## NLP TWITTER ANALYSIS

* Final Project Submission
* Group Members
   1. Benson Kamau
   2. Kevin Muchori
   3. Nancy Chelangat
   4. Sally Kinyanjui
   5. Breden Mugambi

* Student Pace: Full-Time
* Instructor's: Nikita Njoroge

### Problem Statement
Accurately classifying the sentiments expressed in tweets about topics or brands into specific classes- positive, negative or neutral is a huge challenge for companies like Apple and Google. Given the diverse nature of informal data, with its use of slang, abbreviations, coming up with a reliable sentiment analysis model that can effectively interpret and classify the tweets can be a complex task. Getting this task right provides a wide variety of novel information for a company like Apple by providing insights and creating better understanding overall of how consumers interact with products/brands.

### Objective
The main objective is to build a model that can rate the sentiment of  a tweet based on its content.

#### Project Success Metrics
Over 75% accuracy on the testing data.

#### Importing relavent Libraries

In [3]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import nltk
from nltk.tokenize import regexp_tokenize, word_tokenize,TweetTokenizer, RegexpTokenizer
from nltk.corpus import stopwords, wordnet
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report
import re
import nltk
nltk.download('punkt')



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

### 1. Data Loading and Understanding

The dataset that will be used in this study comes from CrowdFlower via data.world through this link - https://data.world/crowdflower/brands-and-product-emotions.


In [4]:
#create a function that loads data and gets the info about the data.
def load_and_get_info(file_path, encoding='utf-8'):
    try :
        # Load data
        df = pd.read_csv(file_path, encoding=encoding)

        # Display the first few rows of the DataFrame
        df_head = df.head()

        # Get information about the DataFrame
        df_info = df.info()

        return df,df_info, df_head
    except UnicodeDecodeError:
        print(f"Failed to decode {file_path} with encoding {encoding}. Trying with 'latin1' encoding.")
        return load_and_get_info(file_path, encoding='latin1')

# A function that checks the data types of DataFrame columns and return the count of columns for each data type category.
def check_data_types(df):

    data_type_counts = df.dtypes.replace({'object': 'string'}).value_counts().to_dict()
    return data_type_counts

In [6]:
file_path1 = '/content/tweet-analysis.csv'
df1,data_info, data_head = load_and_get_info(file_path1)
print(data_info)
print("\nFirst few rows of the DataFrame:")
data_head #data_head

Failed to decode /content/tweet-analysis.csv with encoding utf-8. Trying with 'latin1' encoding.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9093 entries, 0 to 9092
Data columns (total 3 columns):
 #   Column                                              Non-Null Count  Dtype 
---  ------                                              --------------  ----- 
 0   tweet_text                                          9092 non-null   object
 1   emotion_in_tweet_is_directed_at                     3291 non-null   object
 2   is_there_an_emotion_directed_at_a_brand_or_product  9093 non-null   object
dtypes: object(3)
memory usage: 213.2+ KB
None

First few rows of the DataFrame:


Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion


The dataset contains the following columns:

1. 'tweet_text' column
    - Contains the text of the tweet.
2. 'emotion_in_tweet_is_directed_at' column
    - Contains the person or entity that the tweet is directed at.
3. 'is_there_an_emotion_directed_at_a_brand_or_product' column
    - Indicates the kind of emotion in the tweet directed at the brand or product

The dataset has a total of 9093 data points.



In [7]:
#check the data types of DataFrame columns in our training set values.
data_type_counts = check_data_types(df1)
print("Count of columns for each data type category:")
print(data_type_counts)

Count of columns for each data type category:
{'string': 3}


The dataset has one data type category .i., object type.


To simplify working with the dataset, we will rename the columns to simpler and shorter names.


In [8]:
# function to rename the column names
def rename_columns(df, columns_dict):
    """
   Parameters:
    df (pd.DataFrame): The DataFrame whose columns need to be renamed.
    columns_dict (dict): A dictionary where keys are current column names and values are the new column names.
    """
    df.rename(columns=columns_dict, inplace=True)
    return df

# Define the dictionary for renaming columns
columns_dict = {
    'tweet_text': 'tweet',
    'emotion_in_tweet_is_directed_at': 'target_entity',
    'is_there_an_emotion_directed_at_a_brand_or_product': 'emotion'
}

# Rename columns using the dictionary
df1 = rename_columns(df1, columns_dict)

print("\nRenamed DataFrame columns:")
print(df1.columns)
df1.head()


Renamed DataFrame columns:
Index(['tweet', 'target_entity', 'emotion'], dtype='object')


Unnamed: 0,tweet,target_entity,emotion
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion


The dataset has been successfully renamed and the column names are now more descriptive.

# Data Cleaning
This is an essential aspect so as to ensure that the text data is consistent and free of errors. For this project, we will check for missing values, checking for duplicates,
remove white spaces, handle capitalization.



In [9]:
# Check missing values
df1.isna().sum()

tweet               1
target_entity    5802
emotion             0
dtype: int64

There is only one missing tweet text value, which will be removed. There are 5,802 missing "target_entity" values; however, this is acceptable since the current project focuses on overall tweet sentiment rather than specific items. These missing values will be replaced with an "Uncategorized" classification.

In [20]:
#Removing Null Tweets, Removing Duplicate entries and Filling in missing Item Values

#Removing 1 null 'Tweet' Entry
df1.dropna(subset = ['tweet'], inplace=True)

#Removing Duplicates
df1.drop_duplicates(inplace=True)

#Filling in Null "Item" categories with "Uncategorized"
df1['target_entity'].fillna('Uncategorized', inplace=True)

df1.info()

<class 'pandas.core.frame.DataFrame'>
Index: 9070 entries, 0 to 9092
Data columns (total 4 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   tweet          9070 non-null   object
 1   target_entity  9070 non-null   object
 2   emotion        9070 non-null   object
 3   cleaned_tweet  9070 non-null   object
dtypes: object(4)
memory usage: 354.3+ KB


At this point there should not be any duplicate entries or null values in the data and the total row count in the dataset has decreased from 9092 to 9070.

Text Cleaning: The function below cleans the tweet column by removing white spaces, converting to lower case, and removing special characters.

In [14]:
def clean_text(text):
    if not isinstance(text, str):
        return ""
    text = text.strip()
    text = text.lower()
    pattern = re.compile(r'[^a-zA-Z0-9\s]')
    text = pattern.sub('', text)
    return text

# Apply the clean_text function to the tweet column
df1['cleaned_tweet'] = df1['tweet'].apply(clean_text)
df1.head()

Unnamed: 0,tweet,target_entity,emotion,cleaned_tweet
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion,wesley83 i have a 3g iphone after 3 hrs tweeti...
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion,jessedee know about fludapp awesome ipadiphon...
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion,swonderlin can not wait for ipad 2 also they s...
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion,sxsw i hope this years festival isnt as crashy...
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion,sxtxstate great stuff on fri sxsw marissa maye...


The additional column cleaned_tweet stores the text after text cleaning.