## Sentiment Analysis

### Business Problem

- This analysis will aim to build a model that can rate the sentiment of a Tweet based on its content.
### Objectives

- To build a multimodal classifier that will accurately classify tweets into positive, negative and neutral

## Importing Relevant Libraries


In [81]:
import pandas as pd
import numpy as np

## Loading Data

### Importing the Dataset

In [82]:
# Set display options to show all rows and increase the column width
pd.set_option('display.max_rows', None)
pd.set_option('display.max_colwidth', None)

# Read data
df = pd.read_csv('data/judge-1377884607_tweet_product_company.csv', encoding='latin1')

In [83]:
# Checking the first five rows
df.head()

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,".@wesley83 I have a 3G iPhone. After 3 hrs tweeting at #RISE_Austin, it was dead! I need to upgrade. Plugin stations at #SXSW.",iPhone,Negative emotion
1,"@jessedee Know about @fludapp ? Awesome iPad/iPhone app that you'll likely appreciate for its design. Also, they're giving free Ts at #SXSW",iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. They should sale them down at #SXSW.,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as crashy as this year's iPhone app. #sxsw,iPad or iPhone App,Negative emotion
4,"@sxtxstate great stuff on Fri #SXSW: Marissa Mayer (Google), Tim O'Reilly (tech books/conferences) &amp; Matt Mullenweg (Wordpress)",Google,Positive emotion


In [84]:
# Simplifying the column names

df.columns = ['Tweet','Brand','Emotion']
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9093 entries, 0 to 9092
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Tweet    9092 non-null   object
 1   Brand    3291 non-null   object
 2   Emotion  9093 non-null   object
dtypes: object(3)
memory usage: 213.2+ KB


The dataset contains a total of 9,093 tweets, with nearly all entries having text data. However, only 3,291 entries specify a brand or product, which highlights that many tweets do not directly mention a particular brand. Despite this, each tweet is associated with an emotion, either positive or negative, which helps to understand the sentiment being conveyed.

## Data Cleaning

### Cleaning of the Brand column

In [85]:
# Identify distribution of column
df['Brand'].value_counts()

Brand
iPad                               946
Apple                              661
iPad or iPhone App                 470
Google                             430
iPhone                             297
Other Google product or service    293
Android App                         81
Android                             78
Other Apple product or service      35
Name: count, dtype: int64

In [86]:
# Map product category to brand
product_mapping = {
    "iPad": "Apple",
    "iPad or iPhone App": "Apple",
    "iPhone": "Apple",
    "Other Apple product or service": "Apple",
    "Other Google product or service": "Google",
    "Android App": "Google",
    "Android": "Google"
}

# Map the 'Brand' column to 'Brand' using the product mapping
df["Brand"] = df["Brand"].map(product_mapping).fillna(df["Brand"])

# Display value counts and info
print(df["Brand"].value_counts())
print(df.info())


Brand
Apple     2409
Google     882
Name: count, dtype: int64
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9093 entries, 0 to 9092
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Tweet    9092 non-null   object
 1   Brand    3291 non-null   object
 2   Emotion  9093 non-null   object
dtypes: object(3)
memory usage: 213.2+ KB
None


### Analyzing the Distribution of Missing Values in Brand Column

In [87]:
print(df["Brand"].isna().sum())
print(df["Brand"].isna().mean())  # Proportion of missing data

5802
0.6380732431540745


### Imputing Missing Values in Brand Column Based on Other Columns

We can use features like Tweet to predict the missing Brand values.

In [88]:
# Assign 'Apple' or 'Google' to 'Brand' based on keywords in 'Tweet'
# Convert the 'Tweet' column to lowercase for case-insensitive matching
df["Brand"] = np.where(df["Tweet"].str.lower().str.contains("iphone|ipad|apple|itunes", na=False), "Apple",
                       np.where(df["Tweet"].str.lower().str.contains("android|google", na=False), "Google", df["Brand"]))



In [89]:
# Count "Apple" and "Google"
brand_counts = df["Brand"].value_counts()

# Count NaN values
nan_count = df["Brand"].isna().sum()

# Display counts
print("Counts for 'Apple' and 'Google':")
print(brand_counts)

print("\nCount for NaN values:")
print(nan_count)


Counts for 'Apple' and 'Google':
Brand
Apple     5606
Google    2780
Name: count, dtype: int64

Count for NaN values:
707


In [90]:
# Filter tweets that do not contain any of the keywords
non_matching_tweets = df[df["Brand"].isna()]

# Display the tweets that did not match any keyword
print(non_matching_tweets[["Tweet"]])


                                                                                                                                                               Tweet
6                                                                                                                                                                NaN
51                                  ÛÏ@mention {link} &lt;-- HELP ME FORWARD THIS DOC to all Anonymous accounts, techies,&amp; ppl who can help us JAM #libya #SXSW
52                                                                                   ÷¼ WHAT? ÷_ {link} ã_ #edchat #musedchat #sxsw #sxswi #classical #newTwitter
53                                                                      .@mention @mention on the location-based 'fast, fun and future' - {link} (via @mention #sxsw
66                                                                          At #sxsw? @mention / @mention wanna buy you a drink. 7pm at Fado on 4th. {link} Join us!
71        

In [91]:
# Dropping the rows that do not contain any of the key words

df = df[df["Brand"].isin(["Apple", "Google"])]

# Reset the index if needed
# df.reset_index(drop=True, inplace=True)

