# Business understanding
Getting feedback form the users is a crutial aspect of growth as it gives a deeper understanding of user sentiment, improves content moderation, and informs product and service improvements. Our project utilizes the Google AI GoEmotions dataset to expand emotion classification datasets, improving chatbot sensitivity, online behavior detection, and customer support. By training neural networks and SVM models to analyze text tonality, we advance emotion analysis in NLP, benefiting stakeholders such as chatbot system providers, online platforms, and customer support departments. This real-world problem of limited sensitivity and understanding is addressed through our project's enhanced emotion analysis, leading to more empathetic interactions, improved content moderation, and optimized customer support, ultimately enhancing user experiences.

## Objectives

### Main Objective
* Expand emotion classification datasets by training models to analyze text tonality using the Google AI GoEmotions dataset.

### Specific Objectives:

* Enhance chatbot sensitivity by improving the understanding and response to user emotions.
* Detect online hazards by identifying potential harmful content through emotional analysis.
* Improve customer support by recognizing and addressing user emotions in textual communication.

# Data Understanding
The Google AI GoEmotions dataset contains labeled comments from Reddit users expressing diverse emotions. This dataset is suitable for training neural networks to analyze text tonality, as it provides a comprehensive emotional spectrum and allows for subtle differentiation among various emotions. The dataset includes detailed emotional annotations and descriptive statistics, facilitating the analysis of emotions in text. While there may be limitations such as potential biases and subjective categorization, the GoEmotions dataset remains valuable for enhancing chatbot sensitivity, detecting online hazards, and improving customer support through the analysis of diverse emotions.


In [1]:
import pandas as pd
import numpy as np
import seaborn as sns

import matplotlib.pyplot as plt
import plotly.express as px

import warnings
warnings.filterwarnings("ignore")

In [2]:
sentiment_data = pd.read_csv('go_emotions_dataset.csv',index_col=0)
sentiment_data.head()

Unnamed: 0_level_0,text,example_very_unclear,admiration,amusement,anger,annoyance,approval,caring,confusion,curiosity,...,love,nervousness,optimism,pride,realization,relief,remorse,sadness,surprise,neutral
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
eew5j0j,That game hurt.,False,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
eemcysk,>sexuality shouldn’t be a grouping category I...,True,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
ed2mah1,"You do right, if you don't care then fuck 'em!",False,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
eeibobj,Man I love reddit.,False,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
eda6yn6,"[NAME] was nowhere near them, he was by the Fa...",False,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


In [None]:
sentiment_data.tail()

In [None]:
sentiment_data.info()

In [None]:
sentiment_data.shape

There are 211225 rows and 31 columns.

In [None]:
# Checking for mssing values
sentiment_data.isna().sum()

In [None]:
# Check for duplicates
duplicates = sentiment_data[sentiment_data.duplicated()]
duplicates_df = pd.DataFrame(duplicates, columns = sentiment_data.columns)
duplicates_df

In [None]:
duplicates_df.head()

Duplicated data was not dropped text might be the same though emotions are different within the row.

In [None]:
sentiment_data['text'].nunique()

In [None]:
# Exploring patterns and trends in the dataset for the emotions columns 
# Summary statistics
summary_stats = sentiment_data.describe()

# Correlation matrix
correlation_matrix = sentiment_data.corr()

# Print the results
print("Summary Statistics:")
print(summary_stats)

In [None]:
print("\nCorrelation Matrix:")
print(correlation_matrix)

In [None]:
value_counts = sentiment_data['example_very_unclear'].value_counts()
value_counts

In [None]:
# Count the occurrences of each value in the example_very_unclear column
unclear_counts = sentiment_data['example_very_unclear'].value_counts()

# Plot the counts
unclear_counts.plot(kind='bar', figsize=(8, 6))
value_counts.plot.bar(color=['blue', 'orange'])
plt.xlabel('Unclear Label')
plt.ylabel('Count')
plt.title('Count of Unclear Labels')
plt.show()


This column indicated whether the annotator marked the example as being very unclear or difficult to label (in this case they did not choose any emotion labels).True means no emotions were recorded.

In [None]:
# Confirming no emotions were recorded
filtered_data = sentiment_data[sentiment_data['example_very_unclear'] == True]
filtered_data

## Data Preparation

#### Droping records that have ['example_very_unclear'] == True 


In [None]:
sentiments_df = sentiment_data[sentiment_data['example_very_unclear'] != True]

In [None]:
# creating a copy of the text column 
# Rename the existing 'text' column to 'original_text'
sentiments_df.rename(columns={'text': 'original_text'}, inplace=True)

# Create a new column 'text' as a copy of 'original_text' and insert it next to 'original_text'
sentiments_df.insert(sentiments_df.columns.get_loc('original_text') + 1, 'text', sentiments_df['original_text'].copy())


In [None]:
sentiments_df.head()

In [None]:
sentiments_df.tail()

In [None]:
sentiments_df.info()

In [None]:
sentiments_df.shape

In [None]:
sentiments_df.describe()

In [None]:
sentiments_df.isna().sum()

In [None]:
sentiments_df.duplicated().sum()

In [None]:
duplicate_rows = sentiments_df[sentiments_df.duplicated()]
duplicate_rows

Text might be duplicated though ID are different or in instances where IDs are same giving different responses.

#### How many emotions can be in one text record.

In [None]:
#Grouping the emotions into a set
emotions = set(sentiments_df.columns[2:])
emotions

In [None]:
def assign_emotions(row):
    emotion_list = [emotion for emotion, value in row.items() if value == 1]
    return ', '.join(emotion_list)
    

# Create a new column 'listed_emotions' to store the individual emotions per row
sentiments_df['listed_emotions'] = sentiments_df.apply(assign_emotions, axis=1)

# Print the updated DataFrame
print(sentiments_df.head())

In [None]:
# Assuming you have a DataFrame called df with 'ID' set as the index
specific_id = 'eew5j0j'

# Using loc to access the record with the specific ID
specific_record = sentiments_df.loc[specific_id]

# Printing the specific record
print(specific_record)

In [None]:
# Assuming you have a DataFrame called sentiments_df with a 'listed_emotions' column
sentiments_df['emotion_count'] = sentiments_df['listed_emotions'].str.split(', ').apply(lambda x: len(x))

# Print the updated DataFrame
print(sentiments_df.head(20))

#### Lowercasing the text 

In [None]:
#sentiments_df['text'] = sentiments_df['text'].str.lower()
sentiments_df = sentiments_df.copy()
sentiments_df['text'] = sentiments_df['text'].str.lower()

In [None]:
sentiments_df.head()

#### Removing punctuations

In [None]:
import string

sentiments_df['text'] = sentiments_df['text'].str.replace('[{}]'.format(string.punctuation), '', regex=True)

In [None]:
sentiments_df.head()

#### Creating a column to hold emojis 

In [None]:
pip install emoji

In [None]:
import emoji

In [None]:
pip install demoji

In [None]:
import demoji

In [None]:
# Download the emoji dictionary
demoji.download_codes()

# Function to extract emojis from text
def extract_emojis(text):
    emojis = demoji.findall(text)
    return ''.join(emojis.keys())

# Assuming you have a DataFrame with a 'text' column called sentiments_df
sentiments_df['emojis'] = sentiments_df['text'].apply(extract_emojis)

In [None]:
sentiments_df.head()

In [None]:
sentiments_df.tail()

In [None]:
# Assuming you have a DataFrame called df with 'ID' set as the index
specific_id = 'ee0sak1'

# Using loc to access the record with the specific ID
specific_record = sentiments_df.loc[specific_id]

# Printing the specific record
print(specific_record)


In [None]:
pip install regex

In [None]:
pip install emoji --upgrade

In [None]:
# Assuming you have a DataFrame called sentiments_df with an 'emojis' column
demoji.download_codes()  # Download the emoji dictionary

def has_emoji(text):
    emojis = demoji.findall(text)
    return bool(emojis)

emoji_records = sentiments_df[sentiments_df['emojis'].apply(has_emoji)]

# Printing the records with emojis
print(emoji_records)

In [None]:
type(emoji_records)

In [None]:
emoji_records.info()

In [None]:
# Assuming you have a DataFrame called sentiments_df with an 'emojis' column
demoji.download_codes()  # Download the emoji dictionary

# Count the occurrence of each emoji in the 'emojis' column
emoji_counts = sentiments_df['emojis'].apply(lambda x: demoji.findall(x)).explode().value_counts()

# Select the top 10 most used emojis
top_10_emojis = emoji_counts.head(10)

# Plot the top 10 emojis
plt.figure(figsize=(10, 6))
top_10_emojis.plot(kind='barh')
plt.xlabel('Emoji')
plt.ylabel('Count')
plt.title('Top 10 Most Used Emojis')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()


#try sns to create the visual 

In [None]:
top_10_emojis

#### Replacing emojis with corresponding text

In [None]:
def replace_emojis_with_text(text):
    # Replace emojis with corresponding text descriptions
    text_with_text_emojis = emoji.demojize(text)
    
    return text_with_text_emojis

# Applying the function to the 'text' column in your DataFrame
sentiments_df['text'] = sentiments_df['text'].apply(replace_emojis_with_text)

In [None]:
#'ee0sak1' Id had an emoji in the text retrieving to confirm convertion 
# Set the maximum width of column 'text' to display the full text
pd.set_option('display.max_colwidth', None)

# Access the 'text' column for a specific ID
specific_data = sentiments_df.loc['ee0sak1', 'text']

# Print the full text
print(specific_data)

In [None]:
import re

In [None]:
# Function to remove emoji from the text column
def remove_emoji(text):
    no_emoji = re.sub(r':[^\s:]+:', '', text)
    return no_emoji

# Apply the remove_emoji function to the 'text' column
sentiments_df['text'] = sentiments_df['text'].apply(remove_emoji)

In [None]:
# Display the updated DataFrame
sentiments_df.head

In [None]:
# Retrieve the text using the index
text_ = sentiments_df.loc['ee0sak1', 'text']

# Display the text
print(text_)

In [None]:
sentiments_df.head()

In [None]:
# saving the sentiments_df to a csv 
sentiments_df.to_csv('Data/sentiments_df.csv', index = True)

#### Tokenization

In [None]:
pip install nltk

In [None]:
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize

# Tokenize the 'text' column
sentiments_df['text'] = sentiments_df['text'].apply(word_tokenize)


In [None]:
# Print the tokenized text
sentiments_df.head()

#### Removing stop words

In [None]:
from nltk.corpus import stopwords

# Download stopwords
nltk.download('stopwords')

# Get the set of stop words
stop_words = set(stopwords.words('english'))

# Remove stop words from the text column
sentiments_df['text'] = sentiments_df['text'].apply(lambda x: ' '.join([word for word in ' '.join(x).split() if word.lower() not in stop_words]))

In [None]:
# Print the text without stop words
sentiments_df.head()

In [None]:
# Tokenize the 'text' column
sentiments_df['text'] = sentiments_df['text'].apply(word_tokenize)

In [None]:
sentiments_df.head()

#### lemmatization

In [None]:
nltk.download('omw-1.4')

from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')

In [None]:
# Create a lemmatizer instance
lemmatizer = WordNetLemmatizer()

In [None]:
# Lemmatize the tokenized text column
sentiments_df['text'] = sentiments_df['text'].apply(lambda x: ' '.join([lemmatizer.lemmatize(word) for word in x]))


In [None]:
# Print the lemmatized text
sentiments_df.head()

#### *Emotions Categorization*

In [None]:
# Creating a column for the labels
# Define the mapping of emotions to categories
emotion_to_category = {
    'admiration': 'positive',
    'amusement': 'positive',
    'approval': 'positive',
    'caring': 'positive',
    'curiosity': 'positive',
    'excitement': 'positive',
    'gratitude': 'positive',
    'joy': 'positive',
    'love': 'positive',
    'optimism': 'positive',
    'relief': 'positive',
    'surprise': 'positive',
    'sadness': 'negative',
    'pride': 'negative',
    'fear': 'negative',
    'embarrassment': 'negative',
    'disapproval': 'negative',
    'disappointment': 'negative',
    'confusion': 'negative',
    'annoyance': 'negative',
    'anger': 'negative',
    'nervousness': 'negative',
    'desire': 'negative',
    'remorse': 'ambiguous',
    'realization': 'ambiguous',
    'grief': 'ambiguous',
    'disgust': 'ambiguous',
    'neutral': 'neutral'
}

emotions_columns = sentiments_df.columns[3:-3]

In [None]:
emotions_columns

In [None]:
sentiments_df['labels'] = sentiments_df[emotions_columns].apply(lambda row: emotion_to_category.get(row.idxmax(), 'unknown'), axis=1)


In [None]:
sentiments_df.head()

In [None]:
from sklearn.preprocessing import LabelEncoder
# Create an instance of LabelEncoder
label_encoder = LabelEncoder()

# Fit and transform the target column
sentiments_df['encoded_labels'] = label_encoder.fit_transform(sentiments_df['labels'])

In [None]:
sentiments_df.head(10)

In [None]:
unique_values = sentiments_df[['encoded_labels', 'labels']].drop_duplicates().values

In [None]:
for value in unique_values:
    encoded_label, label = value
    print(f"Encoded Label: {encoded_label}, Label: {label}")


In [None]:
print(sentiments_df['labels'].value_counts())

In [None]:
sentiments_df.head(30)

#### Data Exploration

#### Categorizing Emotions

In [None]:
sentiments_df.info()

In [None]:
#Grouping the emotions into a set
emotions = set(sentiments_df.columns[3:-5])
emotions

In [None]:
positive_col = ['admiration','amusement','approval','caring','curiosity','excitement','gratitude','joy','love','optimism','relief','surprise']
negative_col = ['sadness','pride','fear','embarrassment','disapproval','disappointment','confusion','annoyance','anger','nervousness','desire']
ambiguous_col = ['remorse','realization','grief','disgust']
neutral_col = ['neutral']

In [None]:
positive_col = sentiments_df[positive_col]
negative_col = sentiments_df[negative_col]
ambiguous_col = sentiments_df[ambiguous_col]
neutral_col = sentiments_df[neutral_col]

In [None]:
df_emotion = pd.DataFrame()
df_emotion['emotion'] = list(emotions)
df_emotion['group'] = ''
df_emotion['group'].loc[df_emotion['emotion'].isin(positive_col)] = 'positive'
df_emotion['group'].loc[df_emotion['emotion'].isin(negative_col)] = 'negative'
df_emotion['group'].loc[df_emotion['emotion'].isin(ambiguous_col)] = 'ambiguous'
df_emotion['group'].loc[df_emotion['emotion'].isin(neutral_col)] = 'neutral'
df_emotion.head(3)

In [None]:
df_emotion['group'].unique()

In [None]:
df_emotion.columns

In [None]:
import matplotlib.pyplot as plt

temp = sentiments_df[emotions].sum(axis=0) \
    .reset_index() \
    .rename(columns={'index': 'emotion', 0: 'n'}) \
    .merge(df_emotion, how='left', on='emotion') \
    .sort_values('n', ascending=False)

fig, ax = plt.subplots(figsize=(7, 7))
ax.tick_params(axis='x', rotation=90)
palette ={
    "positive": "skyblue", 
    "negative": "red", 
    "ambiguous": 'gray',
    "neutral": 'green'  # Add 'neutral' category and corresponding color
}
sns.barplot(data=temp, x='n', 
            y='emotion', hue='group', 
            dodge=False,
            palette=palette,
            ax=ax)
ax.set_title('Count of emotions appearance')
plt.show()


In [None]:
import matplotlib.pyplot as plt

temp = temp.groupby('group').agg('sum').reset_index()
temp = temp.sort_values('n', ascending=False)

ax = sns.barplot(data=temp, x='group', y='n')
ax.set_title('Emotions category counts')
plt.show()


In [None]:
emotion_counts = {}
for c in positive_col:
    emotion_counts[c]  = positive_col[c].value_counts().to_dict()[1]

In [None]:
import plotly.graph_objects as go

emotion_counts_sorted = sorted(emotion_counts.items(), key=lambda x: x[1], reverse=True)
x = [item[0] for item in emotion_counts_sorted]
y = [item[1] for item in emotion_counts_sorted]

fig = go.Figure(data=go.Bar(x=x, y=y))
fig.update_layout(
    title='Go Emotions',
    title_x=0.5,  # Center the title
    height=600,
    xaxis_title="Positive Emotion",
    yaxis_title="Number of Texts",
    xaxis_tickangle=45
)
fig.show()


In [None]:
emotion_counts = {}
for c in negative_col:
    emotion_counts[c]  = negative_col[c].value_counts().to_dict()[1]

In [None]:
emotion_counts_sorted = sorted(emotion_counts.items(), key=lambda x: x[1], reverse=True)
x = [item[0] for item in emotion_counts_sorted]
y = [item[1] for item in emotion_counts_sorted]

fig = go.Figure(data=go.Bar(x=x, y=y))
fig.update_layout(
    title='Go Emotions',
    title_x=0.5,  # Center the title
    height=600,
    xaxis_title="Negative Emotion",
    yaxis_title="Number of Texts",
    xaxis_tickangle=45
)
fig.show()

In [None]:
emotion_counts = {}
for c in ambiguous_col:
    emotion_counts[c]  = ambiguous_col[c].value_counts().to_dict()[1]

In [None]:
emotion_counts_sorted = sorted(emotion_counts.items(), key=lambda x: x[1], reverse=True)
x = [item[0] for item in emotion_counts_sorted]
y = [item[1] for item in emotion_counts_sorted]

fig = go.Figure(data=go.Bar(x=x, y=y))
fig.update_layout(
    title='Go Emotions',
    title_x=0.5,  # Center the title
    height=600,
    xaxis_title="Ambigous Emotion",
    yaxis_title="Number of Texts",
    xaxis_tickangle=45
)
fig.show()

In [None]:
neutral_counts = neutral_col.value_counts()
neutral_counts

In [None]:
sentiments_df.columns

In [None]:
sentiments_df = sentiments_df[['original_text', 'text', 'listed_emotions','emotion_count','labels','encoded_labels']]

In [None]:
sentiments_df.info()

#### *Converting notebook to csv for modelling phase*

In [None]:
# saving the sentiments_df to a csv 
sentiments_df.to_csv('Data/preprocessed_data.csv', index = True)