#           Moderate Ltd. Assignment



Name: Aneesa Begum J



Location: Chennai



Role: AIML Developer



# Emotional Sentiment Analysis and Adaptive Response System


Dataset link: https://www.kaggle.com/datasets/atharvjairath/empathetic-dialogues-facebook-ai/data

In [None]:
# import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats

In [None]:
# Load the data
df=pd.read_csv("/content/emotion-emotion_69k.csv")

In [None]:
# Display the first 10 rows of the dataset
df.head(10)

Unnamed: 0.1,Unnamed: 0,Situation,emotion,empathetic_dialogues,labels,Unnamed: 5,Unnamed: 6
0,0,I remember going to the fireworks with my best...,sentimental,Customer :I remember going to see the firework...,"Was this a friend you were in love with, or ju...",,
1,1,I remember going to the fireworks with my best...,sentimental,Customer :This was a best friend. I miss her.\...,Where has she gone?,,
2,2,I remember going to the fireworks with my best...,sentimental,Customer :We no longer talk.\nAgent :,Oh was this something that happened because of...,,
3,3,I remember going to the fireworks with my best...,sentimental,Customer :Was this a friend you were in love w...,This was a best friend. I miss her.,,
4,4,I remember going to the fireworks with my best...,sentimental,Customer :Where has she gone?\nAgent :,We no longer talk.,,
5,5,i used to scare for darkness,afraid,Customer : it feels like hitting to blank wall...,Oh ya? I don't really see how,,
6,6,i used to scare for darkness,afraid,Customer :dont you feel so.. its a wonder \nAg...,I do actually hit blank walls a lot of times b...,,
7,7,i used to scare for darkness,afraid,Customer : i virtually thought so.. and i used...,Wait what are sweatings,,
8,8,i used to scare for darkness,afraid,Customer :Oh ya? I don't really see how\nAgent :,dont you feel so.. its a wonder,,
9,9,i used to scare for darkness,afraid,Customer :I do actually hit blank walls a lot ...,i virtually thought so.. and i used to get sw...,,


# Exploratory Data Analysis

In [None]:
# Display dataset information
print("\nDataset Info:")
print(df.info())


Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 64636 entries, 0 to 64635
Data columns (total 7 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   Unnamed: 0            64636 non-null  int64 
 1   Situation             64636 non-null  object
 2   emotion               64632 non-null  object
 3   empathetic_dialogues  64636 non-null  object
 4   labels                64636 non-null  object
 5   Unnamed: 5            113 non-null    object
 6   Unnamed: 6            5 non-null      object
dtypes: int64(1), object(6)
memory usage: 3.5+ MB
None


In [None]:
# Drop columns with high missing values and irrelevant columns
columns_to_drop = ['Unnamed: 5', 'Unnamed: 6', 'Unnamed: 0']  # Adjust if needed
df = df.drop(columns=columns_to_drop)

In [None]:
print("\nColumns after dropping irrelevant ones:")
print(df.columns)


Columns after dropping irrelevant ones:
Index(['Situation', 'emotion', 'empathetic_dialogues', 'labels'], dtype='object')


In [None]:
# Check for missing values
print("\nMissing Values Before Handling:")
print(df.isnull().sum())



Missing Values Before Handling:
Situation               0
emotion                 4
empathetic_dialogues    0
labels                  0
dtype: int64


In [None]:
# Drop rows with missing values in 'emotion'
clean_df = df.dropna(subset=['emotion'])

# Display missing values after handling
print("\nMissing Values After Handling:")
print(clean_df.isnull().sum())


Missing Values After Handling:
Situation               0
emotion                 0
empathetic_dialogues    0
labels                  0
dtype: int64


In [None]:
# Reset the index after dropping rows
clean_df.reset_index(drop=True, inplace=True)

# Display the cleaned dataset shape
print("\nCleaned Dataset Shape:", clean_df.shape)


Cleaned Dataset Shape: (64632, 4)


# Data Preparation

In [None]:
import string
import re

# Predefined stopwords list
stop_words = set([
    'i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", 'your', 'yours',
    'he', 'she', 'it', 'they', 'this', 'that', 'was', 'were', 'be', 'have', 'do', 'a', 'an', 'the',
    'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with',
    'about', 'against', 'into', 'through', 'before', 'after', 'above', 'below', 'to', 'from', 'up'
])

# Function to clean text data
def clean_text(text):
    text = text.lower()  # Convert to lowercase
    text = re.sub(r'\[.*?\]', '', text)  # Remove text in brackets
    text = re.sub(r'\s+', ' ', text)  # Remove extra spaces
    text = text.translate(str.maketrans('', '', string.punctuation))  # Remove punctuation
    text = ' '.join(word for word in text.split() if word not in stop_words)  # Remove stopwords
    return text.strip()

# Apply cleaning to the 'empathetic_dialogues' column
df['empathetic_dialogues'] = df['empathetic_dialogues'].apply(clean_text)

# Verify the cleaned data
print(df.head())

                                           Situation      emotion  \
0  I remember going to the fireworks with my best...  sentimental   
1  I remember going to the fireworks with my best...  sentimental   
2  I remember going to the fireworks with my best...  sentimental   
3  I remember going to the fireworks with my best...  sentimental   
4  I remember going to the fireworks with my best...  sentimental   

                                empathetic_dialogues  \
0  customer remember going see fireworks best fri...   
1                customer best friend miss her agent   
2                      customer no longer talk agent   
3     customer friend in love just best friend agent   
4                      customer where has gone agent   

                                              labels  
0  Was this a friend you were in love with, or ju...  
1                                Where has she gone?  
2  Oh was this something that happened because of...  
3                This was a 

# Model Building - BERT

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments
import torch

# Split the data into training and testing sets
train_texts, val_texts, train_labels, val_labels = train_test_split(
    df['empathetic_dialogues'].tolist(),
    df['emotion'].tolist(),
    test_size=0.2,
    random_state=42
)

# Encode the emotion labels into integers
label_encoder = LabelEncoder()
train_labels_encoded = label_encoder.fit_transform(train_labels)
val_labels_encoded = label_encoder.transform(val_labels)

# Load pre-trained BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Tokenize the data
train_encodings = tokenizer(train_texts, truncation=True, padding=True, max_length=128)
val_encodings = tokenizer(val_texts, truncation=True, padding=True, max_length=128)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

In [None]:
class EmotionDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

train_dataset = EmotionDataset(train_encodings, train_labels_encoded)
val_dataset = EmotionDataset(val_encodings, val_labels_encoded)

In [None]:
# Load pre-trained BERT model for sequence classification
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=len(label_encoder.classes_))

# Define training parameters
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=10,
)

# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset
)

# Train the model
trainer.train()


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss
10,3.8299
20,3.7959
30,3.9271
40,3.7758
50,3.8375
60,3.751
70,3.7898
80,3.7775
90,3.7723
100,3.7146


TrainOutput(global_step=19392, training_loss=2.1824088948148703, metrics={'train_runtime': 4452.335, 'train_samples_per_second': 34.841, 'train_steps_per_second': 4.355, 'total_flos': 7177188985058880.0, 'train_loss': 2.1824088948148703, 'epoch': 3.0})

In [None]:
# Assuming the emotion labels used during training are available
emotion_labels_list = ['sadness', 'joy', 'anger', 'fear', 'disgust', 'surprise', 'trust', 'anticipation']  # Replace this with your full label set if different

# Fit the LabelEncoder
label_encoder = LabelEncoder()
label_encoder.fit(emotion_labels_list)

# Verify classes are available
emotion_labels = label_encoder.classes_
print("Emotion labels:", emotion_labels)


In [None]:
# After training is complete:
model.save_pretrained('/content/saved_model')
tokenizer.save_pretrained('/content/saved_model')


('/content/saved_model/tokenizer_config.json',
 '/content/saved_model/special_tokens_map.json',
 '/content/saved_model/vocab.txt',
 '/content/saved_model/added_tokens.json')

In [None]:
# Evaluate the model on the validation dataset
eval_results = trainer.evaluate()

# Print evaluation results (e.g., accuracy, loss, etc.)
print("Evaluation results:", eval_results)


Evaluation results: {'eval_loss': 2.520890712738037, 'eval_runtime': 53.9745, 'eval_samples_per_second': 239.52, 'eval_steps_per_second': 29.94, 'epoch': 3.0}


In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Load the trained model and tokenizer
model_dir = '/content/saved_model'  # Ensure this is the correct path
emotion_tokenizer = AutoTokenizer.from_pretrained(model_dir)
emotion_model = AutoModelForSequenceClassification.from_pretrained(model_dir)


# Emotion Classification and Response Generation

In [None]:
import torch

# Ensure the model is on the correct device (GPU if available, else CPU)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

# Tokenize the input text and ensure the input is also on the same device
inputs = tokenizer(user_input, return_tensors='pt', truncation=True, padding=True).to(device)

# Set the model to evaluation mode (important when using it for inference)
model.eval()

# Perform prediction
with torch.no_grad():
    outputs = model(**inputs)
    predicted_class = torch.argmax(outputs.logits, dim=1).item()

# Map the predicted class to the corresponding emotion
detected_emotion = label_encoder.inverse_transform([predicted_class])[0]

# Generate a response based on detected emotion
response = generate_response(detected_emotion)

# Print the result
print(f"User Input: {user_input}")
print(f"Detected Emotion: {detected_emotion}")
print(f"Chatbot Response: {response}")


User Input: Everything seems fine, but I feel a bit down.
Detected Emotion: sad
Chatbot Response: I'm here to listen. Let me know if you need any help.


In [None]:
# List of multiple user inputs for testing
user_inputs = [
    "I'm feeling really happy today!",
    "I'm anxious about the upcoming deadline.",
    "It's been a tough day, I'm feeling stressed.",
    "I miss my old friends, it makes me sentimental.",
    "Everything seems fine, but I feel a bit down."
]

# Iterate through each user input
for user_input in user_inputs:
    # Tokenize the input and move to the same device as the model
    inputs = tokenizer(user_input, return_tensors='pt', truncation=True, padding=True).to(device)

    # Perform inference
    model.eval()  # Set the model to evaluation mode
    with torch.no_grad():
        outputs = model(**inputs)
        predicted_class = torch.argmax(outputs.logits, dim=1).item()

    # Map the predicted class to the corresponding emotion
    detected_emotion = label_encoder.inverse_transform([predicted_class])[0]

    # Generate an empathetic response
    response = generate_response(detected_emotion)

    # Print results
    print(f"User Input: {user_input}")
    print(f"Detected Emotion: {detected_emotion}")
    print(f"Chatbot Response: {response}\n")

User Input: I'm feeling really happy today!
Detected Emotion: joyful
Chatbot Response: I'm here to listen. Let me know if you need any help.

User Input: I'm anxious about the upcoming deadline.
Detected Emotion: anxious
Chatbot Response: I'm here to listen. Let me know if you need any help.

User Input: It's been a tough day, I'm feeling stressed.
Detected Emotion: anxious
Chatbot Response: I'm here to listen. Let me know if you need any help.

User Input: I miss my old friends, it makes me sentimental.
Detected Emotion: sentimental
Chatbot Response: I'm here to listen. Let me know if you need any help.

User Input: Everything seems fine, but I feel a bit down.
Detected Emotion: sad
Chatbot Response: I'm here to listen. Let me know if you need any help.

