## Empathy Emotion and Personality Detection using Machine Learning
### 7120CEM CW1
WASSA 2023 Shared Task on Empathy Emotion and Personality Detection in Interactions (* 
Including regression problems and classification problems):  
- Website: https://codalab.lisn.upsaclay.fr/competitions/11167 
- Summary paper: https://aclanthology.org/2023.wassa-1.44/ 

1. Import Libraries

In [14]:
import string
import numpy as np
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge
from sklearn.multioutput import MultiOutputRegressor
from sklearn.pipeline import make_pipeline
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.feature_extraction.text import TfidfVectorizer

nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\AttahiruJibril\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

2. Load data and preprocess

In [15]:
# Load data from TSV file
data = 'data/WASSA23_conv_level_with_labels_train.tsv'
df = pd.read_table(data, header=0)
df.head()

Unnamed: 0,conversation_id,turn_id,text,EmotionalPolarity,Emotion,Empathy,speaker_number,article_id,speaker_id,essay_id
0,2,0,I feel very sad for the people. ...,2.0,3.0,3.3333,1,35.0,30.0,1.0
1,2,1,It's terrible. Not only the people but the ani...,2.0,4.0,3.3333,2,35.0,17.0,501.0
2,2,10,I felt really sorry for the sister that now ha...,2.0,3.6667,2.6667,1,35.0,30.0,1.0
3,2,11,"Yeah, it's going to be tough but i am sure she...",0.6667,3.0,2.0,2,35.0,17.0,501.0
4,2,12,"Yeah, we never know what we can do unless we a...",0.3333,2.3333,1.3333,1,35.0,30.0,1.0


In [16]:
# Clean column names (optional)
new_col = []
for names in df.columns:
	new_col.append(names.strip())
df.columns = new_col  # This step can be removed if column names don't need cleaning

# Drop unnecessary columns
df.drop(["conversation_id", "turn_id", "speaker_number", "article_id", "speaker_id", "essay_id"], axis=1, inplace=True)

# Separate features (text) and target variables
X_data, y_data = df.loc[:, 'text'], df.drop('text', axis=1)

3. Split Data into train and test set

In [22]:
X_train, X_test, y_train , y_test = train_test_split(X_data, y_data, train_size=0.8)
#reset index of training examples
X_train, X_test = X_train.reset_index(drop=True), X_test.reset_index(drop=True)
y_train, y_test = y_train.reset_index(drop=True), y_test.reset_index(drop=True)

# X_data.to_numpy(), y_data.to_numpy()
X_train[0:4]

0    wonder what they will do for food and water no...
1    I thought the fact that he faces 14 felony cou...
2    I agree - no way you just leave a dog in a car...
3    I agree. I hope these people learn from this s...
Name: text, dtype: object

4. process word data into numbers
- tokenization
- remove stop word and punctuatuons, numbers
- lematization
- vectorization

In [23]:
"""Preprocesses a sentence for natural language processing tasks.

This function performs the following steps:
    1. Tokenizes the sentence into individual words.
    2. Removes stop words (common words with little meaning) from the tokens.
    3. Removes punctuation marks from the tokens.
    4. Lemmatizes the tokens (reduces words to their base form).
    5. Joins the preprocessed tokens back into a sentence string.

Args:
	sentence: The input sentence to be preprocessed (string).

Returns:
	The preprocessed sentence string.
"""
def word_preprocessor(sentence):
    stop_words = set(stopwords.words('english'))
    punctuations = set(string.punctuation)
    lem = WordNetLemmatizer().lemmatize
    sentence = word_tokenize(sentence)
    sentence = [word for word in sentence if word not in stop_words]
    sentence = [word for word in sentence if word not in punctuations]
    sentence_str = ' '.join(sentence)
    sentence = lem(sentence_str)
    return sentence 

In [24]:
X_train = X_train.apply(word_preprocessor)
X_test = X_test.apply(word_preprocessor)

#convert labels to array
X_train, X_test = np.array(X_train), np.array(X_test)
y_train, y_test = np.array(y_train[['EmotionalPolarity', 'Emotion', 'Empathy']]), np.array(y_test[['EmotionalPolarity', 'Emotion', 'Empathy']])
X_train[0:4]


0                                    wonder food water
1         I thought fact faces 14 felony counts enough
2    I agree way leave dog car forget What horrible...
3           I agree I hope people learn stupid mistake
Name: text, dtype: object

In [17]:
X_train.shape, y_train.shape, X_test.shape, y_test.shape

((7020,), (7020, 3), (1756,), (1756, 3))

5. Create and Train model

In [37]:
# Create a pipeline for text regression
regressor = make_pipeline(
  TfidfVectorizer(max_features=4086),  # Feature extraction with TF-IDF
  MultiOutputRegressor(Ridge())         # Multi-output regression with Ridge regularization
)

# Train the regression model on the training data
regressor.fit(X_train, y_train)

# Make predictions on the test data
y_pred = regressor.predict(X_test)

# Calculate evaluation metrics
mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)

# Print the evaluation results
print(f'MeanSquaredError: \t {mse:.4f}% \nMeanAbsoluteError: \t {mae:.4f}%')


MeanSquaredError: 	 0.3763% 
MeanAbsoluteError: 	 0.4770%
