# Predict the Emotion Labels based on the Basic NRC Emotion Lexicon 

In this section, we implemented the predictation emotion label to the texts in both t5 and yangswei_85 dataset. Firstly, we began by using the SBERT model to transform the dataset's text into embedding vectors. Next, we organized the emotion lexicons from the Basic NRC Emotion Lexicon into groups and converted them into SBERT embeddings as well. Next step, we calculated the cosine similarity between the text embeddings and the emotion embeddings. The emotion with the highest similarity score was assigned as the predicted emotion for the text. Finally, we mapped the emotion that predicted by NRC Emotion Lexicon into Parrott's emotion

## Import library

In [4]:
pip install sentence_transformers

Note: you may need to restart the kernel to use updated packages.




In [5]:
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
from sentence_transformers import SentenceTransformer
from sklearn.metrics import accuracy_score, precision_recall_fscore_support




In [6]:
import spacy
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk import download

## T5 dataset

From this part, we implemented the code based on the t5_test dataset

In [7]:
t5_test = pd.read_csv('https://raw.githubusercontent.com/SaraHoxha/emotion-detection-txa/main/Model%20Implementation/data/test_t5.csv')

## Data Preprocessing

In [8]:
# Download necessary NLTK resources
download('punkt')  # For tokenization
download('stopwords')  # For stopwords

# Load SpaCy English model
nlp = spacy.load("en_core_web_sm")

[nltk_data] Downloading package punkt to C:\Users\minhd/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\minhd/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [9]:
def preprocess_text(text):
    # Lowercase 
    text = text.lower()

    tokens = word_tokenize(text)

    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word.isalpha() and word not in stop_words]

    # Lemmatization (using SpaCy)
    doc = nlp(" ".join(tokens))
    lemmatized_tokens = [token.lemma_ for token in doc]

    # Return to string 
    return " ".join(lemmatized_tokens)

In [10]:
# Data preprocessing
t5_test['processed_text'] = t5_test['text'].apply(preprocess_text)

## Text Embedding with SBERT 

This part, we used SBERT (Sentence-BERT) to convert text into vector representations, which capture the meaning of each sentence. The model all-mpnet-base-v2 is used to generate these embeddings. Each text in the dataset is transformed into an embedding and stored in a new column called embeddings. These embeddings can later be compared to emotion embeddings to predict the emotions behind the text.

In [11]:
# Import SBERT to generate sentence embeddings
model = SentenceTransformer('all-mpnet-base-v2')

In [12]:
# Convert text into sentence embeddings
t5_test['embeddings'] = t5_test['processed_text'].apply(lambda x: model.encode(x))

In [13]:
emotion_embeddings = {}

## Emotion Embedding with SBERT 

This code reads the basic NRC Emotion Lexicon, filters out 'positive' and 'negative' sentiments, and creates emotion embeddings for each emotion. For each emotion, it groups the related words, and generates embeddings using the SBERT model. 

In [14]:
# Import and read NRC Emotion file
nrc_data_path = 'NRC-Emotion-Lexicon-Wordlevel-v0.92.txt'  
nrc_df = pd.read_csv(nrc_data_path, sep='\t', header=None, names=['word', 'emotion', 'sentiment'])

# Filters out 'positive' and 'negative' sentiments (don't count them)
nrc_df = nrc_df[(nrc_df['sentiment'] == 1) & (~nrc_df['emotion'].isin(['positive', 'negative']))]

# Get emotion from the word
def get_emotion(word):
    emotions = nrc_df[nrc_df['word'] == word]['emotion'].tolist()
    if emotions:
        return emotions
    else:
        return None  

In [15]:
# Generate sentence embeddings for each emotion in NRC Emotion Lexicon
emotion_embeddings = {}
nrc_emotions = nrc_df['emotion'].unique()

# Create embeddings for each emotion by averaging the vectors of the words in the emotion group
for emotion in nrc_emotions:
    # Retrieve the words representing each emotion in the NRC
    words = nrc_df[nrc_df['emotion'] == emotion]['word'].tolist()

    # Generate embeddings for the emotion by concatenating the words in the emotion group
    words_embeddings = model.encode(' '.join(words))  # create embeddings for each representative word
    emotion_embeddings[emotion] = words_embeddings

## Predict emotion 

This code compares the similarity between text posts and emotions using cosine similarity. The get_most_similar_emotion function calculates the cosine similarity between the embedding of each text and the emotion embeddings. It selects the emotion with the highest similarity score as the predicted emotion for each text. The predicted emotion is then assigned to the predicted_emotion column in the dataset.

In [16]:
# Calculate cosine similarity between texts và emotions
def get_most_similar_emotion(post_embedding):
    max_sim = -1
    most_similar_emotion = None
    for emotion, emotion_embedding in emotion_embeddings.items():
        similarity = cosine_similarity([post_embedding], [emotion_embedding])[0][0]
        if similarity > max_sim:
            max_sim = similarity
            most_similar_emotion = emotion
    return most_similar_emotion

In [17]:
# Predict emotion by embedding
t5_test['predicted_emotion'] = t5_test['embeddings'].apply(lambda x: get_most_similar_emotion(x))

In [19]:
label_counts = t5_test['predicted_emotion'].value_counts()
label_counts

predicted_emotion
trust           10016
anticipation     8591
surprise         3188
anger            1465
sadness           608
fear              371
joy               208
disgust            27
Name: count, dtype: int64

## Convert to Parrott's emotion

After predicting the NRC labels, we mapped our results to Parrott's emotion categories based on the definitions provided in the groups listed at  https://en.wikipedia.org/wiki/Emotion_classification. This mapping was an attempt to align the NRC emotions with Parrott's emotions to observe their correspondence.
In this approach, we highlighted some changes as below:
1. In class "joy" in Parrott's emotion, there are two sub-emotions: "eagerness" and "hope." These emotions share similar meanings with "anticipation" from the NRC lexicon, so we decided to convert it into "joy."
2. In class "anger" in Parrott's emotion, there is the "disgust" in this list, so we decided to convert it into "anger"
3. In class "love" in Parrott's emotion, we added "trust" because it’s an important part of love. Trust helps build strong, loving relationships, so we included it under "love" to show how essential it is.

In [26]:
def map_nrc_to_parrott(nrc_emotion):
    mapping = {
        "anger": "anger",
        "anticipation": "joy",  # convert Anticipation to Joy
        "disgust": "anger",   # convert Disgust to Anger
        "fear": "fear",
        "joy": "joy",
        "sadness": "sadness",
        "surprise": "surprise",
        "trust": "love",         # convert Trust to Love
    }
    return mapping.get(nrc_emotion, None)

In [27]:
# Map the NRC's emotion label to the Parrott's emotion label
t5_test['map_to_parrott'] = t5_test['predicted_emotion'].apply(map_nrc_to_parrott)

In [49]:
# Count the number of label after predicting
label_counts_t5 = t5_test['map_to_parrott'].value_counts()
label_counts_t5

map_to_parrott
love        10016
joy          8799
surprise     3188
anger        1492
sadness       608
fear          371
Name: count, dtype: int64

In [48]:
t5_test.head()

Unnamed: 0,text,label,processed_text,embeddings,predicted_emotion,map_to_parrott
0,Winter Blues and WFH question i think the sudd...,anger,winter blue wfh question think sudden shift bu...,"[0.019767283, -0.023702554, -0.024428228, 0.03...",anticipation,joy
1,New Workspace i saw some of your other posts a...,joy,new workspace see post renovation go incredibl...,"[-0.0148920305, 0.053874586, 0.008216122, 0.00...",joy,joy
2,Hard to mentally unwind… i go for a long walk ...,joy,hard mentally go long walk work help shift wor...,"[-0.005250863, -0.021374501, -0.04698471, -0.0...",sadness,sadness
3,Would you leave 150k for 90k depends on your e...,joy,would leave depend expense saving,"[0.0018950137, 0.006368839, -0.0058382694, -0....",sadness,sadness
4,There’s no magic formula to get remote work so...,fear,magic formula get remote work worry nothing go...,"[-0.01843844, -0.03417755, -0.0084715495, -0.0...",surprise,surprise


## Yangswei_85

From this part, we implemented the code to the yangswei_85 dataset

In [29]:
# Import data
yangswei_85_test = pd.read_csv('https://raw.githubusercontent.com/SaraHoxha/emotion-detection-txa/main/Model%20Implementation/data/test_yangswei_85.csv')

In [30]:
# Data preprocessing
yangswei_85_test['processed_text'] = yangswei_85_test['text'].apply(preprocess_text)

In [32]:
# Convert text into sentence embeddings
yangswei_85_test['embeddings'] = yangswei_85_test['processed_text'].apply(lambda x: model.encode(x))

In [33]:
# Predict emotion by embedding
yangswei_85_test['predicted_emotion'] = yangswei_85_test['embeddings'].apply(lambda x: get_most_similar_emotion(x))

In [35]:
# Map the NRC's emotion label to the Parrott's emotion label
yangswei_85_test['map_to_parrott'] = yangswei_85_test['predicted_emotion'].apply(map_nrc_to_parrott)

In [47]:
yangswei_85_test.head()

Unnamed: 0,text,label,processed_text,embeddings,predicted_emotion,map_to_parrott
0,RTO is the new war on the middle class don't f...,joy,rto new war middle class fire tie package aver...,"[-0.023674954, -0.005235405, 0.007556416, -0.0...",trust,love
1,How do you continue with life outside of work ...,joy,continue life outside work mental exhaustion c...,"[0.0020819395, 0.042672526, -0.032852497, -0.0...",anticipation,joy
2,Very desperate for a job would you know a pers...,fear,desperate job would know person come apply job...,"[0.02066167, 0.011141692, -0.021646924, -0.056...",trust,love
3,What time do you start working most days quest...,joy,time start work day question set hour like get...,"[-0.04142596, -0.0028264832, -0.00038986365, -...",anticipation,joy
4,What are good job sites to find LEGIT remote w...,joy,good job site find legit remote work like dece...,"[-0.027872037, 0.0032044589, -0.03555838, -0.0...",trust,love


In [50]:
# Count the number of label after predicting
label_counts_yangswei_85 = yangswei_85_test['map_to_parrott'].value_counts()
label_counts_yangswei_85

map_to_parrott
love        9092
joy         7979
surprise    2970
anger       1337
sadness      575
fear         336
Name: count, dtype: int64

## Calculate metrics

In [42]:
# Calculate metrics
def calculate_metrics(y_true, y_pred):
    accuracy = accuracy_score(y_true, y_pred)
    precision, recall, f1, _ = precision_recall_fscore_support(y_true, y_pred, average='weighted')
    return {'accuracy': accuracy, 'precision': precision, 'recall': recall,
        'f1': f1}

In [43]:
# Results of t5 dataset
y_true_t5 = t5_test['label']
y_pred_t5 = t5_test['map_to_parrott']
t5_metrics_map_to_parrott = calculate_metrics(y_true_t5, y_pred_t5)
t5_metrics_map_to_parrott

{'accuracy': 0.23719048786467273,
 'precision': 0.481811500158746,
 'recall': 0.23719048786467273,
 'f1': 0.2871546505038804}

In [44]:
# Results of yangswei_85 dataset
y_true_yangswei_85_test = yangswei_85_test['label']
y_pred_yangswei_85_test = yangswei_85_test['map_to_parrott']
yangswei_85_metrics_map_to_parrott = calculate_metrics(y_true_yangswei_85_test, y_pred_yangswei_85_test)
yangswei_85_metrics_map_to_parrott

{'accuracy': 0.2595002018933106,
 'precision': 0.4941631108057709,
 'recall': 0.2595002018933106,
 'f1': 0.3281493426504335}

In [45]:
# Save the metrics to the result file
def save_metrics_to_file(metrics, filename):
    metrics_str = (f"Accuracy: {metrics['accuracy']:.4f}\n"
        f"Precision: {metrics['precision']:.4f}\n"
        f"Recall: {metrics['recall']:.4f}\n"
        f"F1-Score: {metrics['f1']:.4f}\n")
    with open(filename, 'w') as file:
        file.write(metrics_str)

In [46]:
save_metrics_to_file(t5_metrics_map_to_parrott, 't5_metrics_nrc.txt')
save_metrics_to_file(yangswei_85_metrics_map_to_parrott, 'yangswei_85_metrics_nrc.txt')