# Go Emotions: Google Emotions Dataset Preprocessing

## About Dataset

The Google AI GoEmotions dataset consists of comments from Reddit users with labels of their emotional coloring. GoEmotions is designed to train neural networks to perform deep analysis of the tonality of texts. Most of the existing emotion classification datasets cover certain areas (for example, news headlines and movie subtitles), are small in size and use a scale of only six basic emotions (anger, surprise, disgust, joy, fear, and sadness). The expansion of the emotional spectrum considered in datasets could make it possible to create more sensitive chatbots, models for detecting dangerous behavior on the Internet, as well as improve customer support services.

The categories of emotions were identified by Google together with psychologists and include 12 positive,, 11 negative, 4 ambiguous emotions, and 1 neutral, which makes the dataset suitable for solving tasks that require subtle differentiation between different emotions.


Source: https://arxiv.org/pdf/2005.00547.pdf

Github: https://github.com/google-research/google-research/tree/master/goemotions

In [19]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from datasets import load_dataset
from transformers import pipeline, MarianTokenizer, MarianMTModel
from transformers import AutoTokenizer, TFAutoModelForSeq2SeqLM
from transformers import DataCollatorForSeq2Seq
import evaluate
import tensorflow as tf
from tqdm import tqdm
import torch

In [20]:
import pandas as pd

splits = {'train': 'data/train-00000-of-00001.parquet', 'test': 'data/test-00000-of-00001.parquet'}
df = pd.read_parquet("hf://datasets/sebdg/go_emotions_cleaned/" + splits["train"])

In [21]:
# Copy the dataset in case we need to restart the preprocessing
df_copy = df

In [22]:
display(df.head())

Unnamed: 0,text,labels,labels_text,admiration,amusement,anger,annoyance,approval,caring,confusion,...,nervousness,optimism,pride,realization,relief,remorse,sadness,surprise,neutral,__index_level_0__
0,Thats true but [NAME] is still a valuable reso...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",realization,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,17779
1,I don't see how a NBC journalist's departure i...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","annoyance, disapproval",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,97610
2,If [NAME] supporters trigger you more than mas...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",neutral,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,46210
3,"What if you're someone, like me, who believes ...","[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, ...",curiosity,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,38641
4,You are normal. Very normal.,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",approval,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,20606


In [23]:
df.info

<bound method DataFrame.info of                                                     text  \
0      Thats true but [NAME] is still a valuable reso...   
1      I don't see how a NBC journalist's departure i...   
2      If [NAME] supporters trigger you more than mas...   
3      What if you're someone, like me, who believes ...   
4                           You are normal. Very normal.   
...                                                  ...   
46027  Dozens. My favorite scene is the tribe bringin...   
46028  I love [NAME], he’s literally the craziest cou...   
46029  That's awful. I'm glad [NAME] doing better, bu...   
46030                   I'll check that out. Thank you!!   
46031               Your video fails to show how though?   

                                                  labels  \
0      [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...   
1      [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...   
2      [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...   
3      

In [24]:
df.dropna()

Unnamed: 0,text,labels,labels_text,admiration,amusement,anger,annoyance,approval,caring,confusion,...,nervousness,optimism,pride,realization,relief,remorse,sadness,surprise,neutral,__index_level_0__
0,Thats true but [NAME] is still a valuable reso...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",realization,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,17779
1,I don't see how a NBC journalist's departure i...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","annoyance, disapproval",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,97610
2,If [NAME] supporters trigger you more than mas...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",neutral,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,46210
3,"What if you're someone, like me, who believes ...","[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, ...",curiosity,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,38641
4,You are normal. Very normal.,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",approval,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,20606
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
46027,Dozens. My favorite scene is the tribe bringin...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","love, approval",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,49479
46028,"I love [NAME], he’s literally the craziest cou...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","love, amusement",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,5018
46029,"That's awful. I'm glad [NAME] doing better, bu...","[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","disapproval, gratitude, joy, annoyance, disgus...",0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,5951
46030,I'll check that out. Thank you!!,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",gratitude,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,26769


In [25]:
df.columns

Index(['text', 'labels', 'labels_text', 'admiration', 'amusement', 'anger',
       'annoyance', 'approval', 'caring', 'confusion', 'curiosity', 'desire',
       'disappointment', 'disapproval', 'disgust', 'embarrassment',
       'excitement', 'fear', 'gratitude', 'grief', 'joy', 'love',
       'nervousness', 'optimism', 'pride', 'realization', 'relief', 'remorse',
       'sadness', 'surprise', 'neutral', '__index_level_0__'],
      dtype='object')

In [26]:
# Remove the unnecessary columns
columns_to_keep = ["text", "labels", "labels_text"]
df = df[columns_to_keep]

In [27]:
# Split the labels_text column into multiple
df = df.assign(labels_text=df["labels_text"].str.split(",")).explode("labels_text")
display(df.tail(10))

Unnamed: 0,text,labels,labels_text
46029,"That's awful. I'm glad [NAME] doing better, bu...","[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",annoyance
46029,"That's awful. I'm glad [NAME] doing better, bu...","[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",disgust
46029,"That's awful. I'm glad [NAME] doing better, bu...","[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",admiration
46029,"That's awful. I'm glad [NAME] doing better, bu...","[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",curiosity
46029,"That's awful. I'm glad [NAME] doing better, bu...","[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",grief
46030,I'll check that out. Thank you!!,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",gratitude
46031,Your video fails to show how though?,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, ...",disapproval
46031,Your video fails to show how though?,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, ...",disappointment
46031,Your video fails to show how though?,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, ...",disgust
46031,Your video fails to show how though?,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, ...",surprise


In [28]:
df["labels_text"] = df["labels_text"].str.strip().str.lower()
print(df["labels_text"].unique())  # Show all unique labels

['realization' 'annoyance' 'disapproval' 'neutral' 'curiosity' 'approval'
 'optimism' 'disappointment' 'confusion' 'joy' 'love' 'admiration'
 'surprise' 'amusement' 'excitement' 'anger' 'gratitude' 'caring'
 'remorse' 'embarrassment' 'pride' 'relief' 'disgust' 'sadness' 'grief'
 'desire' 'nervousness' 'fear']


In [29]:
# Replace the 28 ('admiration', 'amusement', 'anger', 'annoyance', 'approval', 'caring', 'confusion', 'curiosity', 'desire', 
# 'disappointment', 'disapproval', 'disgust', 'embarrassment', 'excitement', 'fear', 'gratitude', 'grief', 'joy', 
# 'love', 'nervousness', 'optimism', 'pride', 'realization', 'relief', 'remorse', 'sadness', 'surprise', 'neutral') labels 
# with our 7 labels ("happiness", "sadness", "anger", "surprise", "fear", "disgust", "neutral")
df["labels_text"] = df["labels_text"].str.replace("admiration", "happiness")
df["labels_text"] = df["labels_text"].str.replace("amusement", "happiness")
df["labels_text"] = df["labels_text"].str.replace("anger", "anger")
df["labels_text"] = df["labels_text"].str.replace("annoyance", "anger")
df["labels_text"] = df["labels_text"].str.replace("approval", "happiness")
df["labels_text"] = df["labels_text"].str.replace("caring", "happiness")
df["labels_text"] = df["labels_text"].str.replace("confusion", "surprise")
df["labels_text"] = df["labels_text"].str.replace("curiosity", "surprise")
df["labels_text"] = df["labels_text"].str.replace("desire", "happiness")
df["labels_text"] = df["labels_text"].str.replace("disappointment", "disgust")
df["labels_text"] = df["labels_text"].str.replace("disapproval", "disgust")
df["labels_text"] = df["labels_text"].str.replace("disgust", "disgust")
df["labels_text"] = df["labels_text"].str.replace("embarrassment", "sadness")
df["labels_text"] = df["labels_text"].str.replace("excitement", "happiness")
df["labels_text"] = df["labels_text"].str.replace("fear", "fear")
df["labels_text"] = df["labels_text"].str.replace("gratitude", "happiness")
df["labels_text"] = df["labels_text"].str.replace("grief", "sadness")
df["labels_text"] = df["labels_text"].str.replace("joy", "happiness")
df["labels_text"] = df["labels_text"].str.replace("love", "happiness")
df["labels_text"] = df["labels_text"].str.replace("nervousness", "fear")
df["labels_text"] = df["labels_text"].str.replace("optimism", "happiness")
df["labels_text"] = df["labels_text"].str.replace("pride", "happiness")
df["labels_text"] = df["labels_text"].str.replace("realization", "surprise")
df["labels_text"] = df["labels_text"].str.replace("relief", "happiness")
df["labels_text"] = df["labels_text"].str.replace("remorse", "sadness")
df["labels_text"] = df["labels_text"].str.replace("sadness", "sadness")
df["labels_text"] = df["labels_text"].str.replace("surprise", "surprise")
df["labels_text"] = df["labels_text"].str.replace("neutral", "neutral")
df["labels_text"] = df["labels_text"].str.replace("dishappiness", "sadness")

In [30]:
print(df["labels_text"].unique()) 

['surprise' 'anger' 'sadness' 'neutral' 'happiness' 'disgust' 'fear']


In [31]:
display(df)

Unnamed: 0,text,labels,labels_text
0,Thats true but [NAME] is still a valuable reso...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",surprise
1,I don't see how a NBC journalist's departure i...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",anger
1,I don't see how a NBC journalist's departure i...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",sadness
2,If [NAME] supporters trigger you more than mas...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",neutral
3,"What if you're someone, like me, who believes ...","[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, ...",surprise
...,...,...,...
46030,I'll check that out. Thank you!!,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",happiness
46031,Your video fails to show how though?,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, ...",sadness
46031,Your video fails to show how though?,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, ...",disgust
46031,Your video fails to show how though?,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, ...",disgust


In [32]:
# Rename the text column to resemble our transcribed dataset better
df.rename(columns={"text": "Translated"}, inplace=True)

In [33]:
df = df.drop_duplicates(subset=["Translated"], keep="first")

In [34]:
display(df)

Unnamed: 0,Translated,labels,labels_text
0,Thats true but [NAME] is still a valuable reso...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",surprise
1,I don't see how a NBC journalist's departure i...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",anger
2,If [NAME] supporters trigger you more than mas...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",neutral
3,"What if you're someone, like me, who believes ...","[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, ...",surprise
4,You are normal. Very normal.,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",happiness
...,...,...,...
46027,Dozens. My favorite scene is the tribe bringin...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",happiness
46028,"I love [NAME], he’s literally the craziest cou...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",happiness
46029,"That's awful. I'm glad [NAME] doing better, bu...","[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",sadness
46030,I'll check that out. Thank you!!,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",happiness


In [36]:
# Check if CUDA is available
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")
 
# Optimize PyTorch for GPU
torch.backends.cudnn.benchmark = True
torch.backends.cudnn.deterministic = False
 
# Load model and tokenizer
model_name = "Helsinki-NLP/opus-mt-en-fr"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name).to(device)  # Move model to GPU
 
# Warm-up GPU to avoid initial slow batch
dummy_input = torch.tensor([[0]]).to(device)  
 
# Function to translate sentences in batches
def translate(sentences, batch_size=8):  # Reduce batch size if stuck
    translated = []
    for i in tqdm(range(0, len(sentences), batch_size), desc="Translating", unit="batch"):
        batch = sentences[i:i+batch_size]
       
        # Tokenize & move input to GPU
        inputs = tokenizer(batch, return_tensors="pt", padding=True, truncation=True).to(device)
       
        # Generate translation using GPU
        with torch.no_grad():
            translated_batch = model.generate(**inputs)
       
        # Decode output
        translated.extend([tokenizer.decode(t, skip_special_tokens=True) for t in translated_batch])
   
    return translated
 
# Translate sentences
df['Sentence'] = translate(df['Translated'].tolist())

# Display the translated sentences
print(df.head())

Using device: cuda


Translating: 100%|██████████| 5754/5754 [35:59<00:00,  2.66batch/s]  


                                          Translated  \
0  Thats true but [NAME] is still a valuable reso...   
1  I don't see how a NBC journalist's departure i...   
2  If [NAME] supporters trigger you more than mas...   
3  What if you're someone, like me, who believes ...   
4                       You are normal. Very normal.   

                                              labels labels_text  \
0  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...    surprise   
1  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...       anger   
2  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...     neutral   
3  [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, ...    surprise   
4  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...   happiness   

                                            Sentence  
0  C'est vrai, mais [NAME] est toujours une resso...  
1  Je ne vois pas comment le départ d'un journali...  
2  Si les partisans de [NAME] vous déclenchent pl...  
3  Et si vous êtes quelqu'un, comm

In [38]:
df.tail()

Unnamed: 0,Translated,labels,labels_text,Sentence
46027,Dozens. My favorite scene is the tribe bringin...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",happiness,Ma scène préférée est la tribu qui ramène le m...
46028,"I love [NAME], he’s literally the craziest cou...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",happiness,"J'aime [NAME], il est littéralement le canapé ..."
46029,"That's awful. I'm glad [NAME] doing better, bu...","[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",sadness,C'est affreux. Je suis content [NAME] de faire...
46030,I'll check that out. Thank you!!,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",happiness,Je vais voir ça.
46031,Your video fails to show how though?,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, ...",sadness,Votre vidéo ne montre pas comment ?


In [46]:
df_copy = df.copy()

In [45]:
df = df[['Sentence'] + [col for col in df.columns if col != 'Sentence']]
df.drop(columns=['labels'], inplace=True)
df = df.loc[:, ~df.columns.duplicated()]
print(df)

                                                Sentence  \
0      C'est vrai, mais [NAME] est toujours une resso...   
1      Je ne vois pas comment le départ d'un journali...   
2      Si les partisans de [NAME] vous déclenchent pl...   
3      Et si vous êtes quelqu'un, comme moi, qui croi...   
4                         Vous êtes normal, très normal.   
...                                                  ...   
46027  Ma scène préférée est la tribu qui ramène le m...   
46028  J'aime [NAME], il est littéralement le canapé ...   
46029  C'est affreux. Je suis content [NAME] de faire...   
46030                                   Je vais voir ça.   
46031                Votre vidéo ne montre pas comment ?   

                                              Translated  \
0      Thats true but [NAME] is still a valuable reso...   
1      I don't see how a NBC journalist's departure i...   
2      If [NAME] supporters trigger you more than mas...   
3      What if you're someone, like me,

In [49]:
df.drop(columns=['labels'], inplace=True)
df.rename(columns = {'labels_text':'Emotion'})
display(df)

KeyError: "['labels'] not found in axis"

In [51]:
df.rename(columns = {'labels_text':'Emotion'}, inplace=True)
display(df)

Unnamed: 0,Sentence,Translated,Emotion
0,"C'est vrai, mais [NAME] est toujours une resso...",Thats true but [NAME] is still a valuable reso...,surprise
1,Je ne vois pas comment le départ d'un journali...,I don't see how a NBC journalist's departure i...,anger
2,Si les partisans de [NAME] vous déclenchent pl...,If [NAME] supporters trigger you more than mas...,neutral
3,"Et si vous êtes quelqu'un, comme moi, qui croi...","What if you're someone, like me, who believes ...",surprise
4,"Vous êtes normal, très normal.",You are normal. Very normal.,happiness
...,...,...,...
46027,Ma scène préférée est la tribu qui ramène le m...,Dozens. My favorite scene is the tribe bringin...,happiness
46028,"J'aime [NAME], il est littéralement le canapé ...","I love [NAME], he’s literally the craziest cou...",happiness
46029,C'est affreux. Je suis content [NAME] de faire...,"That's awful. I'm glad [NAME] doing better, bu...",sadness
46030,Je vais voir ça.,I'll check that out. Thank you!!,happiness


In [52]:
df.to_csv("df_gotranslated.csv",index=False)