# Emotion Detection Dataset

## About Dataset

Emotion Detection aims to classify a fine-grained emotion for each utterance in multiparty dialogue. Our annotation is based on the primary emotions in the Feeling Wheel (Willcox, 1982). We must admit that the inter-annotator agreement of this annotation is not the greatest; we welcome any contribution from the community to improve the annotation quality. This task is a part of the Character Mining project led by the Emory NLP research group.

Github: https://github.com/emorynlp/emotion-detection

In [1]:
import pandas as pd
import json
import numpy as np

# Load JSON file
with open("emotion-detection-emotion-detection-1.0/json/emotion-detection-trn.json", "r", encoding="utf-8") as f:
    data = json.load(f)

# Convert to DataFrame
dataf = pd.DataFrame(data)

# Expand the 'episodes' column
episodes_list = []

for _, row in dataf.iterrows():
    season_id = row["season_id"]  # Track season ID

    # Check if the episodes column is a dictionary and process accordingly
    if isinstance(row['episodes'], dict):  # If episodes is already a dict
        episodes = [row['episodes']]  # Put it in a list for uniform processing
    elif isinstance(row['episodes'], str):  # If episodes is a string, try parsing it
        try:
            episodes = json.loads(row['episodes'])  # Convert string to dictionary
        except json.JSONDecodeError:
            print(f"Error decoding JSON for row {row.name}")
            continue
    else:
        episodes = []  # Empty list if structure is unexpected

    # Iterate through the episodes
    for episode in episodes:
        # Make sure episode is a dictionary and has the expected keys
        if isinstance(episode, dict) and "episode_id" in episode and "scenes" in episode:
            episode_id = episode["episode_id"]
            for scene in episode["scenes"]:
                # Make sure each scene has the expected structure
                if isinstance(scene, dict) and "scene_id" in scene and "utterances" in scene:
                    scene_id = scene["scene_id"]
                    for utterance in scene["utterances"]:
                        if isinstance(utterance, dict) and "utterance_id" in utterance:
                            episodes_list.append({
                                "season_id": season_id,
                                "episode_id": episode_id,
                                "scene_id": scene_id,
                                "utterance_id": utterance["utterance_id"],
                                "speaker": utterance["speakers"][0] if utterance["speakers"] else None,
                                "transcript": utterance["transcript"],
                                "emotion": utterance["emotion"]
                            })
        else:
            print(f"Unexpected episode structure: {episode}")

# Create a new DataFrame
df = pd.DataFrame(episodes_list)

In [2]:
# Copy the dataset in case we need to restart the preprocessing
df_copy = df.copy()

In [3]:
display(df.head())

Unnamed: 0,season_id,episode_id,scene_id,utterance_id,speaker,transcript,emotion
0,trn,s01_e02,s01_e02_c01,s01_e02_c01_u001,Monica Geller,"What you guys don't understand is, for us, kis...",Joyful
1,trn,s01_e02,s01_e02_c01,s01_e02_c01_u002,Joey Tribbiani,"Yeah, right!.......Y'serious?",Neutral
2,trn,s01_e02,s01_e02_c01,s01_e02_c01_u003,Phoebe Buffay,"Oh, yeah!",Joyful
3,trn,s01_e02,s01_e02_c01,s01_e02_c01_u004,Rachel Green,Everything you need to know is in that first k...,Powerful
4,trn,s01_e02,s01_e02_c01,s01_e02_c01_u005,Monica Geller,Absolutely.,Powerful


In [4]:
df.columns

Index(['season_id', 'episode_id', 'scene_id', 'utterance_id', 'speaker',
       'transcript', 'emotion'],
      dtype='object')

In [5]:
# Remove the unnecessary columns
columns_to_keep = ["transcript", "emotion"]
df_copy = df_copy[columns_to_keep]

In [6]:
print(df_copy.head())

                                          transcript   emotion
0  What you guys don't understand is, for us, kis...    Joyful
1                      Yeah, right!.......Y'serious?   Neutral
2                                          Oh, yeah!    Joyful
3  Everything you need to know is in that first k...  Powerful
4                                        Absolutely.  Powerful


In [7]:
print(df_copy["emotion"].unique())  # Show all unique labels

['Joyful' 'Neutral' 'Powerful' 'Mad' 'Sad' 'Scared' 'Peaceful']


In [8]:
# Replace their 7 labels ('Joyful' 'Neutral' 'Powerful' 'Mad' 'Sad' 'Scared' 'Peaceful') 
# with our 7 labels ("happiness", "sadness", "anger", "surprise", "fear", "disgust", "neutral")
df_copy.loc[:, "emotion"] = df_copy["emotion"].str.replace("Joyful", "happiness")
df_copy.loc[:, "emotion"] = df_copy["emotion"].str.replace("Neutral", "neutral")
df_copy.loc[:, "emotion"] = df_copy["emotion"].str.replace("Powerful", "happiness")
df_copy.loc[:, "emotion"] = df_copy["emotion"].str.replace("Mad", "anger")
df_copy.loc[:, "emotion"] = df_copy["emotion"].str.replace("Sad", "sadness")
df_copy.loc[:, "emotion"] = df_copy["emotion"].str.replace("Scared", "fear")
df_copy.loc[:, "emotion"] = df_copy["emotion"].str.replace("Peaceful", "happiness")

In [9]:
print(df_copy["emotion"].unique()) 

['happiness' 'neutral' 'anger' 'sadness' 'fear']


In [10]:
display(df_copy)

Unnamed: 0,transcript,emotion
0,"What you guys don't understand is, for us, kis...",happiness
1,"Yeah, right!.......Y'serious?",neutral
2,"Oh, yeah!",happiness
3,Everything you need to know is in that first k...,happiness
4,Absolutely.,happiness
...,...,...
9929,"Ahh, yes, I will have a glass of the Merlot",neutral
9930,Okay.,neutral
9931,"And uh, he will have a white wine spritzer.",neutral
9932,"Okay, good. Thank you. I'll be back shortly, a...",happiness


In [11]:
# Rename the text column to resemble our transcribed dataset better
df_copy.rename(columns={"emotion": "Emotion", "transcript": "Sentence"}, inplace=True)

In [None]:
from googletrans import Translator
import time

translator = Translator()

def translate_function(text):
    try:
        print(f"Translating: {text}")  # Add this to see if the function is being called
        if not isinstance(text, str) or text.strip() == "":
            print(f"Skipping invalid input: {text}")
            return text
        
        time.sleep(0.5)  # Avoid rate limits
        translated_text = translator.translate(text, dest="fr").text  # This should be synchronous
        
        if not translated_text:
            print(f"Empty translation for: {text}")
            return text  

        return translated_text
    
    except Exception as e:
        print(f"Error translating: {text} - {e}")
        return text

# Apply the translation function to the "Sentence" column
df_copy["Translated"] = df_copy["Sentence"].apply(translate_function)

# Now df_sampled will have the "Translated" column with the French translations

Translating: What you guys don't understand is, for us, kissing is as important as any part of it.
Translating: Yeah, right!.......Y'serious?
Translating: Oh, yeah!
Translating: Everything you need to know is in that first kiss.
Translating: Absolutely.
Translating: Yeah, I think for us, kissing is pretty much like an opening act, y'know? I mean it's like the stand-up comedian you have to sit through before Pink Floyd comes out.
Translating: Yeah, and-and it's not that we don't like the comedian, it's that-that... that's not why we bought the ticket.
Translating: The problem is, though, after the concert's over, no matter how great the show was, you girls are always looking for the comedian again, y'know? I mean, we're in the car, we're fighting traffic... basically just trying to stay awake.
Translating: Yeah, well, word of advice: Bring back the comedian. Otherwise next time you're gonna find yourself sitting at home, listening to that album alone.
Translating: ....Are we still talki

In [21]:
df_copy.to_csv('emory_nlp_ds.csv', index=False)