# Description

We've downloaded two different datsets from two sources.

1. Tweets dataset containing emotions downloaded from: http://saifmohammad.com/WebPages/EmotionIntensity-SharedTask.html
2. Emotion classification data downloaded from: https://www.kaggle.com/eray1yildiz/emotion-classification

In [56]:
import pandas as pd
import os

# Preparing twitter dataset

For each text, we have a label in dataset. For each label we also have an 'confidence' or intensity of that label (between 0 and 1). We only considered emotions with intensity > 0.5 

In [57]:
DATA_DIR = 'data/raw'

files = os.listdir(DATA_DIR)
print(files)
train_files = []
test_files = []

for file in files:
    if 'test' in file:
        test_files.append(file)
    else:
        train_files.append(file)

['anger-ratings-0to1.dev.gold.txt', 'sadness-ratings-0to1.train.txt', 'fear-ratings-0to1.test.gold.txt', 'fear-ratings-0to1.dev.gold.txt', 'joy-ratings-0to1.train.txt', 'anger-ratings-0to1.train.txt', 'joy-ratings-0to1.dev.gold.txt', 'sadness-ratings-0to1.test.gold.txt', 'sadness-ratings-0to1.dev.gold.txt', 'joy-ratings-0to1.test.gold.txt', 'fear-ratings-0to1.train.txt', 'anger-ratings-0to1.test.gold.txt']


In [58]:
def get_dataframe(files):
    
    raw_text = []
    labels = []
    for file in files:
        f = open(DATA_DIR + '/' + file)
        lines = f.readlines()
        lines = [line.split() for line in lines]
        lines = [line[2:] for line in lines]
        for line in lines:
            
            label = line[-2]
            confidence = line[-1]
            if float(confidence) > 0.5:
                line = line[:-2]
                raw = ' '.join(line)
                raw_text.append(raw)
                labels.append(label)
    
    df = pd.DataFrame(list(zip(raw_text, labels)), columns =['text', 'label'])
    return df

In [59]:
train_df = get_dataframe(train_files)
test_df = get_dataframe(test_files)

In [60]:
train_df.shape

(1790, 2)

In [61]:
train_df.head()

Unnamed: 0,text,label
0,that Rutgers game was an abomination. An affro...,anger
1,I get mad over something so minuscule I try to...,anger
2,I get mad over something so minuscule I try to...,anger
3,eyes have been dilated. I hate the world right...,anger
4,One chosen by the CLP members! MP seats are no...,anger


In [62]:
test_df.shape

(1505, 2)

In [63]:
test_df.head()

Unnamed: 0,text,label
0,#afraid of the #quiet ones they are the ones w...,fear
1,he's a horrible person and now i gag when i se...,fear
2,pedicure is supposed to be nice but honestly I...,fear
3,you need to band together not apart #nevertrum...,fear
4,you need to band together not apart #nevertrum...,fear


In [65]:
train_df.to_csv('data/raw/train.csv', index=None)
test_df.to_csv('data/raw/test.csv', index=None)

# Preparing kaggle dataset

In [66]:
data = pd.read_csv('data/raw/emotion.data')

In [67]:
data.shape

(416809, 3)

In [68]:
data.head()

Unnamed: 0.1,Unnamed: 0,text,emotions
0,27383,i feel awful about it too because it s my job ...,sadness
1,110083,im alone i feel awful,sadness
2,140764,ive probably mentioned this before but i reall...,joy
3,100071,i was feeling a little low few days back,sadness
4,2837,i beleive that i am much more sensitive to oth...,love


In [69]:
data.drop(['Unnamed: 0'], inplace=True, axis=1)

In [70]:
data.head()

Unnamed: 0,text,emotions
0,i feel awful about it too because it s my job ...,sadness
1,im alone i feel awful,sadness
2,ive probably mentioned this before but i reall...,joy
3,i was feeling a little low few days back,sadness
4,i beleive that i am much more sensitive to oth...,love


In [71]:
data.to_csv('data/raw/kaggle_emotion_data.csv', index=None)