# Aggregating EmoBank

This notebook illustrates how the EmoBank data was aggregated from the raw crowdsourcing data. The original aggregation script from 2016 was written in R. This notebook here is a cleaner and simpler python version. 

## Imports and references to local paths

In [1]:
import pandas as pd
from pathlib import Path

In [2]:
path_individual_reader = Path("/Users/sven/julie/research/__ARCHIVE__/EACL-2017-EmoBank/corpus/results-crowdsourcing/final_evoke/ratings_evoke.csv")
path_individual_writer = Path("/Users/sven/julie/research/__ARCHIVE__/EACL-2017-EmoBank/corpus/results-crowdsourcing/final_express/ratings_express.csv")
path_aggregated_reader = Path("/Users/sven/workspace/distribute/EmoBank/corpus/reader.csv")
path_aggregated_writer = Path("/Users/sven/workspace/distribute/EmoBank/corpus/writer.csv")

In [3]:
df_individual_reader = pd.read_csv(path_individual_reader)
df_individual_writer = pd.read_csv(path_individual_writer)
df_aggregated_reader = pd.read_csv(path_aggregated_reader, index_col=0)
df_aggregated_writer = pd.read_csv(path_aggregated_writer, index_col=0)

## Basic data inspection

`df_individual_reader` and `df_individual_writer` are the raw crowdsourcing data provided by CrowdFlower (today known as Figure Eight). These dataframes contain personal data of the participants, so they are neither shown here nor contained in the repository.

In [4]:
print(df_individual_reader.shape)
print(df_individual_reader.columns)

(53055, 26)
Index(['_unit_id', '_created_at', '_golden', '_id', '_missed', '_started_at',
       '_tainted', '_channel', '_trust', '_worker_id', '_country', '_region',
       '_city', '_ip', 'arousal', 'control', 'pleasure', 'orig__golden',
       'arousal_gold', 'arousal_gold_reason', 'control_gold',
       'control_gold_reason', 'id', 'pleasure_gold', 'pleasure_gold_reason',
       'sentence'],
      dtype='object')


In [5]:
print(df_individual_writer.shape)
print(df_individual_writer.columns)

(53118, 26)
Index(['_unit_id', '_created_at', '_golden', '_id', '_missed', '_started_at',
       '_tainted', '_channel', '_trust', '_worker_id', '_country', '_region',
       '_city', '_ip', 'arousal', 'control', 'pleasure', 'orig__golden',
       'arousal_gold', 'arousal_gold_reason', 'control_gold',
       'control_gold_reason', 'id', 'pleasure_gold', 'pleasure_gold_reason',
       'sentence'],
      dtype='object')


`df_aggregated_reader` and `df_aggregated_writer` contain the originally published EmoBank data. 

In [6]:
df_aggregated_reader

Unnamed: 0_level_0,V,A,D,stdV,stdA,stdD,N
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
110CYL068_1036_1079,3.00,3.20,3.00,0.00,0.40,0.00,5
110CYL068_1079_1110,2.60,3.00,2.60,0.49,0.63,0.49,5
110CYL068_1110_1127,2.00,2.33,2.33,1.41,0.47,0.47,3
110CYL068_1127_1130,3.00,3.00,3.00,0.00,0.00,0.00,2
110CYL068_1137_1188,3.60,3.00,3.40,0.80,0.63,0.49,5
...,...,...,...,...,...,...,...
wwf12_4531_4624,3.33,3.67,2.33,0.94,0.47,0.47,3
wwf12_501_591,3.50,3.00,3.50,0.50,0.00,0.50,2
wwf12_592_691,3.00,3.00,3.00,0.00,0.00,0.00,5
wwf12_702_921,3.40,3.40,3.40,0.49,0.49,0.49,5


In [7]:
df_aggregated_writer

Unnamed: 0_level_0,V,A,D,stdV,stdA,stdD,N
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
110CYL068_1036_1079,3.00,2.80,3.40,0.00,0.98,0.49,5
110CYL068_1079_1110,3.00,3.20,3.00,0.00,0.40,0.00,5
110CYL068_1127_1130,3.00,3.00,3.00,0.00,0.00,0.00,5
110CYL068_1137_1188,3.25,3.00,3.00,0.43,0.71,0.00,4
110CYL068_1189_1328,3.40,3.40,3.20,0.49,0.49,0.40,5
...,...,...,...,...,...,...,...
wwf12_4531_4624,2.80,3.40,3.40,0.40,0.80,0.49,5
wwf12_501_591,4.00,3.67,3.67,0.82,0.47,0.47,3
wwf12_592_691,3.00,3.00,3.20,0.00,0.00,0.40,5
wwf12_702_921,3.25,3.50,3.50,0.43,0.50,0.50,4


The code below shows how the these dataframes were built from the raw crowdsourcing data.

## Remove personal data, only keeping the essentials

In [8]:
to_keep = ["id", "pleasure", "arousal", "control"]
to_rename = {"pleasure": "V", "arousal": "A", "control":"D"}

df_individual_reader = df_individual_reader[to_keep].rename(columns=to_rename)
df_individual_writer = df_individual_writer[to_keep].rename(columns=to_rename)

df_individual_reader.to_csv("individual_reader_ratings.csv", index=False)
df_individual_writer.to_csv("individual_writer_ratings.csv", index=False)


## Reload and inspect dataframes

In [9]:
df_individual_reader = pd.read_csv("individual_reader_ratings.csv")
df_individual_writer = pd.read_csv("individual_writer_ratings.csv")

In [10]:
df_individual_reader

Unnamed: 0,id,V,A,D
0,Acephalous-Cant-believe_4_47,3,3,3
1,Acephalous-Cant-believe_4_47,3,3,3
2,Acephalous-Cant-believe_4_47,3,5,3
3,Acephalous-Cant-believe_4_47,3,3,4
4,Acephalous-Cant-believe_4_47,3,3,3
...,...,...,...,...
53050,SemEval_1499,3,3,3
53051,SemEval_1499,3,3,3
53052,SemEval_1499,3,3,3
53053,SemEval_1499,2,4,2


In [11]:
df_individual_writer

Unnamed: 0,id,V,A,D
0,Acephalous-Cant-believe_4_47,3,4,3
1,Acephalous-Cant-believe_4_47,4,3,3
2,Acephalous-Cant-believe_4_47,3,4,3
3,Acephalous-Cant-believe_4_47,3,3,3
4,Acephalous-Cant-believe_4_47,3,3,3
...,...,...,...,...
53113,SemEval_1499,3,3,3
53114,SemEval_1499,3,3,3
53115,SemEval_1499,3,3,3
53116,SemEval_1499,3,3,4


## Recreating the published data from the filter individual ratings

We assert that `df_individual_reader` and `df_individual_writer` are identical to the data EmoBank was built with by replicating the original aggregation process. 

In [12]:
def build_corpus(df):
    df = df.query("not (V == 1 and A == 1 and D == 1)") 
    df_mean = df.groupby("id").mean() 
    df_sd = df.groupby("id").std(ddof=0).rename(columns={"V": "stdV", "A": "stdA", "D": "stdD"})
    df_n = df.id.value_counts()
    df = pd.concat([df_mean, df_sd, df_n], axis=1, sort=True)
    df = df.rename(columns={"id":"N"}) # column stored in df_n gets called "id" for some reason. The actual id is already in the index as this point...
    df = df[df.N > 1]
    df.index.rename("id", inplace=True)
    df = df.round(2)
    return df

In [13]:
newly_aggregated_reader = build_corpus(df_individual_reader)
newly_aggregated_reader

Unnamed: 0_level_0,V,A,D,stdV,stdA,stdD,N
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
110CYL068_1036_1079,3.00,3.20,3.00,0.00,0.40,0.00,5
110CYL068_1079_1110,2.60,3.00,2.60,0.49,0.63,0.49,5
110CYL068_1110_1127,2.00,2.33,2.33,1.41,0.47,0.47,3
110CYL068_1127_1130,3.00,3.00,3.00,0.00,0.00,0.00,2
110CYL068_1137_1188,3.60,3.00,3.40,0.80,0.63,0.49,5
...,...,...,...,...,...,...,...
wwf12_4531_4624,3.33,3.67,2.33,0.94,0.47,0.47,3
wwf12_501_591,3.50,3.00,3.50,0.50,0.00,0.50,2
wwf12_592_691,3.00,3.00,3.00,0.00,0.00,0.00,5
wwf12_702_921,3.40,3.40,3.40,0.49,0.49,0.49,5


In [14]:
newly_aggregated_writer = build_corpus(df_individual_writer)
newly_aggregated_writer

Unnamed: 0_level_0,V,A,D,stdV,stdA,stdD,N
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
110CYL068_1036_1079,3.00,2.80,3.40,0.00,0.98,0.49,5
110CYL068_1079_1110,3.00,3.20,3.00,0.00,0.40,0.00,5
110CYL068_1127_1130,3.00,3.00,3.00,0.00,0.00,0.00,5
110CYL068_1137_1188,3.25,3.00,3.00,0.43,0.71,0.00,4
110CYL068_1189_1328,3.40,3.40,3.20,0.49,0.49,0.40,5
...,...,...,...,...,...,...,...
wwf12_4531_4624,2.80,3.40,3.40,0.40,0.80,0.49,5
wwf12_501_591,4.00,3.67,3.67,0.82,0.47,0.47,3
wwf12_592_691,3.00,3.00,3.20,0.00,0.00,0.40,5
wwf12_702_921,3.25,3.50,3.50,0.43,0.50,0.50,4


The original aggregation script was written in R and somehow ended up with a different ordering of the instances:

In [15]:
df_aggregated_reader.equals(newly_aggregated_reader)

False

In [16]:
df_aggregated_writer.equals(newly_aggregated_writer)

False

Simply reordering the newly aggregated dataframes according to the index of the published data solves the problem.

In [17]:
newly_aggregated_reader = newly_aggregated_reader.loc[df_aggregated_reader.index]
df_aggregated_reader.equals(newly_aggregated_reader)

True

In [18]:
newly_aggregated_writer = newly_aggregated_writer.loc[df_aggregated_writer.index]
df_aggregated_writer.equals(newly_aggregated_writer)

True

The newly aggregated dataframes are indeed identical to the published ones.

--- 