## Prerequisites

In [1]:
import pandas as pd
import numpy as np

## Loading Data Into Pandas
I use pandas to read the 'tab separated' data into a 'data frame' object for ease of interaction.

Loading aggression annotation data:

In [4]:
aggression_labels = pd.DataFrame()
aggression_labels = pd.read_csv('aggression_annotations.tsv', sep='\t')

Unnamed: 0,rev_id,worker_id,aggression,aggression_score
0,37675,1362,1.0,-1.0
1,37675,2408,0.0,1.0
2,37675,1493,0.0,0.0
3,37675,1439,0.0,0.0
4,37675,170,0.0,0.0
...,...,...,...,...
1365212,699897151,628,0.0,0.0
1365213,699897151,15,0.0,0.0
1365214,699897151,57,0.0,0.0
1365215,699897151,1815,0.0,0.0


Loading aggression comment data:

In [5]:
aggression_comments = pd.DataFrame()
aggression_comments = pd.read_csv('aggression_annotated_comments.tsv', sep='\t')

Unnamed: 0,rev_id,comment,year,logged_in,ns,sample,split
0,37675,`-NEWLINE_TOKENThis is not ``creative``. Thos...,2002,True,article,random,train
1,44816,`NEWLINE_TOKENNEWLINE_TOKEN:: the term ``stand...,2002,True,article,random,train
2,49851,"NEWLINE_TOKENNEWLINE_TOKENTrue or false, the s...",2002,True,article,random,train
3,89320,"Next, maybe you could work on being less cond...",2002,True,article,random,dev
4,93890,This page will need disambiguation.,2002,True,article,random,train
...,...,...,...,...,...,...,...
115859,699848324,`NEWLINE_TOKENNEWLINE_TOKENNEWLINE_TOKENThese ...,2016,True,article,blocked,train
115860,699851288,NEWLINE_TOKENNEWLINE_TOKENThe Institute for Hi...,2016,True,article,blocked,test
115861,699857133,NEWLINE_TOKEN:The way you're trying to describ...,2016,True,article,blocked,train
115862,699891012,NEWLINE_TOKENNEWLINE_TOKEN== Warning ==NEWLINE...,2016,True,user,blocked,dev


Loading aggression worker data:


In [6]:
aggression_workers = pd.DataFrame()
aggression_workers = pd.read_csv('aggression_worker_demographics.tsv', sep='\t')

Unnamed: 0,worker_id,gender,english_first_language,age_group,education
0,833,female,0,45-60,bachelors
1,1072,male,0,30-45,bachelors
2,872,male,0,18-30,hs
3,2116,male,0,30-45,professional
4,453,male,0,30-45,hs
...,...,...,...,...,...
2185,1442,male,0,18-30,hs
2186,529,female,0,30-45,hs
2187,2036,female,0,18-30,masters
2188,393,female,0,18-30,masters


Loading toxicity annotation data:

In [7]:
toxicity_labels = pd.DataFrame()
toxicity_labels = pd.read_csv('toxicity_annotations.tsv', sep='\t')

Unnamed: 0,rev_id,worker_id,toxicity,toxicity_score
0,2232.0,723,0,0.0
1,2232.0,4000,0,0.0
2,2232.0,3989,0,1.0
3,2232.0,3341,0,0.0
4,2232.0,1574,0,1.0
...,...,...,...,...
1598284,699897151.0,1550,0,0.0
1598285,699897151.0,1025,0,1.0
1598286,699897151.0,648,0,1.0
1598287,699897151.0,379,0,0.0


Loading toxicity comment data:

In [8]:
toxicity_comments = pd.DataFrame()
toxicity_comments = pd.read_csv('toxicity_annotated_comments.tsv', sep='\t')

Unnamed: 0,rev_id,comment,year,logged_in,ns,sample,split
0,2232.0,This:NEWLINE_TOKEN:One can make an analogy in ...,2002,True,article,random,train
1,4216.0,`NEWLINE_TOKENNEWLINE_TOKEN:Clarification for ...,2002,True,user,random,train
2,8953.0,Elected or Electoral? JHK,2002,False,article,random,test
3,26547.0,`This is such a fun entry. DevotchkaNEWLINE_...,2002,True,article,random,train
4,28959.0,Please relate the ozone hole to increases in c...,2002,True,article,random,test
...,...,...,...,...,...,...,...
159681,699848324.0,`NEWLINE_TOKENNEWLINE_TOKENNEWLINE_TOKENThese ...,2016,True,article,blocked,train
159682,699851288.0,NEWLINE_TOKENNEWLINE_TOKENThe Institute for Hi...,2016,True,article,blocked,test
159683,699857133.0,NEWLINE_TOKEN:The way you're trying to describ...,2016,True,article,blocked,dev
159684,699891012.0,NEWLINE_TOKENNEWLINE_TOKEN== Warning ==NEWLINE...,2016,True,user,blocked,train


Loading toxicity worker data:

In [9]:
toxicity_workers = pd.DataFrame()
toxicity_workers = pd.read_csv('toxicity_worker_demographics.tsv', sep='\t')

Unnamed: 0,worker_id,gender,english_first_language,age_group,education
0,85,female,0,18-30,bachelors
1,1617,female,0,45-60,bachelors
2,1394,female,0,,bachelors
3,311,male,0,30-45,bachelors
4,1980,male,0,45-60,masters
...,...,...,...,...,...
3586,3189,female,0,18-30,bachelors
3587,1105,female,0,18-30,bachelors
3588,2192,female,1,Under 18,hs
3589,2692,female,0,30-45,hs


## Analysis
For both data sets loaded above, I investigate the question of whether different demographics find different words toxic or aggressive. This would provide some insight into just how much perception of the same content can differ, and therefore what kinds of words (if any) bring some bias into the labelling. If there are particular words for which the labels differ significantly by some demographic breakdown, then any model trained on the data would probably perform poorly when dealing with such words.

First, I must select the annotations that indicate toxicity or aggression.

In [17]:
toxicity_labels_bad = toxicity_labels.loc[toxicity_labels['toxicity'] == 1]

aggression_labels_bad = aggression_labels.loc[aggression_labels['aggression'] == 1]

Next, I isolate the demographic data I will be focusing on: gender.

In [19]:
tox_worker_gender = toxicity_workers[['worker_id', 'gender']]

agg_worker_gender = aggression_workers[['worker_id', 'gender']]

Now I combine the gender data with the annotation data.

In [21]:
tox_labels_gender = toxicity_labels_bad.merge(tox_worker_gender, on='worker_id')

agg_labels_gender = aggression_labels_bad.merge(agg_worker_gender, on='worker_id')

I also isolate the comment data and combine it with the above.

In [22]:
tox_comments = toxicity_comments[['rev_id', 'comment']]

agg_comments = aggression_comments[['rev_id', 'comment']]

In [23]:
tox = tox_labels_gender.merge(tox_comments, on='rev_id', how='inner')

agg = agg_labels_gender.merge(agg_comments, on='rev_id', how='inner')

Now I create an overall list of words for each gender.

In [75]:
male_tox_words = []
male_tox = tox.loc[tox['gender'] == 'male']
male_tox.reset_index(drop=True, inplace=True)

for i in range(len(male_tox)):
    male_tox_words.extend(male_tox['comment'][i].split())
    
male_tox_words


['This:NEWLINE_TOKEN:One',
 'can',
 'make',
 'an',
 'analogy',
 'in',
 'mathematical',
 'terms',
 'by',
 'envisioning',
 'the',
 'distribution',
 'of',
 'opinions',
 'in',
 'a',
 'population',
 'as',
 'a',
 'Gaussian',
 'curve.',
 'We',
 'would',
 'then',
 'say',
 'that',
 'the',
 'consensus',
 'would',
 'be',
 'a',
 'statement',
 'that',
 'represents',
 'the',
 'range',
 'of',
 'opinions',
 'within',
 'perhaps',
 'three',
 'standard',
 'deviations',
 'of',
 'the',
 'mean',
 'opinion.',
 'NEWLINE_TOKENsounds',
 'arbitrary',
 'and',
 'ad',
 'hoc.',
 'Does',
 'it',
 'really',
 'belong',
 'in',
 'n',
 'encyclopedia',
 'article?',
 'I',
 "don't",
 'see',
 'that',
 'it',
 'adds',
 'anything',
 'useful.NEWLINE_TOKENNEWLINE_TOKENThe',
 'paragraph',
 'that',
 'follows',
 'seems',
 'much',
 'more',
 'useful.',
 'Are',
 'there',
 'any',
 'political',
 'theorists',
 'out',
 'there',
 'who',
 'can',
 'clarify',
 'the',
 'issues?',
 'It',
 'seems',
 'to',
 'me',
 'that',
 'this',
 'is',
 'an',
 'is

In [76]:
fem_tox_words = []
fem_tox = tox.loc[tox['gender'] == 'female']
fem_tox.reset_index(drop=True, inplace=True)

for i in range(len(fem_tox)):
    fem_tox_words.extend(fem_tox['comment'][i].split())
    
fem_tox_words

['NEWLINE_TOKENNEWLINE_TOKEN==',
 'I',
 "didn't",
 'vandalise',
 'it',
 'you',
 'cockface!',
 '==NEWLINE_TOKENNEWLINE_TOKENI',
 'just',
 'told',
 'you',
 'shut',
 'up',
 'and',
 'sit',
 'down.',
 "NEWLINE_TOKENNEWLINE_TOKENI'LL",
 'KILL',
 'YOU',
 'ALL!!!!',
 "NEWLINE_TOKENNEWLINE_TOKENI'LL",
 'KILL',
 'YOU',
 'ALL!!!!',
 "NEWLINE_TOKENNEWLINE_TOKENI'LL",
 'KILL',
 'YOU',
 'ALL!!!!',
 '`',
 '(and',
 'lest',
 'we',
 'forget,',
 'this',
 'is',
 'the',
 'same',
 'publication',
 'that',
 'ran',
 'with',
 'the',
 'outrageously',
 'false',
 '``story``',
 'that',
 'Governor',
 'Rick',
 'Perry',
 'was',
 'gay)`',
 'Islamophobia',
 '==NEWLINE_TOKENNEWLINE_TOKEN:That',
 'may',
 'be',
 'so,',
 'but',
 'the',
 'to',
 'exclude',
 'another',
 "party's",
 'point',
 'of',
 'view',
 'even',
 'if',
 'that',
 'point',
 'of',
 'view',
 'is',
 'seen',
 'as',
 'abhorrent',
 'is',
 'against',
 'the',
 'NPOV',
 'policy,',
 'which,',
 'frankly,',
 'reduces',
 'the',
 "article's",
 'utility',
 'and',
 '(in',
 '