## Step 0: Gathering the data
Get the personal attacks and toxicity datasets for the annotated comments and the annotations. Additionally, get the demographics data for the persoanl attack data.

Define the urls for each dataset and use the urlretrieve function from the urllib library to download the data.

In [23]:
from urllib.request import urlretrieve

#data urls
ATTACK_ANNOTATED_COMMENTS_URL = 'https://ndownloader.figshare.com/files/7554634' 
ATTACK_ANNOTATIONS_URL = 'https://ndownloader.figshare.com/files/7554637'
ATTACK_DEMOGRAPHICS_URL = 'https://ndownloader.figshare.com/files/7640752'
TOXICITY_ANNOTATED_COMMENTS_URL = 'https://ndownloader.figshare.com/files/7394542' 
TOXICITY_ANNOTATIONS_URL = 'https://ndownloader.figshare.com/files/7394539'

#download files given the url and file name
urlretrieve(ATTACK_ANNOTATED_COMMENTS_URL, 'attack_annotated_comments.tsv')
urlretrieve(ATTACK_ANNOTATIONS_URL, 'attack_annotations.tsv')   
urlretrieve(ATTACK_DEMOGRAPHICS_URL, 'attack_worker_demographics.tsv')   
urlretrieve(TOXICITY_ANNOTATED_COMMENTS_URL, 'toxicity_annotated_comments.tsv')
urlretrieve(TOXICITY_ANNOTATIONS_URL, 'toxicity_annotations.tsv') 

(&#39;toxicity_annotations.tsv&#39;, &lt;http.client.HTTPMessage at 0x26534ab5f60&gt;)

Read in the downloaded data files into dataframes for analysis. View the first 5 lines of the toxicity annotations dataframe.

In [45]:
import pandas as pd

attack_comments = pd.read_csv('attack_annotated_comments.tsv', sep = '\t')
attack_annotations = pd.read_csv('attack_annotations.tsv',  sep = '\t')
attack_demographics = pd.read_csv('attack_worker_demographics.tsv',  sep = '\t')
toxicity_comments = pd.read_csv('toxicity_annotated_comments.tsv', sep = '\t')
toxicity_annotations = pd.read_csv('toxicity_annotations.tsv',  sep = '\t')

toxicity_annotations.head()

Unnamed: 0,rev_id,worker_id,toxicity,toxicity_score
0,2232.0,723,0,0.0
1,2232.0,4000,0,0.0
2,2232.0,3989,0,1.0
3,2232.0,3341,0,0.0
4,2232.0,1574,0,1.0


## Step 1: Select and perform analysis
There are two analyses performed in this section.

The first analyzes the toxicity dataset on its own. There is a demographic dataset provided which enables the ability to see how different demographics could potentially bias the data. The research questions for this analysis are: 
- Are workers who share demographics more likely to label the data more similarly than workers who are in different demographic groups?
- If so, how consistent is the labelling across different demographic groups?

The second is a comparative analysis across the toxicity dataset and the personal attacks dataset. Some types of hostile comments may be more agreeable or disagreeable than others which could impact the reliability of the models trained on these datasets. The research question is: 
- Is there a significant difference between agreement in how workers label comments as personal attacks vs toxic comments? 

### Analysis 1: Worker demographics impacting labelling

Are labelers more likely to agree on how to label comments for personal attacks if they share demographics? 

The demographics included in the personal attack worker demographics data is gender, whether english is the worker's first language, their age group, and education. 

See the different demographics below and the percent of the dataset they make up:

In [40]:
print("Gender:\n", attack_demographics["gender"].value_counts(normalize=True))
print("\nEnglish as first language:\n", attack_demographics["english_first_language"].value_counts(normalize=True))
print("\nAge group:\n", attack_demographics["age_group"].value_counts(normalize=True))
print("\nEducation:\n", attack_demographics["education"].value_counts(normalize=True))

Gender:
 male      0.615982
female    0.383562
other     0.000457
Name: gender, dtype: float64

English as first language:
 0    0.816438
1    0.183562
Name: english_first_language, dtype: float64

Age group:
 18-30       0.486775
30-45       0.385615
45-60       0.101160
Under 18    0.017169
Over 60     0.009281
Name: age_group, dtype: float64

Education:
 bachelors       0.393607
hs              0.288128
masters         0.175799
professional    0.110959
some            0.021918
doctorate       0.009132
none            0.000457
Name: education, dtype: float64


Join the demographics dataset for attacks with the annotations dataset to get the annotations per worker. 

In [41]:
attack_demo_ann = pd.merge(attack_annotations, attack_demographics, 
how='inner', on='worker_id')
attack_demo_ann.head()

[&#39;rev_id&#39; &#39;worker_id&#39; &#39;quoting_attack&#39; &#39;recipient_attack&#39;
 &#39;third_party_attack&#39; &#39;other_attack&#39; &#39;attack&#39; &#39;gender&#39;
 &#39;english_first_language&#39; &#39;age_group&#39; &#39;education&#39;]


Unnamed: 0,rev_id,worker_id,quoting_attack,recipient_attack,third_party_attack,other_attack,attack,gender,english_first_language,age_group,education
0,37675,1362,0.0,0.0,0.0,0.0,0.0,male,0,18-30,masters
1,3202092,1362,0.0,0.0,0.0,0.0,0.0,male,0,18-30,masters
2,4745553,1362,0.0,0.0,0.0,0.0,0.0,male,0,18-30,masters
3,4855563,1362,0.0,0.0,0.0,0.0,0.0,male,0,18-30,masters
4,8350378,1362,0.0,0.0,0.0,0.0,0.0,male,0,18-30,masters


Get the probabilities for all the demographics that make up more than 3% of the data:
- P(Attack=True | Female)
- P(Attack=True | Male)
- P(Attack=True | English is first language)
- P(Attack=True | English is not first language)
- P(Attack=True | Age group is 18 - 30)
- P(Attack=True | Age group is 30 - 45)
- P(Attack=True | Age group is 45 - 60)
- P(Attack=True | Education is high school)
- P(Attack=True | Education is bachelors)
- P(Attack=True | Education is masters)
- P(Attack=True | Education is professional)

As you can see below, labelers who identified as female were ~1.5% more likely to label a comment as a personal attack than labelers who identified as male. This is not much of a difference. 

In [239]:
print("P(Attack=True | Female):\n", attack_demo_ann[attack_demo_ann["gender"] == "female"]["attack"].value_counts(normalize=True))
print("\nP(Attack=True | Male):\n", attack_demo_ann[attack_demo_ann["gender"] == "male"]["attack"].value_counts(normalize=True))

P(Attack=True | Female):
 0.0    0.826402
1.0    0.173598
Name: attack, dtype: float64

P(Attack=True | Male):
 0.0    0.841309
1.0    0.158691
Name: attack, dtype: float64

P(Attack=True | Other):
 0.0    0.554217
1.0    0.445783
Name: attack, dtype: float64


As you can see below, labelers who said they spoke English as their first language were ~2% more likely to label comments as a personal attack than labelers who said English was not their first language. 

In [55]:
print("P(Attack=True | English is first language):\n", attack_demo_ann[attack_demo_ann["english_first_language"] == 1]["attack"].value_counts(normalize=True))
print("\nP(Attack=True | English is not first language):\n", attack_demo_ann[attack_demo_ann["english_first_language"] == 0]["attack"].value_counts(normalize=True))

P(Attack=True | English is first language):
 0.0    0.818531
1.0    0.181469
Name: attack, dtype: float64

P(Attack=True | English is not first language):
 0.0    0.838578
1.0    0.161422
Name: attack, dtype: float64


In [None]:
As you can see below, age has a larger effect on whether a labeler marks a comment as a personal attack or not. The older age group of 45-60 was ~5.7% more likely to label comments as an attack compared to labelers in the 18-30 age group and ~3.8% more likely than the 30-45 age group. 

In [57]:
print("P(Attack=True | Age group is 18 - 30):\n", attack_demo_ann[attack_demo_ann["age_group"] == "18-30"]["attack"].value_counts(normalize=True))
print("\nP(Attack=True | Age group is 30 - 45):\n", attack_demo_ann[attack_demo_ann["age_group"] == "30-45"]["attack"].value_counts(normalize=True))
print("\nP(Attack=True | Age group is 45 - 60):\n", attack_demo_ann[attack_demo_ann["age_group"] == "45-60"]["attack"].value_counts(normalize=True))

P(Attack=True | Age group is 18 - 30):
 0.0    0.848967
1.0    0.151033
Name: attack, dtype: float64

P(Attack=True | Age group is 30 - 45):
 0.0    0.829853
1.0    0.170147
Name: attack, dtype: float64

P(Attack=True | Age group is 45 - 60):
 0.0    0.791523
1.0    0.208477
Name: attack, dtype: float64


As you can see below, education has the smallest impact on whether a labeler marked a comment as an attack. Each education group was within half a percentage point of each other. 

In [58]:
print("P(Attack=True | Education is high school):\n", attack_demo_ann[attack_demo_ann["education"] == "hs"]["attack"].value_counts(normalize=True))
print("\nP(Attack=True | Education is bachelors):\n", attack_demo_ann[attack_demo_ann["education"] == "bachelors"]["attack"].value_counts(normalize=True))
print("\nP(Attack=True | Education is masters):\n", attack_demo_ann[attack_demo_ann["education"] == "masters"]["attack"].value_counts(normalize=True))
print("\nP(Attack=True | Education is professional):\n", attack_demo_ann[attack_demo_ann["education"] == "professional"]["attack"].value_counts(normalize=True))

P(Attack=True | Education is high school):
 0.0    0.838819
1.0    0.161181
Name: attack, dtype: float64

P(Attack=True | Education is bachelors):
 0.0    0.834606
1.0    0.165394
Name: attack, dtype: float64

P(Attack=True | Education is masters):
 0.0    0.834386
1.0    0.165614
Name: attack, dtype: float64

P(Attack=True | Education is professional):
 0.0    0.832173
1.0    0.167827
Name: attack, dtype: float64


After looking at the probabilities it appears that age is the demographic with the largest influence in a labelers decision with a difference of ~5.7% between the 18-30 and 45-60 age groups followed by english as a first language, gender, and then education last with less than half a percent of variation between education levels.

Next, we will expanding on the age group and English as a first language demographic data by training a linear regression model and looking at the coefficients. 

The cell below sets the y labels as whether or not a comment was considered an attack in addition to creating a dataframe of predictor features by creating a binary feature for each age group and including the english as a first language feature.

In [223]:
y_train = attack_demo_ann["attack"]

X_train = pd.DataFrame(columns=[])

X_train["Age_Under18"] = attack_demo_ann["age_group"].apply(lambda g: 1 if g == "Under 18" else 0)
X_train["Age_18-30"] = attack_demo_ann["age_group"].apply(lambda g: 1 if g == "18-30" else 0)
X_train["Age_30-45"] = attack_demo_ann["age_group"].apply(lambda g: 1 if g == "30-45" else 0)
X_train["Age_45-60"] = attack_demo_ann["age_group"].apply(lambda g: 1 if g == "45-60" else 0)
X_train["Age_Over60"] = attack_demo_ann["age_group"].apply(lambda g: 1 if g == "Over 60" else 0)

X_train["GenderOther"] = attack_demo_ann["gender"].apply(lambda g: 1 if g == "other" else 0)
X_train["GenderFemale"] = attack_demo_ann["gender"].apply(lambda g: 1 if g == "female" else 0)
X_train["GenderMale"] = attack_demo_ann["gender"].apply(lambda g: 1 if g == "male" else 0)

X_train["EnglishFirst"] = attack_demo_ann["english_first_language"]

X_train["Education_HS"] = attack_demo_ann["education"].apply(lambda g: 1 if g == "hs" else 0)
X_train["Education_Bachelors"] = attack_demo_ann["education"].apply(lambda g: 1 if g == "bachelors" else 0)
X_train["Education_Masters"] = attack_demo_ann["education"].apply(lambda g: 1 if g == "masters" else 0)
X_train["Education_Professional"] = attack_demo_ann["education"].apply(lambda g: 1 if g == "professional" else 0)
X_train["Education_Some"] = attack_demo_ann["education"].apply(lambda g: 1 if g == "some" else 0)
X_train["Education_Doctorate"] = attack_demo_ann["education"].apply(lambda g: 1 if g == "doctorate" else 0)
X_train["Education_None"] = attack_demo_ann["education"].apply(lambda g: 1 if g == "none" else 0)

X_train.head()

Unnamed: 0,Age_Under18,Age_18-30,Age_30-45,Age_45-60,Age_Over60,GenderOther,GenderFemale,GenderMale,EnglishFirst,Education_HS,Education_Bachelors,Education_Masters,Education_Professional,Education_Some,Education_Doctorate,Education_None
0,0,1,0,0,0,0,0,1,0,0,0,1,0,0,0,0
1,0,1,0,0,0,0,0,1,0,0,0,1,0,0,0,0
2,0,1,0,0,0,0,0,1,0,0,0,1,0,0,0,0
3,0,1,0,0,0,0,0,1,0,0,0,1,0,0,0,0
4,0,1,0,0,0,0,0,1,0,0,0,1,0,0,0,0


Next up is creating a logistic regression model and fitting it to the data. Below we can see the features and their coefficients from the model.

In [214]:
import statsmodels.api as sm

model = sm.Logit(y_train, X_train)
result = model.fit()
result.summary()

Optimization terminated successfully.
         Current function value: 0.444978
         Iterations 7


0,1,2,3
Dep. Variable:,attack,No. Observations:,855514.0
Model:,Logit,Df Residuals:,855499.0
Method:,MLE,Df Model:,14.0
Date:,"Sun, 11 Oct 2020",Pseudo R-squ.:,0.003154
Time:,20:54:14,Log-Likelihood:,-380680.0
converged:,True,LL-Null:,-381890.0
Covariance Type:,nonrobust,LLR p-value:,0.0

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Age_Under18,-0.2978,0.042,-7.174,0.000,-0.379,-0.216
Age_18-30,0.0069,0.026,0.262,0.793,-0.045,0.059
Age_30-45,0.1469,0.026,5.579,0.000,0.095,0.199
Age_45-60,0.3836,0.027,13.991,0.000,0.330,0.437
Age_Over60,0.2748,0.039,7.090,0.000,0.199,0.351
GenderOther,0.0248,,,,,
GenderFemale,-1.4276,,,,,
GenderMale,-1.5177,,,,,
EnglishFirst,0.1005,0.008,11.924,0.000,0.084,0.117


Calculating the odds from the log odds (coefficients), we see the features with odds > 1 increase the odds of a comment being labeled as an attack.

In [219]:
odds = np.exp(result.params) #get the odds, features with odds greater than 1 are positively associated with attacks
odds

#Age 30-45 odds are 1.1583 which means a 15.8% increase in comments labelled as attack
#Age under 18 odds are 0.74246 which means .74 comments will be attacks for every 1 that is not. ~1 in 2 non-events.

Age_Under18               0.742461
Age_18-30                 1.006927
Age_30-45                 1.158234
Age_45-60                 1.467557
Age_Over60                1.316219
GenderOther               1.025093
GenderFemale              0.239885
GenderMale                0.219211
EnglishFirst              1.105735
Education_HS              0.756290
Education_Bachelors       0.779260
Education_Masters         0.770216
Education_Professional    0.795967
Education_Some            0.662451
Education_Doctorate       0.737382
Education_None            0.305423
dtype: float64

Comparing the different gender breakdowns we can see that labelers who identified as other were more than twice as likely to mark a comment as an attack than labelers who identified as male or female. 

Having less than 1% of the labelers identified as other biases the data because according to the LGBTQ+ Population Quick Facts in 2017 about 4.5% of US adults identified as LGBT and 8.2% of millenials identified as LGBT (source: https://diversity.iupui.edu/offices/lgbtq/images/LGBT-Population-Quick-Facts.pdf). It would be better to have a higher, more representative percentage of labelers who identify as other given this makes a large difference in how the data is labelled.

In [243]:
print("P(Attack=True | English is first language):\n", attack_demo_ann[attack_demo_ann["english_first_language"] == 1]["attack"].value_counts(normalize=True))
print("\nP(Attack=True | English is not first language):\n", attack_demo_ann[attack_demo_ann["english_first_language"] == 0]["attack"].value_counts(normalize=True))
print("\nP(Attack=True | Other):\n", attack_demo_ann[attack_demo_ann["gender"] == "other"]["attack"].value_counts(normalize=True))

P(Attack=True | English is first language):
 0.0    0.818531
1.0    0.181469
Name: attack, dtype: float64

P(Attack=True | English is not first language):
 0.0    0.838578
1.0    0.161422
Name: attack, dtype: float64

P(Attack=True | Other):
 0.0    0.554217
1.0    0.445783
Name: attack, dtype: float64


In [None]:
The demographic data would be more helpful in detecting bias if there was also information on ethnicity, household income. 