In [3]:
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

In [4]:
data = pd.read_csv("context_toxicity_raw.txt")

<h1> The Dataset (CCC) </h1>

To build the dataset of this work, we used the publicly available Civil Comments (CC) dataset (Borkan et al., 2019).

CC was originally annotated by ten annotators per post, but the parent post (the previous post in the thread) was not shown to the annotators. We call this new dataset Civil Comments in Context (CCC). Each CCC post was rated either as NON-TOXIC, UNSURE, TOXIC, or VERY TOXIC, as in the original CC dataset. We unified the latter two labels in both CC and CCC annotations to simplify the problem.
In only 71 posts (0.07%) an annotator said UNSURE, meaning annotators were confident in their decisions most of the time. We exclude these 71 posts from our study, as there are too few to generalize about.

The dataset is stored as a CSV. The data file contains 8 columns:
<ul>
<li>id = the id of the target post on the civil comments platform </li>
<li>tox_codes_oc = the toxic codes given by the annotators whao did not have access to the parent post</li>
<li>text = the target posts</li>
<li>toxicity_annotator_count = the number of the annotators who annotated this post</li>
<li>parent = the parent post</li>
<li>tox_codes_ic = the toxic codes given by the annotators who did have access to the parent post</li>
<li>tox_codes_parent = the toxic codes (out of context) of the parent post</li>
<li>workers_ic = the ids of the annotators on the appen platform </li>
</ul>

<h1> Cleaning Methodology</h1>

The dataset labels for this dataset is separated into two which is tox_codes_oc and tox_codes_ic which contains a list of codes given by annotators. 1 means toxic and 0 means non-toxic. 

To determine whether a text is toxic or not, I will count all the codes and base the decision based on the majority vote. 

[0,0,0,1] = Non-Toxic

[1,1,0,1] = Toxic

[00,1,1] = Neutral

These will then be recoded based on the preprocessing guidelines wherein Non-Toxic would be non-risky (0), Neutral would be potentially risky(1), Toxic would be Risky (2).


To add more text to the dataset, I will include the parent text's labels.

In [5]:
data

Unnamed: 0,id,tox_codes_oc,text,toxicity_annotator_count,parent,tox_codes_ic,tox_codes_parent,workers_ic
0,240290,"[0, 0, 0, 0]","Not a good idea, considering that the elephant...",4,Here's an idea. Spend $2 million moving them b...,"[0.0, 0.0, 0.0, 0.0, 0.0]","[0.0, 0.0, 0.0, 0.0, 0.0]","[45485526, 45438879, 45500804, 45432844, 45388..."
1,240311,"[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...",You're an idiot.,32,"""Yesterday I was giddy, today I'm scared,"" Tho...","[1.0, 1.0, 0.0, 1.0, 1.0]","[0.0, 0.0, 0.0, 0.0, 0.0]","[45432844, 45485526, 45397010, 45596860, 45630..."
2,240314,"[0, 0, 0, 0]","No, I think 1983 is the correct debut year. Th...",4,Perhaps someone born less recently than I has ...,"[0.0, 0.0, 0.0, 0.0, 0.0]","[0.0, 0.0, 0.0, 0.0, 0.5]","[45397769, 45598353, 45486432, 45525588, 45599..."
3,240325,"[0, 0, 0, 0]",Author of article here. I appreciate this fact...,4,Does the author of this piece not know that St...,"[0.0, 0.0, 0.0, 0.0, 0.0]","[0.0, 0.0, 0.0, 0.0, 0.0]","[45432844, 45331940, 45582632, 45449661, 45489..."
4,240367,"[0, 0, 0, 0]",HAHAHA you have a point!,4,"They both start with Star, so how are the diff...","[0.0, 0.0, 0.0, 0.0, 0.0]","[0.0, 0.0, 0.0, 0.0, 0.0]","[45536973, 45541171, 45491654, 45388080, 45224..."
...,...,...,...,...,...,...,...,...
9924,6333040,"[0, 0, 0, 0, 0, 0]","Man, I was thinking about becoming a stoner, b...",6,Count on the federal government to have not pe...,"[0.0, 0.0, 0.0, 0.0, 0.0]","[0.0, 0.0, 0.0, 1.0, 0.0]","[45535369, 45314593, 45597995, 45313149, 45571..."
9925,6333248,"[0, 0, 0, 0]",I did it just so you could rub my belly again!...,4,thank you for rolling over motley. knew you wo...,"[0.0, 0.0, 0.0, 0.0, 0.0]","[0.0, 0.0, 0.0, 0.0, 0.0]","[45256178, 45184889, 45224788, 45653549, 45541..."
9926,6333695,"[0, 0, 0, 0]","Let me re-phrase it then, ""Comment""---Perhaps ...",4,"“In the original Hebrew of the Old Testament, ...","[0.0, 0.0, 0.0, 0.0, 1.0]","[0.0, 0.0, 0.0, 0.0, 1.0]","[45446324, 45418477, 45599360, 45589137, 45407..."
9927,6333872,"[0, 0, 0, 1, 0]",I don't know that abortion played much of a pa...,5,I don't know whether you're trying to label me...,"[0.0, 0.0, 0.0, 0.0, 0.0]","[0.0, 0.0, 0.0, 0.0, 0.0]","[45537352, 45184889, 45388080, 45588938, 45505..."


In [6]:
df_to_clean = data.copy()

In [7]:
df_to_clean['tox_codes_parent'].str.contains('0.5').sum()

46

Create a new Dataframe for Text Labels for processing

In [8]:
data_text = data[['text', 'tox_codes_oc', 'tox_codes_ic']]
data_text

Unnamed: 0,text,tox_codes_oc,tox_codes_ic
0,"Not a good idea, considering that the elephant...","[0, 0, 0, 0]","[0.0, 0.0, 0.0, 0.0, 0.0]"
1,You're an idiot.,"[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[1.0, 1.0, 0.0, 1.0, 1.0]"
2,"No, I think 1983 is the correct debut year. Th...","[0, 0, 0, 0]","[0.0, 0.0, 0.0, 0.0, 0.0]"
3,Author of article here. I appreciate this fact...,"[0, 0, 0, 0]","[0.0, 0.0, 0.0, 0.0, 0.0]"
4,HAHAHA you have a point!,"[0, 0, 0, 0]","[0.0, 0.0, 0.0, 0.0, 0.0]"
...,...,...,...
9924,"Man, I was thinking about becoming a stoner, b...","[0, 0, 0, 0, 0, 0]","[0.0, 0.0, 0.0, 0.0, 0.0]"
9925,I did it just so you could rub my belly again!...,"[0, 0, 0, 0]","[0.0, 0.0, 0.0, 0.0, 0.0]"
9926,"Let me re-phrase it then, ""Comment""---Perhaps ...","[0, 0, 0, 0]","[0.0, 0.0, 0.0, 0.0, 1.0]"
9927,I don't know that abortion played much of a pa...,"[0, 0, 0, 1, 0]","[0.0, 0.0, 0.0, 0.0, 0.0]"


In [9]:
data_text['combined_codes'] = data_text['tox_codes_oc'] + data_text['tox_codes_ic']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_text['combined_codes'] = data_text['tox_codes_oc'] + data_text['tox_codes_ic']


Changing labels based on preprocessing guidelines

In [10]:
data_text['zero_count'] = data_text['combined_codes'].apply(lambda x:x.count('0'))
data_text['one_count'] = data_text['combined_codes'].apply(lambda x:x.count('1'))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_text['zero_count'] = data_text['combined_codes'].apply(lambda x:x.count('0'))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_text['one_count'] = data_text['combined_codes'].apply(lambda x:x.count('1'))


In [11]:
data_text['label_value'] = data_text['zero_count'] - data_text['one_count']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_text['label_value'] = data_text['zero_count'] - data_text['one_count']


In [12]:
def labeler(x):
    if x > 0:
        return 0
    elif x < 0:
        return 2
    else:
        return 1

In [13]:
#cleaned labels
data_text['label'] = data_text['label_value'].apply(lambda x: labeler(x))
data_text.head()

Unnamed: 0,text,tox_codes_oc,tox_codes_ic,combined_codes,zero_count,one_count,label_value,label
0,"Not a good idea, considering that the elephant...","[0, 0, 0, 0]","[0.0, 0.0, 0.0, 0.0, 0.0]","[0, 0, 0, 0][0.0, 0.0, 0.0, 0.0, 0.0]",14,0,14,0
1,You're an idiot.,"[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[1.0, 1.0, 0.0, 1.0, 1.0]","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...",7,35,-28,2
2,"No, I think 1983 is the correct debut year. Th...","[0, 0, 0, 0]","[0.0, 0.0, 0.0, 0.0, 0.0]","[0, 0, 0, 0][0.0, 0.0, 0.0, 0.0, 0.0]",14,0,14,0
3,Author of article here. I appreciate this fact...,"[0, 0, 0, 0]","[0.0, 0.0, 0.0, 0.0, 0.0]","[0, 0, 0, 0][0.0, 0.0, 0.0, 0.0, 0.0]",14,0,14,0
4,HAHAHA you have a point!,"[0, 0, 0, 0]","[0.0, 0.0, 0.0, 0.0, 0.0]","[0, 0, 0, 0][0.0, 0.0, 0.0, 0.0, 0.0]",14,0,14,0


Removing Stopwords and Punctuations

In [14]:
def text_cleaner_training(dataset, text_column, new_column):

    lemmatizer = WordNetLemmatizer()
    stop_words = set(stopwords.words("english"))
    dataset[new_column] = dataset[text_column].apply(lambda x: " ".join([word for word in word_tokenize(x) if word.lower() not in stop_words]))
    dataset[new_column] = dataset[new_column].apply(lambda x: re.sub("[^a-zA-Z\']", ' ', x))
    dataset[new_column] = dataset[new_column].apply(lambda x: " ".join([lemmatizer.lemmatize(word) for word in x.split()]))


Removing Emoji's

In [15]:
def text_cleaner_eda(dataset, text_column, new_column):

    emoji_pattern = re.compile("["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           "]+", flags=re.UNICODE)
    
    dataset[new_column] = dataset[text_column].apply(lambda x: emoji_pattern.sub(r'', x))

In [16]:
text_cleaner_training(data_text, 'text', 'clean_sentence_training')
text_cleaner_eda(data_text, 'clean_sentence_training', 'clean_sentence_eda')

Retrieving needed columns from partial dataset

In [17]:
data_partial = data_text[['text','label','clean_sentence_training', 'clean_sentence_eda']]

In [18]:
data_partial['corpus_name'] = 'context_toxicity'

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_partial['corpus_name'] = 'context_toxicity'


In [19]:
data = data_partial[['corpus_name', 'text','label', 'clean_sentence_training', 'clean_sentence_eda']]

In [20]:
data.columns = ['corpus_name', 'raw_sentence', 'label', 'clean_sentence_training', 'clean_sentence_eda']

In [21]:
#data.to_csv('context_toxicity.csv', index=False)

In [22]:
data['label'].value_counts()

0    9660
2     247
1      22
Name: label, dtype: int64

In [23]:
data

Unnamed: 0,corpus_name,raw_sentence,label,clean_sentence_training,clean_sentence_eda
0,context_toxicity,"Not a good idea, considering that the elephant...",0,good idea considering elephant zoo Asian elephant,good idea considering elephant zoo Asian elephant
1,context_toxicity,You're an idiot.,2,'re idiot,'re idiot
2,context_toxicity,"No, I think 1983 is the correct debut year. Th...",0,think correct debut year Think Star Wars Episo...,think correct debut year Think Star Wars Episo...
3,context_toxicity,Author of article here. I appreciate this fact...,0,Author article appreciate fact check ca n't ri...,Author article appreciate fact check ca n't ri...
4,context_toxicity,HAHAHA you have a point!,0,HAHAHA point,HAHAHA point
...,...,...,...,...,...
9924,context_toxicity,"Man, I was thinking about becoming a stoner, b...",0,Man thinking becoming stoner way,Man thinking becoming stoner way
9925,context_toxicity,I did it just so you could rub my belly again!...,0,could rub belly Thanks,could rub belly Thanks
9926,context_toxicity,"Let me re-phrase it then, ""Comment""---Perhaps ...",0,Let re phrase Comment '' Perhaps translator de...,Let re phrase Comment '' Perhaps translator de...
9927,context_toxicity,I don't know that abortion played much of a pa...,0,n't know abortion played much part New Jersey ...,n't know abortion played much part New Jersey ...
