This notebook creates the dataset that we use to train an ILM model to explain positive predictions of a toxicity classifier. The datasets we chose are for the toxic/abusive language detection task, close to each other in their task definition and from a variety of different sources. 

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
pd.set_option('display.max_colwidth', None)
#import preprocessor
import pickle
import wordsegment as ws
from html import unescape
import re
import string
ws.load() # load vocab for word segmentation

random_seed = 42

# Cleaning functions from hatecheck-experiments
# Define helper function for segmenting hashtags found through regex
def regex_match_segmentation(match):
    return ' '.join(ws.segment(match.group(0)))

# Define function for cleaning text
def clean_text(text):
    
    # convert HTML codes
    text = unescape(text)
    
    # lowercase text
    text = text.lower()
    
    # replace mentions, URLs and emojis with special token
    text = re.sub(r"@[A-Za-z0-9_-]+",'[USER]',text)
    text = re.sub(r"u/[A-Za-z0-9_-]+",'[USER]',text)
    text = re.sub(r"http\S+",'[URL]',text)
    
    # find and split hashtags into words
    text = re.sub(r"#[A-Za-z0-9]+", regex_match_segmentation, text)

    # remove punctuation at beginning of string (quirk in Davidson data)
    text = text.lstrip("!")
    text = text.lstrip(":")
    
    # remove newline and tab characters
    text = text.replace('\n',' ')
    text = text.replace('\t',' ')
    text = text.replace('[linebreak]', ' ')
    
    return text

## Founta

The first dataset we consider is from [Founta et al. 2018](https://arxiv.org/pdf/1802.00393.pdf), which is a dataset sampled from Twitter. We split this into train, valid and test sets here, and only use the neutral tweets in the train split to train the ILM. We will use the same splits later when training a BERT classifier. 

In [None]:
df_texts = pd.read_csv("../Founta/hatespeech_text_label_vote.csv",names=['text', 'label', 'count_label_votes'], delimiter='\t')
df_texts.drop_duplicates(subset='text', inplace=True)
founta_train, founta_valtest = train_test_split(df_texts, test_size=0.2, stratify=df_texts.label, random_state=123)
founta_val, founta_test = train_test_split(founta_valtest, test_size=0.5, stratify=founta_valtest.label, random_state=123)
founta_train_neutral = founta_train[founta_train['label'] == 'normal']

founta_train.to_csv("Data/Founta/train.csv")
founta_val.to_csv("Data/Founta/valid.csv")
founta_test.to_csv("Data/Founta/test.csv")

founta_train_neutral[:10]

## CAD

Next, we get the neutral posts from the CAD dataset, introduced in [Vigden et al. 2021](https://aclanthology.org/2021.naacl-main.182.pdf) and can be obtained from [here](https://zenodo.org/record/4881008#.YnvpkvPMK3I). This dataset is sourced from Reddit, and posts are annotated with hierarchical labels, and within their context. For our task we only keep the posts with the Neutral label. 

In [None]:
cad_train = pd.read_csv("../cad_naacl2021/data/cad_v1_1_train.tsv", sep="\t")
cad_train_neutral = cad_train[cad_train.labels == 'Neutral']
cad_train_neutral[:3]

## Wikipedia Toxicity

The next dataset we use is the Wikipedia Toxicity dataset from [Wulczyn et al. 2017](https://arxiv.org/abs/1610.08914), which can be downloaded [here](https://figshare.com/articles/dataset/Wikipedia_Talk_Labels_Toxicity/4563973). As shown in [Nejadgholi and Kiritchenko 2020](https://aclanthology.org/2020.alw-1.20.pdf), the neutral class for this dataset is dominated by Wikipedia specific topics such as edits and formatting. We use the topic clusters found in this work to remove these domain specific instances from the training set before sampling.

In [None]:
comments = pd.read_csv('../cross_dataset_toxicity/toxicity_annotated_comments.tsv', sep = '\t', index_col = 0)  #from https://figshare.com/articles/dataset/Wikipedia_Talk_Labels_Toxicity/4563973
annotations = pd.read_csv('../cross_dataset_toxicity/toxicity_annotations.tsv',  sep = '\t')
# join labels and comments
comments['toxicity'] = annotations.groupby('rev_id')['toxicity'].mean() > 0.5

# # remove newline and tab tokens
comments['comment'] = comments['comment'].apply(lambda x: x.replace("NEWLINE_TOKEN", " "))
comments['comment'] = comments['comment'].apply(lambda x: x.replace("TAB_TOKEN", " "))

wiki_topics = pd.read_csv('../cross_dataset_toxicity/wiki_toxicity_topics.csv', index_col=[0]) #from this repo

data = comments.merge(wiki_topics, on='rev_id')  #merge the two datasets

#pruned Wiki-toxic 
topic_categories={1:[0,1],
                  2:[2,7,8,9,12,14,16],
                  3:[3,4,5,6,10,11,13,15,17,18,19]}


toxic_train_pruned = data[data['split']=='train' ][data['wiki_topic'].isin(topic_categories[1]+topic_categories[2])]
wiki_train_neutral = toxic_train_pruned[toxic_train_pruned.toxicity == False]

In [None]:
wiki_train_neutral[:3]

## Civil Comments

Next, we get the civil_comments from [kaggle](https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification/data). This dataset consists of comments made on a number of
news platforms, within the years 2015-2017, and later annotated by Jigsaw. For picking neutral comments, we pick the comments where the target is 0. 

In [None]:
civil_comments_train = pd.read_csv('../civil_comments/train.csv')
civil_comments_neutral = civil_comments_train[(civil_comments_train['target'] < 0.0001)]

## Putting it all together

In [None]:
# comparing the sizes of different datasets
len(founta_train_neutral)

In [None]:
cad_train_neutral.shape[0]

In [None]:
wiki_train_neutral.shape[0]

In [None]:
civil_comments_neutral.shape[0]

In [None]:
# sample 30K comments from civil_comments, and take others as is. 
civil_comments_sampled = civil_comments_neutral.sample(n=30000, random_state=random_seed)
civil_comments_sampled.shape

In [None]:
civil_comments_sampled['comment_text'] = civil_comments_sampled['comment_text']

In [None]:
founta_texts = [clean_text(tt) for tt in founta_train_neutral['text'].tolist()]
cad_texts = [clean_text(tt) for tt in cad_train_neutral['text'].tolist()]
wiki_texts = [clean_text(tt) for tt in wiki_train_neutral['comment'].tolist()]
civil_texts = [clean_text(tt) for tt in civil_comments_sampled['comment_text'].tolist()]

We divide the texts again to train valid and test splits for the ILM training.

In [None]:
from sklearn.model_selection import train_test_split
from random import Random

founta_train, founta_valid = train_test_split(founta_texts, test_size=0.05, random_state=random_seed+1)
cad_train, cad_valid = train_test_split(cad_texts, test_size=0.05, random_state=random_seed+2)
wiki_train, wiki_valid = train_test_split(wiki_texts, test_size=0.05, random_state=random_seed+3)
civil_train, civil_valid = train_test_split(wiki_texts, test_size=0.05, random_state=random_seed+4)

In [None]:
compound_train = founta_train + cad_train + wiki_train + civil_train
compound_valid = founta_valid + cad_valid + wiki_valid + civil_valid
Random(random_seed+5).shuffle(compound_train)
Random(random_seed+6).shuffle(compound_valid)

In [None]:
with open("Data/ILM/compound_dataset/train.txt", "w") as ff:
    ff.write("\n\n\n".join(compound_train))
    
with open("Data/ILM/compound_dataset/valid.txt", "w") as ff:
    ff.write("\n\n\n".join(compound_valid))