# Data Preprocessing

This notebook preprocesses the data from Davidson et al. (2017) and pickles it for future use.

It pickles two datasets. One with the original hate speech clas, and another with the breakdown of the hate class into directed and generalized hate speech.


In [1]:
import json
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from utils import preprocess, create_lookup_tables, create_pad_fn, pad_tweets,\
                  hate_classification, change_hate_labels

In [2]:
df = pd.read_csv("./data/labeled_data.csv", index_col=0)
raw_tweets = df.tweet
raw_labels = df["class"].values

In [3]:
df.head()

Unnamed: 0,count,hate_speech,offensive_language,neither,class,tweet
0,3,0,0,3,2,!!! RT @mayasolovely: As a woman you shouldn't...
1,3,0,3,0,1,!!!!! RT @mleew17: boy dats cold...tyga dwn ba...
2,3,0,3,0,1,!!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby...
3,3,0,2,1,1,!!!!!!!!! RT @C_G_Anderson: @viva_based she lo...
4,6,0,6,0,1,!!!!!!!!!!!!! RT @ShenikaRoberts: The shit you...


### Data Cleaning 

In [4]:
tweets = raw_tweets.map(preprocess)
print("Example of a raw tweet:\n{}".format(raw_tweets[68]))
print("\nIts cleaned version is:\n{}".format(preprocess(raw_tweets[68])))

Example of a raw tweet:
"@Almightywayne__: @JetsAndASwisher @Gook____ bitch fuck u http://t.co/pXmGA68NC1" maybe you'll get better. Just http://t.co/TPreVwfq0S

Its cleaned version is:
 ||Quotation_Mark|| MENTIONHERE : MENTIONHERE MENTIONHERE bitch fuck u URLHERE ||Quotation_Mark|| maybe you'll get better ||Period|| just URLHERE 


### Checking for outliers

In [5]:
# Get cleaned tweets
df["clean_tweet"] = tweets

# Get their word count
df["word_count"] = df.clean_tweet.apply(lambda x : len(x.split()))

df.word_count.describe()

count    24783.000000
mean        16.729936
std          8.445555
min          1.000000
25%         10.000000
50%         16.000000
75%         23.000000
max         95.000000
Name: word_count, dtype: float64

In [6]:
# Check tweets with the minimum word count
df.loc[df.word_count == df.word_count.min(),]

Unnamed: 0,count,hate_speech,offensive_language,neither,class,tweet,clean_tweet,word_count
821,3,0,0,3,2,#Yankees,HASHTAGHERE,1
24147,3,0,3,0,1,bitches,bitches,1
24218,3,3,0,0,0,coons,coons,1
24869,3,0,3,0,1,pussy,pussy,1


Looks good. Let's check the tweet(s) with the maximum word count.

In [7]:
# Check tweets with the maximum length
df.loc[df.word_count == df.word_count.max(),]

Unnamed: 0,count,hate_speech,offensive_language,neither,class,tweet,clean_tweet,word_count
22953,3,0,0,3,2,Was finna slit my eyebrows up in the shop but ...,was finna slit my eyebrows up in the shop but ...,95


There's something strange going on. Let's check the tweet again.

In [8]:
df.loc[df.word_count == df.word_count.max(),].tweet.values

array(['Was finna slit my eyebrows up in the shop but nahhhhhh.\r\n.\r\n.\r\n.\r\n.\r\n.\r\n.\r\n.\r\n.\r\n.\r\n.\r\n.\r\n.\r\n.\r\n.\r\n.\r\n.\r\n.\r\n.\r\n.\r\n.\r\n.\r\n.\r\n.\r\n.\r\n.\r\n.\r\n.\r\n.\r\n\r\n.\r\n.\r\n.\r\n.\r\n.\r\n.\r\n.\r\n.\r\n.\r\n.\r\n.\r\n.\r\n.'],
      dtype=object)

The tweet contains a lot of new lines. It's hard to know why, but I'll choose to remove them.

In [9]:
old_tweet = df.loc[df.word_count == df.word_count.max(),].tweet.values[0]
new_tweet = old_tweet[:old_tweet.find("\r")]
df.loc[df.word_count == df.word_count.max(), "tweet"] = new_tweet
df.loc[df.word_count == df.word_count.max(), "clean_tweet"] = preprocess(new_tweet)
df.loc[df.word_count == df.word_count.max(), "word_count"] = len(preprocess(new_tweet).split())

Let's check it again.

In [10]:
# Check tweets with the maximum length
df.loc[df.word_count == df.word_count.max(),]

Unnamed: 0,count,hate_speech,offensive_language,neither,class,tweet,clean_tweet,word_count
18267,3,0,3,0,1,RT @TrxllLegend: One good girl is worth a thou...,rt MENTIONHERE : one good girl is worth a thou...,91


In [11]:
df.loc[df.word_count == df.word_count.max(),].clean_tweet.values[0]

'rt MENTIONHERE : one good girl is worth a thousand bitches ||Return|| ||Return|| 👰 = 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 … '

Sighes. Well, format-wise it is okay.

### Create lookup tables

In [12]:
vocab_to_int, int_to_vocab = create_lookup_tables(tweets)

In [13]:
print("The size of the vocabulary is: {} tokens.".format(len(vocab_to_int)))
vocab = list(vocab_to_int.keys())
np.random.shuffle(vocab)
print("These are 10 randomly sample words in the vocabulary:\n{}".format(vocab[:10]))
del vocab

The size of the vocabulary is: 21134 tokens.
These are 10 randomly sample words in the vocabulary:
['parties', 'unt', 'spoiling', 'talking', 'wilde', 'together', 'pubeless', '<', 'helmsley', 'check']


###  Padding the Data

In [14]:
# Hemker sets the sentence length to 100. We'll do the same here for consistency, but
# note that the max sentence length in our dataset is actually 91.
MAX_LENGTH = 100  # df.word_count.max()
pad_tweets = create_pad_fn(MAX_LENGTH)
df["padded_tweets"] = df.clean_tweet.map(pad_tweets)
print(df.padded_tweets[10])

<PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>  ||Quotation_Mark|| keeks is a bitch she curves everyone ||Quotation_Mark|| lol i walked into a conversation like this ||Period|| smh


### Tokenizing the Data

In [15]:
tweets_ints = np.array([[vocab_to_int[word] for word in tweet.split()] for tweet in df.padded_tweets.values])
print(tweets_ints[10])

[    0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0  3621  5100   958
 20469 21124  7434  4077 20029  3621 19519 16692 20201 16706 20469  8061
 11248 20570 16880  5586]


### Hate Subclass Extraction

In [16]:
hate_tweets = tweets[df["class"] == 0].values
_hate_prnt = lambda x : "Generalized" if hate_classification(x) == 4 else "Directed"

print("Example of a hateful tweet: \n{}".format(hate_tweets[20]))
print("Its type of hate speech is: {}\n".format(_hate_prnt(hate_tweets[20])))

print("Example of a hateful tweet:\n{}".format(hate_tweets[10]))
print("Its type of hate speech is: {}\n".format(_hate_prnt(hate_tweets[10])))

Example of a hateful tweet: 
 ||Quotation_Mark|| we're out here ||Comma|| and we're queer ||Exclamation_Mark|| ||Quotation_Mark|| ||Return|| ||Quotation_Mark|| 2 ||Comma|| 4 ||Comma|| 6 ||Comma|| hut ||Exclamation_Mark|| we like it in our butt ||Exclamation_Mark|| ||Quotation_Mark|| 
Its type of hate speech is: Generalized

Example of a hateful tweet:
 ||Quotation_Mark|| MENTIONHERE : jackies a retard HASHTAGHERE ||Quotation_Mark|| at least i can make a grilled cheese ||Exclamation_Mark|| 
Its type of hate speech is: Directed



### Change hate labels

In [17]:
labels = change_hate_labels(tweets, raw_labels)
pd.Series(labels).value_counts()

0    19190
1     4163
2      954
3      476
dtype: int64

### Save Files

In [18]:
np.save("data/tweets", tweets_ints)
np.save("data/hate_original", raw_labels)
np.save("data/hate_breakdown", labels)

In [19]:
with open('vocab_to_int.json', 'w') as f:
    json.dump(vocab_to_int, f)
    
with open('int_to_vocab.json', 'w') as f:
    json.dump(int_to_vocab, f)