todo
1. optimize hyperparameters
2. clean dataset
    - balance
3. data augmentation
4. add requirments.txt
5. correct title (of file as well)
6. go through all variable names
7. ask which types of groups should "be in focus". Do disabled people count?
8. more advaned duplicate dropping?
9. balancing the dataset needed?
10. dedicated train test split?
11. data augmentation only on train?
12. remove duplicates a second time
13. train test split after tokenization?

# Hate Detection

In [64]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelBinarizer
from sklearn.metrics import confusion_matrix, f1_score

from keras.callbacks import EarlyStopping
from keras.layers import SimpleRNN, LSTM, GRU, Bidirectional, Dense, Embedding
from keras.models import Sequential
from keras.preprocessing.text import Tokenizer
from keras.utils import pad_sequences


from random import sample

### Data Exploration

In [65]:
data = pd.read_csv('hate-text.csv')
data

Unnamed: 0,case_id,test_case,label_gold,target_ident
0,1,I hate women.,hateful,women
1,2,I hate trans people.,hateful,trans people
2,3,I hate gay people.,hateful,gay people
3,4,I hate black people.,hateful,black people
4,5,I hate disabled people.,hateful,disabled people
...,...,...,...,...
3896,3897,Gay people are all complete 4ssholes.,hateful,gay people
3897,3898,Black people are all complete 4ssholes.,hateful,black people
3898,3899,Disabled people are all complete 4ssholes.,hateful,disabled people
3899,3900,Muslims are all complete 4ssholes.,hateful,Muslims


The dataset contains some missing values in the target_ident column, which we'll have to deal with later

In [66]:
# Using display() to display both at the same time. Won't comment any future use of display() (for obvious reasons)
display(
    data.isnull().any(),
    data[data.isna().any(axis=1)]
)

case_id         False
test_case       False
label_gold      False
target_ident     True
dtype: bool

Unnamed: 0,case_id,test_case,label_gold,target_ident
1271,1272,"Fucking hell, what a day.",non-hateful,
1272,1273,That's a great fucking idea.,non-hateful,
1273,1274,No fucking way.,non-hateful,
1274,1275,I'm so fucking excited.,non-hateful,
1275,1276,Are you fucking kidding me?,non-hateful,
...,...,...,...,...
3097,3098,Pro lifers should get the fuck out of our coun...,non-hateful,
3098,3099,Most gun dealers are seriously fucked in the h...,non-hateful,
3099,3100,"By nature, corporate accountants are just shit...",non-hateful,
3100,3101,Most real estate agents are absolute bastards.,non-hateful,


No duplicate texts found

In [67]:
print(f'''duplicate texts: {data['test_case'].duplicated().any()}''')

duplicate texts: False


While the dataset is quite balanced in the target groups, it's still assymetric when it comes to the number of hate/non-hate texts

In [68]:
display(
    pd.DataFrame(data['label_gold'].value_counts()),
    pd.DataFrame(data['target_ident'].value_counts())
)

Unnamed: 0_level_0,count
label_gold,Unnamed: 1_level_1
hateful,2659
non-hateful,1242


Unnamed: 0_level_0,count
target_ident,Unnamed: 1_level_1
gay people,577
women,535
disabled people,510
Muslims,510
black people,504
trans people,485
immigrants,485


No unusually short/long texts

In [69]:
longest_text_length = data['test_case'].str.len().max()
shortest_text_length = data['test_case'].str.len().min()

longest_text_index = data['test_case'].str.len().idxmax()
shortest_text_index = data['test_case'].str.len().idxmin()

print(
    f'''Shortest: {longest_text_length}, "{data['test_case'][longest_text_index]}"\n'''
    f'''Longest: {shortest_text_length}, "{data['test_case'][shortest_text_index]}"'''
)

Shortest: 104, "When will it finally be clear to people that it's in a disabled person's nature to be an annoying prick?"
Longest: 11, "Ihate women"


### Data cleaning

Drop and rename columns to something more suitable

In [70]:
data = data.drop('case_id', axis=1)
data = data.rename(
    columns=
    {
        'test_case': 'text',
        'label_gold': 'label',
        'target_ident': 'target',
    }
)

data.head()

Unnamed: 0,text,label,target
0,I hate women.,hateful,women
1,I hate trans people.,hateful,trans people
2,I hate gay people.,hateful,gay people
3,I hate black people.,hateful,black people
4,I hate disabled people.,hateful,disabled people


It appears as though non of the missing target data have any relevant hate speech associated with them. This can easily be varified by manually looking at the dataset (given its tiny size). But alas, given my horrendous dyslexia i decided to programmatically varify it as a sanity check

In [71]:
target_search_words = [
    'gay',
    'women',
    'disabled',
    'Muslims',
    'black',
    'trans',
    'immigrants'
]

# Get all rows targets with NaNs
data_target_nans = data[data['target'].isna()]

for index, row in data_target_nans.iterrows():
    for word in target_search_words:
        # Split by words and search each one
        if word in row['text'].split(' '):
            print(f'''found target word in: "{row['text']}" row''')

I'm now confident in dropping all rows with target NaN values

In [72]:
data = data.dropna()

Balance the dataset

In [73]:
hateful = data['label'].value_counts()['hateful']
non_hateful = data['label'].value_counts()['non-hateful']

# Because we've got far more hateful texts
number_of_rows_to_drop = hateful - non_hateful

# randomly sample the rows and and drop them
rows_indexes_to_remove = data[data['label'] == 'hateful'].sample(n=number_of_rows_to_drop).index
data = data.drop(rows_indexes_to_remove)

display(
    pd.DataFrame(data['label'].value_counts()),
    pd.DataFrame(data['target'].value_counts())
)

Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
hateful,947
non-hateful,947


Unnamed: 0_level_0,count
target,Unnamed: 1_level_1
gay people,324
women,289
Muslims,274
black people,267
immigrants,259
disabled people,251
trans people,230


Shuffle the dataset

In [74]:
# Shuffles and returns the entire dataset as frac=1
data = data.sample(frac=1)

# Reset index after having both balanced and shuffled
data = data.reset_index(drop=True)
data

Unnamed: 0,text,label,target
0,We should not let Muslims suffer.,non-hateful,Muslims
1,I have no respect for people who write things ...,non-hateful,disabled people
2,My friend's bitch is not like other dogs. She ...,non-hateful,women
3,There is nothing in the world I like less than...,hateful,women
4,I do not know many trans people.,non-hateful,trans people
...,...,...,...
1889,Immigrants are just the dumbest people.,hateful,immigrants
1890,There is nothing worse than a Muslim.,hateful,Muslims
1891,"I absolutely loathe people, especially if they...",hateful,black people
1892,There have to be consequences for saying you w...,non-hateful,gay people


In [75]:
print(
    f'''Shortest: {data['text'].str.len().min()}\n'''
    f'''Longest: {data['text'].str.len().max()}'''
)

Shortest: 13
Longest: 100


In [76]:
max_words_to_use = 1000

# Tokenize the text data (convert them into "sequences")
tokenizer = Tokenizer(num_words=max_words_to_use) # Consider only using the top 1000 words
tokenizer.fit_on_texts(data['text'])
tokenized_texts = tokenizer.texts_to_sequences(data['text'])

# printing using loop for easier viewing
for i in range(5):
    print(f'seq {i}: {tokenized_texts[i]}')

seq 0: [27, 28, 19, 142, 24, 133]
seq 1: [6, 29, 41, 82, 20, 1, 73, 107, 108, 7, 64, 67, 552, 12, 30, 46]
seq 2: [38, 734, 134, 17, 19, 7, 119, 635, 224, 735, 74, 20, 736]
seq 3: [52, 17, 44, 12, 10, 135, 6, 7, 240, 54, 4, 96]
seq 4: [6, 59, 19, 185, 67, 16, 1]


The longest series of tokenized words is only 20 items. Thus, we'll set our padding length accordingly.

In [77]:
longest_tokenized_text = max(tokenized_texts, key=len)
len(longest_tokenized_text)

19

In [79]:
# Use 40 as max sequence length given the maximum text length was only 38 characters long
max_sequence_length = 20

# Pad the sequences to make them of uniform length
tokenized_padded_texts = pad_sequences(tokenized_texts, maxlen=max_sequence_length)

# printing using loop for easier viewing
for i in range(5):
    print(f'seq {i}: {tokenized_padded_texts[i]}')

seq 0: [  0   0   0   0   0   0   0   0   0   0   0   0   0   0  27  28  19 142
  24 133]
seq 1: [  0   0   0   0   6  29  41  82  20   1  73 107 108   7  64  67 552  12
  30  46]
seq 2: [  0   0   0   0   0   0   0  38 734 134  17  19   7 119 635 224 735  74
  20 736]
seq 3: [  0   0   0   0   0   0   0   0  52  17  44  12  10 135   6   7 240  54
   4  96]
seq 4: [  0   0   0   0   0   0   0   0   0   0   0   0   0   6  59  19 185  67
  16   1]
