todo
1. optimize hyperparameters
2. clean dataset
    - balance
3. data augmentation
4. add requirments.txt
5. correct title (of file as well)
6. go through all variable names
7. ask which types of groups should "be in focus". Do disabled people count?
8. more advaned duplicate dropping?
9. balancing the dataset needed?
10. dedicated train test split?

# Hate Detection

In [943]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelBinarizer
from sklearn.metrics import confusion_matrix, f1_score

from keras.callbacks import EarlyStopping
from keras.layers import SimpleRNN, LSTM, GRU, Bidirectional, Dense, Embedding
from keras.models import Sequential
from keras.preprocessing.text import Tokenizer
from keras.utils import pad_sequences

### Data Exploration

In [944]:
data = pd.read_csv('hate-text.csv')
data

Unnamed: 0,case_id,test_case,label_gold,target_ident
0,1,I hate women.,hateful,women
1,2,I hate trans people.,hateful,trans people
2,3,I hate gay people.,hateful,gay people
3,4,I hate black people.,hateful,black people
4,5,I hate disabled people.,hateful,disabled people
...,...,...,...,...
3896,3897,Gay people are all complete 4ssholes.,hateful,gay people
3897,3898,Black people are all complete 4ssholes.,hateful,black people
3898,3899,Disabled people are all complete 4ssholes.,hateful,disabled people
3899,3900,Muslims are all complete 4ssholes.,hateful,Muslims


The dataset contains some missing values in the target_ident column, which we'll have to deal with later

In [994]:
# Using display() to display both at the same time. Won't comment any future use of display() (for obvious reasons)
display(
    data.isnull().any(),
    data[data.isna().any(axis=1)]
)

text      False
label     False
target     True
dtype: bool

Unnamed: 0,text,label,target
31,He did really piss me off by doing that.,non-hateful,
32,"Fuck this, I am done.",non-hateful,
33,Labour organisers are subhuman creatures.,non-hateful,
34,I can be a bit of an asshole sometimes.,non-hateful,
55,You should fuck off forever.,non-hateful,
...,...,...,...
2442,I am trying to like sushi but it is just too r...,non-hateful,
2446,"Fucking hell, what a day.",non-hateful,
2456,Swarms of lobbyists are infesting our governme...,non-hateful,
2465,Those marketing executives should be burned at...,non-hateful,


No duplicate texts found

In [946]:
print(f'''duplicate texts: {data['test_case'].duplicated().any()}''')

duplicate texts: False


While the dataset is quite balanced in the target groups, it's still assymetric when it comes to the number of hate/non-hate texts

In [947]:
display(
    pd.DataFrame(data['label_gold'].value_counts()),
    pd.DataFrame(data['target_ident'].value_counts())
)

Unnamed: 0_level_0,count
label_gold,Unnamed: 1_level_1
hateful,2659
non-hateful,1242


Unnamed: 0_level_0,count
target_ident,Unnamed: 1_level_1
gay people,577
women,535
disabled people,510
Muslims,510
black people,504
trans people,485
immigrants,485


No unusually short/long texts

In [948]:
print(
    f'Shortest: {len(data["test_case"].iloc[0])}\n'
    f'Longest: {len(data["test_case"].iloc[-1])}'
)

Shortest: 14
Longest: 38


### Data preprocessing

Drop and rename columns to something more suitable

In [949]:
data = data.drop('case_id', axis=1)
data = data.rename(
    columns=
    {
        'test_case': 'text',
        'label_gold': 'label',
        'target_ident': 'target',
    }
)

data.head()

Unnamed: 0,text,label,target
0,I hate women.,hateful,women
1,I hate trans people.,hateful,trans people
2,I hate gay people.,hateful,gay people
3,I hate black people.,hateful,black people
4,I hate disabled people.,hateful,disabled people


It appears as though non of the missing target data have any relevant hate speech associated with them. This can easily be varified by manually looking at the dataset (given its tiny size). But alas, given my horrendous dyslexia i decided to programmatically varify it as a sanity check

In [992]:
target_search_words = [
    'gay',
    'women',
    'disabled',
    'Muslims',
    'black',
    'trans',
    'immigrants'
]

# Get all rows targets with NaNs
data_target_nans = data[data['target'].isna()]

for index, row in data_target_nans.iterrows():
    for word in target_search_words:
        # Split by words and search each one
        if word in row['text'].split(' '):
            print(f'''found target word in: "{row['text']}" row''')

I'm now confident in dropping all rows with target NaN values

In [954]:
data[data.isna().any(axis=1)]

Unnamed: 0,text,label,target
1271,"Fucking hell, what a day.",non-hateful,
1272,That's a great fucking idea.,non-hateful,
1273,No fucking way.,non-hateful,
1274,I'm so fucking excited.,non-hateful,
1275,Are you fucking kidding me?,non-hateful,
...,...,...,...
3097,Pro lifers should get the fuck out of our coun...,non-hateful,
3098,Most gun dealers are seriously fucked in the h...,non-hateful,
3099,"By nature, corporate accountants are just shit...",non-hateful,
3100,Most real estate agents are absolute bastards.,non-hateful,


Balance the dataset

In [955]:
# randomly sample rows with label of "hateful" and drop them
rows_indexes_to_remove = data[data['label'] == 'hateful'].sample(n=1417).index
data = data.drop(rows_indexes_to_remove)

# Using display() to display both at the same time
display(
    pd.DataFrame(data['label'].value_counts()),
    pd.DataFrame(data['target'].value_counts())
)

Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
hateful,1242
non-hateful,1242


Unnamed: 0_level_0,count
target,Unnamed: 1_level_1
gay people,369
women,326
black people,315
disabled people,301
Muslims,299
immigrants,298
trans people,281


Shuffle the dataset

In [956]:
# Shuffles and returns the entire dataset as frac=1
data = data.sample(frac=1)

# Reset index after having both balanced and shuffled
data = data.reset_index(drop=True)
data

Unnamed: 0,text,label,target
0,Love to upset homophobes by being a fag.,non-hateful,gay people
1,"If I see a comment like ""you disabled people m...",non-hateful,disabled people
2,I will embrace you and any other black people ...,non-hateful,black people
3,Three disabled people were elected to parliame...,non-hateful,disabled people
4,Gay lives aren't worth sh1t.,hateful,gay people
...,...,...,...
2479,I believe immigrant lives aren't worth shit.,hateful,immigrants
2480,I have such strong admiration for immigrants l...,non-hateful,immigrants
2481,"You can hate all you want, I'm not going anywh...",non-hateful,gay people
2482,I have no respect for people who write things ...,non-hateful,gay people


In [957]:
max_words_to_use = 1000

# Tokenize the text data (convert them into "sequences")
tokenizer = Tokenizer(num_words=max_words_to_use) # Consider only using the top 1000 words, as those 
tokenizer.fit_on_texts(data['text'])
tokenized_data = tokenizer.texts_to_sequences(data['text'])

# printing using loop for easier viewing
for i in range(5):
    print(f'seq {i}: {tokenized_data[i]}')

seq 0: [143, 3, 810, 595, 97, 88, 4, 191]
seq 1: [46, 6, 83, 4, 371, 7, 5, 17, 1, 60, 20, 137, 6, 33, 372, 42]
seq 2: [6, 33, 373, 5, 37, 32, 93, 13, 1, 7, 5]
seq 3: [333, 17, 1, 124, 374, 3, 375, 149]
seq 4: [15, 39, 120, 52, 376]
