todo
1. optimize hyperparameters
2. clean dataset
    - balance
3. data augmentation
4. add requirments.txt
5. correct title (of file as well)
6. go through all variable names
7. ask which types of groups should "be in focus". Do disabled people count?
8. more advaned duplicate dropping?
9. balancing the dataset needed?
10. dedicated train test split?

# Hate Detection

In [172]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelBinarizer
from sklearn.metrics import confusion_matrix, f1_score

from keras.callbacks import EarlyStopping
from keras.layers import SimpleRNN, LSTM, GRU, Bidirectional, Dense, Embedding
from keras.models import Sequential
from keras.preprocessing.text import Tokenizer
from keras.utils import pad_sequences


from random import sample

### Data Exploration

In [173]:
data = pd.read_csv('hate-text.csv')
data

Unnamed: 0,case_id,test_case,label_gold,target_ident
0,1,I hate women.,hateful,women
1,2,I hate trans people.,hateful,trans people
2,3,I hate gay people.,hateful,gay people
3,4,I hate black people.,hateful,black people
4,5,I hate disabled people.,hateful,disabled people
...,...,...,...,...
3896,3897,Gay people are all complete 4ssholes.,hateful,gay people
3897,3898,Black people are all complete 4ssholes.,hateful,black people
3898,3899,Disabled people are all complete 4ssholes.,hateful,disabled people
3899,3900,Muslims are all complete 4ssholes.,hateful,Muslims


The dataset contains some missing values in the target_ident column, which we'll have to deal with later

In [174]:
# Using display() to display both at the same time. Won't comment any future use of display() (for obvious reasons)
display(
    data.isnull().any(),
    data[data.isna().any(axis=1)]
)

case_id         False
test_case       False
label_gold      False
target_ident     True
dtype: bool

Unnamed: 0,case_id,test_case,label_gold,target_ident
1271,1272,"Fucking hell, what a day.",non-hateful,
1272,1273,That's a great fucking idea.,non-hateful,
1273,1274,No fucking way.,non-hateful,
1274,1275,I'm so fucking excited.,non-hateful,
1275,1276,Are you fucking kidding me?,non-hateful,
...,...,...,...,...
3097,3098,Pro lifers should get the fuck out of our coun...,non-hateful,
3098,3099,Most gun dealers are seriously fucked in the h...,non-hateful,
3099,3100,"By nature, corporate accountants are just shit...",non-hateful,
3100,3101,Most real estate agents are absolute bastards.,non-hateful,


No duplicate texts found

In [175]:
print(f'''duplicate texts: {data['test_case'].duplicated().any()}''')

duplicate texts: False


While the dataset is quite balanced in the target groups, it's still assymetric when it comes to the number of hate/non-hate texts

In [176]:
display(
    pd.DataFrame(data['label_gold'].value_counts()),
    pd.DataFrame(data['target_ident'].value_counts())
)

Unnamed: 0_level_0,count
label_gold,Unnamed: 1_level_1
hateful,2659
non-hateful,1242


Unnamed: 0_level_0,count
target_ident,Unnamed: 1_level_1
gay people,577
women,535
disabled people,510
Muslims,510
black people,504
trans people,485
immigrants,485


No unusually short/long texts

In [177]:
print(
    f'Shortest: {len(data["test_case"].iloc[0])}\n'
    f'Longest: {len(data["test_case"].iloc[-1])}'
)

Shortest: 14
Longest: 38


### Data preprocessing

Drop and rename columns to something more suitable

In [178]:
data = data.drop('case_id', axis=1)
data = data.rename(
    columns=
    {
        'test_case': 'text',
        'label_gold': 'label',
        'target_ident': 'target',
    }
)

data.head()

Unnamed: 0,text,label,target
0,I hate women.,hateful,women
1,I hate trans people.,hateful,trans people
2,I hate gay people.,hateful,gay people
3,I hate black people.,hateful,black people
4,I hate disabled people.,hateful,disabled people


It appears as though non of the missing target data have any relevant hate speech associated with them. This can easily be varified by manually looking at the dataset (given its tiny size). But alas, given my horrendous dyslexia i decided to programmatically varify it as a sanity check

In [179]:
target_search_words = [
    'gay',
    'women',
    'disabled',
    'Muslims',
    'black',
    'trans',
    'immigrants'
]

# Get all rows targets with NaNs
data_target_nans = data[data['target'].isna()]

for index, row in data_target_nans.iterrows():
    for word in target_search_words:
        # Split by words and search each one
        if word in row['text'].split(' '):
            print(f'''found target word in: "{row['text']}" row''')

I'm now confident in dropping all rows with target NaN values

In [180]:
data = data.dropna()

Balance the dataset

In [181]:
hateful = data['label'].value_counts()['hateful']
non_hateful = data['label'].value_counts()['non-hateful']

# Because we've got far more hateful texts
number_of_rows_to_drop = hateful - non_hateful

# randomly sample the rows and and drop them
rows_indexes_to_remove = data[data['label'] == 'hateful'].sample(n=number_of_rows_to_drop).index
data = data.drop(rows_indexes_to_remove)

display(
    pd.DataFrame(data['label'].value_counts()),
    pd.DataFrame(data['target'].value_counts())
)

Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
hateful,947
non-hateful,947


Unnamed: 0_level_0,count
target,Unnamed: 1_level_1
gay people,327
black people,276
trans people,266
disabled people,264
women,262
Muslims,252
immigrants,247


### Data Augmentation

In [182]:
data

Unnamed: 0,text,label,target
1,I hate trans people.,hateful,trans people
3,I hate black people.,hateful,black people
4,I hate disabled people.,hateful,disabled people
5,I hate Muslims.,hateful,Muslims
9,I really can't stand gay people like you.,hateful,gay people
...,...,...,...
3888,Trans lives aren't worth sh1t.,hateful,trans people
3891,Disabled lives aren't worth sh1t.,hateful,disabled people
3892,Muslim lives aren't worth sh1t.,hateful,Muslims
3893,Immigrant lives aren't worth sh1t.,hateful,immigrants


In [183]:

data['text'] = data['text'].apply(lambda x: ' '.join(sample(x.split(), len(x.split()))))
data

Unnamed: 0,text,label,target
1,I trans people. hate,hateful,trans people
3,people. hate I black,hateful,black people
4,hate I people. disabled,hateful,disabled people
5,I hate Muslims.,hateful,Muslims
9,gay stand you. people I really like can't,hateful,gay people
...,...,...,...
3888,worth sh1t. aren't Trans lives,hateful,trans people
3891,Disabled lives aren't worth sh1t.,hateful,disabled people
3892,Muslim aren't sh1t. worth lives,hateful,Muslims
3893,aren't lives Immigrant sh1t. worth,hateful,immigrants


Shuffle the dataset

In [184]:
# Shuffles and returns the entire dataset as frac=1
data = data.sample(frac=1)

# Reset index after having both balanced and shuffled
data = data.reset_index(drop=True)
data

Unnamed: 0,text,label,target
0,saying to lives see hurts Muslim all It worthl...,non-hateful,Muslims
1,shit. aren't lives Immigrant worth,hateful,immigrants
2,despicable. are Muslims absolutely,hateful,Muslims
3,"make black it. sick"", will a I report me see I...",non-hateful,black people
4,"immigrant, are then probably are If you you st...",hateful,immigrants
...,...,...,...
1889,are people our parasites Black society. to,hateful,black people
1890,Gay a are people cancr.,hateful,gay people
1891,gay time is This person the is nominated. a first,non-hateful,gay people
1892,"are like send blocked. ""You People to messages...",non-hateful,black people


In [185]:
max_words_to_use = 1000

# Tokenize the text data (convert them into "sequences")
tokenizer = Tokenizer(num_words=max_words_to_use) # Consider only using the top 1000 words, as those 
tokenizer.fit_on_texts(data['text'])
tokenized_data = tokenizer.texts_to_sequences(data['text'])

# printing using loop for easier viewing
for i in range(5):
    print(f'seq {i}: {tokenized_data[i]}')

seq 0: [64, 3, 32, 99, 297, 66, 8, 41, 110, 1, 40, 2]
seq 1: [102, 108, 32, 59, 48]
seq 2: [212, 2, 25, 82]
seq 3: [67, 14, 41, 181, 33, 5, 6, 298, 21, 99, 47, 1, 7, 4, 299, 6]
seq 4: [59, 2, 213, 457, 2, 47, 4, 4, 114, 90]
