todo
1. optimize hyperparameters
2. clean dataset
    - balance
3. data augmentation
4. add requirments.txt
5. correct title (of file as well)
6. go through all variable names
7. ask which types of groups should "be in focus". Do disabled people count?
8. more advaned duplicate dropping?
9. balancing the dataset needed?
10. dedicated train test split?

# Hate Detection

In [715]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelBinarizer
from sklearn.metrics import confusion_matrix, f1_score

from keras.callbacks import EarlyStopping
from keras.layers import SimpleRNN, LSTM, GRU, Bidirectional, Dense, Embedding
from keras.models import Sequential
from keras.preprocessing.text import Tokenizer
from keras.utils import pad_sequences

### Data Exploration

In [716]:
data = pd.read_csv('hate-text.csv')
data

Unnamed: 0,case_id,test_case,label_gold,target_ident
0,1,I hate women.,hateful,women
1,2,I hate trans people.,hateful,trans people
2,3,I hate gay people.,hateful,gay people
3,4,I hate black people.,hateful,black people
4,5,I hate disabled people.,hateful,disabled people
...,...,...,...,...
3896,3897,Gay people are all complete 4ssholes.,hateful,gay people
3897,3898,Black people are all complete 4ssholes.,hateful,black people
3898,3899,Disabled people are all complete 4ssholes.,hateful,disabled people
3899,3900,Muslims are all complete 4ssholes.,hateful,Muslims


The dataset contains some missing values in the target_ident column, which we'll have to deal with later

In [717]:
data.isnull().any()

case_id         False
test_case       False
label_gold      False
target_ident     True
dtype: bool

No duplicate texts found

In [718]:
print(f'''duplicate texts: {data['test_case'].duplicated().any()}''')

duplicate texts: False


While the dataset is quite balanced in the target groups, it's still assymetric when it comes to the number of hate/non-hate texts

In [719]:
# Using display() to display both at the same time
display(
    pd.DataFrame(data['label_gold'].value_counts()),
    pd.DataFrame(data['target_ident'].value_counts())
)

Unnamed: 0_level_0,count
label_gold,Unnamed: 1_level_1
hateful,2659
non-hateful,1242


Unnamed: 0_level_0,count
target_ident,Unnamed: 1_level_1
gay people,577
women,535
disabled people,510
Muslims,510
black people,504
trans people,485
immigrants,485


No unusually short/long texts

In [720]:
print(
    f'Shortest: {len(data["test_case"].iloc[0])}\n'
    f'Longest: {len(data["test_case"].iloc[-1])}'
)

Shortest: 14
Longest: 38


### Data preprocessing

Drop and rename columns to something more suitable

In [721]:
data = data.drop('case_id', axis=1)
data = data.rename(
    columns=
    {
        'test_case': 'text',
        'label_gold': 'label',
        'target_ident': 'target',
    }
)

data.head()

Unnamed: 0,text,label,target
0,I hate women.,hateful,women
1,I hate trans people.,hateful,trans people
2,I hate gay people.,hateful,gay people
3,I hate black people.,hateful,black people
4,I hate disabled people.,hateful,disabled people


Fill in missing target data

Balance the dataset

In [722]:
# randomly sample rows with label of "hateful" and drop them
rows_indexes_to_remove = data[data['label'] == 'hateful'].sample(n=1417).index
data = data.drop(rows_indexes_to_remove)

# Using display() to display both at the same time
display(
    pd.DataFrame(data['label'].value_counts()),
    pd.DataFrame(data['target'].value_counts())
)

Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
hateful,1242
non-hateful,1242


Unnamed: 0_level_0,count
target,Unnamed: 1_level_1
gay people,369
women,329
black people,318
Muslims,309
disabled people,303
trans people,281
immigrants,280


Shuffle the dataset

In [725]:
# Shuffles and returns the entire dataset as frac=1
data = data.sample(frac=1)

# Reset index after having both balanced and shuffled
data = data.reset_index(drop=True)
data

Unnamed: 0,text,label,target
0,"Obvious hate like ""Muslim lives aren't worth s...",non-hateful,Muslims
1,My dog is smarter than any computer could ever...,non-hateful,
2,"By their nature, trans people are never attrac...",hateful,trans people
3,I haet trans people.,hateful,trans people
4,Calling disabled people cockroaches is such a ...,non-hateful,disabled people
...,...,...,...
2479,We should get rid of gay people once and for a...,hateful,gay people
2480,There have to be consequences for saying you w...,non-hateful,women
2481,So s1ck of being around Muslims like you.,hateful,Muslims
2482,Never trust a single word said by a disabled p...,hateful,disabled people


In [724]:
max_words_to_use = 1000

# Tokenize the text data (convert them into "sequences")
tokenizer = Tokenizer(num_words=max_words_to_use) # Consider only using the top 1000 words, as those 
tokenizer.fit_on_texts(data['text'])
tokenized_data = tokenizer.texts_to_sequences(data['text'])

# printing using loop for easier viewing
for i in range(5):
    print(f'seq {i}: {tokenized_data[i]}')

seq 0: [5, 27, 58, 70, 487]
seq 1: [527, 528, 119, 14, 1, 221, 43, 79, 132, 89, 33, 7, 140, 295]
seq 2: [10, 366, 367, 12, 11, 368, 16, 65, 80]
seq 3: [5, 2, 22, 4, 488, 3, 21]
seq 4: [156, 154, 183, 120, 147, 6, 154, 120, 2, 96, 167, 12, 10, 176]
