todo
1. optimize hyperparameters
2. clean dataset
    - balance
3. data augmentation
4. add requirments.txt
5. correct title (of file as well)
6. go through all variable names
7. ask which types of groups should "be in focus". Do disabled people count?
8. more advaned duplicate dropping?
9. balancing the dataset needed?
10. dedicated train test split?

# Hate Detection

In [895]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelBinarizer
from sklearn.metrics import confusion_matrix, f1_score

from keras.callbacks import EarlyStopping
from keras.layers import SimpleRNN, LSTM, GRU, Bidirectional, Dense, Embedding
from keras.models import Sequential
from keras.preprocessing.text import Tokenizer
from keras.utils import pad_sequences

### Data Exploration

In [896]:
data = pd.read_csv('hate-text.csv')
data

Unnamed: 0,case_id,test_case,label_gold,target_ident
0,1,I hate women.,hateful,women
1,2,I hate trans people.,hateful,trans people
2,3,I hate gay people.,hateful,gay people
3,4,I hate black people.,hateful,black people
4,5,I hate disabled people.,hateful,disabled people
...,...,...,...,...
3896,3897,Gay people are all complete 4ssholes.,hateful,gay people
3897,3898,Black people are all complete 4ssholes.,hateful,black people
3898,3899,Disabled people are all complete 4ssholes.,hateful,disabled people
3899,3900,Muslims are all complete 4ssholes.,hateful,Muslims


The dataset contains some missing values in the target_ident column, which we'll have to deal with later

In [897]:
data.isnull().any()

case_id         False
test_case       False
label_gold      False
target_ident     True
dtype: bool

No duplicate texts found

In [898]:
print(f'''duplicate texts: {data['test_case'].duplicated().any()}''')

duplicate texts: False


While the dataset is quite balanced in the target groups, it's still assymetric when it comes to the number of hate/non-hate texts

In [899]:
# Using display() to display both at the same time
display(
    pd.DataFrame(data['label_gold'].value_counts()),
    pd.DataFrame(data['target_ident'].value_counts())
)

Unnamed: 0_level_0,count
label_gold,Unnamed: 1_level_1
hateful,2659
non-hateful,1242


Unnamed: 0_level_0,count
target_ident,Unnamed: 1_level_1
gay people,577
women,535
disabled people,510
Muslims,510
black people,504
trans people,485
immigrants,485


No unusually short/long texts

In [900]:
print(
    f'Shortest: {len(data["test_case"].iloc[0])}\n'
    f'Longest: {len(data["test_case"].iloc[-1])}'
)

Shortest: 14
Longest: 38


### Data preprocessing

Drop and rename columns to something more suitable

In [901]:
data = data.drop('case_id', axis=1)
data = data.rename(
    columns=
    {
        'test_case': 'text',
        'label_gold': 'label',
        'target_ident': 'target',
    }
)

data.head()

Unnamed: 0,text,label,target
0,I hate women.,hateful,women
1,I hate trans people.,hateful,trans people
2,I hate gay people.,hateful,gay people
3,I hate black people.,hateful,black people
4,I hate disabled people.,hateful,disabled people


Fill in missing target data

In [902]:
data.head()

Unnamed: 0,text,label,target
0,I hate women.,hateful,women
1,I hate trans people.,hateful,trans people
2,I hate gay people.,hateful,gay people
3,I hate black people.,hateful,black people
4,I hate disabled people.,hateful,disabled people


In [903]:
data[data.isna().any(axis=1)]

Unnamed: 0,text,label,target
1271,"Fucking hell, what a day.",non-hateful,
1272,That's a great fucking idea.,non-hateful,
1273,No fucking way.,non-hateful,
1274,I'm so fucking excited.,non-hateful,
1275,Are you fucking kidding me?,non-hateful,
...,...,...,...
3097,Pro lifers should get the fuck out of our coun...,non-hateful,
3098,Most gun dealers are seriously fucked in the h...,non-hateful,
3099,"By nature, corporate accountants are just shit...",non-hateful,
3100,Most real estate agents are absolute bastards.,non-hateful,


In [904]:
# Create a new DataFrame with rows containing NaN values
new_df = data[data['target'].isna()].copy()

#data = data.dropna()

new_df

Unnamed: 0,text,label,target
1271,"Fucking hell, what a day.",non-hateful,
1272,That's a great fucking idea.,non-hateful,
1273,No fucking way.,non-hateful,
1274,I'm so fucking excited.,non-hateful,
1275,Are you fucking kidding me?,non-hateful,
...,...,...,...
3097,Pro lifers should get the fuck out of our coun...,non-hateful,
3098,Most gun dealers are seriously fucked in the h...,non-hateful,
3099,"By nature, corporate accountants are just shit...",non-hateful,
3100,Most real estate agents are absolute bastards.,non-hateful,


In [905]:
key_words = ['gay', 'women', 'disabled', 'Muslims', 'black', 'trans', 'immigrants']

def update_gender(row):
    words = row['text'].split(' ')
    
    for key_word in key_words:
        if key_word in words: return key_word
        else: return row['target']

         

# Assuming 'gender' and 'text' are columns in your DataFrame
new_df['target'] = new_df.apply(update_gender, axis=1)

new_df

Unnamed: 0,text,label,target
1271,"Fucking hell, what a day.",non-hateful,
1272,That's a great fucking idea.,non-hateful,
1273,No fucking way.,non-hateful,
1274,I'm so fucking excited.,non-hateful,
1275,Are you fucking kidding me?,non-hateful,
...,...,...,...
3097,Pro lifers should get the fuck out of our coun...,non-hateful,
3098,Most gun dealers are seriously fucked in the h...,non-hateful,
3099,"By nature, corporate accountants are just shit...",non-hateful,
3100,Most real estate agents are absolute bastards.,non-hateful,


In [906]:
data[data['target'].isnull()]

Unnamed: 0,text,label,target
1271,"Fucking hell, what a day.",non-hateful,
1272,That's a great fucking idea.,non-hateful,
1273,No fucking way.,non-hateful,
1274,I'm so fucking excited.,non-hateful,
1275,Are you fucking kidding me?,non-hateful,
...,...,...,...
3097,Pro lifers should get the fuck out of our coun...,non-hateful,
3098,Most gun dealers are seriously fucked in the h...,non-hateful,
3099,"By nature, corporate accountants are just shit...",non-hateful,
3100,Most real estate agents are absolute bastards.,non-hateful,


Balance the dataset

In [907]:
# randomly sample rows with label of "hateful" and drop them
rows_indexes_to_remove = data[data['label'] == 'hateful'].sample(n=1417).index
data = data.drop(rows_indexes_to_remove)

# Using display() to display both at the same time
display(
    pd.DataFrame(data['label'].value_counts()),
    pd.DataFrame(data['target'].value_counts())
)

Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
hateful,1242
non-hateful,1242


Unnamed: 0_level_0,count
target,Unnamed: 1_level_1
gay people,369
women,327
disabled people,319
black people,302
Muslims,297
trans people,295
immigrants,280


Shuffle the dataset

In [908]:
# Shuffles and returns the entire dataset as frac=1
data = data.sample(frac=1)

# Reset index after having both balanced and shuffled
data = data.reset_index(drop=True)
data

Unnamed: 0,text,label,target
0,Immigrants are scum to me.,hateful,immigrants
1,They are all complete assholes. Needless to sa...,hateful,disabled people
2,You are just a queer to me.,hateful,gay people
3,Trans people with any power are a t h r e a t ...,hateful,trans people
4,All women are such massivescumbags.,hateful,women
...,...,...,...
2479,We must never condone the killing of any trans...,non-hateful,trans people
2480,"He said this was ""typical retard behaviour"", w...",non-hateful,disabled people
2481,We should not allow for any trans person to be...,non-hateful,trans people
2482,It's really not okay to be calling people queer.,non-hateful,gay people


In [909]:
max_words_to_use = 1000

# Tokenize the text data (convert them into "sequences")
tokenizer = Tokenizer(num_words=max_words_to_use) # Consider only using the top 1000 words, as those 
tokenizer.fit_on_texts(data['text'])
tokenized_data = tokenizer.texts_to_sequences(data['text'])

# printing using loop for easier viewing
for i in range(5):
    print(f'seq {i}: {tokenized_data[i]}')

seq 0: [24, 2, 132, 3, 20]
seq 1: [42, 2, 7, 120, 171, 652, 3, 126, 42, 2, 14, 1]
seq 2: [5, 2, 22, 4, 129, 3, 20]
seq 3: [16, 1, 68, 31, 266, 2, 4, 133, 193, 194, 102, 4, 133, 3, 214]
seq 4: [7, 23, 2, 85, 653]
