# Train Hate Speech Classification Neural Network Models

With the data cleaned and processed, this notebook implements model training on the data sets. The code in this notebook assumes that cleaned data is in the filepath `"data/combined_deduped.csv"`.

In [None]:
import pandas as pd
import numpy as np
import spacy
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.metrics import roc_curve, roc_auc_score, precision_recall_fscore_support, classification_report
import matplotlib.pyplot as plt
import wandb
from tensorflow import keras
from wandb.keras import WandbCallback
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Flatten
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras import regularizers
from tensorflow.keras.preprocessing.sequence import pad_sequences


# Neural Network Baselines

## MLP

The "baseline" implementation of a neural network is more debatable. Our choice was to set up a simple multilayer perceptron with enough neurons and layers to be functional and little customization beyond that. 

First, we instantiate a CountVectorizer to transform the words into integer counts of word appearance. Next, we scale that data and convert it from sparse matrices to arrays. Finally, we create, compile, and fit our model.

In [21]:
vect = CountVectorizer(stop_words = 'english', max_features=3000)
x_train_vect = vect.fit_transform(x_train)
x_val_vect = vect.transform(x_val)

In [22]:
scaler = StandardScaler(with_mean = False)
x_train_vect_scale = scaler.fit_transform(x_train_vect)
x_val_vect_scale = scaler.transform(x_val_vect)

In [23]:
x_train_vect_scale = x_train_vect_scale.toarray()
x_val_vect_scale = x_val_vect_scale.toarray()

In [24]:
model = Sequential([
    Dense(128, input_dim=3000, activation='relu'),
    Dense(32, activation='relu'),
    Dense(1, activation='sigmoid')
              ])

In [25]:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [41]:
wandb.init(project="allay-ds-23")

results = model.fit(x_train_vect_scale,
                    y_train,
                    epochs=5,
                   batch_size=20)

Train on 99459 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [42]:
score = model.evaluate(x_val_vect_scale, y_val)
y_pred = model.predict(x_val_vect_scale, batch_size=64, verbose=1)
y_pred = np.round(y_pred)



In [43]:
print(classification_report(y_val, y_pred, digits=4))

              precision    recall  f1-score   support

       False     0.8485    0.8924    0.8699     10223
        True     0.8382    0.7777    0.8069      7329

    accuracy                         0.8445     17552
   macro avg     0.8434    0.8351    0.8384     17552
weighted avg     0.8442    0.8445    0.8436     17552



In [44]:
accuracy, precision, recall, f1 = .8445, .8382, .7777, .8069

wandb.log({'accuracy':accuracy, 'recall':recall, 
               'f1':f1, 'precision':precision})

Next we try the same thing, but with a Tf-Idf vectorizer.

In [45]:
vect = TfidfVectorizer(stop_words = 'english', max_features=3000)
x_train_vect = vect.fit_transform(x_train)
x_val_vect = vect.transform(x_val)

In [46]:
x_train_vect = x_train_vect.toarray()
x_val_vect = x_val_vect.toarray()

In [47]:
model = Sequential([
    Dense(128, input_dim=3000, activation='relu'),
    Dense(32, activation='relu'),
    Dense(1, activation='sigmoid')
              ])

In [48]:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [49]:
wandb.init(project="allay-ds-23")
results = model.fit(x_train_vect,
                    y_train,
                    epochs=5,
                   batch_size=20)

Train on 99459 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [51]:
y_pred = model.predict(x_val_vect, batch_size=64, verbose=1)
y_pred = np.round(y_pred)
print(classification_report(y_val, y_pred, digits=4))

              precision    recall  f1-score   support

       False     0.8541    0.9035    0.8781     10223
        True     0.8535    0.7847    0.8177      7329

    accuracy                         0.8539     17552
   macro avg     0.8538    0.8441    0.8479     17552
weighted avg     0.8538    0.8539    0.8528     17552



In [53]:
accuracy, precision, recall, f1 = .8539, .8535, .7847, .8177

wandb.log({'accuracy':accuracy, 'recall':recall, 
               'f1':f1, 'precision':precision})

## RNN + LSTM

The next model we try is a recurrent neural network with LSTM. This relies on having pickled lemmatized data in the filepath `data/lemmas_2020-05-04-16-27-18Z.pkl.xz`. See other notebook for lemmatization methods.

In [60]:
lemmas = pd.read_pickle("data/lemmas_2020-05-04-16-27-18Z.pkl.xz", compression = 'xz')

In [61]:
lemmas.head()

Unnamed: 0,sm_lemmas,md_lemmas,lg_lemmas,inappropriate
0,"[beat, Dr., Dre, urbeat, Wired, Ear, Headphone...","[beat, Dr., Dre, urbeat, Wired, ear, Headphone...","[beat, Dr., Dre, urBeats, wire, Ear, Headphone...",True
1,"[@Papapishu, man, fucking, rule, party, perpet...","[@Papapishu, man, fucking, rule, party, perpet...","[@Papapishu, man, fucking, rule, party, perpet...",True
2,"[time, draw, close, 128591;&#127995, Father, d...","[time, draw, close, 128591;&#127995, Father, d...","[time, draw, close, 128591;&#127995, Father, d...",False
3,"[notice, start, act, different, distant, peep,...","[notice, start, act, different, distant, peep,...","[notice, start, act, different, distant, peep,...",False
4,"[forget, unfollower, believe, grow, new, follo...","[forget, unfollower, believe, grow, new, follo...","[forget, unfollower, believe, grow, new, follo...",False


In [63]:
medium_lemmas = lemmas[["md_lemmas", "inappropriate"]].copy()
del lemmas

In [82]:
train, test = train_test_split(medium_lemmas, test_size=.2, random_state=42)
train, val = train_test_split(train, test_size=.15, random_state=42)
target = 'inappropriate'

In [83]:
y_train = train[target]
y_val = val[target]

x_train = train.drop([target], axis=1)
x_val = val.drop([target], axis=1)

We look to get each lemma in our file (up to a maximum number of features) coded to an integer so we can pass these tweets into an embedding layer. Since preprocessing was done in the lemmatization step, we create a CountVectorizer -- whose attributes we will access to create our vocab -- but turn off all of its automatic text processing.

In [102]:
def do_nothing(tokens):
    return tokens

In [125]:
vectorizer = CountVectorizer(tokenizer=do_nothing, preprocessor=None, lowercase=False, stop_words="english", max_features=5000, min_df=.0001)

In [126]:
vectorizer.fit(x_train["md_lemmas"])

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=False, max_df=1.0, max_features=5000, min_df=0.0001,
                ngram_range=(1, 1), preprocessor=None, stop_words='english',
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=<function do_nothing at 0x0000028736EF8948>,
                vocabulary=None)

In [127]:
word2idx = {word: idx for idx, word in enumerate(vectorizer.get_feature_names())}

In [None]:
word2idx[:3]

In [111]:
def to_sequence(index, text):
    indexes = [index[word] for word in text if word in index]
    return indexes

In [129]:
x_train["integers"] = x_train["md_lemmas"].apply(lambda x: to_sequence(word2idx, x))
x_val["integers"] = x_val["md_lemmas"].apply(lambda x: to_sequence(word2idx, x))

In [130]:
x_train.head()

Unnamed: 0,md_lemmas,integers
14520,"[.@pepsi, think, protest, hip, cute, thing, mi...","[4580, 3824, 2838, 2077, 4579, 3360, 3843, 2931]"
68713,"[Mississauga, load, line, finally, click, Spen...","[3221, 3199, 2539, 1851, 3212, 4612, 1531]"
117901,"[kejriwal, accept, role, making, udtapunjab, c...","[1258, 4043, 3283, 4721, 1955, 3838]"
61528,"[@AsEasyAsRiding, @lastnotlost, @shoestringcyc...",[3193]
115752,"[time, like, boxer, rapper, etc, bitch, fuckin...","[4608, 3193, 3888, 1578, 2632, 2805, 1781, 355..."


As reviews are likely to be longer than tweets, we increase our max sequence length to allow our model to process texts of longer length. 

In [145]:
max_seq_length = (max(x_train["integers"].apply(lambda x: len(x))) * 2)
max_seq_length

60

In [116]:
len(word2idx)

5000

In [136]:
n_features = len(vectorizer.get_feature_names())
x_train_sequences = pad_sequences(x_train["integers"], maxlen = max_seq_length, value=n_features)

In [141]:
x_val_sequences = pad_sequences(x_val["integers"], maxlen=max_seq_length, value=n_features)

In [138]:
x_train_sequences[0]

array([5000, 5000, 5000, 5000, 5000, 5000, 5000, 5000, 5000, 5000, 5000,
       5000, 5000, 5000, 5000, 5000, 5000, 5000, 5000, 5000, 5000, 5000,
       4580, 3824, 2838, 2077, 4579, 3360, 3843, 2931])

In [139]:
model = Sequential()
model.add(Embedding(len(vectorizer.get_feature_names()) + 1,
                    64,
                    input_length=max_seq_length))
model.add(LSTM(64))
model.add(Dense(units=1, activation='sigmoid'))
 
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())

Model: "sequential_13"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_8 (Embedding)      (None, 30, 64)            320064    
_________________________________________________________________
lstm_9 (LSTM)                (None, 64)                33024     
_________________________________________________________________
dense_11 (Dense)             (None, 1)                 65        
Total params: 353,153
Trainable params: 353,153
Non-trainable params: 0
_________________________________________________________________
None


In [142]:
WANDB_NOTEBOOK_NAME = "train_models.ipynb"
wandb.init(project="allay-ds-23", config = {"epochs": 3, "optimizer": "adam", "batch_size": 20})
results = model.fit(x_train_sequences,
                    y_train,
                    epochs=3,
                   batch_size=20,
                   callbacks=[WandbCallback(validation_data=(x_val_sequences, y_val),
                labels=["appropriate", "inappropriate"])])

Failed to query for notebook name, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable
wandb: Wandb version 0.8.35 is available!  To upgrade, please run:
wandb:  $ pip install wandb --upgrade


Train on 99459 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


Run pip install nbformat to save notebook history


In [144]:
y_pred = model.predict(x_val_sequences, batch_size=64, verbose=1)
y_pred = np.round(y_pred)
print(classification_report(y_val, y_pred, digits=4))

              precision    recall  f1-score   support

       False     0.8569    0.9119    0.8835     10223
        True     0.8650    0.7876    0.8245      7329

    accuracy                         0.8600     17552
   macro avg     0.8609    0.8497    0.8540     17552
weighted avg     0.8603    0.8600    0.8589     17552

