# Code Content



In [None]:
# Import base libraries for mathematical operations, dataframes, time and plotting
import numpy as np
import pandas as pd
from time import time
import matplotlib.pyplot as plt
import seaborn as sns
font = {'family' : 'sans-serif',
        'style' : 'normal',
        'size'   : 15}
plt.rc('font', **font)
plt.rcParams['figure.figsize'] = 12, 8

import warnings
warnings.filterwarnings("ignore")

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import precision_recall_fscore_support as prfs

from keras.wrappers.scikit_learn import KerasClassifier
import keras
from keras.utils import to_categorical
from keras import regularizers
from keras.constraints import maxnorm

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences


from keras.models import Sequential
from keras.layers import Dense, Embedding, Flatten, LeakyReLU
from keras.callbacks import EarlyStopping

from keras.layers.core import Dropout

from joblib import dump, load

In [None]:
import py_plots
from py_plots import precisionmeasures as pm

In [None]:
%%HTML
<style type="text/css">
table.dataframe td, table.dataframe th {
    border: 1px  black solid !important;
  color: black !important;
}
</style>

In [None]:
def performance_metrics_table(test,pred,feature):
    '''Inputs:
            test = actual labels of test set
            pred = model predictions for the the test set
            feature = feature name
            
            Computes macro- and micro- precision, recall and F1-score
        Output:
            Multi-index data frame with 3 precision measures 
    '''
    temp_dict = {'Performance':['Precision','Recall','F1-Score']}
    averages = ['micro','macro']
    for average in averages:
        p,r,f,_ = prfs(test,pred,average = average)
        temp_dict[average]= np.round((p,r,f),4)
    temp_df = pd.DataFrame(temp_dict)
    temp_df = pd.melt(temp_df, id_vars=['Performance'], value_vars=averages,
                        var_name='Metric', value_name=feature).set_index(['Metric','Performance'])
    temp_df = temp_df.rename_axis([None,'Performance Measures'])
    return temp_df

# 1. Data upload

In [None]:
class_names = ['Hate','Offensive','Neutral']
path = "datasets/balanced_dataset.csv"

In [None]:
# upload the dataset
data = pd.read_csv(path)
# drop any rows with null (after preprocessing)
data = data.dropna()
# print first 5 rows of the data set
data.head()

# 2. Split dataset into training-validation-test sets

In [None]:
# Split the dataset into training and test sets (2:1)
X_train, x_test, Y_train, y_test = train_test_split(data.clean_tweet, data.labels, test_size=0.33, random_state=42)

# maximum word count of tweets in the training set
max_length = np.max([len(tweet.split()) for tweet in X_train])

print('Maximum lenght (word-count) of tweets in the training set: {}\n'.format(max_length))


# Split the trainng dataset further into training and validation sets (2:1)
x_train, x_val, y_train, y_val = train_test_split(X_train, Y_train, test_size=0.33, random_state=42)

y_train_onehot = to_categorical(y_train)
y_val_onehot = to_categorical(y_val)
y_test_onehot = to_categorical(y_test)

# Print
print('=='*15)
print('Training-Validation-Test Split')
print('=='*15)
print('Size of training data: {}'.format(len(y_train)))
print('..'*15)
print('Size of validation data: {}'.format(len(y_val)))
print('..'*15)
print('Size of test data: {}'.format(len(y_test)))
print('..'*15)

# 2. Word vectorization

## 2.1 Tokenizaition

In [None]:
# Initializer tokenizer
tokenizer = Tokenizer()
tokenizer.fit_on_texts(x_train.append(x_val))
vocab_size = len(tokenizer.word_index)+1
print('Total size of training (including validation set) vocabulary: {} words'.format(vocab_size))

# 2.2 Zero-padding

In [None]:
sequence_train = tokenizer.texts_to_sequences(x_train)
padded_train = pad_sequences(sequence_train, maxlen=max_length, padding='post') 

sequence_val = tokenizer.texts_to_sequences(x_val)
padded_val = pad_sequences(sequence_val, maxlen=max_length, padding='post') 

sequence_test = tokenizer.texts_to_sequences(x_test)
padded_test = pad_sequences(sequence_test, maxlen=max_length, padding='post') 

In [None]:
# Upload embedding matrix for words in the vocabulary
embed_dim = 300
embedding_matrix = pd.read_pickle("model/GloVe_matrix.pkl").values
print( 'Shape of embedding matrix is {} x {}'. format(embedding_matrix.shape[0],embedding_matrix.shape[1]))

# 3. Multilayer Perceptron

## 3.1  6-Hidden layers

Parameter Grid:

    - Hidden Layer 1: # of neurons [512, 256]
    - Hidden Layer 2: # of neurons [256, 128]
    - Hidden Layer 3: # of neurons [128, 64]
    - Hidden Layer 4: # of neurons [64, 16]
    - Hidden Layer 5: # of neurons [16, 8]

In [None]:
def create_model(neuronsHL1 = 256, neuronsHL2 = 128, neuronsHL3 = 64, neuronsHL4 = 16, neuronsHL5 = 8):
    print('=='*15)
    print(neuronsHL1, neuronsHL2, neuronsHL3, neuronsHL4, neuronsHL5)
    print('=='*15)
    model = Sequential()
    model.add(Embedding(vocab_size, embed_dim, weights=[embedding_matrix], input_length=max_length, trainable=True))
    model.add(Flatten())
    model.add(Dense(neuronsHL1, activation='relu'))
    model.add(Dense(neuronsHL2, activation='relu'))
    model.add(Dense(neuronsHL3, activation='relu'))
    model.add(Dense(neuronsHL4, activation='relu'))
    model.add(Dense(neuronsHL5, activation='relu'))
    model.add(Dense(3, activation='softmax'))
    model.compile(optimizer='adam',loss='categorical_crossentropy',metrics=['acc'])
    return model

model = KerasClassifier(build_fn = create_model, epochs = 30, verbose = 0)

param_grid = dict(neuronsHL1=[512,256],
                  neuronsHL2=[256,128],
                  neuronsHL3=[128,64],
                  neuronsHL4=[64,16],
                  neuronsHL5=[16,8])


grid = GridSearchCV(estimator=model, cv =5, param_grid=param_grid, n_jobs=1)

es = EarlyStopping(monitor='val_loss',patience=1)

grid_result = grid.fit(padded_train,y_train_onehot,epochs=30,batch_size=128,
                        validation_data=(padded_val, y_val_onehot),
                       callbacks=[es], verbose=2)


print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

# Model predictions
pred = grid_result.predict(padded_test)
tbl = performance_metrics_table(y_test,pred,'Best 6 layer fit')

## 3.2  4-Hidden layers

Parameter Grid:

    - Hidden Layer 1: # of neurons [128, 64]
    - Hidden Layer 2: # of neurons [64,32]
    - Hidden Layer 3: # of neurons 16
    - Hidden Layer 4: # of neurons 8

In [None]:
def create_model(neuronsHL1 = 128, neuronsHL2 = 64):
    print('=='*15)
    print(neuronsHL1, neuronsHL2)
    print('=='*15)
    model = Sequential()
    model.add(Embedding(vocab_size, embed_dim, weights=[embedding_matrix], input_length=max_length, trainable=True))
    model.add(Flatten())
    model.add(Dense(neuronsHL1, activation='relu'))
    model.add(Dense(neuronsHL2, activation='relu'))
    model.add(Dense(16, activation='relu'))
    model.add(Dense(8, activation='relu'))
    model.add(Dense(3, activation='softmax'))
    model.compile(optimizer='adam',loss='categorical_crossentropy',metrics=['acc'])
    return model

model = KerasClassifier(build_fn = create_model, epochs = 30, verbose = 0)

param_grid = dict(neuronsHL1=[128,64],
                  neuronsHL2=[64,32])

grid = GridSearchCV(estimator=model, cv =5, param_grid=param_grid, n_jobs=1)

es = EarlyStopping(monitor='val_loss',patience=1)

grid_result = grid.fit(padded_train,y_train_onehot,epochs=30,batch_size=128,
                        validation_data=(padded_val, y_val_onehot),
                       callbacks=[es], verbose=2)


print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

# Model predictions
pred = grid_result.predict(padded_test)
tbl = tbl.join(performance_metrics_table(y_test,pred,'Best 4 layer fit'))

## 3.3  3-Hidden layers

Parameter Grid:

    - Hidden Layer 1: # of neurons [64, 32]
    - Hidden Layer 2: # of neurons [16, 8]
    - Hidden Layer 3: # of neurons [128, 64]


In [None]:
def create_model(neuronsHL1 = 64, neuronsHL2 = 8):
    print('=='*15)
    print(neuronsHL1, neuronsHL2)
    print('=='*15)
    model = Sequential()
    model.add(Embedding(vocab_size, embed_dim, weights=[embedding_matrix], input_length=max_length, trainable=True))
    model.add(Flatten())
    model.add(Dense(128, activation='relu'))
    model.add(Dense(neuronsHL1, activation='relu'))
    model.add(Dense(neuronsHL2,, activation='relu'))
    model.add(Dense(3, activation='softmax'))
    model.compile(optimizer='adam',loss='categorical_crossentropy',metrics=['acc'])
    return model

model = KerasClassifier(build_fn = create_model, epochs = 30, verbose = 0)

param_grid = dict(neuronsHL1=[64,32],
                  neuronsHL2=[16,8])

grid = GridSearchCV(estimator=model, cv =5, param_grid=param_grid, n_jobs=1)

es = EarlyStopping(monitor='val_loss',patience=1)

grid_result = grid.fit(padded_train,y_train_onehot,epochs=30,batch_size=128,
                        validation_data=(padded_val, y_val_onehot),
                       callbacks=[es], verbose=2)


print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))
    
# Model predictions
pred = grid_result.predict(padded_test)
tbl = tbl.join(performance_metrics_table(y_test,pred,'Best 3 layer fit'))

In [None]:
print('=='*10, 'Best MLP Model fits', '=='*10)
tbl