# LSTM Model

In this notebook we have specified the build for our LSTM model.
We have used tensorflow 2.0. If you are training the model, it is highly recommended to use a GPU and have at least ~32gb of RAM. 

In [None]:
%load_ext tensorboard

In [1]:
# Import general modules
import numpy as np
import pandas as pd 
from ast import literal_eval
import datetime, os

In [2]:
# Import scikit learn modules
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.metrics import classification_report

In [3]:
# Import tensorflow and keras modules. NOTE: WE ARE USING TENSORFLOW 2.0
import tensorflow as tf
from tensorflow.keras.preprocessing import text, sequence
from tensorflow.keras.models import Model, load_model
from tensorflow.keras.layers import Embedding, Dense, LSTM, MaxPooling1D, Input, GlobalAveragePooling1D, GlobalMaxPooling1D
from tensorflow.keras.layers import Bidirectional, Dropout
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.metrics import AUC


In [4]:
# Here we have defined the various parameters that will be used in the model,

# General set-up parameters
TRAIN_TEXT_COL = 'comment_text_clean2'
TEST_TEXT_COL = 'comment_text_clean2'
TRAIN_TARGET_COL = 'target'
TEST_TARGET_COL = 'target'
EMBEDDING_FILE = 'embeds/glove.840B.300d.txt' #change this if you are using a different embedding file

# Model hyperparameters - the first 3 refer more to the data itself. Reducing the vocab size and max sequence length can
# have negative impact on accuracy, but will likely lead to a model that trains faster and is less memory intensive. 
# There are similar effects to using lower dimension word embeddings. Changing the embedding dim requires you to have
# the specific word-embedding file available. 

MAX_VOCAB_SIZE = 200000 # there are 563693 words in the vocabulary 
MAX_LEN_SEQ = 300
EMBED_DIM = 300 #change this if you have chose different embedding dimensions
DROPOUT_RATE = 0.2
LSTM_UNITS = 128
BATCH_SIZE = 128
NUM_EPOCHS = 4

# Location to save training checkpoints
CHECKPOINT_PATH = "NN_models/cp.ckpt"
CHECKPOINT_DIR = os.path.dirname(CHECKPOINT_PATH)


#### Loading data from S3 for cloud computing

In [None]:
import boto3
s3 = boto3.client('s3')
s3.download_file(bucket, dataset_file_path_train, 'train_for_nn.csv')
s3.download_file(bucket, dataset_file_path_test, 'test_for_nn.csv')

# When the data set was saved as a CSV, tokenized column, which was a list was coverted to a string, 
# The converters option changes this back into its list form 
train_data = pd.read_csv('train_for_nn.csv', converters={"comment_text_clean2": literal_eval})
test_data = pd.read_csv('test_for_nn.csv', converters={"comment_text_clean2": literal_eval})

#### Loading data from local machine

In [7]:
train_data = pd.read_csv('data/train_for_nn.csv', converters={"comment_text_clean2": literal_eval})
test_data = pd.read_csv('data/test_for_nn.csv', converters={"comment_text_clean2": literal_eval})#

In [8]:
# Drop unnamed col 
train_data.drop(['Unnamed: 0'], axis=1, inplace=True)
test_data.drop(['Unnamed: 0'], axis=1, inplace=True)

In [9]:
# Create train val split, stratify on target
train_df, val_df = train_test_split(train_data, test_size=0.2, stratify=train_data['target'], random_state=1)

In [106]:
# save down the train_df - we will need this in the future to fit the tokenizer to
train_df.to_csv('data/nn_tokenizer_fit_data.csv')

#### Model Methods
Below are the various methods required to run the model

In [10]:
# Create and fix tokenizer
def train_tokenizer(train_data, vocab_size):
    '''
    Function to train the Keras tokenizer to create a vocabulary dictionary
    
    INPUT:
    train_data - training data column
    int vocab_size - When text_to_sequences is run, any word with index above this will be set to 0
    
    OUTPUT:
    A fitted keras tokenizer
    '''
    # Use Keras tokenizer to create vocabulary dictionary 
    # default arguments will filter punctuation and convert to lower, we do not want this given our use 
    # of pre-trained word embeddings
    tokenizer = text.Tokenizer(num_words = vocab_size, filters='', lower=False)
    tokenizer.fit_on_texts(train_data)
    return tokenizer

# pad tokenized sequences
def text_padder(text, tokenizer):
    '''
    Function to convert sequences to a sequence of word-indexes and also to pad these sequences to the max sequence length
    
    INPUT:
    text - text to process
    tokenizer - trained keras tokenizer.
    
    OUTPUT:
    Returns a padded sequence of length MAX_LEN_SEQ
    '''
    
    return sequence.pad_sequences(tokenizer.texts_to_sequences(text), maxlen=MAX_LEN_SEQ)

# Build embedding matrix
def build_embedding_matrix(word_indexes, EMBEDDING_FILE):
    '''
    Function to create the word-embedding matrix to feed into the model input layer
    
    INPUT:
    text - word-indexes from keras tokenizer
    EMBEDDING_FILE - path to embedding file
    
    OUTPUT:
    numpy array with word indexes and embedding values
    '''
    # Used to store words as key and vectors as value
    embedding_dict = {}
    # read in embedding file
    with open(EMBEDDING_FILE) as file:
        # file is formatted word {whitespace} vector
        for line in file:
            pairs = line.split(' ')
           # word is 0 index of pairs
            word = pairs[0]
            vec = pairs[1:]
           #convert vec into a numpy array
            vec = np.asarray(vec, dtype=np.float32)
            embedding_dict[word] = vec
    
    #create the embedding matrix which has dimensions:
    # MAX_VOCAB_SIZE +1 for rows, this means there will be as many rows as words we allow to be part of the feature set.
    # EMBED_DIM is the number of columns, this reflects the dimensions of the word embedding vectors we are using.
    embedding_matrix = np.zeros((len(word_indexes)+1, EMBED_DIM))


    word_count = 0
    for word, i in word_indexes.items():
        
        # checks if word index is outide max vocab size. If true we just continue
        if i >= MAX_VOCAB_SIZE:
            continue
        
        # gets the vector to the corresponding word from the previous dictionary and sets it to the variable
        embedding_vector = embedding_dict.get(word)
        # We check whether the embedding_vector is not none (i.e the word is in the embedding index)
        if embedding_vector is not None:
            word_count += 1
            # Append the embedding vector to index i in the embedding matrix. If word is not in embeddings these will be 0.
            embedding_matrix[i] = embedding_vector
            
    return embedding_matrix
            

In [7]:
# build model

# NOTE: WITH TF2.0 CUDNNLSTM is active by default when there is a GPU available but you must use the default settings.
# SEE https://www.tensorflow.org/api_docs/python/tf/keras/layers/LSTM for more details

def build_model(embedding_matrix):
    '''
    Function to build out the model structure. This function can be changed to edit the model architecture or a new
    function can be created to seperate these. 
    
    INPUT:
    embedding_matrix - The embedding_matrix created by the build_embedding_matrix function
    
    OUTPUT:
    tensorflow model
    '''
    
    # Input layer, this has shape max_len_seq, which is the length all sequences will be padded to. 
    input_words = Input(shape=(MAX_LEN_SEQ,), dtype='int32')
    
    # Embedding layer. This is fixed layer with the weights being what we have stored in the embedding matrix
    embedding = Embedding(len(tokenizer.word_index)+1, EMBED_DIM,
                          weights=[embedding_matrix],
                          input_length = MAX_LEN_SEQ,
                          #mask_zero = True
                          trainable = False) (input_words)
    
    # Dropout layer to reduce overfitting
    x = Dropout(DROPOUT_RATE)(embedding)
    
    # Bidirectional LSTM layer. We go over each sequence forwards and backwards to extract as much
    # contextual information as possible
    # Note with TF2.0 to enable GPU training we cannot change certain paraemters such as the activation function.
    # Second layer has been commented out for now
    x = Bidirectional(LSTM(128, activation='tanh', return_sequences=True))(x) 
    #x = Bidirectional(LSTM(128, activation='tanh', return_sequences=True))(x)
    
    # Use GlobalMaxPooling
    x = GlobalMaxPooling1D()(x)
    
    # Pass into DENSE layers 
    # Dense nodes total has been calculated as per 
    # https://ai.stackexchange.com/questions/3156/how-to-select-number-of-hidden-layers-and-number-of-memory-cells-in-an-lstm
    # (300,000)/5*(128+2) = 462
    x = Dense(462, activation='relu')(x)
    
    # Final output layer using sigmoid for binary classification
    prediction = Dense(2, activation='sigmoid')(x)
    
    model = Model(inputs=input_words, outputs=prediction, name='baseline-LSTM')
    # we use binary_crossentropy loss function given this is a binary classification problem
    # adam has been selected given general performance
    # We check both the classification accuracy and the AUC of the model.
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy', AUC()])
    
    return model
                           
def train_model(train_df, val_df, tokenizer):
    '''
    Function to train a tensorflow model. This function runs a number of the previous functions
    
    INPUT:
    train_df - training data
    val_df - validation data
    tokenizer - fitted keras tokenizer
    
    OUTPUT:
    model - tensorflow model
    fitted_model - fit history
    '''
    # Create processed and padded train and targets
    print('padding_text')
    X_train = text_padder(train_df[TRAIN_TEXT_COL], tokenizer)
    X_val = text_padder(val_df[TRAIN_TEXT_COL], tokenizer)
    y_train = to_categorical(train_df[TRAIN_TARGET_COL])
    y_val = to_categorical(val_df[TRAIN_TARGET_COL])
    
    print('building embedding matrix')
    # build embedding matrix
    embed_matrix = build_embedding_matrix(tokenizer.word_index, EMBEDDING_FILE)
    
    # build model
    print('building model')
    model = build_model(embed_matrix)
    
    # set up checkpoint callbacks
    cp_callback = tf.keras.callbacks.ModelCheckpoint(filepath=CHECKPOINT_PATH,
                                                 save_weights_only=True,
                                                 verbose=1)
    
    # Connect to tensorboard
    #logdir = os.path.join("logs", datetime.datetime.now().strftime("%Y%m%d-%H%M%S"))
    #tensorboard_callback = tf.keras.callbacks.TensorBoard(logdir, histogram_freq=1, write_images=True, 
                                                          #write_graph=False
                                                          #)
   
    # train model batch size and epochs were set up earlier.
    print('training model')
    fitted_model = model.fit(X_train, y_train,
                             batch_size = BATCH_SIZE,
                             epochs = NUM_EPOCHS,
                             validation_data=(X_val, y_val),
                             callbacks=[cp_callback],
                             verbose = 1)

    return model, fitted_model

    

#### Training the model

In [11]:
%%time
# Create fitted tokenizer
tokenizer = train_tokenizer(train_df[TRAIN_TEXT_COL], MAX_VOCAB_SIZE)

CPU times: user 47.5 s, sys: 894 ms, total: 48.4 s
Wall time: 48.4 s


In [None]:
# Double check tensorflow can find GPU
tf.config.experimental.list_physical_devices('GPU')

In [None]:
# Call the model training function
model, fitted_model = train_model(train_df, val_df, tokenizer)

In [None]:
#save full model  using TF saved model format
model.save('NN_model/saved_model/baseline-LSTM') 

#saves full model using h5 format
model.save('NN_model/saved_model_h5/baseline-LSTM.h5')
    
#save weights
model.save_weights('NN_model/saved_weights/baseline-LSTM')
model.save_weights('NN_model/saved_weights_h5/baseline-LSTM.h5')

In [79]:
# Pass trained tokenizer to convert test results to sequences
X_test = text_padder(test_data[TEST_TEXT_COL], tokenizer)

#convert target col to categorical 
y_test = to_categorical(test_data[TEST_TARGET_COL])

In [None]:
# evaluate on test set
loss, acc = model.evaluate(X_test, y_test, batch_size = BATCH_SIZE)

In [None]:
# Predict the test set 
test_preds = model.predict(X_test)

In [None]:
# Create a new dataframe to add test set predictions
test_pred_results = pd.DataFrame(test_data['id'])

In [None]:
# Create new columns to add the test predictions
test_pred_results['prediction_prob_0'] = test_preds[:,0]
test_pred_results['prediction_prob_1'] = test_preds[:,1]

In [None]:
# save results to csv
test_pred_results.to_csv('data/LSTM_pred_results.csv')

**NOTE: To predict new values, the string will have to be passed through the pre-processing functions (i.e text_padder()). In addition a tokenizer trained on the train data will need to be created**

#### Assessing Model Performance


In [26]:
# load in test predictions
test_preds = pd.read_csv('data/LSTM_pred_results.csv')

In [27]:
#drop unnamed column
test_preds.drop('Unnamed: 0', axis=1, inplace=True)
test_data.drop('Unnamed: 0', axis=1, inplace=True)

In [28]:
#merge the predictions onto the test dataframe on id
test_results = test_data.merge(test_preds, how='inner', on='id')

In [29]:
# define identity columns
identity_columns = [
    'male', 'female', 'homosexual_gay_or_lesbian', 'christian', 'jewish',
    'muslim', 'black', 'white', 'psychiatric_or_mental_illness']

# convert identity and target columns to boolean
for col in identity_columns + ['target']:
    test_results[col] = np.where(test_results[col] >= 0.5, True, False)
    
# create a binary col for prediction of class 1 (toxic)
test_results['prediction_binary'] = np.where(test_results['prediction_prob_1'] >= 0.5, True, False)

In [30]:
# store the precision, recall, and f1 score for later and print the classification report
nn_precision = precision_score(test_results['target'], test_results['prediction_binary'])
nn_recall = recall_score(test_results['target'], test_results['prediction_binary'])
nn_f1 = f1_score(test_results['target'], test_results['prediction_binary'])

print(classification_report(test_results['target'], test_results['prediction_binary']))

              precision    recall  f1-score   support

       False       0.97      0.98      0.97    179192
        True       0.75      0.63      0.68     15448

    accuracy                           0.95    194640
   macro avg       0.86      0.80      0.83    194640
weighted avg       0.95      0.95      0.95    194640



We can see that the model is very strong at predicting the false class, however not as adept at the cases where toxicity is the case. A recall score of 0.63 suggests we are letting through a number of cases of toxic commentary. This is most likely due to the very large class imbalance we noted during our EDA. The model only has a few cases of toxic comments to train on compared to non-toxic which impairs its ability to learn about what constitutes a toxic comment. 

In the future we will run all models once again with up-sampling and down-sampling applied and see whether this leads to a better preicsion and recall for the positive class. 

(see ML_models.ipynb for direct comparison to our other models)

In [269]:
# Define subgroup metrics

SUBGROUP_AUC = 'subgroup_auc'
BPSN_AUC = 'bpsn_auc'  # stands for background positive, subgroup negative
BNSP_AUC = 'bnsp_auc'  # stands for background negative, subgroup positive


# These calculations have been provided by Jigsaw AI for scoring based on the metrics of the kaggle competition
# https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification/overview/evaluation

# They work by filtering the relevant dataframe into specific subgroups and using the roc_auc_score metric from sklearn.

def compute_auc(y_true, y_pred):
    try:
        return metrics.roc_auc_score(y_true, y_pred)
    except ValueError:
        return np.nan

def compute_subgroup_auc(df, subgroup, label, model_name):
    subgroup_examples = df[df[subgroup]]
    return compute_auc(subgroup_examples[label], subgroup_examples[model_name])

def compute_bpsn_auc(df, subgroup, label, model_name):
    """Computes the AUC of the within-subgroup negative examples and the background positive examples."""
    subgroup_negative_examples = df.loc[df[subgroup] & ~df[label]]
    non_subgroup_positive_examples = df.loc[~df[subgroup] & df[label]]
    examples = subgroup_negative_examples.append(non_subgroup_positive_examples)
    return compute_auc(examples[label], examples[model_name])

def compute_bnsp_auc(df, subgroup, label, model_name):
    """Computes the AUC of the within-subgroup positive examples and the background negative examples."""
    subgroup_positive_examples = df.loc[df[subgroup] & df[label]]
    non_subgroup_negative_examples = df.loc[~df[subgroup] & ~df[label]]
    examples = subgroup_positive_examples.append(non_subgroup_negative_examples)
    return compute_auc(examples[label], examples[model_name])

def compute_bias_metrics_for_model(dataset,
                                   subgroups,
                                   model,
                                   label_col,
                                   include_asegs=False):
    """Computes per-subgroup metrics for all subgroups and one model."""
    records = []
    for subgroup in subgroups:
        record = {
            'subgroup': subgroup,
            'subgroup_size': len(dataset.loc[dataset[subgroup]])
        }
        record[SUBGROUP_AUC] = compute_subgroup_auc(dataset, subgroup, label_col, model)
        record[BPSN_AUC] = compute_bpsn_auc(dataset, subgroup, label_col, model)
        record[BNSP_AUC] = compute_bnsp_auc(dataset, subgroup, label_col, model)
        records.append(record)
    return pd.DataFrame(records).sort_values('subgroup_auc', ascending=True)

def calculate_overall_auc(df, model_name):
    true_labels = df[TOXICITY_COLUMN]
    predicted_labels = df[model_name]
    return metrics.roc_auc_score(true_labels, predicted_labels)

def power_mean(series, p):
    total = sum(np.power(series, p))
    return np.power(total / len(series), 1 / p)

def get_final_metric(bias_df, overall_auc, POWER=-5, OVERALL_MODEL_WEIGHT=0.25):
    bias_score = np.average([
        power_mean(bias_df[SUBGROUP_AUC], POWER),
        power_mean(bias_df[BPSN_AUC], POWER),
        power_mean(bias_df[BNSP_AUC], POWER)
    ])
    return (OVERALL_MODEL_WEIGHT * overall_auc) + ((1 - OVERALL_MODEL_WEIGHT) * bias_score)
    



In [33]:
SUBGROUP_AUC = 'subgroup_auc'
BPSN_AUC = 'bpsn_auc'  # stands for background positive, subgroup negative
BNSP_AUC = 'bnsp_auc'

MODEL_NAME = 'prediction_prob_1'
TOXICITY_COLUMN = 'target'

#log_bias_metrics_df_train = compute_bias_metrics_for_model(train_df, identity_columns, MODEL_NAME, TOXICITY_COLUMN)
#log_final_metric_train = get_final_metric(log_bias_metrics_df_train, calculate_overall_auc(train_df, MODEL_NAME))

nn_bias_metrics_df_test = compute_bias_metrics_for_model(test_results, identity_columns, MODEL_NAME, TOXICITY_COLUMN)
nn_final_metric_test = get_final_metric(nn_bias_metrics_df_test, calculate_overall_auc(test_results, MODEL_NAME))

In [34]:
nn_bias_metrics_df_test

Unnamed: 0,subgroup,subgroup_size,subgroup_auc,bpsn_auc,bnsp_auc
2,homosexual_gay_or_lesbian,1065,0.814429,0.834683,0.963505
6,black,1519,0.825677,0.814967,0.973397
5,muslim,2040,0.845463,0.856942,0.966448
7,white,2452,0.849332,0.826127,0.975361
0,male,4386,0.9104,0.916824,0.96362
4,jewish,835,0.914738,0.898723,0.972254
8,psychiatric_or_mental_illness,511,0.916247,0.896643,0.97216
1,female,5155,0.921562,0.928183,0.962616
3,christian,4226,0.933272,0.943499,0.958967


In [35]:
nn_final_metric_test

0.9196725784887158

In terms of the final bias metric and overall accuracy the results of our LSTM model are very encouraging. In terms of the final weighted AUC, we can see that the model performed significantly better than our classical ML models. Looking at the specific bias subgroups we can see that the model did not particularly struggle with any particular identity group. 

While this is a good result in terms of our stated aim of reducing bias, we are interested to see the impact of adjusting for the existing class imbalance on our model performances. Especially in terms of precision and recall for toxic comments. 

------