# Continuation - Classifying toxic comments with deep learning
 
This notebook is a continuation of my [previous](./01_ToxicCommentsOnAWS.ipynb) attempt to classify comments on Wikipedia articles according to their verbal toxicity and getting to know AWS SageMaker. In the previous notebook I used Scikit-Learn implementations of shallow, non-sequetial ML algorithms and the Scikit-Learn-specific parts of the AWS SageMaker Python SDK. This time I'll use 1DConv and LSTM neural nets implemented in TensorFlow and the corrsponding parts of the SageMaker Python SDK to seperately build the model in this notebook, launch a training job, and then invoke an endpoint instance to host the trained model. To keep things a little simpler, I'll not consider specific types of verbal toxicity this time but only if a comment is toxic at all or not.

## Setup
First we need to set up everthing to run on AWS, namely the S3 bucket and IAM role.

In [1]:
import boto3
import re
import sagemaker
from sagemaker import get_execution_role

# Define IAM role
region = boto3.Session().region_name    
smclient = boto3.Session().client("sagemaker")
role = get_execution_role()

# S3 bucket
bucket = '<my-bucket-name>'
prefix = 'sagemaker/toxic-comments'

# get the zsame ipped training and test data from S3 as last time 
# (the zip already contains differently preprocessed versions to save some time) 
s3 = boto3.resource("s3")
s3.Bucket(bucket).download_file("jigsaw-toxic-comment-classification-challenge.zip",
                                "local-jigsaw-toxic-comment-classification-challenge.zip")

# unzip the data
import zipfile
with zipfile.ZipFile("local-jigsaw-toxic-comment-classification-challenge.zip", 'r') as zip_ref:
    zip_ref.extractall("./data")

In [2]:
# data processing and computation
import pandas as pd
import numpy as np

import os

# deep learning and corresponding data preprocessing
import tensorflow as tf
import tensorflow.keras as keras
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.preprocessing.text import Tokenizer




## Prepare the data
We already explored the data in detail in the first notebook, so we'll just do the processing here. We will start with the same train and test data.

In [3]:
# load training data
train_df = pd.read_csv("./data/train.csv")
train_df.sample(5)

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
24991,42176424aa902f06,Policy says to block both participants in an e...,0,0,0,0,0,0
6867,125536354304fcca,Go Fuck Yourself \n\nDeeside College is a moth...,1,0,1,0,1,0
17750,2edb9cb73ea97eca,"fiddle away, with your",0,0,0,0,0,0
156845,d4707bf4a06d1855,"""\n\n Oregon Ducks football \n\nHi Abdoozy, so...",0,0,0,0,0,0
88759,ed759c8c9bc94c7b,"""\nWelcome!\n\nHello, , and welcome to Wikiped...",0,0,0,0,0,0


In [4]:
# load test data
test_df = pd.read_csv("./data/test.csv").merge(pd.read_csv("./data/test_labels.csv"),left_on="id", right_on="id")
test_df.sample(5)

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
148224,f7b7ff872e3a094c,""" \n :::Are you aware of any recent articles w...",0,0,0,0,0,0
78628,8333fb10ce0bfc19,"==Bot on please?== \n Hi Alex, I don't know if...",-1,-1,-1,-1,-1,-1
113904,be27f4723912ac6b,""" \n\n == Cultural relativism == \n\n """"Psycho...",0,0,0,0,0,0
52926,57e3ad0c96651f30,""" \n :::Well, seeing as I'm supporting, I don'...",-1,-1,-1,-1,-1,-1
93310,9ba32b4beb22808a,and that he realised Wikipedia was gonna be a ...,0,0,0,0,0,0


In [5]:
# remove the -1 labeled instance (will be converted to float so convert back to int)
test_df = test_df.where(test_df!=-1).dropna()
test_df[test_df.columns[2:]] = test_df[test_df.columns[2:]].astype(np.int)
test_df.sample(5)

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
40701,439e70520e9e584a,We already quote Barron prominently in the led...,0,0,0,0,0,0
102576,ab42619b5656aa26,dates only from Protestant Reformation,0,0,0,0,0,0
96791,a180be3aa1103c0d,: Don't repeat the same comment in several pla...,0,0,0,0,0,0
129573,d884b8c801c193e9,Is this article and the Type 98 20 mm AA Half-...,0,0,0,0,0,0
44944,4a89a5fda298a1f2,:::What??? Have you read it???? It was written...,0,0,0,0,0,0


In [6]:
# get the labels
y_train = train_df.iloc[:,2:].reset_index(drop=True).copy()
y_test = test_df.iloc[:,2:].reset_index(drop=True).copy()

In the previous part we already preprocessed the data by removing stop words, punctuation, and number. Let's re-use this data here (and remember to remove the rows that are now completly empty due to the processing).

In [7]:
# load processed training and test features
X_train_stop = pd.read_csv("./data/X_train_stop.csv",skip_blank_lines=False)
X_test_stop = pd.read_csv("./data/X_test_stop.csv",skip_blank_lines=False)

# get index of empty rows
idx_train_stop = X_train_stop[X_train_stop.comment_text.isnull()].index
idx_test_stop = X_test_stop[X_test_stop.comment_text.isnull()].index

# remove empty rows
X_train_stop = X_train_stop.drop(idx_train_stop)
X_test_stop = X_test_stop.drop(idx_test_stop)
y_train_stop = y_train.drop(idx_train_stop)
y_test_stop = y_test.drop(idx_test_stop)

To make the problem a bit simpler now, we'll only consider if a comment is toxic or not, not in which way it is toxic. This changes our task from a multi label to a binary classification.

In [8]:
# make binary labels - either toxic or not
y_train_bin = np.zeros([len(y_train_stop)])
y_test_bin = np.zeros([len(y_test_stop)])

for y_stop, y_bin in zip([y_train_stop,y_test_stop],[y_train_bin,y_test_bin]):
    for i, labels in enumerate(y_stop.values):
        if np.sum(labels) > 0:
            y_bin[i] = 1

At this point our data is still in text format. If we want to feed it into a neural net, starting with an embedding layer, we should bring it in a numerical format. Since the comments have different length, we have to pad them with zeros so that the sequences we feed into the network all have the same lengths.

In [9]:
# choose a vocabulary of only the 50000 most popular unique words
# and a sequence length of 700 (will be padded with zeros to len=700)
max_features = 50000
maxlen = 700

# tokenizing and padding the comment sequences
tokenizer = Tokenizer(num_words=max_features)
tokenizer.fit_on_texts(X_train_stop.comment_text)
train_sequences = tokenizer.texts_to_sequences(X_train_stop.comment_text)
X_train_seq = sequence.pad_sequences(train_sequences, maxlen=maxlen)
test_sequences = tokenizer.texts_to_sequences(X_test_stop.comment_text)
X_test_seq = sequence.pad_sequences(test_sequences, maxlen=maxlen)

Now everything should be as we need it and we can upload the training data to S3 so that the training job which we will launch later can fetch it from there.

In [10]:
# create directory for training data
WORK_DIRECTORY = "train_data"
if not os.path.isdir(WORK_DIRECTORY):
    os.mkdir(WORK_DIRECTORY)

# write the trainingdata (features and labels) to CSV    
np.save("./"+WORK_DIRECTORY+"/X_train_seq.npy",X_train_seq)
np.save("./"+WORK_DIRECTORY+"/y_train_bin.npy",y_train_bin)

# upload the data to S3 to be accessed for training later
train_input = sagemaker.Session().upload_data(WORK_DIRECTORY, key_prefix="{}/{}".format(prefix, WORK_DIRECTORY))

## Creating the model
We will write a training script that parses all potential hyperparameters (I've already set the default the way I want them), loads the training data from S3, builds a model with an embedding layer, a Conv1D layer with max pooling, and a LSTM layer, going into a dense layer with Sigmoid activation for binary classification, and then train it.

In [11]:
%%writefile script.py
# write this notebook cell as a script file

import argparse
import os
import pandas as pd
import numpy as np
import tensorflow as tf
import tensorflow.keras as keras
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.preprocessing.text import Tokenizer
from sklearn.utils import class_weight
from sklearn.model_selection import train_test_split
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Activation
from tensorflow.keras.layers import Embedding
from tensorflow.keras.layers import LSTM, Flatten
from tensorflow.keras.layers import Conv1D, MaxPooling1D


# the training function
if __name__ == '__main__':
    parser = argparse.ArgumentParser()

    # Hyperparameters 
    parser.add_argument('--max_features', type=int, default=50000)
    parser.add_argument('--maxlen', type=int, default=700)
    parser.add_argument('--embedding_size', type=int, default=128)
    parser.add_argument('--kernel_size', type=int, default=5)
    parser.add_argument('--filters', type=int, default=64)
    parser.add_argument('--pool_size', type=int, default=64)
    parser.add_argument('--lstm_output_size', type=int, default=70)
    parser.add_argument('--batch_size', type=int, default=64)
    parser.add_argument('--epochs', type=int, default=1)

    # Sagemaker specific arguments. Defaults are set in the environment variables.
    parser.add_argument('--output_data_dir', type=str, default=os.environ['SM_OUTPUT_DATA_DIR'])
    parser.add_argument('--model_dir', type=str, default=os.environ['SM_MODEL_DIR'])
    parser.add_argument('--train', type=str, default=os.environ['SM_CHANNEL_TRAIN'])

    args = parser.parse_args()
    
    # set hyperparameters
    # Embedding
    max_features = args.max_features
    maxlen = args.maxlen
    embedding_size = args.embedding_size

    # Convolution
    kernel_size = args.kernel_size
    filters = args.filters
    pool_size = args.pool_size

    # LSTM
    lstm_output_size = args.lstm_output_size
    
    # Training
    batch_size = args.batch_size
    epochs = args.epochs
    
    # Take the set of input files 
    input_files = [ os.path.join(args.train, file) for file in sorted(os.listdir(args.train)) ]
    if len(input_files) == 0:
        raise ValueError(('There are no files in {}.\n' +
                          'This usually indicates that the channel ({}) was incorrectly specified,\n' +
                          'the data specification in S3 was incorrectly specified or the role specified\n' +
                          'does not have permission to access the data.').format(args.train, "train"))
        
    # load the input data from the files
    X_train_seq = np.load(input_files[0],allow_pickle=True)
    y_train_bin = np.load(input_files[1],allow_pickle=True)
    
    # compute class weigths
    class_weights = class_weight.compute_class_weight("balanced",[0.0,1.0],y_train_bin)
    
    # split of a validation set from the training data
    X_train_seq, X_valid_seq, y_train_bin, y_valid_bin = train_test_split(X_train_seq, 
                                                                          y_train_bin,
                                                                          stratify=y_train_bin, 
                                                                          test_size=0.25, 
                                                                          random_state=42)
    
    # convert to tensorflow dataset format
    train_ds = (tf.data.Dataset.from_tensor_slices((X_train_seq, y_train_bin.astype(np.int32)))
                .repeat()
                .shuffle(100)
                .batch(batch_size)
                .prefetch(tf.data.experimental.AUTOTUNE))

    valid_ds = (tf.data.Dataset.from_tensor_slices((X_valid_seq, y_valid_bin.astype(np.int32)))
                .repeat()
                .batch(batch_size)
                .prefetch(tf.data.experimental.AUTOTUNE))
    
    
    # build the model - staring with and embedding layer,
    # feeding into a 1DConv and an LSTM
    model = Sequential()
    model.add(Embedding(max_features, embedding_size, input_length=maxlen))
    model.add(Dropout(0.25))
    model.add(Conv1D(filters,
                     kernel_size,
                     padding="valid",
                     activation="relu",
                     strides=1))
    model.add(MaxPooling1D(pool_size=pool_size))
    model.add(LSTM(lstm_output_size))
    model.add
    model.add(Dense(1))
    model.add(Activation("sigmoid"))
    
    model.compile(loss="binary_crossentropy",
                  optimizer="adam",
                  metrics=["accuracy"])
    
    lr_scheduler = keras.callbacks.ReduceLROnPlateau(factor=0.5, patience=0)
    earlystop = keras.callbacks.EarlyStopping(monitor="val_loss", min_delta=0.01, patience=0, restore_best_weights=True)
    
    # print model summary
    model.summary()
    
    # start the training
    print('Train...')
    history = model.fit(train_ds,
                        epochs=epochs,
                        steps_per_epoch=len(X_train_seq)//batch_size,
                        validation_data=valid_ds,
                        validation_steps=len(X_valid_seq)//batch_size,
                        class_weight=class_weights,
                        callbacks=[lr_scheduler,earlystop])
    
    # save the model
    
    if not os.path.isdir(args.model_dir):
        os.mkdir(args.model_dir)
        
    model.save(os.path.join(args.model_dir, "model.h5"))
    
    
# the inference function
def model_fn(model_dir):
    """
    Deserialized and return fitted model
    Note that this should have the same name as the serialized model in the main method
    """
    model = keras.models.load_model(os.path.join(model_dir, "model.h5"))
    return model
    
    

Overwriting script.py


## Training the model

Now that we have specified what sould be done during training in the script file, we can easily create a Sagemaker estimator from its prebuild Tensorflow container. We'll just specify the script's location, our role, and on what type of instance we want to perform the training.

In [12]:
from sagemaker.tensorflow import TensorFlow

# build a TensorFLow estimator
tf_estimator = TensorFlow(entry_point="script.py", role=role,
                          train_instance_count=1, train_instance_type="ml.m4.xlarge",
                          framework_version="2.0.0", py_version="py3")

In [13]:
# start the training with specifying the S3 location with the training data
tf_estimator.fit({'train': train_input})

## Making predictions
If we now want to use the trained model, we'll either first have to deploy it to an endpoint instance...

In [14]:
#predictor = tf_estimator.deploy(initial_instance_count=1,
#                            instance_type="ml.m4.xlarge",
#                             endpoint_type="tensorflow-serving")

...or we just load the trained model into this notebook instance here (e.g. AWS thinks I'm already above my deployment instance quota).

In [15]:
# download the trained model from the S3 bucket
s3.Bucket(bucket).download_file("model.h5","model.h5")

In [16]:
# load the model
model = keras.models.load_model("model.h5")

Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


Let's bring the test data in the right format and make predictions.

In [17]:
batch_size = 64
test_ds = (tf.data.Dataset.from_tensor_slices((X_test_seq, y_test_bin.astype(np.int32)))
            .batch(batch_size)
            .prefetch(tf.data.experimental.AUTOTUNE))

In [18]:
# make predictions for the test data
y_pred = model.predict(test_ds)

In [19]:
y_pred = np.where(y_pred>=0.5,1.0,0.0)

And finally, let's see how the model performed:

In [20]:
from sklearn.metrics import f1_score, accuracy_score, confusion_matrix

f1 = f1_score(y_test_bin, y_pred, average="binary")
acc = accuracy_score(y_test_bin, y_pred)
cm = confusion_matrix(y_test_bin, y_pred, labels=[0,1])
print(f"F1 macro: {f1}\nAccuracy: {acc}\nConfusion matrix:\n{cm}")

F1 macro: 0.6635685459214871
Accuracy: 0.9138630136986301
Confusion matrix:
[[52947  4686]
 [  816  5426]]


We can see in the confusion matrix that still a lot of instances are misclassified, but overall, this model results in a better F1 score and accuracy than the shallow multi-label model. This may be more due to the fact that a binary classification problem is easier and not so much due to emplyoing deep learning vs. shallow learning.

## Conclusion
We used deep, sequential machine learning methods to classify Wikipedia article comments verbally toxic or not, using AWS Sagemakers capabilities of buliding and training models. 