In [158]:
#Dependencies
import pandas as pd
import numpy as np
import nltk
import re
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras import backend as K
from keras.models import Model
from keras.layers import Input, Embedding, GlobalMaxPool1D, Bidirectional
from keras.layers import Dense, LSTM, Dropout, BatchNormalization, Activation
from keras.models import Sequential

In [3]:
nltk.download('stopwords')
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/jharmse/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## TF-IDF - Akshi

## Naive Bayes - Xinbin

## Logistic Regression - Akshi

## Word2vec - Xinbin

## Multilayer Perceptron - Matt

### Data Import

For this project, we are using Kaggle's toxic comment datasets. The data, and an overview of the data, can be found [here](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/data).

There are two files worth taking note of:

* `train.csv`
* `test.csv`

We will *train* our models on the `train.csv` data. To ensure that our model isn't memorising the training data (a.k.a. overfitting), we will *test* our model on the independent `test.csv` data.

`test.csv` has the same format as `train.csv`, but contains never-seen-before comment. Testing our model on this dataset, will give us an indication of whether our model works in a real-world application (will it be able to flag or delete new, unseen toxic comments?).

In [190]:
train = pd.read_csv('../additional/data/train.csv')
test_X = pd.read_csv('../additional/data/test.csv')
test_labels = pd.read_csv('../additional/data/test_labels.csv')

print("Training data examples:")
display(train.head())
print("Test input examples:")
display(test_X.head())
print("Test expected output examples:")
display(test_labels.head())

Training data examples:


Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0


Test input examples:


Unnamed: 0,id,comment_text
0,00001cee341fdb12,Yo bitch Ja Rule is more succesful then you'll...
1,0000247867823ef7,== From RfC == \n\n The title is fine as it is...
2,00013b17ad220c46,""" \n\n == Sources == \n\n * Zawe Ashton on Lap..."
3,00017563c3f7919a,":If you have a look back at the source, the in..."
4,00017695ad8997eb,I don't anonymously edit articles at all.


Test expected output examples:


Unnamed: 0,id,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,00001cee341fdb12,-1,-1,-1,-1,-1,-1
1,0000247867823ef7,-1,-1,-1,-1,-1,-1
2,00013b17ad220c46,-1,-1,-1,-1,-1,-1
3,00017563c3f7919a,-1,-1,-1,-1,-1,-1
4,00017695ad8997eb,-1,-1,-1,-1,-1,-1


The test data contains a bunch of `-1` values. These comments cannot be considered as toxic or non-toxic and should be omitted from scoring (see Kaggle link).

In [191]:
# cleaner ways of doing this, but whatevs
remove_rows = test_labels.toxic != -1
test_labels = test_labels[remove_rows]
test_X = test_X[remove_rows]

display(test_X.head())
test_labels.head()

Unnamed: 0,id,comment_text
5,0001ea8717f6de06,Thank you for understanding. I think very high...
7,000247e83dcc1211,:Dear god this site is horrible.
11,0002f87b16116a7f,"""::: Somebody will invariably try to add Relig..."
13,0003e1cccfd5a40a,""" \n\n It says it right there that it IS a typ..."
14,00059ace3e3e9a53,""" \n\n == Before adding a new product to the l..."


Unnamed: 0,id,toxic,severe_toxic,obscene,threat,insult,identity_hate
5,0001ea8717f6de06,0,0,0,0,0,0
7,000247e83dcc1211,0,0,0,0,0,0
11,0002f87b16116a7f,0,0,0,0,0,0
13,0003e1cccfd5a40a,0,0,0,0,0,0
14,00059ace3e3e9a53,0,0,0,0,0,0


For our needs, we are only interested in wether a comment is toxic or not. We aren't interested in the type of toxicity. Let's change the comment labels to a binary label.

In [192]:
# train data
train_y = train.iloc[:, 2:] != np.zeros((len(train), 6))
train_y = train_y.any(axis=1)

# test data
test_y = test_labels.iloc[:, 1:] != np.zeros((len(test_labels), 6))
test_y = test_y.any(axis=1)

print('Training data Binary Response Variable')
print(train_y.head())
print('\nTest data Binary Response Variable')
print(test_y.head())

Training data Binary Response Variable
0    False
1    False
2    False
3    False
4    False
dtype: bool

Test data Binary Response Variable
5     False
7     False
11    False
13    False
14    False
dtype: bool


In this case, `True` represents a *toxic* comment. `False` represents *non-toxic*. However, for a mathematical model to work, we need numbers. We will convert `True` to `1` and `False` to `0`.

In [193]:
train_y = train_y.astype(float)
test_y = test_y.astype(float)
print('Training Data Response Variable')
print(train_y.head())
print('\nTest Data Response Variable')
print(test_y.head())

Training Data Response Variable
0    0.0
1    0.0
2    0.0
3    0.0
4    0.0
dtype: float64

Test Data Response Variable
5     0.0
7     0.0
11    0.0
13    0.0
14    0.0
dtype: float64


The toxic comment datasets are quite big.

In [202]:
print(f'Number of training samples: {len(train_y)}')
print(f'Number of test samples: {len(test_y)}')

Number of training samples: 159571
Number of test samples: 63978


Neural networks are generally quite complex, which means that training can take very long. For this reason, let's reduce the number of examples we train on. This will most likely reduce our accuracy, but will increase the training time.

To prevent training on comments that are mostly non-toxic (a non-toxic comment is more likely to occur than a toxic comment), we can reduce the training data by omitting more non-toxic comments. In the example below we try to end up with an equal amount of non-toxic and toxic comments in our training data. This will also prevent the model from having a higher probability of predicting a comment as non-toxic, simply becuase it's a more common occurance.

In [203]:
train_toxic = train_y[train_y == 1.0]
train_non_toxic = train_y[train_y == 0.0]
train_non_toxic = train_non_toxic[:len(train_toxic)]

rows_keep = list(train_toxic.index)
rows_keep = keep + list(train_non_toxic)
rows_keep.sort()

train_y = train_y.iloc[rows_keep]
train = train.iloc[rows_keep]

In [205]:
print(f'Number of training samples: {len(train_y)}')

Number of training samples: 48675


### Data Preprocessing

In our last workshop we spoke about multiple text preprocessing techniques. For reasons already discussed, these preprocessing techniques might help the accuracy of your model a lot.

Below we are only applying the following preprocessing techniques:

* make comments lowercase
* split comments into individual words (1-gram).
* remove stopwords

Feel free to apply more preprocessing techniques here before training your model.

In [206]:
def preproc_line(line):
    text = re.sub(r"[^a-zA-Z0-9]", " ", line.lower())
    words = text.split()
    words = [w for w in words if w not in stopwords.words('english')]
  
    return words

In [207]:
# training data
X_train_pro = []

for line in train['comment_text']:
    X_train_pro.append(preproc_line(line))
    
# test data
X_test_pro = []

for line in test_X['comment_text']:
    X_test_pro.append(preproc_line(line))

In [208]:
X_train_pro[0]

['explanation',
 'edits',
 'made',
 'username',
 'hardcore',
 'metallica',
 'fan',
 'reverted',
 'vandalisms',
 'closure',
 'gas',
 'voted',
 'new',
 'york',
 'dolls',
 'fac',
 'please',
 'remove',
 'template',
 'talk',
 'page',
 'since',
 'retired',
 '89',
 '205',
 '38',
 '27']

### Word Vectorizing

Next we need to present our comments as numerical vectors.

You can use more complex techniques, such as `word2vec`, but for this example we simply consider the 20000 most common words and represent each comment as a vector in terms of these common words.

In [210]:
# number of most common words to use
max_features = 20000

tokenizer = Tokenizer(num_words=max_features)
tokenizer.fit_on_texts(list(X_train_pro)) # we only look at words occurring in training data

X_train_tokenized = tokenizer.texts_to_sequences(X_train_pro)
X_test_tokenized = tokenizer.texts_to_sequences(X_test_pro)

In [212]:
X_train_tokenized[0]

[14,
 4,
 5,
 15,
 21,
 23,
 9,
 10,
 24,
 26,
 20,
 22,
 6,
 12,
 27,
 25,
 3,
 8,
 11,
 2,
 1,
 7,
 19,
 16,
 18,
 17,
 13]

As with any machine learning model, we need to have a fixed number of features. Since comments can contain a varying number of words, we need to limit the number of words that can occur in a comment being classified. For our model we will allow a maximum of 200 words per comment that is fed into our model. Any comment containing fewer words, will be padded with a bunch of zeros.

In [214]:
max_len = 200
X_train_pad = pad_sequences(X_train_tokenized, maxlen=max_len)
X_test_pad = pad_sequences(X_test_tokenized, maxlen=max_len)

In [215]:
X_train_pad[0]

array([ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0, 14,  4,  5, 15, 21, 23,  9, 10, 24, 26, 20, 22,  6, 12,
       27, 25,  3,  8, 11,  2,  1,  7, 19, 16, 18, 17, 13], dtype=int32)

### Creating the Model

A multilayer perceptron is just another word for a normal neural network. For a neural network to be considered a deep neural network, it should have at least 2 hidden layers.

![](../additional/img/ANN.jpg)

In our case, our input layer should have a size of 200, since our comments are represented as word vectors, each having a size of 200.

Both our hidden layers have a size of 32. This is arbitrary and can be optimised.

The `relu` activation functions change linear outputs from the neurons to non-linear outputs. This ensures that we aren't simply training a complicated linear regression model.

THe last layer has a size of 1 and a `sigmoid` activation function. This layer allows the output the model to be a single value per sample that has a value between 0 and 1. This value can be seen as the probability of a comment being toxic (close to 1.0 means it is more likely that it is toxic).

![](../additional/img/activations.png)

The `Dropout` lines below essentially mean that we *turn off* a cerain fraction of neurons (50% in this case) and see if the model gives similar performance. If this is the case, we know that we can deactivate the neurons which will result in a simpler model.

If you want a clearer understanding of how neural networks work, play around with them [here](https://playground.tensorflow.org/#activation=tanh&batchSize=10&dataset=circle&regDataset=reg-plane&learningRate=0.03&regularizationRate=0&noise=0&networkShape=4,2&seed=0.44195&showTestData=false&discretize=false&percTrainData=50&x=true&y=true&xTimesY=false&xSquared=false&ySquared=false&cosX=false&sinX=false&cosY=false&sinY=false&collectStats=false&problem=classification&initZero=false&hideText=false).

In [216]:
# size of input layer
input_dim = len(X_train_pad[0])

model = Sequential()
model.add(Dense(32, activation='relu', input_dim=input_dim))
model.add(Dropout(0.5))
model.add(Dense(32, activation='relu', input_dim=input_dim))
model.add(Dropout(0.5))
model.add(Dense(1, activation='sigmoid'))

print(model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_5 (Dense)              (None, 32)                6432      
_________________________________________________________________
dropout_5 (Dropout)          (None, 32)                0         
_________________________________________________________________
dense_6 (Dense)              (None, 32)                1056      
_________________________________________________________________
dropout_6 (Dropout)          (None, 32)                0         
_________________________________________________________________
dense_7 (Dense)              (None, 1)                 33        
Total params: 7,521
Trainable params: 7,521
Non-trainable params: 0
_________________________________________________________________
None


We need to specify how our model is trained. Below follows standard binary model training parameters. You can find more details about these [here](https://keras.io/getting-started/sequential-model-guide/#training).

In short, we are using [RMSProp](https://keras.io/optimizers/#rmsprop) to find the best neural network weights that minimizes our loss, [Binary Cross Entropy](https://keras.io/losses/). The metric we use for our model is [Accuracy](https://keras.io/metrics/).

In [217]:
model.compile(optimizer='rmsprop', 
              loss='binary_crossentropy', 
             metrics=['accuracy'])

Now we can train our model.

We feed in batches of data during training. In this example, our batch size is 32, which means that 32 training data comments are being fed into the mdoel during each iteration. During an iteration, the model weights are updated in an attempt to reduce our loss. This is done, using a technique called backpropagation.

Epochs refer to the number of times the full training dataset gets fed into the model during training. Having epochs=4, means our model sees each training example 4 times during the training process. Be careful for setting the number of epopchs too high, becuase this can result in overfitting.

The validation split specifies that 10% of the data should be used as validation data, while 90% is used for training. This gives us an indication of whether the model is underfitting or overfitting at any stage during the training process.

In [218]:
model.fit(X_train_pad, train_y, epochs=4, batch_size=32, validation_split=0.1)

Train on 43807 samples, validate on 4868 samples
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<keras.callbacks.History at 0x7fd6f432e3c8>

In [224]:
score = model.evaluate(X_test_pad, test_y, batch_size=32)
print(f'Test Accuracy: {score[1]}')

Test Accuracy: 0.35496576948512315


## Bidirectional Recurrent Neural Network(RNN) - Matt

In [181]:
def create_network(input_dim, embed_size, units, layers, output_dim, prob=0.2):
    input_ = Input(name='input_', shape=(input_dim, ))
    embed = Embedding(input_dim, embed_size)(input_)

    def add_layer(input_layer, units, name):
        lstm = Bidirectional(LSTM(units, return_sequences=True, activation='relu',
                                 name=name))(input_layer)
        bn_layer = BatchNormalization()(lstm)
        return bn_layer

    for i in range(layers):
        if i == 0:
            last_layer = add_layer(embed, units, 'rnn0')
        else:
            last_layer = add_layer(last_layer, units, 'rnn'+str(i))

    x = Dropout(prob)(last_layer)
    x = GlobalMaxPool1D()(x)
    x = Dense(units, activation='relu')(x)
    x = Dropout(prob)(x)
    logits = Dense(output_dim, name='logits')(x)
    out = Activation('sigmoid', name='out')(logits)
    model = Model(inputs=input_, outputs=out)

    print(model.summary())

    return model

In [182]:
embed_size = 128
units = 128
layers = 2
batch_size = 32
epochs = 4

In [183]:
K.clear_session()
model = create_network(max_len, embed_size, units, layers, 1)
model.compile(loss='binary_crossentropy',
             optimizer='adam',
             metrics=['accuracy'])

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_ (InputLayer)          (None, 200)               0         
_________________________________________________________________
embedding_1 (Embedding)      (None, 200, 128)          25600     
_________________________________________________________________
bidirectional_1 (Bidirection (None, 200, 256)          263168    
_________________________________________________________________
batch_normalization_1 (Batch (None, 200, 256)          1024      
_________________________________________________________________
bidirectional_2 (Bidirection (None, 200, 256)          394240    
_________________________________________________________________
batch_normalization_2 (Batch (None, 200, 256)          1024      
_________________________________________________________________
dropout_1 (Dropout)          (None, 200, 256)          0         
__________

In [184]:
model.fit(X_pad, train_y, batch_size=batch_size, epochs=epochs, validation_split=0.1)

Train on 43807 samples, validate on 4868 samples
Epoch 1/4
 3136/43807 [=>............................] - ETA: 15:57 - loss: 0.6160 - acc: 0.7369

KeyboardInterrupt: 

## Accuracy - Akshi

## AUC (ROC) - Akshi