# Train Toxicity Model

This notebook trains a model to detect toxicity in online comments. It uses a CNN architecture for text classification trained on the [Wikipedia Talk Labels: Toxicity dataset](https://figshare.com/articles/Wikipedia_Talk_Labels_Toxicity/4563973) and pre-trained GloVe embeddings which can be found at:
http://nlp.stanford.edu/data/glove.6B.zip
(source page: http://nlp.stanford.edu/projects/glove/).

This model is a modification of [example code](https://github.com/fchollet/keras/blob/master/examples/pretrained_word_embeddings.py) found in the [Keras Github repository](https://github.com/fchollet/keras) and released under an [MIT license](https://github.com/fchollet/keras/blob/master/LICENSE). For further details of this license, find it [online](https://github.com/fchollet/keras/blob/master/LICENSE) or in this repository in the file KERAS_LICENSE. 

## Usage Instructions

Prior to running the notebook, you must:

* Download the [Wikipedia Talk Labels: Toxicity dataset](https://figshare.com/articles/Wikipedia_Talk_Labels_Toxicity/4563973)
* Download pre-trained [GloVe embeddings](http://nlp.stanford.edu/data/glove.6B.zip)
* (optional) To skip the training step, you will need to download a model and tokenizer file. We are looking into the appropriate means for distributing these (sometimes large) files.

## Setting Up

In [1]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import os

In [2]:
DATA_DIR = '../data/'
MODEL_DIR = '../models/'
EMBEDDING_DIR = '../data/glove.6B/'
MODEL_VERSION = 'wiki_tox_labels_v1'

## Clean and Prep Data

In [3]:
import pandas as pd

In [4]:
toxicity_annotated_comments = pd.read_csv(os.path.join(DATA_DIR, 'toxicity_annotated_comments.tsv'), sep = '\t')
toxicity_annotations = pd.read_csv(os.path.join(DATA_DIR, 'toxicity_annotations.tsv'), sep = '\t')

In [5]:
annotations_gped = toxicity_annotations.groupby('rev_id', as_index=False).agg({'toxicity': 'mean'})
all_data = pd.merge(annotations_gped, toxicity_annotated_comments, on = 'rev_id')

In [6]:
all_data['comment'] = all_data['comment'].apply(lambda x: x.replace("NEWLINE_TOKEN", " "))
all_data['comment'] = all_data['comment'].apply(lambda x: x.replace("TAB_TOKEN", " "))

# TODO(nthain): Consider doing regression instead of classification
all_data['is_toxic'] = all_data['toxicity'] > 0.5

## Build Model

### Hyperparameters

In [7]:
MAX_SEQUENCE_LENGTH = 1000
MAX_NUM_WORDS = 20000
EMBEDDING_DIM = 100

BATCH_SIZE = 128
EPOCHS = 3

CNN_FILTER_SIZES = [128, 128, 128]
CNN_KERNEL_SIZES = [5,5,5]
CNN_POOLING_SIZES = [5, 5, 40] 


### Split and Tokenize Data

In [8]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
import numpy as np
import cPickle

Using TensorFlow backend.


In [9]:
# split into train, valid, test
train = all_data.query('split == "train"')
valid = all_data.query('split == "dev"')
test = all_data.query('split == "test"')

train_text = train['comment']
valid_text = valid['comment']
test_text = test['comment']

In [10]:
# TODO(nthain): Make the code not repeat itself
tokenizer = Tokenizer(num_words = MAX_NUM_WORDS)
tokenizer.fit_on_texts(train_text)
cPickle.dump(tokenizer, open(os.path.join(MODEL_DIR, '%s_tokenizer.pkl' % MODEL_VERSION), 'wb'))

train_sequences = tokenizer.texts_to_sequences(train_text)
valid_sequences = tokenizer.texts_to_sequences(valid_text)
test_sequences = tokenizer.texts_to_sequences(test_text)

word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

Found 151181 unique tokens.


In [11]:
train_data = pad_sequences(train_sequences, maxlen = MAX_SEQUENCE_LENGTH)
valid_data = pad_sequences(valid_sequences, maxlen = MAX_SEQUENCE_LENGTH)
test_data = pad_sequences(test_sequences, maxlen = MAX_SEQUENCE_LENGTH)

train_labels = to_categorical(np.asarray(train['is_toxic']))
valid_labels = to_categorical(np.asarray(valid['is_toxic']))
test_labels = to_categorical(np.asarray(test['is_toxic']))

### Load pre-trained Embeddings

In [12]:
embeddings_index = {}
f = open(os.path.join(EMBEDDING_DIR, 'glove.6B.100d.txt'))
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

print('Found %s word vectors.' % len(embeddings_index))

Found 400000 word vectors.


In [13]:
embedding_matrix = np.zeros((len(word_index) + 1, EMBEDDING_DIM))
num_words_in_embedding = 0
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        num_words_in_embedding += 1
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector

print('%s words successfully embedded.' % num_words_in_embedding)

78047 words successfully embedded.


### Model Architecture

In [14]:
from keras.layers import Embedding
from keras.layers import Dense, Input, Flatten
from keras.layers import Conv1D, MaxPooling1D, Embedding, GlobalMaxPooling1D
from keras.models import Model

In [15]:
def build_conv_layer(input_tensor, filter_size, kernel_size, pool_size):
    output = Conv1D(filter_size, kernel_size, activation='relu', padding='same')(input_tensor)
    if pool_size:
        output = MaxPooling1D(pool_size, padding = 'same')(output)
    else:
        # TODO(nthain): This seems broken. Fix.
        output = GlobalMaxPooling1D()(output)
    return output

In [16]:
sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
embedding_layer = Embedding(len(word_index) + 1,
                            EMBEDDING_DIM,
                            weights=[embedding_matrix],
                            input_length=MAX_SEQUENCE_LENGTH,
                            trainable=True)

embedded_sequences = embedding_layer(sequence_input)
x = embedded_sequences
for filter_size, kernel_size, pool_size in zip(CNN_FILTER_SIZES, CNN_KERNEL_SIZES, CNN_POOLING_SIZES):
    x = build_conv_layer(x, filter_size, kernel_size, pool_size)
    
x = Flatten()(x)
# TODO(nthain): Parametrize the number and size of fully connected layers
x = Dense(128, activation='relu')(x)
preds = Dense(2, activation='softmax')(x)

model = Model(sequence_input, preds)
model.compile(loss='categorical_crossentropy',
              optimizer='rmsprop',
              metrics=['acc'])

In [18]:
# A summary of the model architecture
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         (None, 1000)              0         
_________________________________________________________________
embedding_1 (Embedding)      (None, 1000, 100)         15118200  
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 1000, 128)         64128     
_________________________________________________________________
max_pooling1d_1 (MaxPooling1 (None, 200, 128)          0         
_________________________________________________________________
conv1d_2 (Conv1D)            (None, 200, 128)          82048     
_________________________________________________________________
max_pooling1d_2 (MaxPooling1 (None, 40, 128)           0         
_________________________________________________________________
conv1d_3 (Conv1D)            (None, 40, 128)           82048     
__________

### Train Model

In [19]:
model.fit(train_data, train_labels,
          batch_size=BATCH_SIZE,
          epochs=EPOCHS,
          validation_data=(valid_data, valid_labels))

Train on 95692 samples, validate on 32128 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x1271ece90>

In [20]:
model.save(os.path.join(MODEL_DIR, '%s_model.h5' % MODEL_VERSION))

### Evaluate Model on Test Data

In [21]:
from sklearn import metrics
from keras.models import load_model

In [22]:
model = load_model(os.path.join(MODEL_DIR, '%s_model.h5' % MODEL_VERSION))

In [23]:
def compute_auc(y_true, y_pred):
    fpr, tpr, thresholds = metrics.roc_curve(y_true, y_pred)
    return metrics.auc(fpr, tpr)

In [24]:
test_preds = model.predict(test_data, batch_size=BATCH_SIZE)

In [25]:
test_auc = compute_auc(test_labels[:,1], test_preds[:,1])
print('The model achieves an AUC of %.3f on the test set.' % test_auc)

The model achieves an AUC of 0.977 on the test set.
