# Toxic Comment Classifier

Below is a classifier for the Jigsaw [Toxic Comment Classification Challenge](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge) hosted by Kaggle, written in Keras/TF. This model leverages multiple insights from high-scoring Kaggle kernels, as well as exploratory work in PyTorch's Torchtext library.

In this notebook I'll compare a simply bidirectional LSTM with a more complex bi-GRU-ConvNet. If the two models produce similar predictions, I'll take the simpler model if it isn't dramatically worse, and ensemble it with a NB-SVM.

**See Aside 1** for terms.

-- Wayne Nixalo 1/4/2018

---

## 1. Data & Embeddings Preparation

Imports & Paths

In [2]:
import pathlib
import pandas as pd
import numpy as np
import keras
import keras.preprocessing.text

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Embedding
from keras.layers import SpatialDropout1D
from keras.layers import Bidirectional
from keras.layers import GRU
from keras.layers import LSTM
from keras.layers import Conv1D
from keras.layers import GlobalAveragePooling1D
from keras.layers import GlobalMaxPooling1D
from keras.layers import concatenate
from keras.layers import Dense
from keras.layers import Input
from keras.models import Model
from keras.callbacks import EarlyStopping
from keras.callbacks import ModelCheckpoint

In [3]:
PATH_DATA = pathlib.Path('../../data')
PATH_COMP = PATH_DATA/'competitions/jigsaw-toxic-comment-classification-challenge'
TRAIN_FILE = 'train.csv'
TEST_FILE  = 'test.csv'
EMBEDDING_GLOVE    = PATH_DATA/'glove/glove.6B.50d.txt'
EMBEDDING_FASTTEXT = PATH_DATA/'fasttext/crawl-300d-2M.vec'

Basic Config Parameters

In [4]:
embed_sz = 50    # embedding vector length
max_feat = 20000 # num unique words
maxlen   = 100   # max length of sequence to read

Data preprocessing **See: Aside 2**

In [5]:
# read data & replace missing values
train_df = pd.read_csv(PATH_COMP/TRAIN_FILE)
test_df  = pd.read_csv(PATH_COMP/TEST_FILE)

list_sentences_train = train_df["comment_text"].fillna("_na_").values
list_sentences_test  = test_df["comment_text"].fillna("_na_").values

list_classes = ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]
labels = train_df[list_classes].values

In [6]:
# tokenize, numericalize, and pad
tokenizer = keras.preprocessing.text.Tokenizer(num_words=max_feat, lower=True, oov_token='<unk>')
tokenizer.fit_on_texts(list(list_sentences_train))

list_tokenized_train = tokenizer.texts_to_sequences(list_sentences_train)
list_tokenized_test  = tokenizer.texts_to_sequences(list_sentences_test)

input_train = keras.preprocessing.sequence.pad_sequences(list_tokenized_train, maxlen=maxlen)
input_test  = keras.preprocessing.sequence.pad_sequences(list_tokenized_test, maxlen=maxlen)

Load pretrained word vectors (GloVe or fastText) into word->vector dictionary **See: Aside 3**

In [7]:
# build word-vector lookup dictionary
def get_coefficients(word, *arr):
    """return a word and an ndarray of its associated vector embedding"""
    return word, np.asarray(arr, dtype='float32')

embeddings_index = dict(get_coefficients(*o.strip().split()) for o in open(EMBEDDING_GLOVE))

In [31]:
# if you want to use embedding size of the pretrained vectors:
# embed_sz = len(embeddings_index[next(iter(embeddings_index.keys()))])

Build Embedding Matrix. Randomly initialize out-of-vocab words to the mean & standard deviation of the embeddings.

In [8]:
embeddings = np.stack(embeddings_index.values())
emb_mean   = embeddings.mean()
emb_stdv   = embeddings.std()

In [9]:
n_words    = min(max_feat, tokenizer.num_words)
embedding_matrix = np.random.normal(emb_mean, emb_stdv, (n_words, embed_sz))

# build using the first `max_feat` most common words
for word, i in tokenizer.word_index.items():
    if i >= max_feat:
        continue
    embedding_vector = embeddings_index.get(word)
    
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

## 2. Architecture Design

In [10]:
def BiLSTM():
    m_input = keras.layers.Input(shape=(maxlen,))
    x = keras.layers.Embedding(max_feat, embed_sz, weights=[embedding_matrix])(m_input)
    x = keras.layers.Bidirectional(keras.layers.LSTM(50, return_sequences=True, 
                                                     dropout=0.1, recurrent_dropout=0.1))(x)
    x = keras.layers.GlobalMaxPool1D()(x)
    x = keras.layers.Dense(50, activation="relu")(x)
    x = keras.layers.Dropout(0.1)(x)
    x = keras.layers.Dense(len(list_classes), activation="sigmoid")(x)
    
    return keras.models.Model(inputs=m_input, outputs=x)

In [11]:
def BiGRU_ConvNet():
    m_input = keras.layers.Input(shape=(maxlen,))
    x = keras.layers.Embedding(max_feat, embed_sz, 
                               weights=[embedding_matrix], 
                               trainable=False)(m_input)
    x = keras.layers.SpatialDropout1D(0.2)(x)
    x = keras.layers.Bidirectional(keras.layers.GRU(128, 
                                                    return_sequences=True,
                                                    dropout=0.1, 
                                                    recurrent_dropout=0.1))(x)
    x = keras.layers.Conv1D(64, kernel_size=3)(x)
    x = keras.layers.concatenate([
            keras.layers.GlobalAveragePooling1D()(x),
            keras.layers.GlobalMaxPooling1D()(x)
        ])
    x = keras.layers.Dense(len(list_classes), activation="sigmoid")(x)
    
    return keras.models.Model(inputs=m_input, outputs=x)

In [12]:
model = BiLSTM()
model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])

In [None]:
model.fit(input_train, labels, batch_size=64, epochs=2, validation_split=0.01)

In [67]:
model = BiGRU_ConvNet()
model.compile(optimizer="adam", loss='binary_crossentropy', metrics=['accuracy'])

## Notes

This notebook does not demonstrate Cross Validation. The built-in Keras `validation_split` parameter in `Model.fit` sets aside a fraction of the data *before* it is shuffled.

---

## Asides

### Aside 1: Terms

- LSTM: [Long Short-Term Memory Network](http://colah.github.io/posts/2015-08-Understanding-LSTMs/#lstm-networks)
- GRU : [Gated Recurrent Unit Network](http://colah.github.io/posts/2015-08-Understanding-LSTMs/#variants-on-long-short-term-memory)
- NB-SVM: [Naïve Bayes Support Vector Machine](https://nlp.stanford.edu/pubs/sidaw12_simple_sentiment.pdf)
- ConvNet: [Convolutional Neural Network](http://cs231n.github.io/convolutional-networks/)

### Aside 2: Sequence Padding: Keras vs. PyTorch

PyTorch's `torchtext.data.BucketIterator` pads by batch, whereas Keras' `keras.preprocessing.text.pad_sequences` pads the entire dataset. 

This points to some memory savings in PyTorch and makes it some to pay attention to as the library and as FastAI integration develop. ie: Keras pads all sequences in the dataset to `maxlen`, but PyTorch sorts each batch by length and pads to the longest-length sequence in each batch. This seems like something doable in TensorFlow that's hidden by Keras' level of abstraction.

---

### Aside 3: Taking a look at embedding vectors

In [14]:
# GloVe 50d - First Vector
for o in open(EMBEDDING_GLOVE):
    print(len(o.split()) - 1)
    print(o)
    break

50
the 0.418 0.24968 -0.41242 0.1217 0.34527 -0.044457 -0.49688 -0.17862 -0.00066023 -0.6566 0.27843 -0.14767 -0.55677 0.14658 -0.0095095 0.011658 0.10204 -0.12792 -0.8443 -0.12181 -0.016801 -0.33279 -0.1552 -0.23131 -0.19181 -1.8823 -0.76746 0.099051 -0.42125 -0.19526 4.0071 -0.18594 -0.52287 -0.31681 0.00059213 0.0074449 0.17778 -0.15897 0.012041 -0.054223 -0.29871 -0.15749 -0.34758 -0.045637 -0.44251 0.18785 0.0027849 -0.18411 -0.11514 -0.78581



In [15]:
# fastText 300d - First Vector
i = 0
for o in open(EMBEDDING_FASTTEXT):
    if i == 0:
        print(o)
    if i == 2: 
        print(o)
        break
    i += 1

2000000 300

the 0.0231 0.0170 0.0157 -0.0773 0.1088 0.0031 -0.1487 -0.2672 -0.0357 -0.0487 0.0807 0.1532 -0.0739 -0.0291 -0.0445 -0.0014 0.1014 0.0186 -0.0253 0.0200 -0.0026 -0.0179 0.0005 0.0054 -0.0134 0.0233 -0.0755 -0.0156 0.0415 -0.4985 0.0410 -0.0616 0.0047 0.0325 -0.0162 -0.0172 0.0988 0.0766 -0.0796 -0.0345 0.0124 -0.1007 -0.0292 -0.0762 -0.1261 -0.0531 0.0424 0.0144 -0.0683 0.2859 0.0399 0.0201 0.3240 -0.0656 -0.0497 0.0090 0.0902 -0.0138 -0.0412 -0.0297 0.3139 -0.1428 0.0166 -0.0219 -0.0575 0.1359 -0.1655 0.0019 0.0323 -0.0013 -0.3033 -0.0091 0.1462 0.1860 -0.0524 0.1886 -0.7372 -0.0248 -0.0205 0.0022 0.5988 -0.0359 -0.0269 -0.0483 0.0109 -0.0044 0.0592 0.0174 0.0010 -0.0012 -0.0251 0.4620 -0.0443 -0.0350 0.0115 0.1496 0.3125 -0.0091 0.2517 0.0654 0.0237 -0.0432 0.0952 0.0650 -0.2932 0.0630 0.0236 0.0340 -0.0012 0.0889 -0.0006 -0.1736 0.0374 0.0313 -0.6184 0.0282 -0.3836 0.0589 0.2443 0.0602 0.0057 -0.0038 0.1352 0.0053 0.0193 -0.0213 0.0248 0.0214 0.2334 -0.0438 0.0527 0.02