# Final Project for the NLP Class (CogSci @ UW, summer 2018)
#### Radosław Jurczak
***

## The Jigsaw Toxic Comments Classification Challenge

---
### Introduction & Task Description
This notebook contains full code of a solution to [Kaggle's Jigsaw Toxic Comments Classification Challenge](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge).

The Google/Jigsaw ConversationAI research group (a team focused mainly on automated web moderation and hate speech detection) posted a dataset of comments collected from Wikipedia page edit discussions. The comments had been hand-labelled with respect to different kinds of toxicity. Six binary labels were used:
* toxic
* severe toxic
* obscene
* threat
* insult
* identity hate.
As we can see, the labels are quite different from each other, ranging from very specific types of toxicity (threat, obscenity) to sort of "generic" toxicity (first & second label).

This is __not__ a simple single-label classification task. Each comment may belong to __more than one__ category. Of course, most of the comments are labelled 0 for all categories, i.e. they are not toxic. This problem is discussed in Section 1 of this notebook.

For each comment in the test set, the model is expected to predict probability for each of the six types of toxicity specified above. Submissions are scored using the __mean column-wise ROC AUC (area under the Receiver Operating Characteristic curve)__. In other words, the final score is the average of ROC AUC scores for each category.

As usual with Kaggle competitions, the final score is calculated on a larger dataset called the "private" test set; during the competition, submissions are only scored against a small fraction of the private set (this promotes more generalisable models). When developing my solution, I was using only the public (small) test set; I checked private (i.e. ranking-relevant) test set score only in the end, after selecting the final model. The whole work is therefore done fully in accordance with Kaggle competition rules.

---
### Solution Overview
In general, the solution is a deep recurrent network. Although this is a Kaggle competition, I chose not to use any form of blending/ensembling but to build and tune a single model only.

The network was trained on a single Nvidia GPU (16 GB) hosted remotely on [Paperspace](https://www.paperspace.com/) (per-hour charge). I was using a 13$ free credit for students; this allowed only for limited experimentation, so possibly the model could be further improved if the computational resources were available.

---
### 0. Loading
First, we'll load all the necessary libraries. This may take a while.

In [None]:
# basics
from collections import defaultdict
import re

# Numerics, data processing & similar utils
import numpy as np
import pandas as pd

# nlp tools
import nltk
from nltk.corpus import stopwords
import spacy

# Machine learning (metrics from scikit-learn, model building tools from Keras on tensorflow backend)
from keras import backend as K
from keras.callbacks import Callback, EarlyStopping, ModelCheckpoint
from keras.layers import Conv1D, CuDNNGRU, CuDNNLSTM, GlobalAveragePooling1D, GlobalMaxPooling1D, Dense, Embedding, Flatten
from keras.layers import Activation, BatchNormalization, Bidirectional, concatenate, Dropout, Input, SpatialDropout1D
from keras.models import Model, load_model
from keras.optimizers import Adam
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import Sequence, pad_sequences
from keras.regularizers import l1_l2
from keras.utils import plot_model
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split

# Visualisation
import matplotlib.pyplot as plt
%matplotlib inline
import pydot
from tqdm import tqdm, tqdm_notebook

In [None]:
nltk.download("stopwords")

In [None]:
tqdm_notebook().pandas()

Now load the data.

I am using pre-trained 300-dimensional Fasttext word vectors downloaded from [Facebook Research's github account](https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md). More information on the Fasttext embedding is included in Section 2 of this notebook.

In [None]:
TRAIN_FILE = "../data/train.csv"
TEST_DATA_FILE = "../data/test.csv"
EMB_FASTTEXT_300_FILE = "../data/wiki.en.vec"

train = pd.read_csv(TRAIN_FILE)
test_data = pd.read_csv(TEST_DATA_FILE)

---
### 1. (Very) fast-and-simple exploratory analysis
Let's have a quick look at the data.

In [None]:
print("Training set dimension: " + str(train.shape))
print("Test set dimension: " + str(test_data.shape))

In [None]:
train.head(5)

There are 159571 comments in the training set.

It looks as if there was about the same amount of data in the test set, but according to the competition description, only a fraction of the examples is actually relevant for testing and score calculation. The "fake"/unused test examples were labeled -1 for all label categories (this labelling remained unknown to competitors). Although the competition has ended recently and thus the fully labeled version of the test set has been published, I will not remove the unused examples to stick to a realistic Kaggle setting.

As we can see, there are luckily no missing values in the training set, which will save us some work:

In [None]:
np.sum(train.isna())

In [None]:
train_size = train.shape[0]
test_size = test_data.shape[0]
print(f"Number of examples in the training set: {train_size}")
print(f"Number of examples in the test set: {test_size}")

Now let's see how many examples in the training data have been labelled as toxic (some way or another):

In [None]:
LABELS = ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]
percent_toxic = train[LABELS].sum().apply(lambda x: (x / train_size) * 100)
print("Toxic comments by category (as fractions of the whole dataset): \n")
for category, percent in percent_toxic.items():
    print(category + ": " + str(round(percent, 2)) + "%")

Two problems are obvious from what we have seen above. First, only a small fraction of the whole dataset consists of toxic comments. This will make model training harder.

Second, the toxicity distribution between categories is very unbalanced. This also poses a potential problem for training, as models might become more sensitive to some types of toxicity while neglecting others. Things will get even worse if the test distribution does not match the training distribution, which is quite probable for such peculiar data.

In [None]:
ts = train[LABELS].sum()
print("Total toxicity occurrences detected by human annotators: " + str(ts.sum()))

---
### 2. Data preprocessing
What we have in the training (and test) set is pretty much raw text written in informal internet language. Quite a lot of preprocessing is necessary to turn this into a reasonably clean dataset suitable for feeding a neural network.

In [None]:
train_target = train[LABELS]
train_comments = train["comment_text"]
test_comments = test_data["comment_text"]

In the cell below, I have defined a regex-based word replacement vocabulary. It's a revised and extended version of a similar dictionary that circulated among the competitors (I am unsure of the original author). the making of this vocabulary was probably the most linguistics-based part of my work, as well as one that took the longest (I needed to look through at least part of the toxic comments in the training set). The vocabulary covers the following issues:
* all newline signs (\n), IP addresses and user nicknames are removed;
* common abbreviations like "afaik" ot "imho" are expanded to their full forms;
* all the expressions of amusement and/or amazement and/or mockery like "lol", "lel", "kek" etc. are reduced to one common form "lol";
* emoticons like ;), :) and the like are reduced to the token "happy"; similarly, the many varieties of :( emoticon are reduced to the token "sad";
* numerous cases of deliberate or non-deliberate misspelling, esp. of "toxic" words like "fuck", "bitch" etc., are corrected and replaced with the correct word forms;
* @ and & symbols are replaced with "at" and "and" prepositions, respectively;
* "in/in'" word endings are replaced with the proper suffix "ing".

Forms like "I'll", "he'd" were deliberately left as they were; the reason for that is described below.

In [None]:
CLEANING_DICT = {
    " ": ["\n", "/", "(([0-9]{2,3}[^a-z0-9])+)+", "\[\[.*\]", " +"],
    " as far as i know ": ["afaik"],
    " in my opinion ": ["imo", "imho"],
    " sad ": ["\:\(", "\:\'\(", "\:\(+"],
    " lol ": [" (h[ae])+h? ", " lel ", " kek ", "lu+lz?", "loo(o)*lz?"],
    " happy ": ["\:\)+", ";\)+", "\:d+"],
    " american ": ["amerikan"],
    " adolf ": ["adolf"],
    " hitler ": ["hitler"],
    "fuck": ["(f)(u|[^a-z0-9 ])(c|[^a-z0-9 ])(k|[^a-z0-9 ])", "(f)([^a-z]*)(u)([^a-z]*)(c)([^a-z]*)(k)",
            " f[!@#\$%\^\&\*]*u[!@#\$%\^&\*]*k", "f u u c", "(f)(c|[^a-z ])(u|[^a-z ])(k)", r"f\*", "feck ",
            " fux ", "f\*\*", "f\.u\.", "f###", " fu ", "f@ck", "f u c k", "f uck", "f ck"],
    "fucking ": ["f[u|a]c?king?"],
    " ass ": ["[^a-z]ass ", "[^a-z]azz ", "arrse", " arse ", "@\$\$", "[^a-z]anus", " a\*s\*s", "[^a-z]ass[^a-z ]",
             "a[@#\$%\^&\*][@#\$%\^&\*]", "[^a-z]anal ", "a s s", "butt "],
    " asshole ": [" a[s|z]*wipe", "a[s|z]*[w]*h[o|0]+[l]*e", "@\$\$hole"],
    " bitch ": ["b[w]*i[t]*ch", "b!tch", "bi\+ch", "b!\+ch", "(b)([^a-z]*)(i)([^a-z]*)(t)([^a-z]*)(c)([^a-z]*)(h)",
                "biatch", "bi\*\*h", "bytch", "b i t c h"],
    " bastard ": ["ba[s|z]t[a|e]rd"],
    " gay ": ["gay"],
    " cock ": ["[^a-z]cock", "c0ck", "[^a-z]cok ", "c0k", "[^a-z]cok[^aeiou]", " cawk",
               "(c)([^a-z ])(o)([^a-z ]*)(c)([^a-z ]*)(k)", "c o c k"],
    " dick ": [" dick[aeiou]", "deek", "d i c k"],
    " suck ": ["(s)([^a-z ]*)(u)([^a-z ]*)(c)([^a-z ]*)(k)", "sucks", "5uck", "s u c k"],
    " sucking ": ["s[u|a]c?king?"],
    " cunt ": ["cunt", "c u n t", "c\*\*\*", "c\*\*t", "c\*nt"],
    " bullshit ": ["bullsh\*t", "bull\$hit"],
    " homosexual ": ["homo"],
    " jerk ": ["jerk"],
    " idiot ": ["i[d]+io[t]+", "(i)([^a-z ]*)(d)([^a-z ]*)(i)([^a-z ]*)(o)([^a-z ]*)(t)", "idiots", "i d i o t"],
    " dumb ": ["(d)([^a-z ]*)(u)([^a-z ]*)(m)([^a-z ]*)(b)", "d\*mb", "$dumm"],
    " shit ": ["shitty", "(s)([^a-z ]*)(h)([^a-z ]*)(i)([^a-z ]*)(t)", "shite", "\$hit", "s h i t"],
    " shit hole ": ["shythole", "shithole"],
    " retard ": ["returd", "retad", "retard", "ret[au]rded", "wiktard", "wikitard", "wikitud"],
    " dumb ass": ["dumbass", "dubass", "du(m)+ass"],
    " ass head ": ["butthead"],
    " sex ": ["s3x", "s\*x"],
    " nigger ": ["nigger", "ni[g]+a", " nigr ", "negrito", "niguh", "n3gr", "n i g g e r"],
    " shut the fuck up ": ["stfu"],
    " rape ": ["reap", "rpe"],
    " pussy ": ["pussy[^c]", "pusy", "pussi[^l]", "pusses"],
    " faggot ": ["faggot", " fa[g]+[s]*[^a-z ]", "fagot", "f a g g o t", "faggit",
                 "(f)([^a-z ]*)(a)([^a-z ]*)([g]+)([^a-z ]*)(o)([^a-z ]*)(t)", "fau[g]+ot", "fae[g]+ot",
                 "fagg+az", "fagg+otz"],
    " motherfucker ": [" motha ", "motha f ", "moth[a|er]fucka?r", "mother f", "motherucker"],
    " whore ": ["w\*\*\*(\*)?", "whor", "w h o r e"],
    " and ": ["&"],
    " at ": [" @ "],
    "ing ": ["$in\'"]
}

For preprocessing, I have used [SpaCy](https://spacy.io/), a very fast, production-oriented NLP library for Python. As a language model for preprocessing tasks, I loaded one of SpaCy's predefined English language models: en_core_web_md, which is a convolutional neural network capable of tokenization, lemmatization, POS tagging and dependency parsing, trained on CommonCrawl and OntoNotes. This model's performance on the advanced tasks like tagging or parsing is actually irrelevant, as I'm using only tokenization+lemmatization, which is of decent quality.

It is important to mention that SpaCy can automatically deal with abbreviations like "I'd", "he'll" and the like. That is why such patterns are not present in the replacement dictionary.

The whole preprocessing is contained within the clean_text() function defined below. It takes as input a single comment string and performs the following steps:
* lowercase all words in the comment;
* use regular expressions to apply the substitutions defined in the replacement dictionary;
* tokenize the result;
* lemmatize the result;
* remove stopwords (I used the list of English stopwords available as part of [NLTK](https://www.nltk.org/));
* return all remaining tokens joined in a single string.

This function is applied both to the training and test set.

WARNING: __do not__ run the cleaning procedure (unless you want to check if it works). Even with SpaCy (which is really very fast in comparison to other similar tools), it does take time. Below you can find code that loads the preprocessed datasets directly into pandas DataFrame objects.

In [None]:
STOPWORDS = stopwords.words("english")
SPACY_LANG_MODEL = spacy.load("en_core_web_md")

In [None]:
def clean_text(comment, replacement_dict=CLEANING_DICT, spacy_model=SPACY_LANG_MODEL, stopwds=STOPWORDS):
    comment = comment.lower()
    for base, patterns in replacement_dict.items():
        for pattern in patterns:
            comment = re.sub(pattern, base, comment)
    tokenized = list(spacy_model(comment, disable=["tagger", "parser", "ner"]))
    lemmatized = [token.lemma_ if token.lemma_ != "-PRON-" else token.lower_ for token in tokenized]
    lemmatized = [lem for lem in lemmatized if lem not in stopwds and lem != " "]
    comment = " ".join(lemmatized)
    return comment

In [None]:
train_clean = train_comments.progress_apply(clean_text)

In [None]:
train_clean.to_csv("../data/train_clean.csv", index=False, header=["comment_text"])

In [None]:
test_clean = test_comments.progress_apply(clean_text)

In [None]:
test_clean.to_csv("../data/test_clean.csv", index=False, header=["comment_text"])

__Run the cell below to load preprocessed data:__

In [None]:
train_clean = pd.read_csv("../data/train_clean.csv", header=0)
test_clean = pd.read_csv("../data/test_clean.csv", header=0)
train_clean = train_clean["comment_text"]
test_clean = test_clean["comment_text"]

The processed comments need to be tokenized to a Keras-feedable form. This is done with [Keras built-in tokenizer](https://keras.io/preprocessing/text/). It automatically removes special symbols, whitespaces and punctuation; it also returns a word index (a dictionary mapping words to their integer vocabulary indices).

As far as possible, I want to keep any information available in the comment text, so the model will use all the words that appear in the comments, hence the large MAX_FEATURES constant defining maximum vocabulary size. MAX_COMMENT_LEN is the cutoff comment length (longer comments will be trimmed, shorter ones will be zero-padded later).

In [None]:
MAX_FEATURES = 300000
MAX_COMMENT_LEN = 900

In [None]:
def tokenize_for_keras(tr_set, test_set, max_features):
    tokenizer = Tokenizer(num_words=max_features)
    tokenizer.fit_on_texts(list(tr_set) + list(test_set))
    train_tokenized = tokenizer.texts_to_sequences(tr_set)
    test_tokenized = tokenizer.texts_to_sequences(test_set)
    tokenizer_index = tokenizer.word_index
    return (tokenizer_index, train_tokenized, test_tokenized)

In [None]:
train_feed = train_clean.fillna("fillna")
test_feed = test_clean.fillna("fillna")
word_index, train_tokenized, test_tokenized = tokenize_for_keras(train_feed, test_feed, max_features=MAX_FEATURES)

Now we need to load the Fasttext word vectors and form an embedding dictionary to pass to Keras embedding layer.

BTW the advantage of Fasttext over other embeddings, e.g. GloVe or word2vec, is that they are capable of representing incomplete words (see [this publication by Bojanowski et al. (2016)](https://arxiv.org/abs/1607.04606)), which seems useful when working with informal internet language data.

In [None]:
def get_embedding_dict(emb_file=EMB_FASTTEXT_300_FILE):
    emb_dict = defaultdict()
    with open(emb_file) as file:
        for line in file:
            emb = line.rstrip().rsplit(" ")
            word = emb[0]
            vec = np.asarray(emb[1:], dtype="float32")
            emb_dict[word] = vec
    return emb_dict

In [None]:
emb_dict_fasttext = get_embedding_dict()

In [None]:
EMB_SIZE_FASTTEXT = 300

The Keras-tokenized comments need to be zero-padded so that they all have the same length (defined above as MAX_COMMENT_LENGTH).

In [None]:
train_tokenized = pad_sequences(train_tokenized, maxlen=MAX_COMMENT_LEN)
test_tokenized = pad_sequences(test_tokenized, maxlen=MAX_COMMENT_LEN)

In order to construct an Embedding layer in Keras, an embedding matrix (i.e. a dictionary mapping word indices and vectors) is needed. The function below builds embedding matrix based on given embedding dictionary, tokenizer index and parameters specifying maximum vocabulary size and the embedding vectors' dimension.

The matrix is initialized with 300-dimensional vectors filled with zeros, then filled with available embeddings. This means that words for which we do not have Fasttext embeddings are zero-initialized.

The function's optional boolean parameter "gaussian_initialization" specifies whether the out-of-vocabulary words should instead be initialized with random numbers drawn from the normal distribution with mean and standard deviation computed from all the embeddings available in the vocabulary. This solution improved score when combined with GloVe vectors (which I was initially using), but for Fasttext zero-initialized vectors tend to work better.

In [None]:
def get_embedding_matrix(emb_dict, word_index, max_features, emb_size, gaussian_initialization=False):
    n_words = min(MAX_FEATURES, len(word_index))
    
    if gaussian_initialization:
        stacked_embs = np.stack(emb_dict.values())
        emb_mean, emb_std = (np.mean(stacked_embs), np.std(stacked_embs))
        emb_matrix = np.random.normal(emb_mean, emb_std, size=(n_words, emb_size))
    else:
        emb_matrix = np.zeros((n_words, emb_size))
    
    for word, index in word_index.items():
        if index >= max_features:
            continue
        vec = emb_dict.get(word)
        if vec is not None:
            emb_matrix[index] = vec
    return emb_matrix

In [None]:
emb_matrix_fasttext = get_embedding_matrix(emb_dict=emb_dict_fasttext, word_index=word_index, max_features=MAX_FEATURES, emb_size=EMB_SIZE_FASTTEXT, gaussian_initialization=False)

### 2. Model building
Now we can construct the model.

---
First, a few utils are necessary.
* The official rating metrics in the competition is the ROC AUC score; it is not, however, a differentiable loss function, so we need a reasonable proxy. I chose binary crossentropy, which is a standard loss function for multi-class classification.
* It's better to track ROC AUC anyway. For that, I defined a custom Keras callback.
* I also define a model checkpointer which creates backup by saving the current best model (in terms of ROC AUC) to file.

In [None]:
class ROC_AUC_Score(Callback):
    """Custom Keras callback class tracking ROC AUC.
    After every epoch's end, it prints out current ROC AUC score for model's predictions
    and saves the value to model's logs for potential use by other callbacks.
    """
    
    def __init__(self, validation_data=(), interval=1):
        super().__init__()
        self.interval = interval
        self.val_X, self.val_target = validation_data
     
    def on_train_begin(self, logs={}):
        self.aucs = []
        self.losses = []
    
    def on_epoch_end(self, epoch, logs={}):
        if epoch % self.interval == 0:
            preds = self.model.predict(self.val_X, verbose=0)
            auc = roc_auc_score(self.val_target, preds)
            self.aucs.append(auc)
            logs["roc_auc_val"] = auc
            print(f"Epoch: {epoch+1} - ROC AUC score: {round(auc, 5)}")


The model is defined below. It has the following components:
* Non-trainable Embedding layer (using Fasttext vectors for embedding as explained above). Trainable vectors tend to overfit terribly (the dataset is too small and also somehow prone to overfitting), so I decided not to unfreeze the embeddings.
* Spatial dropout (rate 0.5) is applied to the embeddings. Spatial dropout cuts off entire feature maps along a dimension, which in the case of our model means that some positions on every vector are randomly "switched off" (the usual dropout way) from batch to batch during training. As every dropout layer, this has a regularizing effect. Adding the spatial dropout layer right after the embeddings turned out to make one of the biggest improvements in model's score.
* Dense layer with 128 units and ReLU activation. The idea of including this layer is to provide a substitute to trainable vectors (the network can learn to "preprocess" the embedding layer if necessary).
* Batch normalization and dropout (rate 0.5). All dense layers in the model are followed by the commonly used pattern of batchnorm+dropout. The first one normalizes the preceding layer's activations using current batch's mean and variance; this usually effects in learning speedup, decreases model's sensitivity to initial (randomly initialized) weight values and to possible differences between training/test feature distributions, allows for adding more layers (basically makes the layers more independent in a way) and has a small regularizing effect. Dropout is a standard regularization technique. I'm using quite high dropout rate (established empirically) because it's really easy to overfit on this dataset even for much sipler models than the final one.
* After that, the main trick is applied: the model splits into two separate branches. Along the first branch, an LSTM layer with 64 units is applied (it's bidirectional, i.e. goes through the input both right-to-left and left-to-right). Th LSTM's output is then convolved with 128 filters of size 3 (in the 1D case, this means a moving "horizontal" window of length 3) with stride 1. The convolution's padding is valid, which means that the input dimension is reduced, not artificially kept as it was by use of padding. As usual, after the convolution comes a pooling layer; in this case, both max and average pooling is applied.
* The left branch is similar, but it uses a greater (128) number of simpler GRU units instead of LSTM and less (64 instead of 128) convolutional filters.
* After that, the branches are concatenated together; one more trick is to use both the results of max and average pooling over each branch, simply stacked into one long tensor. This allows the network to extract as much information about the input's properties as possible. Details of this approach will be described during the class presentation.
* The concatenated results of pooling are then fed to another dense layer of 128 units with ReLU activation followed by batchnorm and 0.5 dropout. This allows the network to extract necessary features from the tensor obtained from pooling.
* Finally, a simple 6-unit sigmoid-activated dense layer predicts the inputs.
The network uses an Adam (Adaptive learning rate with momentum) optimizer with initial learning rate of 0.001. One important thing is that gradient clipping is added to the optimizer. It helps to deal with the so-called exploding gradient problem, which is a common issue in recurrent networks and also empirically proved to be an issue in this particular task. Clipping the values, not the gradient norm, turned out to work a bit better.
---
The schematic picture of the model (created with Keras built-in plot_model() function) looks as follows:

![model](../models/final-model.png)

In [None]:
def build_model():
    model_input = Input(shape=(MAX_COMMENT_LEN,))
    X = Embedding(input_dim=MAX_FEATURES, output_dim=EMB_SIZE_FASTTEXT, weights=[emb_matrix_fasttext], trainable=False)(model_input)
    X = SpatialDropout1D(rate=0.5)(X)
    
    X = Dense(units=128, activation="relu")(X)
    X = BatchNormalization()(X)
    X = Dropout(rate=0.5)(X)
    
    Y = Bidirectional(CuDNNLSTM(units=64, return_sequences=True))(X)
    Y = Conv1D(filters=128, kernel_size=3, strides=1, padding="valid")(Y)
    maxpool2 = GlobalMaxPooling1D()(Y)
    avgpool2 = GlobalAveragePooling1D()(Y)
    
    X = Bidirectional(CuDNNGRU(units=128, return_sequences=True))(X)
    X = Conv1D(filters=64, kernel_size=3, strides=1, padding="valid")(X)
    maxpool = GlobalMaxPooling1D()(X)
    avgpool = GlobalAveragePooling1D()(X)
    
    X = concatenate([avgpool, maxpool, avgpool2, maxpool2])
    
    X = Dense(units=128, activation="relu")(X)
    X = BatchNormalization()(X)
    X = Dropout(rate=0.5)(X)
    X = Dense(units=6, activation="sigmoid")(X)

    model = Model(inputs=model_input, outputs=X)
    opt = Adam(lr=0.001, clipvalue=1.0)
    model.compile(optimizer=opt, loss="binary_crossentropy", metrics=["accuracy"])
    return model

I couldn't do crossvalidation (it would take too much time on the computer I had), so I used the standard train-validation-test approach. The train/validation split is 90%/10%.

The model was set for 15 epochs of training. I've tried a few minibatch sizes; rather small 64-item minibatches turned out to work best, perhaps due to the slight regularizing effect they provide.

In [None]:
model = build_model()
X_tr, X_val, y_train, y_val = train_test_split(train_tokenized, train_target, train_size=0.9, random_state=233)

In [None]:
# ROC AUC tracker, model checkpointer
roc_auc = ROC_AUC_Score(validation_data=(X_val, y_val))
model_path="final-model"
checkpointer = ModelCheckpoint(model_path, monitor="roc_auc_val", mode="max", verbose=1, save_best_only=True)

In [None]:
model.fit(x=X_tr, y=y_train, batch_size=64, epochs=15, validation_data=(X_val, y_val), callbacks=[roc_auc, checkpointer])

__If needed, load the final model by running the cell below. DO NOT run the whole training routine. Training this network from scratch would take about 1.5h on a GPU, I don't even know how long on a CPU.__

In [None]:
model = load_model("final-model.h5")

Make predictions and save them to file; backup the model.

In [None]:
pred_test = model.predict(x=[test_tokenized], batch_size=1024, verbose=1)

In [None]:
def predictions_to_csv(predictions, submission_name):
    submission = pd.read_csv("../data/sample_submission.csv")
    submission[LABELS] = predictions
    submission.to_csv("../output/" + submission_name + ".csv", index=False)

In [None]:
predictions_to_csv(pred_test, "final-submission")

In [None]:
def backup_model(model, model_name):
    model_path = "../models/" + model_name + ".h5"
    model.save(model_path)
    picture_path = "../models/" + model_name + ".png"
    plot_model(model, to_file=picture_path)

In [None]:
backup_model(model, "final-model")

### 3. Results and prospects of improvement
* The model scores __0.9801 ROC AUC__ on the private (full) test set (and 0.9803 on the public test set). This is a pretty good estimate of quality as well as the base for Kaggle scoring.
* This is __0.0084 ROC AUC loss to the winning model__.

One way of improving the model I would like to try if I had time and resources is the trick the winning team used. It's a technique of data augmentation by translating each comment to German, French and Spanish using Google Translate's public API, then retranslating to English and adding the resulting text to the dataset. This method, however, comes with numerous difficulties I'll explain in detail during the presentation.

Were it an active Kaggle competition, I'd probably begin ensembling (most of the top 5% solutions were ensembles), but here I wanted to keep things clean and build a single model.

If I had much more time or an own GPU, I would switch from train/validation/test framework to crossvalidation to get better estimates of model quality.




-------------------------------------------------------------------------------------------------------------------------
#### Sources:
* The model is not based on any publicly available solution known to me. I have read the descriptions of 1st and 2nd place solutions, but could not apply much of what they mentioned. However, I did benefit greatly from reading various discussions on Kaggle forums, especially regarding the dataset's properties (warnings about easy overfitting, presence of IP addresses in the training data, etc.).
* The Fasttext papers: [Bojanowski et al. (2016), Enriching Word Vectors with Subword Information](https://arxiv.org/abs/1607.04606); [Joulin et al. (2017), Bag of Tricks for Efficient Text Classification](https://arxiv.org/abs/1607.01759).
* The idea of forking the model, applying recurrent, convolutional and pooling layers and then concatenating the results is loosely based on a similar idea from deep image processing networks which is called "Inception blocks". I'll be talking about it during the presentation; the original image processing paper is [Szegedy, Liu et al. (2014), Going Deeper with Convolutions](https://arxiv.org/abs/1409.4842).
* Much of what I learnt about recurrent networks when doing the project was from [Goodfellow, Bengio & Courville (2016), Deep Learning](https://mitpress.mit.edu/books/deep-learning). [Andrew Ng's lectures on sequence models](https://www.youtube.com/playlist?list=PLBAGcD3siRDittPwQDGIIAWkjz-RucAc7) were also helpful in developing higher-level intuition of how the various layers behave.