# Improved LSTM baseline

This kernel is a somewhat improved version of [Keras - Bidirectional LSTM baseline](https://www.kaggle.com/CVxTz/keras-bidirectional-lstm-baseline-lb-0-051) along with some additional documentation of the steps. (NB: this notebook has been re-run on the new test set.)

In [1]:
import sys, os, re, csv, codecs, time, numpy as np, pandas as pd

from keras.callbacks import EarlyStopping, ModelCheckpoint, ReduceLROnPlateau
from kaggletoxicity.keras_utils import KaggleToxicityValMetric
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Dense, Input, LSTM, Embedding, Dropout, Activation
from keras.layers import Bidirectional, GlobalMaxPool1D
from keras.models import Model
from keras import initializers, regularizers, constraints, optimizers, layers
import constants as ct

Using TensorFlow backend.


We include the GloVe word vectors in our input files. To include these in your kernel, simple click 'input files' at the top of the notebook, and search 'glove' in the 'datasets' section.

In [2]:
EMBEDDING_FILE = os.path.join(ct.DATA_TOOLS_FOLDER, 'glove.6B.50d.txt')
TRAIN_DATA_FILE = os.path.join(ct.DATA_FOLDER, 'train.csv')
TEST_DATA_FILE = os.path.join(ct.DATA_FOLDER, 'test.csv')

Set some basic config parameters:

In [3]:
embed_size = 50 # how big is each word vector
max_features = 20000 # how many unique words to use (i.e num rows in embedding vector)
maxlen = 500 # max number of words in a comment to use

Read in our data and replace missing values:

In [4]:
train = pd.read_csv(TRAIN_DATA_FILE)
test = pd.read_csv(TEST_DATA_FILE)

list_sentences_train = train["comment_text"].fillna("_na_").values
list_classes = ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]
y = train[list_classes].values
list_sentences_test = test["comment_text"].fillna("_na_").values

Standard keras preprocessing, to turn each comment into a list of word indexes of equal length (with truncation or padding as needed).

In [5]:
tokenizer = Tokenizer(num_words=max_features)
tokenizer.fit_on_texts(list(list_sentences_train))
list_tokenized_train = tokenizer.texts_to_sequences(list_sentences_train)
list_tokenized_test = tokenizer.texts_to_sequences(list_sentences_test)
X_t = pad_sequences(list_tokenized_train, maxlen=maxlen)
X_te = pad_sequences(list_tokenized_test, maxlen=maxlen)

Read the glove word vectors (space delimited strings) into a dictionary from word->vector.

In [6]:
def get_coefs(word,*arr): return word, np.asarray(arr, dtype='float32')
embeddings_index = dict(get_coefs(*o.strip().split()) for o in open(EMBEDDING_FILE))

Use these vectors to create our embedding matrix, with random initialization for words that aren't in GloVe. We'll use the same mean and stdev of embeddings the GloVe has when generating the random init.

In [7]:
all_embs = np.stack(embeddings_index.values())
emb_mean,emb_std = all_embs.mean(), all_embs.std()
emb_mean,emb_std

(0.020940492, 0.64410442)

In [8]:
word_index = tokenizer.word_index
nb_words = min(max_features, len(word_index))
embedding_matrix = np.random.normal(emb_mean, emb_std, (nb_words, embed_size))
for word, i in word_index.items():
    if i >= max_features: continue
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None: embedding_matrix[i] = embedding_vector

Simple bidirectional LSTM with two fully connected layers. We add some dropout to the LSTM since even 2 epochs is enough to overfit.

In [9]:
inp = Input(shape=(maxlen,))
x = Embedding(max_features, embed_size, weights=[embedding_matrix])(inp)
x = Bidirectional(LSTM(50, return_sequences=True, dropout=0.1, recurrent_dropout=0.1))(x)
x = GlobalMaxPool1D()(x)
x = Dense(50, activation="relu")(x)
x = Dropout(0.1)(x)
x = Dense(6, activation="sigmoid")(x)
model = Model(inputs=inp, outputs=x)
model.compile(loss='binary_crossentropy', optimizer='adam')

Now we're ready to fit out model! Use `validation_split` when not submitting.

In [10]:
X_t.shape

(159571, 500)

In [11]:
batch_size = 1024
epochs = 50
val_prop = 0.05
es_patience = 5
rlr_patience = 2
rlr_cooldown = 4

file_path = os.path.join(ct.MODELS_FOLDER, "weights_base_best.hdf5")
extraval = KaggleToxicityValMetric()
early_stop = EarlyStopping(monitor='val_roc_auc', patience=es_patience, mode='max',  verbose=0)
checkpoint = ModelCheckpoint(file_path, monitor='val_roc_auc', verbose=0, mode='max',   save_best_only=True)
reduce_lr = ReduceLROnPlateau( monitor='val_roc_auc', 
                              factor=0.5, 
                              patience=rlr_patience, 
                              cooldown=rlr_cooldown, 
                              min_lr=1e-4)

callbacks_list = [extraval, checkpoint, early_stop, reduce_lr]
model.fit(X_t, y, batch_size=batch_size, epochs=epochs, validation_split=val_prop, callbacks=callbacks_list)

Train on 151592 samples, validate on 7979 samples
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<keras.callbacks.History at 0x7facb83723c8>

And finally, get predictions for the test set and prepare a submission CSV:

In [12]:
model.load_weights(file_path)
y_test = model.predict([X_te], batch_size=1024, verbose=1)



In [13]:
sample_submission = pd.read_csv(os.path.join(ct.DATA_FOLDER, 'sample_submission.csv'))

In [14]:
sample_submission[list_classes] = y_test

In [15]:
moment = time.strftime("%Y_%m_%d_%H_%M")
moment

'2018_02_17_22_49'

In [16]:
file_name = 'results_%s.csv' % moment
sample_submission.to_csv(os.path.join(ct.RESULTS_FOLDER, file_name), index=False)