# Improved LSTM baseline

This kernel is a somewhat improved version of [Keras - Bidirectional LSTM baseline](https://www.kaggle.com/CVxTz/keras-bidirectional-lstm-baseline-lb-0-051) along with some additional documentation of the steps. (NB: this notebook has been re-run on the new test set.)

In [1]:
import sys, os, re, csv, codecs, numpy as np, pandas as pd
from sklearn.metrics import f1_score, make_scorer, classification_report
from sklearn.model_selection import train_test_split

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Dense, Input, LSTM, Embedding, Dropout, Activation
from keras.layers import Bidirectional, GlobalMaxPool1D
from keras.models import Model
from keras import initializers, regularizers, constraints, optimizers, layers

Using TensorFlow backend.


In [2]:
df = pd.read_csv('../input/disaster-data/disaster_data.csv')

  interactivity=interactivity, compiler=compiler, result=result)


In [3]:
X = df['message']
Y = df.drop(['Unnamed: 0', 'id', 'message', 'genre', 'original', 'child_alone'], axis=1)

We include the GloVe word vectors in our input files. To include these in your kernel, simple click 'input files' at the top of the notebook, and search 'glove' in the 'datasets' section.

In [4]:
path = '../input/'
comp = 'jigsaw-toxic-comment-classification-challenge/'
EMBEDDING_FILE=f'{path}glove6b50d/glove.6B.50d.txt'
TRAIN_DATA_FILE=f'{path}{comp}train.csv'
TEST_DATA_FILE=f'{path}{comp}test.csv'

Set some basic config parameters:

In [5]:
embed_size = 50 # how big is each word vector
max_features = 20000 # how many unique words to use (i.e num rows in embedding vector)
maxlen = 200 # max number of words in a comment to use

Read in our data and replace missing values:

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

In [7]:
# train = pd.read_csv(TRAIN_DATA_FILE)
# test = pd.read_csv(TEST_DATA_FILE)

list_sentences_train = X_train.values
list_classes = Y.columns
y = y_train
list_sentences_test = X_test.values

Standard keras preprocessing, to turn each comment into a list of word indexes of equal length (with truncation or padding as needed).

In [8]:
tokenizer = Tokenizer(num_words=max_features)
tokenizer.fit_on_texts(list(list_sentences_train))
list_tokenized_train = tokenizer.texts_to_sequences(list_sentences_train)
list_tokenized_test = tokenizer.texts_to_sequences(list_sentences_test)
X_t = pad_sequences(list_tokenized_train, maxlen=maxlen)
X_te = pad_sequences(list_tokenized_test, maxlen=maxlen)

Read the glove word vectors (space delimited strings) into a dictionary from word->vector.

In [9]:
def get_coefs(word,*arr): return word, np.asarray(arr, dtype='float32')
embeddings_index = dict(get_coefs(*o.strip().split()) for o in open(EMBEDDING_FILE))

Use these vectors to create our embedding matrix, with random initialization for words that aren't in GloVe. We'll use the same mean and stdev of embeddings the GloVe has when generating the random init.

In [10]:
all_embs = np.stack(embeddings_index.values())
emb_mean,emb_std = all_embs.mean(), all_embs.std()
emb_mean,emb_std

  """Entry point for launching an IPython kernel.


(0.020940498, 0.6441043)

In [11]:
word_index = tokenizer.word_index
nb_words = min(max_features, len(word_index))
embedding_matrix = np.random.normal(emb_mean, emb_std, (nb_words, embed_size))
for word, i in word_index.items():
    if i >= max_features: continue
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None: embedding_matrix[i] = embedding_vector

Simple bidirectional LSTM with two fully connected layers. We add some dropout to the LSTM since even 2 epochs is enough to overfit.

In [None]:
inp = Input(shape=(maxlen,))
x = Embedding(max_features, embed_size, weights=[embedding_matrix])(inp)
x = Bidirectional(LSTM(100, return_sequences=True, dropout=0.1, recurrent_dropout=0.1))(x)
x = GlobalMaxPool1D()(x)
x = Dense(64, activation="relu")(x)
x = Dropout(0.2)(x)
x = Dense(64, activation="relu")(x)
x = Dropout(0.2)(x)
x = Dense(len(Y.columns), activation="sigmoid")(x)
model = Model(inputs=inp, outputs=x)
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

Now we're ready to fit out model! Use `validation_split` when not submitting.

In [None]:
hist = model.fit(X_t, y, batch_size=32, epochs=5, validation_split=0.2)

And finally, get predictions for the test set and prepare a submission CSV:

In [13]:
from keras.models import Sequential,Model
from keras.layers import CuDNNLSTM, Dense, Bidirectional, Input,Dropout

from keras import backend as K
from keras.engine.topology import Layer
from keras import initializers, regularizers, constraints

In [14]:
# https://www.kaggle.com/qqgeogor/keras-lstm-attention-glove840b-lb-0-043
class Attention(Layer):
    def __init__(self, step_dim,
                 W_regularizer=None, b_regularizer=None,
                 W_constraint=None, b_constraint=None,
                 bias=True, **kwargs):
        self.supports_masking = True
        self.init = initializers.get('glorot_uniform')

        self.W_regularizer = regularizers.get(W_regularizer)
        self.b_regularizer = regularizers.get(b_regularizer)

        self.W_constraint = constraints.get(W_constraint)
        self.b_constraint = constraints.get(b_constraint)

        self.bias = bias
        self.step_dim = step_dim
        self.features_dim = 0
        super(Attention, self).__init__(**kwargs)

    def build(self, input_shape):
        assert len(input_shape) == 3

        self.W = self.add_weight((input_shape[-1],),
                                 initializer=self.init,
                                 name='{}_W'.format(self.name),
                                 regularizer=self.W_regularizer,
                                 constraint=self.W_constraint)
        self.features_dim = input_shape[-1]

        if self.bias:
            self.b = self.add_weight((input_shape[1],),
                                     initializer='zero',
                                     name='{}_b'.format(self.name),
                                     regularizer=self.b_regularizer,
                                     constraint=self.b_constraint)
        else:
            self.b = None

        self.built = True

    def compute_mask(self, input, input_mask=None):
        return None

    def call(self, x, mask=None):
        features_dim = self.features_dim
        step_dim = self.step_dim

        eij = K.reshape(K.dot(K.reshape(x, (-1, features_dim)),
                        K.reshape(self.W, (features_dim, 1))), (-1, step_dim))

        if self.bias:
            eij += self.b

        eij = K.tanh(eij)

        a = K.exp(eij)

        if mask is not None:
            a *= K.cast(mask, K.floatx())

        a /= K.cast(K.sum(a, axis=1, keepdims=True) + K.epsilon(), K.floatx())

        a = K.expand_dims(a)
        weighted_input = x * a
        return K.sum(weighted_input, axis=1)

    def compute_output_shape(self, input_shape):
        return input_shape[0],  self.features_dim

In [16]:
inp = Input(shape=(maxlen,))
x = Embedding(max_features, embed_size, weights=[embedding_matrix])(inp)
x = Bidirectional(LSTM(128, return_sequences=True, dropout=0.1, recurrent_dropout=0.1))(x)
#x = Bidirectional(LSTM(64, return_sequences=True, dropout=0.1, recurrent_dropout=0.1))(x)
x = Attention(maxlen)(x)
# x = GlobalMaxPool1D()(x)
x = Dense(64, activation="relu")(x)
x = Dropout(0.2)(x)
# x = Dense(64, activation="relu")(x)
# x = Dropout(0.2)(x)
x = Dense(len(Y.columns), activation="sigmoid")(x)
model = Model(inputs=inp, outputs=x)
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [17]:
hist = model.fit(X_t, y, batch_size=32, epochs=5, validation_split=0.2)

Train on 16777 samples, validate on 4195 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [23]:
y_preds = model.predict([X_te], batch_size=1024, verbose=1)
preds = (y_preds > 0.5).astype(np.int)
print(classification_report(y_test.values, preds, 
                            target_names = y_test.columns))

                        precision    recall  f1-score   support

               related       0.89      0.90      0.89      3978
               request       0.85      0.57      0.68       895
                 offer       0.00      0.00      0.00        26
           aid_related       0.79      0.66      0.72      2131
          medical_help       0.66      0.05      0.08       422
      medical_products       0.89      0.03      0.06       270
     search_and_rescue       0.00      0.00      0.00       127
              security       0.00      0.00      0.00        88
              military       1.00      0.01      0.01       155
                 water       0.80      0.18      0.29       339
                  food       0.76      0.43      0.55       595
               shelter       0.86      0.05      0.10       470
              clothing       0.00      0.00      0.00        73
                 money       0.00      0.00      0.00       104
        missing_people       0.00      

  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


# Build Naive Bayes - SVM models

Paper: Baselines and Bigrams: Simple, Good Sentiment and Topic Classification, Sida Wang and Christopher D. Manning (https://nlp.stanford.edu/pubs/sidaw12_simple_sentiment.pdf)

Reference: AlexSánchez's comment on https://www.kaggle.com/jhoward/nb-svm-strong-linear-baseline

In [19]:
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.utils.validation import check_X_y, check_is_fitted
from sklearn.linear_model import LogisticRegression
from scipy import sparse
import numpy as np

class NbSvmClassifier(BaseEstimator, ClassifierMixin):
    def __init__(self, C=1.0, dual=False, n_jobs=1):
        self.C = C
        self.dual = dual
        self.n_jobs = n_jobs

    def predict(self, x):
        # Verify that model has been fit
        check_is_fitted(self, ['_r', '_clf'])
        return self._clf.predict(x.multiply(self._r))

    def predict_proba(self, x):
        # Verify that model has been fit
        check_is_fitted(self, ['_r', '_clf'])
        return self._clf.predict_proba(x.multiply(self._r))

    def fit(self, x, y):
        # Check that X and y have correct shape
        #y = y.values
        x, y = check_X_y(x, y, accept_sparse=True)

        def pr(x, y_i, y):
            p = x[y==y_i].sum(0)
            return (p+1) / ((y==y_i).sum()+1)

        self._r = sparse.csr_matrix(np.log(pr(x,1,y) / pr(x,0,y)))
        x_nb = x.multiply(self._r)
        self._clf = LogisticRegression(C=self.C, dual=self.dual, n_jobs=self.n_jobs).fit(x_nb, y)
        return self

In [27]:
# Text Processing
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.tokenize import word_tokenize

# Modeling
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.multioutput import MultiOutputClassifier
from sklearn.metrics import f1_score, make_scorer, classification_report


def load_data(database_filepath):
    '''
    Load data from sqlite database, quick cleaning, and
    return the features and labels of the models.

    INPUTS:
        database_filepath: path to the sqlite database

    OUTPUTS:
        X, Y: features and labels of the model
        Y.columns: categories of the label
    '''

    engine = create_engine('sqlite:///' + database_filepath)
    df = pd.read_sql_table('DisasterResponse', engine)

    # 'related' column has values of 0,1,2 which doesn't make sense for binary classification
    df['related'] = df['related'].replace(2, 1)

    # 'child_alone' column has only value of 0. So our model will always predict 0.
    df = df.drop('child_alone', axis=1)

    X = df['message']
    Y = df.drop(['id', 'message', 'genre', 'original'], axis=1)

    return X, Y, Y.columns


def tokenize(text):
    '''
    Preprocess text features by tokenization and lemmatization.

    INPUTS:
        text: string of text need to be processed

    OUTPUTS:
        tokens: a list of tokens from the text
    '''

    # normalize case and remove punctuation
    text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower())

    # tokenize text
    tokens = word_tokenize(text)

    # lemmatize andremove stop words
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(word) for word in tokens \ 
              if word not in stopwords.words('english')]

    return tokens


def scorer(y_test, y_pred):
    '''
    Create a evaluation metric for the grid search.
    '''

    report = classification_report(y_test, y_pred, output_dict=True)
    weighted_avg = report['weighted avg']
    return weighted_avg['f1-score']


def build_model(pretrained_model=None):
    '''
    Build ML pipeline to including text processing and multi-output multi-class classifier
    If pretrained_model not None, load pretrained model. 
    Otherwise output model is a grid search, which takes longer to train.
    '''

    if pretrained_model != None:
        clf = pickle.load(open(pretrained_model, 'rb'))
    else:
        pipeline = Pipeline([
            ('countvec', CountVectorizer(tokenizer=tokenize)),
            ('tfidf', TfidfTransformer()),
            ('clf', MultiOutputClassifier(NbSvmClassifier()))
        ])

        parameters = {
            'clf__estimator__C': [1.0, 5.0, 10.0],
            'countvec__ngram_range': [(1, 1), (1, 2)]
        }

        f1_scorer = make_scorer(scorer)

        # optimize model
        clf = GridSearchCV(pipeline, parameters, scoring=f1_scorer,
                           cv=3, verbose=10)

    return clf


def evaluate_model(model, X_test, Y_test, category_names, pretrained_model=None):
    '''
    Evaluate the classifier by classification report (sklearn).
    If the pretrained_model is None, we trained the grid search, 
    so best parameters will be reported.
    '''
    if pretrained_model == None:
        print("Best parameters: ", model.best_params_)

    Y_pred = model.predict(X_test)
    report = classification_report(Y_test, Y_pred, target_names=category_names, 
                                   output_dict=True)

    print("Validation Results: ")
    print(classification_report(Y_test, Y_pred, target_names=category_names))

    return (Y_pred, report)


def save_model(model, model_filepath):
    '''
    Save the model to a path specified by model_filepath.
    '''
    pickle.dump(model, open(model_filepath, 'wb'))
    

In [29]:
print('Building model...')
model2 = build_model()

print('Training model...')
model2.fit(X_train, y_train)

print('Evaluating model...')
preds2, report = evaluate_model(model2, X_test, y_test, y_test.columns)

# print('Saving model...\n    MODEL: {}'.format(model_filepath))
# save_model(model, model_filepath)


Building model...
Training model...
Fitting 3 folds for each of 6 candidates, totalling 18 fits
[CV] clf__estimator__C=1.0, countvec__ngram_range=(1, 1) .............


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


[CV]  clf__estimator__C=1.0, countvec__ngram_range=(1, 1), score=0.6174687613469391, total= 1.1min
[CV] clf__estimator__C=1.0, countvec__ngram_range=(1, 1) .............


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:  1.8min remaining:    0.0s
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


[CV]  clf__estimator__C=1.0, countvec__ngram_range=(1, 1), score=0.6257062108818834, total= 1.1min
[CV] clf__estimator__C=1.0, countvec__ngram_range=(1, 1) .............


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:  3.7min remaining:    0.0s
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


[CV]  clf__estimator__C=1.0, countvec__ngram_range=(1, 1), score=0.6289740579661346, total= 1.1min
[CV] clf__estimator__C=1.0, countvec__ngram_range=(1, 2) .............


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:  5.5min remaining:    0.0s
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


[CV]  clf__estimator__C=1.0, countvec__ngram_range=(1, 2), score=0.6241910613339897, total= 1.3min
[CV] clf__estimator__C=1.0, countvec__ngram_range=(1, 2) .............


[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:  7.5min remaining:    0.0s
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


[CV]  clf__estimator__C=1.0, countvec__ngram_range=(1, 2), score=0.6307677048001096, total= 1.3min
[CV] clf__estimator__C=1.0, countvec__ngram_range=(1, 2) .............


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:  9.6min remaining:    0.0s
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


[CV]  clf__estimator__C=1.0, countvec__ngram_range=(1, 2), score=0.6329363273998921, total= 1.3min
[CV] clf__estimator__C=5.0, countvec__ngram_range=(1, 1) .............


[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed: 11.6min remaining:    0.0s
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


[CV]  clf__estimator__C=5.0, countvec__ngram_range=(1, 1), score=0.6284620805508135, total= 1.1min
[CV] clf__estimator__C=5.0, countvec__ngram_range=(1, 1) .............


[Parallel(n_jobs=1)]: Done   7 out of   7 | elapsed: 13.4min remaining:    0.0s
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


[CV]  clf__estimator__C=5.0, countvec__ngram_range=(1, 1), score=0.6379854020161618, total= 1.2min
[CV] clf__estimator__C=5.0, countvec__ngram_range=(1, 1) .............


[Parallel(n_jobs=1)]: Done   8 out of   8 | elapsed: 15.3min remaining:    0.0s
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


[CV]  clf__estimator__C=5.0, countvec__ngram_range=(1, 1), score=0.6371521421092954, total= 1.1min
[CV] clf__estimator__C=5.0, countvec__ngram_range=(1, 2) .............


[Parallel(n_jobs=1)]: Done   9 out of   9 | elapsed: 17.0min remaining:    0.0s
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


[CV]  clf__estimator__C=5.0, countvec__ngram_range=(1, 2), score=0.6414443281483035, total= 1.3min
[CV] clf__estimator__C=5.0, countvec__ngram_range=(1, 2) .............


  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


[CV]  clf__estimator__C=5.0, countvec__ngram_range=(1, 2), score=0.6469481155431975, total= 1.4min
[CV] clf__estimator__C=5.0, countvec__ngram_range=(1, 2) .............


  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


[CV]  clf__estimator__C=5.0, countvec__ngram_range=(1, 2), score=0.6507796391990328, total= 1.4min
[CV] clf__estimator__C=10.0, countvec__ngram_range=(1, 1) ............


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


[CV]  clf__estimator__C=10.0, countvec__ngram_range=(1, 1), score=0.6270712033821747, total= 1.1min
[CV] clf__estimator__C=10.0, countvec__ngram_range=(1, 1) ............


  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


[CV]  clf__estimator__C=10.0, countvec__ngram_range=(1, 1), score=0.6349160837109199, total= 1.1min
[CV] clf__estimator__C=10.0, countvec__ngram_range=(1, 1) ............


  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


[CV]  clf__estimator__C=10.0, countvec__ngram_range=(1, 1), score=0.6339367125638458, total= 1.1min
[CV] clf__estimator__C=10.0, countvec__ngram_range=(1, 2) ............


  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


[CV]  clf__estimator__C=10.0, countvec__ngram_range=(1, 2), score=0.6409498568105442, total= 1.4min
[CV] clf__estimator__C=10.0, countvec__ngram_range=(1, 2) ............


  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


[CV]  clf__estimator__C=10.0, countvec__ngram_range=(1, 2), score=0.648802592557683, total= 1.4min
[CV] clf__estimator__C=10.0, countvec__ngram_range=(1, 2) ............


  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


[CV]  clf__estimator__C=10.0, countvec__ngram_range=(1, 2), score=0.6508290228506336, total= 1.4min


[Parallel(n_jobs=1)]: Done  18 out of  18 | elapsed: 35.1min finished


Evaluating model...
Best parameters:  {'clf__estimator__C': 10.0, 'countvec__ngram_range': (1, 2)}
Validation Results: 
                        precision    recall  f1-score   support

               related       0.83      0.95      0.88      3978
               request       0.80      0.58      0.67       895
                 offer       0.00      0.00      0.00        26
           aid_related       0.74      0.72      0.73      2131
          medical_help       0.61      0.32      0.42       422
      medical_products       0.70      0.32      0.44       270
     search_and_rescue       0.71      0.17      0.28       127
              security       0.40      0.02      0.04        88
              military       0.54      0.34      0.42       155
                 water       0.73      0.63      0.68       339
                  food       0.83      0.72      0.77       595
               shelter       0.77      0.59      0.67       470
              clothing       0.78      0.44    

  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


# Simple Ensemble

In [30]:
ens_preds = (preds + preds2) / 2
ens_preds = (ens_preds >= 0.5).astype(np.int)
print(classification_report(y_test, ens_preds, target_names=y_test.columns))

                        precision    recall  f1-score   support

               related       0.82      0.97      0.89      3978
               request       0.78      0.65      0.71       895
                 offer       0.00      0.00      0.00        26
           aid_related       0.73      0.77      0.75      2131
          medical_help       0.60      0.33      0.42       422
      medical_products       0.70      0.33      0.45       270
     search_and_rescue       0.71      0.17      0.28       127
              security       0.40      0.02      0.04        88
              military       0.54      0.34      0.42       155
                 water       0.70      0.64      0.67       339
                  food       0.76      0.75      0.75       595
               shelter       0.77      0.60      0.67       470
              clothing       0.78      0.44      0.56        73
                 money       0.51      0.24      0.33       104
        missing_people       1.00      

  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)
