# Reccurent neural networks

## Introduction

The purpose of this notebook is to test several types of reccurent neural networks on the Author identification challenge.

### Loading the dataset

Let's start by loading the dataset and displaying some sentences:

In [1]:
import pandas as pd

from IPython.display import display, HTML

raw_train = pd.read_csv("train.csv")
raw_test = pd.read_csv("test.csv")

print("Train set:")
display(raw_train.head())

print("Test set:")
display(raw_test.head())

Train set:


Unnamed: 0,id,text,author
0,id26305,"This process, however, afforded me no means of...",EAP
1,id17569,It never once occurred to me that the fumbling...,HPL
2,id11008,"In his left hand was a gold snuff box, from wh...",EAP
3,id27763,How lovely is spring As we looked from Windsor...,MWS
4,id12958,"Finding nothing else, not even gold, the Super...",HPL


Test set:


Unnamed: 0,id,text
0,id02310,"Still, as I urged our leaving Ireland with suc..."
1,id24541,"If a fire wanted fanning, it could readily be ..."
2,id00134,And when they had broken down the frail door t...
3,id27757,While I was thinking how I should possibly man...
4,id04081,I am not sure to what limit his knowledge may ...


We can extract the labels from the training dataframe, and convert the remaining column into an array of strings.

In [2]:
X_train, y_train = raw_train['text'].values, raw_train['author'].values
X_test = raw_test['text']

print("Some tokenized sentences from train:")
print(X_train[:2])

Some tokenized sentences from train:
[ 'This process, however, afforded me no means of ascertaining the dimensions of my dungeon; as I might make its circuit, and return to the point whence I set out, without being aware of the fact; so perfectly uniform seemed the wall.'
 'It never once occurred to me that the fumbling might be a mere mistake.']


We now use scikit's `CountVectorizer` to produce a one hot encoding of our sentences, where each sentence is mapped to a vector containing zeros the corresponding word is not in the sentence) and ones (the corresponding word is in the sentence) . Tokenization is performed automatically. This function return a sparse matrix.
We then performe a tf/idf transformation.

In [3]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)

tf_transformer = TfidfTransformer(use_idf=True).fit(X_train)
X_train_tfidf = tf_transformer.transform(X_train)
X_test_tfidf = tf_transformer.transform(X_test)

We define a simple one-hot encoder for our labels, and split the inputs and labels into a training and a validation set.

In [4]:
from sklearn.model_selection import train_test_split

def simple_encoder(y):
    keys = {'EAP':[1,0,0], 'HPL':[0,1,0], 'MWS':[0,0,1]}
    return map(lambda x: keys[x], y)

X_learn_tfidf, X_val_tfidf, y_learn, y_val = train_test_split(X_train_tfidf, y_train, random_state=0)
y_learn, y_val = simple_encoder(y_learn), simple_encoder(y_val)

We define a naive Bayesian classifier to use it as a benchmark for our more complex models

In [5]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
from sklearn.metrics import log_loss
import numpy as np

gnb =  MultinomialNB()
gnb.fit(X_learn_tfidf, np.argmax(y_learn, axis=1))
prediction = gnb.predict_proba(X_val_tfidf)

And we assess the accuracy and the multiclass logloss of our predictions.

In [6]:
def assess_performance(predicted_proba):
    accuracy = accuracy_score(np.argmax(y_val, axis=1), np.argmax(predicted_proba, axis=1))
    multiclass_log_loss = log_loss(np.argmax(y_val, axis=1), predicted_proba)
    print("Accuracy of {} on the validation set".format(accuracy))
    print("Multiclass log loss of {} on the validation set".format(multiclass_log_loss))
    
assess_performance(prediction)

Accuracy of 0.809601634321 on the validation set
Multiclass log loss of 0.613639775439 on the validation set


Not too bad !

In [7]:
X_learn_tfidf, X_val_tfidf = X_learn_tfidf.todense(), X_val_tfidf.todense()

print(X_learn_tfidf[:2])

[[ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]]


In [None]:
import xgboost as xgb

xg_train = xgb.DMatrix(X_learn_tfidf, label=np.argmax(y_learn, axis=1))
xg_test = xgb.DMatrix(X_val_tfidf, label=np.argmax(y_val, axis=1))
# setup parameters for xgboost
param = {}
# use softmax multi-class classification
param['objective'] = 'multi:softprob'
param['eval_metric'] = 'mlogloss'
# scale weight of positive examples
param['eta'] = 0.2
param['max_depth'] = 12
param['nthread'] = 8
param['num_class'] = 3

watchlist = [(xg_train, 'train'), (xg_test, 'test')]
num_round = 30

bst = xgb.train(param, xg_train, num_round, watchlist)
# Note: this convention has been changed since xgboost-unity
# get prediction, this is in 1D array, need reshape to (ndata, nclass)
pred_prob = bst.predict(xg_test).reshape(len(y_val), 3)

[0]	train-mlogloss:1.02882	test-mlogloss:1.04816
[1]	train-mlogloss:0.973694	test-mlogloss:1.01029
[2]	train-mlogloss:0.930061	test-mlogloss:0.980565
[3]	train-mlogloss:0.893394	test-mlogloss:0.956907
[4]	train-mlogloss:0.859228	test-mlogloss:0.937623
[5]	train-mlogloss:0.830139	test-mlogloss:0.920776
[6]	train-mlogloss:0.804961	test-mlogloss:0.906962
[7]	train-mlogloss:0.782397	test-mlogloss:0.894857
[8]	train-mlogloss:0.762067	test-mlogloss:0.883999
[9]	train-mlogloss:0.744574	test-mlogloss:0.873654
[10]	train-mlogloss:0.725952	test-mlogloss:0.864393


In [62]:
assess_performance(pred_prob)

Accuracy of 0.58835546476 on the validation set
Multiclass log loss of 0.978218757546 on the validation set


In [8]:
import keras

from keras.models import Sequential
from keras.layers import Dense, Activation

m1 = Sequential()
m1.add(Dense(20000, input_shape=(X_learn_tfidf.shape[1],), activation='softmax'))
m1.add(Dense(1000, activation='softmax'))
m1.add(Dense(3, activation='softmax'))

m1.summary()
m1.compile(optimizer='rmsprop',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

Using TensorFlow backend.


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_1 (Dense)              (None, 20000)             501380000 
_________________________________________________________________
dense_2 (Dense)              (None, 1000)              20001000  
_________________________________________________________________
dense_3 (Dense)              (None, 3)                 3003      
Total params: 521,384,003
Trainable params: 521,384,003
Non-trainable params: 0
_________________________________________________________________


In [None]:
from keras.callbacks import ModelCheckpoint 

checkpointer = ModelCheckpoint(filepath='weights.best.hdf5', 
                               verbose=1, save_best_only=True)

m1.fit(X_learn_tfidf, y_learn, validation_data=(X_val_tfidf, y_val), epochs=4, callbacks=[checkpointer])

Train on 14684 samples, validate on 4895 samples
Epoch 1/4
  416/14684 [..............................] - ETA: 12086s - loss: 1.0970 - acc: 0.3894