# Keras models
This is based on the Keras tutorial (https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html) and Keras Convolutional Model notebook (https://github.com/JoeDumoulin/CSCD439F17/blob/master/notebooks/Final%20Project/Keras%20Convolutional%20Network%20for%20Spooky%20Author%20ID1.ipynb)

In [1]:
# Definitions

from __future__ import print_function

import os
import sys
import numpy as np

# tensorflow settings to activate gpu
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"   # see issue #152
os.environ["CUDA_VISIBLE_DEVICES"]="0"

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
from keras.layers import Dense, Input, GlobalMaxPooling1D, Dropout
from keras.layers import Conv1D, MaxPooling1D, Embedding
from keras.models import Model
from keras.optimizers import RMSprop

from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())


BASE_DIR = '../data'
GLOVE_DIR = 'glove.6B'
TEXT_DATA_DIR = os.path.join(BASE_DIR, 'SpookyData')
MAX_SEQUENCE_LENGTH = 1000
MAX_NB_WORDS = 20000
EMBEDDING_DIM = 300
VALIDATION_SPLIT = 0.2

import tensorflow as tf
# Creates a graph.
a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
c = tf.matmul(a, b)
# Creates a session with log_device_placement set to True.
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
# Runs the op.
print(sess.run(c))

Using TensorFlow backend.


[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 17600307339371531092
]
[[ 22.  28.]
 [ 49.  64.]]


In [2]:
import pandas as pd

# read the training data
df = pd.read_csv('train.csv')
df.head()

Unnamed: 0,id,text,author
0,id26305,"This process, however, afforded me no means of...",EAP
1,id17569,It never once occurred to me that the fumbling...,HPL
2,id11008,"In his left hand was a gold snuff box, from wh...",EAP
3,id27763,How lovely is spring As we looked from Windsor...,MWS
4,id12958,"Finding nothing else, not even gold, the Super...",HPL


In [3]:
# get a list of classifications and generate numeric 
#  values for each class.  put the numeric class back 
#  on to the data frame.
authors = dict([(auth, idx) for idx, auth in enumerate(df['author'].unique())])
print(authors)
df['author_id'] = df['author'].apply(lambda x: authors[x])

df.head()

{'MWS': 2, 'EAP': 0, 'HPL': 1}


Unnamed: 0,id,text,author,author_id
0,id26305,"This process, however, afforded me no means of...",EAP,0
1,id17569,It never once occurred to me that the fumbling...,HPL,1
2,id11008,"In his left hand was a gold snuff box, from wh...",EAP,0
3,id27763,How lovely is spring As we looked from Windsor...,MWS,2
4,id12958,"Finding nothing else, not even gold, the Super...",HPL,1


In [17]:
# Drop stop words. These common words probably won't provide insight into which author wrote each sentence
from nltk.corpus import stopwords
stops = set(stopwords.words('english'))

# now we will use the text and author_id fields to train a classifier.
#  We have to: 
#  1. Get the sentences, 
# this takes each sentence from the training file and places it in the list
sents = df['text'].tolist()
# this takes each author id (assigned above) and places them in a list,
# so that we can compare our results to the actual authors
labels = df['author_id'].tolist()

#  2. Tokenize each sentence, 
tokenizer = Tokenizer(num_words=MAX_NB_WORDS)
tokenizer.fit_on_texts(sents)
# turns the text input into numerical arrays
sequences = tokenizer.texts_to_sequences(sents)
print(len(sequences))
print(sequences[0])
##    Get a vector of unique terms here
print('Found %s unique tokens before stopwords removal.' % len(tokenizer.word_index))
print([w for w in tokenizer.word_index.items()][:5])
word_index = dict([(w,i) for w,i in tokenizer.word_index.items() if w not in stops])
print('Found %s unique tokens after stopwords removal.' % len(word_index))


data = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)

labels = to_categorical(np.asarray(labels))
print('Shape of data tensor:', data.shape)
print('Shape of label tensor:', labels.shape)

# split the data into a training set and a validation set
indices = np.arange(data.shape[0])
#shuffling indices takes less time than shuffling objects
np.random.shuffle(indices)
data = data[indices]
labels = labels[indices]
# sets the number of the validation samples to 20% of the data (20% is the percentage selected above)
num_validation_samples = int(VALIDATION_SPLIT * data.shape[0])

x_train = data[:-num_validation_samples]
y_train = labels[:-num_validation_samples]
x_val = data[-num_validation_samples:]
y_val = labels[-num_validation_samples:]
y_val[:5]

19579
[26, 2945, 143, 1372, 22, 36, 294, 2, 7451, 1, 2440, 2, 10, 4556, 16, 6, 79, 179, 48, 4245, 3, 295, 4, 1, 249, 1943, 6, 326, 74, 134, 123, 891, 2, 1, 313, 39, 1438, 4928, 98, 1, 430]
Found 25943 unique tokens before stopwords removal.
[('splendour', 2558), ('waddle', 15025), ('atlantic', 4141), ('foraging', 16866), ('millstone', 16530)]
Found 25808 unique tokens after stopwords removal.
Shape of data tensor: (19579, 1000)
Shape of label tensor: (19579, 3)


array([[ 0.,  0.,  1.],
       [ 0.,  0.,  1.],
       [ 1.,  0.,  0.],
       [ 0.,  1.,  0.],
       [ 0.,  0.,  1.]])

The above cells are used in both the first and second models, which is why I left it outside of the "Model 1" section. The data is read in from the "train.csv" file and prepared for use in the models. The models below all use embeddings; some use preprepared GloVe embeddings, and some train embeddings based on the training set.

GloVe (Global Vectors for Word Representation) is an "unsupervised learning algorithm for obtaining vector representations of words," developed at Stanford (https://nlp.stanford.edu/projects/glove/). The GloVe model is trained on non-zero entries in a word-word co-occurence matrix. Developing the matrix requires a one-time pass over the corpus, which can be expensive up front but saves time in the long-run. In the matrix, words which are closely associated with each other appear closer together than words which rarely occur near each other in the corpus.

# Model 1

In [10]:
#  3. Load embeddings
embeddings_index = {}
f = open(os.path.join(GLOVE_DIR, 'glove.6B.300d.txt'))
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

print('Found %s word vectors.' % len(embeddings_index))

Found 400000 word vectors.


This uses ones of several GloVe text files. Using one of the other files may be more effective. The length of the word vectors in the files used is 300. The other options are 50, 100, and 200.

In [11]:
#  4. Create the Embedding matrix for the training set
num_words = min(MAX_NB_WORDS, len(word_index))
# returns an array of the size num_words x EMBEDDING_DIM, filled with 0s
embedding_matrix = np.zeros((num_words, EMBEDDING_DIM))
unk = []
for word, i in word_index.items():
    if i >= MAX_NB_WORDS:
        continue
    # gets the vector for the current word    
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector
    else:
        unk.append(word)
print(len(unk))

2092


As noted in the comments in the cell above, this is specifically structured so that words not found in the embedding index will be all-zeros. Considering HP Lovecraft makes up many words, this might cause some issues if his imaginary words were not in the data sampled for the GloVe.

In [12]:
# load pre-trained word embeddings into an Embedding layer
# note that we set trainable = False so as to keep the embeddings fixed
embedding_layer = Embedding(num_words,
                            EMBEDDING_DIM,
                            weights=[embedding_matrix],
                            input_length=MAX_SEQUENCE_LENGTH,
                            trainable=True)

print('Training model.')

# train a 1D convnet with global maxpooling
sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
embedded_sequences = embedding_layer(sequence_input)
x = Conv1D(128, 5, activation='relu')(embedded_sequences)
x = GlobalMaxPooling1D()(x)
x = Dense(128, activation='relu')(x)
x = Dropout(.5)(x)
preds = Dense(len(authors), activation='softmax')(x)
rms = RMSprop(lr=0.003)
model = Model(sequence_input, preds)
model.compile(loss='mean_squared_logarithmic_error',
              optimizer=rms, #'rmsprop',
              metrics=['acc'])

Training model.


In [14]:
model.fit(x_train, y_train,
          batch_size=100,
          epochs=2,
          validation_data=(x_val, y_val))

Train on 15664 samples, validate on 3915 samples
Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x7fdd5bcc7048>

In [15]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         (None, 1000)              0         
_________________________________________________________________
embedding_1 (Embedding)      (None, 1000, 300)         6000000   
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 996, 128)          192128    
_________________________________________________________________
global_max_pooling1d_1 (Glob (None, 128)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 128)               16512     
_________________________________________________________________
dropout_1 (Dropout)          (None, 128)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 3)                 387       
Total para

We can see that this model has one input layer, one embedding layer, one convolutional layer, one pooling layer, two dense layers, and one dropout layer.

After two epochs, the validation accuracy of this model was 77.57%. Running more epochs could result in higher accuracy, but due to the amount of time it takes to run each model, I will be testing various models with two epochs, and then run the best model for more epochs in order to prepare the final submission.

# Model 2 -- Not pretrained embeddings
As I noted above, I was concerned that using the pretrained embeddings might cause issues due to Lovecraft's tendency to make up words. These imaginary words (Cthulhu, R'lyeh, etc.) are highly indicative of Lovecraft's writing and could affect the accuracy of the model. In this model, the embedding layer is trained based off the word_index vector defined above. In all other respects, the model is identical to Model 1

In [6]:
# train word embeddings and load into an Embedding layer
embedding_layer = Embedding(len(word_index) + 1,
                            EMBEDDING_DIM,
                            input_length=MAX_SEQUENCE_LENGTH)

print('Training model.')

# train a 1D convnet with global maxpooling
sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
embedded_sequences = embedding_layer(sequence_input)
x = Conv1D(128, 5, activation='relu')(embedded_sequences)
x = GlobalMaxPooling1D()(x)
x = Dense(128, activation='relu')(x)
x = Dropout(.5)(x)
preds = Dense(len(authors), activation='softmax')(x)
rms = RMSprop(lr=0.003)
model = Model(sequence_input, preds)
model.compile(loss='mean_squared_logarithmic_error',
              optimizer=rms, #'rmsprop',
              metrics=['acc'])

Training model.


In [7]:
model.fit(x_train, y_train,
          batch_size=100,
          epochs=2,
          validation_data=(x_val, y_val))

Train on 15664 samples, validate on 3915 samples
Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x7f56a74f4470>

This model was very slightly less accurate after the second epoch (77.32% vs 77.57%), but interestingly, it was more accurate after the first epoch (78.34%, compared to 75.35% after the first epoch in Model 1).

# Model 3 -- Pretrained embeddings of length 100 (instead of 300)

In [18]:
#  3. Load embeddings
embeddings_index = {}
f = open(os.path.join(GLOVE_DIR, 'glove.6B.100d.txt'))
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

print('Found %s word vectors.' % len(embeddings_index))

Found 400000 word vectors.


In [20]:
#  4. Create the Embedding matrix for the training set
num_words = min(MAX_NB_WORDS, len(word_index))
# returns an array of the size num_words x EMBEDDING_DIM, filled with 0s
embedding_matrix = np.zeros((num_words, 100))
unk = []
for word, i in word_index.items():
    if i >= MAX_NB_WORDS:
        continue
    # gets the vector for the current word    
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector
    else:
        unk.append(word)
print(len(unk))

2092


In [22]:
# load pre-trained word embeddings into an Embedding layer
# note that we set trainable = False so as to keep the embeddings fixed
embedding_layer = Embedding(num_words,
                            100,
                            weights=[embedding_matrix],
                            input_length=MAX_SEQUENCE_LENGTH,
                            trainable=True)

print('Training model.')

# train a 1D convnet with global maxpooling
sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
embedded_sequences = embedding_layer(sequence_input)
x = Conv1D(128, 5, activation='relu')(embedded_sequences)
x = GlobalMaxPooling1D()(x)
x = Dense(128, activation='relu')(x)
x = Dropout(.5)(x)
preds = Dense(len(authors), activation='softmax')(x)
rms = RMSprop(lr=0.003)
model = Model(sequence_input, preds)
model.compile(loss='mean_squared_logarithmic_error',
              optimizer=rms, #'rmsprop',
              metrics=['acc'])

Training model.


In [23]:
model.fit(x_train, y_train,
          batch_size=100,
          epochs=2,
          validation_data=(x_val, y_val))

Train on 15664 samples, validate on 3915 samples
Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x7f56a22d61d0>

This model is almost ten percentage points worse than model 1. When using pretrained embeddings, it appears sticking with the longer word vectors is better.

# Model 4 -- Not pretrained embeddings, sentences to lower
I tried to find whether the capitalization of the words being used to train the embedding mattered, but nothing seemed to come up when I googled it. In most circumstances, capital letters are treated as entirely different than lowercase letters, so I tried using a non-pretrained embedding layer again, this time setting all the words in the sentences to lowercase beforehand. Changing the capitalization is the only difference between this model and model 2.

In [4]:
# Drop stop words
from nltk.corpus import stopwords
stops = set(stopwords.words('english'))

# now we will use the text and author_id fields to train a classifier.
#  We have to: 
#  1. Get the sentences, 
sents = df['text'].tolist()
labels = df['author_id'].tolist()

#make all sentences lower case
[x.lower() for x in sents]

#  2. Tokenize each sentence, 
tokenizer = Tokenizer(num_words=MAX_NB_WORDS)
tokenizer.fit_on_texts(sents)
sequences = tokenizer.texts_to_sequences(sents)
print(len(sequences))
print(sequences[0])
##    Get a vector of unique terms here
print('Found %s unique tokens before stopwords removal.' % len(tokenizer.word_index))
print([w for w in tokenizer.word_index.items()][:5])
word_index = dict([(w,i) for w,i in tokenizer.word_index.items() if w not in stops])
print('Found %s unique tokens after stopwords removal.' % len(word_index))


data = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)

labels = to_categorical(np.asarray(labels))
print('Shape of data tensor:', data.shape)
print('Shape of label tensor:', labels.shape)

# split the data into a training set and a validation set
indices = np.arange(data.shape[0])
np.random.shuffle(indices)
data = data[indices]
labels = labels[indices]
num_validation_samples = int(VALIDATION_SPLIT * data.shape[0])

x_train = data[:-num_validation_samples]
y_train = labels[:-num_validation_samples]
x_val = data[-num_validation_samples:]
y_val = labels[-num_validation_samples:]
y_val[:5]

19579
[26, 2945, 143, 1372, 22, 36, 294, 2, 7451, 1, 2440, 2, 10, 4556, 16, 6, 79, 179, 48, 4245, 3, 295, 4, 1, 249, 1943, 6, 326, 74, 134, 123, 891, 2, 1, 313, 39, 1438, 4928, 98, 1, 430]
Found 25943 unique tokens before stopwords removal.
[('city', 224), ('oozing', 24131), ('unsatisfying', 10739), ('aye', 6915), ('dooms', 15393)]
Found 25808 unique tokens after stopwords removal.
Shape of data tensor: (19579, 1000)
Shape of label tensor: (19579, 3)


array([[ 1.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 0.,  1.,  0.],
       [ 0.,  1.,  0.],
       [ 1.,  0.,  0.]])

In [9]:
# train word embeddings and load into an Embedding layer
embedding_layer = Embedding(len(word_index) + 1,
                            EMBEDDING_DIM,
                            input_length=MAX_SEQUENCE_LENGTH)

print('Training model.')

# train a 1D convnet with global maxpooling
sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
embedded_sequences = embedding_layer(sequence_input)
x = Conv1D(128, 5, activation='relu')(embedded_sequences)
x = GlobalMaxPooling1D()(x)
x = Dense(128, activation='relu')(x)
x = Dropout(.5)(x)
preds = Dense(len(authors), activation='softmax')(x)
rms = RMSprop(lr=0.003)
model = Model(sequence_input, preds)
model.compile(loss='mean_squared_logarithmic_error',
              optimizer=rms, #'rmsprop',
              metrics=['acc'])

Training model.


In [10]:
model.fit(x_train, y_train,
          batch_size=100,
          epochs=2,
          validation_data=(x_val, y_val))

Train on 15664 samples, validate on 3915 samples
Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x7f56a241ce48>

This is the highest accuracy model yet (80.87% accuracy after two epochs, 3.3 percentage points higher than the second-most accurate model), so it appears that making the sentences entirely lowercase before training the embedding layer helped.

# Model 5 -- Pretrained embeddings, sentences to lower
In this model, I am checking if making the sentences all lowercase improves the accuracy if using a pretrained embedding layer

In [11]:
#  3. Load embeddings
embeddings_index = {}
f = open(os.path.join(GLOVE_DIR, 'glove.6B.300d.txt'))
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

print('Found %s word vectors.' % len(embeddings_index))

Found 400000 word vectors.


In [12]:
#  4. Create the Embedding matrix for the training set
num_words = min(MAX_NB_WORDS, len(word_index))
embedding_matrix = np.zeros((num_words, EMBEDDING_DIM))
unk = []
for word, i in word_index.items():
    if i >= MAX_NB_WORDS:
        continue
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector
    else:
        unk.append(word)
print(len(unk))

2092


In [13]:
# load pre-trained word embeddings into an Embedding layer
# note that we set trainable = False so as to keep the embeddings fixed
embedding_layer = Embedding(num_words,
                            EMBEDDING_DIM,
                            weights=[embedding_matrix],
                            input_length=MAX_SEQUENCE_LENGTH,
                            trainable=True)

print('Training model.')

# train a 1D convnet with global maxpooling
sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
embedded_sequences = embedding_layer(sequence_input)
x = Conv1D(128, 5, activation='relu')(embedded_sequences)
#x = MaxPooling1D()(x)
#x = Conv1D(128, 5, activation='relu')(x)
x = GlobalMaxPooling1D()(x)
x = Dense(128, activation='relu')(x)
x = Dropout(.5)(x)
preds = Dense(len(authors), activation='softmax')(x)
rms = RMSprop(lr=0.003)
model = Model(sequence_input, preds)
model.compile(loss='mean_squared_logarithmic_error',
              optimizer=rms, #'rmsprop',
              metrics=['acc'])
#model.compile(loss='categorical_crossentropy',
#              optimizer=rms, #'rmsprop',
#              metrics=['acc'])

Training model.


In [14]:
model.fit(x_train, y_train,
          batch_size=100,
          epochs=2,
          validation_data=(x_val, y_val))

Train on 15664 samples, validate on 3915 samples
Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x7f56a2135400>

This model has the lowest accuracy so far (73.87%). It appears that making all the sentences lowercase only helps when not using pretrained embeddings.

# Model 6 - Based on model 4, messing with layers
Model 4 was the best model so far, so I will use it as the base for further experimentation This model has another convolutional layer and a max pooling layer which was not in model 4.

In [15]:
# train word embeddings and load into an Embedding layer
embedding_layer = Embedding(len(word_index) + 1,
                            EMBEDDING_DIM,
                            input_length=MAX_SEQUENCE_LENGTH)

print('Training model.')

# train a 1D convnet with global maxpooling
sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
embedded_sequences = embedding_layer(sequence_input)
x = Conv1D(128, 5, activation='relu')(embedded_sequences)
x = MaxPooling1D()(x)
x = Conv1D(128, 5, activation='relu')(x)
x = GlobalMaxPooling1D()(x)
x = Dense(128, activation='relu')(x)
x = Dropout(.5)(x)
preds = Dense(len(authors), activation='softmax')(x)
rms = RMSprop(lr=0.003)
model = Model(sequence_input, preds)
model.compile(loss='mean_squared_logarithmic_error',
              optimizer=rms, #'rmsprop',
              metrics=['acc'])

Training model.


In [16]:
model.fit(x_train, y_train,
          batch_size=100,
          epochs=2,
          validation_data=(x_val, y_val))

Train on 15664 samples, validate on 3915 samples
Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x7f5696d2b240>

The extra layers made this model slightly more accurate (81.15% after two epochs vs 80.87%), although it took nearly twice as long to run.

# Model 7 -- Based on Model 4, various tweaks to hyperparameters
In this model, I halved the number of filters in the convolutional layer, reduced the kernel size from 5 to 3, and increased the learning rate to .005 from .003

In [25]:
# train word embeddings and load into an Embedding layer
embedding_layer = Embedding(len(word_index) + 1,
                            EMBEDDING_DIM,
                            input_length=MAX_SEQUENCE_LENGTH)

print('Training model.')

# train a 1D convnet with global maxpooling
sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
embedded_sequences = embedding_layer(sequence_input)
x = Conv1D(64, 3, activation='relu')(embedded_sequences)
x = GlobalMaxPooling1D()(x)
x = Dense(64, activation='relu')(x)
x = Dropout(.5)(x)
preds = Dense(len(authors), activation='softmax')(x)
rms = RMSprop(lr=0.005)
model = Model(sequence_input, preds)
model.compile(loss='mean_squared_logarithmic_error',
              optimizer=rms, #'rmsprop',
              metrics=['acc'])

Training model.


In [26]:
model.fit(x_train, y_train,
          batch_size=100,
          epochs=2,
          validation_data=(x_val, y_val))

Train on 15664 samples, validate on 3915 samples
Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x7f56a7684908>

This model, with an accuracy of 78.19%, is better than the first three models but a step down from Model 4, which it is based off of.

# Initial Submission
This model isn't great, but I made a test submission at this point just to make sure I could get it to work. The loss rate was 2.68128, which isn't particularly surprising given the general state of the model. It's worth noting that the loss rate in the contest score doesn't seem to have much to do with the loss calculated during the training.

# Model 8 -- Based on Model 4, different loss function, smaller batches, lower dropout rate

In [27]:
# train word embeddings and load into an Embedding layer
embedding_layer = Embedding(len(word_index) + 1,
                            EMBEDDING_DIM,
                            input_length=MAX_SEQUENCE_LENGTH)

print('Training model.')

# train a 1D convnet with global maxpooling
sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
embedded_sequences = embedding_layer(sequence_input)
x = Conv1D(128, 5, activation='relu')(embedded_sequences)
x = GlobalMaxPooling1D()(x)
x = Dense(128, activation='relu')(x)
x = Dropout(.4)(x)
preds = Dense(len(authors), activation='softmax')(x)
rms = RMSprop(lr=0.003)
model = Model(sequence_input, preds)
model.compile(loss='mean_absolute_error',
              optimizer=rms, #'rmsprop',
              metrics=['acc'])

Training model.


In [28]:
model.fit(x_train, y_train,
          batch_size=50,
          epochs=2,
          validation_data=(x_val, y_val))

Train on 15664 samples, validate on 3915 samples
Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x7f56a23b5c18>

This model is significantly worse than even Model 3, the previous worst model, and after two epochs is barely better than chance.

# Model 9 -- Combined tweaks
This model has the same layers as model 6, a slightly lower learning rate, a slightly higher dropout rate, and a slightly larger batch size.

In [5]:
# train word embeddings and load into an Embedding layer
embedding_layer = Embedding(len(word_index) + 1,
                            EMBEDDING_DIM,
                            input_length=MAX_SEQUENCE_LENGTH)

print('Training model.')

# train a 1D convnet with global maxpooling
sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
embedded_sequences = embedding_layer(sequence_input)
x = Conv1D(128, 5, activation='relu')(embedded_sequences)
x = MaxPooling1D()(x)
x = Conv1D(128, 5, activation='relu')(x)
x = GlobalMaxPooling1D()(x)
x = Dense(128, activation='relu')(x)
x = Dropout(.6)(x)
preds = Dense(len(authors), activation='softmax')(x)
rms = RMSprop(lr=0.002)
model = Model(sequence_input, preds)
model.compile(loss='mean_squared_logarithmic_error',
              optimizer=rms, #'rmsprop',
              metrics=['acc'])

Training model.


In [6]:
model.fit(x_train, y_train,
          batch_size=150,
          epochs=2,
          validation_data=(x_val, y_val))

Train on 15664 samples, validate on 3915 samples
Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x7f33b0103e10>

This model looked like it was going to have a higher accuracy while it was running, but the end result was a lower accuracy than most of the others.

# Final Model -- More Epochs
The best model of the previous nine models was Model (BLANK), with an accuracy of (BLANK). This final model is unchanged from that model, except it is running for 10 epochs in order to acheive a higher accuracy for the submission to the contest.

In [5]:
# train word embeddings and load into an Embedding layer
embedding_layer = Embedding(len(word_index) + 1,
                            EMBEDDING_DIM,
                            input_length=MAX_SEQUENCE_LENGTH)

print('Training model.')

# train a 1D convnet with global maxpooling
sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
embedded_sequences = embedding_layer(sequence_input)
x = Conv1D(128, 5, activation='relu')(embedded_sequences)
x = MaxPooling1D()(x)
x = Conv1D(128, 5, activation='relu')(x)
x = GlobalMaxPooling1D()(x)
x = Dense(128, activation='relu')(x)
x = Dropout(.5)(x)
preds = Dense(len(authors), activation='softmax')(x)
rms = RMSprop(lr=0.003)
model = Model(sequence_input, preds)
model.compile(loss='mean_squared_logarithmic_error',
              optimizer=rms, #'rmsprop',
              metrics=['acc'])

Training model.


In [6]:
model.fit(x_train, y_train,
          batch_size=100,
          epochs=10,
          validation_data=(x_val, y_val))

Train on 15664 samples, validate on 3915 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f621d90b550>

Overall, increasing the number of epochs does not appear to have improved the actual accuracy of the model, contrary to what I expected. The "accuracy" measurement incresed sharply (from 66% in the first epoch to the high 90s in epochs 6 and up), but the actual accuracy when validated against test data stayed around 80% the entire time. It's possible adding more epochs simply led to overtraining the model, and did not actually help improve the score.

The loss rate also decreased sharply over the epochs, but I'm not convinced that will translate to a low loss rate in the actual contest submission.

# Exporting final model to CSV to submit to the contest

In [7]:
test_df = pd.read_csv("test.csv")
#  1. Get the sentences, 
sents = test_df['text'].tolist()
ids = test_df['id'].tolist()
#  2. Tokenize each sentence, 
tokenizer = Tokenizer(num_words=MAX_NB_WORDS)
tokenizer.fit_on_texts(sents)
sequences = tokenizer.texts_to_sequences(sents)
test_data = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)

In [8]:
predictions = pd.DataFrame(
    model.predict(test_data)
                           )
predictions = predictions.rename(columns={0: 'EAP', 1: 'HPL', 2: 'MWS'})
predictions['id'] = test_df['id']
predictions.to_csv("submission_keras.csv", index=False)

# Submission Result
The score from the submission was 16.03046, which is significantly worse than I expected. I assume the model was overtrained and as a result couldn't generalize to the wider dataset. As noted above, I suspected that might be the case, but the degree to which it was off surprised me.

# Final Thoughts
I tried to use a variety of methods in the submissions, but overall the convolutional network was easiest to use, although it ended up with the worst score. I wish I'd focused more of my time on trying another neural network model, using a different approach (such as a recurrent neural network). Despite my difficulties in getting the other, simpler-seeming methods to work, they ended up with much better scores than the neural network. I'd hoped to have time to get a fourth submission in, but unfortunately with the amount of time it takes to run even the simpler models on my machine, that just didn't happen.

I also should have been making more submissions to the contest as I tweaked this model. Specifically, it would have been interesting to see what the score was after each epoch with the last model, and whether it improved before getting worse, or if fewer epochs would have been better all around.