# LSTM

Implementing LSTM for IMDB data. For data analysis, look at Logistic_Regression.ipynb

In [1]:
import imdb_functions # For easy data load. Didn't need.
import keras
import numpy as np
import pandas as pd
import sklearn as sk

from keras.datasets import imdb # Easier way to implement
from keras.layers import LSTM, Dense, Input, Embedding, Reshape
from keras.models import Model, save_model, load_model
from keras.preprocessing.sequence import pad_sequences

from time import time
from keras.callbacks import TensorBoard

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer

Using TensorFlow backend.


In [2]:
(X_train,y_train),(X_test,y_test)=imdb.load_data(num_words=5000)
print("X_train:", len(X_train),"y_train:", len(y_train),"X_test:",len(X_test),"y_test:", len(y_test))

X_train: 25000 y_train: 25000 X_test: 25000 y_test: 25000


In [3]:
max_review_length = 500
X_train = pad_sequences(X_train, maxlen=max_review_length, padding='pre')
X_test = pad_sequences(X_test, maxlen=max_review_length)

In [7]:
# Creating validation data 
X_test, X_val, y_test, y_val = train_test_split(X_test, y_test, test_size=0.2, random_state=42)

In [8]:
X_train.shape, y_train.shape, X_test.shape, y_test.shape, 

((25000, 500), (25000,), (20000, 500), (20000,))

In [5]:
# Model setup
input_nodes= Input(shape=(X_train.shape[1],))
e = Embedding(5000,
              32,
              input_length=X_train.shape[1],
              trainable=True)(input_nodes)
lstm=LSTM(100)(e)
output_nodes=Dense(1, activation='sigmoid')(lstm)

#Build model
model = Model(inputs=input_nodes, outputs=output_nodes)
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

tensorboard = TensorBoard(log_dir="logs/{}".format(time()))

In [9]:
model.fit(X_train, y_train, batch_size=64, epochs=5, validation_data=(X_val, y_val))

Train on 25000 samples, validate on 5000 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x1291bf1d0>

In [12]:
model.evaluate(X_train, y_train)



[0.15253367988467217, 0.94592]

In [11]:
model.evaluate(X_test, y_test)



[0.32600042741298674, 0.8714]

__Observations and Things Learned:__
- Embedding layer makes models faster by matching words/numbers with meanings.
- Without an embedding layer, we need a fairly complex model to run. Only one LSTM layers gets me 53% accuracy on train. More LSTM layers and different structures needed for a similar performance.
- First time using TensorBoard.

__Results:__
- 94% accuracy on train, and 87% accuracy on test.
- Took ~20 minutes to train.
