**PROBLEM DESCRIPTION**

This notebook demostrates a **sequence classification of IMDB movie reviews** dataset by creating a simple LSTM based classifier. 

Each movie review is a variable sequence of words, and the tone of each movie review must be classified. The large movie reviews dataset (sometimes referred to as the IMDB dataset) contains *25,000 film reviews (good or bad) for training and 25,000 reviews for testing*. The problem is deciding whether a given movie review is positive or negative. The data were collected by researchers at Stanford  and  used in a 2011 paper that used 50-50  data  for training and testing. An accuracy of 88.89% is achieved. 

Here, a built-in dataset of IMDB movie reviews is used. Keras provide several built-in dataset, one of them is - **imdb.load_data()**. 

**Import modules**

Let's start off with the basic step of importing all the relevant modules and functions required for this particular classifier.

In [19]:
import numpy
from keras.datasets import imdb #built-in dataset
from keras.models import Sequential 
import pandas as pd
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers.embeddings import Embedding #to define the word embedding
from keras.preprocessing import sequence

Next, take the amount of data you want to use for this particular model. The original dataset contains 50,000 movie reviews. But, it can be restricted as per the requirement. Here, we can select the top 10,000 words of the plathora of this dataset. 

Then split the dataset into train (50%) and test (50%) sets.

In [18]:
top_words = 10000
(X_train, y_train), (X_test, y_test) = df(num_words=top_words)

Now, we need to pad sequences so that they can be of same length for modelling. For sure, when there is no information after this process, the model will take zero value for them. 

In [None]:
# truncate and pad input sequences
max_review_length = 500
X_train = sequence.pad_sequences(X_train, maxlen=max_review_length)
X_test = sequence.pad_sequences(X_test, maxlen=max_review_length)

**Define the LSTM model**

The first layer is the Embedded layer that uses 32 length vectors to represent each word. The next layer is the LSTM layer with 100 memory units (smart neurons). Subsequently, you can add more than one LSTM layer. Finally, because this is a classification problem we use a Dense output layer with a single neuron and a sigmoid activation function to make 0 or 1 predictions for the two classes (good and bad) in the problem.

Because it is a *binary classification problem*, log loss is used as the loss function (binary_crossentropy in Keras). The efficient ADAM optimization algorithm is used. The number of epochs and batch size can be increased as per the requirement. Here, we have taken epoch of 10 and batch size of 64.

In [12]:
# create the model
embedding_vecor_length = 32
model = Sequential()
model.add(Embedding(top_words, embedding_vecor_length, input_length=max_review_length))
model.add(LSTM(100))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())
model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=10, batch_size=64)

Once the model is created, then we can test the performance on unseen reviews. 

In [None]:
# Final evaluation of the model
scores = model.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

The complete code can be seen here. Let's see how the model turns out to be. 

In [20]:
# LSTM for sequence classification in the IMDB dataset
import numpy
from keras.datasets import imdb
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence

# load the dataset but only keep the top n words, zero the rest
top_words = 5000
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=top_words)

# truncate and pad input sequences
max_review_length = 10000
X_train = sequence.pad_sequences(X_train, maxlen=max_review_length)
X_test = sequence.pad_sequences(X_test, maxlen=max_review_length)

# create the model
embedding_vecor_length = 32
model = Sequential()
model.add(Embedding(top_words, embedding_vecor_length, input_length=max_review_length))
model.add(LSTM(100))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())
model.fit(X_train, y_train, epochs=10, batch_size=64)

# Final evaluation of the model
scores = model.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))