## IMDB review sentiment analysis
25000 movie reviews from IMDB, labelled good or bad
Data is available in the Keras dataset and is processed as sequence of integers, i.e., we aren't directly dealing with word vocab.
We will embed the sentences with embedding layer and then learn the sequential structure with LSTM

In [2]:
from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import Dense, Embedding, LSTM
from keras.datasets import imdb

In [3]:
max_feature = 20000    # Max number of vocab words considered
max_len = 80       # Max length of the review taken

(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words = max_feature)

Downloading data from https://s3.amazonaws.com/text-datasets/imdb.npz


In [4]:
## Now we pad our sequences to get equal length of each sentence input (Tx).
x_train = sequence.pad_sequences(x_train, maxlen = max_len)
x_test = sequence.pad_sequences(x_test, maxlen = max_len)

model = Sequential()
model.add(Embedding(max_feature, 128))
model.add(LSTM(128, dropout = 0.2, recurrent_dropout =0.2))
model.add(Dense(1, activation = 'sigmoid'))

In [5]:
from keras import backend as K

K.set_session(K.tf.Session(config=K.tf.ConfigProto(intra_op_parallelism_threads=1, inter_op_parallelism_threads=1)))

In [6]:
## Run and evaluate the model

model.compile(loss = 'binary_crossentropy', optimizer = 'sgd', metrics = ['accuracy'])
model.fit(x_train, y_train, batch_size = 32, epochs = 10, validation_data = (x_test, y_test))
model.evaluate(x_test, y_test, batch_size = 32)

Train on 25000 samples, validate on 25000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


[0.68965756261825562, 0.55855999999999995]

## Reuters Newswire Data

This dataset consists of 11,228 newswires from the Reuters news agency. Each wire is encoded as a sequence of word indexes, just as in the IMDB data. Moreover, each wire is categorised into one of 46 topics, which will serve as our label. This dataset is available through the Keras API. We will create a Multi-layer perceptron (MLP) using Keras which we can train to classify news items into the specified 46 topics.

In [9]:
import pip
try:
    __import__('h5py')
except ImportError:
    pip.main(['install', 'h5py']) 

import numpy as np
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.utils import to_categorical

seed = 1337
np.random.seed(seed)

In [10]:
from keras.datasets import reuters

max_words = 1000
(x_train, y_train), (x_test, y_test) = reuters.load_data(num_words=max_words,
                                                         test_split=0.2,
                                                         seed=seed)
num_classes = np.max(y_train) + 1  # 46 topics

Downloading data from https://s3.amazonaws.com/text-datasets/reuters.npz


Note that we cap the maximum number of words in a news item to 1000 by specifying the *num_words* key word. Also, 20% of the data will be test data and we ensure reproducibility by setting our random seed.

Our training features are still simply sequences of indexes and we need to further preprocess them, so that we can plug them into a *Dense* layer. For this we use a *Tokenizer* from Keras' text preprocessing module. This tokenizer will take an index sequence and map it to a vector of length *max_words=1000*. Each of the 1000 vector positions corresponds to one of the words in our newswire corpus. The output of the tokenizer has a 1 at the i-th position of the vector, if the word corresponding to i is in the description of the newswire, and 0 otherwise. Even if this word appears multiple times, we still just put a 1 into our vector, i.e. our tokenizer is binary. We use this tokenizer to transform both train and test features:

In [11]:
from keras.preprocessing.text import Tokenizer

tokenizer = Tokenizer(num_words=max_words)
x_train = tokenizer.sequences_to_matrix(x_train, mode='binary')
x_test = tokenizer.sequences_to_matrix(x_test, mode='binary')

In [12]:
y_train = to_categorical(y_train, num_classes)
y_test = to_categorical(y_test, num_classes)

In [13]:
model = Sequential()  
model.add(Dense(512, activation='relu', input_shape = (max_words,))) 
model.add(Dropout(0.5))
model.add(Dense(num_classes, activation='softmax')) 

model.compile(loss = 'categorical_crossentropy', optimizer= 'adam', metrics=['accuracy'])

In [14]:
# Keras takes heavy resource consumption, this is the workaround to limit resource consumption on (IBM) cloud
from keras import backend as K

K.set_session(K.tf.Session(config=K.tf.ConfigProto(intra_op_parallelism_threads=1, inter_op_parallelism_threads=1)))

In [15]:
batch_size = 32
model.fit(x_train, y_train, batch_size = batch_size, epochs=5)
score = model.evaluate(x_test, y_test)
model.save("model.h5")

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [None]:
print("accuracy of the model is {:2f'}%".format(score[1]))