Tutorial 2: Sentiment analysis with an recurrent LSTM network
=============================================================

Train an recurrent neural network to parse movie reviews from IMDB and decide if they are positive or negative reviews.

An example of a review:

`"Okay, sorry, but I loved this movie. I just love the whole 80's genre of these kind of movies, because you don't see many like this one anymore! I want to ask all of you people who say this movie is just a rip-off, or a cheesy imitation, what is it imitating? I've never seen another movie like this one, well, not horror anyway.
Basically its about the popular group in school, who like to make everyones lives living hell, so they decided to pick on this nerdy boy named Marty. It turns fatal when he really gets hurt from one of their little pranks.
So, its like 10 years later, and the group of friends who hurt Marty start getting High School reunion letters. But...they are the only ones receiving them! So they return back to the old school, and one by one get knocked off by.......Yeah you probably know what happens!
The only part that disappointed me was the very end. It could have been left off, or thought out better.
I think you should give it a try, and try not to be to critical!
~*~CupidGrl~*~"`

We have 25000 reviews like this.

1. Data preparation
-------------------
We have a script that takes reviews from a text file and store as one-hot encoded dataset

In [1]:
# read reviews from text file and store as one-hot encoded dataset
import prepare
fname='labeledTrainData.tsv'
prepare.main(fname)

vocabulary size -  81321
# of samples -  25000
# of classes 2
class distribution -  [0 1] [12500 12500]
sentence length -  1108 [  10   11   12 ..., 1919 1977 2633] [1 1 1 ..., 1 1 1]
# of train - 19881, # of valid - 5119


2. Building the model
---------------------
Similar to the convnet example, we need dataset, layers, callbacks, backend

In [2]:
# hyperparameters
hidden_size = 128
embedding_dim = 128
vocab_size = 20000
sentence_length = 128
batch_size = 128
num_epochs = 2

# setup backend
from neon.backends import gen_backend
be = gen_backend(backend='cpu',
                 batch_size=batch_size,
                 rng_seed=0)

In [3]:
# load the h5 datasets, print stats
import h5py
h5f = h5py.File(fname + '.h5', 'r')
reviews, h5train, h5valid = h5f['reviews'], h5f['train'], h5f['valid']
ntrain, nvalid, nclass = reviews.attrs['ntrain'], reviews.attrs['nvalid'], reviews.attrs['nclass']
print "# of train examples - {0}, valid examples - {1}".format(ntrain, nvalid)
print "# of classes - ", nclass
print "class distribution - ", reviews.attrs['class_distribution']
print "vocab size - {0}, sentence_length - {1}".format(vocab_size, sentence_length)


# of train examples - 19881, valid examples - 5119
# of classes -  2
class distribution -  [12500 12500]
vocab size - 20000, sentence_length - 128


### Create datsets
Extract a training and validation set from the raw dataset and pad / truncate reviews to 128 words. Finally wrap them into a DataIterator.


In [4]:
# make train dataset
from preprocess_text import get_paddedXY
from neon.data import DataIterator
Xy = h5train[:ntrain]
X = [xy[1:] for xy in Xy]
y = [xy[0] for xy in Xy]
X_train, y_train = get_paddedXY(
    X, y, vocab_size=vocab_size, sentence_length=sentence_length)
train_set = DataIterator(X_train, y_train, nclass=nclass)

# make valid dataset
Xy = h5valid[:nvalid]
X = [xy[1:] for xy in Xy]
y = [xy[0] for xy in Xy]
X_valid, y_valid = get_paddedXY(
    X, y, vocab_size=vocab_size, sentence_length=sentence_length)
valid_set = DataIterator(X_valid, y_valid, nclass=nclass)


### Intializers

We use "Glorot Initialization" to automatically scale the weights to preserve the variance of input activations on the output side

In [5]:
# initialization
from neon.initializers import GlorotUniform, Uniform
init_glorot = GlorotUniform()
init_emb = Uniform(-0.1 / embedding_dim, 0.1 / embedding_dim)

### Model layers
* The network consists of a word embedding layer, and LSTM, a RecurrentSum and and Affine layer.
* LookupTable is a word embedding that maps from a sparse one-hot representation to dense wordvectors. The embedding is learned from the data
* LSTM is a recurrent layer with "long short-term memory" units. LSTM networks tend to be easier to train, but generally perform similar to standard RNN layers
* RecurrentSum is a recurrent output layer that collapeses over the time dimension of the sequence by summing up outputs from individual steps.
* Dropout performs regularizaion by randomly zeroing out some of the units
* Affine is a fully connected MLP layer that is used for the binary classification of the outputs

In [6]:
# define layers
from neon.layers import LookupTable, LSTM, RecurrentSum, Dropout, Affine
from neon.transforms import Softmax, Tanh, Logistic
layers = [
    LookupTable(vocab_size=vocab_size, embedding_dim=embedding_dim, init=init_emb),
    LSTM(hidden_size, init_glorot, activation=Tanh(),
         gate_activation=Logistic(), reset_cells=True),
    RecurrentSum(),
    Dropout(keep=0.5),
    Affine(nclass, init_glorot, bias=init_glorot, activation=Softmax())
]

### Cost and Optimizer

In [7]:
# set the cost, metrics, optimizer
from neon.layers import GeneralizedCost
from neon.transforms import CrossEntropyMulti
from neon.transforms import Accuracy
from neon.models import Model
from neon.optimizers import Adagrad
cost = GeneralizedCost(costfunc=CrossEntropyMulti(usebits=True))
metric = Accuracy()
model = Model(layers=layers)
optimizer = Adagrad(learning_rate=0.01)

### Callbacks
In addition to the default progress bar, we set up a callback to save the model to a pickle file after every epoch

In [8]:
# configure callbacks
from neon.callbacks import Callbacks
callbacks = Callbacks(model, train_set, eval_set=valid_set, 
                      epochs=num_epochs, serialize=1,
                      save_path=fname + '.pickle')

### Training the model
We now have all the parts in place to train the model. Two epochs are sufficient to obtain some interesting results. 

In [9]:
# train model
model.fit(train_set, optimizer=optimizer, num_epochs=num_epochs,
          cost=cost, callbacks=callbacks)

# eval model
print "\nTest Accuracy -", 100 * model.eval(valid_set, metric=metric)
print "Train Accuracy -", 100 * model.eval(train_set, metric=metric)

Epoch 0   [Train |████████████████████|  156/156  batches, 0.41 cost, 108.16s]
Epoch 1   [Train |████████████████████|  155/155  batches, 0.21 cost, 106.63s]

Test Accuracy - [ 86.38405609]
Train Accuracy - [ 95.5736618]


3. TODO: Inference
------------
The trained model can now be used to perform inference on 

In [10]:
# hyperparameters from the reference
batch_size = 1
clip_gradients = True
gradient_limit = 15
vocab_size = 20000
sentence_length = 128
embedding_dim = 128
hidden_size = 128
reset_cells = True
#fname='labeledTrainData.tsv'
save_path= 'labeledTrainData.tsv' + '.pickle'
#num_epochs = args.epochs

In [11]:
# setup backend
from neon.backends import gen_backend
be = gen_backend(#backend=args.backend,
                 batch_size=batch_size,
                 #rng_seed=args.rng_seed,
                 #device_id=args.device_id,
                 #default_dtype=args.datatype
)



In [12]:
from neon.initializers import GlorotUniform, Uniform
init_glorot = GlorotUniform()
init_emb = Uniform(-0.1 / embedding_dim, 0.1 / embedding_dim)
nclass = 2

In [13]:
# define same model as in train
from neon.layers import LookupTable, LSTM, RecurrentSum, Dropout, Affine
from neon.transforms import Tanh, Softmax, Logistic
layers = [

    LookupTable(vocab_size=vocab_size, embedding_dim=embedding_dim, init=init_emb),
    LSTM(hidden_size, init_glorot, activation=Tanh(),
         gate_activation=Logistic(), reset_cells=True),
    RecurrentSum(),
    Dropout(keep=0.5),
    Affine(nclass, init_glorot, bias=init_glorot, activation=Softmax())
]



In [14]:
# load the weights
from neon.models import Model
print "Initialized the models - "
model_new = Model(layers=layers)
print "Loading the weights from {0}".format(save_path)

model_new.load_weights(save_path)
model_new.initialize(dataset=(sentence_length, batch_size))



Initialized the models - 
Loading the weights from labeledTrainData.tsv.pickle


In [15]:
# setup buffers before accepting reviews
import cPickle
import numpy as np
xdev = be.zeros((sentence_length, 1), dtype=np.int32)  # bsz is 1, feature size
xbuf = np.zeros((1, sentence_length), dtype=np.int32)
oov = 2
start = 1
index_from = 3
pad_char = 0
vocab, rev_vocab = cPickle.load(open(fname + '.vocab', 'rb'))



In [None]:
import preprocess_text
while True:
    line = raw_input('Enter a Review from testData.tsv file \n')

    # clean the input
    tokens = preprocess_text.clean_string(line).strip().split()

    # check for oov and add start
    sent = [len(vocab) + 1 if t not in vocab else vocab[t] for t in tokens]
    sent = [start] + [w + index_from for w in sent]
    sent = [oov if w >= vocab_size else w for w in sent]

    # pad sentences
    xbuf[:] = 0
    trunc = sent[-sentence_length:]
    xbuf[0, -len(trunc):] = trunc
    xdev[:] = xbuf.T.copy()
    y_pred = model_new.fprop(xdev, inference=True)  # inference flag dropout

    print "Sent - {0}".format(xbuf)
    print "Pred - {0} ".format(y_pred.get().T)
    print '-' * 100

Enter a Review from testData.tsv file 
preprocess_text
Sent - [[   0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    1
     2 3004]]
Pred - [[ 0.30626768  0.69373232]] 
----------------------------------------------------------------------------------------------------
Enter a Review from testData.tsv file 
The only part that disappointed me was the very end. It could have been left off, or thought out bet

In [None]:
asdf