# Predicting sentiment on the IMDB dataset

In [1]:
!date

Fr 27. Apr 17:29:14 CEST 2018


In this notebook, we show you how to train an RNN to classify movie review sentences. We mostly start from scratch, so that you should be able to plug in your own dataset without too much hassle. Furthermore, we explain some best practices with skorch, and how to perform a randomized hyper-parameter search.

## Import

In [2]:
import os
import tarfile

In [3]:
from dstoolbox.transformers import Padder2d
from dstoolbox.transformers import TextFeaturizer
import numpy as np
from scipy import stats
from sklearn.datasets import load_files
from sklearn.pipeline import Pipeline
from sklearn.model_selection import RandomizedSearchCV
from scripts.study_case.ID_12.skorch import NeuralNetClassifier
import torch
from torch import nn
F = nn.functional

In [4]:
np.random.seed(0)

## Constants

In [5]:
VOCAB_SIZE = 1000  # This is on the low end
MAX_LEN = 50  # Texts are pretty long on average, this is on the low end
USE_CUDA = True  # Set this to False if you don't want to use CUDA
NUM_CV_STEPS = 10  # Number of randomized search steps to perform

## Preparation

First we need to install an additional package, dstoolbox, for this example to run:

    $ pip install dstoolbox

Also, download the IMDB dataset

    $ wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz

## Load data

### Untar and unzip data

In [6]:
if not os.path.exists('aclImdb'):
    # unzip data if it does not exist
    with tarfile.open('aclImdb_v1.tar.gz', 'r:gz') as f:
        f.extractall()

### Description of the dataset

In [7]:
!head -n 8 aclImdb/README

Large Movie Review Dataset v1.0

Overview

This dataset contains movie reviews along with their associated binary
sentiment polarity labels. It is intended to serve as a benchmark for
sentiment classification. This document outlines how the dataset was
gathered, and how to use the files provided. 


In [8]:
!tail -n 22 aclImdb/README

@InProceedings{maas-EtAl:2011:ACL-HLT2011,
  author    = {Maas, Andrew L.  and  Daly, Raymond E.  and  Pham, Peter T.  and  Huang, Dan  and  Ng, Andrew Y.  and  Potts, Christopher},
  title     = {Learning Word Vectors for Sentiment Analysis},
  booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies},
  month     = {June},
  year      = {2011},
  address   = {Portland, Oregon, USA},
  publisher = {Association for Computational Linguistics},
  pages     = {142--150},
  url       = {http://www.aclweb.org/anthology/P11-1015}
}

References

Potts, Christopher. 2011. On the negativity of negation. In Nan Li and
David Lutz, eds., Proceedings of Semantics and Linguistic Theory 20,
636-659.

Contact

For questions/comments/corrections please contact Andrew Maas
amaas@cs.stanford.edu


### Read in data

In [9]:
dataset = load_files('aclImdb/train/', categories=['pos', 'neg'])

In [10]:
dataset.keys()

dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR'])

### Only minimal data transformation

We mostly leave the data as is; for better reults, we should for instance remove markup, but we leave this out for brevity.

In [11]:
X, y = dataset['data'], dataset['target']
X = np.asarray([x.decode() for x in X])  # decode from bytes

### A peak at the data

In [12]:
for text, target in zip(X[:3], y):
    print("Target: {}".format(dataset['target_names'][target]))
    print(text)
    print()

Target: pos
Zero Day leads you to think, even re-think why two boys/young men would do what they did - commit mutual suicide via slaughtering their classmates. It captures what must be beyond a bizarre mode of being for two humans who have decided to withdraw from common civility in order to define their own/mutual world via coupled destruction.<br /><br />It is not a perfect movie but given what money/time the filmmaker and actors had - it is a remarkable product. In terms of explaining the motives and actions of the two young suicide/murderers it is better than 'Elephant' - in terms of being a film that gets under our 'rationalistic' skin it is a far, far better film than almost anything you are likely to see. <br /><br />Flawed but honest with a terrible honesty.

Target: neg
Words can't describe how bad this movie is. I can't explain it by writing only. You have too see it for yourself to get at grip of how horrible a movie really can be. Not that I recommend you to do that. There 

## Transformation steps

There are many possible ways to transform our data so that we can pass it to our neural net. What we effectively need is to transform a list of strings to an array of indices, where each row corresponds to one sample, and in each row, each int represents a specific word.

Below we show one way to achieve this using `TextFeaturizer` from dstoolbox (more information in this [notebook](https://nbviewer.jupyter.org/github/ottogroup/dstoolbox/blob/master/notebooks/Examples_transformers.ipynb#TextFeaturizer)). This transformer relies heavily on sklearn's `CountVectorizer`. By using sklearn under the hood instead of rolling our own transformation code, we gain the following benefits:

* battle-tested, (mostly) bug free code
* since it is an sklearn transformer, we can put it into a `Pipeline`
* many parameters for us to test in a hyper-parameter search

For more on the last point, see the section about randomized search.

Finally, we have to solve a small problem, namely that sentences have different number of words. This results in a heterogenous array but we need a homogeneous array. With the help of dstoolbox's `Padder2d`, we get this functionality in an sklearn transformer. (Note: We set `pad_value=VOCAB_SIZE` to give the padded value a unique index, since the other indices will range from 0 to VOCAB_SIZE-1).

Putting all of this together, these are the transformation steps in the pipeline before the actual neural net:

In [13]:
steps = [
    ('to_idx', TextFeaturizer(max_features=VOCAB_SIZE)),
    ('pad', Padder2d(max_len=MAX_LEN, pad_value=VOCAB_SIZE, dtype=int)),
]

Here is what the output looks like so far:

In [14]:
Pipeline(steps).fit_transform(X[:3])

array([[220,  48, 104, 217, 190, 186,  63, 156, 186, 207, 193,  29, 218,
        117, 215,  57, 205, 184,  54,  43, 129, 173, 199, 169, 181,  39,
        102,  35, 205, 128,  19,  26,  27, 120, 133,  23,  76, 193,  95,
        206,  87,  49, 190, 210,  77,  44,  38,  98, 140, 190],
       [213,  33,  52,  94,  18, 187, 124, 101,  33,  67, 102,  32, 216,
        137, 217,  87, 191, 163, 102,  76, 219, 190,  78,  17,  83, 133,
         94,  93, 124, 158,  33,  19, 132, 179, 159, 217, 190,  57, 179,
        183,  14, 170, 115,  40, 119,  12,   8, 142, 130, 185],
       [ 65, 151, 181, 148, 153, 203,  98, 187, 108, 131, 124,  24,  79,
        180,  36, 190, 109, 148, 133,  90, 105,  56,  31,  62, 195, 157,
        179, 205,  88,  85, 201,  81, 190,  19, 103,  16,  82, 139, 116,
         63,  25, 180, 124, 166, 196, 179, 202, 143, 190, 174]])

As desired, we have a homogeneous array of indices, exactly what we need.

## The RNN

We define a rather simple RNN with just embeddings, a recurrent layer, and an output layer. To be later able to test all hyper-parameters, we make sure to pass them to the `__init__` of our pytorch module.

In [15]:
class RNNClassifier(nn.Module):
    def __init__(
            self,
            embedding_dim=128,
            rec_layer_type='lstm',
            num_units=128,
            num_layers=2,
            dropout=0,
    ):
        super().__init__()
        self.embedding_dim = embedding_dim
        self.rec_layer_type = rec_layer_type.lower()
        self.num_units = num_units
        self.num_layers = num_layers
        self.dropout = dropout

        self.emb = nn.Embedding(VOCAB_SIZE + 1, embedding_dim=self.embedding_dim)
        
        rec_layer = {'lstm': nn.LSTM, 'gru': nn.GRU}[self.rec_layer_type]
        # We have to make sure that the recurrent layer is batch_first,
        # since sklearn assumes the batch dimension to be the first
        self.rec = rec_layer(
            self.embedding_dim, self.num_units, num_layers=num_layers, batch_first=True)

        self.output = nn.Linear(self.num_units, 2)

    def forward(self, X):
        embeddings = self.emb(X)
        # from the recurrent layer, only take the activities from the last sequence step
        if self.rec_layer_type == 'gru':
            _, rec_out = self.rec(embeddings)
        else:
            _, (rec_out, _) = self.rec(embeddings)
        rec_out = rec_out[-1]  # take output of last RNN layer
        drop = F.dropout(rec_out, p=self.dropout)
        # Remember that the final non-linearity should be softmax, so that our predict_proba
        # method outputs actual probabilities!
        out = F.softmax(self.output(drop), dim=-1)
        return out

We wrap the pytorch module into a skorch `NeuralNetClassifier`, since we are dealing with a binary classification task, and append the step to our transformation steps.

In [16]:
steps.append(
    ('net', NeuralNetClassifier(
        RNNClassifier,
        device=('cuda' if USE_CUDA else 'cpu'),
        max_epochs=5,
        lr=0.01,
        optimizer=torch.optim.RMSprop,
    ))
)

Now we are good to go:

In [17]:
pipe = Pipeline(steps)

In [24]:
%time pipe.fit(X, y)

CPU times: user 28.8 s, sys: 5.79 s, total: 34.6 s
Wall time: 34.6 s


Pipeline(memory=None,
     steps=[('to_idx', TextFeaturizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=1000, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        stri...ayers=2, batch_first=True)
    (output): Linear(in_features=128, out_features=2, bias=True)
  ),
))])

## Randomized search

The results above were already okay, but we have many hyper-parameters, so of course we would like to know which ones are best. Therefore, we perform a randomized search. For those not aware, a randomized search is like a grid search, but instead of testing the parameters systematically, they are drawn randomly from a distribution. In practice, compared to grid search, randomized search will find you better parameter values in a shorter amount of time.

For the randomized search, we turn off the net's verbosity to not clutter the notebook. Also, we set `train_split=None`. This is because we don't need an internal train/valid split, given that sklearn's `RandomizedSearchCV` already takes care of cross-validation.

In [25]:
pipe.set_params(net__verbose=0, net__train_split=None)

Pipeline(memory=None,
     steps=[('to_idx', TextFeaturizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=1000, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        stri...ayers=2, batch_first=True)
    (output): Linear(in_features=128, out_features=2, bias=True)
  ),
))])

Now we would like to set the hyper-parameter range to test. With randomized search, we can either specify a list (mostly for discrete variables) or a `scipy.stats` distribution, from which sklearn will sample automatically.

As we can see below, we can extend the randomized search to not only cover the parameters we defined for our RNN, but also to cover the way we construct our vocabulary using `TextFeaturizer`. This is why we said earlier that we should use it instead of implementing it ourselves. As shown below, we test:

* stop_words: whether to remove english stop words or not
* lowercase: whether to turn all words lower-cased or not
* ngram_range: whether to use word uni-grams or bi-grams

We could also easily switch from words to characters by setting `analyzer='char'`, but then we would probably need longer sequences.

Additionally, we test some hyper-parameters on the RNN module itself (e.g. LSTM vs GRU) and on the skorch `NeuralNetClassifier` (e.g. `max_epochs`).

In [26]:
params = {
    'to_idx__stop_words': ['english', None],
    'to_idx__lowercase': [False, True],
    'to_idx__ngram_range': [(1, 1), (2, 2)],
    'net__module__embedding_dim': stats.randint(32, 256 + 1),
    'net__module__rec_layer_type': ['gru', 'lstm'],
    'net__module__num_units': stats.randint(32, 256 + 1),
    'net__module__num_layers': [1, 2, 3],
    'net__module__dropout': stats.uniform(0, 0.9),
    'net__lr': [10**(-stats.uniform(1, 5).rvs()) for _ in range(NUM_CV_STEPS)],
    'net__max_epochs': [5, 10],
}

We define our randomized search and start fitting.

For demonstration purposes, we perform only a low number of iterations (10) and only fit on the first 5000 samples. Of course, with more time, we should use more steps and include all samples.

In [27]:
search = RandomizedSearchCV(
    pipe, params, n_iter=NUM_CV_STEPS, verbose=2, refit=False, scoring='accuracy', cv=3)

In [28]:
%time search.fit(X[:5000], y[:5000])

Fitting 3 folds for each of 10 candidates, totalling 30 fits
[CV] net__lr=0.000394032773719, net__max_epochs=5, net__module__dropout=0.792428300327, net__module__embedding_dim=254, net__module__num_layers=3, net__module__num_units=206, net__module__rec_layer_type=lstm, to_idx__lowercase=False, to_idx__ngram_range=(2, 2), to_idx__stop_words=english 
[CV]  net__lr=0.000394032773719, net__max_epochs=5, net__module__dropout=0.792428300327, net__module__embedding_dim=254, net__module__num_layers=3, net__module__num_units=206, net__module__rec_layer_type=lstm, to_idx__lowercase=False, to_idx__ngram_range=(2, 2), to_idx__stop_words=english, total=  13.6s
[CV] net__lr=0.000394032773719, net__max_epochs=5, net__module__dropout=0.792428300327, net__module__embedding_dim=254, net__module__num_layers=3, net__module__num_units=206, net__module__rec_layer_type=lstm, to_idx__lowercase=False, to_idx__ngram_range=(2, 2), to_idx__stop_words=english 


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   15.0s remaining:    0.0s


[CV]  net__lr=0.000394032773719, net__max_epochs=5, net__module__dropout=0.792428300327, net__module__embedding_dim=254, net__module__num_layers=3, net__module__num_units=206, net__module__rec_layer_type=lstm, to_idx__lowercase=False, to_idx__ngram_range=(2, 2), to_idx__stop_words=english, total=  13.4s
[CV] net__lr=0.000394032773719, net__max_epochs=5, net__module__dropout=0.792428300327, net__module__embedding_dim=254, net__module__num_layers=3, net__module__num_units=206, net__module__rec_layer_type=lstm, to_idx__lowercase=False, to_idx__ngram_range=(2, 2), to_idx__stop_words=english 
[CV]  net__lr=0.000394032773719, net__max_epochs=5, net__module__dropout=0.792428300327, net__module__embedding_dim=254, net__module__num_layers=3, net__module__num_units=206, net__module__rec_layer_type=lstm, to_idx__lowercase=False, to_idx__ngram_range=(2, 2), to_idx__stop_words=english, total=  13.3s
[CV] net__lr=0.000394032773719, net__max_epochs=5, net__module__dropout=0.518351846001, net__module_

[CV]  net__lr=1.38125857601e-06, net__max_epochs=5, net__module__dropout=0.880713835579, net__module__embedding_dim=171, net__module__num_layers=1, net__module__num_units=118, net__module__rec_layer_type=lstm, to_idx__lowercase=True, to_idx__ngram_range=(2, 2), to_idx__stop_words=None, total=   6.3s
[CV] net__lr=1.38125857601e-06, net__max_epochs=5, net__module__dropout=0.880713835579, net__module__embedding_dim=171, net__module__num_layers=1, net__module__num_units=118, net__module__rec_layer_type=lstm, to_idx__lowercase=True, to_idx__ngram_range=(2, 2), to_idx__stop_words=None 
[CV]  net__lr=1.38125857601e-06, net__max_epochs=5, net__module__dropout=0.880713835579, net__module__embedding_dim=171, net__module__num_layers=1, net__module__num_units=118, net__module__rec_layer_type=lstm, to_idx__lowercase=True, to_idx__ngram_range=(2, 2), to_idx__stop_words=None, total=   6.5s
[CV] net__lr=1.38125857601e-06, net__max_epochs=5, net__module__dropout=0.880713835579, net__module__embedding_d

[CV]  net__lr=0.000394032773719, net__max_epochs=10, net__module__dropout=0.516892723965, net__module__embedding_dim=76, net__module__num_layers=1, net__module__num_units=157, net__module__rec_layer_type=gru, to_idx__lowercase=True, to_idx__ngram_range=(1, 1), to_idx__stop_words=None, total=   7.8s
CPU times: user 3min 55s, sys: 53.1 s, total: 4min 48s
Wall time: 4min 48s


[Parallel(n_jobs=1)]: Done  30 out of  30 | elapsed:  4.8min finished


RandomizedSearchCV(cv=3, error_score='raise',
          estimator=Pipeline(memory=None,
     steps=[('to_idx', TextFeaturizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=1000, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        stri...ayers=2, batch_first=True)
    (output): Linear(in_features=128, out_features=2, bias=True)
  ),
))]),
          fit_params=None, iid=True, n_iter=10, n_jobs=1,
          param_distributions={'to_idx__stop_words': ['english', None], 'to_idx__lowercase': [False, True], 'to_idx__ngram_range': [(1, 1), (2, 2)], 'net__module__embedding_dim': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7f60dbc226d8>, 'net__module__rec_layer_type': ['gru', 'lstm'], 'net__modul... 1.2649296015411273e-06, 0.0015950624092266551, 0.00039403277371927766], 'net__max_epochs': [5, 10]},
          pre_dispatch='2*n_

Below we see the best accuracy we achieved and what the best hyper-parameters were. Of course the scores here are underwhelming, given that we used so few samples and iterations. Using all the data and trying out more iterations should lead to much better outcomes.

In [29]:
search.best_score_, search.best_params_

(0.6784,
 {'net__lr': 0.00039403277371927766,
  'net__max_epochs': 10,
  'net__module__dropout': 0.51689272396462094,
  'net__module__embedding_dim': 76,
  'net__module__num_layers': 1,
  'net__module__num_units': 157,
  'net__module__rec_layer_type': 'gru',
  'to_idx__lowercase': True,
  'to_idx__ngram_range': (1, 1),
  'to_idx__stop_words': None})