# Sentiment Analysis with Apache MXNet GLUON

# Pre-requisites

1. Apache MXNet >= v1.3
2. Gluon-NLP >=v0.3
3. Spacy - Natural Language Processing Utility

Apache MXNet [1] and Gluon-NLP [2] are already pre-installed. In the next section, install Spacy [3] and setup resources for English Language Model.

In this tutorial, we will use Spacy for Sentence Tokenizer and Language Model.

**Credits:** This notebook is borrowed from GLUON-NLP Tutorials [4] and enhanced for this workshop


In [None]:
%%bash

# Install Spacy
pip install spacy -U --quiet

# Download Spacy resources for English Language Model
python -m spacy download en

# Problem
Given an input text, classify its sentiment as positive of negative.
X -> Input Text
Y -> Probability. <=0.5: Negative, >0.5: Positive

# Solution
1. Use IMDB movie review dataset [5] for training the model. IMDB movie review dataset is a curated collection of 25,000 movie reviews (positive and negative) for training and 25,000 for testing. 
2. Use pre-trained Vocabulary, Embedding and Language Model trained on wikitext2 [2]. This pretrained model is based on LSTM with 200 hidden units. Essentially, we will be using a pre-trained LSTM model for English Vocabulary, Embeddings and Language Model based on wikitext2. 
3. Use Spacy and GluonNLP for data preparation.
4. Use Gluon for defining a simple Neural Network - Embedding -> Encoding (LSTM) -> Dropout -> Dense -> Softmax
5. Train and test the model.


# 1. Import dependencies

In [9]:
import warnings
warnings.filterwarnings('ignore')

import random
import time
import multiprocessing as mp
import numpy as np

import mxnet as mx
from mxnet import nd, gluon, autograd

import gluonnlp as nlp

random.seed(123)
np.random.seed(123)
mx.random.seed(123)

In [10]:
# Set MXNet Context. Use mx.cpu() for CPU. Use mx.gpu(0) for 1 GPU
context = mx.gpu(0)

# 2. Load Pretrained wikitext-2 Language Model

We use a pretrained model on wikitext-2 dataset [6]. Specifically, we use Vocabulary, Language Model i.e., Embeddings and Encodings (LSTM weights) based on wikitext-2 dataset.

**Intuition:** Using pretrained language model weights is a common approach for semi-supervised learning in NLP. In order to do a good job with large language modeling on a large corpus of text, our model must learn representations that contain information about the structure of natural language. Intuitively, by starting with these good features, vs random features, we’re able to converge faster upon a good model for our downsteam task.

In [11]:
language_model_name = 'standard_lstm_lm_200'

In [12]:
lm_model, vocab = nlp.model.get_model(name=language_model_name,
                                      dataset_name='wikitext-2',
                                      pretrained=True,
                                      ctx=context,
                                      dropout=0)

# 3. Data pipeline

* Load IMDB reviews dataset
* Label negative reviews (score <= 5) as o
* Label positive reviews (score > 5) as 1
* Tokenize using spaCy- Extract words, punctuation marks from review text
* Convert each token to an index in the vocabulary. Vocabulary is obtained from wikitext2
* Prepare data iterators that can iterate on training and test data


In [13]:
# Step 1: Load the train and test IMDB movie review dataset
train_dataset, test_dataset = [nlp.data.IMDB(root='data/imdb', segment=segment)
                               for segment in ('train', 'test')]

# Use spaCy English (en) tokenizer on input sentences to get tokens(words and punctuation marks)
tokenizer = nlp.data.SpacyTokenizer('en')

# Clip sentences to be max 500 tokens
length_clip = nlp.data.ClipSequence(500)

def preprocess(x):
    """
    1. Prepare labels. label = 1 (positive) if score > 5. label = 0 (negative) if score <= 5.
    2. Tokenize - Extract words, punctuation marks from review text.
    3. Convert each token to an index in the vocabulary.
    """
    data, label = x
    label = int(label > 5)
    # Tokenize the data
    tokenized_data = tokenizer(data)
    # Clip the tokens
    tokenized_clipped_data = length_clip(tokenized_data)
    # Get vocabulary indexes for the tokens. Use pre-loaded 'vocab'.
    data = vocab[tokenized_clipped_data]

    return data, label

def get_length(x):
    return float(len(x[0]))

def preprocess_dataset(dataset):
    with mp.Pool() as pool:
        # Each sample is processed in an asynchronous manner.
        dataset = gluon.data.SimpleDataset(pool.map(preprocess, dataset))
        lengths = gluon.data.SimpleDataset(pool.map(get_length, dataset))
    
    return dataset, lengths

# Preprocess the dataset
print("Preparing Train dataset. This will take few minutes...")
train_dataset, train_data_lengths = preprocess_dataset(train_dataset)

print("Preparing Test dataset. This will take few minutes...")
test_dataset, test_data_lengths = preprocess_dataset(test_dataset)

print("Data is ready!!!")

Preparing Train dataset. This will take few minutes...
Preparing Test dataset. This will take few minutes...
Data is ready!!!


## 3.2 Prepare Dataloader

* Input sentences can be of different lengths.
* Use FixedBucketSampler, which assigns each data sample to a fixed bucket based on its length.
* Batchify function (batchify) is applied on all the samples as the loaders read the batches.
* We apply *Pad* for padding smaller length sequence to max length sequence in the bucket.
* We apply *Stack* for stacking data, label, data_length i.e., [sentence, sentiment label, sentence_length]

In [6]:
batch_size = 32
bucket_num, bucket_ratio = 10, 0.2

In [14]:
def get_dataloader():
    batchify_fn = nlp.data.batchify.Tuple(
        nlp.data.batchify.Pad(axis=0, ret_length=True),
        nlp.data.batchify.Stack(dtype='float32'))
    batch_sampler = nlp.data.sampler.FixedBucketSampler(
        train_data_lengths,
        batch_size=batch_size,
        num_buckets=bucket_num,
        ratio=bucket_ratio,
        shuffle=True)
    print(batch_sampler.stats())
    train_dataloader = gluon.data.DataLoader(
        dataset=train_dataset,
        batch_sampler=batch_sampler,
        batchify_fn=batchify_fn, num_workers=4)
    test_dataloader = gluon.data.DataLoader(
        dataset=test_dataset,
        batch_size=batch_size,
        shuffle=False,
        batchify_fn=batchify_fn, num_workers=4)
    return train_dataloader, test_dataloader

train_dataloader, test_dataloader = get_dataloader()

FixedBucketSampler:
  sample_num=25000, batch_num=779
  key=[59, 108, 157, 206, 255, 304, 353, 402, 451, 500]
  cnt=[590, 1999, 5092, 5102, 3038, 2085, 1477, 1165, 870, 3582]
  batch_size=[54, 32, 32, 32, 32, 32, 32, 32, 32, 32]


# 4. Define Network

* **Embedding, LSTM Layer:** To use pre-trained weights, we base our network on the Language Model Network (Embedding -> LSTM). 
* **Mean Pooling Layer:** We have multiple words input (reviews) and one output (sentiment). Hence, we average(mean) states across all time steps into one value.
* **Dense Layer:** To generate the final output

![Network Structure](network.png)

In [15]:
class MeanPoolingLayer(gluon.HybridBlock):
    """A block for mean pooling of encoder features"""
    def __init__(self, prefix=None, params=None):
        super(MeanPoolingLayer, self).__init__(prefix=prefix, params=params)

    def hybrid_forward(self, F, data, valid_length):
        masked_encoded = F.SequenceMask(data,
                                        sequence_length=valid_length,
                                        use_sequence_length=True)
        agg_state = F.broadcast_div(F.sum(masked_encoded, axis=0),
                                    F.expand_dims(valid_length, axis=1))
        return agg_state


class SentimentNet(gluon.HybridBlock):
    """Network for sentiment analysis."""
    def __init__(self, prefix=None, params=None):
        super(SentimentNet, self).__init__(prefix=prefix, params=params)
        with self.name_scope():
            self.embedding = None # will set with lm embedding later
            self.encoder = None # will set with lm encoder later
            self.agg_layer = MeanPoolingLayer()
            self.output = gluon.nn.HybridSequential()
            with self.output.name_scope():
                self.output.add(gluon.nn.Dense(1, flatten=False))

    def hybrid_forward(self, F, data, valid_length): 
        embedded = self.embedding(data)
        encoded = self.encoder(embedded)
        agg_state = self.agg_layer(encoded, valid_length)
        out = self.output(agg_state)
        return out

# 5. Initialize Network with Pretrained Weights

In [16]:
net = SentimentNet()

# Use Pretrained Embeddings from wikitext-2
net.embedding = lm_model.embedding

# Use Pretrained Encoder states (LSTM) from wikitext-2
net.encoder = lm_model.encoder

net.hybridize()

# Random initialize the last Dense Laywer
net.output.initialize(mx.init.Xavier(), ctx=context)
print(net)

SentimentNet(
  (embedding): HybridSequential(
    (0): Embedding(33278 -> 200, float32)
  )
  (encoder): LSTM(200 -> 200, TNC, num_layers=2)
  (agg_layer): MeanPoolingLayer(
  
  )
  (output): HybridSequential(
    (0): Dense(None -> 1, linear)
  )
)


# 6. Train the Network

## 6.1 Hyperparameters

In [17]:
learning_rate = 0.005
epochs = 3
grad_clip = None

## 6.2 Evaluation Function

In [18]:
def evaluate(net, dataloader, context):
    loss = gluon.loss.SigmoidBCELoss()
    total_L = 0.0
    total_sample_num = 0
    total_correct_num = 0
    print('Begin Testing...')
    for i, ((data, valid_length), label) in enumerate(dataloader):
        # Step 1: Prepare data
        data = mx.nd.transpose(data.as_in_context(context))
        valid_length = valid_length.as_in_context(context).astype(np.float32)
        label = label.as_in_context(context)
        
        # Step 2: Forward pass
        output = net(data, valid_length)
        
        # Step 3: Calculate loss
        L = loss(output, label)
        
        # Step 4: Statistics - Keeping moving average loss and accuracy
        pred = (output > 0.5).reshape(-1)
        total_L += L.sum().asscalar()
        total_sample_num += label.shape[0]
        total_correct_num += (pred == label).sum().asscalar()
    avg_L = total_L / float(total_sample_num)
    acc = total_correct_num / float(total_sample_num)
    return avg_L, acc

## 6.3 Train the Network

In [19]:
def train(net, context, epochs):
    # Use Follow the Moving Leader Optimizer - [7]
    trainer = gluon.Trainer(net.collect_params(), 'ftml',
                            {'learning_rate': learning_rate})
    loss = gluon.loss.SigmoidBCELoss()
    parameters = net.collect_params().values()
    print("Training the Sentiment Classification Model...")
    for epoch in range(epochs):
        epoch_L = 0.0
        epoch_sent_num = 0
        print("[Epoch - {}]".format(epoch))
        for i, ((data, length), label) in enumerate(train_dataloader):
            L = 0
            with autograd.record():
                # Step 1: Forward pass
                output = net(data.as_in_context(context).T,
                             length.as_in_context(context)
                                   .astype(np.float32))
                # Step 2: Calculate Loss
                L = L + loss(output, label.as_in_context(context)).mean()
            
            # Step 3: Backward pass
            L.backward()
            
            # Step 3.1: Clip gradient - Avoid gradient explosion
            if grad_clip:
                gluon.utils.clip_global_norm(
                    [p.grad(context) for p in parameters],
                    grad_clip)
            
            # Step 4: Do parameter updates
            trainer.step(1)
            
            # For epoch statistics - Loss and data sample count
            epoch_sent_num += data.shape[1]
            epoch_L += L.asscalar()
    
        print('Train Avg Loss {:.6f}'.format(epoch_L / epoch_sent_num))
        
        # Step 5: Evaluation after each epoch
        test_avg_L, test_acc = evaluate(net, test_dataloader, context)
        print('Test Acc {:.2f}, Test Avg Loss {:.6f}'.format(test_acc, test_avg_L))

In [20]:
# Train the model
train(net, context, epochs)

Training the Sentiment Classification Model...
[Epoch - 0]
Train Avg Loss 0.001512
Begin Testing...
Test Acc 0.85, Test Avg Loss 0.313760
[Epoch - 1]
Train Avg Loss 0.000692
Begin Testing...
Test Acc 0.81, Test Avg Loss 0.392310
[Epoch - 2]
Train Avg Loss 0.000257
Begin Testing...
Test Acc 0.85, Test Avg Loss 0.436076


# 7. Prediction

## 7.1 Positive Sentiment

In [21]:
prob1 = net(
            mx.nd.reshape(
            mx.nd.array(vocab[['This', 'movie', 'is', 'good']], ctx=context),
            shape=(-1, 1)), mx.nd.array([4], ctx=context)).sigmoid()
print("Sentiment - ", 'positive' if prob1[0] > 0.5 else 'negative')


Sentiment -  positive


## 7.2 Negative Sentiment

In [22]:
prob2 = net(
            mx.nd.reshape(
            mx.nd.array(vocab[['Movie', 'was', 'bad', 'and', 'boring']], ctx=context),
            shape=(-1, 1)), mx.nd.array([5], ctx=context)).sigmoid()
print("Sentiment - ", 'positive' if prob2[0] > 0.5 else 'negative')

Sentiment -  negative


# References
1. http://mxnet.incubator.apache.org/
2. https://gluon-nlp.mxnet.io
3. https://spacy.io/usage/
4. https://gluon-nlp.mxnet.io/examples/sentiment_analysis/sentiment_analysis.html
5. http://ai.stanford.edu/~amaas/data/sentiment/
6. https://einstein.ai/research/blog/the-wikitext-long-term-dependency-language-modeling-dataset
7. https://mxnet.incubator.apache.org/api/python/optimization/optimization.html#mxnet.optimizer.FTML