# Text Sentiment Analysis - 1


Text classification is a common task in natural language processing, which transforms a sequence of text of indefinite length into a category of text. It is similar to the image classification. The only difference is that, rather than an image, text classification's example is a text sentence.

This notebook will focus on one of the sub problems in this field: using text sentiment classification to analyze the emotions of the text's author. This problem is also called sentiment analysis and has a wide range of applications. For example, we can analyze user reviews of products to obtain user satisfaction statistics in E-commerce, or analyze user sentiments about market conditions and use it to predict future trends in Finance.


Like searching for synonyms and analogies, text classification is also a popular application of word embedding. I will use pre-trained GloVe vectors and 2 models in this notebook : first a **bidirectional RNN (Recurrent Neural Network)** and then a **CNN (Convolutional Neural Network)** to determine whether a certain length of text sequence contains positive or negative emotions.

In [10]:
import sys
sys.path.append("../input/")

In [11]:
import collections
import os
import random
import time
from tqdm import tqdm
import torch
from torch import nn
import torchtext.vocab as Vocab
import torch.utils.data as Data
import torch.nn.functional as F
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

### The Sentiment Analysis Dataset

We use [Stanford's Large Movie Review Dataset](http://ai.stanford.edu/~amaas/data/sentiment/) as the dataset for sentiment analysis. This dataset is divided into two datasets for training and testing purposes, each containing 25,000 movie reviews downloaded from IMDb. In each dataset, the number of comments labeled as "positive" and "negative" is equal, making it a quite balanced dataset.


- Dataset file structure

```
| aclImdb_v1
    | train
    |   | pos
    |   |   | 0_9.txt  
    |   |   | 1_7.txt
    |   |   | ...
    |   | neg
    |   |   | 0_3.txt
    |   |   | 1_1.txt
    |   | ...
    | test
    |   | pos
    |   | neg
    |   | ...
    | ...
```

In [12]:
def read_imdb(folder='train', data_root="/home/kesci/input/IMDB2578/aclImdb_v1/aclImdb"):
    data = []
    for label in ['pos', 'neg']:
        folder_name = os.path.join(data_root, folder, label)
        for file in tqdm(os.listdir(folder_name)):
            with open(os.path.join(folder_name, file), 'rb') as f:
                review = f.read().decode('utf-8').replace('\n', '').lower()
                data.append([review, 1 if label == 'pos' else 0])
    random.shuffle(data)
    return data

DATA_ROOT = "/home/kesci/input/IMDB2578/aclImdb_v1/"
data_root = os.path.join(DATA_ROOT, "aclImdb")
train_data, test_data = read_imdb('train', data_root), read_imdb('test', data_root)

# print first 5 samples in the training set
for sample in train_data[:5]:
    print(sample[1], '\t', sample[0][:50])

100%|██████████| 12500/12500 [00:00<00:00, 41725.80it/s]
100%|██████████| 12500/12500 [00:00<00:00, 41366.23it/s]
100%|██████████| 12500/12500 [00:00<00:00, 42091.20it/s]
100%|██████████| 12500/12500 [00:00<00:00, 41608.67it/s]

0 	 the fiendish plot of dr. fu manchu (1980). this is
0 	 how do i begin to review a film that will soon be 
0 	 if you want an undemanding and reasonably amusing 
0 	 well since seeing part's 1 through 3 i can honestl
1 	 spoiler alert! this movie, zero day, gives an insi





In [13]:
# previe of the data
print('# trainings:', len(train_data[0]))

for x, y in zip(train_data[0][:3], train_data[1][:3]):
    print('label:', y, 'review:', x[0:60])

# trainings: 2
label: how do i begin to review a film that will soon be recognized as the `worst film of all time' by the `worst director of all time?' a film that could develop a cult following because it's `so bad it's good?'<br /><br />an analytical approach criticizing the film seems both pointless and part of band-wagon syndrome--let's bash freely without worry of backlash because every other human on earth is doing it, and the people who like the film like it for those flaws we'd cite.<br /><br />the film's universal poor quality goes without saying-- 'sixteen years of alcohol' is not without competition for title of worst film so it has to sink pretty low to acquire the title and keep a hold of it, but i believe this film could go the distance. imdb doesn't allow enough words to cite all the films failures, and it be much easier to site the elements 'sixteen years of alcohol' does right. unfortunately, those moments of glory are so far buried in the shadows of this film's poorne

TypeError: 'int' object is not subscriptable

### Preprocessing:  Tokenization and Vocabulary

After reading the data, we first tokenize the text, and then use a word as a token to create a dictionary based on the training set using [`torchtext.vocab.Vocab`](https://torchtext.readthedocs.io/en/latest/vocab.html#vocab)

In [14]:
def get_tokenized_imdb(data):
    '''
    @params:
        data: each element is [string, 0/1 label] 
    @return: list after tokenization, each element is token sequence
    '''
    def tokenizer(text):
        return [tok.lower() for tok in text.split(' ')] # split by space
    
    return [tokenizer(review) for review, _ in data]

def get_vocab_imdb(data):
    '''
    @params:
        data: same as above
    @return: vocab created on the dataset，instance as（freqs, stoi, itos）
    '''
    tokenized_data = get_tokenized_imdb(data)
    counter = collections.Counter([tk for st in tokenized_data for tk in st])
    return Vocab.Vocab(counter, min_freq=5)

vocab = get_vocab_imdb(train_data)
print('# words in vocab:', len(vocab))

# words in vocab: 46152


After the vocab and the mapping of token to idx are created, the text can be converted from string to a sequence with idx for later use. Because the reviews have different lengths, so they cannot be directly combined into mini batches. Here we fix the length of each comment to 500 by truncating or padding to the samel length.

In [15]:
def preprocess_imdb(data, vocab):
    '''
    @params:
        data: raw input data
        vocab: dictionary we just created
    @return:
        features: idx seq of token, shape is (n, max_l) int tensor
        labels: emotions labels, shape is (n,) 0/1 int tensor
    '''
    max_l = 500  # sequence length we defined: 500

    def pad(x):
        return x[:max_l] if len(x) > max_l else x + [0] * (max_l - len(x)) 

    tokenized_data = get_tokenized_imdb(data)
    features = torch.tensor([pad([vocab.stoi[word] for word in words]) for words in tokenized_data])
    labels = torch.tensor([score for _, score in data])
    
    return features, labels

### Creating the data iterator

Using [`torch.utils.data.TensorDataset`](https://pytorch.org/docs/stable/data.html?highlight=tensor%20dataset#torch.utils.data.TensorDataset) can create PyTorch format dataset and data iterator. Each iteration will return a minibatch of data.

In [16]:
train_set = Data.TensorDataset(*preprocess_imdb(train_data, vocab))
test_set = Data.TensorDataset(*preprocess_imdb(test_data, vocab))

# below is the code for same function as above
# train_features, train_labels = preprocess_imdb(train_data, vocab)
# test_features, test_labels = preprocess_imdb(test_data, vocab)
# train_set = Data.TensorDataset(train_features, train_labels)
# test_set = Data.TensorDataset(test_features, test_labels)

# preview 
# len(train_set) = features.shape[0] or labels.shape[0] 
# train_set[index] = (features[index], labels[index])

batch_size = 64
train_iter = Data.DataLoader(train_set, batch_size, shuffle=True) # shuffle the training set
test_iter = Data.DataLoader(test_set, batch_size)

for X, y in train_iter:
    print('X', X.shape, 'y', y.shape)
    break
# preview
print('#batches:', len(train_iter))

X torch.Size([64, 500]) y torch.Size([64])
#batches: 391


## Recurrent Neural Network Model

**- Using a Bidirectional RNN Model**

In this model, each word first obtains a feature vector from the embedding layer. Then, we further encode the feature sequence using a bidirectional recurrent neural network to obtain sequence information. Finally, we transform the encoded sequence information to output through the fully connected layer. Specifically, we can concatenate hidden states of bidirectional long-short term memory in the initial timestep and final timestep and pass it to the output layer classification as encoded feature sequence information. In the BiRNN class implemented below, the Embedding instance is the embedding layer, the LSTM instance is the hidden layer for sequence encoding, and the Dense instance is the output layer for generated classification results.

Let's quickly recap our Bi-RNN model: 

![Image Name](https://cdn.kesci.com/upload/image/q5mnobct47.png?imageView2/0/w/960/h/960)


![Image Name](https://cdn.kesci.com/upload/image/q5mo6okdnp.png?imageView2/0/w/960/h/960)



Given the input sequence $\{\boldsymbol{X}_1,\boldsymbol{X}_2,\dots,\boldsymbol{X}_T\}$，where $\boldsymbol{X}_t\in\mathbb{R}^{n\times d}$ is the time step（batch size is $n$，input dimension is $d$).

In the Bi-RNN structure，let the forward hidden state at time step $t$ be $\overrightarrow{\boldsymbol{H}}_{t} \in \mathbb{R}^{n \times h}$ (Forward hidden state dim: $h$), while the hidden state in reverse order is $\overleftarrow{\boldsymbol{H}}_{t} \in \mathbb{R}^{n \times h}$ (Reverse hidden state dim: $h$). We can compute the forward hidden state and reverse hidden state separately:

$$ \begin{aligned}
&\overrightarrow{\boldsymbol{H}}_{t}=\phi\left(\boldsymbol{X}_{t} \boldsymbol{W}_{x h}^{(f)}+\overrightarrow{\boldsymbol{H}}_{t-1} \boldsymbol{W}_{h h}^{(f)}+\boldsymbol{b}_{h}^{(f)}\right)\\
&\overleftarrow{\boldsymbol{H}}_{t}=\phi\left(\boldsymbol{X}_{t} \boldsymbol{W}_{x h}^{(b)}+\overleftarrow{\boldsymbol{H}}_{t+1} \boldsymbol{W}_{h h}^{(b)}+\boldsymbol{b}_{h}^{(b)}\right)
\end{aligned} $$


Where, weight $\boldsymbol{W}_{x h}^{(f)} \in \mathbb{R}^{d \times h},  \boldsymbol{W}_{h h}^{(f)} \in \mathbb{R}^{h \times h},  \boldsymbol{W}_{x h}^{(b)} \in \mathbb{R}^{d \times h},  \boldsymbol{W}_{h h}^{(b)} \in \mathbb{R}^{h \times h}$ and bias  $\boldsymbol{b}_{h}^{(f)} \in \mathbb{R}^{1 \times h},  \boldsymbol{b}_{h}^{(b)} \in \mathbb{R}^{1 \times h}$ are params，$\phi$ is the activation function for hidden layer.


Then we concatenate the hidden states of two directions $\overrightarrow{\boldsymbol{H}}_{t}$ and  $\overleftarrow{\boldsymbol{H}}_{t}$ to get the hidden state $\boldsymbol{H}_{t} \in \mathbb{R}^{n \times 2 h}$, and feed it to output layer. The output layer calculates $\boldsymbol{O}_{t} \in \mathbb{R}^{n \times q}$（output dim: $q$）：


$$ \boldsymbol{O}_{t}=\boldsymbol{H}_{t} \boldsymbol{W}_{h q}+\boldsymbol{b}_{q} $$


Where, weight $\boldsymbol{W}_{h q} \in \mathbb{R}^{2 h \times q}$ and bias $\boldsymbol{b}_{q} \in \mathbb{R}^{1 \times q}$ are params for output layer. The hidden units dim can be different in different directions.

Luckily, we don't have to implement all from scratch, instead we can easily use built-in modules [`torch.nn.RNN`](https://pytorch.org/docs/stable/nn.html?highlight=rnn#torch.nn.RNN) or [`torch.nn.LSTM`](https://pytorch.org/docs/stable/nn.html?highlight=lstm#torch.nn.LSTM) to build a bi-RNN model. (Yay! Truth is PyTorch made it a breeze!)

Below I will show an example using [`torch.nn.LSTM`](https://pytorch.org/docs/stable/nn.html?highlight=lstm#torch.nn.LSTM) to build the model for our sentiment analysis task!

In [17]:
class BiRNN(nn.Module):
    def __init__(self, vocab, embed_size, num_hiddens, num_layers):
        '''
        @params:
            vocab: vocab created on the dataset
            embed_size: embedding dim size
            num_hiddens: hidden state dim
            num_layers: num of hidden layers
        '''
        super(BiRNN, self).__init__()
        self.embedding = nn.Embedding(len(vocab), embed_size) # word embedding layer: get word vector from idx
        
        # encoder-decoder framework
        self.encoder = nn.LSTM(input_size=embed_size, 
                                hidden_size=num_hiddens, 
                                num_layers=num_layers,
                                bidirectional=True) # set bidirectional to True
        self.decoder = nn.Linear(4*num_hiddens, 2) # first and last timestep hidden state is input for FC dense layer
        
    def forward(self, inputs):
        '''
        @params:
            inputs: idx sequence of tokens, shape is (batch_size, seq_len) int tensor
        @return:
            outs: prediction of text sentiment, shape is (batch_size, 2) int tensor
        '''
        # transpose input as LSTM needs to have (seq_len) at 1st-dim
        embeddings = self.embedding(inputs.permute(1, 0)) # (seq_len, batch_size, d)
        # rnn.LSTM returns outputs, (hidden state, cell)
        outputs, _ = self.encoder(embeddings) # (seq_len, batch_size, 2*h)
        encoding = torch.cat((outputs[0], outputs[-1]), -1) # (batch_size, 4*h)
        outs = self.decoder(encoding) # (batch_size, 2)
        return outs

embed_size, num_hiddens, num_layers = 100, 100, 2
net = BiRNN(vocab, embed_size, num_hiddens, num_layers)

- Load pre-trained word vectors

Since the vocab and idx token of the pre-trained word vectors are not the same as the dataset we use, the pre-trained word vector needs to be loaded according to the current order of the vocab and idx.

In [25]:
cache_dir = "/home/kesci/input/GloVe6B5429"
glove_vocab = Vocab.GloVe(name='6B', dim=100, cache=cache_dir) # 100dim is good enough
#glove_vocab = Vocab.GloVe(name='6B', dim=300, cache=cache_dir) 

def load_pretrained_embedding(words, pretrained_vocab):
    '''
    @params:
        words: list of word vectors to be loaded，as itos dictionary type
        pretrained_vocab: pre-trained word vectors
    @return:
        embed: word vector loaded
    '''
    embed = torch.zeros(len(words), pretrained_vocab.vectors[0].shape[0]) # initialize to 0
    oov_count = 0 # out of vocabulary
    for i, word in enumerate(words):
        try:
            idx = pretrained_vocab.stoi[word]
            embed[i, :] = pretrained_vocab.vectors[idx]
        except KeyError:
            oov_count += 1
    if oov_count > 0:
        print("There are %d oov words." % oov_count)
    return embed

net.embedding.weight.data.copy_(load_pretrained_embedding(vocab.itos, glove_vocab))
net.embedding.weight.requires_grad = False # load pre-trained, no need to update

There are 21202 oov words.


### Model Training

Define `train` and `evaluate_accuracy` functions.

In [None]:
def evaluate_accuracy(data_iter, net, device=None):
    if device is None and isinstance(net, torch.nn.Module):
        device = list(net.parameters())[0].device 
    acc_sum, n = 0.0, 0
    
    with torch.no_grad():
        for X, y in data_iter:
            if isinstance(net, torch.nn.Module):
                net.eval()
                acc_sum += (net(X.to(device)).argmax(dim=1) == y.to(device)).float().sum().cpu().item()
                net.train()
            else:
                if('is_training' in net.__code__.co_varnames):
                    acc_sum += (net(X, is_training=False).argmax(dim=1) == y).float().sum().item() 
                else:
                    acc_sum += (net(X).argmax(dim=1) == y).float().sum().item() 
            n += y.shape[0]
    
    return acc_sum / n

def train(train_iter, test_iter, net, loss, optimizer, device, num_epochs):
    net = net.to(device)
    print("training on ", device)
    batch_count = 0
    
    for epoch in range(num_epochs):
        train_l_sum, train_acc_sum, n, start = 0.0, 0.0, 0, time.time()
        for X, y in train_iter:
            X = X.to(device)
            y = y.to(device)
            y_hat = net(X)
            l = loss(y_hat, y) # calculate loss
            optimizer.zero_grad() # reset grad to zero
            l.backward() # backprop
            optimizer.step()
            train_l_sum += l.cpu().item()
            train_acc_sum += (y_hat.argmax(dim=1) == y).sum().cpu().item()
            n += y.shape[0]
            batch_count += 1
        test_acc = evaluate_accuracy(test_iter, net)
        
        print('epoch %d, loss %.4f, train acc %.3f, test acc %.3f, time %.1f sec'
              % (epoch + 1, train_l_sum / batch_count, train_acc_sum / n, test_acc, time.time() - start))


100%|█████████▉| 399629/400000 [01:05<00:00, 7731.61it/s][A

*Note: Since the params of the embedding layer do not need to be updated during the training, so we use the `filter` function and the`lambda` expression to filter out some parts of params that do not need to update.*

In [22]:
lr, num_epochs = 0.01, 5
optimizer = torch.optim.Adam(filter(lambda p: p.requires_grad, net.parameters()), lr=lr)
loss = nn.CrossEntropyLoss()

train(train_iter, test_iter, net, loss, optimizer, device, num_epochs)

training on  cuda
epoch 1, loss 0.2468, train acc 0.902, test acc 0.838, time 115.5 sec
epoch 2, loss 0.1083, train acc 0.914, test acc 0.846, time 115.5 sec
epoch 3, loss 0.0608, train acc 0.932, test acc 0.852, time 114.5 sec
epoch 4, loss 0.0424, train acc 0.934, test acc 0.844, time 114.6 sec
epoch 5, loss 0.0319, train acc 0.940, test acc 0.842, time 115.3 sec


### Model Evaluation

In [23]:
def predict_sentiment(net, vocab, sentence):
    '''
    @params：
        net: trained model
        vocab: vocab on the dataset，mapping token to idx
        sentence: text in word tokens seq
    @return: pred outcome: positive for postive emotions，negative else
    '''
    device = list(net.parameters())[0].device # load the device location
    sentence = torch.tensor([vocab.stoi[word] for word in sentence], device=device)
    label = torch.argmax(net(sentence.view((1, -1))), dim=1)
    return 'positive' if label.item() == 1 else 'negative'

predict_sentiment(net, vocab, ['this', 'movie', 'is', 'so', 'great'])

'positive'

In [24]:
predict_sentiment(net, vocab, ['this', 'movie', 'is', 'so', 'bad'])

'negative'

- Summary

Text classification transforms a sequence of text of indefinite length into a category of text. This is a downstream application of word embedding.

We can apply pre-trained word vectors and recurrent neural networks to classify the emotions in a text.