# Text Sentiment Analysis - 2

Text classification is a common task in natural language processing, which transforms a sequence of text of indefinite length into a category of text. It is similar to the image classification. The only difference is that, rather than an image, text classification's example is a text sentence.

This notebook will focus on one of the sub problems in this field: using text sentiment classification to analyze the emotions of the text's author. This problem is also called sentiment analysis and has a wide range of applications. For example, we can analyze user reviews of products to obtain user satisfaction statistics in E-commerce, or analyze user sentiments about market conditions and use it to predict future trends in Finance.

Following the first part, this second notebook will use pre-trained GloVe and **TextCNN (Convolutional Neural Network)** to determine whether a certain length of text sequence contains positive or negative emotions.

In [1]:
import sys
sys.path.append("../input/")

In [2]:
import collections
import os
import random
import time
from tqdm import tqdm
import torch
from torch import nn
import torchtext.vocab as Vocab
import torch.utils.data as Data
import torch.nn.functional as F
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

### The Sentiment Analysis Dataset

We use [Stanford's Large Movie Review Dataset](http://ai.stanford.edu/~amaas/data/sentiment/) as the dataset for sentiment analysis. This dataset is divided into two datasets for training and testing purposes, each containing 25,000 movie reviews downloaded from IMDb. In each dataset, the number of comments labeled as "positive" and "negative" is equal, making it a quite balanced dataset.


- Dataset file structure

```
| aclImdb_v1
    | train
    |   | pos
    |   |   | 0_9.txt  
    |   |   | 1_7.txt
    |   |   | ...
    |   | neg
    |   |   | 0_3.txt
    |   |   | 1_1.txt
    |   | ...
    | test
    |   | pos
    |   | neg
    |   | ...
    | ...
```

In [3]:
def read_imdb(folder='train', data_root="/home/kesci/input/IMDB2578/aclImdb_v1/aclImdb"):
    data = []
    for label in ['pos', 'neg']:
        folder_name = os.path.join(data_root, folder, label)
        for file in tqdm(os.listdir(folder_name)):
            with open(os.path.join(folder_name, file), 'rb') as f:
                review = f.read().decode('utf-8').replace('\n', '').lower()
                data.append([review, 1 if label == 'pos' else 0])
    random.shuffle(data)
    return data

DATA_ROOT = "/home/kesci/input/IMDB2578/aclImdb_v1/"
data_root = os.path.join(DATA_ROOT, "aclImdb")
train_data, test_data = read_imdb('train', data_root), read_imdb('test', data_root)

# print first 5 samples in the training set
for sample in train_data[:5]:
    print(sample[1], '\t', sample[0][:50])

100%|██████████| 12500/12500 [00:00<00:00, 52698.83it/s]
100%|██████████| 12500/12500 [00:00<00:00, 51241.03it/s]
100%|██████████| 12500/12500 [00:00<00:00, 53923.67it/s]
100%|██████████| 12500/12500 [00:00<00:00, 53180.24it/s]

0 	 certainly nomad has some of the best horse riding 
1 	 this movie is really good. the plot, which works l
0 	 who could possibly have wished for a sequel to ber
1 	 saw this today with my 8 year old. i thought it wa
0 	 unlike endemol usa's two other current game shows 





In [5]:
# previe of the data
print('# trainings:', len(train_data[0]))

for x, y in zip(train_data[0][:3], train_data[1][:3]):
    print('label:', y, 'review:', x[0:60])

# trainings: 2
label: this movie is really good. the plot, which works like puzzle forces viewer to think and guess, what will happen next. such a trick brings a lot of surprises and makes a viewer really looking forward to solution of a riddle. fighting scenes are very good. there's a lot of different combat styles (although one of styles was a bit unreal for me, but it's only my opinion) to watch and it's fascinating show. the only thing which may be irritating is actors look. a bit too effeminate (at least for me). hong kong was always good at kung-fu movies especially in the 70's and 80's, so "five venoms" (or other its versions) is great choice. review: certainly nomad has some of the best horse riding scenes, sw


TypeError: 'int' object is not subscriptable

### Preprocessing:  Tokenization and Vocabulary

After reading the data, we first tokenize the text, and then use a word as a token to create a dictionary based on the training set using [`torchtext.vocab.Vocab`](https://torchtext.readthedocs.io/en/latest/vocab.html#vocab)

In [6]:
def get_tokenized_imdb(data):
    '''
    @params:
        data: each element is [string, 0/1 label] 
    @return: list after tokenization, each element is token sequence
    '''
    def tokenizer(text):
        return [tok.lower() for tok in text.split(' ')] # split by space
    
    return [tokenizer(review) for review, _ in data]

def get_vocab_imdb(data):
    '''
    @params:
        data: same as above
    @return: vocab created on the dataset，instance as（freqs, stoi, itos）
    '''
    tokenized_data = get_tokenized_imdb(data)
    counter = collections.Counter([tk for st in tokenized_data for tk in st])
    return Vocab.Vocab(counter, min_freq=5)

vocab = get_vocab_imdb(train_data)
print('# words in vocab:', len(vocab))

# words in vocab: 46152


After the vocab and the mapping of token to idx are created, the text can be converted from string to a sequence with idx for later use. Because the reviews have different lengths, so they cannot be directly combined into mini batches. Here we fix the length of each comment to 500 by truncating or adding "<unk>" indices.

In [7]:
def preprocess_imdb(data, vocab):
    '''
    @params:
        data: raw input data
        vocab: dictionary we just created
    @return:
        features: idx seq of token, shape is (n, max_l) int tensor
        labels: emotions labels, shape is (n,) 0/1 int tensor
    '''
    max_l = 500  # sequence length we defined: 500

    def pad(x):
        return x[:max_l] if len(x) > max_l else x + [0] * (max_l - len(x)) 

    tokenized_data = get_tokenized_imdb(data)
    features = torch.tensor([pad([vocab.stoi[word] for word in words]) for words in tokenized_data])
    labels = torch.tensor([score for _, score in data])
    
    return features, labels

### Creating the data iterator

Using [`torch.utils.data.TensorDataset`](https://pytorch.org/docs/stable/data.html?highlight=tensor%20dataset#torch.utils.data.TensorDataset) can create PyTorch format dataset and data iterator. Each iteration will return a minibatch of data.

In [8]:
train_set = Data.TensorDataset(*preprocess_imdb(train_data, vocab))
test_set = Data.TensorDataset(*preprocess_imdb(test_data, vocab))

# below is the code for same function as above
# train_features, train_labels = preprocess_imdb(train_data, vocab)
# test_features, test_labels = preprocess_imdb(test_data, vocab)
# train_set = Data.TensorDataset(train_features, train_labels)
# test_set = Data.TensorDataset(test_features, test_labels)

# preview 
# len(train_set) = features.shape[0] or labels.shape[0] 
# train_set[index] = (features[index], labels[index])

batch_size = 64
train_iter = Data.DataLoader(train_set, batch_size, shuffle=True) # shuffle the training set
test_iter = Data.DataLoader(test_set, batch_size)

for X, y in train_iter:
    print('X', X.shape, 'y', y.shape)
    break
# preview
print('#batches:', len(train_iter))

X torch.Size([64, 500]) y torch.Size([64])
#batches: 391


### Load pre-trained word vectors

Since the vocab and idx token of the pre-trained word vectors are not the same as the dataset we use, the pre-trained word vector needs to be loaded according to the current order of the vocab and idx.

In [9]:
cache_dir = "/home/kesci/input/GloVe6B5429"
glove_vocab = Vocab.GloVe(name='6B', dim=100, cache=cache_dir) # 100dim is good enough
#glove_vocab = Vocab.GloVe(name='6B', dim=300, cache=cache_dir) 

def load_pretrained_embedding(words, pretrained_vocab):
    '''
    @params:
        words: list of word vectors to be loaded，as itos dictionary type
        pretrained_vocab: pre-trained word vectors
    @return:
        embed: word vector loaded
    '''
    embed = torch.zeros(len(words), pretrained_vocab.vectors[0].shape[0]) # initialize to 0
    oov_count = 0 # out of vocabulary
    for i, word in enumerate(words):
        try:
            idx = pretrained_vocab.stoi[word]
            embed[i, :] = pretrained_vocab.vectors[idx]
        except KeyError:
            oov_count += 1
    if oov_count > 0:
        print("There are %d oov words." % oov_count)
    return embed

net.embedding.weight.data.copy_(load_pretrained_embedding(vocab.itos, glove_vocab))
net.embedding.weight.requires_grad = False # load pre-trained, no need to update

100%|█████████▉| 398244/400000 [00:14<00:00, 27457.92it/s]

NameError: name 'net' is not defined

100%|█████████▉| 398244/400000 [00:30<00:00, 27457.92it/s]

### Model training

Define `train` and `evaluate_accuracy` functions.

In [15]:
def evaluate_accuracy(data_iter, net, device=None):
    if device is None and isinstance(net, torch.nn.Module):
        device = list(net.parameters())[0].device 
    acc_sum, n = 0.0, 0
    
    with torch.no_grad():
        for X, y in data_iter:
            if isinstance(net, torch.nn.Module):
                net.eval()
                acc_sum += (net(X.to(device)).argmax(dim=1) == y.to(device)).float().sum().cpu().item()
                net.train()
            else:
                if('is_training' in net.__code__.co_varnames):
                    acc_sum += (net(X, is_training=False).argmax(dim=1) == y).float().sum().item() 
                else:
                    acc_sum += (net(X).argmax(dim=1) == y).float().sum().item() 
            n += y.shape[0]
    
    return acc_sum / n

def train(train_iter, test_iter, net, loss, optimizer, device, num_epochs):
    net = net.to(device)
    print("training on ", device)
    batch_count = 0
    
    for epoch in range(num_epochs):
        train_l_sum, train_acc_sum, n, start = 0.0, 0.0, 0, time.time()
        for X, y in train_iter:
            X = X.to(device)
            y = y.to(device)
            y_hat = net(X)
            l = loss(y_hat, y) # calculate loss
            optimizer.zero_grad() # reset grad to zero
            l.backward() # backprop
            optimizer.step()
            train_l_sum += l.cpu().item()
            train_acc_sum += (y_hat.argmax(dim=1) == y).sum().cpu().item()
            n += y.shape[0]
            batch_count += 1
        test_acc = evaluate_accuracy(test_iter, net)
        
        print('epoch %d, loss %.4f, train acc %.3f, test acc %.3f, time %.1f sec'
              % (epoch + 1, train_l_sum / batch_count, train_acc_sum / n, test_acc, time.time() - start))

In the previous language models and text classification tasks, we treated text data as a time series data with only one dimension, and naturally, we used RNN models to process such data. In fact, we can also treat text as a one-dimensional image, so that we can apply one-dimensional Convolutional Neural Networks to capture associations between adjacent words. 

This notebook describes a such approach that applies CNN model to sentiment analysis: **textCNN**.

## Convolution Neural Network

### One-dimensional Convolution layer

Before introducing the model, let's have a look at the principle of the 1-dim convolutional layer. Like 2-dim conv layers, 1-dim conv layers use 1-dim cross-correlation operations. In the 1d cross-correlation operation, the convolution window starts from the leftmost side of the input array and slides on the input array in order from left to right. When the convolution window slides to a certain position, the input subarray in the window and kernel array are multiplied and summed by element to get the element at the corresponding location in the output array. 


![Image Name](https://github.com/d2l-ai/d2l-en/raw/master/img/conv1d.svg?sanitize=true)


As shown in the figure below, the input is a 1-dim array with a width of 7 and the kernel array has a width of 2. So the output width is $7-2+1=6$ and the 1st element is obtained by performing multiplication by element on the leftmost input subarray with a width of 2 and kernel array and then summing the results: $0×1+1×2 = 2$.

In [10]:
# define the 1-dim croos-correlation function
def corr1d(X, K):
    '''
    @params:
        X: input array, (seq_len,) shape tensor
        K: kernel size, (w,) tensor
    @return:
        Y: output array, (seq_len - w + 1,) shape tensor
    '''
    w = K.shape[0] # kernal window size
    Y = torch.zeros((X.shape[0] - w + 1))
    for i in range(Y.shape[0]): # sliding window
        Y[i] = (X[i: i + w] * K).sum()
    return Y

X, K = torch.tensor([0, 1, 2, 3, 4, 5, 6]), torch.tensor([1, 2])
print(corr1d(X, K))

tensor([ 2.,  5.,  8., 11., 14., 17.])


The 1-dim cross-correlation operation for multiple input channels is also similar to the 2-dim cross-correlation operation for multiple input channels. On each channel, it performs the 1-dim cross-correlation operation on the kernel and its corresponding input and adds the results of the channels to get the output. Below figure shows a 1-dim cross-correlation operation with three input channels, where the blue part is the first output element and the input and kernel array elements used in calculation：$0×1+1×2+1×3+2×4+2×(−1)+3×(−3)=2$.


![Image Name](https://github.com/d2l-ai/d2l-en/raw/master/img/conv1d-channel.svg?sanitize=true)

In [11]:
def corr1d_multi_in(X, K):
    # First, traverse along the 0th dim (channel dimension) of X and K,
    # calculate the 1-dim cross-correlation result, 
    # stack all results together and add up along the 0th dim
    
    return torch.stack([corr1d(x, k) for x, k in zip(X, K)]).sum(dim=0)
    # [corr1d(X[i], K[i]) for i in range(X.shape[0])]

X = torch.tensor([[0, 1, 2, 3, 4, 5, 6],
              [1, 2, 3, 4, 5, 6, 7],
              [2, 3, 4, 5, 6, 7, 8]])
K = torch.tensor([[1, 2], [3, 4], [-1, -3]])

print(corr1d_multi_in(X, K))

tensor([ 2.,  8., 14., 20., 26., 32.])


The definition of a 2-dim cross-correlation operation shows that a 1-dim cross-correlation operation with multiple input channels can be regarded as a 2-dim cross-correlation operation with a single input channel. 

In other words, we can also present the 1-dim cross-correlation operation with multiple input channels below as the equivalent 2-dim cross-correlation operation with a single input channel. Here, the height of the kernel is equal to the height of the input. As the blue part shows the calculation：$2×(−1)+3×(−3)+1×3+2×4+0×1+1×2=2$.

![Image Name](https://github.com/d2l-ai/d2l-en/raw/master/img/conv1d-2d.svg?sanitize=true)



Both the examples beofre have only one output channel. Similarly, we can also specify multiple output channels in the 1-dim convolutional layer to extend the model params in the convolutional layer.

### Max-Over-Time Pooling Layer

Similarly, we have a 1-dim pooling layer. The **max-over-time pooling layer** used in TextCNN actually corresponds to a *1-dim global maximum pooling layer*. Assuming that the input contains multiple channels, and each channel consists of values on different timesteps, the output of each channel will be the largest value of all timesteps in the channel. Therefore, the input of the max-over-time pooling layer can have different timesteps on each channel.



To improve computation performance, we often combine timing examples of different lengths into a minibatch and make the lengths of each timing example in the batch consistent by padding shorter samples with special char (likes 0). Naturally, these added special chars have no intrinsic meaning. Because the main purpose of the max-over-time pooling layer is to capture the most important features of timing, it usually allows the model to be unaffected by the manually added characters.

In [12]:
class GlobalMaxPool1d(nn.Module):
    def __init__(self):
        super(GlobalMaxPool1d, self).__init__()
    def forward(self, x):
        '''
        @params:
            x: input，(batch_size, n_channels, seq_len) tensor
        @return: output, (batch_size, n_channels, 1) tensor
        '''
        return F.max_pool1d(x, kernel_size=x.shape[2]) # kenerl_size=seq_len

### TextCNN Model


TextCNN mainly uses a **1-dim convolutional layer** and **max-over-time pooling layer**. Suppose the input text sequence consists of $n$ words, and each word is represented by a $d$-dim word vector. Then the input sample has a width of $n$, a height of 1, and $d$ input channels. The calculation of textCNN can be mainly divided into the following steps:


1. Define multiple one-dimensional convolution kernels and use them to perform convolution calculations on the inputs. Convolution kernels with different widths may capture the correlation of different numbers of adjacent words.


2. Perform max-over-time pooling on all output channels, and then concatenate the pooling output values of these channels in a vector.


3. The concatenated vector is transformed into the output for each category through the fully connected layer. A dropout layer can be used in this step to deal with overfitting.


![Image Name](https://github.com/d2l-ai/d2l-en/raw/master/img/textcnn.svg?sanitize=true)



This figure shows an example to illustrate the textCNN. The input here is a sentence with 11 words, with each word represented by a 6-dim word vector. Therefore, the input sequence has a width of 11, and 6 input channels. 

Assuming we are using two 1-dim convolution kernels with widths of 2 and 4, and 4 and 5 as output channels, respectively. Therefore, after 1-dim convolution calculation, the width of the 4 output channels is $11-2+1=10$, while the width of the other 5 channels is $11-4+1=8$. Even though the width of each channel is different, we can still perform max-over-time pooling for each channel and concatenate the pooling outputs of the 9 channels into a 9-dim vector. 

Finally, we use a fully connected layer to transform the 9-dim vector into a 2-dim output: positive sentiment and negative sentiment predictions.

Next, we will implement our textCNN model. Compared with the LSTM model, here we will use a 1-dim convolutional layer, two embedding layers, one with a fixed weight and another that participates in training.

In [16]:
class TextCNN(nn.Module):
    
    def __init__(self, vocab, embed_size, kernel_sizes, num_channels):
        '''
        @params:
            vocab: vocab
            embed_size: embedding dim for word vectors
            kernel_sizes: kernel size list
            num_channels: channel number for kernel
        '''
        super(TextCNN, self).__init__()
        self.embedding = nn.Embedding(len(vocab), embed_size) # embedding layer that participates in training
        self.constant_embedding = nn.Embedding(len(vocab), embed_size) # embedding layer that does not participate in training
        
        self.pool = GlobalMaxPool1d() # layer has no weight, so it can share an instance
        self.convs = nn.ModuleList()  # create multiple 1-dim convolutional layers
        for c, k in zip(num_channels, kernel_sizes):
            self.convs.append(nn.Conv1d(in_channels = 2*embed_size, 
                                        out_channels = c, 
                                        kernel_size = k))
            
        self.decoder = nn.Linear(sum(num_channels), 2)
        self.dropout = nn.Dropout(0.5) # add dropout layer

        
    def forward(self, inputs):
        '''
        @params:
            inputs: idx sequence for token, (batch_size, seq_len) int tensor
        @return:
            outputs: binary prediction, (batch_size, 2) tensor
        '''
        embeddings = torch.cat((
            self.embedding(inputs), 
            self.constant_embedding(inputs)), dim=2)  # (batch_size, seq_len, 2*embed_size)
        # concatenate the output of two embedding layers,
        # transform into the channel dimension of the 1-dim convolutional layer,
        # according to the input format required by Conv1D, the word vector dimension
        embeddings = embeddings.permute(0, 2, 1) # (batch_size, 2*embed_size, seq_len)
        
        
        # for each 1-dim convolutional layer, after max-over-time pooling,
        # get tensor shape of (batch size, channel size, 1)
        # use squeeze to remove the last dim and then concatenate on the channel dim
        encoding = torch.cat([
            self.pool(F.relu(conv(embeddings))).squeeze(-1) for conv in self.convs], dim=1)
    
        # below is the step-by-step code for the same function
        # encoding = []
        # for conv in self.convs:
        #     out = conv(embeddings) # (batch_size, out_channels, seq_len-kernel_size+1)
        #     out = self.pool(F.relu(out)) # (batch_size, out_channels, 1)
        #     encoding.append(out.squeeze(-1)) # (batch_size, out_channels)
        # encoding = torch.cat(encoding) # (batch_size, out_channels_sum)
        
        outputs = self.decoder(self.dropout(encoding))
        # after dropout, use a fully connected layer get the output
        
        return outputs

embed_size, kernel_sizes, nums_channels = 100, [3, 4, 5], [100, 100, 100]
net = TextCNN(vocab, embed_size, kernel_sizes, nums_channels)

### Training and Evaluation

In [17]:
lr, num_epochs = 0.001, 5
optimizer = torch.optim.Adam(filter(lambda p: p.requires_grad, net.parameters()), lr=lr)
loss = nn.CrossEntropyLoss()
train(train_iter, test_iter, net, loss, optimizer, device, num_epochs)

training on  cpu
epoch 1, loss 0.6358, train acc 0.663, test acc 0.783, time 371.6 sec
epoch 2, loss 0.2478, train acc 0.758, test acc 0.830, time 367.9 sec
epoch 3, loss 0.1367, train acc 0.814, test acc 0.851, time 367.1 sec
epoch 4, loss 0.0822, train acc 0.857, test acc 0.851, time 367.4 sec
epoch 5, loss 0.0499, train acc 0.896, test acc 0.860, time 367.6 sec


```
training on  cuda
epoch 1, loss 0.6314, train acc 0.666, test acc 0.803, time 15.9 sec
epoch 2, loss 0.2416, train acc 0.766, test acc 0.807, time 15.9 sec
epoch 3, loss 0.1330, train acc 0.821, test acc 0.849, time 15.9 sec
epoch 4, loss 0.0825, train acc 0.858, test acc 0.860, time 16.0 sec
epoch 5, loss 0.0494, train acc 0.898, test acc 0.865, time 15.9 sec
```

In [22]:
def predict_sentiment(net, vocab, sentence):
    '''
    @params：
        net: trained model
        vocab: vocab on the dataset，mapping token to idx
        sentence: text in word tokens seq
    @return: pred outcome: positive for postive emotions，negative else
    '''
    device = list(net.parameters())[0].device # load the device location
    sentence = torch.tensor([vocab.stoi[word] for word in sentence], device=device)
    label = torch.argmax(net(sentence.view((1, -1))), dim=1)
    return 'positive' if label.item() == 1 else 'negative'

In [25]:
predict_sentiment(net, vocab, ['this', 'movie', 'is', 'so', 'so'])

'negative'

In [26]:
predict_sentiment(net, vocab, ['this', 'movie', 'is', 'so', 'great'])

'positive'

In [24]:
predict_sentiment(net, vocab, ['this', 'movie', 'is', 'so', 'bad'])

'negative'

#### Summary:

* We can use one-dimensional convolution to process and analyze timing data.

* A one-dimensional cross-correlation operation with multiple input channels can be regarded as a two-dimensional cross-correlation operation with a single input channel.

* The input of the max-over-time pooling layer can have different numbers of timesteps on each channel.

* TextCNN mainly uses a one-dimensional convolutional layer and max-over-time pooling layer.