# Sentiment analysis of IMDb movie reviews

# Bag-of-words model with learned embeddings

## Learnings:

At the end of this lesson you will know:

- How to create a bag-of-words model for sentiment analysis.

- How to feed text data into a PyTorch model using `torchtext`

- How to write a training/validation loop in PyTorch.

- How to visualize training metrics and word embeddings with Tensorboard.

<a href="https://colab.research.google.com/github/Paulescu/practical-nlp-2021/blob/main/notebooks/1_fine_tune_bert_for_sentiment_analysis.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab" align="left"/>
</a>

# Stage 1: Data download and pre-processing

The original dataset in `http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz` has one file per example, and examples are grouped in folders according to train vs test, and positive vs negative.

At the end of `Stage 1: Data pre-processing` we will have data split into 3 CSV files: train.csv, validation.csv, test.csv. In each file, the first column will be the review text, the second column will
be the sentiment, 0: negative, 1: positive. This is the format we need the data to be to use `torchtext` data ingestion convenient function `TabularDataset.splits()`

### Download raw data

In [98]:
!wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -xf aclImdb_v1.tar.gz

--2020-12-08 17:49:39--  http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
Resolving ai.stanford.edu (ai.stanford.edu)... 171.64.68.10
Connecting to ai.stanford.edu (ai.stanford.edu)|171.64.68.10|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 84125825 (80M) [application/x-gzip]
Saving to: ‘aclImdb_v1.tar.gz.2’


2020-12-08 17:50:04 (3.28 MB/s) - ‘aclImdb_v1.tar.gz.2’ saved [84125825/84125825]



### Read raw data from disk into Python lists

In [99]:
from typing import List, Tuple
from pathlib import Path

def read_imdb_split(split_dir: str) -> Tuple[List[str], List[str]]:
    """
    Auxiliary function to read raw data from disk
    into 2 Python lists, one for texts the other for labels
    """
    split_dir = Path(split_dir)
    texts = []
    labels = []
    for label_dir in ['pos', 'neg']:
        for text_file in (split_dir/label_dir).iterdir():
            texts.append(text_file.read_text())
            labels.append(0 if label_dir is "neg" else 1)
    return texts, labels

train_texts_, train_labels_ = read_imdb_split('aclImdb/train')
print('train_texts: ', len(train_texts_))

test_texts, test_labels = read_imdb_split('aclImdb/test')
print('test_texts: ', len(test_texts))

train_texts:  25000
test_texts:  25000


### Split data into train, validation and test sets

In [100]:
from sklearn.model_selection import train_test_split

train_texts, val_texts, train_labels, val_labels = \
    train_test_split(train_texts_, train_labels_, test_size=0.2, random_state=1)

print('train_texts: ', len(train_texts))
print('val_texts: ', len(val_texts))
print('test_texts: ', len(test_texts))

train_texts:  20000
val_texts:  5000
test_texts:  25000


### Save `train.csv` , `validation.csv`, `test.csv`

In [101]:
import pandas as pd

train_data = pd.DataFrame({'text': train_texts, 'label': train_labels},
                          columns=['text', 'label'])
val_data = pd.DataFrame({'text': val_texts, 'label': val_labels},
                        columns=['text', 'label'])
test_data = pd.DataFrame({'text': test_texts, 'label': test_labels},
                         columns=['text', 'label'])

train_data.to_csv('train_data.csv', index=False)
val_data.to_csv('val_data.csv', index=False)
test_data.to_csv('test_data.csv', index=False)

# Stage 2: Define PyTorch `Dataset` and `DataLoader`s using `torchtext`

### Define `Field` objects

In [1]:
from torchtext.data import Field

TEXT = Field(sequential=True, tokenize='spacy', lower=True, batch_first=True)
LABEL = Field(sequential=False, use_vocab=False)



### Create `Dataset` objects for train, validation, test files

A `Dataset` is a list of `Example` objects. Each `Example` is a dictionary that maps `Field` names to values.

`TabularDataset` provides a convenient way to load columnar data from a csv file into a PyTorch `Dataset`

In [2]:
from torchtext.data import TabularDataset

fields = [('text', TEXT), ('label', LABEL)]

train, val, test = TabularDataset.splits(
    path='',
    train='train_data.csv',
    validation='val_data.csv',
    test='test_data.csv',
    format='csv',
    skip_header=True,
    fields=fields,
)



In [40]:
import torchtext

print(type(train))
print(TabularDataset.__bases__[0])
print(torchtext.data.dataset.Dataset.__bases__[0])
print(type(train[0]))

<class 'torchtext.data.dataset.TabularDataset'>
<class 'torchtext.data.dataset.Dataset'>
<class 'torch.utils.data.dataset.Dataset'>
<class 'torchtext.data.example.Example'>


You can access each example data in the same way you access elements of a Python `list`
and attributes of a Python `object`

In [3]:
print('text: ', train[0].text)
print('\nlabel: ', train[0].label)

text:  ['there', 'are', 'so', 'many', 'good', 'things', 'to', 'say', 'about', 'this', '“', 'b', '”', 'movie.<br', '/><br', '/>“b', '’', 'maybe', 'in', 'connections', ',', 'but', 'not', 'in', 'commission', '.', 'this', 'is', 'about', 'the', 'best', 'of', 'its', 'genre', 'that', 'i', 'have', 'ever', 'seen', '.', 'a', 'grade', 'a', 'effort', 'by', 'universal', '.', 'the', 'script', 'is', 'well', 'done', ',', 'imaginative', ',', 'and', 'without', 'fault', '.', 'writing', 'credits', ':', 'howard', 'higgin', 'original', 'story', '&', 'douglas', 'hodges', 'story', ',', 'john', 'colton', '(', 'screenplay', ')', '.', 'director', 'lambert', 'hillyer', 'handled', 'the', 'complex', 'story', 'and', 'story', 'locations', 'very', 'well', '.', 'no', 'skimping', 'on', 'the', 'loads', 'of', 'extras', 'and', 'locations', '.', 'i', 'loved', 'beulah', 'bondy', '(', 'jimmy', 'stewarts', 'mother', 'in', '“', 'it', '’s', 'a', 'wonderful', 'life', '”', '.', 'the', 'fem', 'lead', ',', 'frances', 'drake', 'is', 

### Build the vocabulary using the training data

A `Vocab` object maps each word in the training set into a unique integer.

Once you build the vocabulary you can express sentences as sequences of integers.

Only training data is used to generate the vocabulary.

In [4]:
TEXT.build_vocab(train, max_size=10000, min_freq=2)

# we will need this later
vocab_size = len(TEXT.vocab)

You can map words to integers, and viceversa, using the dictionaries `Vocab.itos` and `Vocab.stoi`

In [5]:
print(TEXT.vocab.itos[13])
print(TEXT.vocab.stoi['this'])

this
13


### Create the train/validation/test dataloaders

In `torchtext` data loaders are called `BucketIterator`s

In [8]:
import torch
from torchtext.data import BucketIterator

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
batch_size = 128

train_iter, val_iter, test_iter = BucketIterator.splits(
    (train, val, test),
    batch_sizes=(batch_size, batch_size, batch_size),
    device=device,
    sort_key=lambda x: len(x.text),
    sort_within_batch=True,
)



**What is `train_iter`?**

In [9]:
# it is an iterator
print(type(train_iter))

# let's pick the first element and check what it is inside
x = next(iter(train_iter))
print(x, '\n')
print('text: \n', x.text)
print('label: \n', x.label)

<class 'torchtext.data.iterator.BucketIterator'>

[torchtext.data.batch.Batch of size 128]
	[.text]:[torch.LongTensor of size 128x661]
	[.label]:[torch.LongTensor of size 128] 

text: 
 tensor([[2390,    0,    3,  ..., 1828, 1761,    4],
        [   0,   91,  382,  ...,    0,   18, 7875],
        [1610, 5653,    3,  ..., 2674,   18, 7572],
        ...,
        [  57,   27,  591,  ...,    1,    1,    1],
        [  21,  170,   12,  ...,    1,    1,    1],
        [  10,   19,    6,  ...,    1,    1,    1]])
label: 
 tensor([1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0,
        0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1,
        1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1,
        0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1,
        0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0,
        1, 0, 0, 1, 1, 0, 1, 1])




# Stage 3: Define the neural net model

In [44]:
# TODO: add diagram here

In [23]:
import torch.nn as nn
import torch.nn.functional as F
    
class Model(nn.Module):
    
    def __init__(self, vocab_size: int, embedding_dim: int):
        super(Model, self).__init__()
        self.embed = nn.Embedding(vocab_size, embedding_dim)
        self.global_avg_pooling = lambda x: torch.mean(x, dim=-2)
        self.fc1 = nn.Linear(embedding_dim, 16)
        self.fc2 = nn.Linear(16, 2)
        
    def forward(self, x):
        x = self.embed(x)
        x = self.global_avg_pooling(x)

        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        
        return x
    
model = Model(vocab_size, 16)

# Stage 4: Train

### Loss function and optimizer

In [24]:
import torch.optim as optim

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.0001)

### Launch Tensorboard

In [25]:
%load_ext tensorboard
%tensorboard --logdir runs

The tensorboard extension is already loaded. To reload it, use:
  %reload_ext tensorboard


Reusing TensorBoard on port 6007 (pid 26106), started 0:01:44 ago. (Use '!kill 26106' to kill it.)

### Train loop

In [30]:
# Setup logging to Tensorboard
from torch.utils.tensorboard import SummaryWriter
from datetime import datetime

now = datetime.now()
now = now.strftime("%Y-%m-%d-%H:%M:%S")
model_name = 'bag_of_words_embeddings_from_scratch'
log_file = f'./runs/{model_name}/{now}'
writer = SummaryWriter(log_file)

# training loop
from tqdm import tqdm

n_epochs = 125
for epoch in range(n_epochs):
    
    # train
    running_loss = 0.0
    model.train()
    train_size = 0
    running_accuracy = 0.0
    for batch in tqdm(train_iter):
        
        # forward pass to compute the batch loss
        x = batch.text
        y = batch.label.long()
        predictions = model(x)        
        loss = criterion(predictions, y)
            
        # backward pass to update model parameters
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        # compute train metrics
        running_loss += loss.data * x.size(0)
        _, predicted_classes = torch.max(predictions, 1)
        running_accuracy += predicted_classes.eq(y.data).sum().item()
        train_size += x.size(0)
        
    epoch_loss = running_loss / train_size
    writer.add_scalar('training_epoch_loss', epoch_loss, epoch + 1)
    epoch_accuracy = running_accuracy / train_size
    writer.add_scalar('training_epoch_accuracy', epoch_accuracy, epoch + 1)
    
    # validation
    val_loss = 0.0
    model.eval()
    val_size = 0
    val_accuracy = 0
    with torch.no_grad():
        for batch in val_iter:
            x = batch.text
            y = batch.label.long()
            predictions = model(x)
            loss = criterion(predictions, y)
            
            # compute validation metrics
            val_loss += loss.data * x.size(0)
            _, predicted_classes = torch.max(predictions, 1)
            val_accuracy += predicted_classes.eq(y.data).sum().item()           
            val_size += x.size(0)
            
        val_loss /= val_size
        val_accuracy /= val_size
        
        print('\nEpoch: {}'.format(epoch))
        print('Train loss: {:.4f} \t Train accuracy: {:.4f}'.format(epoch_loss, epoch_accuracy))
        print('Val loss: {:.4f} \t Val accuracy: {:.4f}'.format(val_loss, val_accuracy))
        writer.add_scalar('validation_epoch_loss', val_loss, epoch + 1)
        writer.add_scalar('validation_epoch_accuracy', val_accuracy, epoch + 1)

100%|██████████| 157/157 [00:15<00:00, 10.04it/s]
  1%|          | 1/157 [00:00<00:17,  8.70it/s]


Epoch: 0
Train loss: 0.2697 	 Train accuracy: 0.8936
Val loss: 0.3524 	 Val accuracy: 0.8564


100%|██████████| 157/157 [00:14<00:00, 10.94it/s]
  1%|▏         | 2/157 [00:00<00:10, 14.87it/s]


Epoch: 1
Train loss: 0.2678 	 Train accuracy: 0.8939
Val loss: 0.3512 	 Val accuracy: 0.8554


100%|██████████| 157/157 [00:14<00:00, 10.99it/s]
  0%|          | 0/157 [00:00<?, ?it/s]


Epoch: 2
Train loss: 0.2658 	 Train accuracy: 0.8946
Val loss: 0.3503 	 Val accuracy: 0.8568


100%|██████████| 157/157 [00:14<00:00, 11.03it/s]
  0%|          | 0/157 [00:00<?, ?it/s]


Epoch: 3
Train loss: 0.2639 	 Train accuracy: 0.8952
Val loss: 0.3495 	 Val accuracy: 0.8572


100%|██████████| 157/157 [00:13<00:00, 11.23it/s]
  1%|          | 1/157 [00:00<00:20,  7.74it/s]


Epoch: 4
Train loss: 0.2618 	 Train accuracy: 0.8963
Val loss: 0.3486 	 Val accuracy: 0.8570


100%|██████████| 157/157 [00:14<00:00, 11.16it/s]
  1%|          | 1/157 [00:00<00:22,  6.90it/s]


Epoch: 5
Train loss: 0.2600 	 Train accuracy: 0.8972
Val loss: 0.3475 	 Val accuracy: 0.8572


100%|██████████| 157/157 [00:14<00:00, 11.10it/s]
  1%|          | 1/157 [00:00<00:22,  6.78it/s]


Epoch: 6
Train loss: 0.2582 	 Train accuracy: 0.8979
Val loss: 0.3469 	 Val accuracy: 0.8582


100%|██████████| 157/157 [00:14<00:00, 11.15it/s]
  1%|          | 1/157 [00:00<00:22,  7.06it/s]


Epoch: 7
Train loss: 0.2561 	 Train accuracy: 0.8990
Val loss: 0.3457 	 Val accuracy: 0.8576


100%|██████████| 157/157 [00:14<00:00, 11.15it/s]
  1%|▏         | 2/157 [00:00<00:08, 17.50it/s]


Epoch: 8
Train loss: 0.2546 	 Train accuracy: 0.8992
Val loss: 0.3449 	 Val accuracy: 0.8582


100%|██████████| 157/157 [00:14<00:00, 11.14it/s]
  0%|          | 0/157 [00:00<?, ?it/s]


Epoch: 9
Train loss: 0.2527 	 Train accuracy: 0.8998
Val loss: 0.3442 	 Val accuracy: 0.8598


100%|██████████| 157/157 [00:14<00:00, 11.19it/s]
  1%|          | 1/157 [00:00<00:16,  9.75it/s]


Epoch: 10
Train loss: 0.2507 	 Train accuracy: 0.9010
Val loss: 0.3433 	 Val accuracy: 0.8602


100%|██████████| 157/157 [00:14<00:00, 11.21it/s]
  1%|          | 1/157 [00:00<00:25,  6.06it/s]


Epoch: 11
Train loss: 0.2490 	 Train accuracy: 0.9016
Val loss: 0.3428 	 Val accuracy: 0.8624


100%|██████████| 157/157 [00:14<00:00, 11.11it/s]
  0%|          | 0/157 [00:00<?, ?it/s]


Epoch: 12
Train loss: 0.2475 	 Train accuracy: 0.9026
Val loss: 0.3426 	 Val accuracy: 0.8626


100%|██████████| 157/157 [00:14<00:00, 11.19it/s]
  1%|▏         | 2/157 [00:00<00:14, 10.92it/s]


Epoch: 13
Train loss: 0.2459 	 Train accuracy: 0.9039
Val loss: 0.3414 	 Val accuracy: 0.8630


100%|██████████| 157/157 [00:14<00:00, 11.12it/s]
  1%|          | 1/157 [00:00<00:16,  9.23it/s]


Epoch: 14
Train loss: 0.2442 	 Train accuracy: 0.9048
Val loss: 0.3406 	 Val accuracy: 0.8636


100%|██████████| 157/157 [00:13<00:00, 11.24it/s]
  0%|          | 0/157 [00:00<?, ?it/s]


Epoch: 15
Train loss: 0.2422 	 Train accuracy: 0.9055
Val loss: 0.3399 	 Val accuracy: 0.8638


100%|██████████| 157/157 [00:14<00:00, 11.13it/s]
  0%|          | 0/157 [00:00<?, ?it/s]


Epoch: 16
Train loss: 0.2409 	 Train accuracy: 0.9055
Val loss: 0.3399 	 Val accuracy: 0.8650


100%|██████████| 157/157 [00:14<00:00, 11.06it/s]
  1%|          | 1/157 [00:00<00:17,  8.80it/s]


Epoch: 17
Train loss: 0.2392 	 Train accuracy: 0.9071
Val loss: 0.3388 	 Val accuracy: 0.8648


100%|██████████| 157/157 [00:14<00:00, 11.07it/s]
  1%|          | 1/157 [00:00<00:17,  8.67it/s]


Epoch: 18
Train loss: 0.2377 	 Train accuracy: 0.9075
Val loss: 0.3381 	 Val accuracy: 0.8646


100%|██████████| 157/157 [00:14<00:00, 11.19it/s]
  1%|          | 1/157 [00:00<00:16,  9.69it/s]


Epoch: 19
Train loss: 0.2358 	 Train accuracy: 0.9082
Val loss: 0.3379 	 Val accuracy: 0.8658


100%|██████████| 157/157 [00:14<00:00, 11.12it/s]
  1%|          | 1/157 [00:00<00:24,  6.29it/s]


Epoch: 20
Train loss: 0.2346 	 Train accuracy: 0.9089
Val loss: 0.3371 	 Val accuracy: 0.8654


100%|██████████| 157/157 [00:14<00:00, 11.04it/s]
  1%|          | 1/157 [00:00<00:17,  8.96it/s]


Epoch: 21
Train loss: 0.2329 	 Train accuracy: 0.9101
Val loss: 0.3365 	 Val accuracy: 0.8652


100%|██████████| 157/157 [00:14<00:00, 10.80it/s]
  1%|          | 1/157 [00:00<00:28,  5.53it/s]


Epoch: 22
Train loss: 0.2315 	 Train accuracy: 0.9102
Val loss: 0.3360 	 Val accuracy: 0.8658


100%|██████████| 157/157 [00:14<00:00, 10.65it/s]
  0%|          | 0/157 [00:00<?, ?it/s]


Epoch: 23
Train loss: 0.2300 	 Train accuracy: 0.9106
Val loss: 0.3359 	 Val accuracy: 0.8674


100%|██████████| 157/157 [00:17<00:00,  9.07it/s]



Epoch: 24
Train loss: 0.2286 	 Train accuracy: 0.9116
Val loss: 0.3352 	 Val accuracy: 0.8668


# Stage 5: Test

## Extra: Visualize the word embeddings