# Sentiment analysis of IMDb movie reviews

# Bag-of-words model with learned embeddings

## Learnings:

At the end of this lesson you will know how to:

- Create a bag-of-words model for sentiment analysis.

- Feed text data into a PyTorch model using `torchtext`

- Write a training/validation loop in PyTorch.

- Visualize the trained embeddings with the [Embedding Projector]((https://projector.tensorflow.org/))

<a href="https://colab.research.google.com/github/Paulescu/practical-nlp-2021/blob/main/0_sentiment_analysis/0_bag_of_words_with_learned_embeddings.ipynb" target="_blank">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab" align="left"/>
</a>

# Stage 1: Data download and pre-processing

The original dataset in `http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz` has one file per example, and examples are grouped in folders according to train vs test, and positive vs negative.

At the end of `Stage 1: Data pre-processing` we will have data split into 3 CSV files: train.csv, validation.csv, test.csv. In each file, the first column will be the review text, the second column will
be the sentiment, 0: negative, 1: positive. This is the format we need the data to be to use `torchtext` data ingestion convenient function `TabularDataset.splits()`

### Download raw data

In [1]:
!wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -xf aclImdb_v1.tar.gz

### Read raw data from disk into Python lists

In [2]:
from typing import List, Tuple
from pathlib import Path

def read_imdb_split(split_dir: str) -> Tuple[List[str], List[str]]:
    """
    Auxiliary function to read raw data from disk
    into 2 Python lists, one for texts the other for labels
    """
    split_dir = Path(split_dir)
    texts = []
    labels = []
    for label_dir in ['pos', 'neg']:
        for text_file in (split_dir/label_dir).iterdir():
            texts.append(text_file.read_text())
            labels.append(0 if label_dir is "neg" else 1)
    return texts, labels

train_texts_, train_labels_ = read_imdb_split('aclImdb/train')
print('train_texts: ', len(train_texts_))

test_texts, test_labels = read_imdb_split('aclImdb/test')
print('test_texts: ', len(test_texts))

train_texts:  25000
test_texts:  25000


### Split data into train, validation and test sets

In [3]:
from sklearn.model_selection import train_test_split

train_texts, val_texts, train_labels, val_labels = \
    train_test_split(train_texts_, train_labels_, test_size=0.2, random_state=1)

# train_texts = train_texts[:100]
# train_labels = train_labels[:100]
# val_texts = val_texts[:100]
# val_labels = val_labels[:100]
# test_texts = test_texts[:100]
# test_labels = test_labels[:100]

print('train_texts: ', len(train_texts))
print('val_texts: ', len(val_texts))
print('test_texts: ', len(test_texts))

train_texts:  100
val_texts:  100
test_texts:  100


### Save `train.csv` , `validation.csv`, `test.csv`

In [4]:
import pandas as pd

train_data = pd.DataFrame({'text': train_texts, 'label': train_labels},
                          columns=['text', 'label'])
val_data = pd.DataFrame({'text': val_texts, 'label': val_labels},
                        columns=['text', 'label'])
test_data = pd.DataFrame({'text': test_texts, 'label': test_labels},
                         columns=['text', 'label'])

train_data.to_csv('train_data.csv', index=False)
val_data.to_csv('val_data.csv', index=False)
test_data.to_csv('test_data.csv', index=False)

# Stage 2: Define PyTorch `Dataset` and `DataLoader`s using `torchtext`

### Define `Field` objects

In [5]:
!python -m spacy download en

In [6]:
from torchtext.data import Field

TEXT = Field(sequential=True, tokenize='spacy', lower=True, batch_first=True)
LABEL = Field(sequential=False, use_vocab=False)



### Create `Dataset` objects for train, validation, test files

A `Dataset` is a list of `Example` objects. Each `Example` is a dictionary that maps `Field` names to values.

`TabularDataset` provides a convenient way to load columnar data from a csv file into a PyTorch `Dataset`

In [7]:
from torchtext.data import TabularDataset

fields = [('text', TEXT), ('label', LABEL)]

train, val, test = TabularDataset.splits(
    path='',
    train='train_data.csv',
    validation='val_data.csv',
    test='test_data.csv',
    format='csv',
    skip_header=True,
    fields=fields,
)



In [8]:
import torchtext

print(type(train))
print(TabularDataset.__bases__[0])
print(torchtext.data.dataset.Dataset.__bases__[0])
print(type(train[0]))

<class 'torchtext.data.dataset.TabularDataset'>
<class 'torchtext.data.dataset.Dataset'>
<class 'torch.utils.data.dataset.Dataset'>
<class 'torchtext.data.example.Example'>


You can access each example data in the same way you access elements of a Python `list`
and attributes of a Python `object`

In [9]:
print('text: ', train[0].text)
print('\nlabel: ', train[0].label)

text:  ['there', 'are', 'so', 'many', 'good', 'things', 'to', 'say', 'about', 'this', '“', 'b', '”', 'movie.<br', '/><br', '/>“b', '’', 'maybe', 'in', 'connections', ',', 'but', 'not', 'in', 'commission', '.', 'this', 'is', 'about', 'the', 'best', 'of', 'its', 'genre', 'that', 'i', 'have', 'ever', 'seen', '.', 'a', 'grade', 'a', 'effort', 'by', 'universal', '.', 'the', 'script', 'is', 'well', 'done', ',', 'imaginative', ',', 'and', 'without', 'fault', '.', 'writing', 'credits', ':', 'howard', 'higgin', 'original', 'story', '&', 'douglas', 'hodges', 'story', ',', 'john', 'colton', '(', 'screenplay', ')', '.', 'director', 'lambert', 'hillyer', 'handled', 'the', 'complex', 'story', 'and', 'story', 'locations', 'very', 'well', '.', 'no', 'skimping', 'on', 'the', 'loads', 'of', 'extras', 'and', 'locations', '.', 'i', 'loved', 'beulah', 'bondy', '(', 'jimmy', 'stewarts', 'mother', 'in', '“', 'it', '’s', 'a', 'wonderful', 'life', '”', '.', 'the', 'fem', 'lead', ',', 'frances', 'drake', 'is', 

### Build the vocabulary using the training data

A `Vocab` object maps each word in the training set into a unique integer.

Once you build the vocabulary you can express sentences as sequences of integers.

Only training data is used to generate the vocabulary.

In [10]:
TEXT.build_vocab(train, max_size=10000, min_freq=2)

# we will need this later
vocab_size = len(TEXT.vocab)
print('Vocabulary size: ', vocab_size)

Vocabulary size:  1757


### Create the train/validation/test dataloaders

In `torchtext` data loaders are called `BucketIterator`s

In [12]:
import torch
from torchtext.data import BucketIterator

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
batch_size = 128

train_iter, val_iter, test_iter = BucketIterator.splits(
    (train, val, test),
    batch_sizes=(batch_size, batch_size, batch_size),
    device=device,
    sort_key=lambda x: len(x.text),
    sort_within_batch=True,
)



**What is `train_iter`?**

In [13]:
# it is an iterator
print(type(train_iter))

# let's pick the first element and check what it is inside
x = next(iter(train_iter))
print(x, '\n')
print('text: \n', x.text)
print('label: \n', x.label)

<class 'torchtext.data.iterator.BucketIterator'>

[torchtext.data.batch.Batch of size 100]
	[.text]:[torch.LongTensor of size 100x1143]
	[.label]:[torch.LongTensor of size 100] 

text: 
 tensor([[  10,   31,  137,  ...,    0,   18,    0],
        [  48,  236,    5,  ...,    1,    1,    1],
        [  66,    2,  585,  ...,    1,    1,    1],
        ...,
        [1491,    0,    9,  ...,    1,    1,    1],
        [ 952,    3,    0,  ...,    1,    1,    1],
        [  71,  510,  194,  ...,    1,    1,    1]])
label: 
 tensor([0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1,
        1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1,
        0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1,
        1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0,
        1, 1, 0, 1])




# Stage 3: Define the neural net model

In [14]:
# TODO: add diagram here

In [15]:
import torch.nn as nn
import torch.nn.functional as F
    
class Model(nn.Module):
    
    def __init__(self, vocab_size: int, embedding_dim: int):
        super(Model, self).__init__()
        self.embed = nn.Embedding(vocab_size, embedding_dim)
        self.global_avg_pooling = lambda x: torch.mean(x, dim=-2)
        self.fc1 = nn.Linear(embedding_dim, 16)
        self.fc2 = nn.Linear(16, 2)
        
    def forward(self, x):
        x = self.embed(x)
        x = self.global_avg_pooling(x)

        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        
        return x
    
model = Model(vocab_size, 16).to(device)

# Stage 4: Train

### Loss function and optimizer

In [16]:
import torch.optim as optim

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=3e-4)

### Launch Tensorboard

In [17]:
%load_ext tensorboard
%tensorboard --logdir runs

Reusing TensorBoard on port 6007 (pid 46459), started 0:23:18 ago. (Use '!kill 46459' to kill it.)

### Train loop

In [18]:
# Setup logging to Tensorboard
from torch.utils.tensorboard import SummaryWriter
from datetime import datetime

now = datetime.now()
now = now.strftime("%Y-%m-%d-%H:%M:%S")
model_name = 'bag_of_words_embeddings_from_scratch'
log_file = f'./runs/{model_name}/{now}'
writer = SummaryWriter(log_file)

# training loop
from tqdm import tqdm

n_epochs = 100
for epoch in range(n_epochs):
    
    # train
    running_loss = 0.0
    model.train()
    train_size = 0
    running_accuracy = 0.0
    for batch in tqdm(train_iter):
        
        # forward pass to compute the batch loss
        x = batch.text
        y = batch.label.long()
        predictions = model(x)        
        loss = criterion(predictions, y)
            
        # backward pass to update model parameters
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        # compute train metrics
        running_loss += loss.data * x.size(0)
        _, predicted_classes = torch.max(predictions, 1)
        running_accuracy += predicted_classes.eq(y.data).sum().item()
        train_size += x.size(0)
        
    epoch_loss = running_loss / train_size
    writer.add_scalar('training_epoch_loss', epoch_loss, epoch + 1)
    epoch_accuracy = running_accuracy / train_size
    writer.add_scalar('training_epoch_accuracy', epoch_accuracy, epoch + 1)
    
    # validation
    val_loss = 0.0
    model.eval()
    val_size = 0
    val_accuracy = 0
    with torch.no_grad():
        for batch in val_iter:
            x = batch.text
            y = batch.label.long()
            predictions = model(x)
            loss = criterion(predictions, y)
            
            # compute validation metrics
            val_loss += loss.data * x.size(0)
            _, predicted_classes = torch.max(predictions, 1)
            val_accuracy += predicted_classes.eq(y.data).sum().item()           
            val_size += x.size(0)
            
        val_loss /= val_size
        val_accuracy /= val_size
        
        print('\nEpoch: {}'.format(epoch))
        print('Train loss: {:.4f} \t Train accuracy: {:.4f}'.format(epoch_loss, epoch_accuracy))
        print('Val loss: {:.4f} \t Val accuracy: {:.4f}'.format(val_loss, val_accuracy))
        writer.add_scalar('validation_epoch_loss', val_loss, epoch + 1)
        writer.add_scalar('validation_epoch_accuracy', val_accuracy, epoch + 1)

writer.close()

100%|██████████| 1/1 [00:00<00:00,  3.40it/s]


Epoch: 0
Train loss: 0.7096 	 Train accuracy: 0.4600
Val loss: 0.7170 	 Val accuracy: 0.4100





# Stage 5: Test

In [25]:
test_accuracy = 0.0
test_size = 0
with torch.no_grad():
    for batch in test_iter:
        # forward pass
        x = batch.text
        y = batch.label.long()
        predictions = model(x)        
        loss = criterion(predictions, y)

        # compute accuracy
        _, predicted_classes = torch.max(predictions, 1)
        test_accuracy += predicted_classes.eq(y.data).sum().item()
        test_size += x.size(0)

test_accuracy /= test_size
print('Test accuracy: {:.4f}'.format(test_accuracy))

100%|██████████| 1/1 [00:00<00:00, 40.08it/s]

Test accuracy: 0.9200





# Extra: Visualize the learned word embeddings with the [Embedding Projector](https://projector.tensorflow.org/)

### Extract embedding parameters

In [43]:
for name, parameter in model.named_parameters():
    if name == 'embed.weight':
        embeddings = parameter

print(embeddings.shape)

torch.Size([1757, 16])


### Generate tsv files

In [47]:
import io

embeddings = embeddings.cpu().detach().numpy()
vocab = TEXT.vocab.itos

out_v = io.open('vectors.tsv', 'w', encoding='utf-8')
out_m = io.open('metadata.tsv', 'w', encoding='utf-8')

for index, word in enumerate(vocab):
    if index in [0, 1]:
        # skip 0, it's the unknown token.
        # skip 1, it's the padding token.
        continue
        
    vec = embeddings[index, :] 
    out_v.write('\t'.join([str(x) for x in vec]) + "\n")
    out_m.write(word + "\n")

out_v.close()
out_m.close()

### Download files to your local computer (in case you are running this notebook in Google Colab)

In [48]:
try:
    from google.colab import files
    files.download('vectors.tsv')
    files.download('metadata.tsv')
except Exception as e:
    pass