### Faster Sentiment Analyisis.

> "This paper explores a simple and efficient baseline for text classification. Our experiments show that our fast text classifier fastText is often on par with deep learning classifiers in terms of accuracy, and many orders of magnitude faster for training and evaluation. We can train fastText on more than one billion words in less than ten minutes using a standard multicore - CPU, and classify half a million sentences among~312K classes in less than a minute." - Fast Text

This is based on this [paper](https://arxiv.org/abs/1607.01759)


In this notebook we are going to implement a model based on that paper that yields comparable results from the previous model. This model will train for s short amount of time and with few trainable parameters.


### Data Preparation.
One of the key concepts in the ``FastText`` paper is that they calculate the ``n-grams`` of an input sentence and append them to the end of a sentence. Here, we'll use ``bi-grams``.

We are going to create a ``generate_bigrams`` function takes a sentence that has already been tokenized, calculates the bi-grams and appends them to the end of the tokenized list.

In [1]:

def generate_bigrams(x):
  n_grams = set(zip(*[x[i:] for i in range(2)]))
  for n_gram in n_grams:
      x.append(' '.join(n_gram))
  return x
generate_bigrams(['This', 'film', 'is', 'terrible'])

['This', 'film', 'is', 'terrible', 'film is', 'is terrible', 'This film']

**TorchText** ``Fields`` have a ``preprocessing`` argument which is a function be applied to a sentence after it has been tokenized, **but before it has been numericalized**. This is where we'll pass our ``generate_bigrams`` function.

As we aren't using an ``RNN`` we can't use packed padded sequences, thus we do not need to set ``include_lengths = True``.

In [2]:
import torch
from torchtext.legacy import data, datasets

In [3]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
SEED = 42

torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

In [4]:
TEXT = data.Field(tokenize = 'spacy',
                  tokenizer_language = 'en_core_web_sm',
                  preprocessing = generate_bigrams)

LABEL = data.LabelField(dtype = torch.float)

### We need to split the data.

In [5]:
import random

In [6]:
train_data, test_data = datasets.IMDB.splits(TEXT, LABEL)
train_data, valid_data = train_data.split(random_state = random.seed(SEED))

downloading aclImdb_v1.tar.gz


aclImdb_v1.tar.gz: 100%|██████████| 84.1M/84.1M [00:05<00:00, 15.6MB/s]


### Build the vocab and load the pre-trained word embeddings.

In [7]:
MAX_VOCAB_SIZE = 25_000

TEXT.build_vocab(train_data, 
                 max_size = MAX_VOCAB_SIZE, 
                 vectors = "glove.6B.100d", 
                 unk_init = torch.Tensor.normal_)

LABEL.build_vocab(train_data)

.vector_cache/glove.6B.zip: 862MB [02:42, 5.29MB/s]                           
100%|█████████▉| 399479/400000 [00:14<00:00, 28432.33it/s]

# Create Iterators and push them to `device`.

In [8]:
BATCH_SIZE = 64
train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
    (train_data, valid_data, test_data), 
    batch_size = BATCH_SIZE, 
    device = device
    )

### Building a Model.
This model has far fewer parameters than the previous model as it only has ``2`` layers that have any parameters, the ``embedding layer`` and the ``linear layer``. 

Instead, it first calculates the word embedding for each word using the Embedding layer _(blue)_, then calculates the average of all of the word embeddings _(pink)_ and feeds this through the Linear layer _(silver)_, and that's it!

<p align="center">
  <img src="https://github.com/bentrevett/pytorch-sentiment-analysis/raw/2b666b3cba7d629a2f192c7d9c66fadcc9f0c363/assets/sentiment8.png"/>
</p>

We implement the averaging with the ``avg_pool2d`` (average pool 2-dimensions) function. Initially, you may think using a 2-dimensional pooling seems strange, surely our sentences are 1-dimensional, not 2-dimensional? However, you can think of the word embeddings as a 2-dimensional grid, where the words are along one axis and the dimensions of the word embeddings are along the other. The image below is an example sentence after being converted into 5-dimensional word embeddings, with the words along the vertical axis and the embeddings along the horizontal axis. Each element in this ``[4x5]`` tensor is represented by a green block.

<p align="center">
  <img src="https://github.com/bentrevett/pytorch-sentiment-analysis/raw/2b666b3cba7d629a2f192c7d9c66fadcc9f0c363/assets/sentiment9.png"/>
</p>

The ``avg_pool2d`` uses a filter of size ``embedded.shape[1] `` (i.e. the length of the sentence) by ``1``. This is shown in pink in the image below.

<p align="center">
  <img src="https://github.com/bentrevett/pytorch-sentiment-analysis/raw/2b666b3cba7d629a2f192c7d9c66fadcc9f0c363/assets/sentiment10.png"/>
</p>
We calculate the average value of all elements covered by the filter, then the filter then slides to the right, calculating the average over the next column of embedding values for each word in the sentence.

<p align="center">
  <img src="https://github.com/bentrevett/pytorch-sentiment-analysis/raw/2b666b3cba7d629a2f192c7d9c66fadcc9f0c363/assets/sentiment11.png"/>
</p>

Each filter position gives us a single value, the average of all covered elements. After the filter has covered all embedding dimensions we get a ``[1x5]`` tensor. This tensor is then passed through the linear layer to produce our prediction.

In [9]:
from torch.nn import functional as F
import torch.nn as nn

In [12]:
class FastText(nn.Module):
  def __init__(self, vocab_size, embedding_dim, output_dim, pad_idx):
    super(FastText, self).__init__()

    self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=pad_idx)
    self.fc = nn.Linear(embedding_dim, output_dim)

  def forward(self, text):
    #text = [sent len, batch size]
    embedded = self.embedding(text)
    #embedded = [sent len, batch size, emb dim]
    embedded = embedded.permute(1, 0, 2)
    #embedded = [batch size, sent len, emb dim]
    pooled = F.avg_pool2d(embedded, (embedded.shape[1], 1)).squeeze(1) 
    #pooled = [batch size, embedding_dim]
    return self.fc(pooled)

### Creating the instance of the `FastText` model.

In [13]:
INPUT_DIM = len(TEXT.vocab)
EMBEDDING_DIM = 100
OUTPUT_DIM = 1
PAD_IDX = TEXT.vocab.stoi[TEXT.pad_token]

model = FastText(INPUT_DIM, EMBEDDING_DIM, OUTPUT_DIM, PAD_IDX)
model

FastText(
  (embedding): Embedding(25002, 100, padding_idx=1)
  (fc): Linear(in_features=100, out_features=1, bias=True)
)

### Checking number of trainable parameters in our `FastText` model.

In [15]:
def count_trainable_params(model):
  return sum(p.numel() for p in model.parameters() if p.requires_grad==True)

print(f'The model has {count_trainable_params(model):,} trainable parameters')

The model has 2,500,301 trainable parameters


### Copying the pretrainned vectors to the `embedding` layer.

In [16]:
pretrained_embeddings = TEXT.vocab.vectors

In [17]:
model.embedding.weight.data.copy_(pretrained_embeddings)

tensor([[ 1.9269,  1.4873,  0.9007,  ...,  0.1233,  0.3499,  0.6173],
        [ 0.7262,  0.0912, -0.3891,  ...,  0.0821,  0.4440, -0.7240],
        [-0.0382, -0.2449,  0.7281,  ..., -0.1459,  0.8278,  0.2706],
        ...,
        [-0.0232, -0.1614,  0.2054,  ...,  0.5729,  1.6427,  0.1845],
        [ 0.2418, -0.4951,  0.7971,  ..., -0.0517,  0.3518,  0.1536],
        [-1.8498, -0.1302, -0.6559,  ..., -0.3399,  1.0973, -0.7170]])

### Initialising `padding` and `unknown` weights to zeros.

In [18]:
UNK_IDX = TEXT.vocab.stoi[TEXT.unk_token]

model.embedding.weight.data[UNK_IDX] = torch.zeros(EMBEDDING_DIM)
model.embedding.weight.data[PAD_IDX] = torch.zeros(EMBEDDING_DIM)
print(model.embedding.weight.data)

tensor([[ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [-0.0382, -0.2449,  0.7281,  ..., -0.1459,  0.8278,  0.2706],
        ...,
        [-0.0232, -0.1614,  0.2054,  ...,  0.5729,  1.6427,  0.1845],
        [ 0.2418, -0.4951,  0.7971,  ..., -0.0517,  0.3518,  0.1536],
        [-1.8498, -0.1302, -0.6559,  ..., -0.3399,  1.0973, -0.7170]])


### Trainning the Model.

In [19]:
optimizer = torch.optim.Adam(model.parameters())
criterion = nn.BCEWithLogitsLoss()

### Pushing the Model and Loss functions to the `device`

In [20]:
model = model.to(device)
criterion = criterion.to(device)

### The Accuracy Function

In [21]:
def accuracy(y_preds, y_true):
  #round predictions to the closest integer
  rounded_preds = torch.round(torch.sigmoid(y_preds))
  correct = (rounded_preds == y_true).float() #convert into float for division 
  acc = correct.sum() / len(correct)
  return acc

### Trainning and Evaluation Functions

In [22]:
def train(model, iterator, optimizer, criterion):
    epoch_loss = 0
    epoch_acc = 0
    model.train()
    for batch in iterator:
        optimizer.zero_grad()
        text = batch.text
        predictions = model(text).squeeze(1)
        loss = criterion(predictions, batch.label)
        acc = accuracy(predictions, batch.label)
        loss.backward()
        optimizer.step()
        epoch_loss += loss.item()
        epoch_acc += acc.item()
    return epoch_loss / len(iterator), epoch_acc / len(iterator)


def evaluate(model, iterator, criterion):
    epoch_loss = 0
    epoch_acc = 0
    model.eval()
    with torch.no_grad():
        for batch in iterator:
            text = batch.text
            predictions = model(text).squeeze(1)
            loss = criterion(predictions, batch.label)
            acc = accuracy(predictions, batch.label)
            epoch_loss += loss.item()
            epoch_acc += acc.item()
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

We'll also create a function to tell us how long an epoch takes to compare training times between models.

In [23]:
import time
def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

### Running Training loop

In [26]:
N_EPOCHS = 5
best_valid_loss = float('inf')
for epoch in range(N_EPOCHS):
    start_time = time.time()
    train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
    valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)
    end_time = time.time()
    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'best-model.pt')
    print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')

Epoch: 01 | Epoch Time: 0m 8s
	Train Loss: 0.577 | Train Acc: 80.38%
	 Val. Loss: 0.410 |  Val. Acc: 81.47%
Epoch: 02 | Epoch Time: 0m 8s
	Train Loss: 0.497 | Train Acc: 84.67%
	 Val. Loss: 0.377 |  Val. Acc: 84.42%
Epoch: 03 | Epoch Time: 0m 8s
	Train Loss: 0.428 | Train Acc: 87.32%
	 Val. Loss: 0.375 |  Val. Acc: 85.70%
Epoch: 04 | Epoch Time: 0m 8s
	Train Loss: 0.379 | Train Acc: 88.81%
	 Val. Loss: 0.385 |  Val. Acc: 86.79%
Epoch: 05 | Epoch Time: 0m 8s
	Train Loss: 0.337 | Train Acc: 89.82%
	 Val. Loss: 0.403 |  Val. Acc: 87.44%


### Evaluating the Best Model

In [27]:
model.load_state_dict(torch.load('best-model.pt'))

test_loss, test_acc = evaluate(model, test_iterator, criterion)

print(f'Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}%')

Test Loss: 0.383 | Test Acc: 85.41%


### Making predictions

In [31]:
import spacy
import en_core_web_sm
nlp = en_core_web_sm.load()
def predict_sentiment(model, sentence):
    model.eval()
    tokenized = generate_bigrams([tok.text for tok in nlp.tokenizer(sentence)])
    indexed = [TEXT.vocab.stoi[t] for t in tokenized]
    tensor = torch.LongTensor(indexed).to(device)
    tensor = tensor.unsqueeze(1)
    prediction = torch.sigmoid(model(tensor))
    return prediction.item()

#### Negative sentiment.

In [32]:
predict_sentiment(model, "This film is terrible")

1.0

#### Positive Sentiment

In [33]:
predict_sentiment(model, "This film is great")

1.208332384873421e-19

### Next Steps
* CNN in Sentiment Analyisis

### Credits.
* [bentrevett](https://github.com/bentrevett/pytorch-sentiment-analysis/blob/master/3%20-%20Faster%20Sentiment%20Analysis.ipynb)