### 1. Simple Sentiment Analysis usin `IMDB` dataset.
* In this notebook we are going to predict weather a movie review is positive or negative using the `imdb` dataset.


#### Preparing Data
We are going to use the `TorchText`'s ``Field`` which define how your data should be processed.

We use the ``TEXT`` field to define how the review should be processed, and the ``LABEL`` field to process the sentiment.

Our ``TEXT`` field has ``tokenize='spacy'`` as an argument. This defines that the "tokenization" (the act of splitting the string into discrete "tokens") should be done using the ``spaCy`` tokenizer. If no tokenize argument is passed, the default is simply splitting the string on spaces. We also need to specify a ``tokenizer_language`` which tells torchtext which spaCy model to use. We use the ``en_core_web_sm``.

**Downloading the `eng_core_web_sm`:**

```
python -m spacy download en_core_web_sm
```

``LABEL`` is defined by a ``LabelField``, a special subset of the ``Field`` class specifically used for handling labels. 

In [2]:
import en_core_web_sm

In [6]:
import torch
from torchtext.legacy import data

In [7]:
SEED = 1234

torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True
device = torch.device("cuda" if torch.cuda.is_available() else 'cpu')
device

device(type='cuda')

In [9]:
TEXT = data.Field(tokenize = 'spacy',
                  tokenizer_language = 'en_core_web_sm')
LABEL = data.LabelField(dtype = torch.float)
TEXT, LABEL

(<torchtext.legacy.data.field.Field at 0x7f1180859b10>,
 <torchtext.legacy.data.field.LabelField at 0x7f11816fafd0>)

### Downloading the `IMDB` dataset.
Another handy feature of ``TorchText`` is that it has support for common datasets used in natural language processing (NLP).

In [10]:
from torchtext.legacy import datasets
train_data, test_data = datasets.IMDB.splits(TEXT, LABEL)

aclImdb_v1.tar.gz:   0%|          | 0.00/84.1M [00:00<?, ?B/s]

downloading aclImdb_v1.tar.gz


aclImdb_v1.tar.gz: 100%|██████████| 84.1M/84.1M [00:01<00:00, 71.7MB/s]


### Checking the data structure.

In [16]:
print(f"TRAINING EXAMPLES: \t {len(train_data)}\nTEST EXAMPLES: \t {len(test_data)}\nTOTAL EXAMPLES: \t {len(train_data) + len(test_data)}")

TRAINING EXAMPLES: 	 25000
TEST EXAMPLES: 	 25000
TOTAL EXAMPLES: 	 50000


## Checking one example.

In [17]:
vars(train_data.examples[0])

{'label': 'pos',
 'text': ['This',
  'movie',
  'was',
  'like',
  'any',
  'Jimmy',
  'Stewart',
  'film',
  ',',
  'witty',
  ',',
  'charming',
  'and',
  'very',
  'enjoyable',
  '.',
  'Kim',
  'Novak',
  "'s",
  'performance',
  'as',
  'Gillian',
  ',',
  'the',
  'beautiful',
  'witch',
  'who',
  'longs',
  'to',
  'be',
  'human',
  ',',
  'is',
  'splendid',
  ',',
  'her',
  'subtle',
  'facial',
  'expressions',
  ',',
  'her',
  'every',
  'move',
  'and',
  'gesture',
  'all',
  'create',
  'Gillian',
  "'s",
  'unique',
  'and',
  'somewhat',
  'haunting',
  'character',
  ',',
  'she',
  'left',
  'us',
  'hanging',
  'on',
  'her',
  'every',
  'word',
  '.',
  'I',
  'should',
  'not',
  'fail',
  'to',
  'mention',
  'Ernie',
  'Kovacs',
  "'",
  'and',
  'Elsa',
  'Lanchester',
  "'s",
  'highly',
  'commendable',
  'performances',
  'as',
  'the',
  'scotch',
  'loving',
  'writer',
  'obsessed',
  'with',
  'the',
  'world',
  'of',
  'magic(Kovacs',
  ')',
  'an

### Creating the validation data.
By default the `IMDB` only have two sets which are the trainning and testing set, we also need the validation set in our sample. We are going to use the `.split()` method on the train data.

1. `.split()` method.
This method split the dataset into a ration of ``70% `` trainning and ``30%`` validation.
* We can change this by specifying the keyword arg `split_ratio = 0.8` which means ``80%`` of the data will belong to the training and the rest to the testing.

In [22]:
from random import seed

In [24]:
train_data, val_data = train_data.split(random_state=seed(SEED))

Let's check how many example do we have now.

In [25]:
print(f"TRAINING EXAMPLES: \t {len(train_data)}\nVALIDATION EXAMPLES: \t {len(val_data)}\nTEST EXAMPLES: \t {len(test_data)}\nTOTAL EXAMPLES: \t {len(train_data) + len(test_data) + len(val_data)}")

TRAINING EXAMPLES: 	 17500
VALIDATION EXAMPLES: 	 7500
TEST EXAMPLES: 	 25000
TOTAL EXAMPLES: 	 50000


### $B$uilding a $V$ocabulary.
A vocanulary is a effectively a look up table where every unique word in your data set has a corresponding index (an integer).

The reason we create a vocabulary is because our machine learning models can not operate on string data.
Each index is used to construct a one-hot vector for each word. A one-hot vector is a vector where all of the elements are 0, except one, which is 1, and dimensionality is the total number of unique words in your vocabulary, commonly denoted by $V$.

The number of unique words in our training set is over 100,000, which means that our one-hot vectors will have over 100,000 dimensions which will make trainning slower.

There are two ways effectively cut down our vocabulary, we can either only take the top $n$ most common words or ignore words that appear less than $m$ times. We'll do the former, only keeping the top 25,000 words.

What do we do with words that appear in examples but we have cut from the vocabulary? We replace them with a special unknown or ``<unk>`` token. For example, if the sentence was "This film is great and I love it" but the word "love" was not in the vocabulary, it would become "This film is great and I ``<unk>`` it".

Let's build a Vocabulary.

In [26]:
MAX_VOCAB_SIZE = 25_000

TEXT.build_vocab(train_data, max_size = MAX_VOCAB_SIZE)
LABEL.build_vocab(train_data)

**Why building the vocabulary on the ``train set`` only?**

* Machine learning system must not look at the ``test data`` in any way.
* We want the ``validation data`` to represent the testing datasets as much as possible.



In [34]:
print(f"Unique words in the: {len(TEXT.vocab)}")
print(f"Unique labels in the: {len(LABEL.vocab)}")

Unique words in the: 25002
Unique labels in the: 2


#### But wait, isn't that we said our vocabulary size $2500$, what the hack is going on with the extra $2$ words? Where did they come from?

Come down dude, the two additions to our vocabulary are `<unk>` for unknown words and `<pad>` for padding sequences.

#### But why?
When we feed sentences into our model, we feed a batch of them at a time, i.e. more than one at a time, and all sentences in the batch need to be the same size. 

Consider the following illestration:

<p align="center">
<img src="https://github.com/bentrevett/pytorch-sentiment-analysis/raw/2b666b3cba7d629a2f192c7d9c66fadcc9f0c363/assets/sentiment6.png"/>
</p>

### Most common words.
The most common 10 words and their frequences.

In [35]:
TEXT.vocab.freqs.most_common(10)

[('the', 203562),
 (',', 193428),
 ('.', 166888),
 ('a', 109851),
 ('and', 109723),
 ('of', 100959),
 ('to', 94207),
 ('is', 76693),
 ('in', 61321),
 ('I', 54250)]

### The vocabulary.
We can also see the vocabulary directly bu using either the stoi (**s**tring **t**o **i**nt) or itos (**i**nt **t**o **s**tring) method, for both the text and the labels

In [36]:
print(TEXT.vocab.itos)



In [37]:
print(TEXT.vocab.stoi)



In [38]:
print(LABEL.vocab.stoi)
print(LABEL.vocab.itos)

defaultdict(None, {'neg': 0, 'pos': 1})
['neg', 'pos']


### Creating Iterators - `BucketIterator`

The final step of preparing the data is creating the iterators. We iterate over these in the training/evaluation loop, and they return a batch of examples (indexed and converted into tensors) at each iteration.

We'll use a BucketIterator which is a special type of iterator that will return a batch of examples where each example is of a similar length, minimizing the amount of padding per example.

In [39]:
BATCH_SIZE = 64

train_iterator, test_iterator, validation_iterator = data.BucketIterator.splits(
    (train_data, test_data, val_data),
    batch_size = BATCH_SIZE,
    device=device
)

In [44]:
for X in train_iterator: 
  break
print(X)


[torchtext.legacy.data.batch.Batch of size 64]
	[.text]:[torch.cuda.LongTensor of size 1205x64 (GPU 0)]
	[.label]:[torch.cuda.FloatTensor of size 64 (GPU 0)]


### Creating a model.
The next stage is building the model that we'll eventually train and evaluate.

1. **The embedding layer.**

The embedding layer is used to transform our sparse ``one-hot vector`` (sparse as most of the elements are 0) into a dense embedding vector (dense as the dimensionality is a lot smaller and all the elements are real numbers). This embedding layer is simply a single fully connected layer. As well as reducing the dimensionality of the input to the ``RNN``, there is the theory that words which have similar impact on the sentiment of the review are mapped close together in this dense vector space.

The RNN layer is our RNN which takes in our dense vector and the previous hidden state $h_{t-1}$, which it uses to calculate the next hidden state, $h_t$.

<p align="center">
<img src="https://github.com/bentrevett/pytorch-sentiment-analysis/raw/2b666b3cba7d629a2f192c7d9c66fadcc9f0c363/assets/sentiment7.png"/>
</p

Finally, the linear layer takes the final hidden state and feeds it through a fully connected layer, $f(h_T)$, transforming it to the correct output dimension.

The forward method is called when we feed examples into our model.

Each batch, text, is a tensor of size **[sentence length, batch size]**. That is a batch of sentences, each having each word converted into a one-hot vector.

You may notice that this tensor should have another dimension due to the one-hot vectors, however PyTorch conveniently stores a one-hot vector as it's index value, i.e. the tensor representing a sentence is just a tensor of the indexes for each token in that sentence. The act of converting a list of tokens into a list of indexes is commonly called **numericalizing**.

The input batch is then passed through the embedding layer to get embedded, which gives us a dense vector representation of our sentences. embedded is a tensor of size **[sentence length, batch size, embedding dim]**.

embedded is then fed into the **RNN**. In some frameworks you must feed the initial hidden state, $h_0$, into the RNN, however in PyTorch, if no initial hidden state is passed as an argument it defaults to a tensor of all zeros.

The **RNN** returns 2 tensors, output of size **[sentence length, batch size, hidden dim]** and hidden of size **[1, batch size, hidden dim]**. output is the concatenation of the hidden state from every time step, whereas hidden is simply the final hidden state. We verify this using the assert statement. Note the squeeze method, which is used to remove a dimension of size 1.

Finally, we feed the last hidden state, hidden, through the linear layer, fc, to produce a prediction.

In [45]:
from torch import nn
from torch.nn import functional as F

In [71]:
class RNN(nn.Module):
  def __init__(self,input_size, hidden_size, embedding_size, num_layers, output_size):
    super().__init__()
    self.emb = nn.Embedding(input_size, embedding_dim=embedding_size)
    self.rnn = nn.RNN(embedding_size, hidden_size=hidden_size, num_layers=num_layers)
    self.fc = nn.Linear(hidden_size, out_features=output_size)

  def forward(self, x):
    # x = [sent len, batch size]
    embedded = self.emb(x)
    #embedded = [sent len, batch size, emb dim]

    output, hidden = self.rnn(embedded)
    #output = [sent len, batch size, hid dim]
    #hidden = [1, batch size, hid dim] 
    assert torch.equal(output[-1,:,:], hidden.squeeze(0))

    return self.fc(output[-1,:,:])

We now create an instance of our RNN class.

The input size is the dimension of the one-hot vectors, which is equal to the vocabulary size.

The embedding size is the size of the dense word vectors. This is usually around ``50-250`` dimensions, but depends on the size of the vocabulary.

The ``hidden size`` is the size of the hidden states. This is usually around ``100-500`` dimensions, but also depends on factors such as on the vocabulary size, the size of the dense vectors and the complexity of the task.

The ``output size`` is usually the number of classes, however in the case of only 2 classes the output value is between 0 and 1 and thus can be 1-dimensional, i.e. a single scalar real number.

In [72]:
INPUT_SIZE = len(TEXT.vocab)
EMBEDDING_SIZE = 100
HIDDEN_SIZE = 256
OUTPUT_SIZE = 1
NUM_LAYERS = 1

model = RNN(INPUT_SIZE, HIDDEN_SIZE, EMBEDDING_SIZE, NUM_LAYERS, OUTPUT_SIZE)
model

RNN(
  (emb): Embedding(25002, 100)
  (rnn): RNN(100, 256)
  (fc): Linear(in_features=256, out_features=1, bias=True)
)

### A function that tells us how many trainable parameters do we have in the model.

In [73]:
def count_trainable_params(model):
  return sum(p.numel() for p in model.parameters() if p.requires_grad == True)

print(f'The model has {count_trainable_params(model):,} trainable parameters')

The model has 2,592,105 trainable parameters


### Trainning the model.
We are going to use th `SGD` as our optimizer and `BCEWithLogitsLoss` as our loss.

* The reason we are using this loss is because we don't have the activation function on our last layer [more](https://pytorch.org/docs/stable/generated/torch.nn.BCEWithLogitsLoss.html).

* This loss combines a Sigmoid layer and the BCELoss in one single class. This version is more numerically stable than using a plain Sigmoid followed by a BCELoss as, by combining the operations into one layer, we take advantage of the log-sum-exp trick for numerical stability.

In [74]:
optimizer = torch.optim.SGD(model.parameters(), lr=0.001)
criterion = nn.BCEWithLogitsLoss()

### Pushing the model and loss function to the devics

In [75]:
model = model.to(device)
criterion = criterion.to(device)

$L$oss and $A$ccuracy.

Our criterion function calculates the loss, however we have to write our function to calculate the accuracy.

This function first feeds the predictions through a sigmoid layer, squashing the values between 0 and 1, we then round them to the nearest integer. This rounds any value greater than 0.5 to 1 (a positive sentiment) and the rest to 0 (a negative sentiment).

We then calculate how many rounded predictions equal the actual labels and average it across the batch.

In [76]:
def accuracy(y_preds, y_true):
  #round predictions to the closest integer
  rounded_preds = torch.round(torch.sigmoid(y_preds))
  correct = (rounded_preds == y_true).float() #convert into float for division 
  acc = correct.sum() / len(correct)
  return acc

The train function iterates over all examples, one batch at a time.

**model.train()** is used to put the model in "training mode", which turns on ``dropout`` and ``batch normalization``. Although we aren't using them in this model, it's good practice to include it.

For each batch, we first ``zero the gradients``. Each parameter in a model has a grad attribute which stores the gradient calculated by the criterion. PyTorch does not automatically remove (or "zero") the gradients calculated from the last gradient calculation, so they must be manually zeroed.

We then feed the batch of sentences, ``batch.text``, into the model. Note, you do not need to do ``model.forward(batch.text)``, simply calling the model works. The squeeze is needed as the predictions are initially size ``[batch size, 1]``, and we need to remove the dimension of size ``1`` as PyTorch expects the predictions input to our criterion function to be of size ``[batch size]``.

The loss and accuracy are then calculated using our predictions and the labels, batch.label, with the loss being averaged over all examples in the batch.

We calculate the gradient of each parameter with loss.``backward()``, and then update the parameters using the gradients and optimizer algorithm with ``optimizer.step().``

The loss and accuracy is accumulated across the epoch, the ``.item()`` method is used to extract a scalar from a tensor which only contains a single value.

Finally, we return the loss and accuracy, averaged across the epoch. The len of an iterator is the number of batches in the iterator.

You may recall when initializing the LABEL field, we set ``dtype=torch.float``. This is because TorchText sets tensors to be LongTensors by default, however our criterion expects both inputs to be ``FloatTensors``. Setting the dtype to be torch.float, did this for us. The alternative method of doing this would be to do the conversion inside the train function by passing ``batch.label.float()`` instad of ``batch.label`` to the criterion.

In [90]:
def train(model, iterator, optimizer, criterion):
    epoch_loss = 0
    epoch_acc = 0
    model.train()
    for batch in iterator:
        optimizer.zero_grad()
        predictions = model(batch.text).squeeze(1)
        loss = criterion(predictions, batch.label)
        acc = accuracy(predictions, batch.label)
        loss.backward()
        optimizer.step()
        epoch_loss += loss.item()
        epoch_acc += acc.item()
    return epoch_loss / len(iterator), epoch_acc / len(iterator)


``evaluate`` is similar to train, with a few modifications as you don't want to update the parameters when evaluating.

``model.eval()`` puts the model in "evaluation mode", this turns off ``dropout`` and ``batch normalization``. Again, we are not using them in this model, but it is good practice to include them.

No gradients are calculated on PyTorch operations inside the ``with no_grad()`` block. This causes less memory to be used and speeds up computation.

The rest of the function is the same as train, with the removal of ``optimizer.zero_grad(),`` ``loss.backward()`` and ``optimizer.step()``, as we do not update the model's parameters when evaluating.

In [87]:
def evaluate(model, iterator, criterion):
    epoch_loss = 0
    epoch_acc = 0
    model.eval()
    with torch.no_grad():
        for batch in iterator:
            predictions = model(batch.text).squeeze(1)
            loss = criterion(predictions, batch.label)
            acc = accuracy(predictions, batch.label)
            epoch_loss += loss.item()
            epoch_acc += acc.item()
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

We'll also create a function to tell us how long an epoch takes to compare training times between models.

In [79]:
import time
def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

We then train the model through multiple epochs, an epoch being a complete pass through all examples in the training and validation sets.

At each epoch, if the validation loss is the best we have seen so far, we'll save the parameters of the model and then after training has finished we'll use that model on the test set.

In [91]:
N_EPOCHS = 5
best_valid_loss = float('inf')
for epoch in range(N_EPOCHS):
    start_time = time.time()
    train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
    valid_loss, valid_acc = evaluate(model, validation_iterator, criterion)
    end_time = time.time()
    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'tut1-model.pt')
    print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')

Epoch: 01 | Epoch Time: 0m 15s
	Train Loss: 0.693 | Train Acc: 50.22%
	 Val. Loss: 0.694 |  Val. Acc: 50.42%
Epoch: 02 | Epoch Time: 0m 15s
	Train Loss: 0.693 | Train Acc: 50.37%
	 Val. Loss: 0.695 |  Val. Acc: 49.83%
Epoch: 03 | Epoch Time: 0m 15s
	Train Loss: 0.693 | Train Acc: 50.03%
	 Val. Loss: 0.694 |  Val. Acc: 50.46%
Epoch: 04 | Epoch Time: 0m 15s
	Train Loss: 0.693 | Train Acc: 50.29%
	 Val. Loss: 0.695 |  Val. Acc: 50.24%
Epoch: 05 | Epoch Time: 0m 15s
	Train Loss: 0.693 | Train Acc: 50.19%
	 Val. Loss: 0.694 |  Val. Acc: 50.75%


You may have noticed the loss is not really decreasing and the accuracy is poor. This is due to several issues with the model which we'll improve in the next notebook.

Finally, the metric we actually care about, the test loss and accuracy, which we get from our parameters that gave us the best validation loss.

In [92]:
model.load_state_dict(torch.load('tut1-model.pt'))

test_loss, test_acc = evaluate(model, test_iterator, criterion)

print(f'Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}%')

Test Loss: 0.687 | Test Acc: 58.45%


### Next Steps
In the next notebook, the improvements we will make are:

* packed padded sequences
* pre-trained word embeddings
* different RNN architecture
* bidirectional RNN
* multi-layer RNN
* regularization
* a different optimizer

This will allow us to achieve ~84% accuracy.

### Credits:

* [bentrevett](https://github.com/bentrevett/pytorch-sentiment-analysis/blob/master/1%20-%20Simple%20Sentiment%20Analysis.ipynb)
* [PyTorch](https://pytorch.org/docs/stable/generated/torch.nn.BCEWithLogitsLoss.html)
