### 2. Updated Sentiment Analysis using `IMDB` dataset.
* In this notebook we are going to update the code from the previous example so that we can improve accuracy on our test dataset.

We will walk through the following:
* packed padded sequences
* pre-trained word embeddings
* different RNN architecture
* bidirectional RNN
* multi-layer RNN
* regularization
* a different optimizer typically `Adam`


#### Preparing Data
We'll be using ``packed padded sequences``, which will make our ``RNN`` only process the ``non-padded`` elements of our sequence, and for any padded element the output will be a ``zero`` tensor. To use packed padded sequences, we have to tell the RNN how long the actual sequences are. We do this by setting **``include_lengths = True``** for our ``TEXT`` field. This will cause ``batch.text`` to now be a tuple with the first element being our sentence (a numericalized tensor that has been padded) and the second element being the actual lengths of our sentences.

In [1]:
import en_core_web_sm

In [2]:
import torch
from torchtext.legacy import data

In [3]:
SEED = 1234

torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True
device = torch.device("cuda" if torch.cuda.is_available() else 'cpu')
device

device(type='cuda')

In [4]:
TEXT = data.Field(tokenize = 'spacy',
                  tokenizer_language = 'en_core_web_sm',
              include_lengths=True)
LABEL = data.LabelField(dtype = torch.float)
TEXT, LABEL

(<torchtext.legacy.data.field.Field at 0x7fe1b1e9b990>,
 <torchtext.legacy.data.field.LabelField at 0x7fe1b1ea3f90>)

### Downloading the `IMDB` dataset.
Another handy feature of ``TorchText`` is that it has support for common datasets used in natural language processing (NLP).

In [5]:
from torchtext.legacy import datasets
train_data, test_data = datasets.IMDB.splits(TEXT, LABEL)

aclImdb_v1.tar.gz:   1%|          | 459k/84.1M [00:00<00:19, 4.20MB/s]

downloading aclImdb_v1.tar.gz


aclImdb_v1.tar.gz: 100%|██████████| 84.1M/84.1M [00:00<00:00, 93.5MB/s]


### Checking the data structure.

In [6]:
print(f"TRAINING EXAMPLES: \t {len(train_data)}\nTEST EXAMPLES: \t {len(test_data)}\nTOTAL EXAMPLES: \t {len(train_data) + len(test_data)}")

TRAINING EXAMPLES: 	 25000
TEST EXAMPLES: 	 25000
TOTAL EXAMPLES: 	 50000


## Checking one example.

In [7]:
vars(train_data.examples[0])

{'label': 'pos',
 'text': ['This',
  'is',
  'my',
  'first',
  'comment',
  '!',
  'This',
  'is',
  'a',
  'fantastic',
  'movie',
  '!',
  'I',
  'watched',
  'it',
  'all',
  'by',
  'luck',
  'one',
  'night',
  'on',
  'TV',
  '.',
  'At',
  'first',
  '5',
  'minutes',
  'i',
  'thought',
  'it',
  'was',
  'a',
  'B',
  'movie',
  ',',
  'but',
  'afterward',
  'i',
  'understood',
  'what',
  'an',
  'amazing',
  'product',
  'this',
  'was.<br',
  '/><br',
  '/>I',
  'suggested',
  'to',
  'some',
  'friends',
  'to',
  'see',
  'the',
  'movie',
  ',',
  'only',
  'to',
  'tell',
  'me',
  'that',
  'it',
  'was',
  'a',
  'bad',
  'B',
  'movie',
  '.',
  'How',
  'wrong',
  '.',
  'Superficial',
  'critiques.<br',
  '/><br',
  '/>I',
  'think',
  'that',
  'the',
  'movie',
  'is',
  'almost',
  'a',
  'product',
  'of',
  'genius',
  '!',
  'The',
  'well',
  'known',
  'director',
  'made',
  'an',
  'excellent',
  'job',
  'here',
  'and',
  'it',
  'is',
  'a',
  'sham

### Creating the validation data.
By default the `IMDB` only have two sets which are the trainning and testing set, we also need the validation set in our sample. We are going to use the `.split()` method on the train data.

1. `.split()` method.
This method split the dataset into a ration of ``70% `` trainning and ``30%`` validation.
* We can change this by specifying the keyword arg `split_ratio = 0.8` which means ``80%`` of the data will belong to the training and the rest to the testing.

In [8]:
from random import seed

In [9]:
train_data, val_data = train_data.split(random_state=seed(SEED))

Let's check how many example do we have now.

In [10]:
print(f"TRAINING EXAMPLES: \t {len(train_data)}\nVALIDATION EXAMPLES: \t {len(val_data)}\nTEST EXAMPLES: \t {len(test_data)}\nTOTAL EXAMPLES: \t {len(train_data) + len(test_data) + len(val_data)}")

TRAINING EXAMPLES: 	 17500
VALIDATION EXAMPLES: 	 7500
TEST EXAMPLES: 	 25000
TOTAL EXAMPLES: 	 50000


### $P$re-trained $W$ord $E$mbeddings.

Now, instead of having our word embeddings initialized randomly, they are initialized with these pre-trained vectors. We get these vectors simply by specifying which vectors we want and passing it as an argument to ``build_vocab``. ``TorchText`` handles downloading the vectors and associating them with the correct words in our vocabulary.

Here, we'll be using the ``"glove.6B.100d"`` vectors. Glove is the algorithm used to calculate the vectors, go here for more. ``6B`` indicates these vectors were trained on 6 billion tokens and ``100d`` indicates these vectors are 100-dimensional.

The theory is that these pre-trained vectors already have words with similar semantic meaning close together in vector space, e.g. "terrible", "awful", "dreadful" are nearby. This gives our embedding layer a good initialization as it does not have to learn these relations from scratch.

By default, ``TorchText`` will initialize words in your vocabulary but not in your pre-trained embeddings to zero. We don't want this, and instead initialize them randomly by setting ``unk_init to torch.Tensor.normal_``. This will now initialize those words via a Gaussian distribution.

In [11]:
MAX_VOCAB_SIZE = 25_000

TEXT.build_vocab(train_data, 
                 max_size = MAX_VOCAB_SIZE,
                 vectors = "glove.6B.100d",
                 unk_init = torch.Tensor.normal_
                 )
LABEL.build_vocab(train_data)

.vector_cache/glove.6B.zip: 862MB [02:40, 5.38MB/s]                          
100%|█████████▉| 399832/400000 [00:13<00:00, 29330.03it/s]

### Creating Iterators - `BucketIterator`

As before, we create the iterators, placing the tensors on the GPU if one is available.

Another thing for packed padded sequences all of the tensors within a batch need to be sorted by their lengths. This is handled in the iterator by setting ``sort_within_batch = True``.

In [12]:
BATCH_SIZE = 64

train_iterator, validation_iterator, test_iterator = data.BucketIterator.splits(
    (train_data, val_data, test_data),
     batch_size = BATCH_SIZE,
     device=device,
    sort_within_batch = True
)

### Different `RNN` achitectures.
We'll be using a different RNN architecture called a Long Short-Term Memory ``(LSTM)``. **Why is an LSTM better than a standard RNN?** Standard RNNs suffer from the vanishing gradient problem. LSTMs overcome this by having an extra recurrent state called a ``cell``, $c_0$ - which can be thought of as the "memory" of the LSTM - and the use use multiple gates which control the flow of information into and out of the memory.  We can simply think of the LSTM as a function of $x_t$, $h_t$ and $c_t$, instead of just $x_t$ and $h_t$.

The `LSTM`

<p align="center">
<img src="https://render.githubusercontent.com/render/math?math=%28h_t%2C%20c_t%29%20%3D%20%5Ctext%7BLSTM%7D%28x_t%2C%20h_t%2C%20c_t%29&mode=display"/>
</p>

The `LSTM` looks like (without the embedding layer)

<p align="center">
<img src="https://github.com/bentrevett/pytorch-sentiment-analysis/raw/2b666b3cba7d629a2f192c7d9c66fadcc9f0c363/assets/sentiment2.png"/>
</p>

### Bidirectional RNN
The concept behind a bidirectional RNN is simple. As well as having an RNN processing the words in the sentence from the first to the last (a forward RNN), we have a second RNN processing the words in the sentence from the last to the first (a backward RNN). At time step $t$, the forward RNN is processing word $x_t$, and the backward RNN is processing word $x_{T-t+1}$.

In PyTorch, the hidden state (and cell state) tensors returned by the forward and backward RNNs are stacked on top of each other in a single tensor.

We make our sentiment prediction using a concatenation of the last hidden state from the forward RNN (obtained from final word of the sentence), $h_T^\rightarrow$, and the last hidden state from the backward RNN (obtained from the first word of the sentence), $h_T^\leftarrow$, i.e. $\hat{y}=f(h_T^\rightarrow, h_T^\leftarrow)$

The image below shows a bi-directional RNN, with the forward RNN in orange, the backward RNN in green and the linear layer in silver.

<p align="center">
<img src="https://github.com/bentrevett/pytorch-sentiment-analysis/raw/2b666b3cba7d629a2f192c7d9c66fadcc9f0c363/assets/sentiment3.png"/>
</p>

### Regularization
Although we've added improvements to our model, each one adds additional parameters. Without going into overfitting into too much detail, the more parameters you have in in your model, the higher the probability that your model will overfit **(memorize the training data, causing a low training error but high validation/testing error, i.e. poor generalization to new, unseen examples).** To combat this, we use regularization. More specifically, we use a method of regularization called dropout. **Dropou**t works by randomly dropping out (setting to 0) neurons in a layer during a forward pass. The probability that each neuron is dropped out is set by a hyperparameter and each neuron with dropout applied is considered indepenently. One theory about why dropout works is that a model with parameters dropped out can be seen as a "weaker" (less parameters) model. The predictions from all these "weaker" models (one for each forward pass) get averaged together withinin the parameters of the model. Thus, your one model can be thought of as an ensemble of weaker models, none of which are over-parameterized and thus should not overfit.


### Implementation Details
Another addition to this model is that we are not going to learn the embedding for the ``<pad>`` token. This is because we want to explitictly tell our model that padding tokens are irrelevant to determining the sentiment of a sentence. This means the embedding for the pad token will remain at what it is initialized to (we initialize it to all zeros later). We do this by passing the index of our pad token as the ``padding_idx`` argument to the ``nn.Embedding`` layer.

To use an ``LSTM`` instead of the standard RNN, we use ``nn.LSTM`` instead of ``nn.RNN``. Also, note that the ``LSTM `` returns the output and a tuple of the final hidden state and the final cell state, whereas the standard ``RNN`` only returned the output and final hidden state.

As the final hidden state of our ``LSTM`` has both a forward and a backward component, which will be concatenated together, the size of the input to the ``nn.Linear`` layer is twice that of the hidden dimension size.

Implementing bidirectionality and adding additional layers are done by passing values for the ``num_layers`` and bidirectional arguments for the RNN/LSTM.

``Dropout`` is implemented by initializing an ``nn.Dropout`` layer (the argument is the probability of dropping out each neuron) and using it within the forward method after each layer we want to apply dropout to. 

**Note:** Never use dropout on the input or output layers (text or fc in this case), you only ever want to use dropout on intermediate layers. The ``LSTM`` has a dropout argument which adds dropout on the connections between hidden states in one layer to hidden states in the next layer.

As we are passing the lengths of our sentences to be able to use packed padded sequences, we have to add a second argument, text_lengths, to forward.

Before we pass our embeddings to the RNN, we need to pack them, which we do with ``nn.utils.rnn.packed_padded_sequence``. This will cause our ``RNN`` to only process the non-padded elements of our sequence. The ``RNN`` will then return packed_output (a packed sequence) as well as the hidden and cell states (both of which are tensors). Without packed padded sequences, hidden and cell are tensors from the last element in the sequence, which will most probably be a pad token, however when using packed padded sequences they are both from the last non-padded element in the sequence. Note that the lengths argument of ``packed_padded_sequence`` must be a CPU tensor so we explicitly make it one by using ``.to('cpu')``.

We then unpack the output sequence, with ``nn.utils.rnn.pad_packed_sequence``, to transform it from a packed sequence to a tensor. The elements of output from padding tokens will be zero tensors (tensors where every element is zero). Usually, we only have to unpack output if we are going to use it later on in the model. Although we aren't in this case, we still unpack the sequence just to show how it is done.

The final hidden state, hidden, has a shape of ``[num layers * num directions, batch size, hid dim]``. These are ordered: ``[forward_layer_0, backward_layer_0, forward_layer_1, backward_layer 1, ..., forward_layer_n, backward_layer n]``. As we want the final (top) layer forward and backward hidden states, we get the top two hidden layers from the first ``dimension``, ``hidden[-2,:,:]`` and ``hidden[-1,:,:]``, and concatenate them together before passing them to the linear layer (after applying dropout).

###  Model Creation.


In [13]:
from torch import nn

In [15]:
class RNN(nn.Module):
  def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, n_layers, 
                bidirectional, dropout, pad_idx):
    super().__init__()
    self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx = pad_idx)
    self.rnn = nn.LSTM(embedding_dim, 
                        hidden_dim, 
                        num_layers=n_layers, 
                        bidirectional=bidirectional, 
                        dropout=dropout)
    self.fc = nn.Linear(hidden_dim * 2, output_dim)
    self.dropout = nn.Dropout(dropout)
        
  def forward(self, text, text_lengths):
    #text = [sent len, batch size]
    embedded = self.dropout(self.embedding(text))
    #embedded = [sent len, batch size, emb dim]
    #pack sequence
    # lengths need to be on CPU!
    packed_embedded = nn.utils.rnn.pack_padded_sequence(embedded, text_lengths.to('cpu'))
    packed_output, (h_0, c_0) = self.rnn(packed_embedded)
    
    #unpack sequence
    output, output_lengths = nn.utils.rnn.pad_packed_sequence(packed_output)

    #output = [sent len, batch size, hid dim * num directions]
    #output over padding tokens are zero tensors
    
    #h_0 = [num layers * num directions, batch size, hid dim]
    #c_0 = [num layers * num directions, batch size, hid dim]
    
    #concat the final forward (h_0[-2,:,:]) and backward (h_0[-1,:,:]) h_0 layers
    #and apply dropout
    
    h_0 = self.dropout(torch.cat((h_0[-2,:,:], h_0[-1,:,:]), dim = 1))
            
    #h_0 = [batch size, hid dim * num directions]
        
    return self.fc(h_0)

Like before, we'll create an instance of our ``RNN`` class, with the new parameters and arguments for the number of layers, bidirectionality and dropout probability.

To ensure the pre-trained vectors can be loaded into the model, the ``EMBEDDING_DIM `` must be equal to that of the pre-trained GloVe vectors loaded earlier.

We get our pad token index from the vocabulary, getting the actual string representing the pad token from the field's ``pad_token`` attribute, which is ``<pad>`` by default.

In [16]:

INPUT_DIM = len(TEXT.vocab) # # 25002
EMBEDDING_DIM = 100
HIDDEN_DIM = 256
OUTPUT_DIM = 1
N_LAYERS = 2
BIDIRECTIONAL = True
DROPOUT = 0.5
PAD_IDX = TEXT.vocab.stoi[TEXT.pad_token] # 0

model = RNN(INPUT_DIM, 
            EMBEDDING_DIM, 
            HIDDEN_DIM, 
            OUTPUT_DIM, 
            N_LAYERS, 
            BIDIRECTIONAL, 
            DROPOUT, 
            PAD_IDX)
model

RNN(
  (embedding): Embedding(25002, 100, padding_idx=1)
  (rnn): LSTM(100, 256, num_layers=2, dropout=0.5, bidirectional=True)
  (fc): Linear(in_features=512, out_features=1, bias=True)
  (dropout): Dropout(p=0.5, inplace=False)
)

#### Number of trainable parameters.

In [17]:
def count_trainable_params(model):
  n_t_params =sum(p.numel() for p in model.parameters() if p.requires_grad == True)
  return n_t_params

print(f'The model has {count_trainable_params(model):,} trainable parameters')

The model has 4,810,857 trainable parameters


The final addition is copying the pre-trained word embeddings we loaded earlier into the embedding layer of our model.

We retrieve the embeddings from the field's vocab, and check they're the correct size, ``[vocab size, embedding dim]``

In [18]:
pretrained_embeddings = TEXT.vocab.vectors
print(pretrained_embeddings.shape)

torch.Size([25002, 100])


In [19]:
pretrained_embeddings[:1]

tensor([[-0.1117, -0.4966,  0.1631, -0.8817,  0.0539,  0.6684, -0.0597, -0.4675,
         -0.2153,  0.8840, -0.7584, -0.3689, -0.3424, -1.4020,  0.3206, -1.0219,
          0.7988, -0.0923, -0.7049, -1.6024,  0.2891,  0.4899, -0.3853, -0.7120,
         -0.1706, -1.4594,  0.2207,  0.2463, -1.3248,  0.6970, -0.6631,  1.2158,
         -1.4949,  0.8810, -1.1786, -0.9340, -0.5675, -0.2772, -2.1834,  0.3668,
          0.9380,  0.0078, -0.3139, -1.1567,  1.8409, -1.0174,  1.2192,  0.1601,
          1.5985, -0.0469, -1.5270, -2.0143, -1.5173,  0.3877, -1.1849,  0.6897,
          1.3232,  1.8169,  0.6808,  0.7244,  0.0323, -1.6593, -1.8773,  0.7372,
          0.9257,  0.9247,  0.1825, -0.0737,  0.3147, -1.0369,  0.2100,  0.6144,
          0.0628, -0.3297, -1.7970,  0.8728,  0.7670, -0.1138, -0.9428,  0.7540,
          0.1407, -0.6937, -0.6159, -0.7295,  1.3204,  1.5997, -1.0792, -0.3396,
         -1.4538, -2.6740,  1.5984,  0.8021,  0.5722,  0.0653, -0.0235,  0.8876,
          1.4689,  1.2647, -


We then replace the initial weights of the embedding layer with the pre-trained embeddings.

**Note**: this should always be done on the weight.data and not the weight!

In [21]:
model.embedding.weight.data.copy_(pretrained_embeddings)

tensor([[-0.1117, -0.4966,  0.1631,  ...,  1.2647, -0.2753, -0.1325],
        [-0.8555, -0.7208,  1.3755,  ...,  0.0825, -1.1314,  0.3997],
        [-0.0382, -0.2449,  0.7281,  ..., -0.1459,  0.8278,  0.2706],
        ...,
        [ 0.0686, -0.2422,  0.2714,  ..., -0.1480, -0.4611,  0.3606],
        [-0.1419,  0.0282,  0.2185,  ..., -0.1100, -0.1250,  0.0282],
        [-0.9607,  0.5405, -0.8723,  ..., -0.2872,  0.1165,  0.2891]])

As our ``<unk>`` and ``<pad>`` token aren't in the pre-trained vocabulary they have been initialized using ``unk_init`` (an $\mathcal{N}(0,1)$ distribution) when building our vocab. It is preferable to initialize them both to all zeros to explicitly tell our model that, initially, they are irrelevant for determining sentiment.

We do this by manually setting their row in the embedding weights matrix to zeros. We get their row by finding the index of the tokens, which we have already done for the padding index.

**Note:** like initializing the embeddings, this should be done on the weight.data and not the weight!

In [23]:

UNK_IDX = TEXT.vocab.stoi[TEXT.unk_token] or TEXT.vocab.stoi["<unk>"]

model.embedding.weight.data[UNK_IDX] = torch.zeros(EMBEDDING_DIM)
model.embedding.weight.data[PAD_IDX] = torch.zeros(EMBEDDING_DIM)

print(model.embedding.weight.data)

tensor([[ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [-0.0382, -0.2449,  0.7281,  ..., -0.1459,  0.8278,  0.2706],
        ...,
        [ 0.0686, -0.2422,  0.2714,  ..., -0.1480, -0.4611,  0.3606],
        [-0.1419,  0.0282,  0.2185,  ..., -0.1100, -0.1250,  0.0282],
        [-0.9607,  0.5405, -0.8723,  ..., -0.2872,  0.1165,  0.2891]])


We can now see the first two rows of the embedding weights matrix have been set to zeros. As we passed the index of the pad token to the ``padding_idx`` of the embedding layer it will remain zeros throughout training, however the ``<unk>`` token embedding will be learned.

### Trainning the model.
The only part that will change is our `optimizer` we are now using the `Adam` instead of `SGD` and the `criterion` remains the same.

In [24]:
optimizer = torch.optim.Adam(model.parameters())
criterion = nn.BCEWithLogitsLoss()

### Pushing the model and loss function to the devics

In [25]:
model = model.to(device)
criterion = criterion.to(device)

$L$oss and $A$ccuracy.

Our criterion function calculates the loss, however we have to write our function to calculate the accuracy.

This function first feeds the predictions through a sigmoid layer, squashing the values between 0 and 1, we then round them to the nearest integer. This rounds any value greater than 0.5 to 1 (a positive sentiment) and the rest to 0 (a negative sentiment).

We then calculate how many rounded predictions equal the actual labels and average it across the batch.

In [26]:
def accuracy(y_preds, y_true):
  #round predictions to the closest integer
  rounded_preds = torch.round(torch.sigmoid(y_preds))
  correct = (rounded_preds == y_true).float() #convert into float for division 
  acc = correct.sum() / len(correct)
  return acc


### Training the model - `train()` function
We define a function for training our model.

As we have set ``include_lengths = True``, our batch.text is now a tuple with the first element being the numericalized tensor and the second element being the actual lengths of each sequence. We separate these into their own variables, text and ``text_lengths``, before passing them to the model.

Note: as we are now using dropout, we must remember to use ``model.train()`` to ensure the dropout is "turned on" while training.

In [27]:
def train(model, iterator, optimizer, criterion):
    epoch_loss = 0
    epoch_acc = 0
    model.train()
    for batch in iterator:
        optimizer.zero_grad()
        text, text_lengths = batch.text
        predictions = model(text, text_lengths).squeeze(1)
        loss = criterion(predictions, batch.label)
        acc = accuracy(predictions, batch.label)
        loss.backward()
        optimizer.step()
        epoch_loss += loss.item()
        epoch_acc += acc.item()
    return epoch_loss / len(iterator), epoch_acc / len(iterator)


Then we define a function for testing our model, again remembering to separate ``batch.text``.

Note: as we are now using dropout, we must remember to use ``model.eval()`` to ensure the dropout is "turned off" while evaluating.

In [28]:
def evaluate(model, iterator, criterion):
    epoch_loss = 0
    epoch_acc = 0
    model.eval()
    with torch.no_grad():
        for batch in iterator:
            text, text_lengths = batch.text
            predictions = model(text, text_lengths).squeeze(1)
            loss = criterion(predictions, batch.label)
            acc = accuracy(predictions, batch.label)
            epoch_loss += loss.item()
            epoch_acc += acc.item()
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

We'll also create a function to tell us how long an epoch takes to compare training times between models.

In [29]:
import time
def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

We then train the model through multiple epochs, an epoch being a complete pass through all examples in the training and validation sets.

At each epoch, if the validation loss is the best we have seen so far, we'll save the parameters of the model and then after training has finished we'll use that model on the test set.

In [30]:
N_EPOCHS = 5
best_valid_loss = float('inf')
for epoch in range(N_EPOCHS):
    start_time = time.time()
    train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
    valid_loss, valid_acc = evaluate(model, validation_iterator, criterion)
    end_time = time.time()
    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'best-model.pt')
    print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')

Epoch: 01 | Epoch Time: 0m 42s
	Train Loss: 0.673 | Train Acc: 58.21%
	 Val. Loss: 0.689 |  Val. Acc: 57.18%
Epoch: 02 | Epoch Time: 0m 42s
	Train Loss: 0.618 | Train Acc: 66.50%
	 Val. Loss: 0.539 |  Val. Acc: 74.41%
Epoch: 03 | Epoch Time: 0m 42s
	Train Loss: 0.462 | Train Acc: 78.76%
	 Val. Loss: 0.343 |  Val. Acc: 85.38%
Epoch: 04 | Epoch Time: 0m 42s
	Train Loss: 0.332 | Train Acc: 86.46%
	 Val. Loss: 0.376 |  Val. Acc: 83.40%
Epoch: 05 | Epoch Time: 0m 42s
	Train Loss: 0.272 | Train Acc: 89.18%
	 Val. Loss: 0.280 |  Val. Acc: 88.47%


### Evaluate the best model.

In [31]:
model.load_state_dict(torch.load('best-model.pt'))

test_loss, test_acc = evaluate(model, test_iterator, criterion)

print(f'Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}%')

Test Loss: 0.292 | Test Acc: 87.85%


### Making predictions

We can now use our model to predict the sentiment of any sentence we give it. As it has been trained on movie reviews, the sentences provided should also be movie reviews.

**Note:** When using a model for inference it should always be in evaluation mode.

Our ``predict_sentiment`` function does a few things:

* sets the model to ``evaluation`` mode
* ``tokenizes`` the sentence, i.e. splits it from a raw string into a list of tokens
* ``indexes`` the tokens by converting them into their integer representation from our vocabulary
* gets the length of our sequence
* converts the indexes, which are a Python list into a PyTorch tensor
* add a batch dimension by unsqueezeing
* converts the length into a tensor
* squashes the output prediction from a real number between 0 and 1 with the sigmoid function
* converts the tensor holding a single value into an integer with the item() method


> We are expecting reviews with a negative sentiment to return a value close to 0 and positive reviews to return a value close to 1.

In [32]:
import spacy
import en_core_web_sm
nlp = en_core_web_sm.load()
def predict_sentiment(model, sent):
  model.eval()

  tokenized = [tok.text for tok in nlp.tokenizer(sent)]
  indexed = [TEXT.vocab.stoi[t] for t in tokenized]
  length = [len(indexed)]

  tensor = torch.LongTensor(indexed).to(device)
  tensor = tensor.unsqueeze(1)
  length_tensor = torch.LongTensor(length)
  prediction = torch.sigmoid(model(tensor, length_tensor))

  return prediction.item()


> Negative sentiment

In [33]:
predict_sentiment(model, "This film is terrible")

0.007193383295089006

> Positive sentiment

In [34]:
predict_sentiment(model, "This film is great")

0.9876543879508972


### Next Steps
We've now built a decent sentiment analysis model for movie reviews! In the next notebook we'll implement a model that gets comparable accuracy with far fewer parameters and trains much, much faster.

### Credits:

* [bentrevett](https://github.com/bentrevett/pytorch-sentiment-analysis/blob/master/2%20-%20Upgraded%20Sentiment%20Analysis.ipynb)

* [Read the Docs](https://torchtext.readthedocs.io/en/latest/data.html#functions)

