# Lab 2: Text classification: Sentiment analysis

In this session we are going to use state-of-the art models for text classification using the example of sentiment analysis. To be more precise, we will build a feed-forward neural network (FFNN) and a convolutional neural network (CNN). We will look into the details of data preparation, functioning of each model and how the performance of those NNs could be measured efficiently. We will start our work using a toy corpus, so we can then extend our models to larger datasets.

Again we are using [pytorch](https://www.pytorch.org), an open source deep learning platform, as our backbone library in the course.

In [None]:
# Uncomment below if you want to install packages into your local environment
# Colab already provides the required packages for this lab

#! pip install torch

In [None]:
import random

import numpy as np

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F

# we fix the seeds to get consistent results before every training
# loop in what follows
def fix_seed(seed=234):
  torch.manual_seed(seed)
  torch.cuda.manual_seed(seed)
  np.random.seed(seed)
  random.seed(seed)

## Data preparation

Here are the toy training and validation sets. It is good practise to use the validation set (a representative set of the test data). This set is used to tune hyperparameters and choose a configuration for your model to ensure the best performance. Our toy sets are already tokenized and lowercased.

In [None]:
 # Our toy sentiment analysis corpus
train = ['i like his paper !',
         'what a well-written essay !',
         'i do not agree with the criticism on this paper',
         'well done ! it was an enjoyable reading',
         'it was very good . send me a copy please .',
         'the argumentation in the paper is very weak',
         'poor effort !',
         'the methodology could have been more detailed',
         'i am not impressed',
         'could have done better .',
]

train_labels = [1, 1, 1, 1, 1, 0, 0, 0, 0, 0]

# Validation set
valid = ['i like your paper', 
         'i agree with your results', 
         'what a success ! a well-written paper', 
         'not enough details . very poor', 
         'i support the criticism',
         'could be better',
]

valid_labels = [1, 1, 1, 0, 0, 0]

### Pre-processing

Using the material from the previous lab session, fill in the function below to tokenize the corpus:

In [None]:
def get_tokenized_corpus(corpus):
  tokenized_corpus = []

  #######################
  # Q: Process the corpus
  #######################
  for sentence in corpus:
    tokenized_sentence = []
    for token in sentence.split(' '): 
      tokenized_sentence.append(token)
    tokenized_corpus.append(tokenized_sentence)
 
  return tokenized_corpus

### Word2index dictionary

Similar to the way it was done in the previous lab, we define here a method that returns a word to index dictionary. Note that we reserve the 0 index for the padding token `<pad>`.

In [None]:
def get_word2idx(tokenized_corpus):
  vocabulary = []
  for sentence in tokenized_corpus:
    for token in sentence:
        if token not in vocabulary:
            vocabulary.append(token)
  
  word2idx = {w: idx+1 for (idx, w) in enumerate(vocabulary)}
  # we reserve the 0 index for the padding token
  word2idx['<pad>'] = 0
  
 
  return word2idx

### Preparation of inputs

The first layer of our FFNN will be an embedding (look-up) layer which takes as input indexes of tokens (we do not need to one-hot encode our vectors).

---

**Q: Why do we need to fix the length of our input vectors (we take the maximum sentence length here) ? This process is referred to as padding. Print the padded training corpus.**

*A: We are preparing inputs to FFNN which takes fized-size vectors as inputs. Padding fixes input sentence size.*

In [None]:
def get_model_inputs(tokenized_corpus, word2idx, labels):
  # we index our sentences
  vectorized_sents = [[word2idx[tok] for tok in sent if tok in word2idx] for sent in tokenized_corpus]

  # Sentence lengths
  sent_lengths = [len(sent) for sent in vectorized_sents]

  # Get maximum length
  max_len = max(sent_lengths)
  
  # we create a tensor of a fixed size filled with zeroes for padding
  sent_tensor = torch.zeros((len(vectorized_sents), max_len)).long()

  # we fill it with our vectorized sentences 
  for idx, (sent, sentlen) in enumerate(zip(vectorized_sents, sent_lengths)):
    sent_tensor[idx, :sentlen] = torch.LongTensor(sent)

  # Label tensor
  label_tensor = torch.FloatTensor(labels)
  
  return sent_tensor, label_tensor

###

tokenized_corpus = get_tokenized_corpus(train)
word2idx = get_word2idx(tokenized_corpus)
train_sent_tensor, train_label_tensor = get_model_inputs(tokenized_corpus, word2idx, train_labels)

print(f'Vocabulary size: {len(word2idx)}')
print('Training set tensor:')
print(train_sent_tensor)

## Building the Feed-Forward Neural Network

We will start by building a very simple feed-forward neural network (FFNN).
Our FFNN class is a sub-class of `nn.Module`. Within the `__init__` method, we define the layers of the module:

- Our first layer is an embedding layer (look-up layer). This layer could be initialized with pre-trained embeddings (as we will see at the end of this lab) or could be trained together with other layers.
 
- The next layer is a fully connected layer followed by a ReLU activation.

- Finally, the last linear layer is the output layer for the classification task.

The `forward()` method is called when we feed data into our model. Please note that the output dimension of each layer is the input dimension for the next one.

---

**Q: Recall from the previous lab the functioning of a lookup layer. How does the mapping to the dense representation happen?**

*A: We multiply the one-hot input vector of the size of the vocabulary by a matrix of the shape `vocabulary size X embedding size`.*

**Q: Implement the averaging of embeddings in the `forward()` method of the class below.**

*A: If all the input sentences had the same lengths, i.e. there were no 0-padded positions, the solution would simply be* `embedded.mean(1)`*. But in this case, we have to take care of independent sentence lengths when averaging, by first summing the embeddings and then normalizing the sum by the corresponding length:*

```python
sent_lens = x.ne(0).sum(1, keepdims=True)
averaged = embeddings.sum(1) / sent_lens
```


In [None]:
class FFNN(nn.Module):
    def __init__(self, embedding_dim, hidden_dim, vocab_size, num_classes):  
        super(FFNN, self).__init__()
        
        # embedding (lookup layer) layer
        # padding_idx argument makes sure that the 0-th token in the vocabulary
        # is used for padding purposes i.e. its embedding will be a 0-vector
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=0)
        
        # hidden layer
        self.fc1 = nn.Linear(embedding_dim, hidden_dim)
        
        # activation
        self.relu1 = nn.ReLU()
        
        # output layer
        self.fc2 = nn.Linear(hidden_dim, num_classes)  
    
    def forward(self, x):
        # x has shape (batch_size, max_sent_len)

        embedded = self.embedding(x)
        # `embedded` has shape (batch size, max_sent_len, embedding dim)

        ########################################################################
        # Q: Compute the average embeddings of shape (batch_size, embedding_dim)
        ########################################################################
        # Implement averaging that ignores padding (average using actual sentence lengths).
        # Hint: You need to ignore the <pad> token when averaging.
        # How does this affect the result?
        
        sent_lens = x.ne(0).sum(1, keepdims=True)
        averaged = embedded.sum(1) / sent_lens

        out = self.fc1(averaged)
        out = self.relu1(out)
        out = self.fc2(out)
        return out

### Training the model

In this section we will define the hyperparameters of our model, the loss function, the optimizer and perform a number of training epochs over our toy training corpus.

We will use the **Stochastic gradient descent (SGD)** optimizer. The learning rate hyperparameter of the optimizer controls how the weights are adjusted with respect to the loss gradient. The lower the value, the more fine-grained are weight updates.

**Note that** it is a common practise to perform training using mini-batches (sets of training instances seen by the model during weight update step). In this case, the epoch loss is defined as the loss averaged across the mini-batches. Since our corpus is very small, we train on the whole training set without batching.

---

**Q: Why is the number of output classes is equal to 1 for binary classification?**

*A: The output in the case of the sigmoid transformation is considered as a probability of the positive class. If it is $>=0.5$ the output class is $1$, $0$ otherwise.* 

**Q: Try to modify the learning rate (which is initially set to 0.5 below) in the range $[0.0001, 0.5]$. How does the loss react to these changes?**

*A: The loss will typically change slower for a lower learning rate.*

In [None]:
# Reset the seed before every model construction for reproducible results
fix_seed()

# we will train for N epochs (The model will see the corpus N times)
EPOCHS = 10

# Learning rate is initially set to 0.5
LRATE = 0.5

# we define our embedding dimension (dimensionality of the output of the first layer)
EMBEDDING_DIM = 50

# dimensionality of the output of the second hidden layer
HIDDEN_DIM = 50

# the output dimension is the number of classes, 1 for binary classification
OUTPUT_DIM = 1

# Construct the model
model = FFNN(EMBEDDING_DIM, HIDDEN_DIM, len(word2idx), OUTPUT_DIM)

# Print the model
print(model)

# we use the stochastic gradient descent (SGD) optimizer
optimizer = optim.SGD(model.parameters(), lr=LRATE)

# we use the binary cross-entropy loss with sigmoid (applied to logits) 
# Recall that we did not apply any activation to our output layer, hence we need
# to make our outputs look like probabilities.
loss_fn = nn.BCEWithLogitsLoss()

# Input and label tensors
feature = train_sent_tensor
target = train_label_tensor

################
# Start training
################
print(f'Will train for {EPOCHS} epochs')
for epoch in range(1, EPOCHS + 1):
  # to ensure the dropout (explained later) is "turned on" while training
  # good practice to include even if do not use here
  model.train()
  
  # we zero the gradients as they are not removed automatically
  optimizer.zero_grad()
  
  # squeeze is needed as the predictions will have the shape (batch size, 1)
  # and we need to remove the dimension of size 1
  predictions = model(feature).squeeze(1)

  # Compute the loss
  loss = loss_fn(predictions, target)
  train_loss = loss.item()

  # calculate the gradient of each parameter
  loss.backward()

  # update the parameters using the gradients and optimizer algorithm 
  optimizer.step()
  
  print(f'| Epoch: {epoch:02} | Train Loss: {train_loss:.3f}')

### Measuring the accuracy

In addition to measuring the loss, we can also evaluate the actual classification performance of our model. (In the case of training with mini-batches, the epoch accuracy is defined as the accuracy averaged across the mini-batches.)

---

**Q: Fill in the below function so that it computes the accuracy of the model. Once you are done, improve the previous loop so that it also prints the training accuracy after each epoch.**

In [None]:
def accuracy(output, target):
  #####################################
  # Q: Return the accuracy of the model
  #####################################
  # Pass through the sigmoid and round the values to 0 or 1
  output = torch.round(torch.sigmoid(output))
  correct = (output == target).float()
  acc = correct.mean()

  return acc

In [None]:
# Reset the seed for consistent results
fix_seed()

# we will train for N epochs (The model will see the corpus N times)
EPOCHS = 10

# Learning rate is initially set to 0.5
LRATE = 0.5

# we define our embedding dimension (dimensionality of the output of the first layer)
EMBEDDING_DIM = 50

# dimensionality of the output of the second hidden layer
HIDDEN_DIM = 50

# the output dimension is the number of classes, 1 for binary classification
OUTPUT_DIM = 1

# Construct the model
model = FFNN(EMBEDDING_DIM, HIDDEN_DIM, len(word2idx), OUTPUT_DIM)

# Print the model
print(model)

# we use the stochastic gradient descent (SGD) optimizer
optimizer = optim.SGD(model.parameters(), lr=LRATE)

# we use the binary cross-entropy loss with sigmoid (applied to logits) 
# Recall that we did not apply any activation to our output layer, hence we need
# to make our outputs look like probabilities.
loss_fn = nn.BCEWithLogitsLoss()

# Input and label tensors
feature = train_sent_tensor
target = train_label_tensor

################
# Start training
################
print(f'Will train for {EPOCHS} epochs')
for epoch in range(1, EPOCHS + 1):
  # to ensure the dropout (explained later) is "turned on" while training
  # good practice to include even if do not use here
  model.train()
  
  # we zero the gradients as they are not removed automatically
  optimizer.zero_grad()
  
  # squeeze is needed as the predictions will have the shape (batch size, 1)
  # and we need to remove the dimension of size 1
  predictions = model(feature).squeeze(1)

  # Compute the loss
  loss = loss_fn(predictions, target)
  train_loss = loss.item()
  
  #####################
  # Q: Compute accuracy
  #####################
  train_acc = accuracy(predictions, target)

  # calculate the gradient of each parameter
  loss.backward()

  # update the parameters using the gradients and optimizer algorithm 
  optimizer.step()

  print(f'| Epoch: {epoch:02} | Train Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')

### Hyperparameter tuning on the validation set

You should now apply the previous pre-processing and input preparation procedures to the validation set as well.

---

**Q: Should we re-use the word to index dictionary we created before? Why?**

*A: Yes because the training vocabulary is the one known to the model, we have to use that one to map our tokens.*
 

In [None]:
###############################################
# Q: Prepare the validation corpus and labels #
###############################################
# Store validation sentences and labels in ``valid_sent_tensor`` 
# and ``valid_label_tensor`` respectively.
tokenized_valid_corpus = get_tokenized_corpus(valid)
valid_sent_tensor, valid_label_tensor = get_model_inputs(tokenized_valid_corpus, word2idx, valid_labels)
print(valid_sent_tensor)

**Q: Try to modify the learning rate and the number of epochs now. How will the validation loss and accuracy react to those changes?**

*A: Typically the validation loss and accuracy will change slower with a lower learning rate. This potentially increases the chance of an optimal training result. However, the optimization will take longer time because steps towards the minimum of the loss function are smaller. Hence, we increase the number of epochs. A very high learning rate risks to cause the loss to "bounce around" and even overshoot the optimum, preventing convergence.*

In [None]:
# Reset the seed for consistent results
fix_seed()

# we will train for N epochs (The model will see the corpus N times)
EPOCHS = 10

# Learning rate is initially set to 0.5
LRATE = 0.5

# we define our embedding dimension (dimensionality of the output of the first layer)
EMBEDDING_DIM = 50

# dimensionality of the output of the second hidden layer
HIDDEN_DIM = 50

# the output dimension is the number of classes, 1 for binary classification
OUTPUT_DIM = 1

# Construct the model
model = FFNN(EMBEDDING_DIM, HIDDEN_DIM, len(word2idx), OUTPUT_DIM)

# we use the stochastic gradient descent (SGD) optimizer
optimizer = optim.SGD(model.parameters(), lr=LRATE)

# we use the binary cross-entropy loss with sigmoid (applied to logits) 
# Recall that we did not apply any activation to our output layer, hence we need
# to make our outputs look like probabilities.
loss_fn = nn.BCEWithLogitsLoss()

# Input and label tensors for training
feature_train = train_sent_tensor
target_train = train_label_tensor

# Input and label tensors for validation
feature_valid = valid_sent_tensor
target_valid = valid_label_tensor

################
# Start training
################
print(f'Will train for {EPOCHS} epochs')
for epoch in range(1, EPOCHS + 1):
  # to ensure the dropout (explained later) is "turned on" while training
  # good practice to include even if do not use here
  model.train()
  
  # we zero the gradients as they are not removed automatically
  optimizer.zero_grad()
  
  # squeeze is needed as the predictions will have the shape (batch size, 1)
  # and we need to remove the dimension of size 1
  predictions = model(feature_train).squeeze(1)

  # Compute the loss
  loss = loss_fn(predictions, target_train)
  train_loss = loss.item()

  # Compute training accuracy
  train_acc = accuracy(predictions, target_train)

  # calculate the gradient of each parameter
  loss.backward()

  # update the parameters using the gradients and optimizer algorithm 
  optimizer.step()
  
  # this puts the model in "evaluation mode" (turns off dropout and batch normalization)
  # good practise to include even if we do not use them right now
  model.eval()

  # we do not compute gradients within this block, i.e. no training
  with torch.no_grad():
    predictions_valid = model(feature_valid).squeeze(1)
    valid_loss = loss_fn(predictions_valid, target_valid).item()
    valid_acc = accuracy(predictions_valid, target_valid)
  
  print(f'| Epoch: {epoch:02} | Train Loss: {train_loss:.3f} | Train Acc: {train_acc*100:6.2f}% | Val. Loss: {valid_loss:.3f} | Val. Acc: {valid_acc*100:6.2f}% |')

### Testing the model

Now let us test our trained model. We define a small test set below. First, apply the data preparation procedures to this test set as you did for the validation set.


In [None]:
test = ['i really do not like your paper', 
        'well done', 
        'good results for a paper !',
        'amazing effort', 
        'your effort is poor !', 
        'not impressed'   
]

test_labels = [0, 1, 1, 1, 0, 0]

#########################################
# Q: Prepare the test corpus and labels #
############################################
# Store test sentences and labels in ``test_sent_tensor`` 
# and ``test_label_tensor`` respectively.
tokenized_test_corpus = get_tokenized_corpus(test)
test_sent_tensor, test_label_tensor = get_model_inputs(tokenized_test_corpus, word2idx, test_labels)
print(test_sent_tensor)

**Q: Fill in the function below for the computation of F-measure. Once done, complete the missing lines in the final evaluation part.**  

*A: See the following code*

In [None]:
def f_measure(output, gold):
  ############################################
  # Q: Compute precision, recall and f-measure 
  ############################################
  pred = torch.round(torch.sigmoid(output))
  pred = pred.detach().cpu().numpy()
     
  test_pos_preds = np.sum(pred)
  test_pos_real = np.sum(gold)
    
  correct = (np.logical_and(pred, gold)).astype(int)
  correct = np.sum(correct)
  
  precision = correct / test_pos_preds
  recall = correct / test_pos_real
  
  fscore = (2.0 * precision * recall) / (precision + recall)

  # Print them
  print(f"     Recall: {recall:.2f}, Precision: {precision:.2f}, F-measure: {fscore:.2f}")
  

####

model.eval()

feature_test = test_sent_tensor
target_test = test_label_tensor

with torch.no_grad():
  ####################################################################
  # Q: Get predictions for the test set, compute the loss and accuracy
  ####################################################################
  predictions = model(feature_test).squeeze(1)
  test_loss = loss_fn(predictions, target_test).item()
  test_acc = accuracy(predictions, target_test)

  # Print
  print(f'Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}%')
  f_measure(predictions, test_labels)  

**Q:  Are the resulting evaluations different ? How do you interpret those differences? Print the predictions.**

*A: Accuracy is higher than F-measure. F-measure focuses on the results for the positive class in terms of their precision and recall (for our example precision is high, recall is low), while accuracy compares all the predictions for both classes to gold labels.*
 

## Building the Convolutional Neural Network (CNN)

We will implement a model inspired by the state-of-art (at that time) CNN model as described in [Convolutional Neural Networks for Sentence Classification (Kim, 2014)](https://arxiv.org/abs/1408.5882).

Similar to the FFNN model, we start with an embedding layer. We implement the convolutional layer with the help of `nn.Conv2d` and use the ReLU activation after it. The above-mentioned paper, being inspired by the convolution for images, applies a 2-dimensional convolution: a (window size, embedding dimension) filter. It covers `n` sequential words, taking embedding dimensions as the width. We then pass the tensors through a **max pooling layer**.

The **max pooling layer** is typically followed by a **dropout** layer. The latter sets a random set of activations in the max-pooling layer to zero. This prevents the network from learning to rely on specific weights and helps to prevent overfitting. Note that the dropout layer is only used during training, and not during test time.

---

**Q: Study the shapes of outputs coming from convolution and max pooling layers. What is the shape of the max pooling layer output?**

*A:* `batch_size X n_filters`

In [None]:
class CNN(nn.Module):
  def __init__(self, vocab_size, embedding_dim, out_channels, window_size, output_dim, dropout):
    super(CNN, self).__init__()
    
    # Create the embedding layer as usual
    self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=0)
    
    # in_channels -- 1 text channel
    # out_channels -- the number of output channels
    # kernel_size is (window size x embedding dim)
    self.conv = nn.Conv2d(
      in_channels=1, out_channels=out_channels,
      kernel_size=(window_size, embedding_dim))
    
    # the dropout layer
    self.dropout = nn.Dropout(dropout)

    # the output layer
    self.fc = nn.Linear(out_channels, output_dim)
        
  def forward(self, x):
    # x -> (batch size, max_sent_length)
    
    embedded = self.embedding(x)
    # embedded -> (batch size, max_sent_length, embedding_dim)
    
    # images have 3 RGB channels 
    # for the text we add 1 channel
    embedded = embedded.unsqueeze(1)
    # embedded -> (batch size, 1, max_sent_length, embedding dim)

    # Compute the feature maps      
    feature_maps = self.conv(embedded)

    ##########################################
    # Q: What is the shape of `feature_maps` ?
    ##########################################
    # A: (batch size, n filters, max_sent_length - window size + 1, 1)
    
    feature_maps = feature_maps.squeeze(3)
    
    # Q: Why do we remove 1 dimension here?
    # A: We don't need the 1 channel anymore
    
    # Apply ReLU
    feature_maps = F.relu(feature_maps)
    
    # Apply the max pooling layer
    pooled = F.max_pool1d(feature_maps, feature_maps.shape[2])
    
    pooled = pooled.squeeze(2)

    ####################################
    # Q: What is the shape of `pooled` ?
    ####################################
    # A: (batch size, n_filters)
    
    dropped = self.dropout(pooled)
    preds = self.fc(dropped)
    
    return preds

### Training and testing the CNN

Here we will define the CNN-specific hyperparameters and perform the network training and testing. **Note that** the learning rate is initially set to 0.1.

In [None]:
fix_seed()

EPOCHS = 10
LRATE = 0.1

EMBEDDING_DIM = 50
OUTPUT_DIM = 1

# the hyperparameters specific to CNN
# we define the number of filters
N_OUT_CHANNELS = 100

# we define the window size
WINDOW_SIZE = 1

# we apply the dropout with the probability 0.2
DROPOUT = 0.2

# Construct the model
model = CNN(len(word2idx), EMBEDDING_DIM, N_OUT_CHANNELS, WINDOW_SIZE, OUTPUT_DIM, DROPOUT)

optimizer = optim.SGD(model.parameters(), lr=LRATE)
loss_fn = nn.BCEWithLogitsLoss()

feature_train = train_sent_tensor
target_train = train_label_tensor

feature_valid = valid_sent_tensor
target_valid = valid_label_tensor

feature_test = test_sent_tensor
target_test = test_label_tensor

################
# Start training
################
print(f'Will train for {EPOCHS} epochs')
for epoch in range(1, EPOCHS + 1):
  model.train()
  
  optimizer.zero_grad()

  predictions = model(feature_train).squeeze(1)
  loss = loss_fn(predictions, target_train)
  train_loss = loss.item()
  train_acc = accuracy(predictions, target_train)

  loss.backward()

  optimizer.step()

  model.eval()
  with torch.no_grad():
    predictions_valid = model(feature_valid).squeeze(1)
    valid_loss = loss_fn(predictions_valid, target_valid).item()
    valid_acc = accuracy(predictions_valid, target_valid)
  
  print(f'| Epoch: {epoch:02} | Train Loss: {train_loss:.3f} | Train Acc: {train_acc*100:6.2f}% | Val. Loss: {valid_loss:.3f} | Val. Acc: {valid_acc*100:6.2f}% |')


## Finally, test on the test set
model.eval()

with torch.no_grad():
    predictions = model(feature_test).squeeze(1)
    loss = loss_fn(predictions, target_test)
    acc = accuracy(predictions, target_test)
    print(f'Test Loss: {loss:.3f} | Test Acc: {acc*100:.2f}%')
    f_measure(predictions, test_labels)

 **Q: Is the performance of CNN different from the performance of FFNN? Output predictions.**
 
 *A: CNN typically performs better on larger datasets. Here, although our toy dataset is for demonstrational purposes, CNN performs better than the FFNN as well.*

**Q: Is padding necessary for CNN inputs? What is the role of the window size?**

*A: For CNNs, padding is only necessary for the case when an input sentence size is smaller than the longest window size. The window size determines the receptive field (i.e. span) of the convolution operation.*

### Initializing CNN with pre-trained representations

The work [Convolutional Neural Networks for Sentence Classification (Kim, 2014)](https://arxiv.org/abs/1408.5882) also investigates the exploitation of pre-trained embeddings and demonstrates the efficiency of using them.

First, download the embeddings and unzip them below:

In [None]:
!wget http://nlp.stanford.edu/data/glove.6B.zip

In [None]:
# Unzip the file: 4 different embedding sizes are provided
!unzip glove.6B.zip

In [None]:
# Check the file format
!head -n10 glove.6B.50d.txt


Try and initialize the CNN embedding layer with the `50D` pre-trained GloVe embeddings. Pay particular attention to keeping the correct indices from the `word2idx` for the lookup table! Once you fill the below `wvecs` matrix, copy the previous training loop and initialize its embedding layer with the pre-trained ones as follows:

```python
model.embedding.weight.data.copy_(torch.from_numpy(wvecs))
```

**Note:** The learning rate is initially set to 0.5.

---

**Q: What should the embedding for the padding token `<pad>` be?**
 
 *A: It should be a constant. Recall we chose a vector with zero values at the beginning of the session.*
 
 **Q: What is the impact of using those pre-trained embeddings on the model performance?**
 
 *A: The model typically performs better.*

In [None]:
from tqdm import tqdm

EMBEDDING_DIM = 50

# Yet another hyperparameter: since the pre-trained embeddings are coming
# from a different network, their magnitudes could differ from the parameters
# of this network. So scaling may be necessary.
SCALE_EMBS = 0.65

# Creates the empty numpy array that you should fill below
wvecs = np.zeros((len(word2idx), EMBEDDING_DIM), dtype='float32')

#####################################################################
# Q: Read line by line, find the corresponding word and
# insert its embedding to the correct position in the `wvecs` matrix.
# Once done, apply the SCALE_EMBS factor to scale the vectors
#####################################################################
with open(f'glove.6B.{EMBEDDING_DIM}d.txt', 'r') as f:
  for line in tqdm(f):
    if len(line.strip().split()) > 3:
      word = line.strip().split()[0]
      if word in word2idx:
        (word, vec) = (word, list(map(float, line.strip().split()[1:])))
        idx = word2idx[word]
        wvecs[idx] = vec

wvecs = wvecs * SCALE_EMBS

print()          
print(wvecs)

#####################
# Re-create the model
#####################
fix_seed()

EPOCHS = 10
LRATE = 0.5

# the hyperparameters specific to CNN
OUTPUT_DIM = 1

# we define the number of filters
N_OUT_CHANNELS = 100

# we define the window size
WINDOW_SIZE = 1

# we apply the dropout with the probability 0.1
DROPOUT = 0.1

# Construct the model
model = CNN(len(word2idx), EMBEDDING_DIM, N_OUT_CHANNELS, WINDOW_SIZE, OUTPUT_DIM, DROPOUT)

#################################################################
### Q: Initialize the embeddings with the loaded pre-trained ones
#################################################################
model.embedding.weight.data.copy_(torch.from_numpy(wvecs))

optimizer = optim.SGD(model.parameters(), lr=LRATE)
loss_fn = nn.BCEWithLogitsLoss()

feature_train = train_sent_tensor
target_train = train_label_tensor

feature_valid = valid_sent_tensor
target_valid = valid_label_tensor

feature_test = test_sent_tensor
target_test = test_label_tensor

################
# Start training
################
print(f'Will train for {EPOCHS} epochs')
for epoch in range(1, EPOCHS + 1):
  model.train()
  
  optimizer.zero_grad()

  predictions = model(feature_train).squeeze(1)
  loss = loss_fn(predictions, target_train)
  train_loss = loss.item()
  train_acc = accuracy(predictions, target_train)

  loss.backward()

  optimizer.step()

  model.eval()
  with torch.no_grad():
    predictions_valid = model(feature_valid).squeeze(1)
    valid_loss = loss_fn(predictions_valid, target_valid).item()
    valid_acc = accuracy(predictions_valid, target_valid)
  
  print(f'| Epoch: {epoch:02} | Train Loss: {train_loss:.3f} | Train Acc: {train_acc*100:6.2f}% | Val. Loss: {valid_loss:.3f} | Val. Acc: {valid_acc*100:6.2f}% |')


## Finally, test on the test set
model.eval()

with torch.no_grad():
    predictions = model(feature_test).squeeze(1)
    loss = loss_fn(predictions, target_test)
    acc = accuracy(predictions, target_test)
    print(f'Test Loss: {loss:.3f} | Test Acc: {acc*100:.2f}%')
    f_measure(predictions, test_labels)

## Advanced: Experimenting with larger corpora

For advanced experiments with a larger dataset, we suggest to use the [IMBD dataset](http://ai.stanford.edu/~amaas/data/sentiment/) of movie reviews available from [`torchtext.datasets`](https://torchtext.readthedocs.io/en/latest/data.html). This module also provides a range of useful functionalities for data preparation: defining a preprocessing pipeline, splitting, batching, padding, iterating through data, loading pre-trained embeddings, building vocabulary, etc. Below we provide an example using the tokenizer as provided by the [spaCy](https://spacy.io) toolkit.

With the batch size provided, `BucketIterator` defines mini-batches by grouping sequences with similar original lengths, so that there is minimal need for padding.  For this bigger dataset, use `.cuda()` on any input batches/tensors, network modules and loss functions to place computations on the GPU. When working on **Google Colab**, make sure that you changed your runtime to GPU from the above menu.

You can start by applying the provided CNN model to this dataset.

In [None]:
# Uncomment below if you need to install these packages into your local environment
# Colab should already include these packages

## We will use torchtext 0.11.2 for this script. 
# Google colab currently uses 0.14 which can cause backwards incompatibility
! pip install torchtext==0.11.2
# Uncomment in case you need to install spacy. Google Colab should already include it
# ! pip install spacy

In [None]:
import torch
from torchtext.legacy import data, datasets
from torch.utils.data import DataLoader
import spacy

# Fix GPU seeds
SEED = 9320

if torch.cuda.is_available():
  torch.backends.cudnn.deterministic = True
  DEVICE='cuda:0'
else:
  DEVICE='cpu'

print('Device is', DEVICE)

In [None]:
# NOTE: Execution of this cell takes a couple of minutes
##
spacy_en = spacy.load('en_core_web_sm')

def tokenizer(text): # create a custom tokenizer function
    return [tok.text for tok in spacy_en.tokenizer(text)]

# define types of data and their preprocessing
text_field = data.Field(sequential=True, tokenize=tokenizer, lower=True)
label_field = data.Field(sequential=False)

# get pre-defined split
train, test_init = datasets.IMDB.splits(text_field, label_field)

# define our own validation and test set (initial test set is too large)
train, valid_test = train.split(split_ratio=0.9, random_state=random.seed(SEED))
valid, test = valid_test.split(split_ratio=0.5, random_state=random.seed(SEED))

print(f'Train size: {len(train)}')
print(f'Validation size: {len(valid)}')
print(f'Test size: {len(test)}')

In [None]:
# build vocabulary with maximum size (less frequent words are not considered)
# load the pre-trained word embeddings.
EMBEDDING_DIM = 50

text_field.build_vocab(train, max_size=25000, vectors=f"glove.6B.{EMBEDDING_DIM}d")
label_field.build_vocab(train)

In [None]:
print(label_field.vocab.stoi)

In [None]:
# get iterators over the data
# place iterators on the GPU if possible

# define our batch size
BATCH_SIZE = 64

train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
  (train, valid, test),
  batch_sizes=(BATCH_SIZE, BATCH_SIZE, BATCH_SIZE), device=DEVICE)

In [None]:
def eval_data(data_iter, model, loss_fn):
  model.eval()
  loss = 0
  acc = 0
  denom = 0

  with torch.no_grad():
    for batch in data_iter:
      # place on the GPU          
      feature = batch.text.to(DEVICE)
      # Pos and neg are 1 and 2 respectively, we need to place them back to 0 and 1.
      target = ((batch.label.to(DEVICE) - 1) > 0).float()
      predictions = model(feature.t()).squeeze(1)
      
      _loss = loss_fn(predictions, target)
      loss += (_loss.item() * predictions.shape[0])
      acc += (accuracy(predictions, target) * predictions.shape[0])
      denom += predictions.shape[0]

  model.train()
  return loss / denom, acc / denom

In [None]:
def train_model(train_iter, dev_iter, model, loss_fn, n_epochs):
  for epoch in range(1, n_epochs + 1): 
    print(f'Starting epoch {epoch}')
    train_loss = 0
    train_loss_denom = 0
    train_acc = 0
    model.train()
    
    # iterate over batches
    for batch in train_iter:
        # place on the GPU          
        feature = batch.text.to(DEVICE)
        # Pos and neg are 1 and 2 respectively, we need to place them back to 0 and 1.
        target = ((batch.label.to(DEVICE) - 1) > 0).float()
        
        optimizer.zero_grad()
        predictions = model(feature.t()).squeeze(1)
        loss = loss_fn(predictions, target)
        acc = accuracy(predictions, target)
        
        loss.backward()
        optimizer.step()
        train_loss += (loss.item() * predictions.shape[0])
        train_loss_denom += predictions.shape[0]
        train_acc += (acc * predictions.shape[0])
        
    valid_loss, valid_acc = eval_data(dev_iter, model, loss_fn)

    # Normalize everything
    train_loss /= train_loss_denom
    train_acc /= train_loss_denom

    print(f'| Epoch: {epoch:02} | Train Loss: {train_loss:.3f} | Train Acc: {train_acc*100:6.2f}% | Val. Loss: {valid_loss:.3f} | Val. Acc: {valid_acc*100:6.2f}%')


In [None]:
N_OUT_CHANNELS = 100
LRATE = 0.5
DROPOUT = 0.4
WINDOW_SIZE = 1

# Construct the model
model = CNN(len(text_field.vocab), EMBEDDING_DIM, N_OUT_CHANNELS, WINDOW_SIZE, OUTPUT_DIM, DROPOUT)
print(model)

model = model.to(DEVICE)
optimizer = optim.SGD(model.parameters(), lr=LRATE)
loss_fn = nn.BCEWithLogitsLoss()

# Start training
train_model(train_iterator, valid_iterator, model, loss_fn, n_epochs=10)

**Q: The paper [Convolutional Neural Networks for Sentence Classification (Kim, 2014)](https://arxiv.org/abs/1408.5882) applies 3 convolutional layers in parallel with window sizes [3, 4, 5]. Try to extend our CNN model with 2 more convolution layers and apply these window sizes. Outputs of the pooling layers are concatenated. What will be the effect on the model performance?**

**Hint:** you can use the `nn.ModuleList function`.

**Q: Pre-processing: experiment with filtering out stop words from input data. What will be the effect on the performance? You may choose to use spaCy to get a list of stop words. Here's an example:**


In [None]:
# Example
spacy_nlp = spacy.load('en_core_web_sm')
spacy_stop_words = spacy.lang.en.stop_words.STOP_WORDS
print(spacy_stop_words)
def tokenizer(text): # create a custom tokenizer function
    return [tok.text for tok in spacy_en.tokenizer(text)]

# define types of data and their preprocessing
text_field = data.Field(sequential=True, tokenize=tokenizer, lower=True, stop_words=spacy_stop_words)

**Q: Apply a Naive Bayes classifier to the problem. How would it perform for this task? You can use the `sklearn.naive_bayes.MultinomialNB` implementation from the popular `scikit-learn` toolkit. Extraction of the data for this purpose could be performed as follows:**

In [None]:
for example in train:
  if example.label == 'pos':
      label = 1
  else:
      label = 0