<a href="https://colab.research.google.com/github/AndreiS22/deep_learning_labs/blob/main/docs/labs/lab7/7_3_SequenceClassification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Part 3: Sequence Classification

__Before starting, we recommend you enable GPU acceleration if you're running on Colab.__

In [3]:
# Execute this code block to install dependencies when running on colab
!pip uninstall -y torch
!pip install torch==2.3.0
!pip install torchdata==0.8.0
!pip install portalocker==2.8.2

try:
    import torchtext
except:
    !pip install torchtext


try:
    import torchbearer
except:
    !pip install torchbearer

try:
    import spacy
except:
    !pip install spacy

try:
    spacy.load('en-core-web-sm')
except:
    !python -m spacy download en

Found existing installation: torch 2.3.0
Uninstalling torch-2.3.0:
  Successfully uninstalled torch-2.3.0
Collecting torch==2.3.0
  Using cached torch-2.3.0-cp311-cp311-manylinux1_x86_64.whl.metadata (26 kB)
Using cached torch-2.3.0-cp311-cp311-manylinux1_x86_64.whl (779.2 MB)
Installing collected packages: torch
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torchaudio 2.6.0+cu124 requires torch==2.6.0, but you have torch 2.3.0 which is incompatible.
torchvision 0.21.0+cu124 requires torch==2.6.0, but you have torch 2.3.0 which is incompatible.[0m[31m
[0mSuccessfully installed torch-2.3.0


[38;5;3m⚠ As of spaCy v3.0, shortcuts like 'en' are deprecated. Please use the
full pipeline package name 'en_core_web_sm' instead.[0m
Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m119.5 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


## Sequence Classification
The problem that we will use to demonstrate sequence classification in this lab is the IMDB movie review sentiment classification problem. Each movie review is a variable sequence of words and the sentiment of each movie review must be classified.

The Large Movie Review Dataset (often referred to as the IMDB dataset) contains 25,000 highly-polar movie reviews (good or bad) for training and the same amount again for testing. The problem is to determine whether a given movie review has a positive or negative sentiment. The data was collected by Stanford researchers and was used in a 2011 paper where a split of 50-50 of the data was used for training and test. An accuracy of 88.89% was achieved.

We'll be using a **recurrent neural network** (RNN) as they are commonly used in analysing sequences. An RNN takes in sequence of words, $X=\{x_1, ..., x_T\}$, one at a time, and produces a _hidden state_, $h$, for each word. We use the RNN _recurrently_ by feeding in the current word $x_t$ as well as the hidden state from the previous word, $h_{t-1}$, to produce the next hidden state, $h_t$.

$$h_t = \text{RNN}(x_t, h_{t-1})$$

Once we have our final hidden state, $h_T$, (from feeding in the last word in the sequence, $x_T$) we feed it through a linear layer, $f$, (also known as a fully connected layer), to receive our predicted sentiment, $\hat{y} = f(h_T)$.

Below shows an example sentence, with the RNN predicting zero, which indicates a negative sentiment. The RNN is shown in orange and the linear layer shown in silver. Note that we use the same RNN for every word, i.e. it has the same parameters. The initial hidden state, $h_0$, is a tensor initialized to all zeros.

![](https://ecs-vlc.github.io/COMP6258/labs/lab7/assets/sentiment1.png)

**Note:** some layers and steps have been omitted from the diagram, but these will be explained later.


The TorchText library provides easy access to the IMDB dataset. The `IMDB` class allows you to load the dataset in a format that is ready for use in neural network and deep learning models, and TorchText's utility methods allow us to easily create batches of data that are `padded` to the same length (we need to pad shorter sentences in the batch to the length of the longest sentence).

With `torchtext` we can utilise the built in tools to perform tokenisation,
build vocabularies and turn the text into tensors.

In [4]:
import torch
from torchtext.data.utils import get_tokenizer

tokenizer = get_tokenizer('basic_english')

The following code automatically downloads the IMDb dataset and splits it
into the canonical train/test splits:

In [5]:
from torchtext.datasets import IMDB
from collections import Counter

train_iter, test_iter = IMDB(split=('train', 'test'))

################################################################################
The 'datapipes', 'dataloader2' modules are deprecated and will be removed in a
future torchdata release! Please see https://github.com/pytorch/data/issues/1196
to learn more and leave feedback.
################################################################################



We can also check an example from the train set:

In [6]:
next(iter(train_iter))

(1,
 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far betwee

The IMDb dataset only has train/test splits, so we need to create a validation set. We can do this with the `.random_split()` method.

We choose to make a 70/30 split, but this can be controlled.

In [7]:
train_iter, valid_iter = train_iter.random_split(total_length=len(list(train_iter)), weights={"train": 0.7, "valid": 0.3}, seed=0)



Again, we'll view how many examples are in each split.

In [8]:
print(f'Number of training examples: {len(list(train_iter))}')
print(f'Number of validation examples: {len(list(valid_iter))}')
print(f'Number of testing examples: {len(list(test_iter))}')

Number of training examples: 8750
Number of validation examples: 3750
Number of testing examples: 25000


Next, we have to build a _vocabulary_. This is effectively a look up table where every unique word in your data set has a corresponding _index_ (an integer).

We do this as our machine learning model cannot operate on strings, only numbers. Each _index_ is used to construct a _one-hot_ vector for each word. A one-hot vector is a vector where all of the elements are 0, except one, which is 1, and dimensionality is the total number of unique words in your vocabulary, commonly denoted by $V$.

![](https://ecs-vlc.github.io/COMP6258/labs/lab7/assets/sentiment5.png)

The number of unique words in our training set is over 100,000, which means that our one-hot vectors will have over 100,000 dimensions! This will make training slow and possibly won't fit onto your GPU (if you're using one).

There are two ways to effectively cut-down our vocabulary, we can either only take the top $n$ most common words or ignore words that appear less than $m$ times. We'll do the former, only keeping the top 25,000 words.

What do we do with words that appear in examples but we have cut from the vocabulary? We replace them with a special _unknown_ or `<unk>` token. For example, if the sentence was "This film is great and I love it" but the word "love" was not in the vocabulary, it would become "This film is great and I `<unk>` it".

The following builds the vocabulary, only keeping the most common tokens
(ones that appear more than 5 times).

In [9]:
from torchtext.vocab import vocab as Vocab

counter = Counter()
for (label, line) in train_iter:
    counter.update(tokenizer(line))
vocab = Vocab(counter, min_freq=5, specials=('<unk>', '<BOS>', '<EOS>', '<PAD>'))
vocab.set_default_index(0) # set the default token to <unk>



Why do we only build the vocabulary on the training set? When testing any machine learning system you do not want to look at the test set in any way. We do not include the validation set as we want it to reflect the test set as much as possible.

In [10]:
print(f"Unique tokens in vocabulary: {len(vocab)}")

Unique tokens in vocabulary: 16470


We can also see the vocabulary directly using either the `get_stoi` (**s**tring
**to** **i**nt) or `get_itos` (**i**nt **to**  **s**tring) methods.

In [11]:
print(vocab.get_itos()[:10])

['<unk>', '<BOS>', '<EOS>', '<PAD>', 'if', 'only', 'to', 'avoid', 'making', 'this']


The final step of preparing the data is creating the iterators. We iterate
over these in the training/evaluation loop, and they return a batch of
examples (indexed and converted into tensors) at each iteration. Note that we
 define transformations which convert the text and labels into tensors.

When we feed sentences into our model, we feed a _batch_ of them at a time,
i.e. more than one at a time, and all sentences in the batch need to be the
same size. Thus, to ensure each sentence in the batch is the same size, any
sentences which are shorter than the longest within the batch are padded.
This is done by the `collate_batch` function. `collate_batch` also returns
the sequence lengths as part of the data.

![](https://ecs-vlc.github.io/COMP6258/labs/lab7/assets/sentiment6.png)

In [12]:
from torch.utils.data import DataLoader
from torch.nn.utils.rnn import pad_sequence

text_transform = lambda x: [vocab['<BOS>']] + [vocab[token] for token in tokenizer(x)] + [vocab['<EOS>']]
label_transform = lambda x: x - 1


def collate_batch(batch):
   label_list, text_list, len_list = [], [], []
   for (_label, _text) in batch:
        label_list.append(label_transform(_label))
        processed_text = torch.tensor(text_transform(_text))
        text_list.append(processed_text)
        len_list.append(len(processed_text))
   return (pad_sequence(text_list, padding_value=3.0), len_list), torch.tensor(label_list).unsqueeze(1).float()

train_dataloader = DataLoader(list(train_iter), batch_size=8, shuffle=True,
                              collate_fn=collate_batch)
valid_dataloader = DataLoader(list(valid_iter), batch_size=8, shuffle=False,
                              collate_fn=collate_batch)
test_dataloader = DataLoader(list(test_iter), batch_size=8, shuffle=False,
                             collate_fn=collate_batch)

## Build the Model

The next stage is building the model that we'll eventually train and evaluate.

There is a small amount of boilerplate code when creating models in PyTorch, note how our `RNN` class is a sub-class of `nn.Module` and the use of `super`.

Within the `__init__` we define the _layers_ of the module. Our three layers are an _embedding_ layer, our RNN, and a _linear_ layer. All layers have their parameters initialized to random values, unless explicitly specified.

The embedding layer is used to transform our sparse one-hot vector (sparse as most of the elements are 0) into a dense embedding vector (dense as the dimensionality is a lot smaller and all the elements are real numbers). This embedding layer is simply a single fully connected layer. As well as reducing the dimensionality of the input to the RNN, there is the theory that words which have similar impact on the sentiment of the review are mapped close together in this dense vector space. For more information about word embeddings, see [here](https://monkeylearn.com/blog/word-embeddings-transform-text-numbers/).

The RNN layer is our RNN which takes in our dense vector and the previous hidden state $h_{t-1}$, which it uses to calculate the next hidden state, $h_t$.

![](https://ecs-vlc.github.io/COMP6258/labs/lab7/assets/sentiment7.png)

Finally, the linear layer takes the final hidden state and feeds it through a fully connected layer, $f(h_T)$, transforming it to the correct output dimension.

The `forward` method is called when we feed examples into our model.

Each batch, `text_len`, is a tuple containing a tensor of size _**[max_sentence length, batch size]**_ and a tensor of **batch_size** containing the true lengths of each sentence (remember, they won't necessarily be the same; some reviews are much longer than others).

The first tensor in the tuple contains the ordered word indexes for each review in the batch. The act of converting a list of tokens into a list of indexes is commonly called *numericalizing*.

The input batch is then passed through the embedding layer to get `embedded`, which gives us a dense vector representation of our sentences. `embedded` is a tensor of size _**[sentence length, batch size, embedding dim]**_.

`embedded` is then fed into a function called `pack_padded_sequence` before being fed into the RNN. `pack_padded_sequence` is used to create a datastructure that allows the RNN to 'mask' off the padding during the BPTT process (we don't want to learn the padding, as this could drastically influence the results!). In some frameworks you must feed the initial hidden state, $h_0$, into the RNN, however in PyTorch, if no initial hidden state is passed as an argument it defaults to a tensor of all zeros.

The RNN returns 2 tensors, `output` of size _**[sentence length, batch size, hidden dim]**_ and `hidden` of size _**[1, batch size, hidden dim]**_. `output` is the concatenation of the hidden state from every time step, whereas `hidden` is simply the final hidden state.

Finally, we feed the last hidden state, `hidden`, through the linear layer, `fc`, to produce a prediction. Note the `squeeze` method, which is used to remove a dimension of size 1.

In [13]:
import torch.nn as nn

class RNN(nn.Module):
    def __init__(self, input_dim, embedding_dim, hidden_dim, output_dim):
        super().__init__()

        self.embedding = nn.Embedding(input_dim, embedding_dim)
        self.rnn = nn.RNN(embedding_dim, hidden_dim)
        self.fc = nn.Linear(hidden_dim, output_dim)

    def forward(self, text, lengths):
        embedded = self.embedding(text)
        embedded = nn.utils.rnn.pack_padded_sequence(embedded, lengths, enforce_sorted=False)
        packed_output, hidden = self.rnn(embedded)

        return self.fc(hidden.squeeze(0))

We now create an instance of our RNN class.

The input dimension is the dimension of the one-hot vectors, which is equal to the vocabulary size.

The embedding dimension is the size of the dense word vectors. This is usually around 50-250 dimensions, but depends on the size of the vocabulary.

The hidden dimension is the size of the hidden states. This is usually around 100-500 dimensions, but also depends on factors such as on the vocabulary size, the size of the dense vectors and the complexity of the task.

The output dimension is usually the number of classes, however in the case of only 2 classes the output value is between 0 and 1 and thus can be 1-dimensional, i.e. a single scalar real number.

In [14]:
INPUT_DIM = len(vocab)
EMBEDDING_DIM = 50
HIDDEN_DIM = 100
OUTPUT_DIM = 1

model = RNN(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM)

# Train the model

Now we'll set up the training and then train the model.

First, we'll create an optimizer. This is the algorithm we use to update the parameters of the module. Here, we'll use _stochastic gradient descent_ (SGD). The first argument is the parameters that will be updated by the optimizer, the second is the learning rate, i.e. how much we'll change the parameters by when we do a parameter update.

In [15]:
import torch.optim as optim

optimizer = optim.SGD(model.parameters(), lr=0.001)

Next, we'll define our loss function. In PyTorch this is commonly called a criterion.

The loss function here is _binary cross entropy with logits_.

Our model currently outputs an unbound real number. As our labels are either 0 or 1, we want to restrict the predictions to a number between 0 and 1. We do this using the _sigmoid_ function.

We then use this this bound scalar to calculate the loss using binary cross entropy.

The `BCEWithLogitsLoss` criterion carries out both the sigmoid and the binary cross entropy steps.

In [16]:
criterion = nn.BCEWithLogitsLoss()

Finally, before we can a Torchbearer trial to train the model:

In [17]:
from torchbearer import Trial

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

torchbearer_trial = Trial(model, optimizer, criterion, metrics=['acc', 'loss']).to(device)
torchbearer_trial.with_generators(train_generator=train_dataloader,
                                  val_generator=valid_dataloader,
                                  test_generator=test_dataloader)
torchbearer_trial.run(epochs=5)
torchbearer_trial.predict()

0/5(t):   0%|          | 0/1094 [00:00<?, ?it/s]

0/5(v):   0%|          | 0/469 [00:00<?, ?it/s]

1/5(t):   0%|          | 0/1094 [00:00<?, ?it/s]

1/5(v):   0%|          | 0/469 [00:00<?, ?it/s]

2/5(t):   0%|          | 0/1094 [00:00<?, ?it/s]

2/5(v):   0%|          | 0/469 [00:00<?, ?it/s]

3/5(t):   0%|          | 0/1094 [00:00<?, ?it/s]

3/5(v):   0%|          | 0/469 [00:00<?, ?it/s]

4/5(t):   0%|          | 0/1094 [00:00<?, ?it/s]

4/5(v):   0%|          | 0/469 [00:00<?, ?it/s]

0/1(p):   0%|          | 0/3125 [00:00<?, ?it/s]

tensor([[-5.9715],
        [-5.9107],
        [-4.0504],
        ...,
        [-5.8562],
        [-5.8191],
        [-5.9642]], device='cuda:0')

__Use the box below to comment on and give insight into the performance of the above model:__

Test accuarcy of 50%

Now try and build a better model. Rather than using a plain RNN, we'll instead use a (single layer) LSTM, and we'll use Adam with an initial learning rate of 0.01 as the optimiser. __Complete the following code to implement the improved model, and then train it:__

In [25]:
class ImprovedRNN(nn.Module):
    def __init__(self, input_dim, embedding_dim, hidden_dim, output_dim):
        super().__init__()

        self.embedding = nn.Embedding(input_dim, embedding_dim)
        # YOUR CODE HERE
        # raise NotImplementedError()
        self.lstm = nn.LSTM(embedding_dim, hidden_dim)
        self.fc = nn.Linear(hidden_dim, output_dim)

    def forward(self, text, lengths):
        embedded = self.embedding(text)
        embedded = nn.utils.rnn.pack_padded_sequence(embedded, lengths, enforce_sorted=False)

        # YOUR CODE HERE
        # raise NotImplementedError()
        lstm_out, hidden = self.lstm(embedded)
        out = self.fc(hidden[:,-1].squeeze(0))
        return out

INPUT_DIM = len(vocab)
EMBEDDING_DIM = 50
HIDDEN_DIM = 100
OUTPUT_DIM = 1

imodel = ImprovedRNN(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM)

# TODO: Train and evaluate the model
# YOUR CODE HERE
# raise NotImplementedError()
optimizer = optim.Adam(imodel.parameters(), lr=0.01)
torchbearer_trial = Trial(imodel, optimizer, criterion, metrics=['acc', 'loss']).to(device)
torchbearer_trial.with_generators(train_generator=train_dataloader,
                                  val_generator=valid_dataloader,
                                  test_generator=test_dataloader)
torchbearer_trial.run(epochs=5)
torchbearer_trial.predict()

0/5(t):   0%|          | 0/1094 [00:00<?, ?it/s]

Exception: [TypeError('tuple indices must be integers or slices, not tuple'), TypeError("ImprovedRNN.forward() got an unexpected keyword argument 'state'")]

__What do you observe about the performance of this model? What would you do next if you wanted to improve it further? Write your answers in the box below:__

In [None]:
# YOUR CODE HERE
# raise NotImplementedError()

## User Input

We can now use our models to predict the sentiment of any sentence we give it. As it has been trained on movie reviews, the sentences provided should also be movie reviews.

Our `predict_sentiment` function does a few things:
- tokenizes the sentence, i.e. splits it from a raw string into a list of tokens
- indexes the tokens by converting them into their integer representation from our vocabulary
- converts the indexes, which are a Python list into a PyTorch tensor
- add a batch dimension by `unsqueeze`ing
- squashes the output prediction from a real number between 0 and 1 with the `sigmoid` function
- converts the tensor holding a single value into an integer with the `item()` method

We are expecting reviews with a negative sentiment to return a value close to 0 and positive reviews to return a value close to 1.

In [None]:
def predict_sentiment(model, sentence):
    tokenized = [tok for tok in tokenizer(sentence)]
    indexed = [vocab.get_stoi()[t] for t in tokenized]
    tensor = torch.LongTensor(indexed).to(device)
    tensor = tensor.unsqueeze(1)
    prediction = torch.sigmoid(model(tensor, torch.tensor([tensor.shape[0]])))
    return prediction.item()

An example negative review...

In [None]:
predict_sentiment(imodel, "This film is terrible")

and an example positive review...

In [None]:
predict_sentiment(imodel, "This film is great")

__Use the box below to try classifying some of your own 'movie reviews':__