In [1]:
%%html
<style type='text/css'>
.CodeMirror{
font-family: JetBrains Mono;
</style>

In [2]:
# Suggested imports. Do not use import any modules that are not in the requirements.txt file on the VLE.

import sklearn.metrics
import sklearn.model_selection
import matplotlib.pyplot as plt
import random
import collections
import torch
import pandas as pd
import numpy as np
%matplotlib inline


device = 'cuda:0' if torch.cuda.is_available() else 'cpu'
# device = 'cpu'

# Movie titles assignment

Table of contents:

* [Data filtering and splitting (10%)](#Data-filtering-and-splitting-(10%))
* [Title classification (25%)](#Title-classification-(25%))
* [Title generation (25%)](#Title-generation-(25%))
* [Language models as classifiers (30%)](#Language-models-as-classifiers-(30%))
* [Conclusion (10%)](#Conclusion-(10%))

Information:

This assignment is 100% of your assessment.
You are to follow the instructions below and fill each cell as instructed.
Once ready, submit this notebook on VLE with all the outputs included (run all your code and don't clear any output cells).
Do not submit anything else apart from the notebook and do not use any extra data apart from what is requested.

## Introduction

A big shot Hollywood producer is looking for a way to automatically generate new movie titles for future movies and you have been employed to do this (in exchange for millions of dollars!).
A data set of movie details has already been collected from IMDb for you and your task is to create the model and the algorithms necessary to use it.

## Data filtering and splitting (10%)

Start by downloading the CSV file `filmtv_movies - ENG.csv` from [this kaggle data set](https://www.kaggle.com/datasets/stefanoleone992/filmtv-movies-dataset).

The CSV file needs to be filtered as the producer is only interested in certain types of movie titles.
Load the file and filter it so that only movies with the following criteria are kept:

* The country needs to be `United States` (and no other country should be mentioned).
* The genre should be `Action`, `Horror`, `Fantasy`, `Western`, and `Adventure`.
* The title should not have more than 20 characters.

In [3]:
df = pd.read_csv('data.csv')  #Load full csv

In [4]:
df = pd.read_csv('data.csv')  #Load full csv
df = df[df['country'] == 'United States'] #Country == United States
df = df[df['genre'].isin(['Action','Horror','Fantasy','Western','Adventure'])] #Filter genre
df = df[df['title'].str.len() < 21] # Title does not have more than 20 characters
df['title'] = df['title'].apply(lambda s: s.lower()) #Set all titles to lowercase
df = df[['title','genre']] # Only title and genre columns are needed


df = df.sample(frac=1) #Shuffle dataset
df.to_csv('filtered_data.csv', index=False)
df

Unnamed: 0,title,genre
15575,empire strike back,Adventure
8524,fluke,Adventure
35540,ford v. ferrari,Action
30257,see no evil 2,Horror
11026,armageddon,Fantasy
...,...,...
12035,grizzly falls,Adventure
3653,blue heat,Action
14339,death race 2000,Fantasy
127,the hanging tree,Western


Split the filtered data into 80% train, 10% validation, and 10% test.
You will only need the title and genre columns.

In [5]:
#df = pd.read_csv('filtered_data.csv')

#Train = 80%, Other = 20%
train_x, other_x, train_y, other_y = sklearn.model_selection.train_test_split(df['title'],df['genre'],
                                                             test_size=0.2, random_state=1)


#Split other in half -> [Train = 80%, Val = 10%, Test = 10%]
val_x, test_x, val_y, test_y = sklearn.model_selection.train_test_split(other_x, other_y,
                                                       test_size=0.5, random_state=1)



From your processed data set, display:

* the amount of movies in each genre and split
* 5 examples of movie titles from each genre and split

In [6]:
print('Amount of Movies in Training Set:')
print(train_y.value_counts())

print('\nAmount of Movies in Validation Set:')
print(val_y.value_counts())

print('\nAmount of Movies in Testing Set:')
print(test_y.value_counts())

Amount of Movies in Training Set:
Action       708
Horror       666
Fantasy      434
Western      424
Adventure    367
Name: genre, dtype: int64

Amount of Movies in Validation Set:
Action       89
Horror       78
Fantasy      58
Adventure    53
Western      47
Name: genre, dtype: int64

Amount of Movies in Testing Set:
Action       91
Horror       74
Western      66
Fantasy      50
Adventure    44
Name: genre, dtype: int64


## Title classification (25%)

Your first task is to prove that a neural network can identify the genre of a movie based on its title.

You will note that many titles are just a single word or two words long so you need to work at the character level instead of the word level, that is, a token would be a single character, including punctuation marks and spaces.
You must also lowercase the titles.
Preprocess the data sets, create a neural network, and train it to classify the movie titles into their genre.
Plot a graph of the **accuracy** of the model on the train and validation sets after each epoch.

In [62]:
# Tokenise each character.
train_x = train_x.apply(lambda s: [*s])

# Get the lengths of each title.
text_lens = torch.tensor([len(sent) for sent in train_x],
                         dtype=torch.int64, device=device)

# Get the maximum lenght of a title.
max_len = max(text_lens)

# Create the vocabulary.
vocab = ['<PAD>'] + sorted({token for sent in train_x for token in sent})

# Pad the titles to max_len characters using <PAD> tokens.
padded_train_x = [sent + ['<PAD>']*(max_len - len(sent)) for sent in train_x]

# Replace each character with its index in the vocabulary.
indexed_train_x = torch.tensor([[vocab.index(token) for token in title]
                               for title in padded_train_x],
                               dtype=torch.int64, device=device)

###############FIX
categories = ['Action', 'Horror', 'Fantasy', 'Western', 'Adventure']
cat2idx = {cat: i for (i, cat) in enumerate(categories)}

indexed_train_y = torch.tensor(train_y.map(cat2idx.get).to_numpy()[:, None],
                           dtype=torch.int64, device=device)

# The target value for each character sequence
seq_train_y = indexed_train_y.tile(1, max_len)

# Number of classes to classify
# num_classes = seq_train_y.shape[2]

In [92]:
seq_train_y

tensor([[1, 1, 1,  ..., 1, 1, 1],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        ...,
        [0, 0, 0,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 1, 1, 1],
        [3, 3, 3,  ..., 3, 3, 3]], device='cuda:0')

In [66]:
class Model(torch.nn.Module):

    def __init__(self, vocab_size, embedding_size, hidden_size, num_classes):
        super().__init__()
        self.hidden_size = hidden_size

        self.embedding_matrix = torch.nn.Embedding(
            vocab_size, embedding_size, device=device)

        # Forward State
        self.rnn_fw_s0 = torch.nn.Parameter(torch.zeros(
            (hidden_size,), dtype=torch.float32, device=device))
        self.rnn_fw_c0 = torch.nn.Parameter(torch.zeros(
            (hidden_size,), dtype=torch.float32, device=device))
        self.rnn_fw_cell = torch.nn.LSTMCell(
            embedding_size, hidden_size, device=device)

        # Backward State
        self.rnn_bw_s0 = torch.nn.Parameter(torch.zeros(
            (hidden_size,), dtype=torch.float32, device=device))
        self.rnn_bw_c0 = torch.nn.Parameter(torch.zeros(
            (hidden_size,), dtype=torch.float32, device=device))
        self.rnn_bw_cell = torch.nn.LSTMCell(hidden_size, hidden_size)

        # Input to this layer will be the concatanated fw and bw states
        self.output_layer = torch.nn.Linear(
            2*hidden_size, num_classes, device=device)

    def forward(self, x, text_lens):
        batch_size = x.shape[0]  # Number of titles
        time_steps = x.shape[1]  # Number of characters

        # Pass indices to embedding matrix
        embedded = self.embedding_matrix(x)

        # Get Forward State
        state = self.rnn_fw_s0.unsqueeze(0).tile((batch_size, 1))
        c = self.rnn_fw_c0.unsqueeze(0).tile((batch_size, 1))
        interm_states = []

        for t in range(time_steps):
            (state, c) = self.rnn_fw_cell(embedded[:, t, :], (state, c))
            interm_states.append(state)
        interm_states_fw = torch.stack(interm_states, dim=1)

        # Get Backward Intermediate States
        state = self.rnn_bw_s0.unsqueeze(0).tile((batch_size, 1))
        c = self.rnn_bw_c0.unsqueeze(0).tile((batch_size, 1))
        interm_states = []

        for t in reversed(range(time_steps)):
            mask = (t < text_lens).unsqueeze(1).tile((1, self.hidden_size))
            (next_state, next_c) = self.rnn_bw_cell(
                embedded[:, t, :], (state, c))
            
            #Apply mask
            state = torch.where(mask, next_state, state)
            c = torch.where(mask, next_c,c)
            
            interm_states.append(state)
            
        interm_states_bw = torch.stack(
            interm_states[::-1], dim=1)  # Re-Reverse states

        # Concatanate forward and backward states
        interm_states = torch.cat((interm_states_fw, interm_states_bw), dim=2)
        
        # Pass through output layer
        return self.output_layer(interm_states)

In [34]:
mask.shape
# errors.shape

torch.Size([2599, 20, 5])

In [87]:
model = Model(len(vocab), embedding_size=2, hidden_size=2,
              num_classes=num_classes)
model.to(device)

optimiser = torch.optim.Adam(model.parameters())

print('step', 'error')
train_errors = []

# Generate Mask
batch_size = seq_train_y.shape[0]
time_steps = seq_train_y.shape[1]
mask = torch.zeros((batch_size, time_steps),
                   dtype=torch.bool, device=device)

for i in range(batch_size):
    for j in range(time_steps):
        if j >= text_lens[i]:
            mask[i, j] = 1

# Epoch Loop
for step in range(1, 100_000+1):

    optimiser.zero_grad()
    
    #Get output
    output = model(indexed_train_x, text_lens)
    

    #Calculate Error
    errors = torch.nn.functional.cross_entropy(
         output.transpose(1,2), seq_train_y, reduction='none')

#     print(f'Train X: {indexed_train_x.shape}')
#     print(f'Train Y: {seq_train_y.shape}')
#     print(f'Output:  {output.shape}')    
#     print(f'Errors:  {errors.shape}')
#     print(f'Mask:    {mask.shape}')
    
    errors = torch.masked_fill(errors, mask, 0.0)
    error = errors.sum()/text_lens.sum()
    
    #Record error
    train_errors.append(error.detach().tolist())
    
    #Apply BackPropogation and HyperParameter Tuning
    error.backward()
    optimiser.step()

    if step % 10_000 == 0:
        print(step, train_errors[-1])

step error
10000 1.5052309036254883
20000 1.497367024421692
30000 1.4906222820281982


KeyboardInterrupt: 

In [91]:
with torch.no_grad():
    outputs = torch.softmax(
        model(indexed_train_x, text_lens), dim=2).cpu().numpy().argmax(axis=2)
    num_correct = 0
    for (true_tags, output_tags, text_len) in zip(seq_train_y.tolist(), outputs.tolist(), text_lens):
        for j in range(text_len):
            if true_tags[j] == output_tags[j]:
                num_correct += 1
    accuracy = num_correct/sum(text_lens)
    print('accuracy: {:.2%}'.format(accuracy))

accuracy: 31.58%


In [22]:
output.shape
seq_train_y.shape

torch.Size([2599, 20, 5])

Measure the F1 score performance of the model when applied on the test set.
Also plot a confusion matrix showing how often each genre is mistaken as another genre.

## Title generation (25%)

Now that you've proven that titles and genre are related, make a model that can generate a title given a genre.

Again, you need to generate tokens at the character level instead of the word level and the titles must be lowercased.
Preprocess the data sets, create a neural network, and train it to generate the movie titles given their genre.
Plot a graph of the **perplexity** of the model on the train and validation sets after each epoch.

Generate 3 titles for every genre.
Make sure that the titles are not all the same.

## Language models as classifiers (30%)

It occurs to you that the movie title generator can also be used as a classifier by doing the following:

* Let title $t$ be the title that you want to classify.
* For every genre $g$,
    * Use the generator as a language model to get the probability of $t$ (the whole title) using genre $g$.
* Pick the genre that makes the language model give the largest probability.

The producer is thrilled to not need two separate models and now you have to implement this.
**Use the preprocessed test set from the previous task** in order to find the genre that makes the language model give the largest probability.
There is no need to plot anything here.

Just like in the classification task, measure the F1 score and plot the confusion matrix of this new classifier.

Write a paragraph or psuedo code to describe what your code above does.

In [None]:
'''

'''

## Conclusion (10%)

The producer's funders are asking for a report about this new technology they invested in.
In 300 words, write your interpretation of the results together with what you think could make the model perform better.

In [None]:
'''

'''