In [1]:
# Suggested imports. Do not use import any modules that are not in the requirements.txt file on the VLE.

%matplotlib inline

import numpy as np
import pandas as pd
import torch
import collections
import random
import matplotlib.pyplot as plt
import sklearn.model_selection
import sklearn.metrics

device = 'cuda:0' if torch.cuda.is_available() else 'cpu'

# Movie titles assignment

Table of contents:

* [Data filtering and splitting (10%)](#Data-filtering-and-splitting-(10%))
* [Title classification (25%)](#Title-classification-(25%))
* [Title generation (25%)](#Title-generation-(25%))
* [Language models as classifiers (30%)](#Language-models-as-classifiers-(30%))
* [Conclusion (10%)](#Conclusion-(10%))

Information:

This assignment is 100% of your assessment.
You are to follow the instructions below and fill each cell as instructed.
Once ready, submit this notebook on VLE with all the outputs included (run all your code and don't clear any output cells).
Do not submit anything else apart from the notebook and do not use any extra data apart from what is requested.

## Introduction

A big shot Hollywood producer is looking for a way to automatically generate new movie titles for future movies and you have been employed to do this (in exchange for millions of dollars!).
A data set of movie details has already been collected from IMDb for you and your task is to create the model and the algorithms necessary to use it.

## Data filtering and splitting (10%)

Start by downloading the CSV file `filmtv_movies - ENG.csv` from [this kaggle data set](https://www.kaggle.com/datasets/stefanoleone992/filmtv-movies-dataset).

The CSV file needs to be filtered as the producer is only interested in certain types of movie titles.
Load the file and filter it so that only movies with the following criteria are kept:

* The country needs to be `United States` (and no other country should be mentioned).
* The genre should be `Action`, `Horror`, `Fantasy`, `Western`, and `Adventure`.
* The title should not have more than 20 characters.

In [2]:
data = pd.read_csv('Data/filmtv_movies - ENG.csv', index_col=None)

In [3]:
genres = ['Action', 'Horror', 'Fantasy', 'Western', 'Adventure']

data = data.loc[(data['country'] == 'United States') &
                (data['genre'].isin(genres)) & 
                (data['title'].str.len() <= 20)]

In [4]:
data

Unnamed: 0,filmtv_id,title,year,genre,duration,country,directors,actors,avg_vote,critics_vote,public_vote,total_votes,description,notes,humor,rhythm,effort,tension,erotism
13,36,Bowery at Midnight,1942,Horror,62,United States,Wallace Fox,"Bela Lugosi, John Archer, Wanda McKay, Dave O'...",5.1,5.00,5.0,27,In the infamous New York neighborhood of Bower...,"Defined by critics as shaky, Wallace W. Fox's ...",0,2,1,3,0
15,38,Mr. Majestyk,1974,Action,105,United States,Richard Fleischer,"Charles Bronson, Linda Cristal, Al Lettieri, L...",6.2,5.71,7.0,28,"A veteran of Vietnam, Vince (Bronson) grows me...",Cliché screenplay (by Elmore Leonard) tailored...,0,4,3,3,0
16,45,Warning Sign,1985,Action,99,United States,Hal Barwood,"Sam Waterston, Kathleen Quinlan, Yaphet Kotto,...",4.8,4.00,6.0,10,"Inside the Biotek laboratory, a renowned and r...","It is a film that mixes ""The invasion of the b...",0,2,1,1,0
25,61,The Appaloosa,1966,Western,98,United States,Sidney J. Furie,"Marlon Brando, Anjanette Comer, John Saxon, Em...",6.9,7.00,7.0,26,"On the Mexican frontier, a dramatic rivalry be...",A very original western for the subject and fo...,0,2,1,3,1
32,74,The Deep,1977,Adventure,130,United States,Peter Yates,"Nick Nolte, Jacqueline Bisset, Robert Shaw, El...",5.3,4.88,6.0,30,"A boy and a girl, passionate divers, dive off ...","Not very exciting, but the underwater scenes a...",1,2,1,2,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
40020,218528,House of Darkness,2022,Horror,88,United States,Neil LaBute,"Kate Bosworth, Justin Long, Gia Crovatin, Lucy...",4.0,4.00,,1,Hap (Justin Long) and Mina (Kate Bosworth) mee...,,2,3,0,2,1
40022,218530,Nix,2022,Horror,103,United States,Anthony C. Ferrante,"James Zimbardi, Dee Wallace, Michael Paré, Ang...",1.0,1.00,,1,Jack Coyle (James Zimbardi) desperately tries ...,,0,2,1,2,0
40033,219060,Matriarch,2022,Horror,85,United States,Ben Steiner,"Jemima Rooper, Kate Dickie, Sarah Paul, Simon ...",5.9,7.00,5.0,5,"Laura (Jemima Rooper) leads a lonely life, imm...",,0,2,2,3,0
40034,219063,The Inhabitant,2022,Horror,97,United States,Jerren Lauder,"Odessa A’zion, Leslie Bibb, Dermot Mulroney, L...",6.0,6.00,,1,"From a fact of the news, which took place in F...",,0,3,1,3,0


Split the filtered data into 80% train, 10% validation, and 10% test.
You will only need the title and genre columns.

In [5]:
data = data[['title', 'genre']]
train, validate, test = np.split(data.sample(frac=1), [int(.8*len(data)), int(.9*len(data))])

From your processed data set, display:

* the amount of movies in each genre and split
* 5 examples of movie titles from each genre and split

In [6]:
print(f"Training Set Size: {len(train)}")
print(f"Validation Set Size: {len(validate)}")
print(f"Test Set Size: {len(test)}")

for genre in data['genre'].unique():
    print(f"Num. of movies with genre {genre}: {len(data[data['genre'] == genre])} ")

Training Set Size: 2736
Validation Set Size: 342
Test Set Size: 342

Num. of movies with genre Horror: 926 
Num. of movies with genre Action: 914 
Num. of movies with genre Western: 538 
Num. of movies with genre Adventure: 483 
Num. of movies with genre Fantasy: 559 


## Title classification (25%)

Your first task is to prove that a neural network can identify the genre of a movie based on its title.

You will note that many titles are just a single word or two words long so you need to work at the character level instead of the word level, that is, a token would be a single character, including punctuation marks and spaces.
You must also lowercase the titles.
Preprocess the data sets, create a neural network, and train it to classify the movie titles into their genre.
Plot a graph of the **accuracy** of the model on the train and validation sets after each epoch.

In [7]:
train['title'] = train['title'].str.lower()
titles_temp = train['title'].values
titles = [list(title) for title in titles_temp]

In [8]:
min_freq = 3

In [22]:
# Training Set
train_x = []
train_y = []
for row in range(0, len(train)):
    title = list(train.iloc[row]['title'].lower())
    genre = [train.iloc[row]['genre']]

    train_x.append(title)
    train_y.append(genre)

# Testing Set
test_x = []
test_y = []
for row in range(0, len(test)):
    title = list(test.iloc[row]['title'].lower())
    genre = [test.iloc[row]['genre']]

    test_x.append(title)
    test_y.append(genre)

# Lengths
train_lens = torch.tensor(
    [len(title) for title in train_x],
    dtype=torch.int64, device=device
)

test_lens = torch.tensor(
    [len(title) for title in test_x],
    dtype=torch.int64, device=device
)

max_len = max(max(train_lens), max(test_lens))

# Genres
genre = sorted(set(genre for text in train_y for genre in text))
genre = sorted(genre)

# Genre Indexing
genre2index = {genre: i for (i, genre) in enumerate(genre)}

# vocab
frequencies = collections.Counter(letter for text in train_x for word in text for letter in word)
vocab = sorted(frequencies.keys(), key=frequencies.get, reverse=True)
while frequencies[vocab[-1]] < min_freq:
    vocab.pop()
vocab = ['<PAD>', '<UNK>'] + sorted(vocab)
letter2index = {letter: i for (i, letter) in enumerate(vocab)}

# Padding and UNK indexing
for i in range(len(train_x)):
    for j in range(len(train_x[i])):
        if train_x[i][j] not in letter2index:
            train_x[i][j] = '<UNK>'
    
    for x in range(0, (max_len - len(train_x[i]))):
        train_x[i].extend(['<PAD>'])
    
    temp_ans = train_y[i]
    train_y[i] = [0] * len(genre)
    train_y[i][genre2index[temp_ans[0]]] = 1

for i in range(len(test_x)):
    for j in range(len(test_x[i])):
        if test_x[i][j] not in letter2index:
            test_x[i][j] = '<UNK>'

    for x in range(0, (max_len - len(test_x[i]))):
        test_x[i].extend(['<PAD>'])
    
    temp_ans = test_y[i]
    test_y[i] = [0] * len(genre)
    test_y[i][genre2index[temp_ans[0]]] = 1

# indexing
indexed_train_x = torch.tensor([[letter2index[letter] for letter in text] for text in train_x], 
                                dtype=torch.int64, 
                                device=device)
indexed_train_y = torch.tensor([y for y in train_y],
                                dtype=torch.int64, 
                                device=device)

indexed_test_x = torch.tensor([[letter2index[letter] for letter in text] for text in test_x], 
                                dtype=torch.int64, 
                                device=device)
indexed_test_y = torch.tensor([y for y in train_y],
                                dtype=torch.int64, 
                                device=device)


In [51]:
print(f'First 10 vocab: {vocab[:10]}')
print(f'Last 10 vocab: {vocab[-10:]}')
print(f'Vocab Size: {len(vocab)}\n')

print(f'First train_x:\n {train_x[1]}\n')
print(f'First train_y: {train_y[0]}\n')

print(f'Genres: {genre}\n')

print(f'First indexed_train_x:\n {indexed_train_x[0]}')
print(f'First indexed_train_y:\n {indexed_train_y[0]}')

print(f'First indexed_test_x:\n {indexed_test_x[0]}')
print(f'First indexed_test_y:\n {indexed_test_y[0]}')

First 10 vocab: ['<PAD>', '<UNK>', ' ', '!', '&', "'", ',', '-', '.', '/']
Last 10 vocab: ['q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
Vocab Size: 47

First train_x:
 ['t', 'o', 'p', ' ', 'g', 'u', 'n', ':', ' ', 'm', 'a', 'v', 'e', 'r', 'i', 'c', 'k', '<PAD>', '<PAD>', '<PAD>']

First train_y: [0, 0, 1, 0, 0]

Genres: ['Action', 'Adventure', 'Fantasy', 'Horror', 'Western']

First indexed_train_x:
 tensor([39, 21, 40, 41, 38, 34,  2, 13,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0])
First indexed_train_y:
 tensor([0, 0, 1, 0, 0])
First indexed_test_x:
 tensor([24, 35, 27,  2, 25, 21, 40,  2, 24, 35, 27,  0,  0,  0,  0,  0,  0,  0,
         0,  0])
First indexed_test_y:
 tensor([0, 0, 1, 0, 0])


In [59]:
class Model(torch.nn.Module):

    def __init__(self, chars_size, embedding_size, hidden_size, genre_size):
        super().__init__()
        self.embedding_layer = torch.nn.Embedding(chars_size, embedding_size)
        self.rnn_s0 = torch.nn.Parameter(torch.zeros((hidden_size,), dtype=torch.float32))
        self.rnn_c0 = torch.nn.Parameter(torch.zeros((hidden_size,), dtype=torch.float32))
        self.rnn_cell = torch.nn.LSTMCell(hidden_size, hidden_size)
        self.output_layer = torch.nn.Sigmoid(hidden_size, genre_size) # Output size number of genres
    
    def forward(self, x):
        batch_size = x.shape[0]
        time_steps = x.shape[1]

        embedded = self.embedding_layer(x)

        state = self.rnn_s0.unsqueeze(0).tile((batch_size, 1))
        c = self.rnn_c0.unsqueeze(0).tile((batch_size, 1))

        interm_states = []
        for t in range(time_steps):
            (state, c) = self.rnn_cell(embedded[:, t, :], (state, c))
            interm_states.append(state)
        interm_states = torch.stack(interm_states, dim=1)

        dead = self.output_layer(interm_states)

        return dead

In [60]:
class GenreClassifier():
    
    def __init__(self, chars, iters, title_lens, genre_size, embedding, hidden):
        self.model = Model(chars_size=len(chars), embedding_size=embedding, hidden_size=hidden, genre_size=genre_size)
        self.model.to(device)
        self.optimiser = torch.optim.Adam(self.model.parameters())

        self.iters = iters
        self.title_lens = title_lens
        self.train_errors = []
    
    def run(self, indexed_train_x, indexed_train_y):
        print('step', 'error')
        self.train_errors = []

        for step in range(1, self.iters + 1):
            batch_size = indexed_train_x.shape[0]
            time_steps = indexed_train_x.shape[1]
            mask = torch.zeros((indexed_train_y.shape[0], indexed_train_y.shape[1]), dtype=torch.float32)

            for i in range(indexed_train_y.shape[0]):
                for j in range(indexed_train_y.shape[1]):
                    if j >= self.title_lens[i]:
                        mask[i, j] = 1
            
            self.optimiser.zero_grad()
            output = self.model(indexed_train_x)
            errors = torch.nn.functional.cross_entropy(output, indexed_train_y, reduction='none')
            errors = torch.masked_fill(errors, mask, 0.0)
            error = errors.sum()/self.title_lens.sum()
            self.train_errors.append(error.detach().tolist())
            error.backward()
            self.optimiser.step()

            if step%100 == 0:
                print(step, self.train_errors[-1])  

        
        
    def backTesting(self, test_x, test_y):
        with torch.no_grad():
            predictions = torch.sigmoid(self.model(test_x))
            # predictions = torch.round(predictions)
            print(predictions)
            accuracy = (predictions == test_y).numpy().mean()
            print('Test accuracy: {:.3%}'.format(accuracy))

    def errors(self):
        (fig, ax) = plt.subplots(1, 1)
        ax.set_xlabel('step')
        ax.set_ylabel('$E$')
        ax.plot(range(1, len(self.train_errors) + 1), self.train_errors, color='blue', linestyle='-', linewidth=3)
        ax.grid()

In [61]:
classifier = GenreClassifier(chars=vocab, iters=500, title_lens=train_lens, genre_size=len(genre), embedding=16, hidden=16)
classifier.run(indexed_train_x, indexed_train_y)
classifier.errors()

TypeError: __init__() takes 1 positional argument but 2 were given

In [50]:
classifier.backTesting(indexed_test_x, indexed_test_y)

tensor([[[0.8587, 0.9231, 0.8580, 0.8567, 0.8966],
         [0.6541, 0.7056, 0.5835, 0.6381, 0.6337],
         [0.2341, 0.2073, 0.1543, 0.2063, 0.1858],
         ...,
         [0.0081, 0.0038, 0.0062, 0.0060, 0.0041],
         [0.0081, 0.0038, 0.0062, 0.0060, 0.0041],
         [0.0081, 0.0038, 0.0062, 0.0060, 0.0041]],

        [[0.8809, 0.9346, 0.8750, 0.8654, 0.9045],
         [0.6393, 0.6723, 0.5508, 0.6162, 0.5749],
         [0.1280, 0.0903, 0.0812, 0.1044, 0.0748],
         ...,
         [0.0089, 0.0043, 0.0070, 0.0066, 0.0045],
         [0.0081, 0.0038, 0.0062, 0.0060, 0.0041],
         [0.0081, 0.0038, 0.0062, 0.0060, 0.0041]],

        [[0.9032, 0.9334, 0.9039, 0.8758, 0.9084],
         [0.6949, 0.7516, 0.6230, 0.7186, 0.6631],
         [0.1398, 0.0975, 0.0882, 0.1150, 0.0813],
         ...,
         [0.0081, 0.0038, 0.0062, 0.0060, 0.0041],
         [0.0081, 0.0038, 0.0062, 0.0060, 0.0041],
         [0.0081, 0.0038, 0.0062, 0.0060, 0.0041]],

        ...,

        [[0.8415, 0.

RuntimeError: The size of tensor a (20) must match the size of tensor b (2736) at non-singleton dimension 1

Measure the F1 score performance of the model when applied on the test set.
Also plot a confusion matrix showing how often each genre is mistaken as another genre.

## Title generation (25%)

Now that you've proven that titles and genre are related, make a model that can generate a title given a genre.

Again, you need to generate tokens at the character level instead of the word level and the titles must be lowercased.
Preprocess the data sets, create a neural network, and train it to generate the movie titles given their genre.
Plot a graph of the **perplexity** of the model on the train and validation sets after each epoch.

In [None]:
class Model(torch.nn.Module):

    def __init__(self, vocab_size, embedding_size, hidden_size):
        super().__init__()
        self.embedding_layer = torch.nn.Embedding(vocab_size, embedding_size)
        self.rnn_s0 = torch.nn.Parameter(torch.zeros((hidden_size,), dtype=torch.float32))
        self.rnn_c0 = torch.nn.Parameter(torch.zeros((hidden_size,), dtype=torch.float32))
        self.rnn_cell = torch.nn.LSTMCell(embedding_size, hidden_size)
        self.output_layer = torch.nn.Linear(hidden_size, vocab_size)
    
    def forward(self, x):
        batch_size = x.shape[0]
        time_steps = x.shape[1]

        embedded = self.embedding_layer(x)
        state = self.rnn_s0.unsqueeze(0).tile((batch_size, 1))
        c = self.rnn_s0.unsqueeze(0).tile((batch_size, 1))
        interm_states = []
        for t in range(time_steps):
            (state, c) = self.rnn_cell(embedded[:, t, :], (state, c))
            interm_states.append(state)
        interm_states = torch.stack(interm_states, dim=1)
        return self.output_layer(interm_states)


In [None]:
class titleGeneration():

    def __init__(self, vocab):
        self.model = Model(len(vocab), embedding_size = 16, hidden_size = 16)
        self.model.to(device)
        self.optimiser = torch.optim.Adam(self.model.parameters())
    
    def run(self):
        

Generate 3 titles for every genre.
Make sure that the titles are not all the same.

## Language models as classifiers (30%)

It occurs to you that the movie title generator can also be used as a classifier by doing the following:

* Let title $t$ be the title that you want to classify.
* For every genre $g$,
    * Use the generator as a language model to get the probability of $t$ (the whole title) using genre $g$.
* Pick the genre that makes the language model give the largest probability.

The producer is thrilled to not need two separate models and now you have to implement this.
**Use the preprocessed test set from the previous task** in order to find the genre that makes the language model give the largest probability.
There is no need to plot anything here.

Just like in the classification task, measure the F1 score and plot the confusion matrix of this new classifier.

Write a paragraph or psuedo code to describe what your code above does.

In [None]:
'''

'''

## Conclusion (10%)

The producer's funders are asking for a report about this new technology they invested in.
In 300 words, write your interpretation of the results together with what you think could make the model perform better.

In [None]:
'''

'''