In [3]:
# Suggested imports. Do not use import any modules that are not in the requirements.txt file on the VLE.

%matplotlib inline

import numpy as np
import pandas as pd
import torch
import collections
import random
import matplotlib.pyplot as plt
import sklearn.model_selection
import sklearn.metrics

device = 'cuda:0' if torch.cuda.is_available() else 'cpu'

# Movie titles assignment

Table of contents:

* [Data filtering and splitting (10%)](#Data-filtering-and-splitting-(10%))
* [Title classification (25%)](#Title-classification-(25%))
* [Title generation (25%)](#Title-generation-(25%))
* [Language models as classifiers (30%)](#Language-models-as-classifiers-(30%))
* [Conclusion (10%)](#Conclusion-(10%))

Information:

This assignment is 100% of your assessment.
You are to follow the instructions below and fill each cell as instructed.
Once ready, submit this notebook on VLE with all the outputs included (run all your code and don't clear any output cells).
Do not submit anything else apart from the notebook and do not use any extra data apart from what is requested.

## Introduction

A big shot Hollywood producer is looking for a way to automatically generate new movie titles for future movies and you have been employed to do this (in exchange for millions of dollars!).
A data set of movie details has already been collected from IMDb for you and your task is to create the model and the algorithms necessary to use it.

## Data filtering and splitting (10%)

Start by downloading the CSV file `filmtv_movies - ENG.csv` from [this kaggle data set](https://www.kaggle.com/datasets/stefanoleone992/filmtv-movies-dataset).

The CSV file needs to be filtered as the producer is only interested in certain types of movie titles.
Load the file and filter it so that only movies with the following criteria are kept:

* The country needs to be `United States` (and no other country should be mentioned).
* The genre should be `Action`, `Horror`, `Fantasy`, `Western`, and `Adventure`.
* The title should not have more than 20 characters.

In [4]:
data = pd.read_csv('Data/filmtv_movies - ENG.csv', index_col=None)
genres = ['Action', 'Horror', 'Fantasy', 'Western', 'Adventure']

data = data.loc[(data['country'] == 'United States') &
                (data['genre'].isin(genres)) & 
                (data['title'].str.len() <= 20)]

In [5]:
data

Unnamed: 0,filmtv_id,title,year,genre,duration,country,directors,actors,avg_vote,critics_vote,public_vote,total_votes,description,notes,humor,rhythm,effort,tension,erotism
12,36,Bowery at Midnight,1942,Horror,62,United States,Wallace Fox,"Bela Lugosi, John Archer, Wanda McKay, Dave O'...",5.3,5.00,6.0,25,In the infamous New York neighborhood of Bower...,"Defined by critics as shaky, Wallace W. Fox's ...",0,2,1,3,0
14,38,Mr. Majestyk,1974,Action,105,United States,Richard Fleischer,"Charles Bronson, Linda Cristal, Al Lettieri, L...",6.2,5.67,7.0,24,"A veteran of Vietnam, Vince (Bronson) grows me...",Cliché screenplay (by Elmore Leonard) tailored...,0,3,2,2,0
15,45,Warning Sign,1985,Action,99,United States,Hal Barwood,"Sam Waterston, Kathleen Quinlan, Yaphet Kotto,...",4.8,4.00,6.0,10,"Inside the Biotek laboratory, a renowned and r...","It is a film that mixes ""The invasion of the b...",0,2,1,1,0
24,61,The Appaloosa,1966,Western,98,United States,Sidney J. Furie,"Marlon Brando, Anjanette Comer, John Saxon, Em...",6.9,7.00,7.0,25,"On the Mexican frontier, a dramatic rivalry be...",A very original western for the subject and fo...,0,2,1,3,1
31,74,The Deep,1977,Adventure,130,United States,Peter Yates,"Nick Nolte, Jacqueline Bisset, Robert Shaw, El...",5.2,4.71,6.0,28,"A boy and a girl, passionate divers, dive off ...","Not very exciting, but the underwater scenes a...",1,2,1,2,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
37685,205934,Aftermath,2021,Horror,114,United States,Peter Winther,"Ashley Greene, Shawn Ashmore, Sharif Atkins, B...",3.4,3.00,4.0,6,"In a desperate attempt to save their marriage,...",,0,0,0,0,0
37686,205935,Atomic Shark,2016,Action,90,United States,Lisa Palenica,"Alex Chayka, William 'Bill' Connor, Sean Dilli...",2.7,,3.0,3,A shark threatens swimmers on the San Diego co...,,0,0,0,0,0
37705,206906,The Stairs,2021,Horror,92,United States,Peter 'Drago' Tiemann,"Kathleen Quinlan, John Schneider, Russell Hodg...",4.0,4.00,,1,"1997: during a hunting trip, a boy (accompanie...",,0,0,0,0,0
37709,207047,Abandoned Dead,2015,Horror,77,United States,Mark W. Curran,"Ivan Adame, Hannah Johnson, Sarah Nicklin, Jud...",3.0,3.00,,1,"Rachel (Sarah Nicklin), despite working as a s...",,0,0,0,0,0


Split the filtered data into 80% train, 10% validation, and 10% test.
You will only need the title and genre columns.

In [10]:
data = data[['title', 'genre']]
train, validate, test = np.split(data.sample(frac=1), [int(.8*len(data)), int(.9*len(data))])

From your processed data set, display:

* the amount of movies in each genre and split
* 5 examples of movie titles from each genre and split

In [13]:
print(f"Training Set Size: {len(train)}")
print(f"Validation Set Size: {len(validate)}")
print(f"Test Set Size: {len(test)}")
print()
for genre in data['genre'].unique():
    print(f"Num. of movies with genre {genre}: {len(data[data['genre'] == genre])} ")

Training Set Size: 2599
Validation Set Size: 325
Test Set Size: 325

Num. of movies with genre Horror: 818 
Num. of movies with genre Action: 888 
Num. of movies with genre Western: 537 
Num. of movies with genre Adventure: 464 
Num. of movies with genre Fantasy: 542 


## Title classification (25%)

Your first task is to prove that a neural network can identify the genre of a movie based on its title.

You will note that many titles are just a single word or two words long so you need to work at the character level instead of the word level, that is, a token would be a single character, including punctuation marks and spaces.
You must also lowercase the titles.
Preprocess the data sets, create a neural network, and train it to classify the movie titles into their genre.
Plot a graph of the **accuracy** of the model on the train and validation sets after each epoch.

In [18]:
train['title'].str.lower()

class Model(torch.nn.Module):
    def __init__(self, vocab_size, embedding_size, hidden_size):
        super().__init__()
        self.embedding_layer = torch.nn.Embedding(vocab_size, embedding_size)
        self.rnn_s0 = torch.nn.Parameter(torch.zeros((hidden_size,), dtype=torch.float32))
        self.rnn_c0 = torch.nn.Parameter(torch.zeros((hidden_size,), dtype=torch.float32))
        self.rnn_cell = torch.nn.LSTMCell(embedding_size, hidden_size)
        self.output_layer = torch.nn.Linear(hidden_size, vocab_size) # Output size is vocabulary size.

    def forward(self, x):
        batch_size = x.shape[0]
        time_steps = x.shape[1]

        embedded = self.embedding_layer(x)
        state = self.rnn_s0.unsqueeze(0).tile((batch_size, 1))
        c = self.rnn_s0.unsqueeze(0).tile((batch_size, 1))
        interm_states = []
        for t in range(time_steps):
            (state, c) = self.rnn_cell(embedded[:, t, :], (state, c))
            interm_states.append(state)
        interm_states = torch.stack(interm_states, dim=1)
        return self.output_layer(interm_states)

33738         midnighters
26687               train
8802         fright night
7620         lost horizon
36169       the gentlemen
               ...       
22557        unstoppable 
4332           black belt
20884    alien apocalypse
31090       jurassic city
807         heaven's gate
Name: title, Length: 2599, dtype: object

Measure the F1 score performance of the model when applied on the test set.
Also plot a confusion matrix showing how often each genre is mistaken as another genre.

## Title generation (25%)

Now that you've proven that titles and genre are related, make a model that can generate a title given a genre.

Again, you need to generate tokens at the character level instead of the word level and the titles must be lowercased.
Preprocess the data sets, create a neural network, and train it to generate the movie titles given their genre.
Plot a graph of the **perplexity** of the model on the train and validation sets after each epoch.

Generate 3 titles for every genre.
Make sure that the titles are not all the same.

## Language models as classifiers (30%)

It occurs to you that the movie title generator can also be used as a classifier by doing the following:

* Let title $t$ be the title that you want to classify.
* For every genre $g$,
    * Use the generator as a language model to get the probability of $t$ (the whole title) using genre $g$.
* Pick the genre that makes the language model give the largest probability.

The producer is thrilled to not need two separate models and now you have to implement this.
**Use the preprocessed test set from the previous task** in order to find the genre that makes the language model give the largest probability.
There is no need to plot anything here.

Just like in the classification task, measure the F1 score and plot the confusion matrix of this new classifier.

Write a paragraph or psuedo code to describe what your code above does.

In [None]:
'''

'''

## Conclusion (10%)

The producer's funders are asking for a report about this new technology they invested in.
In 300 words, write your interpretation of the results together with what you think could make the model perform better.

In [None]:
'''

'''