In [1]:
# Suggested imports. Do not use import any modules that are not in the requirements.txt file on the VLE.

%matplotlib inline

import numpy as np
import pandas as pd
import torch
import collections
import random
import matplotlib.pyplot as plt
import sklearn.model_selection
import sklearn.metrics

device = 'cuda:0' if torch.cuda.is_available() else 'cpu'

# Movie titles assignment

Table of contents:

* [Data filtering and splitting (10%)](#Data-filtering-and-splitting-(10%))
* [Title classification (25%)](#Title-classification-(25%))
* [Title generation (25%)](#Title-generation-(25%))
* [Language models as classifiers (30%)](#Language-models-as-classifiers-(30%))
* [Conclusion (10%)](#Conclusion-(10%))

Information:

This assignment is 100% of your assessment.
You are to follow the instructions below and fill each cell as instructed.
Once ready, submit this notebook on VLE with all the outputs included (run all your code and don't clear any output cells).
Do not submit anything else apart from the notebook and do not use any extra data apart from what is requested.

## Introduction

A big shot Hollywood producer is looking for a way to automatically generate new movie titles for future movies and you have been employed to do this (in exchange for millions of dollars!).
A data set of movie details has already been collected from IMDb for you and your task is to create the model and the algorithms necessary to use it.

## Data filtering and splitting (10%)

Start by downloading the CSV file `filmtv_movies - ENG.csv` from [this kaggle data set](https://www.kaggle.com/datasets/stefanoleone992/filmtv-movies-dataset).

The CSV file needs to be filtered as the producer is only interested in certain types of movie titles.
Load the file and filter it so that only movies with the following criteria are kept:

* The country needs to be `United States` (and no other country should be mentioned).
* The genre should be `Action`, `Horror`, `Fantasy`, `Western`, and `Adventure`.
* The title should not have more than 20 characters.

In [41]:
df = pd.read_csv('data.csv')  #Load full csv
df = df[df['country'] == 'United States'] #Country == United States
df = df[df['genre'].isin(['Action','Horror','Fantasy','Western','Adventure'])] #Filter genre
df = df[df['title'].str.len() < 21] # Title does not have more than 20 characters
df['title'] = df['title'].apply(lambda s: s.lower()) #Set all titles to lowercase
df = df[['title','genre']] # Only title and genre columns are needed

df = df.sample(frac=1) #Shuffle dataset
df.to_csv('filtered_data.csv', index=False)
df

Unnamed: 0,title,genre
25169,john carter,Fantasy
29786,terminator: genisys,Fantasy
9290,frankenstein,Horror
4470,high plains drifter,Western
20348,deer woman,Horror
...,...,...
35989,midsommar,Horror
16895,intruder,Horror
31751,outlaws and angels,Horror
517,cherry 2000,Fantasy


Split the filtered data into 80% train, 10% validation, and 10% test.
You will only need the title and genre columns.

In [37]:
#df = pd.read_csv('filtered_data.csv')

#Train = 80%, Other = 20%
train_x, other_x, train_y, other_y = sklearn.model_selection.train_test_split(df['title'],df['genre'],
                                                             test_size=0.2, random_state=1)


#Split other in half -> [Train = 80%, Val = 10%, Test = 10%]
val_x, test_x, val_y, test_y = sklearn.model_selection.train_test_split(other_x, other_y,
                                                       test_size=0.5, random_state=1)



From your processed data set, display:

* the amount of movies in each genre and split
* 5 examples of movie titles from each genre and split

## Title classification (25%)

Your first task is to prove that a neural network can identify the genre of a movie based on its title.

You will note that many titles are just a single word or two words long so you need to work at the character level instead of the word level, that is, a token would be a single character, including punctuation marks and spaces.
You must also lowercase the titles.
Preprocess the data sets, create a neural network, and train it to classify the movie titles into their genre.
Plot a graph of the **accuracy** of the model on the train and validation sets after each epoch.

Measure the F1 score performance of the model when applied on the test set.
Also plot a confusion matrix showing how often each genre is mistaken as another genre.

## Title generation (25%)

Now that you've proven that titles and genre are related, make a model that can generate a title given a genre.

Again, you need to generate tokens at the character level instead of the word level and the titles must be lowercased.
Preprocess the data sets, create a neural network, and train it to generate the movie titles given their genre.
Plot a graph of the **perplexity** of the model on the train and validation sets after each epoch.

Generate 3 titles for every genre.
Make sure that the titles are not all the same.

## Language models as classifiers (30%)

It occurs to you that the movie title generator can also be used as a classifier by doing the following:

* Let title $t$ be the title that you want to classify.
* For every genre $g$,
    * Use the generator as a language model to get the probability of $t$ (the whole title) using genre $g$.
* Pick the genre that makes the language model give the largest probability.

The producer is thrilled to not need two separate models and now you have to implement this.
**Use the preprocessed test set from the previous task** in order to find the genre that makes the language model give the largest probability.
There is no need to plot anything here.

Just like in the classification task, measure the F1 score and plot the confusion matrix of this new classifier.

Write a paragraph or psuedo code to describe what your code above does.

In [None]:
'''

'''

## Conclusion (10%)

The producer's funders are asking for a report about this new technology they invested in.
In 300 words, write your interpretation of the results together with what you think could make the model perform better.

In [None]:
'''

'''