In [63]:
# Suggested imports. Do not use import any modules that are not in the requirements.txt file on the VLE.

%matplotlib inline

import numpy as np
import pandas as pd
import torch
import collections
import random
import matplotlib.pyplot as plt
import sklearn.model_selection
import sklearn.metrics

device = 'cuda:0' if torch.cuda.is_available() else 'cpu'

# Movie titles assignment

Table of contents:

* [Data filtering and splitting (10%)](#Data-filtering-and-splitting-(10%))
* [Title classification (25%)](#Title-classification-(25%))
* [Title generation (25%)](#Title-generation-(25%))
* [Language models as classifiers (30%)](#Language-models-as-classifiers-(30%))
* [Conclusion (10%)](#Conclusion-(10%))

Information:

This assignment is 100% of your assessment.
You are to follow the instructions below and fill each cell as instructed.
Once ready, submit this notebook on VLE with all the outputs included (run all your code and don't clear any output cells).
Do not submit anything else apart from the notebook and do not use any extra data apart from what is requested.

## Introduction

A big shot Hollywood producer is looking for a way to automatically generate new movie titles for future movies and you have been employed to do this (in exchange for millions of dollars!).
A data set of movie details has already been collected from IMDb for you and your task is to create the model and the algorithms necessary to use it.

## Data filtering and splitting (10%)

Start by downloading the CSV file `filmtv_movies - ENG.csv` from [this kaggle data set](https://www.kaggle.com/datasets/stefanoleone992/filmtv-movies-dataset).

The CSV file needs to be filtered as the producer is only interested in certain types of movie titles.
Load the file and filter it so that only movies with the following criteria are kept:

* The country needs to be `United States` (and no other country should be mentioned).
* The genre should be `Action`, `Horror`, `Fantasy`, `Western`, and `Adventure`.
* The title should not have more than 20 characters.

In [16]:
filtered_data = []
lines = open('filmtv_movies - ENG.csv', encoding="utf-8")

columns = next(lines).strip().split(',')

for i in lines:
    x = i.strip().split(',')
    # Checking if the movie was made in the US
    if x[columns.index('country')] == 'United States':
        # Checking if the genre of the movie is Action, Horror, Fantasy, Western and Adventure
        if (x[columns.index('genre')] == 'Action') or (x[columns.index('genre')] == 'Horror') or (x[columns.index('genre')] == 'Fantasy') or (x[columns.index('genre')] == 'Western') or (x[columns.index('genre')] == 'Adventure'):
            # Checking if title length not more than 20 characters
            #print(len(x[columns.index('title')]))
            if len(x[columns.index('title')]) <= 20:
                # Printing out the title of the movie
                print(x[columns.index('title')], " has len: ", len(x[columns.index('title')]))
                dict = {'title': x[columns.index('title')], 'genre': x[columns.index('genre')]}
                filtered_data.append(dict)

print(filtered_data)

Bowery at Midnight  has len:  18
Mr. Majestyk  has len:  12
The Appaloosa  has len:  13
The Deep  has len:  8
The Abyss  has len:  9
Mail Order Bride  has len:  16
Air America  has len:  11
Airport '77  has len:  11
Steel Dawn  has len:  10
Dawn at Socorro  has len:  15
The Hanging Tree  has len:  16
Nightwing  has len:  9
Alien Predator  has len:  14
The Hidden  has len:  10
Aliens  has len:  6
Romancing the Stone  has len:  19
Run for Cover  has len:  13
King Solomon's Mines  has len:  20
Broken Arrow  has len:  12
Colorado Territory  has len:  18
Forever Amber  has len:  13
Flesh and Blood  has len:  15
Android  has len:  7
The Andromeda Strain  has len:  20
Almost an Angel  has len:  15
Scarlet Angel  has len:  13
Neon City  has len:  9
Wings of the Apache  has len:  19
Date with an Angel  has len:  18
The Eagle  has len:  9
Gun Glory  has len:  9
Lethal Weapon  has len:  13
Lethal Weapon 2  has len:  15
Dark Angel  has len:  10
Comes a Horseman  has len:  16
Tiger Claws  has len: 

Piraña  has len:  6
Hurricane Smith  has len:  15
Gun Point  has len:  9
Arizona Raiders  has len:  15
Heaven with a Gun  has len:  17
The Shootist  has len:  12
The Magic of Lassie  has len:  19
Shakedown  has len:  9
Poltergeist  has len:  11
A Force of One  has len:  14
Cariboo Trail  has len:  13
Run for the Sun  has len:  15
Predator  has len:  8
Mother Lode  has len:  11
Abilene Town  has len:  12
The Omen  has len:  8
Timbuktu  has len:  8
Garden of Evil  has len:  14
Allegheny Uprising  has len:  18
Prince Valiant  has len:  14
The War Lord  has len:  12
Mohawk  has len:  6
Eve of Destruction  has len:  18
The Evil That Men Do  has len:  20
Shadowchaser  has len:  12
Evilspeak  has len:  9
The Next Man  has len:  12
Blue Heat  has len:  9
Something Evil  has len:  14
When Worlds Collide  has len:  19
Forty Guns  has len:  10
Benji the Hunted  has len:  16
Québec  has len:  6
The Train Robbers  has len:  17
3:10 to Yuma  has len:  12
Quick  has len:  5
Quintet  has len:  7
Rage 

The Spoilers  has len:  12
2010  has len:  4
Robocop 3  has len:  9
Windwalker  has len:  10
Montana  has len:  7
Devil's Doorway  has len:  15
Nowhere to Run  has len:  14
The Big Land  has len:  12
Bingo  has len:  5
Army of Darkness  has len:  16
Universal Soldier  has len:  17
Extreme Prejudice  has len:  17
The Ravagers  has len:  12
Straight Shooting  has len:  17
Friendly Persuasion  has len:  19
Far and Away  has len:  12
Cheyenne  has len:  8
The Sea Hawk  has len:  12
Alien 3  has len:  7
Tron  has len:  4
Free Willy  has len:  10
Dracula  has len:  7
Joshua Tree  has len:  11
Rapid Fire  has len:  10
New Frontier  has len:  12
Zardoz  has len:  6
Hook  has len:  4
The Swarm  has len:  9
Three Violent People  has len:  20
Black Eagle  has len:  11
Riding Shotgun  has len:  14
Apache Uprising  has len:  15
Hearthquake  has len:  11
The Real McCoy  has len:  14
Never Say Die  has len:  13
The Outriders  has len:  13
Freaks  has len:  6
Piraña 2  has len:  8
Ricochet  has len:  

Man from Del Rio  has len:  16
Babylon 5  has len:  9
From Dusk Till Dawn  has len:  19
Escape from L.A.  has len:  16
Courage Under Fire  has len:  18
Dragonslayer  has len:  12
The Lion  has len:  8
Seven Cities of Gold  has len:  20
Hell Camp  has len:  9
Mars Attacks!  has len:  13
Trapper County  has len:  14
Dante's Peak  has len:  12
Blind Justice  has len:  13
She  has len:  3
The Final Countdown  has len:  19
Scanner Cop  has len:  11
Howards of Virginia  has len:  19
Last of the Dogmen  has len:  18
Demon Knight  has len:  12
American Kickboxer  has len:  18
Prime Cut  has len:  9
Sworn to Justice  has len:  16
Airborne  has len:  8
The Shooter  has len:  11
Fifty/Fifty  has len:  11
Surf Ninjas  has len:  11
The Kentuckian  has len:  14
One Tough Bastard  has len:  17
Bad Boys  has len:  8
Panic in the Skies!  has len:  19
Three Young Texans  has len:  18
Dark Avenger  has len:  12
Lord of Illusions  has len:  17
Lost Continent  has len:  14
The Night Flier  has len:  15
The

Fortress 2: Re-Entry  has len:  20
Romeo Must Die  has len:  14
Gone in 60 Seconds  has len:  18
Battlefield Earth  has len:  17
The Cell  has len:  8
Mission to Mars  has len:  15
Russkies  has len:  8
They Rode West  has len:  14
The Wrong Man  has len:  13
Sister Sister  has len:  13
Ride with the Devil  has len:  19
Hollow Man  has len:  10
Deadly Outbreak  has len:  15
Frequency  has len:  9
Bionic Ever After?  has len:  18
The Art of War  has len:  14
If You Believe  has len:  14
City Beneath the Sea  has len:  20
Redhead From Wyoming  has len:  20
The 6th Day  has len:  11
Lost Souls  has len:  10
Red River Range  has len:  15
Snow Job  has len:  8
The Killer Elite  has len:  16
The President's Man  has len:  19
The Bandit Queen  has len:  16
Dracula 2000  has len:  12
Proof of Life  has len:  13
Seminole Uprising  has len:  17
Wind River  has len:  10
No Way Back  has len:  11
The Newton Boys  has len:  15
Kiwi Safari  has len:  11
The Faculty  has len:  11
Hired to Kill  has l

Crack In the World  has len:  18
Panic in Year Zero!  has len:  19
Serenity  has len:  8
The Proud Rebel  has len:  15
Foxfire  has len:  7
The Hunting Party  has len:  17
Saw II  has len:  6
Sniper 2  has len:  8
A Lust to Kill  has len:  14
Best of the Best 2  has len:  18
Seed of Chucky  has len:  14
Oklahoma!  has len:  9
The Real Glory  has len:  14
Nevada Smith  has len:  12
Bird of Paradise  has len:  16
Son of Dracula  has len:  14
Dallas  has len:  6
Aeon Flux  has len:  9
Hostel  has len:  6
The Maze  has len:  8
Santa Fe Stampede  has len:  17
Three Texas Steers  has len:  18
Wyoming Outlaw  has len:  14
Comanche Territory  has len:  18
Lone Hand  has len:  9
War Arrow  has len:  9
Johnny Dark  has len:  11
Reprisal!  has len:  9
The Hard Man  has len:  12
The Desperadoes  has len:  15
Plunder of the Sun  has len:  18
You'll Find Out  has len:  15
Eight Below  has len:  11
Not of This Earth  has len:  17
The Wasp Woman  has len:  14
Last Woman on Earth  has len:  19
Macabre 

Prairie Fever   has len:  14
Red Sands   has len:  10
Lies & Illusions  has len:  16
Legion  has len:  6
Half Past Dead 2   has len:  17
Sorority Row  has len:  12
Diary of the Dead   has len:  18
Ninja Assassin  has len:  14
24: Redemption  has len:  14
G-Force   has len:  8
Hollow Man II   has len:  14
Takers  has len:  6
Webs   has len:  5
Buried Alive   has len:  13
Hatchet  has len:  7
Living hell  has len:  11
Paranormal Activity  has len:  19
Black Dynamite  has len:  14
The Ministers   has len:  14
More Than Murder   has len:  17
Bubba Ho-tep  has len:  12
Ben  has len:  3
The Way of War  has len:  14
Bats: Human Harvest  has len:  19
Highlander: Endgame  has len:  19
Season of the Witch  has len:  19
Command Performance  has len:  19
Killers  has len:  7
Ring of Death  has len:  13
Sand Serpents   has len:  14
Street Warrior  has len:  14
Bangkok Dangerous   has len:  18
Beastly  has len:  7
The Net 2.0  has len:  11
Call of the Wild   has len:  17
Knight & Day   has len:  13


The Cursed  has len:  10
13 Sins  has len:  7
Fingerprints  has len:  12
Peter Pan  has len:  9
Elephant White  has len:  14
Tape 407  has len:  8
The Zero Theorem  has len:  16
Succubus: Hell-Bent  has len:  19
Europa Report  has len:  13
Lost Signal  has len:  11
All Is Lost  has len:  11
Rogue River  has len:  11
Skinned Deep  has len:  12
Alien Origin  has len:  12
The Conjuring  has len:  13
sxtape  has len:  6
The Package  has len:  11
The Purge  has len:  9
Bad Milo  has len:  8
Horns  has len:  5
Homefront  has len:  9
Cold Harvest  has len:  12
Cobra Woman  has len:  11
Sherlock Holmes  has len:  15
Non-Stop  has len:  8
Yankee Buccaneer  has len:  16
Big Ass Spider  has len:  14
The Green Inferno  has len:  17
The Last Keepers  has len:  16
Swamp Shark  has len:  11
Jug Face  has len:  8
Mercy  has len:  5
Ride Along  has len:  10
Clash of the Empires  has len:  20
Mr. Jones  has len:  9
The Human Race  has len:  14
Found  has len:  5
Alyce  has len:  5
Ink  has len:  3
Absol

Metal Tornado  has len:  13
Martian Land  has len:  12
Taken Heart  has len:  11
Cold Zone  has len:  9
Life  has len:  4
Independents' Day  has len:  17
Future World  has len:  12
#Screamers  has len:  10
It Comes at Night  has len:  17
A Wrinkle in Time  has len:  17
Death Race 2050  has len:  15
Replicas  has len:  8
It  has len:  2
The Commuter  has len:  12
Alita: Battle Angel  has len:  19
Hostiles  has len:  8
Bushwick  has len:  8
XX  has len:  2
Ozark Sharks  has len:  12
Mom and Dad  has len:  11
Cross 2  has len:  7
Coin Heist  has len:  10
Io  has len:  2
Don't Knock Twice  has len:  17
Road Wars  has len:  9
Planet of the Sharks  has len:  20
Ice Sharks  has len:  10
Ghosthunters  has len:  12
Altitude  has len:  8
Descendants 2  has len:  13
The Archer  has len:  10
The Burning Dead  has len:  16
Wish Upon  has len:  9
American Warships  has len:  17
Hidden Agenda  has len:  13
Bright  has len:  6
Mayhem  has len:  6
The Endless  has len:  11
Eloise  has len:  6
Ready Pla

Rest Stop  has len:  9
Freaky  has len:  6
The Bunny Game  has len:  14
The Lie  has len:  7
Nocturne  has len:  8
Black Box  has len:  9
Evil Eye  has len:  8
The Tomorrow War  has len:  16
The Craft: Legacy  has len:  17
Books of Blood  has len:  14
Archenemy  has len:  9
Jiu Jitsu  has len:  9
Spell  has len:  5
The Old Ways  has len:  12
Uncharted  has len:  9
Willy's Wonderland  has len:  18
Godmothered  has len:  11
We Can Be Heroes  has len:  16
Scream 5  has len:  8
I eat your skin  has len:  15
Born a Champion  has len:  15
Echo Boomers  has len:  12
Army of One  has len:  11
The Suicide Squad  has len:  17
Flora & Ulysses  has len:  15
How It Ends  has len:  11
Saving Flora  has len:  12
Deep Blue Sea 3  has len:  15
Finding Ohana  has len:  13
Army of the Dead  has len:  16
Bliss  has len:  5
The Manor  has len:  9
Black as Night  has len:  14
Madres  has len:  6
Things Heard & Seen  has len:  19
The Dark Red  has len:  12
Virus Shark  has len:  11
Cosmic Sin  has len:  10
T

In [3]:
f_data = pd.DataFrame(filtered_data)
 
# saving the DataFrame as a CSV file
f_data.to_csv('filtered.csv', index = False)

Split the filtered data into 80% train, 10% validation, and 10% test.
You will only need the title and genre columns.

In [4]:
filtered_data = pd.read_csv('filtered.csv')
  
x = filtered_data['title']
y = filtered_data['genre']
  
#using the train test split function to split the filtered data 
# 80% train data
x_train, x_others, y_train, y_others = sklearn.model_selection.train_test_split(x, y, random_state = 0, train_size=0.8, shuffle=True)
# 10% validation data, 10% test data
x_validation, x_test, y_validation, y_test = sklearn.model_selection.train_test_split(x_others, y_others, random_state = 0, train_size=0.5, shuffle=True)

#converting the data into dictionaries
dict_train = {'title': x_train, 'genre': y_train}
dict_validation = {'title': x_validation, 'genre': y_validation}
dict_test = {'title': x_test, 'genre': y_test}

#converting dictionaries into dataframes
train = pd.DataFrame(dict_train)
validation = pd.DataFrame(dict_validation)
test = pd.DataFrame(dict_test)

In [5]:
#train, validation, test = np.split(f_data.sample(frac=1), [int(.8 * len(f_data)), int(.9 * len(f_data))])

In [6]:
# saving the DataFrame as a CSV file
train.to_csv('train.csv', index = False)

# saving the DataFrame as a CSV file
validation.to_csv('validation.csv', index = False)

# saving the DataFrame as a CSV file
test.to_csv('test.csv', index = False)

In [7]:
train

Unnamed: 0,title,genre
1310,The Spiral Road,Adventure
228,Posse,Western
37,Wyoming Mail,Adventure
1087,Scream,Horror
1357,Bionic Ever After?,Fantasy
...,...,...
835,Lionheart,Adventure
3264,Spell,Horror
1653,Hell Asylum,Horror
2607,Cell,Action


In [8]:
validation

Unnamed: 0,title,genre
570,The Driver,Action
3349,Studio 666,Horror
333,Moon of the Wolf,Horror
2784,All Hallows' Eve,Horror
1432,Knightriders,Adventure
...,...,...
3333,Flight World War II,Fantasy
1457,Wild in the Street,Fantasy
654,Tennessee's Partner,Western
1696,The Ape Man,Horror


In [9]:
test

Unnamed: 0,title,genre
1867,Bullfighter,Western
2305,Quantum Apocalypse,Fantasy
304,Miami Supercops,Adventure
2174,Minutemen,Adventure
840,Remote Control,Fantasy
...,...,...
1283,The Illustrated Man,Fantasy
2968,Bethany,Horror
1465,Minority Report,Fantasy
3321,Bingo Hell,Horror


From your processed data set, display:

* the amount of movies in each genre and split
* 5 examples of movie titles from each genre and split

In [17]:
#Train split
lines_train = open('train.csv', encoding="utf-8")
columns_train = next(lines_train).strip().split(',')

action_train = 0
horror_train = 0
fantasy_train = 0
western_train = 0
adventure_train = 0

action_titles_train = []
horror_titles_train = []
fantasy_titles_train = []
western_titles_train = []
adventure_titles_train = []

for i in lines_train:
    x = i.strip().split(',')
    
    if (x[columns_train.index('genre')] == 'Action'):
        action_train = action_train + 1
        if action_train < 6:
            action_titles_train.append(x[columns_train.index('title')])
            
    elif (x[columns_train.index('genre')] == 'Horror'):
        horror_train = horror_train + 1
        if horror_train < 6:
            horror_titles_train.append(x[columns_train.index('title')])
            
    elif (x[columns_train.index('genre')] == 'Fantasy'): 
        fantasy_train = fantasy_train + 1
        if fantasy_train < 6:
            fantasy_titles_train.append(x[columns_train.index('title')])
            
    elif (x[columns_train.index('genre')] == 'Western'):
        western_train = western_train + 1
        if western_train < 6:
            western_titles_train.append(x[columns_train.index('title')])
            
    elif (x[columns_train.index('genre')] == 'Adventure'):
        adventure_train = adventure_train + 1
        if adventure_train < 6:
            adventure_titles_train.append(x[columns_train.index('title')])

print("Number of movies in Train split:")
print("Action movies: ", action_train)
print("Horror movies: ", horror_train)
print("Fantasy movies: ", fantasy_train)
print("Western movies: ", western_train)
print("Adventure movies: ", adventure_train)

print("\n5 examples of movies in Train split:")
print("Action movies: ", action_titles_train)
print("Horror movies: ", horror_titles_train)
print("Fantasy movies: ", fantasy_titles_train)
print("Western movies: ", western_titles_train)
print("Adventure movies: ", adventure_titles_train)

Number of movies in Train split:
Action movies:  733
Horror movies:  761
Fantasy movies:  429
Western movies:  423
Adventure movies:  378

5 examples of movies in Train split:
Action movies:  ['Machete Kills', 'End of Watch', 'Red Notice', 'Top of the World', 'The Net 2.0']
Horror movies:  ['Scream', 'Midsommar', 'Becky', 'The Invitation', 'Picture Mommy Dead']
Fantasy movies:  ['Bionic Ever After?', 'John Carter', 'Marooned', '400 Days', 'Virtuosity']
Western movies:  ['Posse', 'The Big Country', 'Cheyenne Warrior', 'Love Takes Wing', 'Major Dundee']
Adventure movies:  ['The Spiral Road', 'Wyoming Mail', 'Bingo', 'Watusi', 'On Deadly Ground']


In [18]:
#Validation split
lines_validation = open('validation.csv', encoding="utf-8")
columns_validation = next(lines_validation).strip().split(',')

action_validation = 0
horror_validation = 0
fantasy_validation = 0
western_validation = 0
adventure_validation = 0

action_titles_validation = []
horror_titles_validation = []
fantasy_titles_validation = []
western_titles_validation = []
adventure_titles_validation = []

for i in lines_validation:
    x = i.strip().split(',')
    
    if (x[columns_validation.index('genre')] == 'Action'):
        action_validation = action_validation + 1
        if action_validation < 6:
            action_titles_validation.append(x[columns_validation.index('title')])
            
    elif (x[columns_validation.index('genre')] == 'Horror'):
        horror_validation = horror_validation + 1
        if horror_validation < 6:
            horror_titles_validation.append(x[columns_validation.index('title')])
            
    elif (x[columns_validation.index('genre')] == 'Fantasy'): 
        fantasy_validation = fantasy_validation + 1
        if fantasy_validation < 6:
            fantasy_titles_validation.append(x[columns_validation.index('title')])
            
    elif (x[columns_validation.index('genre')] == 'Western'):
        western_validation = western_validation + 1
        if western_validation < 6:
            western_titles_validation.append(x[columns_validation.index('title')])
            
    elif (x[columns_validation.index('genre')] == 'Adventure'):
        adventure_validation = adventure_validation + 1
        if adventure_validation < 6:
            adventure_titles_validation.append(x[columns_validation.index('title')])

print("\nNumber of movies in Validation split:")
print("Action movies: ", action_validation)
print("Horror movies: ", horror_validation)
print("Fantasy movies: ", fantasy_validation)
print("Western movies: ", western_validation)
print("Adventure movies: ", adventure_validation)
            
print("\n5 examples of movies in Validation split:")
print("Action movies: ", action_titles_validation)
print("Horror movies: ", horror_titles_validation)
print("Fantasy movies: ", fantasy_titles_validation)
print("Western movies: ", western_titles_validation)
print("Adventure movies: ", adventure_titles_validation)


Number of movies in Validation split:
Action movies:  85
Horror movies:  81
Fantasy movies:  66
Western movies:  59
Adventure movies:  49

5 examples of movies in Validation split:
Action movies:  ['The Driver', 'Real Steel', 'Extreme Prejudice', 'John Wick', 'On Wings of Eagles']
Horror movies:  ['Studio 666', 'Moon of the Wolf', "All Hallows' Eve", 'Nazi Overlord', 'Evil Dead 2']
Fantasy movies:  ['The New Mutants', 'Battlestar Galactica', 'Sherlock Holmes', 'The Zero Theorem', 'Invaders From Mars']
Western movies:  ['Big Jake', 'Journey to Shiloh', 'Red River Range', "Love's Long Journey ", 'Cannon for Cordoba']
Adventure movies:  ['Knightriders', 'Plunder of the Sun', 'Challenge to Lassie', 'Mutiny', 'Hellfighters']


In [None]:
#Test split

lines_test = open('test.csv', encoding="utf-8")
columns_test = next(lines_test).strip().split(',')

action_test = 0
horror_test = 0
fantasy_test = 0
western_test = 0
adventure_test = 0

action_titles_test = []
horror_titles_test = []
fantasy_titles_test = []
western_titles_test = []
adventure_titles_test = []

for i in lines_test:
    x = i.strip().split(',')
    
    if (x[columns_test.index('genre')] == 'Action'):
        action_test = action_test + 1
        if action_test < 6:
            action_titles_test.append(x[columns_test.index('title')])
            
    elif (x[columns_test.index('genre')] == 'Horror'):
        horror_test = horror_test + 1
        if horror_test < 6:
            horror_titles_test.append(x[columns_test.index('title')])
            
    elif (x[columns_test.index('genre')] == 'Fantasy'): 
        fantasy_test = fantasy_test + 1
        if fantasy_test < 6:
            fantasy_titles_test.append(x[columns_test.index('title')])
            
    elif (x[columns_test.index('genre')] == 'Western'):
        western_test = western_test + 1
        if western_test < 6:
            western_titles_test.append(x[columns_test.index('title')])
            
    elif (x[columns_test.index('genre')] == 'Adventure'):
        adventure_test = adventure_test + 1
        if adventure_test < 6:
            adventure_titles_test.append(x[columns_test.index('title')])

print("\nNumber of movies in Test split:")
print("Action movies: ", action_test)
print("Horror movies: ", horror_test)
print("Fantasy movies: ", fantasy_test)
print("Western movies: ", western_test)
print("Adventure movies: ", adventure_test)

print("\n5 examples of movies in Test split:")
print("Action movies: ", action_titles_test)
print("Horror movies: ", horror_titles_test)
print("Fantasy movies: ", fantasy_titles_test)
print("Western movies: ", western_titles_test)
print("Adventure movies: ", adventure_titles_test)

## Title classification (25%)

Your first task is to prove that a neural network can identify the genre of a movie based on its title.

You will note that many titles are just a single word or two words long so you need to work at the character level instead of the word level, that is, a token would be a single character, including punctuation marks and spaces.
You must also lowercase the titles.
Preprocess the data sets, create a neural network, and train it to classify the movie titles into their genre.
Plot a graph of the **accuracy** of the model on the train and validation sets after each epoch.

In [19]:
#reading the data with lowercase titles
train_lowercase = pd.read_csv("train.csv",converters={"title": lambda x: x.lower()})
validation_lowercase = pd.read_csv("validation.csv",converters={"title": lambda x: x.lower()})
test_lowercase = pd.read_csv("test.csv",converters={"title": lambda x: x.lower()})

# print(train_lowercase)
# print(validation_lowercase)
# print(test_lowercase)

In [20]:
#tokenising the titles into characters
train_lowercase_title = [list(title) for title in train_lowercase['title']]
validation_lowercase_title = [list(title) for title in validation_lowercase['title']]
test_lowercase_title = [list(title) for title in test_lowercase['title']]

#print(train_lowercase_title)
print(validation_lowercase_title)
#print(test_lowercase_title)

[['t', 'h', 'e', ' ', 'd', 'r', 'i', 'v', 'e', 'r'], ['s', 't', 'u', 'd', 'i', 'o', ' ', '6', '6', '6'], ['m', 'o', 'o', 'n', ' ', 'o', 'f', ' ', 't', 'h', 'e', ' ', 'w', 'o', 'l', 'f'], ['a', 'l', 'l', ' ', 'h', 'a', 'l', 'l', 'o', 'w', 's', "'", ' ', 'e', 'v', 'e'], ['k', 'n', 'i', 'g', 'h', 't', 'r', 'i', 'd', 'e', 'r', 's'], ['b', 'i', 'g', ' ', 'j', 'a', 'k', 'e'], ['p', 'l', 'u', 'n', 'd', 'e', 'r', ' ', 'o', 'f', ' ', 't', 'h', 'e', ' ', 's', 'u', 'n'], ['c', 'h', 'a', 'l', 'l', 'e', 'n', 'g', 'e', ' ', 't', 'o', ' ', 'l', 'a', 's', 's', 'i', 'e'], ['m', 'u', 't', 'i', 'n', 'y'], ['t', 'h', 'e', ' ', 'n', 'e', 'w', ' ', 'm', 'u', 't', 'a', 'n', 't', 's'], ['j', 'o', 'u', 'r', 'n', 'e', 'y', ' ', 't', 'o', ' ', 's', 'h', 'i', 'l', 'o', 'h'], ['r', 'e', 'a', 'l', ' ', 's', 't', 'e', 'e', 'l'], ['r', 'e', 'd', ' ', 'r', 'i', 'v', 'e', 'r', ' ', 'r', 'a', 'n', 'g', 'e'], ['l', 'o', 'v', 'e', "'", 's', ' ', 'l', 'o', 'n', 'g', ' ', 'j', 'o', 'u', 'r', 'n', 'e', 'y', ' '], ['n', 'a', 

In [45]:
#converting the data into dictionaries
dict_train = {'title': list(train_lowercase_title), 'genre': train_lowercase['genre']}
dict_validation = {'title': validation_lowercase_title, 'genre': validation_lowercase['genre']}
dict_test = {'title': test_lowercase_title, 'genre': test_lowercase['genre']}

#converting dictionaries into dataframes
train_lc = pd.DataFrame(dict_train)
validation_lc = pd.DataFrame(dict_validation)
test_lc = pd.DataFrame(dict_test)

# # saving the DataFrame as a CSV file
# train_lc.to_csv('train_characters.csv', index = False)
# validation_lc.to_csv('validation_characters.csv', index = False)
# test_lc.to_csv('test_characters.csv', index = False)

In [46]:
# train_lc = pd.read_csv("train_characters.csv", sep = ',')  
# train_lc = pd.DataFrame(train_lc)

In [48]:
x_train_lc = list(train_lc['title'])
x_validation_lc = list(validation_lc['title'])
x_test_lc = list(test_lc['title'])

#getting vocab
vocab = ['<PAD>'] + sorted({token for sent in x_train_lc for token in sent})
print('vocab:', vocab)

val_tokens = sorted({token for sent in x_validation_lc for token in sent})
for token in val_tokens:
    if token not in vocab:
        vocab.append(token)
        vocab = sorted(vocab)
        
test_tokens = sorted({token for sent in x_test_lc for token in sent})
for token in test_tokens:
    if token not in vocab:
        vocab.append(token)
        vocab = sorted(vocab)


text_lens = torch.tensor([len(sent) for sent in x_train_lc], dtype=torch.int64, device=device)
print('text_lens:', text_lens, ' with len: ',len(text_lens))

max_len = int(max(text_lens))
print('max_len:', max_len)

padded_train_x = [sent + ['<PAD>']*(max_len - len(sent)) for sent in x_train_lc]
#print(padded_train_x)

indexed_train_x = torch.tensor([[vocab.index(token) for token in sent] for sent in padded_train_x], dtype=torch.int64, device=device)
print('indexed_train_x:')
print(indexed_train_x)


# for validation set

text_lens_validation = torch.tensor([len(sent) for sent in x_validation_lc], dtype=torch.int64, device=device)
print('text_lens_validation:', text_lens_validation, ' with len: ',len(text_lens_validation))

max_len_validation = int(max(text_lens_validation))
print('max_len_validation:', max_len_validation)

padded_validation_x = [sent + ['<PAD>']*(max_len_validation - len(sent)) for sent in x_validation_lc]

indexed_validation_x = torch.tensor([[vocab.index(token) for token in sent] for sent in padded_validation_x], dtype=torch.int64, device=device)
print('indexed_validation_x:', indexed_validation_x)


# for test set

text_lens_test = torch.tensor([len(sent) for sent in x_test_lc], dtype=torch.int64, device=device)
print('text_lens_test:', text_lens_test, ' with len: ',len(text_lens_test))

max_len_test = int(max(text_lens_test))
print('max_len_test:', max_len_test)

padded_test_x = [sent + ['<PAD>']*(max_len_test - len(sent)) for sent in x_test_lc]

indexed_test_x = torch.tensor([[vocab.index(token) for token in sent] for sent in padded_test_x], dtype=torch.int64, device=device)
print('indexed_test_x:', indexed_test_x)

vocab: ['<PAD>', ' ', '!', '#', '&', "'", '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', '?', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'á', 'é', 'ñ']
text_lens: tensor([15,  5, 12,  ..., 11,  4, 14])  with len:  2724
max_len: 20
indexed_train_x:
tensor([[40, 28, 25,  ...,  0,  0,  0],
        [36, 35, 39,  ...,  0,  0,  0],
        [43, 45, 35,  ...,  0,  0,  0],
        ...,
        [28, 25, 32,  ...,  0,  0,  0],
        [23, 25, 32,  ...,  0,  0,  0],
        [28, 25, 25,  ...,  0,  0,  0]])
text_lens_validation: tensor([10, 10, 16, 16, 12,  8, 18, 19,  6, 15, 17, 10, 15, 20, 13, 17,  9, 12,
        18, 17, 15, 20, 20, 18, 11,  6, 18, 12, 17, 16, 12, 12, 17, 16, 19, 12,
        11, 14,  7, 15, 11,  6,  6,  9, 16, 16, 14, 18, 20, 12, 13, 18, 17,  9,
        16, 10,  6,  8, 18, 11, 15, 12, 15,  8, 11,  8,  8, 10, 16, 10, 10,  6,
         9, 14, 17, 11, 13, 18,  8, 10, 14, 18

In [None]:
y_train_lc = train_lc['genre']
y_validation_lc = validation_lc['genre']
y_test_lc = test_lc['genre']

categories = ['Action', 'Horror', 'Fantasy', 'Western', 'Adventure']
cat2idx = {cat: i for (i, cat) in enumerate(categories)}
#print(cat2idx)

train_y_idx = list(y_train_lc.map(cat2idx.get))
validation_y_idx = list(y_validation_lc.map(cat2idx.get))
test_y_idx = list(y_test_lc.map(cat2idx.get))
print(len(train_y_idx))
print(len(validation_y_idx))
print(len(test_y_idx))

#print(tensor_x.size())
#print(tensor_t.size())

In [None]:
# https://www.datasciencelearner.com/convert-list-to-2d-array-python-methods/
# converting list to array to match tensor size
def covnvert_list_2d(list1,rows, columns):    
    result=[]               
    start = 0
    end = columns
    for i in range(rows): 
        result.append(list1[start:end])
        start +=columns
        end += columns
    return result

#converting y sets to 2d array to match x tensor sets size

train_y_idx = covnvert_list_2d(train_y_idx,len(train_y_idx),1)
tensor_t = torch.tensor(train_y_idx, dtype=torch.float32, device=device)

validation_y_idx = covnvert_list_2d(validation_y_idx,len(validation_y_idx),1)
tensor_t_validation = torch.tensor(validation_y_idx, dtype=torch.float32, device=device)

test_y_idx = covnvert_list_2d(test_y_idx,len(test_y_idx),1)
tensor_t_test = torch.tensor(test_y_idx, dtype=torch.float32, device=device)

print(tensor_t.size())
print(tensor_t_validation.size())
print(tensor_t_test.size())

In [None]:
class Model(torch.nn.Module):

    def __init__(self, vocab_size, embedding_size, hidden_size):
        super().__init__()
        self.hidden_size = hidden_size
        
        self.embedding_matrix = torch.nn.Embedding(vocab_size, embedding_size)
        self.rnn_s0 = torch.nn.Parameter(torch.zeros((hidden_size,), dtype=torch.float32))
        self.rnn_cell = torch.nn.Linear(hidden_size + embedding_size, hidden_size)
        self.output_layer = torch.nn.Linear(hidden_size, 1)
        
    def forward(self, x, text_lens):
        batch_size = x.shape[0]
        time_steps = x.shape[1]

        embedded = self.embedding_matrix(x)
        state = self.rnn_s0.unsqueeze(0).tile((batch_size, 1))
        for t in range(time_steps):
            mask = (t < text_lens).unsqueeze(1).tile((1, self.hidden_size))
            next_state = torch.nn.functional.leaky_relu(self.rnn_cell(torch.cat((state, embedded[:, t, :]), dim=1)))
            state = torch.where(mask, next_state, state)
        return self.output_layer(state)

model = Model(len(vocab), embedding_size=2, hidden_size=3)
model.to(device)

optimiser = torch.optim.Adam(model.parameters())

print("Training Set: ")
accuracy_arr = []
for epoch in range(1, 101):
    optimiser.zero_grad()
    output = model(indexed_train_x, text_lens)
    error = torch.nn.functional.binary_cross_entropy_with_logits(output, tensor_t)
    
    outputs = torch.round(torch.sigmoid(output))
    accuracy = (tensor_t == outputs).type(torch.float32).mean()
    
    accuracy_arr.append(accuracy.detach().tolist())
    error.backward()
    
    optimiser.step()
    
    if epoch%10 == 0:
        print('Epoch:', epoch, ', Accuracy: {:.2%}'.format(accuracy))

#print(accuracy_arr)
print()

(fig, ax) = plt.subplots(1, 1)
ax.set_xlabel('Epoch')
ax.set_ylabel('Accuracy')
ax.set_title("Training Set")
ax.plot(range(1, len(accuracy_arr) + 1), accuracy_arr, color='blue', linestyle='-', linewidth=3)
ax.grid()

print("Validation Set: ")
accuracy_arr_validation = []
for epoch in range(1, 101):
    optimiser.zero_grad()
    output = model(indexed_validation_x, text_lens_validation)
    error = torch.nn.functional.binary_cross_entropy_with_logits(output, tensor_t_validation)
    
    outputs = torch.round(torch.sigmoid(output))
    accuracy = (tensor_t_validation == outputs).type(torch.float32).mean()
    
    accuracy_arr_validation.append(accuracy.detach().tolist())
    error.backward()
    
    optimiser.step()
    
    if epoch%10 == 0:
        print('Epoch:', epoch, ', Accuracy: {:.2%}'.format(accuracy))

#print(accuracy_arr_validation)
print()

(fig, ax) = plt.subplots(1, 1)
ax.set_xlabel('Epoch')
ax.set_ylabel('Accuracy')
ax.set_title("Validation Set")
ax.plot(range(1, len(accuracy_arr_validation) + 1), accuracy_arr_validation, color='blue', linestyle='-', linewidth=3)
ax.grid()

Measure the F1 score performance of the model when applied on the test set.
Also plot a confusion matrix showing how often each genre is mistaken as another genre.

In [None]:
# model = Model(len(vocab), embedding_size=2, hidden_size=3)
# model.to(device)

# optimiser = torch.optim.Adam(model.parameters())

# print("Training Set: ")
# accuracy_arr = []
# for epoch in range(1, 101):
#     optimiser.zero_grad()
#     output = model(indexed_train_x, text_lens)
#     error = torch.nn.functional.binary_cross_entropy_with_logits(output, tensor_t)
    
#     outputs = torch.round(torch.sigmoid(output))
#     accuracy = (tensor_t == outputs).type(torch.float32).mean()
    
#     accuracy_arr.append(accuracy.detach().tolist())
#     error.backward()
    
#     optimiser.step()
    
#     if epoch%10 == 0:
#         print('Epoch:', epoch, ', Accuracy: {:.2%}'.format(accuracy))



In [None]:
# with torch.no_grad():
#     print('sent', 'prediction')
#     outputs = torch.sigmoid(model(indexed_train_x, text_lens))
#     for (sent, text_len, output) in zip(padded_train_x, text_lens.tolist(), outputs):
#         print(sent, output.numpy()[:text_len].round(1).tolist())

In [None]:
# model = Model(len(vocab), embedding_size=2, hidden_size=3)
# model.to(device)

# optimiser = torch.optim.Adam(model.parameters())

# print("Test Set: ")
# accuracy_arr_test = []
# for epoch in range(1, 10):
#     optimiser.zero_grad()
#     output = model(indexed_test_x, text_lens_test)
#     #print(output)
    
#     error = torch.nn.functional.binary_cross_entropy_with_logits(output, tensor_t_test, reduction='none')
    
# #     perplexity  = torch.exp(error.detach())
# #     print('error:', error, 'PP:', perplexity)
#     #print(error.detach())
#     mask = torch.zeros((batch_size, time_steps, 1), dtype=torch.bool, device=device)
#     for i in range(batch_size):
#         for j in range(time_steps):
#             if j >= text_lens[i]:
#                 mask[i, j, :] = 1
    
#     outputs = torch.round(torch.sigmoid(output))
#     accuracy = (tensor_t_test == outputs).type(torch.float32).mean()
    
#     accuracy_arr_test.append(accuracy.detach().tolist())
#     error.backward()
    
#     optimiser.step()
    
#     if epoch%10 == 0:
#         print('Epoch:', epoch, ', Accuracy: {:.2%}'.format(accuracy))

        

        
# print(accuracy_arr_test)

# print(np.argmax(accuracy_arr_test))

In [None]:
#use argmax for calculating the accuracy and then count the number of correct predictions.



# # output_probs = torch.sigmoid(model(indexed_test_x, text_lens_test))
# # #print(output_probs)
# # # = torch.sigmoid(model(tensor_x))[:, 0]
# # outputs = torch.round(output_probs) # Round the probabilities into discrete classes (0 or 1).

# # accuracy = (tensor_t_test == outputs).type(torch.float32).mean()
# # print('accuracy: {:.2%}'.format(accuracy))


# output = model(indexed_test_x, text_lens_test)
# print(output)
# # error = torch.nn.functional.binary_cross_entropy_with_logits(output, tensor_t_validation)

# # outputs = torch.round(torch.sigmoid(output))
# # accuracy = (tensor_t_validation == outputs).type(torch.float32).mean()

# # accuracy_arr_validation.append(accuracy.detach().tolist())

In [None]:
# model = Model(len(vocab), embedding_size=2, hidden_size=3)
# model.to(device)

# optimiser = torch.optim.Adam(model.parameters())
# with torch.no_grad():
#     for (name, sentiment) in [('positive', [4]), ('negative', [5])]:
#         print(name)
#         sentiment = torch.tensor([sentiment], dtype=torch.float32)
#         prefix_indexes = [vocab.index('<PAD>')]
#         max_words = 10
#         for _ in range(max_words):
#             prefix_tensor = torch.tensor([prefix_indexes], dtype=torch.int64, device=device)
#             outputs = (model(sentiment, prefix_tensor))
#             word_probs = outputs[0, -1, :].tolist()
#             next_word_index = random.choices(range(len(vocab)), word_probs)[0]
#             if next_word_index == vocab.index('<PAD>'):
#                 break
#             prefix_indexes.append(next_word_index)
#         sent = [vocab[index] for index in prefix_indexes[1:]]
#         print(sent)
#         print()

## Title generation (25%)

Now that you've proven that titles and genre are related, make a model that can generate a title given a genre.

Again, you need to generate tokens at the character level instead of the word level and the titles must be lowercased.
Preprocess the data sets, create a neural network, and train it to generate the movie titles given their genre.
Plot a graph of the **perplexity** of the model on the train and validation sets after each epoch.

In [None]:
#reading the data with lowercase titles
train_lowercase = pd.read_csv("train.csv",converters={"title": lambda x: x.lower()})
validation_lowercase = pd.read_csv("validation.csv",converters={"title": lambda x: x.lower()})
test_lowercase = pd.read_csv("test.csv",converters={"title": lambda x: x.lower()})

#tokenising the titles into characters
train_lowercase_title = [list(title) for title in train_lowercase['title']]
validation_lowercase_title = [list(title) for title in validation_lowercase['title']]
test_lowercase_title = [list(title) for title in test_lowercase['title']]

#converting the data into dictionaries
dict_train = {'title': list(train_lowercase_title), 'genre': train_lowercase['genre']}
dict_validation = {'title': validation_lowercase_title, 'genre': validation_lowercase['genre']}
dict_test = {'title': test_lowercase_title, 'genre': test_lowercase['genre']}

#converting dictionaries into dataframes
train_lc = pd.DataFrame(dict_train)
validation_lc = pd.DataFrame(dict_validation)
test_lc = pd.DataFrame(dict_test)

In [50]:
x_train_lc = list(train_lc['title'])
x_validation_lc = list(validation_lc['title'])
x_test_lc = list(test_lc['title'])

# getting the vocab

vocab = ['<PAD>','<EDGE>'] + sorted({token for sent in x_train_lc for token in sent})
print('vocab:', vocab)

val_tokens = sorted({token for sent in x_validation_lc for token in sent})
for token in val_tokens:
    if token not in vocab:
        vocab.append(token)
        vocab = sorted(vocab)
        
test_tokens = sorted({token for sent in x_test_lc for token in sent})
for token in test_tokens:
    if token not in vocab:
        vocab.append(token)
        vocab = sorted(vocab)


# for training set        

text_lens = torch.tensor([len(sent)+1 for sent in x_train_lc], dtype=torch.int64, device=device)
print('text_lens:', text_lens, ' with len: ',len(text_lens))

max_len = int(max(text_lens))
print('max_len:', max_len)

padded_train_x_start = [['<EDGE>'] + sent + ['<PAD>']*(max_len - len(sent)) for sent in x_train_lc]
padded_train_x_end = [sent + ['<EDGE>'] + ['<PAD>']*(max_len - len(sent)) for sent in x_train_lc]
#print(padded_train_x_start)
#print(padded_train_x_start)

indexed_train_x_start = torch.tensor([[vocab.index(token) for token in sent] for sent in padded_train_x_start], dtype=torch.int64, device=device)
indexed_train_x_end = torch.tensor([[vocab.index(token) for token in sent] for sent in padded_train_x_end], dtype=torch.int64, device=device)
print('indexed_train_x_start:')
print(indexed_train_x_start)
print('indexed_train_x_end:')
print(indexed_train_x_end)


# for validation set

text_lens_validation = torch.tensor([len(sent)+1 for sent in x_validation_lc], dtype=torch.int64, device=device)
print('text_lens_validation:', text_lens_validation, ' with len: ',len(text_lens_validation))

max_len_validation = int(max(text_lens_validation))
print('max_len_validation:', max_len_validation)

padded_validation_x_start = [['<EDGE>'] + sent + ['<PAD>']*(max_len_validation - len(sent)) for sent in x_validation_lc]
padded_validation_x_end = [sent + ['<EDGE>'] + ['<PAD>']*(max_len_validation - len(sent)) for sent in x_validation_lc]
#print(padded_validation_x_start)
#print(padded_validation_x_start)

indexed_validation_x_start = torch.tensor([[vocab.index(token) for token in sent] for sent in padded_validation_x_start], dtype=torch.int64, device=device)
indexed_validation_x_end = torch.tensor([[vocab.index(token) for token in sent] for sent in padded_validation_x_end], dtype=torch.int64, device=device)
print('indexed_validation_x_start:')
print(indexed_validation_x_start)
print('indexed_validation_x_end:')
print(indexed_validation_x_end)


# for test set

text_lens_test = torch.tensor([len(sent)+1 for sent in x_test_lc], dtype=torch.int64, device=device)
print('text_lens_test:', text_lens_test, ' with len: ',len(text_lens_test))

max_len_test = int(max(text_lens_test))
print('max_len_test:', max_len_test)

padded_test_x_start = [['<EDGE>'] + sent + ['<PAD>']*(max_len_test - len(sent)) for sent in x_test_lc]
padded_test_x_end = [sent + ['<EDGE>'] + ['<PAD>']*(max_len_test - len(sent)) for sent in x_test_lc]
#print(padded_test_x_start)
#print(padded_test_x_start)

indexed_test_x_start = torch.tensor([[vocab.index(token) for token in sent] for sent in padded_test_x_start], dtype=torch.int64, device=device)
indexed_test_x_end = torch.tensor([[vocab.index(token) for token in sent] for sent in padded_test_x_end], dtype=torch.int64, device=device)
print('indexed_test_x_start:')
print(indexed_test_x_start)
print('indexed_test_x_end:')
print(indexed_test_x_end)

vocab: ['<PAD>', '<EDGE>', ' ', '!', '#', '&', "'", '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', '?', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'á', 'é', 'ñ']
text_lens: tensor([15,  5, 12,  ..., 11,  4, 14])  with len:  2724
max_len: 20
indexed_train_x_start:
tensor([[ 1, 41, 29,  ...,  0,  0,  0],
        [ 1, 37, 36,  ...,  0,  0,  0],
        [ 1, 44, 46,  ...,  0,  0,  0],
        ...,
        [ 1, 29, 26,  ...,  0,  0,  0],
        [ 1, 24, 26,  ...,  0,  0,  0],
        [ 1, 29, 26,  ...,  0,  0,  0]])
indexed_train_x_end:
tensor([[41, 29, 26,  ...,  0,  0,  0],
        [37, 36, 40,  ...,  0,  0,  0],
        [44, 46, 36,  ...,  0,  0,  0],
        ...,
        [29, 26, 33,  ...,  0,  0,  0],
        [24, 26, 33,  ...,  0,  0,  0],
        [29, 26, 26,  ...,  0,  0,  0]])
text_lens_validation: tensor([10, 10, 16, 16, 12,  8, 18, 19,  6, 15, 17, 10, 15, 20, 13, 17,  9,

In [60]:
action = [0, 0, 0, 0, 1]
horror = [0, 0, 0, 1, 0]
fantasy = [0, 0, 1, 0, 0]
western = [0, 1, 0, 0, 0]
adventure = [1, 0, 0, 0, 0]

# for training set

sentiments_train = []

for i in train_lc['genre']:
    if i == 'Action':
        sentiments_train.append(action)
    elif i == 'Horror':
        sentiments_train.append(horror)
    elif i == 'Fantasy':
        sentiments_train.append(fantasy)
    elif i == 'Western':
        sentiments_train.append(western)
    elif i == 'Adventure':
        sentiments_train.append(adventure)


# for validation set        
        
sentiments_validation = []

for i in validation_lc['genre']:
    if i == 'Action':
        sentiments_validation.append(action)
    elif i == 'Horror':
        sentiments_validation.append(horror)
    elif i == 'Fantasy':
        sentiments_validation.append(fantasy)
    elif i == 'Western':
        sentiments_validation.append(western)
    elif i == 'Adventure':
        sentiments_validation.append(adventure)

        
# for test set        
        
sentiments_test = []

for i in test_lc['genre']:
    if i == 'Action':
        sentiments_test.append(action)
    elif i == 'Horror':
        sentiments_test.append(horror)
    elif i == 'Fantasy':
        sentiments_test.append(fantasy)
    elif i == 'Western':
        sentiments_test.append(western)
    elif i == 'Adventure':
        sentiments_test.append(adventure)

# print(train_lc)
# print(sentiments_train)
# print()
# print(sentiments_validation)
# print()
# print(sentiments_test)

sentiments_train_t = torch.tensor(sentiments_train, dtype=torch.float32, device=device)
print(sentiments_train_t.size())

torch.Size([2724, 5])


In [67]:
#LTSM
# class Model(torch.nn.Module):

#     def __init__(self, cond_size, vocab_size, embedding_size, hidden_size):
#         super().__init__()
#         self.embedding_layer = torch.nn.Embedding(vocab_size, embedding_size)
#         self.cond_layer = torch.nn.Linear(cond_size, hidden_size) # Conditioning vector will be projected into the state space.
#         self.rnn_c0 = torch.nn.Parameter(torch.zeros((hidden_size,), dtype=torch.float32))
#         self.rnn_cell = torch.nn.LSTMCell(embedding_size, hidden_size)
#         self.output_layer = torch.nn.Linear(hidden_size, vocab_size)

#     def forward(self, cond, x):
#         batch_size = x.shape[0]
#         time_steps = x.shape[1]
        
#         embedded = self.embedding_layer(x)
#         cond_state = self.cond_layer(cond)
        
#         state = cond_state # Conditioning vector is the initial state.
#         c = self.rnn_c0.unsqueeze(0).tile((batch_size, 1))
#         interm_states = []
#         for t in range(time_steps):
#             (state, c) = self.rnn_cell(embedded[:, t, :], (state, c))
#             interm_states.append(state)
#         interm_states = torch.stack(interm_states, dim=1)
#         return self.output_layer(interm_states)

# model = Model(sentiments_train_t.shape[1], len(vocab), embedding_size=2, hidden_size=2)
# model.to(device)

# optimiser = torch.optim.Adam(model.parameters())

# print('step', 'error')
# train_errors = []
# for step in range(1, 100+1):
#     batch_size = indexed_train_x_start.shape[0]
#     time_steps = indexed_train_x_start.shape[1]
#     mask = torch.zeros((batch_size, time_steps), dtype=torch.bool)
#     for i in range(batch_size):
#         for j in range(time_steps):
#             if j >= text_lens[i]:
#                 mask[i, j] = 1
    
#     optimiser.zero_grad()
#     output = model(sentiments_train_t, indexed_train_x_start)
#     errors = torch.nn.functional.cross_entropy(output.transpose(1, 2), indexed_train_x_end, reduction='none')
#     errors = torch.masked_fill(errors, mask, 0.0)
#     error = errors.sum()/text_lens.sum()
#     train_errors.append(error.detach().tolist())
#     error.backward()
#     optimiser.step()

#     if step%10 == 0:
#         print(step, train_errors[-1])


step error
10 3.975938081741333
20 3.961124897003174
30 3.946540355682373
40 3.932030200958252
50 3.917388677597046
60 3.902397394180298
70 3.886855363845825
80 3.870584011077881
90 3.853426218032837
100 3.8352441787719727


In [88]:
#pre-inject
class Model(torch.nn.Module):

    def __init__(self, cond_size, vocab_size, embedding_size, hidden_size):
        super().__init__()
        self.embedding_layer = torch.nn.Embedding(vocab_size, embedding_size)
        self.cond_layer = torch.nn.Linear(cond_size, embedding_size) # Conditioning vector will be projected into the embedding space.
        self.rnn_s0 = torch.nn.Parameter(torch.zeros((hidden_size,), dtype=torch.float32))
        self.rnn_c0 = torch.nn.Parameter(torch.zeros((hidden_size,), dtype=torch.float32))
        self.rnn_cell = torch.nn.LSTMCell(embedding_size, hidden_size)
        self.output_layer = torch.nn.Linear(hidden_size, vocab_size)

    def forward(self, cond, x):
        batch_size = x.shape[0]
        time_steps = x.shape[1]

        cond_word = self.cond_layer(cond)
        embedded = self.embedding_layer(x)
        embedded = torch.cat((cond_word.unsqueeze(1), embedded), dim=1) # Attach the conditioning vector as a first word.
        
        state = self.rnn_s0.unsqueeze(0).tile((batch_size, 1))
        c = self.rnn_c0.unsqueeze(0).tile((batch_size, 1))
        interm_states = []
        for t in range(time_steps + 1): # Don't forget the conditioning vector!
            (state, c) = self.rnn_cell(embedded[:, t, :], (state, c))
            interm_states.append(state)
        interm_states = torch.stack(interm_states, dim=1)
        interm_states = interm_states[:, 1:, :] # Drop the first state from each text.
        return self.output_layer(interm_states)

model = Model(sentiments_train_t.shape[1], len(vocab), embedding_size=2, hidden_size=2)
model.to(device)

optimiser = torch.optim.Adam(model.parameters())

print('step', 'error')
train_errors = []
for step in range(1, 100+1):
    batch_size = indexed_train_x_start.shape[0]
    time_steps = indexed_train_x_start.shape[1]
    mask = torch.zeros((batch_size, time_steps), dtype=torch.bool)
    for i in range(batch_size):
        for j in range(time_steps):
            if j >= text_lens[i]:
                mask[i, j] = 1
    
    optimiser.zero_grad()
    output = model(sentiments_train_t, indexed_train_x_start)
    errors = torch.nn.functional.cross_entropy(output.transpose(1, 2), indexed_train_x_end, reduction='none')
    errors = torch.masked_fill(errors, mask, 0.0)
    error = errors.sum()/text_lens.sum()
    train_errors.append(error.detach().tolist())
    error.backward()
    optimiser.step()

    if step%10 == 0:
        print(step, train_errors[-1])
        with torch.no_grad():
            outputs = torch.log_softmax(output, dim=2)
            total_word_log_prob = 0.0
            num_words = 0
            for (sent, text_len, log_probs, y) in zip(train_lc, text_lens, outputs, indexed_train_x_end):
                word_log_probs = log_probs[torch.arange(text_len), y[:text_len]]
                total_word_log_prob += word_log_probs.sum().tolist()
                num_words += text_len.tolist()
            print('PPLX:', np.exp(-1/num_words*total_word_log_prob))
# print()

step error
10 4.033138751983643
PPLX: 60.57325819744251
20 4.018068313598633
PPLX: 59.584358187583256
30 4.002994537353516
PPLX: 58.59992146977439
40 3.9876914024353027
PPLX: 57.60723605290041
50 3.971952438354492
PPLX: 56.595166065288105
60 3.95558762550354
PPLX: 55.55347642517525
70 3.9384167194366455
PPLX: 54.47141596769888
80 3.9202513694763184
PPLX: 53.33948912067969
90 3.900883913040161
PPLX: 52.150705349676656
100 3.880079984664917
PPLX: 50.89821635796951


In [87]:
# #par-inject
# class Model(torch.nn.Module):

#     def __init__(self, cond_size, vocab_size, embedding_size, hidden_size):
#         super().__init__()
#         self.embedding_layer = torch.nn.Embedding(vocab_size, embedding_size)
#         self.rnn_s0 = torch.nn.Parameter(torch.zeros((hidden_size,), dtype=torch.float32))
#         self.rnn_c0 = torch.nn.Parameter(torch.zeros((hidden_size,), dtype=torch.float32))
#         self.rnn_cell = torch.nn.LSTMCell(cond_size + embedding_size, hidden_size) # RNN must include the conditioning vector in its input.
#         self.output_layer = torch.nn.Linear(hidden_size, vocab_size)
        
#     def forward(self, cond, x):
#         batch_size = x.shape[0]
#         time_steps = x.shape[1]

#         cond_3d = cond.unsqueeze(1).tile((1, time_steps, 1)) # Replicate the same conditioning vector for every word.
#         embedded = self.embedding_layer(x)
#         embedded = torch.cat((cond_3d, embedded), dim=2) # Attach the replicated conditioning vector to the embedded words.

#         state = self.rnn_s0.unsqueeze(0).tile((batch_size, 1))
#         c = self.rnn_s0.unsqueeze(0).tile((batch_size, 1))
#         interm_states = []
#         for t in range(time_steps):
#             (state, c) = self.rnn_cell(embedded[:, t, :], (state, c))
#             interm_states.append(state)
#         interm_states = torch.stack(interm_states, dim=1)
#         return self.output_layer(interm_states)

# model = Model(sentiments_train_t.shape[1], len(vocab), embedding_size=2, hidden_size=2)
# model.to(device)

# optimiser = torch.optim.Adam(model.parameters())

# print('step', 'error')
# train_errors = []
# for step in range(1, 100+1):
#     batch_size = indexed_train_x_start.shape[0]
#     time_steps = indexed_train_x_start.shape[1]
#     mask = torch.zeros((batch_size, time_steps), dtype=torch.bool)
#     for i in range(batch_size):
#         for j in range(time_steps):
#             if j >= text_lens[i]:
#                 mask[i, j] = 1
    
#     optimiser.zero_grad()
#     output = model(sentiments_train_t, indexed_train_x_start)
#     errors = torch.nn.functional.cross_entropy(output.transpose(1, 2), indexed_train_x_end, reduction='none')
#     errors = torch.masked_fill(errors, mask, 0.0)
#     error = errors.sum()/text_lens.sum()
#     train_errors.append(error.detach().tolist())
#     error.backward()
#     optimiser.step()

#     if step%10 == 0:
#         print(step, train_errors[-1])
#         with torch.no_grad():
#             outputs = torch.log_softmax(output, dim=2)
#             total_word_log_prob = 0.0
#             num_words = 0
#             for (sent, text_len, log_probs, y) in zip(train_lc, text_lens, outputs, indexed_train_x_end):
#                 word_log_probs = log_probs[torch.arange(text_len), y[:text_len]]
#                 total_word_log_prob += word_log_probs.sum().tolist()
#                 num_words += text_len.tolist()
#             print('PPLX:', np.exp(-1/num_words*total_word_log_prob))
# print()

step error
10 4.093809127807617
PPLX: 56.889495249513864
20 4.074880599975586
PPLX: 55.742352326524674
30 4.056140899658203
PPLX: 54.645245783971816
40 4.037400245666504
PPLX: 53.586297523779805
50 4.018407821655273
PPLX: 52.55090716554985
60 3.9988746643066406
PPLX: 51.522996211755455
70 3.9785263538360596
PPLX: 50.485842153201304
80 3.957150459289551
PPLX: 49.42464471195519
90 3.934621810913086
PPLX: 48.32903839247401
100 3.910931348800659
PPLX: 47.19608262963879



In [86]:
#merge
class Model(torch.nn.Module):

    def __init__(self, cond_size, vocab_size, embedding_size, hidden_size):
        super().__init__()
        self.embedding_layer = torch.nn.Embedding(vocab_size, embedding_size)
        self.rnn_s0 = torch.nn.Parameter(torch.zeros((hidden_size,), dtype=torch.float32))
        self.rnn_c0 = torch.nn.Parameter(torch.zeros((hidden_size,), dtype=torch.float32))
        self.rnn_cell = torch.nn.LSTMCell(embedding_size, hidden_size)
        self.output_layer = torch.nn.Linear(cond_size + hidden_size, vocab_size) # Output layer must include the conditioning vector in its input.
        
    def forward(self, cond, x):
        batch_size = x.shape[0]
        time_steps = x.shape[1]

        embedded = self.embedding_layer(x)
        
        state = self.rnn_s0.unsqueeze(0).tile((batch_size, 1))
        c = self.rnn_s0.unsqueeze(0).tile((batch_size, 1))
        interm_states = []
        for t in range(time_steps):
            (state, c) = self.rnn_cell(embedded[:, t, :], (state, c))
            interm_states.append(state)
        interm_states = torch.stack(interm_states, dim=1)
        
        cond_3d = cond.unsqueeze(1).tile((1, time_steps, 1)) # Replicate the same conditioning vector for every word.
        interm_states = torch.cat((cond_3d, interm_states), dim=2) # Attach the replicated conditioning vector to the intermediate states.
        
        return self.output_layer(interm_states)

model = Model(sentiments_train_t.shape[1], len(vocab), embedding_size=2, hidden_size=2)
model.to(device)

optimiser = torch.optim.Adam(model.parameters())

print('step', 'error')
train_errors = []
for step in range(1, 100+1):
    batch_size = indexed_train_x_start.shape[0]
    time_steps = indexed_train_x_start.shape[1]
    mask = torch.zeros((batch_size, time_steps), dtype=torch.bool)
    for i in range(batch_size):
        for j in range(time_steps):
            if j >= text_lens[i]:
                mask[i, j] = 1
    
    optimiser.zero_grad()
    output = model(sentiments_train_t, indexed_train_x_start)
    errors = torch.nn.functional.cross_entropy(output.transpose(1, 2), indexed_train_x_end, reduction='none')
    errors = torch.masked_fill(errors, mask, 0.0)
    error = errors.sum()/text_lens.sum()
    train_errors.append(error.detach().tolist())
    error.backward()
    optimiser.step()
    
    with torch.no_grad():
        outputs = torch.log_softmax(output, dim=2)
        total_word_log_prob = 0.0
        num_words = 0
        for (sent, text_len, log_probs, y) in zip(train_lc, text_lens, outputs, indexed_train_x_end):
            word_log_probs = log_probs[torch.arange(text_len), y[:text_len]]
            total_word_log_prob += word_log_probs.sum().tolist()
            num_words += text_len.tolist()
        print('PPLX:', np.exp(-1/num_words*total_word_log_prob))

    if step%10 == 0:
        print(step, train_errors[-1])
print()

step error
PPLX: 46.96202490784544
PPLX: 46.8183729863008
PPLX: 46.675810373847064
PPLX: 46.53438304390153
PPLX: 46.3941762151919
PPLX: 46.25530494549943
PPLX: 46.11776416046015
PPLX: 45.98146990793302
PPLX: 45.84637419574284
PPLX: 45.71238598963332
10 3.90478253364563
PPLX: 45.5795152804579
PPLX: 45.447732922674994
PPLX: 45.317040347019116
PPLX: 45.18740879031306
PPLX: 45.058801211733595
PPLX: 44.931232398546264
PPLX: 44.804631501534864
PPLX: 44.67898384750537
PPLX: 44.55426216763033
PPLX: 44.43045219558723
20 3.8799147605895996
PPLX: 44.30755671242956
PPLX: 44.185498392922945
PPLX: 44.06433101967815
PPLX: 43.9439652864944
PPLX: 43.824446588897814
PPLX: 43.705711446739656
PPLX: 43.587809288417354
PPLX: 43.47067706487262
PPLX: 43.35433094810239
PPLX: 43.23872094846959
30 3.8555996417999268
PPLX: 43.123892186150165
PPLX: 43.009799053675636
PPLX: 42.89642095914649
PPLX: 42.783766076911306
PPLX: 42.67182213691302
PPLX: 42.560564809478265
PPLX: 42.449986200341215
PPLX: 42.34007444822901
PP

In [80]:
# #full prefix tree search
# def generate(conditioning_vector, max_len):
#     with torch.no_grad():
#         conditioning_tensor = torch.tensor([conditioning_vector], dtype=torch.float32, device=device)
#         curr_level = [(1.0, [vocab.index('<EDGE>')])]
#         best_complete_sent = (0.0, [])
#         for _ in range(max_len):
#             new_level = []
#             for (prefix_prob, prefix_indexes) in curr_level: # Produce partial and complete sentences that are one word longer.
#                 prefix_tensor = torch.tensor([prefix_indexes], dtype=torch.int64, device=device)
#                 outputs = torch.softmax(model(conditioning_tensor, prefix_tensor), dim=2)
#                 word_probs = outputs[0, -1, :].numpy()
#                 new_prefix_probs = prefix_prob*word_probs
#                 for (next_word_index, new_prefix_prob) in enumerate(new_prefix_probs.tolist()):
#                     new_entry = (new_prefix_prob, prefix_indexes + [next_word_index])
#                     if next_word_index == vocab.index('<EDGE>'): # Attaching the EDGE token at the end of a prefix makes a complete sentence.
#                         if new_prefix_prob > best_complete_sent[0]: # A more probable complete sentence was found.
#                             best_complete_sent = new_entry
#                     else: # Add the partial sentence to the new tree level.
#                         new_level.append(new_entry)
#             # Compare the probability of the best complete sentence with that of the new level's best partial sentence.
#             if best_complete_sent[0] > max(prefix_prob for (prefix_prob, _) in new_level):
#                 break
#             curr_level = new_level
#         sent = [vocab[index] for index in best_complete_sent[1][1:-1]]
#         return (sent, best_complete_sent[0])

    
# for (name, sentiment) in [('Action', [0,0,0,0,1]), ('Horror', [0,0,0,1,0]),('Fantasy',[0,0,1,0,0]),('Western',[0,1,0,0,0]),('Adventure',[1,0,0,0,0])]:
#     print(name)
#     (sent, prob) = generate(sentiment, max_len=20)
#     print(sent, prob)
#     print()

Action
[] 0.013677187263965607

Horror
[] 0.014090662822127342

Fantasy
[] 0.013164513744413853

Western
[] 0.01264261081814766

Adventure
[] 0.01414669118821621



In [81]:
# #beam search
# def generate(conditioning_vector, beam_size, max_len):
#     with torch.no_grad():
#         conditioning_tensor = torch.tensor([conditioning_vector], dtype=torch.float32, device=device)
#         beam = [(1.0, [vocab.index('<EDGE>')])]
#         best_complete_sent = (0.0, [])
#         for _ in range(max_len):
#             new_beam = []
#             for (prefix_prob, prefix_indexes) in beam:
#                 prefix_tensor = torch.tensor([prefix_indexes], dtype=torch.int64, device=device)
#                 outputs = torch.softmax(model(conditioning_tensor, prefix_tensor), dim=2)
#                 word_probs = outputs[0, -1, :].numpy()
#                 new_prefix_probs = prefix_prob*word_probs
#                 for (next_word_index, new_prefix_prob) in enumerate(new_prefix_probs.tolist()):
#                     new_entry = (new_prefix_prob, prefix_indexes + [next_word_index])
#                     if next_word_index == vocab.index('<EDGE>'):
#                         if new_prefix_prob > best_complete_sent[0]:
#                             best_complete_sent = new_entry
#                     else:
#                         new_beam.append(new_entry)
#             new_beam.sort(reverse=True) # Sort the new beam by partial sentence probability.
#             beam = new_beam[:beam_size] # Take the top beam_size partial sentences.
#             # Compare the probability of the best complete sentence with that of the new beams's best partial sentence.
#             if best_complete_sent[0] > new_beam[0][0]:
#                 break
#         sent = [vocab[index] for index in best_complete_sent[1][1:-1]]
#         return (sent, best_complete_sent[0])
    
# for (name, sentiment) in [('Action', [0,0,0,0,1]), ('Horror', [0,0,0,1,0]),('Fantasy',[0,0,1,0,0]),('Western',[0,1,0,0,0]),('Adventure',[1,0,0,0,0])]:
#     print(name)
#     (sent, prob) = generate(sentiment, beam_size=3, max_len=20)
#     print(sent, prob)
#     print()

Action
[] 0.013677187263965607

Horror
[] 0.014090662822127342

Fantasy
[] 0.013164513744413853

Western
[] 0.01264261081814766

Adventure
[] 0.01414669118821621



Generate 3 titles for every genre.
Make sure that the titles are not all the same.

In [74]:
# action = [0, 0, 0, 0, 1]
# horror = [0, 0, 0, 1, 0]
# fantasy = [0, 0, 1, 0, 0]
# western = [0, 1, 0, 0, 0]
# adventure = [1, 0, 0, 0, 0]


with torch.no_grad():  
    for (name, sentiment) in [('Action', [0,0,0,0,1]), ('Horror', [0,0,0,1,0]),('Fantasy',[0,0,1,0,0]),('Western',[0,1,0,0,0]),('Adventure',[1,0,0,0,0])]:
        for i in range(3):
            print(name)
            sentiment = torch.tensor([sentiment], dtype=torch.float32)
            prefix_indexes = [vocab.index('<EDGE>')]
            max_words = 10
            for _ in range(max_words):
                prefix_tensor = torch.tensor([prefix_indexes], dtype=torch.int64)
                outputs = torch.softmax(model(sentiment, prefix_tensor), dim=2)
                word_probs = outputs[0, -1, :].tolist()
                next_word_index = random.choices(range(len(vocab)), word_probs)[0]
                if next_word_index == vocab.index('<EDGE>'):
                    break
                prefix_indexes.append(next_word_index)
            sent = [vocab[index] for index in prefix_indexes[1:]]
            print(sent)
            print()

Action
['e', 'o', 'e', 'n', '0', 's', 'n', 'l', 'e', 'ñ']

Horror
['n', '9', 'n', 'n', 'l', 'v', 'c', 'h', 'l', '7']

Fantasy
['w']

Western
['x', 'n', 'y', 'q', '?', 'r', '5', 'h', '/', '6']

Adventure
['2', 'u', 'r', '2', 'd', ':', 's', '3', 'é', '9']

Action
['?', 'c', 'r', 's', 'k', 'h', 'e', "'", 'n', ' ']

Horror
['t', '1', 'l', '!', '4', 'r', '#', '5', 's', 's']

Fantasy
['n', 'z', 'r', 'k', 'h', 't', "'", 'p', 'u', 'm']

Western
['z', '5', "'", 'a', 'b', 'o', '3', '4', 'o', 'l']

Adventure
['2', 'ñ', 'm', 'y', 'l', 's', '1', 'y', '9', 'u']

Action
['y', 'h', '4', '5', 'x', 'l', 'c', 's', 'n', '3']

Horror
['t', '/', '3', 'é', 's', 'u', 'c', 'k', '5', 'w']

Fantasy
['s', 'y', '7', 'h', '-', 'h', 'o', 'r', '9', 'h']

Western
['5', '?', '0', "'", '-', '0', 'ñ', ':', '9', '9']

Adventure
['?', 'w', 'ñ', 'k', 'h', 'n', 'j', 'o', 'l', 't']



## Language models as classifiers (30%)

It occurs to you that the movie title generator can also be used as a classifier by doing the following:

* Let title $t$ be the title that you want to classify.
* For every genre $g$,
    * Use the generator as a language model to get the probability of $t$ (the whole title) using genre $g$.
* Pick the genre that makes the language model give the largest probability.

The producer is thrilled to not need two separate models and now you have to implement this.
**Use the preprocessed test set from the previous task** in order to find the genre that makes the language model give the largest probability.
There is no need to plot anything here.

Just like in the classification task, measure the F1 score and plot the confusion matrix of this new classifier.

Write a paragraph or psuedo code to describe what your code above does.

In [None]:
'''

'''

## Conclusion (10%)

The producer's funders are asking for a report about this new technology they invested in.
In 300 words, write your interpretation of the results together with what you think could make the model perform better.

In [None]:
'''

'''