# Movie genre prediction with Object2Vec Algorithm
### MD: modified to work on a local container

1. [Introduction](#Introduction)
2. [Install and import dependencies](#Install-and-import-dependencies)
3. [Preprocessing](#Preprocessing)
  1. [Build the vocabulary](#Build-the-vocabulary)
  2. [Split data into train, validation and test](#Split-data-into-train,-validation-and-test)
  3. [Negative sampling](#Negative-sampling)
  4. [Tokenization](#Tokenization)
  5. [Download pretrained word embeddings](#Download-pretrained-word-embeddings)
4. [Sagemaker Training](#Sagemaker-Training)
  1. [Upload data to S3](#Upload-data-to-S3)
  1. [Training hyperparameters](#Training-hyperparameters)
5. [Evaluation with Batch inference](#Evaluation-with-Batch-inference)
6. [Online inference demo](#Online-inference-demo)

## Introduction

In this notebook, we will explore how ObjectToVec algorithm can be used in a multi label prediction setting 
to predict the genre of a movie from its plot description. We will be using a dataset provided from imdb.


At a high level, the network architecture that we use for this task is illustrated in the diagram below.

<img src="image.png" width="500">

We cast the problem of multi-label prediction as a binary classification problem. A positive example is the tuple of movie plot description, and a movie genre that applies to the movie in the labeled data. If a movie has multiple genres, we create multiple positive examples for the movie, one for each genre. A negative example is a pair where the genre does not apply to the movie. The negative examples are generated by picking a random subset of genres which do not apply to the movie, as determined by the labeled dataset.

Let us first start with downloading the data.

<div class="alert alert-warning">
Important: Before you begin downloading, please read the following README file using your browser and make sure you are okay with the license.
ftp://ftp.fu-berlin.de/pub/misc/movies/database/frozendata/README
</div>

In [1]:
# !wget ftp://ftp.fu-berlin.de/pub/misc/movies/database/frozendata/genres.list.gz
# !gunzip genres.list.gz

--2020-05-23 13:38:55--  ftp://ftp.fu-berlin.de/pub/misc/movies/database/frozendata/genres.list.gz
           => ‘genres.list.gz’
Resolving ftp.fu-berlin.de (ftp.fu-berlin.de)... 130.133.3.130
Connecting to ftp.fu-berlin.de (ftp.fu-berlin.de)|130.133.3.130|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /pub/misc/movies/database/frozendata ... done.
==> SIZE genres.list.gz ... 20525974
==> PASV ... done.    ==> RETR genres.list.gz ... done.
Length: 20525974 (20M) (unauthoritative)


2020-05-23 13:39:18 (1.03 MB/s) - ‘genres.list.gz’ saved [20525974]



In [2]:
# !wget ftp://ftp.fu-berlin.de/pub/misc/movies/database/frozendata/plot.list.gz
# !gunzip plot.list.gz

--2020-05-23 13:39:43--  ftp://ftp.fu-berlin.de/pub/misc/movies/database/frozendata/plot.list.gz
           => ‘plot.list.gz’
Resolving ftp.fu-berlin.de (ftp.fu-berlin.de)... 130.133.3.130
Connecting to ftp.fu-berlin.de (ftp.fu-berlin.de)|130.133.3.130|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /pub/misc/movies/database/frozendata ... done.
==> SIZE plot.list.gz ... 159742723
==> PASV ... done.    ==> RETR plot.list.gz ... done.
Length: 159742723 (152M) (unauthoritative)


2020-05-23 13:41:43 (1.33 MB/s) - ‘plot.list.gz’ saved [159742723]



## Install and import dependencies

In [3]:
import sys

In [4]:
! {sys.prefix}/bin/pip install langdetect

Collecting langdetect
  Downloading langdetect-1.0.8.tar.gz (981 kB)
[K     |████████████████████████████████| 981 kB 335 kB/s eta 0:00:01
Building wheels for collected packages: langdetect
  Building wheel for langdetect (setup.py) ... [?25ldone
[?25h  Created wheel for langdetect: filename=langdetect-1.0.8-py3-none-any.whl size=993191 sha256=2a9dc7ab8939bb14500daa8be0a522fe7d855c20abd66bb15ed5f0c4473ae627
  Stored in directory: /tmp/pip-ephem-wheel-cache-fmehe9hd/wheels/53/88/5d/b239dc55d773b01fdd2059606b1a8f4b64548848b8f6e381c3
Successfully built langdetect
Installing collected packages: langdetect
Successfully installed langdetect-1.0.8


In [5]:
! {sys.prefix}/bin/pip install nltk

Collecting nltk
  Downloading nltk-3.5.zip (1.4 MB)
[K     |████████████████████████████████| 1.4 MB 507 kB/s eta 0:00:01
[?25hCollecting click
  Downloading click-7.1.2-py2.py3-none-any.whl (82 kB)
[K     |████████████████████████████████| 82 kB 399 kB/s eta 0:00:011
[?25hCollecting joblib
  Downloading joblib-0.15.1-py3-none-any.whl (298 kB)
[K     |████████████████████████████████| 298 kB 488 kB/s eta 0:00:01
[?25hCollecting regex
  Downloading regex-2020.5.14-cp36-cp36m-manylinux2010_x86_64.whl (675 kB)
[K     |████████████████████████████████| 675 kB 1.4 MB/s eta 0:00:01
[?25hCollecting tqdm
  Downloading tqdm-4.46.0-py2.py3-none-any.whl (63 kB)
[K     |████████████████████████████████| 63 kB 1.8 MB/s eta 0:00:01
[?25hBuilding wheels for collected packages: nltk
  Building wheel for nltk (setup.py) ... [?25ldone
[?25h  Created wheel for nltk: filename=nltk-3.5-py3-none-any.whl size=1434676 sha256=690abbe3bbb2c9cf4946a5b7ac88ce2e5ecc1de3299f9264c541f939584863a2
  Stored

In [6]:
!conda upgrade -y sqlite

Solving environment: done


  current version: 4.5.12
  latest version: 4.8.3

Please update conda by running

    $ conda update -n base conda



## Package Plan ##

  environment location: /home/ec2-user/anaconda3

  added / updated specs: 
    - sqlite


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    openssl-1.1.1g             |       h516909a_0         2.1 MB  conda-forge
    certifi-2020.4.5.1         |   py37hc8dfbb8_0         151 KB  conda-forge
    python_abi-3.7             |          1_cp37m           4 KB  conda-forge
    ca-certificates-2020.4.5.1 |       hecc5488_0         146 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         2.4 MB

The following NEW packages will be INSTALLED:

    python_abi:      3.7-1_cp37m           conda-forge

The following packages will be UPDATED:

    ca-cert

In [7]:
! {sys.prefix}/bin/pip install jsonlines



In [35]:
import json
import sys
from collections import Counter
from collections import defaultdict
from itertools import chain, islice

import boto3
import jsonlines
import matplotlib
import matplotlib.pyplot as plt
import nltk
import numpy as np
import pandas as pd
import sagemaker
import seaborn as sns

from langdetect import detect
from nltk.corpus import stopwords
from nltk.tokenize import TreebankWordTokenizer, sent_tokenize
from sagemaker import get_execution_role
from sagemaker.amazon.amazon_estimator import get_image_uri
from sagemaker.session import s3_input
from sklearn.model_selection import StratifiedShuffleSplit

%matplotlib inline

In [4]:
# # execute this on aws sagemaker
# role = get_execution_role()

# use this if running sagemaker locally
def resolve_sm_role():
    client = boto3.client('iam', region_name='us-east-2')
    response_roles = client.list_roles(
        PathPrefix='/',
        # Marker='string',
        MaxItems=999
    )
    for role in response_roles['Roles']:
        if role['RoleName'].startswith('AmazonSageMaker-ExecutionRole-'):
#             print('Resolved SageMaker IAM Role to: ' + str(role))
            return role['Arn']
    raise Exception('Could not resolve what should be the SageMaker role to be used')

# this is the role created by sagemaker notebook on aws
role_arn = resolve_sm_role()
print(role_arn)
role=role_arn

arn:aws:iam::558157414092:role/service-role/AmazonSageMaker-ExecutionRole-20200523T082014


## Preprocessing

In [5]:
row = [0] * 5
print(type(row))
print(row)

<class 'list'>
[0, 0, 0, 0, 0]


In [6]:
def get_genres(filename):
    
    genres = defaultdict(list)
    unique_genres = set()
    
    with open(filename, "r", errors='ignore') as f:
        for line in f:
            if line.startswith('"'):
                
                data = line.split('\t')
                
                movie = data[0]
                genre = data[-1].strip()
                
                genres[movie].append(genre)
                unique_genres.add(genre)
                
    unique_genres = sorted(unique_genres)
    print(unique_genres)
    
    # md: do a one hot encoding for movies and genres
    data = []
    for movie in genres:
        
        # md: create a list with dimension equal to number of genres, each element equal to 0
        row = [0]*len(unique_genres)
        
        for g in genres[movie]:
            row[unique_genres.index(g)] = 1
            
        row.insert(0, movie)
        data.append(row)
        
    genres_df = pd.DataFrame(data)
    genres_df.columns = ['short_title'] + unique_genres
    return genres_df
    
genres_df = get_genres("genres.list")
genres_df.head()

['Action', 'Adult', 'Adventure', 'Animation', 'Biography', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Family', 'Fantasy', 'Game-Show', 'History', 'Horror', 'Lifestyle', 'Music', 'Musical', 'Mystery', 'News', 'Reality-TV', 'Reality-tv', 'Romance', 'Sci-Fi', 'Sci-fi', 'Short', 'Sport', 'Talk-Show', 'Thriller', 'War', 'Western']


Unnamed: 0,short_title,Action,Adult,Adventure,Animation,Biography,Comedy,Crime,Documentary,Drama,...,Reality-tv,Romance,Sci-Fi,Sci-fi,Short,Sport,Talk-Show,Thriller,War,Western
0,"""!Next?"" (1994)",0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
1,"""#1 Single"" (2006)",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,"""#15SecondScare"" (2015)",0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,1,0,0
3,"""#15SecondScare"" (2015) {Who Wants to Play wit...",0,0,0,0,0,0,0,0,1,...,0,0,0,0,1,0,0,1,0,0
4,"""#1MinuteNightmare"" (2014)",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [7]:
def get_plots(filename):
    
    with open(filename, "r", errors='ignore') as f:
        data = []
        inside = False
        plot = ''
        full_title = ''
        for line in f:
            if line.startswith("MV:") and not inside:
                inside = True
                full_title = line.split("MV:")[1].strip()

            elif line.startswith("PL:") and inside:
                plot += line.split("PL:")[1].replace("\n", "")

            elif line.startswith("MV:") and inside:
                short_title = full_title.split('{')[0].strip()
                data.append((short_title, full_title, plot))
                plot = ''
                inside = False
    plots_df = pd.DataFrame(data)
    plots_df.columns = ['short_title', 'title', 'plot']
    return plots_df

plots_df = get_plots("plot.list")
plots_df.head()

Unnamed: 0,short_title,title,plot
0,"""#7DaysLater"" (2013)","""#7DaysLater"" (2013)",#7dayslater is an interactive comedy series f...
1,"""#BlackLove"" (2015)","""#BlackLove"" (2015) {Crash the Party (#1.9)}","With just one week left in the workshops, the..."
2,"""#BlackLove"" (2015)","""#BlackLove"" (2015) {Making Lemonade Out of Le...",All of the women start making strides towards...
3,"""#BlackLove"" (2015)","""#BlackLove"" (2015) {Miss Independent (#1.5)}",All five of these women are independent and s...
4,"""#BlackLove"" (2015)","""#BlackLove"" (2015) {Sealing the Deal (#1.10)}",Despite having gone through a life changing p...


Now join the genre and the plot dataframes.

In [15]:
data_df = plots_df.merge(genres_df, how='inner', on='short_title')
data_df.dropna(inplace=True)
data_df.drop('short_title', axis=1, inplace=True)
data_df.head()

Unnamed: 0,title,plot,Action,Adult,Adventure,Animation,Biography,Comedy,Crime,Documentary,...,Reality-tv,Romance,Sci-Fi,Sci-fi,Short,Sport,Talk-Show,Thriller,War,Western
0,"""#7DaysLater"" (2013)",#7dayslater is an interactive comedy series f...,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
1,"""#BlackLove"" (2015) {Crash the Party (#1.9)}","With just one week left in the workshops, the...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,"""#BlackLove"" (2015) {Making Lemonade Out of Le...",All of the women start making strides towards...,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,"""#BlackLove"" (2015) {Miss Independent (#1.5)}",All five of these women are independent and s...,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,"""#BlackLove"" (2015) {Sealing the Deal (#1.10)}",Despite having gone through a life changing p...,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [16]:
genres = list(data_df.columns)[2:]
genres

['Action',
 'Adult',
 'Adventure',
 'Animation',
 'Biography',
 'Comedy',
 'Crime',
 'Documentary',
 'Drama',
 'Family',
 'Fantasy',
 'Game-Show',
 'History',
 'Horror',
 'Lifestyle',
 'Music',
 'Musical',
 'Mystery',
 'News',
 'Reality-TV',
 'Reality-tv',
 'Romance',
 'Sci-Fi',
 'Sci-fi',
 'Short',
 'Sport',
 'Talk-Show',
 'Thriller',
 'War',
 'Western']

In [17]:
counts = []
for genre in genres:
    counts.append((genre, data_df[genre].sum()))
counts

[('Action', 13620),
 ('Adult', 129),
 ('Adventure', 11353),
 ('Animation', 12944),
 ('Biography', 1580),
 ('Comedy', 37354),
 ('Crime', 16777),
 ('Documentary', 13882),
 ('Drama', 51150),
 ('Family', 17127),
 ('Fantasy', 8488),
 ('Game-Show', 2316),
 ('History', 3165),
 ('Horror', 2826),
 ('Lifestyle', 0),
 ('Music', 3198),
 ('Musical', 779),
 ('Mystery', 12813),
 ('News', 4719),
 ('Reality-TV', 13748),
 ('Reality-tv', 1),
 ('Romance', 21557),
 ('Sci-Fi', 9504),
 ('Sci-fi', 0),
 ('Short', 858),
 ('Sport', 2406),
 ('Talk-Show', 6516),
 ('Thriller', 9511),
 ('War', 1534),
 ('Western', 2841)]

In [18]:
distribution = pd.DataFrame(counts, columns=['genre', 'count'])
distribution

Unnamed: 0,genre,count
0,Action,13620
1,Adult,129
2,Adventure,11353
3,Animation,12944
4,Biography,1580
5,Comedy,37354
6,Crime,16777
7,Documentary,13882
8,Drama,51150
9,Family,17127


In [19]:
# Remove the genres with 0 movies
data_df.drop('Lifestyle', axis=1, inplace=True)
data_df.drop('Sci-fi', axis=1, inplace=True)

Next we select all the movies whose description are in English. Note that this will take about 12 minutes to run.

In [20]:
data_df['plot_lang'] = data_df.apply(lambda row: detect(row['plot']), axis=1)

In [21]:
data_df.head()

Unnamed: 0,title,plot,Action,Adult,Adventure,Animation,Biography,Comedy,Crime,Documentary,...,Reality-tv,Romance,Sci-Fi,Short,Sport,Talk-Show,Thriller,War,Western,plot_lang
0,"""#7DaysLater"" (2013)",#7dayslater is an interactive comedy series f...,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,en
1,"""#BlackLove"" (2015) {Crash the Party (#1.9)}","With just one week left in the workshops, the...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,en
2,"""#BlackLove"" (2015) {Making Lemonade Out of Le...",All of the women start making strides towards...,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,en
3,"""#BlackLove"" (2015) {Miss Independent (#1.5)}",All five of these women are independent and s...,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,en
4,"""#BlackLove"" (2015) {Sealing the Deal (#1.10)}",Despite having gone through a life changing p...,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,en


In [22]:
data_df['plot_lang'].value_counts()

en    131169
nl       145
fr       113
de        36
es        10
it         7
no         3
da         3
ca         2
sv         2
sl         1
pt         1
tl         1
hu         1
Name: plot_lang, dtype: int64

### select only en types of records and save them to a csv

In [23]:
df = data_df[data_df.plot_lang.isin(['en'])]
df.to_csv("movies_genres_en.csv", sep='\t', encoding='utf-8')

### from now on we can read data from csv and save time

In [24]:
df = pd.read_csv("movies_genres_en.csv", delimiter='\t', encoding='utf-8', index_col=0)

### Build the vocabulary

Lets define a few functions to tokenize our data and build the vocabulary.

In [25]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /home/ec2-user/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [36]:
tokenizer = TreebankWordTokenizer()

In [27]:
vocab = Counter()
vocab

Counter()

In [28]:
len(df)

131169

In [32]:
def tokenize_plot_summary(summary):
    for sent in sent_tokenize(summary):
        for token in tokenizer.tokenize(sent):
            yield token

In [31]:
tmp_01 = sent_tokenize('hello wonderful world')
print(tmp_01)

for sent in tmp_01:
    for token in tokenizer.tokenize(sent):
        print(token)

['hello wonderful world']
hello
wonderful
world


In [33]:
UNKNOWN = '<unk>'

def build_vocab(data, max_vocab_size=None):
    
    vocab = Counter()
    total = len(data)
    
    for i, row in enumerate(data.itertuples()):
        
        vocab.update(tokenize_plot_summary(row.plot))
        
        if (i+1)%1000 == 0:
            sys.stdout.write(".")
            sys.stdout.flush()
            
    final_vocab = {word:i for i, (word, count) in enumerate(vocab.most_common(max_vocab_size))}
    final_vocab[UNKNOWN]=len(final_vocab)+1
    return final_vocab

In [33]:
vocab = build_vocab(df)

...................................................................................................................................

In [34]:
type(vocab)

dict

In [35]:
print("Vocab size: ", len(vocab))
with open("vocab.json", "w") as f:
    json.dump(vocab, f)
    print("Saved vocabulary file to vocab.json")

Vocab size:  226171
Saved vocabulary file to vocab.json


# Start From Here If Data is Already Processed

In [8]:
df = pd.read_csv("movies_genres_en.csv", delimiter='\t', encoding='utf-8', index_col=0)

In [9]:
print(df.shape)

(131169, 31)


In [10]:
with open("vocab.json", "r") as read_file:
    vocab = json.load(read_file)


### Split data into train, validation and test

Now we show how to prepare the data for training. First we define a function to convert a dataframe into a jsonlines format which can be used by the algorithm to train.

First we split the dataframe into train, validation and test partitions.

In [11]:
df.values

array([['"#7DaysLater" (2013)',
        " #7dayslater is an interactive comedy series featuring an ensemble cast of YouTube celebrities. Each week the audience writes the brief via social media for an all-new episode featuring a well-known guest-star. Seven days later that week's episode premieres on TV and across multiple platforms.",
        0, ..., 0, 0, 'en'],
       ['"#BlackLove" (2015) {Crash the Party (#1.9)}',
        ' With just one week left in the workshops, the women consider the idea of "The One." The ladies are stunned when Jahmil finally comes to a decision about Bentley and if he\'s the one for her. Jack challenges Tennesha to express her feelings of love towards Errol, but can she put herself out there and face possible rejection?',
        0, ..., 0, 0, 'en'],
       ['"#BlackLove" (2015) {Making Lemonade Out of Lemons (#1.2)}',
        " All of the women start making strides towards finding their own version of a happy ending. Tennesha and Errol decide to become exc

In [12]:
data_y = df.drop(['title', 'plot', 'plot_lang'], axis=1).values
# print(data_y)
tmp = np.argmax(data_y, axis=1)
len(tmp)

131169

In [13]:
def split(df, test_size):
    data = df.values
    data_y = df.drop(['title', 'plot', 'plot_lang'], axis=1).values
    
    # StratifiedShuffleSplit does not work with one hot encoded / multiple labels. 
    # Doing the split on basis of arg max labels.
    data_y = np.argmax(data_y, axis=1)
    print(data_y.shape)
    
    stratified_split = StratifiedShuffleSplit(n_splits=2, test_size=test_size, random_state=42)
    print(type(stratified_split))
    
    for train_index, test_index in stratified_split.split(data, data_y):
        train, test = df.iloc[train_index], df.iloc[test_index]
    return train, test

train, test = split(df, 0.33)
#Split the train further into train and validation
train, validation = split(train, 0.2)

(131169,)
<class 'sklearn.model_selection._split.StratifiedShuffleSplit'>
(87883,)
<class 'sklearn.model_selection._split.StratifiedShuffleSplit'>


In [14]:
print(train.shape)
print(test.shape)

(70306, 31)
(43286, 31)


In [15]:
train.head()

Unnamed: 0,title,plot,Action,Adult,Adventure,Animation,Biography,Comedy,Crime,Documentary,...,Reality-tv,Romance,Sci-Fi,Short,Sport,Talk-Show,Thriller,War,Western,plot_lang
42360,"""Grange Hill"" (1978) {(#4.16)}",Cathy's bunking off Cathy's group have been f...,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,en
34969,"""Evil Lives Here"" (2016) {My Brother's Secret ...","Danyall White always thought her brother, Ric...",0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,en
71718,"""Neighbours"" (1985) {(#1.7178)}","Both frustrated in love and school, Tyler tak...",0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,en
8291,"""Australian Story"" (1996) {A True Calling (#4....",Millions of television viewers around the wor...,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,en
85090,"""Rote Rosen"" (2006) {Eine neue Allianz (#1.2012)}",Nora assumes Carla is mistaken. Nora wants to...,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,en


In [16]:
test.head()

Unnamed: 0,title,plot,Action,Adult,Adventure,Animation,Biography,Comedy,Crime,Documentary,...,Reality-tv,Romance,Sci-Fi,Short,Sport,Talk-Show,Thriller,War,Western,plot_lang
3759,"""Air Warriors"" (2014) {Harrier (#5.1)}","It's sleek, powerful, fast, and innovative, w...",0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,en
66259,"""Mission: Impossible"" (1966) {The Bargain (#3....","A former dictator, now in exile in Miami, pla...",1,0,1,0,0,0,1,0,...,0,0,0,0,0,0,1,0,0,en
53300,"""Jjang!"" (2012) {U-KISS/ZE:A Interviews (#1.55)}",Happy Holidays from JJANG! This week on JJANG...,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,en
39564,"""Fun Farm"" (2014)",52x7' Fun Farm is a very unique place. A farm...,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,en
43992,"""Happy Tree Friends"" (1999/II) {Change of Hear...",An emergency heart transplant for Disco Bear ...,0,0,0,1,0,1,0,0,...,0,0,0,0,0,0,0,0,0,en


In [17]:
train.head()

Unnamed: 0,title,plot,Action,Adult,Adventure,Animation,Biography,Comedy,Crime,Documentary,...,Reality-tv,Romance,Sci-Fi,Short,Sport,Talk-Show,Thriller,War,Western,plot_lang
42360,"""Grange Hill"" (1978) {(#4.16)}",Cathy's bunking off Cathy's group have been f...,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,en
34969,"""Evil Lives Here"" (2016) {My Brother's Secret ...","Danyall White always thought her brother, Ric...",0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,en
71718,"""Neighbours"" (1985) {(#1.7178)}","Both frustrated in love and school, Tyler tak...",0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,en
8291,"""Australian Story"" (1996) {A True Calling (#4....",Millions of television viewers around the wor...,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,en
85090,"""Rote Rosen"" (2006) {Eine neue Allianz (#1.2012)}",Nora assumes Carla is mistaken. Nora wants to...,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,en


### Negative sampling

The object2vec algorithm is setup as a binary classification problem. The true examples are the movie, genre pairs present in the dataset. In order to train the algorithm, we also need to provide negative examples. One option is to add all the genres to which the movie does not belong. However this strategy will create a highly skewed dataset with large percentage of negative example, as there are 27 classes present. Instead we choose to have 5 negative examples per positive example, as has been reported in related works like word2vec.

Lets look at the class distribution and figure out the how much we should sample the negative examples to achieve a balanced distribution of positive and negative examples.

In [19]:
train.columns

Index(['title', 'plot', 'Action', 'Adult', 'Adventure', 'Animation',
       'Biography', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Family',
       'Fantasy', 'Game-Show', 'History', 'Horror', 'Music', 'Musical',
       'Mystery', 'News', 'Reality-TV', 'Reality-tv', 'Romance', 'Sci-Fi',
       'Short', 'Sport', 'Talk-Show', 'Thriller', 'War', 'Western',
       'plot_lang'],
      dtype='object')

In [18]:
genres = list(train.columns)[2:-1]
genres


['Action',
 'Adult',
 'Adventure',
 'Animation',
 'Biography',
 'Comedy',
 'Crime',
 'Documentary',
 'Drama',
 'Family',
 'Fantasy',
 'Game-Show',
 'History',
 'Horror',
 'Music',
 'Musical',
 'Mystery',
 'News',
 'Reality-TV',
 'Reality-tv',
 'Romance',
 'Sci-Fi',
 'Short',
 'Sport',
 'Talk-Show',
 'Thriller',
 'War',
 'Western']

In [20]:
print ("Number of genres: ", len(genres))

Number of genres:  28


In [23]:


# create a dictionary for df aggregation, the values should be the funtion for aggregation
agg = {genre:'sum' for genre in genres}
type(agg)
agg

{'Action': 'sum',
 'Adult': 'sum',
 'Adventure': 'sum',
 'Animation': 'sum',
 'Biography': 'sum',
 'Comedy': 'sum',
 'Crime': 'sum',
 'Documentary': 'sum',
 'Drama': 'sum',
 'Family': 'sum',
 'Fantasy': 'sum',
 'Game-Show': 'sum',
 'History': 'sum',
 'Horror': 'sum',
 'Music': 'sum',
 'Musical': 'sum',
 'Mystery': 'sum',
 'News': 'sum',
 'Reality-TV': 'sum',
 'Reality-tv': 'sum',
 'Romance': 'sum',
 'Sci-Fi': 'sum',
 'Short': 'sum',
 'Sport': 'sum',
 'Talk-Show': 'sum',
 'Thriller': 'sum',
 'War': 'sum',
 'Western': 'sum'}

In [24]:
agg_by_genre = train.agg(agg)
agg_by_genre

Action          7297
Adult             70
Adventure       6103
Animation       6924
Biography        851
Comedy         20052
Crime           8982
Documentary     7447
Drama          27341
Family          9117
Fantasy         4466
Game-Show       1218
History         1737
Horror          1516
Music           1681
Musical          428
Mystery         6828
News            2544
Reality-TV      7360
Reality-tv         0
Romance        11631
Sci-Fi          5105
Short            472
Sport           1282
Talk-Show       3510
Thriller        5129
War              839
Western         1503
dtype: int64

In [25]:
total_positive_samples = agg_by_genre.sum()
total_positive_samples

151433

In [26]:
total_negative_samples = len(train)*len(genres) - total_positive_samples
total_negative_samples



1817135

In [27]:
NEGATIVE_TO_POSITIVE_RATIO = 5
sampling_percent = NEGATIVE_TO_POSITIVE_RATIO * total_positive_samples / total_negative_samples
sampling_percent


0.4166806538864751

In [28]:

print("total positive examples: ", total_positive_samples)
print("total negative samples", total_negative_samples)
print("negative sampling needed: ", sampling_percent )

total positive examples:  151433
total negative samples 1817135
negative sampling needed:  0.4166806538864751


### Tokenization

Now we can proceed to create the tokenized jsonlines dataset for training, validation and test partitions. We will use negative sampling of 0.4 for the training set, and add all the negatives for validation and test sets.

In [29]:
nltk.download('stopwords')


[nltk_data] Downloading package stopwords to
[nltk_data]     /home/ec2-user/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [30]:

def tokenize(df, vocab, filename, negative_frac=1.0, use_stopwords=False):
    # Rename the columns so that they are valid python identifiers
    df = df.rename(lambda x:x.replace("-", "_") ,axis='columns')

    genres = list(df.columns)[2:-1]
    max_seq_length = 0
    total = len(df)
    stop_words = set()

    if use_stopwords:
        stop_words = set(stopwords.words('english'))

    with jsonlines.open(filename, mode='w') as writer:
        for j, row in enumerate(df.itertuples()):
            tokens = [token for token in tokenize_plot_summary(row.plot) if token not in stop_words]

            plot_token_ids = [vocab[token] if token in vocab else vocab[UNKNOWN] for token in tokens]
            for i, genre in enumerate(genres):
                label = getattr(row, genre)

                # here we also consider generating of negative samples 
                if label == 1 or np.random.rand() < negative_frac:
                    # All positive examples and fraction of negative examples are picked.
                    writer.write({"in0": plot_token_ids, "in1": [i], "label": label})
                    
            max_seq_length = max(len(plot_token_ids), max_seq_length)
            if (j+1)%1000==0:
                sys.stdout.write(".")
                sys.stdout.flush()
        print("Finished tokenizing data. Max sequence length of the tokenized data: ", max_seq_length)

In [37]:
tokenize(df=train, vocab=vocab, filename="tokenized_movie_genres_train.jsonl", negative_frac=0.4, use_stopwords=True)

......................................................................Finished tokenizing data. Max sequence length of the tokenized data:  1192


In [38]:
tokenize(df=validation, vocab=vocab, filename="tokenized_movie_genres_validation.jsonl", use_stopwords=True)

.................Finished tokenizing data. Max sequence length of the tokenized data:  1465


In [39]:
tokenize(df=test, vocab=vocab, filename="tokenized_movie_genres_test.jsonl", use_stopwords=True)

...........................................Finished tokenizing data. Max sequence length of the tokenized data:  998


For better performance, the training dataset needs to be shuffled.

In [40]:
!shuf tokenized_movie_genres_train.jsonl > tokenized_movie_genres_train_shuffled.jsonl

### Download pretrained word embeddings

We will make use of pretrained word embeddings from https://nlp.stanford.edu/projects/glove/. 

<div class="alert alert-warning">
Important: Before you begin downloading, please read the following  and make sure you are okay with the license.
https://opendatacommons.org/licenses/pddl/1.0/
</div>

In [41]:
# !mkdir /tmp/glove
# !wget -P /tmp/glove/ http://nlp.stanford.edu/data/glove.840B.300d.zip
# !unzip -d /tmp/glove /tmp/glove/glove.840B.300d.zip
# !rm /tmp/glove/glove.840B.300d.zip

--2020-05-23 17:22:21--  http://nlp.stanford.edu/data/glove.840B.300d.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)...171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80...connected.
HTTP request sent, awaiting response...302 Found
Location: https://nlp.stanford.edu/data/glove.840B.300d.zip [following]
--2020-05-23 17:22:22--  https://nlp.stanford.edu/data/glove.840B.300d.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443...connected.
HTTP request sent, awaiting response...301 Moved Permanently
Location: http://downloads.cs.stanford.edu/nlp/data/glove.840B.300d.zip [following]
--2020-05-23 17:22:25--  http://downloads.cs.stanford.edu/nlp/data/glove.840B.300d.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)...171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:80...connected.
HTTP request sent, awaiting response...200 OK
Length: 2176768927 (2.0G) [application/zip]
Saving

## Sagemaker Training

Let us start with defining some configurations 

In [42]:
bucket='md-backup-bucket-01' # customize to your bucket

In [43]:
prefix = 'object2vec-movie-genre-prediction'

container = get_image_uri(boto3.Session().region_name, 'object2vec')
container

train_s3_path = "s3://{}/{}/data/train/".format(bucket, prefix)
validation_s3_path = "s3://{}/{}/data/validation/".format(bucket, prefix)
test_s3_path = "s3://{}/{}/data/test/".format(bucket, prefix)
auxiliary_s3_path = "s3://{}/{}/data/auxiliary/".format(bucket, prefix)
prediction_s3_path = "s3://{}/{}/predictions/".format(bucket, prefix)

In [44]:
container


'404615174143.dkr.ecr.us-east-2.amazonaws.com/object2vec:1'

### Upload data to S3

In [45]:
!aws s3 cp tokenized_movie_genres_train_shuffled.jsonl {train_s3_path}
!aws s3 cp tokenized_movie_genres_validation.jsonl {validation_s3_path}
!aws s3 cp tokenized_movie_genres_test.jsonl {test_s3_path}

Completed 342.5 MiB/346.0 MiB (564.2 KiB/s) with 1 file(s) remainingCompleted 342.8 MiB/346.0 MiB (563.9 KiB/s) with 1 file(s) remainingCompleted 343.0 MiB/346.0 MiB (561.3 KiB/s) with 1 file(s) remainingCompleted 343.3 MiB/346.0 MiB (560.6 KiB/s) with 1 file(s) remainingCompleted 343.5 MiB/346.0 MiB (559.5 KiB/s) with 1 file(s) remainingCompleted 343.8 MiB/346.0 MiB (559.6 KiB/s) with 1 file(s) remainingCompleted 344.0 MiB/346.0 MiB (559.6 KiB/s) with 1 file(s) remainingCompleted 344.3 MiB/346.0 MiB (556.9 KiB/s) with 1 file(s) remainingCompleted 344.5 MiB/346.0 MiB (554.6 KiB/s) with 1 file(s) remainingCompleted 344.8 MiB/346.0 MiB (548.0 KiB/s) with 1 file(s) remainingCompleted 345.0 MiB/346.0 MiB (545.1 KiB/s) with 1 file(s) remainingCompleted 345.3 MiB/346.0 MiB (541.5 KiB/s) with 1 file(s) remainingCompleted 345.5 MiB/346.0 MiB (529.2 KiB/s) with 1 file(s) remainingCompleted 345.8 MiB/346.0 MiB (500.6 KiB/s) with 1 file(s) remainingCompleted 346.0 MiB/346.0 MiB (490.6 KiB/s) with

In [46]:
!aws s3 cp vocab.json {auxiliary_s3_path}
!aws s3 cp /tmp/glove/glove.840B.300d.txt {auxiliary_s3_path}

upload: ./vocab.json to s3://md-backup-bucket-01/object2vec-movie-genre-prediction/data/auxiliary/vocab.json
Completed 5.2 GiB/5.3 GiB (627.9 KiB/s) with 1 file(s) remainingCompleted 5.2 GiB/5.3 GiB (627.8 KiB/s) with 1 file(s) remainingCompleted 5.2 GiB/5.3 GiB (627.8 KiB/s) with 1 file(s) remainingCompleted 5.2 GiB/5.3 GiB (627.8 KiB/s) with 1 file(s) remainingCompleted 5.2 GiB/5.3 GiB (627.8 KiB/s) with 1 file(s) remainingCompleted 5.2 GiB/5.3 GiB (627.8 KiB/s) with 1 file(s) remainingCompleted 5.2 GiB/5.3 GiB (627.8 KiB/s) with 1 file(s) remainingCompleted 5.2 GiB/5.3 GiB (627.9 KiB/s) with 1 file(s) remainingCompleted 5.2 GiB/5.3 GiB (627.8 KiB/s) with 1 file(s) remainingCompleted 5.2 GiB/5.3 GiB (627.8 KiB/s) with 1 file(s) remainingCompleted 5.2 GiB/5.3 GiB (627.8 KiB/s) with 1 file(s) remainingCompleted 5.2 GiB/5.3 GiB (627.8 KiB/s) with 1 file(s) remainingCompleted 5.2 GiB/5.3 GiB (627.7 KiB/s) with 1 file(s) remainingCompleted 5.2 GiB/5.3 GiB (627.4 KiB/s) with 1 file(s) rema

### Training hyperparameters

The object2vec is a customizable algorithm and hence it has quite a few hyperparameters. Lets review some of the important ones:

* **enc_dim**: The dimension of the encoder. Both the movie plot description and genre embeddings are mapped to this dimension. 
* **mlp_dim**: The dimension of the output from multilayer perceptron (MLP) layers.
* **mlp_activation**: Type of activation function for the multilayer perceptron (MLP) layer.
* **mlp_layers**: The number of multilayer perceptron (MLP) layers in the network.
* **output_layer**: The type of output layer. We choose 'softmax' as it is a classification problem.
* **bucket_width**: The allowed difference between data sequence length when bucketing is enabled. Bucketing is enabled when a non-zero value is specified for this parameter.
* **num_classes**: The number of classes for classification training, which is 2 for our case.

The **enc0** encodes the movie plot description which is a sequence, and **enc1** encodes the movie genre which is a single token. The encoder parameters:

* **max_seq_len**: The maximum sequence length that will be considered. Any input tokens beyond max_seq_len will be truncated and ignored. We choose a value of 500 for enc
* **network**: Network model. We choose hcnn for both enc0 and enc1.
* **cnn_filter_width**: The filter width of the hcnn encoder.
* **layers**: The number of layers. We choose 2 layers for enc0, as we want to capture richer structures in the movie plot description which is a sequence input. For enc1, we choose 1 layer.
* **token_embedding_dim**: The output dimension of  token embedding layer. We choose a dimension of 300 for encoder 0, consistent with the dimension of the glove embdeddings. For enc1, we choose 10.
* **pretrained_embedding_file**: The filename of pretrained token embedding file present in the auxiliary data channel. We use the glove embeddings for enc0. For enc1, the embeddings will be learned by the algorithm.
* **freeze_pretrained_embedding**: Whether to freeze  pretrained embedding weights. We set this to True for enc0.
* **vocab_file**: The vocabulary file for mapping pretrained token embeddings to vocabulary IDs. This is specified only for enc0, as we use pretrained embeddings only for enc0.
* **vocab_size**: The vocabulary size of the tokens. For enc0, it is the number of words appearing the dataset. For enc1, it is the number of genres.

In [47]:
hyperparameters = {
 'enc_dim': 4096, 
 'mlp_dim': 512, 
 'mlp_activation': 'relu', 
 'mlp_layers': 2, 
 'output_layer': 'softmax',
 'bucket_width': 10, 
 'num_classes': 2,
 
 'mini_batch_size': 256,
 
 'enc0_max_seq_len': 500,
 'enc1_max_seq_len': 2,
 
 'enc0_network': 'hcnn',
 'enc1_network': 'hcnn',
    
 'enc0_layers': '2',
 'enc1_layers': '1',
    
 'enc0_cnn_filter_width': 2,
 'enc1_cnn_filter_width': 1,
 
 'enc0_token_embedding_dim': 300,
 'enc1_token_embedding_dim': 10,
 
 'enc0_pretrained_embedding_file' : "glove.840B.300d.txt",
 
 'enc0_freeze_pretrained_embedding': 'true',
 
 'enc0_vocab_file': 'vocab.json',
 'enc1_vocab_file': '',
 
 'enc0_vocab_size': len(vocab),
 'enc1_vocab_size': len(genres),
}


<div class="alert alert-warning">
Note that the training will take approximately 1.5 hours to complete on the ml.p2.8xlarge instance type
</div>


In [49]:
o2v = sagemaker.estimator.Estimator(container,
                                    # get_execution_role(),
                                    role, 
                                    train_instance_count=1, 
                                    train_instance_type='ml.p3.8xlarge',
                                    output_path="s3://{}/{}/output".format(bucket, prefix),
                                   )


In [50]:
                                   
o2v.set_hyperparameters(**hyperparameters)
input_data = {
    "train": s3_input(train_s3_path, content_type="application/jsonlines"),
    "validation": s3_input(validation_s3_path, content_type="application/jsonlines"),
    "auxiliary": s3_input(auxiliary_s3_path)
}
o2v.fit(input_data)

20 21:45:44 INFO 140260686526272] Epoch: 5, batches: 2600, num_examples: 665600, 6346.9 samples/sec, epoch time so far: 0:01:44.870096[0m
[34m[05/23/2020 21:45:44 INFO 140260686526272] #011Training metrics: perplexity: 1.052 cross_entropy: 0.051 accuracy: 0.981 [0m
[34m[05/23/2020 21:45:48 INFO 140260686526272] Epoch: 5, batches: 2700, num_examples: 691200, 6330.2 samples/sec, epoch time so far: 0:01:49.191039[0m
[34m[05/23/2020 21:45:48 INFO 140260686526272] #011Training metrics: perplexity: 1.052 cross_entropy: 0.051 accuracy: 0.980 [0m
[34m[05/23/2020 21:45:52 INFO 140260686526272] Epoch: 5, batches: 2800, num_examples: 716800, 6334.0 samples/sec, epoch time so far: 0:01:53.167647[0m
[34m[05/23/2020 21:45:52 INFO 140260686526272] #011Training metrics: perplexity: 1.053 cross_entropy: 0.051 accuracy: 0.980 [0m
[34m[05/23/2020 21:45:56 INFO 140260686526272] Epoch: 5, batches: 2900, num_examples: 742400, 6325.2 samples/sec, epoch time so far: 0:01:57.372203[0m
[34m[05/23/

## Evaluation with Batch inference

<div class="alert alert-warning">
Note that the batch inference will take approximately 30 minutes to complete on the ml.p2.8xlarge instance type
</div>


In [51]:
transformer = o2v.transformer(instance_count=1, 
                              instance_type="ml.p3.8xlarge", 
                              output_path=prediction_s3_path)
transformer.transform(data=test_s3_path, content_type="application/jsonlines", split_type="Line")
transformer.wait()

140050967226176] module data shapes:[DataDesc[source,(16L, 460L),<type 'numpy.float32'>,NTC], DataDesc[target,(16L, 10L),<type 'numpy.float32'>,NTC]][0m
[34m[05/23/2020 22:04:24 INFO 140050967226176] module label shapes:None[0m
[34m[05/23/2020 22:04:24 INFO 140050967226176] data iter data shapes:[DataDesc[source,(16, 310L),<type 'numpy.float32'>,NTC], DataDesc[target,(16, 10L),<type 'numpy.float32'>,NTC]][0m
[34m[05/23/2020 22:04:24 INFO 140050967226176] data iter label shapes:None[0m
[35m#metrics {"Metrics": {"model.evaluate.time": {"count": 1, "max": 3163.2018089294434, "sum": 3163.2018089294434, "min": 3163.2018089294434}, "json.encoder.time": {"count": 1, "max": 842.4201011657715, "sum": 842.4201011657715, "min": 842.4201011657715}, "jsonlines_bucket_iterator.time": {"count": 1, "max": 327.84485816955566, "sum": 327.84485816955566, "min": 327.84485816955566}, "invocations.count": {"count": 1, "max": 1, "sum": 1.0, "min": 1}}, "EndTime": 1590271463.695261, "Dimensions": {"Ho

Download the predictions from s3 to perform the evaluation.

In [53]:
!aws s3 cp --recursive {prediction_s3_path} /home/ec2-user/SageMaker/AWS-ML-Certification/__my_study/pyspark_mnist/

download: s3://md-backup-bucket-01/object2vec-movie-genre-prediction/predictions/tokenized_movie_genres_test.jsonl.out to ../../pyspark_mnist/tokenized_movie_genres_test.jsonl.out


In [54]:
def evaluate(filename, predictions, genre_dict, threshold=0.5):
    metrics = {g:{"genre": g, "tp":0, "tn":0, "fp":0, "fn":0} for g in genre_dict.values()}
    with jsonlines.open(filename, "r") as reader, jsonlines.open(predictions, "r") as preds:
        for row, preds in zip(reader, preds):
            prediction = preds["scores"][1] > threshold
            label = row["label"]
            g = genre_dict[row["in1"][0]]
            if prediction == 1:
                if label == prediction:
                    metrics[g]["tp"] +=1
                else:
                    metrics[g]["fp"]+=1
            elif prediction == 0:
                if label == prediction:
                    metrics[g]["tn"]+=1
                else:
                    metrics[g]["fn"]+=1
    summary = pd.DataFrame(list(metrics.values())).set_index('genre')
    summary['accuracy'] = summary.apply (lambda row: (row.tp + row.tn) / (row.tp + row.tn + row.fp + row.fn),axis=1)
    summary['precision'] = summary.apply (lambda row: row.tp / (row.tp + row.fp),axis=1)
    summary['recall'] = summary.apply (lambda row: row.tp / (row.tp + row.fn),axis=1)
    summary['f1'] = summary.apply (lambda row: 2*(row.precision * row.recall) /(row.precision + row.recall),axis=1)
    return summary

In [55]:
genre_dict = {i:genre for i, genre in enumerate(genres)}
summary =evaluate("tokenized_movie_genres_test.jsonl", "tokenized_movie_genres_test.jsonl.out", genre_dict, threshold=0.6)
summary

Unnamed: 0_level_0,fn,fp,tn,tp,accuracy,precision,recall,f1
genre,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Action,1698,1006,37787,2795,0.937532,0.735333,0.622079,0.673981
Adult,26,2,43242,16,0.999353,0.888889,0.380952,0.533333
Adventure,1734,996,38543,2013,0.936931,0.668993,0.53723,0.595915
Animation,823,1116,37906,3441,0.955205,0.755102,0.806989,0.780184
Biography,498,29,42734,25,0.987825,0.462963,0.047801,0.086655
Comedy,1635,5801,25192,10658,0.828212,0.647548,0.866997,0.741375
Crime,1249,1697,36085,4255,0.931941,0.714886,0.773074,0.742842
Documentary,1249,1934,36787,3316,0.926466,0.631619,0.726396,0.6757
Drama,1944,4287,22200,14855,0.85605,0.776042,0.884279,0.826633
Family,1620,2028,35631,4007,0.915723,0.66396,0.712102,0.687189


In [56]:
tp_sum = summary["tp"].sum()
fp_sum = summary["fp"].sum()
tn_sum = summary["tn"].sum()
fn_sum = summary["fn"].sum()
precision = (tp_sum) / (tp_sum + fp_sum)
recall = (tp_sum) / (tp_sum + fn_sum)

print("Accuracy: ", (tp_sum + tn_sum) / (tp_sum + fp_sum + tn_sum + fn_sum))
print("Micro Precision: ", precision)
print("Micro Recall: ", recall)
print("Micro F1: ", 2*precision*recall/(precision + recall))

Accuracy:  0.954885611316097
Micro Precision:  0.6930519264333491
Micro Recall:  0.7390640129101668
Micro F1:  0.7153188143967596


We compared the performance with [fastText](https://fasttext.cc/). Fasttext does not perform multi-label predictions, so to do a fair comparison we trained 28 binary classification models with fastText for each of the movie genres and combined the results of each predictor. While training the fastText models we set **wordNgrams** to 2, **dim** to 300 and  **pretrainedVectors** to the glove embeddings.

<img src="comparison.png">

## Online inference demo

In this section we setup a online inference endpoint and perform inference for a few recently released movies.

In [57]:
predictor = o2v.deploy(initial_instance_count=1, instance_type="ml.m4.xlarge")

Using already existing model: object2vec-2020-05-23-21-16-44-612
--------------!

In [58]:
def get_movie_genre_predictions(movie_summary, genre_dict, vocab, predictor, threshold=0.5):

    plot_token_ids = [vocab[token] if token in vocab else vocab[UNKNOWN] for token in tokenize_plot_summary(movie_summary)]

    batch = [{"in0": plot_token_ids, "in1": [genre_id]} for genre_id in range(len(genre_dict))]

    request = {"instances": batch}
    response = predictor.predict(data=json.dumps(request))

    scores = [score["scores"] for score in json.loads(response)["predictions"]]

    predictions = [genre_dict[i] for i, score in enumerate(scores) if score[1] > threshold]
    
    return predictions

In [59]:
star_trek = "Ten years before Kirk, Spock and the Enterprise, theUSS Discovery discovers new worlds and lifeforms \
as one Starfleet officer learns to understand all things alien."

get_movie_genre_predictions(star_trek, genre_dict, vocab, predictor)

['Action', 'Adventure', 'Sci-Fi']

In [60]:
nun = "A priest with a haunted past and a novice on the threshold of her final vows are sent by the Vatican \
to investigate the death of a young nun in Romania and confront a malevolent force in the form of a demonic nun."

get_movie_genre_predictions(nun, genre_dict, vocab, predictor)

['Drama', 'Fantasy', 'Horror', 'Mystery', 'Thriller']

In [61]:
fantastic_beasts = "The second installment of the 'Fantastic Beasts' series set in J.K. Rowling's Wizarding World \
featuring the adventures of magizoologist Newt Scamander."

get_movie_genre_predictions(fantastic_beasts, genre_dict, vocab, predictor)

['Animation']

In [62]:
predictor.delete_endpoint()