## Data wrangling

In [1]:
dataset = 'cornell movie-dialogs corpus'

These are the different files found in the dataset, let's see what's inside of them

In [2]:
with open(f'{dataset}/movie_lines.txt', encoding='utf-8') as f:
    lines = f.readlines()

with open(f'{dataset}/movie_conversations.txt', encoding='utf-8') as f:
    conversations = f.readlines()

with open(f'{dataset}/movie_titles_metadata.txt', encoding='latin-1') as f:
    titles = f.readlines()

with open(f'{dataset}/movie_characters_metadata.txt', encoding='latin-1') as f:
    characters = f.readlines()

The lines and conversation files that are conected with an ID of the line in the conversation.

> According to the dataset documentation these file contains
* 220,579 conversational exchanges between 10,292 pairs of movie characters
* involves 9,035 characters from 617 movies
* in total 304,713 utterances

Also on the titles file we can see

> Movie metadata included:
* genres
* release year
* IMDB rating
* number of IMDB votes
* IMDB rating

And this information for the characters file

> Character metadata included:
* gender (for 3,774 characters)
* position on movie credits (3,321 characters)


For easier visualization, we'll load the Loading them into a pandas dataframe, starting with the lines.


As we saw before, each column of the dataset is splitted by the string `+++$+++` then we'll set that as a parameter for the split function. Also we'll delete the last `\n` symbol from the line

In [3]:
import re
import ast
import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import RandomForestClassifier
from sklearn.multiclass import OneVsRestClassifier
from sklearn.preprocessing import MultiLabelBinarizer

import joblib

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


In [4]:
def clean_text(text): #adapted from https://github.com/REDFOX1899/Chatbot/blob/master/Chatbot.py
    text = text.lower()
    text = re.sub(r"i'm", "i am", text)
    text = re.sub(r"he's", "he is", text)
    text = re.sub(r"she's", "she is", text)
    text = re.sub(r"that's", "that is", text)
    text = re.sub(r"what's", "what is", text)
    text = re.sub(r"where's", "where is", text)
    text = re.sub(r"how's", "how is", text)
    text = re.sub(r"\'ll", " will", text)
    text = re.sub(r"\'ve", " have", text)
    text = re.sub(r"\'re", " are", text)
    text = re.sub(r"\'d", " would", text)
    text = re.sub(r"n't", " not", text)
    text = re.sub(r"won't", "will not", text)
    text = re.sub(r"can't", "cannot", text)
    text = re.sub(r"[-()\"#/@;:<>{}`+=~|.!?,]", "", text)

    # from some common specific mistakes found in this dataset
    text = re.sub(r"youíre", "you are", text)
    text = re.sub(r"óó", "", text)

    return text

In [5]:
# create dataframe with lines
df_lines = pd.DataFrame({'line_text': lines})

# split into columns
df_lines = df_lines['line_text'].str.split(r' \+\+\+\$\+\+\+ ', expand=True)
df_lines.columns = ['line_ID', 'speaker_ID', 'movie','speaker','text']

# delete new line character and other blank character
df_lines = df_lines.applymap(lambda x: x.rstrip() if isinstance(x, str) else x)
df_lines['text'] = df_lines['text'].str.rstrip('\n')

# Add column clean text
df_lines['clean_text'] = df_lines['text'].apply(lambda x: clean_text(x))

#might be needed?
# # add column speaker + text
# df_lines['line'] = df_lines['speaker'] + ": " + df_lines['clean_text']

df_lines.head(5)

  df_lines = df_lines.applymap(lambda x: x.rstrip() if isinstance(x, str) else x)


Unnamed: 0,line_ID,speaker_ID,movie,speaker,text,clean_text
0,L1045,u0,m0,BIANCA,They do not!,they do not
1,L1044,u2,m0,CAMERON,They do to!,they do to
2,L985,u0,m0,BIANCA,I hope so.,i hope so
3,L984,u2,m0,CAMERON,She okay?,she okay
4,L925,u0,m0,BIANCA,Let's go.,let's go


Now let's do the same thing with all the other txt files

In [6]:
# create dataframe with conversations
df_conv = pd.DataFrame({'conv': conversations})

# split into columns
df_conv = df_conv['conv'].str.split(r' \+\+\+\$\+\+\+ ', expand=True)
df_conv.columns = ['speaker1_ID', 'speaker2_ID', 'movie_ID','lines_list']

# delete new line character
df_conv['lines_list'] = df_conv['lines_list'].str.rstrip('\n')

# set lines_list to list type
df_conv['lines_list'] = df_conv['lines_list'].apply(ast.literal_eval)

df_conv.head(5)

Unnamed: 0,speaker1_ID,speaker2_ID,movie_ID,lines_list
0,u0,u2,m0,"[L194, L195, L196, L197]"
1,u0,u2,m0,"[L198, L199]"
2,u0,u2,m0,"[L200, L201, L202, L203]"
3,u0,u2,m0,"[L204, L205, L206]"
4,u0,u2,m0,"[L207, L208]"


In [7]:
# create dataframe with conversations
df_title = pd.DataFrame({'title': titles})

# split into columns
df_title = df_title['title'].str.split(r' \+\+\+\$\+\+\+ ', expand=True)
df_title.columns = ['movie_ID','title','year','IMBD_rating','IMBD_votes','genres']

# cast types to what they are
df_title['IMBD_rating'] = df_title['IMBD_rating'].astype(float)
df_title['IMBD_votes'] = df_title['IMBD_votes'].astype(int)
# Clean 'year' column using regex (for cases like ' 1989/I ')
df_title['year'] = df_title['year'].apply(lambda x: re.sub(r'\D', '', x))  # \D means "no digit"
df_title['year'] = df_title['year'].astype(int)

# delete new line character
df_title['genres'] = df_title['genres'].str.rstrip('\n')

# set genres_list to list type
df_title['genres'] = df_title['genres'].apply(ast.literal_eval)
df_title.head(5)

Unnamed: 0,movie_ID,title,year,IMBD_rating,IMBD_votes,genres
0,m0,10 things i hate about you,1999,6.9,62847,"[comedy, romance]"
1,m1,1492: conquest of paradise,1992,6.2,10421,"[adventure, biography, drama, history]"
2,m2,15 minutes,2001,6.1,25854,"[action, crime, drama, thriller]"
3,m3,2001: a space odyssey,1968,8.4,163227,"[adventure, mystery, sci-fi]"
4,m4,48 hrs.,1982,6.9,22289,"[action, comedy, crime, drama, thriller]"


In [8]:
# create dataframe with conversations
df_chars = pd.DataFrame({'characters': characters})

# split into columns
df_chars = df_chars['characters'].str.split(r' \+\+\+\$\+\+\+ ', expand=True)
df_chars.columns = ['character_ID','name','movie_ID','movie_title','gender','credits_pos']

# delete new line character
df_chars['credits_pos'] = df_chars['credits_pos'].str.rstrip('\n')

# cast credits_pos to int (-1 if unknown)
df_chars.loc[df_chars['credits_pos'] == '?', 'credits_pos'] = -1
df_chars['credits_pos'] = df_chars['credits_pos'].astype(int)  # max(df_chars['credits_pos']) = 1000

df_chars.head(5)

Unnamed: 0,character_ID,name,movie_ID,movie_title,gender,credits_pos
0,u0,BIANCA,m0,10 things i hate about you,f,4
1,u1,BRUCE,m0,10 things i hate about you,?,-1
2,u2,CAMERON,m0,10 things i hate about you,m,3
3,u3,CHASTITY,m0,10 things i hate about you,?,-1
4,u4,JOEY,m0,10 things i hate about you,m,6


For easier handle of the conversations and lines we'll join them together in one dataframe

In [9]:
# Create a guide index
df_conv['index'] = df_conv.index
# Expand list into all sub item lines
expanded_lines = df_conv.explode('lines_list')
# Reset index of expanded df
expanded_lines.reset_index(drop=True, inplace=True)
# merge the line with line ids
merged_df = pd.merge(
    expanded_lines,
    df_lines,
    left_on='lines_list',
    right_on='line_ID',
    how='inner'
)

# Select relevant columns and group by the guide index
merged_df = merged_df[['speaker1_ID', 'speaker2_ID', 'movie_ID', 'lines_list', 'index', 'clean_text']]
merged_df = merged_df.groupby(['index','speaker1_ID', 'speaker2_ID', 'movie_ID'])['clean_text'].apply(list).reset_index()
merged_df = merged_df.rename(columns={'clean_text': 'dialog'})

# convert dialog to string, not list
merged_df['dialog'] = merged_df['dialog'].apply(lambda x: ';'.join(x))
dialog_df = merged_df.drop(['index'], axis=1)

dialog_df

Unnamed: 0,speaker1_ID,speaker2_ID,movie_ID,dialog
0,u0,u2,m0,can we make this quick roxanne korrine and an...
1,u0,u2,m0,you are asking me out that is so cute what is...
2,u0,u2,m0,no no it's my fault we did not have a proper ...
3,u0,u2,m0,why;unsolved mystery she used to be really po...
4,u0,u2,m0,gosh if only we could find kat a boyfriend;let...
...,...,...,...,...
83092,u9028,u9031,m616,do you think she might be interested in someo...
83093,u9028,u9031,m616,choose your targets men that is right watch th...
83094,u9030,u9034,m616,colonel durnford william vereker i hear you h...
83095,u9030,u9034,m616,your orders mr vereker;i am to take the sikali...


In [10]:
dialog_df['dialog'][501]

'i can tell from the tone of your voice dave that you are upset why do not you take a stress pill and get some rest;hal i am in command of this ship i order you to release the manual hibernation control;i am sorry dave but in accordance with subroutine c15324 quote when the crew are dead or incapacitated the computer must assume control unquote i must therefore override your authority now since you are not in any condition to intel ligently exercise it;hal unless you follow my instructions i shall be forced to disconnect you'

### Preparing dataframes

For the character metadata prediction we only need the line and the speaker and not the whole dialog. Then we'll predict the gender and the position on the movie credits based on one line. So let's put in a single dataframe line and character metadata

In [11]:
character_metadata = pd.merge(
    df_lines,
    df_chars,
    left_on='speaker_ID',
    right_on='character_ID',
    how='inner'

)

# Select relevant columns and group by the guide index
character_metadata = character_metadata[['line_ID', 'speaker_ID', 'movie_ID', 'speaker', 'clean_text', 'movie_title','gender','credits_pos']]
character_metadata = character_metadata.rename(columns={'clean_text': 'line'})

character_metadata

Unnamed: 0,line_ID,speaker_ID,movie_ID,speaker,line,movie_title,gender,credits_pos
0,L1045,u0,m0,BIANCA,they do not,10 things i hate about you,f,4
1,L1044,u2,m0,CAMERON,they do to,10 things i hate about you,m,3
2,L985,u0,m0,BIANCA,i hope so,10 things i hate about you,f,4
3,L984,u2,m0,CAMERON,she okay,10 things i hate about you,m,3
4,L925,u0,m0,BIANCA,let's go,10 things i hate about you,f,4
...,...,...,...,...,...,...,...,...
304708,L666371,u9030,m616,DURNFORD,lord chelmsford seems to want me to stay back ...,zulu dawn,?,-1
304709,L666370,u9034,m616,VEREKER,i am to take the sikali with the main column t...,zulu dawn,?,-1
304710,L666369,u9030,m616,DURNFORD,your orders mr vereker,zulu dawn,?,-1
304711,L666257,u9030,m616,DURNFORD,good ones yes mr vereker gentlemen who can rid...,zulu dawn,?,-1


And for the movie metadata prediction we will need the whole dialog. So, let's put the metadata into the dialogs dataframe to make things easier.

In [12]:
movie_metadata = pd.merge(
    dialog_df,
    df_title,
    left_on='movie_ID',
    right_on='movie_ID',
    how='inner'
)

movie_metadata

Unnamed: 0,speaker1_ID,speaker2_ID,movie_ID,dialog,title,year,IMBD_rating,IMBD_votes,genres
0,u0,u2,m0,can we make this quick roxanne korrine and an...,10 things i hate about you,1999,6.9,62847,"[comedy, romance]"
1,u0,u2,m0,you are asking me out that is so cute what is...,10 things i hate about you,1999,6.9,62847,"[comedy, romance]"
2,u0,u2,m0,no no it's my fault we did not have a proper ...,10 things i hate about you,1999,6.9,62847,"[comedy, romance]"
3,u0,u2,m0,why;unsolved mystery she used to be really po...,10 things i hate about you,1999,6.9,62847,"[comedy, romance]"
4,u0,u2,m0,gosh if only we could find kat a boyfriend;let...,10 things i hate about you,1999,6.9,62847,"[comedy, romance]"
...,...,...,...,...,...,...,...,...,...
83092,u9028,u9031,m616,do you think she might be interested in someo...,zulu dawn,1979,6.4,1911,"[action, adventure, drama, history, war]"
83093,u9028,u9031,m616,choose your targets men that is right watch th...,zulu dawn,1979,6.4,1911,"[action, adventure, drama, history, war]"
83094,u9030,u9034,m616,colonel durnford william vereker i hear you h...,zulu dawn,1979,6.4,1911,"[action, adventure, drama, history, war]"
83095,u9030,u9034,m616,your orders mr vereker;i am to take the sikali...,zulu dawn,1979,6.4,1911,"[action, adventure, drama, history, war]"


## Vectorizing

In [13]:
vec_lines = TfidfVectorizer(max_df=0.8, min_df=5, stop_words='english', max_features=1000)

### Vectorizing lines

In [14]:
lines_X = vec_lines.fit_transform(character_metadata['line'])
tfidf_df = pd.DataFrame(lines_X.toarray(), columns=vec_lines.get_feature_names_out())
tfidf_df

Unnamed: 0,able,absolutely,accept,accident,account,act,acting,action,actually,address,...,wrong,wrote,ya,yeah,year,years,yes,yesterday,york,young
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
304708,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0
304709,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0
304710,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0
304711,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.268399,0.0,0.0,0.0


### Vectorizing dialogs

In [15]:
vec_dial = TfidfVectorizer(max_df=0.8, min_df=5, stop_words='english', max_features=1000)

Using the vectorizer (computed in 'otherTasks' notebook) we transform de data to have the dialog in a way we can input it to the model

In [16]:
dialog_X = vec_dial.fit_transform(movie_metadata['dialog'])
tfidf_df = pd.DataFrame(dialog_X.toarray(), columns=vec_dial.get_feature_names_out())
tfidf_df

Unnamed: 0,able,absolutely,accept,accident,account,act,acting,actually,address,admit,...,wrong,wrote,ya,yeah,year,years,yes,yesterday,york,young
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
83092,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.0,0.0
83093,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.0,0.0
83094,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.21258,0.0,0.0,0.0
83095,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.0,0.0


## Model Creation

Note: don't need to run this section if you already have the `.pkl` models

### Character metadata models

**Input:** Line

**Output:**
Character metadata:

* gender
* position on movie *credits*

#### Gender model

The gender of a character is classified as `F`, `M` or `?`. As it has 3 classes, we must use a classifier model.

In [26]:
y_gender = character_metadata['gender']  # Real size y - to uncomment
X_in_use = lines_X  # Real size X - to uncomment

X_train, X_test, y_train, y_test = train_test_split(X_in_use, y_gender, test_size=0.2, random_state=42)

# Using RandomForestClassifier for classification
gender_model = RandomForestClassifier()
gender_model.fit(X_train, y_train)

accuracy = gender_model.score(X_test, y_test)
print('Gender Prediction Accuracy:', accuracy)

Gender Prediction Accuracy: 0.49136077974500764


Export model to not hace to train again

In [29]:
joblib.dump(gender_model, 'gender_model.pkl', compress=9)

['gender_model.pkl']

#### Credits position model

The position in the credits can be seen as continuous or as a categorical variable, since it's a number but it can also represent how important the character was in the movie. If characters are often ranked as "lead", "supporting", and "minor", it makes sense to classify them into categories. So we will try both Regression and Classification and for the last one classes are the position in the credits (`-1` for unknown, `1` for main character and so on, as it gets bigger the less important the character).

In [46]:
y_pos = character_metadata['credits_pos'] # real size y - to uncomment
X_in_use = lines_X # real size X - to uncomment

X_train, X_test, y_train, y_test = train_test_split(X_in_use, y_pos, test_size=0.2, random_state=42)

# Using RandomForestRegressor for regression
credits_pos_model_reg = RandomForestRegressor()
credits_pos_model_reg.fit(X_train, y_train)

accuracy = credits_pos_model_reg.score(X_test, y_test)
print('Credits position Accuracy:', accuracy)

Credits position Accuracy: -0.05991770736540669


In [48]:
joblib.dump(credits_pos_model_reg, 'credits_pos_model_reg.pkl', compress=9)

In [31]:
y_pos = character_metadata['credits_pos'] # real size y - to uncomment
X_in_use = lines_X # real size X - to uncomment

X_train, X_test, y_train, y_test = train_test_split(X_in_use, y_pos, test_size=0.2, random_state=42)

# Using RandomForestClassifier for classification
credits_pos_model_clas = RandomForestClassifier()
credits_pos_model_clas.fit(X_train, y_train)

accuracy = credits_pos_model_clas.score(X_test, y_test)
print('Credits position Accuracy:', accuracy)

Credits position Accuracy: 0.2886467682916824


The score method returns the coefficient of determination R^2
The values that can take are as follows:

* R^2=1 : Indicates that the model perfectly explains the variance in the test data.
* R^2>0 : Indicates that the model has some predictive power, capturing part of the variance in the data.
* R^2=0 : Indicates that the model explains none of the variance, equivalent to using the mean of the observed values.
* R^2<0 : Indicates that the model is worse than a simple mean, meaning the model is failing to capture the variance in the data.

It looks like it has a better accuracy if the credit position is treated as a clasification. So we'll use that one.

In [32]:
joblib.dump(credits_pos_model_clas, 'credits_pos_model_clas.pkl', compress=9)

['credits_pos_model_clas.pkl']

### Movie metadata models

**Input:** Dialog

**Output:** Movie metadata:

* genres
* release year
* IMDB rating
* number of IMDB votes

We will create a model for each metadata field

#### Release year model

In [None]:
y_year = movie_metadata['year']  # real size y - to uncomment
X_in_use = dialog_X # real size X - to uncomment

X_train, X_test, y_train, y_test = train_test_split(X_in_use, y_year, test_size=0.2, random_state=42)

release_year_model = RandomForestRegressor()
release_year_model.fit(X_train, y_train)

accuracy = release_year_model.score(X_test, y_test)
print('Release Year Prediction Accuracy:', accuracy)

Release Year Prediction Accuracy: 0.2689944802819675


In [None]:
joblib.dump(release_year_model, 'release_year_model.pkl', compress=9)

#### IMDB rating model

In [None]:
y_rate = movie_metadata['IMBD_rating']  # real size y - to uncomment
X_in_use = dialog_X # real size X - to uncomment

X_train, X_test, y_train, y_test = train_test_split(X_in_use, y_rate, test_size=0.2, random_state=42)

imdb_rating_model = RandomForestRegressor()
imdb_rating_model.fit(X_train, y_train)

accuracy = imdb_rating_model.score(X_test, y_test)
print('IMDB rating Prediction Accuracy:', accuracy)

IMDB rating Prediction Accuracy: 0.27930733642047323


In [None]:
joblib.dump(imdb_rating_model, 'imdb_rating_model.pkl', compress=9)

#### Number of votes model

In [None]:
y_votes = movie_metadata['IMBD_votes']  # real size y - to uncomment
X_in_use = dialog_X # real size X - to uncomment

X_train, X_test, y_train, y_test = train_test_split(X_in_use, y_votes, test_size=0.2, random_state=42)

votes_model = RandomForestRegressor()
votes_model.fit(X_train, y_train)

accuracy = votes_model.score(X_test, y_test)
print('IMDB rating Prediction Accuracy:', accuracy)

IMDB rating Prediction Accuracy: 0.28853147062131657


In [None]:
joblib.dump(votes_model, 'votes_model.pkl', compress=9)

#### Genres model

This is particular case, since what we are trying to predict is a list of elements (genres). This problem is called MultiLabel classification, that is why we'll use a MultiLabelBinerizer

In [None]:
# Binarize the list of genres
mlb = MultiLabelBinarizer()
y_genres = mlb.fit_transform(movie_metadata['genres'])  # real size y - to uncomment
X_in_use = dialog_X # real size X - to uncomment

X_train, X_test, y_train, y_test = train_test_split(X_in_use, y_genres, test_size=0.2, random_state=42)

genres_model = OneVsRestClassifier(RandomForestClassifier())
genres_model.fit(X_train, y_train)

accuracy = genres_model.score(X_test, y_test)
print('Genre Prediction Accuracy:', accuracy)

Genre Prediction Accuracy: 0.7333333333333333


In [None]:
joblib.dump(genres_model, 'genres_model.pkl', compress=9)

## Meta data prediction task




### Prediction of character metadata for an specific line.

In [17]:
def predict_character_metadata(line):
    line_vector = vec_lines.transform([line])

    gender = gender_model.predict(line_vector)
    credits_pos = credits_pos_model_clas.predict(line_vector)
    # credits_pos_model_clas has a higher accuracy than credits_pos_model_reg

    return {
        'gender': gender[0],
        'credits_pos': credits_pos[0]
    }


### Prediction of movie metadata for specific dialog.

In [22]:
def predict_movie_metadata(dialogue):
    mlb = MultiLabelBinarizer()
    mlb.fit_transform(movie_metadata['genres'])

    dialogue_vector = vec_dial.transform([dialogue])
    genres = mlb.inverse_transform(genres_model.predict(dialogue_vector))
    release_year = release_year_model.predict(dialogue_vector)
    imdb_rating = imdb_rating_model.predict(dialogue_vector)
    number_of_votes = votes_model.predict(dialogue_vector)

    return {
        'genres': list(genres[0]),
        'year': int(release_year[0]),
        'IMBD_rating': round(imdb_rating[0],1),
        'IMBD_votes': int(number_of_votes[0])
    }


## Testing

Run the following cell to load the `.pkl` trained models. 
**Only run it if you haven't run all the previous ones**.



In [19]:
model_folder = 'Metadata-models/'
gender_model = joblib.load(f'{model_folder}gender_model.pkl')
credits_pos_model_clas = joblib.load(f'{model_folder}credits_pos_model_clas.pkl')
genres_model = joblib.load(f'{model_folder}genres_model.pkl')
release_year_model = joblib.load(f'{model_folder}release_year_model.pkl')
imdb_rating_model = joblib.load(f'{model_folder}imdb_rating_model.pkl')
votes_model = joblib.load(f'{model_folder}votes_model.pkl')

### Testing with real data from dataset

#### Character

In [23]:
random_index = np.random.randint(len(character_metadata))
test1 = character_metadata.loc[random_index]['line']
predictions = predict_character_metadata(test1)

print(f'Index: {random_index}')
print(f'Line: {test1}')
print('Results:')
print("-------------------------------------------------")
for i in predictions:
    print(f'Real {i}:     \t {character_metadata.loc[random_index][i]}')
    print(f'Predicted {i}:\t {predictions[i]}')
    print()

Index: 5436
Line: is it because of my mistake six men did not return from that raid
Results:
-------------------------------------------------
Real gender:     	 m
Predicted gender:	 m

Real credits_pos:     	 5
Predicted credits_pos:	 5



#### Movie

In [26]:
random_index = np.random.randint(len(movie_metadata))
test2 = movie_metadata.loc[random_index]['dialog']
predictions = predict_movie_metadata(test2)

print(f'Index: {random_index}')
print(f'Dialog: {test2}')
print('Results:')
print("-------------------------------------------------")
for i in predictions:
    print(f'Real {i}:     \t {movie_metadata.loc[random_index][i]}')
    print(f'Predicted {i}:\t {predictions[i]}')
    print()

Index: 66844
Dialog: ho ho capital punks;signior romeo bon jour there's a french salutation to your french slop you gave us the counterfeit fairly last night;good morrow to you both what counterfeit did i give you;the slip son the slip can you not conceive;pardon good mercutio my business was great and in such a case as mine a man may strain courtesy;that is as much as to say such a case as yours constrains a man to bow in the hams;meaning to court'sy;thou hast most kindly hit it;a most courteous exposition;nay i am the very pink of courtesy;pink for flower;right;why then is my pump well flowered;sure witt now art thou sociable now art thou romeo now art thou what thou art by art as well as by nature;here's goodly gear
Results:
-------------------------------------------------
Real genres:     	 ['drama', 'romance']
Predicted genres:	 ['drama', 'romance']

Real year:     	 1968
Predicted year:	 1977

Real IMBD_rating:     	 7.8
Predicted IMBD_rating:	 7.5

Real IMBD_votes:     	 12360


### Testing with fictional data

In [27]:
predict_character_metadata("Hey, I think I'm in love with you")

{'gender': 'm', 'credits_pos': 1}

In [28]:
predict_character_metadata("Do you want to play football?")

{'gender': 'f', 'credits_pos': 1}

In [29]:
predict_movie_metadata("Hi; Hey how are you?; I'm really scared, I think there is some on in my house")

{'genres': ['comedy', 'drama'],
 'year': 1996,
 'IMBD_rating': 6.7,
 'IMBD_votes': 37502}

In [30]:
predict_movie_metadata("I want 100000 dollars; How will you get all that money?; I'll rob a bank")

{'genres': ['crime', 'drama', 'thriller'],
 'year': 1992,
 'IMBD_rating': 6.5,
 'IMBD_votes': 30232}