# Film recommender system - regression style


Jo and I are always perplexed as to which film to watch together. To combat this I've made a list of all the films we've watched together and had her rate them from 0 to 3. I hope to be able to make a content based recommender system that we could use to find future films.

The key ingredient is a massive film database that I found on kaggle from here:

https://www.kaggle.com/stefanoleone992/imdb-extensive-dataset

Let's take a look at it.

In [1]:
# import modules and data

import pandas as pd
import numpy as np#
import re
import warnings
warnings.filterwarnings('ignore')

films = pd.read_csv('IMDb movies.csv')

# rename the id column for future merging

films = films.rename(columns={'imdb_title_id':'Id'}).fillna('')

films.head()

Unnamed: 0,Id,title,original_title,year,date_published,genre,duration,country,language,director,...,actors,description,avg_vote,votes,budget,usa_gross_income,worlwide_gross_income,metascore,reviews_from_users,reviews_from_critics
0,tt0000009,Miss Jerry,Miss Jerry,1894,1894-10-09,Romance,45,USA,,Alexander Black,...,"Blanche Bayliss, William Courtenay, Chauncey D...",The adventures of a female reporter in the 1890s.,5.9,154,,,,,1,2
1,tt0000574,The Story of the Kelly Gang,The Story of the Kelly Gang,1906,1906-12-26,"Biography, Crime, Drama",70,Australia,,Charles Tait,...,"Elizabeth Tait, John Tait, Norman Campbell, Be...",True story of notorious Australian outlaw Ned ...,6.1,589,$ 2250,,,,7,7
2,tt0001892,Den sorte drøm,Den sorte drøm,1911,1911-08-19,Drama,53,"Germany, Denmark",,Urban Gad,...,"Asta Nielsen, Valdemar Psilander, Gunnar Helse...",Two men of high rank are both wooing the beaut...,5.8,188,,,,,5,2
3,tt0002101,Cleopatra,Cleopatra,1912,1912-11-13,"Drama, History",100,USA,English,Charles L. Gaskill,...,"Helen Gardner, Pearl Sindelar, Miss Fielding, ...",The fabled queen of Egypt's affair with Roman ...,5.2,446,$ 45000,,,,25,3
4,tt0002130,L'Inferno,L'Inferno,1911,1911-03-06,"Adventure, Drama, Fantasy",68,Italy,Italian,"Francesco Bertolini, Adolfo Padovan",...,"Salvatore Papa, Arturo Pirovano, Giuseppe de L...",Loosely adapted from Dante's Divine Comedy and...,7.0,2237,,,,,31,14


In [2]:
# take a look at the columns

films.columns

Index(['Id', 'title', 'original_title', 'year', 'date_published', 'genre',
       'duration', 'country', 'language', 'director', 'writer',
       'production_company', 'actors', 'description', 'avg_vote', 'votes',
       'budget', 'usa_gross_income', 'worlwide_gross_income', 'metascore',
       'reviews_from_users', 'reviews_from_critics'],
      dtype='object')

# Data preperation

There are lots of options for finding similarities between each film here which will make this much easier. Not all categories are very useful i.e most films don't have a budget shown and just putting in the average isn't very helpful so we'll drop this column.

In [3]:
# make a categorical dataframe

columns_to_drop = ['title', 'original_title', 'date_published', 'year', 'country', 'language', 'description', 'votes', 'budget', 'usa_gross_income', 'metascore', 'worlwide_gross_income']

regression_df = films.drop(columns_to_drop, axis = 1)

regression_df.head()

Unnamed: 0,Id,genre,duration,director,writer,production_company,actors,avg_vote,reviews_from_users,reviews_from_critics
0,tt0000009,Romance,45,Alexander Black,Alexander Black,Alexander Black Photoplays,"Blanche Bayliss, William Courtenay, Chauncey D...",5.9,1,2
1,tt0000574,"Biography, Crime, Drama",70,Charles Tait,Charles Tait,J. and N. Tait,"Elizabeth Tait, John Tait, Norman Campbell, Be...",6.1,7,7
2,tt0001892,Drama,53,Urban Gad,"Urban Gad, Gebhard Schätzler-Perasini",Fotorama,"Asta Nielsen, Valdemar Psilander, Gunnar Helse...",5.8,5,2
3,tt0002101,"Drama, History",100,Charles L. Gaskill,Victorien Sardou,Helen Gardner Picture Players,"Helen Gardner, Pearl Sindelar, Miss Fielding, ...",5.2,25,3
4,tt0002130,"Adventure, Drama, Fantasy",68,"Francesco Bertolini, Adolfo Padovan",Dante Alighieri,Milano Film,"Salvatore Papa, Arturo Pirovano, Giuseppe de L...",7.0,31,14


Now we fill in the missing values for reviews and replace them with 0.

In [4]:
#regression_df = regression_df.replace('', 0)
regression_df[['reviews_from_users', 'reviews_from_critics']] = regression_df[['reviews_from_users', 'reviews_from_critics']].replace('', 0).astype(int)

regression_df = regression_df.replace('Metro-Goldwyn-Mayer (MGM)', 'MGM')
regression_df = regression_df.replace('European Broadcasting Union (EBU)', 'EBU')

regression_df.dtypes

Id                       object
genre                    object
duration                  int64
director                 object
writer                   object
production_company       object
actors                   object
avg_vote                float64
reviews_from_users        int32
reviews_from_critics      int32
dtype: object

To make a lookup table we need to note the response values of the films. Let's import the watched films.

In [5]:
# import Jo's watched films

Jo = pd.read_csv('JosFilms.csv')

# drop the rows with NaN values

Jo = Jo.dropna()

print(Jo.shape)

Jo.head()

(83, 3)


Unnamed: 0,Film,Response,Id
0,Shutter Island,3.0,tt1130884
3,Green Book,3.0,tt6966692
4,The Silence of the Lambs,3.0,tt0102926
5,Paranormal Activity,0.0,tt1179904
6,Saw,1.0,tt0387564


We join the two dataframes by Id.

In [6]:
# Join the two dataframes by ID

Jo_regression = pd.merge(regression_df, Jo, on='Id', how='inner')

# drop the titles and Id columns

Jo_regression = Jo_regression.drop(['Id', 'Film'], axis=1)

Jo_regression.tail()

Unnamed: 0,genre,duration,director,writer,production_company,actors,avg_vote,reviews_from_users,reviews_from_critics,Response
78,"Drama, War",119,Sam Mendes,"Sam Mendes, Krysty Wilson-Cairns",DreamWorks,"Dean-Charles Chapman, George MacKay, Daniel Ma...",8.3,2843,474,3.0
79,"Comedy, Music",123,David Dobkin,"Will Ferrell, Andrew Steele",EBU,"Will Ferrell, Rachel McAdams, Dan Stevens, Mik...",6.5,1154,157,0.0
80,"Drama, Horror, Mystery",148,Ari Aster,Ari Aster,A24,"Florence Pugh, Jack Reynor, Vilhelm Blomgren, ...",7.1,2706,390,1.0
81,"Comedy, Crime, Drama",130,Rian Johnson,Rian Johnson,Lionsgate,"Daniel Craig, Chris Evans, Ana de Armas, Jamie...",7.9,2334,448,2.0
82,"Fantasy, Horror, Mystery",87,Oz Perkins,Rob Hayes,Orion Pictures,"Sophia Lillis, Samuel Leakey, Alice Krige, Jes...",5.3,418,154,1.0


For each categorical feature we now create a lookup table where the keys are the values of the categorical data found in the above dataframe and the values are the average response given to that particular key.

In [7]:
# function for making a lookup table

def make_lookup(column):
    lookup = {}
    
    labels_list = list(", ".join(Jo_regression[column].astype(str)).split(", "))
    
    for label in labels_list:
        average_response = Jo_regression[Jo_regression[column].astype(str).str.contains(label)]['Response'].mean()
        lookup[label] = average_response
        
    return lookup

In [8]:
# function to replace text with number

def lookup_replace(entry):
    global lookup
    value = 0
    category_list = entry.split(', ')
    
    counter = 0
    for category in category_list:
        value += lookup.get(category, 1.5)
        counter += 1
    
    if counter != 0:
        value = value / counter
        
    return value
    

In [9]:
columns_for_lookup = ['genre', 'director', 'writer', 'production_company', 'actors']

for column in columns_for_lookup:
    lookup = make_lookup(column)
    
    regression_df[column] = regression_df[column].apply(lookup_replace)
    Jo_regression[column] = Jo_regression[column].apply(lookup_replace)

We now normalise the data.

In [10]:
regression_df.head()

Unnamed: 0,Id,genre,duration,director,writer,production_company,actors,avg_vote,reviews_from_users,reviews_from_critics
0,tt0000009,1.6,45,1.5,1.5,1.5,1.5,5.9,1,2
1,tt0000574,2.06676,70,1.5,1.5,1.5,1.5,6.1,7,7
2,tt0001892,2.176471,53,1.5,1.5,1.5,1.5,5.8,5,2
3,tt0002101,2.338235,100,1.5,1.5,1.5,1.5,5.2,25,3
4,tt0002130,2.033932,68,1.5,1.5,1.5,1.5,7.0,31,14


In [11]:
columns = list(regression_df.columns)[1:]

print(columns)

for column in columns:
    regression_df[column] = regression_df[column]/regression_df[column].max()
    Jo_regression[column] = Jo_regression[column]/Jo_regression[column].max()
    
regression_df.head()

['genre', 'duration', 'director', 'writer', 'production_company', 'actors', 'avg_vote', 'reviews_from_users', 'reviews_from_critics']


Unnamed: 0,Id,genre,duration,director,writer,production_company,actors,avg_vote,reviews_from_users,reviews_from_critics
0,tt0000009,0.533333,0.055693,0.5,0.5,0.5,0.5,0.59596,9.5e-05,0.002002
1,tt0000574,0.68892,0.086634,0.5,0.5,0.5,0.5,0.616162,0.000668,0.007007
2,tt0001892,0.72549,0.065594,0.5,0.5,0.5,0.5,0.585859,0.000477,0.002002
3,tt0002101,0.779412,0.123762,0.5,0.5,0.5,0.5,0.525253,0.002387,0.003003
4,tt0002130,0.677977,0.084158,0.5,0.5,0.5,0.5,0.707071,0.00296,0.014014


## Description (large text)

The description of a film is a handy way to find similar films. To make use of it we create a feature for each word in the description which results in a massive matrix. First we need just the descriptions of the films that have been watched already. This can't be done once the data has been converted to sparse format.

In [12]:
Jo_descriptions = pd.merge(films[['Id', 'description']], Jo, on='Id', how='inner').drop(columns=['Film', 'Id', 'Response'])

Jo_descriptions

Unnamed: 0,description
0,Dorothy Gale is swept away from a farm in Kans...
1,The story of a young deer growing up in the fo...
2,The romantic tale of a sheltered uptown Cocker...
3,"After being snubbed by the royal family, a mal..."
4,A rootless young man's drawn into an unnamed r...
...,...
78,"April 6th, 1917. As a regiment assembles to wa..."
79,When aspiring musicians Lars and Sigrit are gi...
80,A couple travels to Sweden to visit a rural ho...
81,A detective investigates the death of a patria...


The massive matrix is created with the TfidfVectorizer module.

In [13]:
# import the vectorizer module

from scipy import sparse
from time import time
from sklearn.feature_extraction.text import TfidfVectorizer

# create a sparse matrix from the description column

vectorizer = TfidfVectorizer(
    min_df=3, # ignore words that occur less than 3 times
    max_features=None, # have as many columns as we need
    strip_accents='unicode', # strip any accents that are above words
    analyzer='word', # take the words
    token_pattern=r'\w{1,}', # stipulate how a word is defined with a regex expression i.e a word with at least 1 letter
    ngram_range=(1,3), # define the usable ngram range
    stop_words='english' # void words that arn't relevent i.e and
    )

X = vectorizer.fit_transform(films['description'])

X

<85855x80474 sparse matrix of type '<class 'numpy.float64'>'
	with 1570616 stored elements in Compressed Sparse Row format>

We need to slice a training set from this sparse matrix.

In [14]:
row_index = films['Id'].isin(list(Jo['Id']))
X_train = X[row_index.values]

X_train

<83x80474 sparse matrix of type '<class 'numpy.float64'>'
	with 1641 stored elements in Compressed Sparse Row format>

# Model fitting and predicting

We are now ready for regression. We have two different datasets to do and we'll start with our main dataframe of categories.

## First regression

We define our training and test data.

In [15]:
from sklearn.neural_network import MLPRegressor

#regression_df = regression_df.drop(['Response'], axis = 1)

X_train1 = Jo_regression.iloc[:,0:Jo_regression.shape[1]-1] # leaving out the response

# the training data will be the same for both our models

y_train = Jo_regression['Response']

X_test1 = regression_df.iloc[:,1:regression_df.shape[1]]

# define and fit the model

model1 = MLPRegressor(activation='relu',hidden_layer_sizes=(20, 20), solver='adam',alpha=0.1, max_iter=300).fit(X_train1, y_train)

In [16]:
first_predictions = model1.predict(X_test1)

first_predictions.mean()

1.3262956271925468

In [17]:
films['Response'] = first_predictions

films.sort_values(by='Response', ascending = False).head()

Unnamed: 0,Id,title,original_title,year,date_published,genre,duration,country,language,director,...,description,avg_vote,votes,budget,usa_gross_income,worlwide_gross_income,metascore,reviews_from_users,reviews_from_critics,Response
31331,tt0120815,Salvate il soldato Ryan,Saving Private Ryan,1998,1998-10-30,"Drama, War",169,USA,"English, French, German, Czech",Steven Spielberg,...,"Following the Normandy Landings, a group of U....",8.6,1203825,$ 70000000,$ 217049603,$ 482349603,91,2610,268,2.911009
31279,tt0120737,Il Signore degli Anelli - La compagnia dell'An...,The Lord of the Rings: The Fellowship of the Ring,2001,2002-01-18,"Action, Adventure, Drama",178,"New Zealand, USA","English, Sindarin",Peter Jackson,...,A meek Hobbit from the Shire and eight compani...,8.8,1619920,$ 93000000,$ 315544750,$ 887934303,92,5392,340,2.88597
34127,tt0167260,Il Signore degli Anelli - Il ritorno del re,The Lord of the Rings: The Return of the King,2003,2004-01-22,"Action, Adventure, Drama",201,"New Zealand, USA","English, Quenya, Old English, Sindarin",Peter Jackson,...,Gandalf and Aragorn lead the World of Men agai...,8.9,1604280,$ 94000000,$ 377845905,$ 1142271098,94,3718,353,2.875498
32487,tt0137523,Fight Club,Fight Club,1999,1999-10-29,Drama,139,"USA, Germany",English,David Fincher,...,An insomniac office worker and a devil-may-car...,8.8,1807440,$ 63000000,$ 37030102,$ 101218804,66,3758,370,2.8603
34128,tt0167261,Il Signore degli Anelli - Le due torri,The Lord of the Rings: The Two Towers,2002,2003-01-16,"Action, Adventure, Drama",179,"New Zealand, USA","English, Sindarin, Old English",Peter Jackson,...,While Frodo and Sam edge closer to Mordor with...,8.7,1449778,$ 94000000,$ 342551365,$ 951227416,87,2575,324,2.810049


This completes the first regression model which accounted for most of the categories available to us. Now we move on to the description.

# Count vectoriser

In [18]:
from sklearn import linear_model

# define and fit the model

sparse_lasso = linear_model.Lasso(alpha=0, fit_intercept=False, max_iter=1000).fit(X_train, y_train)

In [19]:
second_predictions = sparse_lasso.predict(X)

second_predictions

array([ 0.00000000e+00, -2.93721245e-17,  9.37901448e-01, ...,
        0.00000000e+00,  0.00000000e+00,  2.82049170e-01])

The predictions are a little bit erratic, so we'll normalise them to be within our 0 to 3 scale.

In [20]:
second_predictions = (second_predictions*3)/second_predictions.max()

# Final step

Now we'll almagamate our two results. This was done by feel, the main model gave better recommendations and so we gave mre importance to that model. We then tweaked the ratio of importance until a nice joint list was given.

In [21]:
results = ((first_predictions*5) + second_predictions)/6

films['response1'] = first_predictions

films['response2'] = second_predictions

results

array([1.06387767, 1.1009419 , 1.12349785, ..., 1.1558389 , 1.11349848,
       1.13115572])

In [22]:
films['predicted_response'] = results

films = films.sort_values(by='predicted_response', ascending=False)

# remove the films that have already been watched

films = films[~films['Id'].isin(list(Jo['Id']))]

# search for english only films

films = films[films['language'] == 'English']

films.head()

Unnamed: 0,Id,title,original_title,year,date_published,genre,duration,country,language,director,...,budget,usa_gross_income,worlwide_gross_income,metascore,reviews_from_users,reviews_from_critics,Response,response1,response2,predicted_response
71919,tt3606756,Gli Incredibili 2,Incredibles 2,2018,2018-09-19,"Animation, Action, Adventure",118,USA,English,Brad Bird,...,$ 200000000,$ 608581744,$ 1242805359,80,1074,395,2.420136,2.420136,8.686146e-18,2.01678
56816,tt1313092,Animal Kingdom,Animal Kingdom,2010,2010-10-29,"Crime, Drama",113,Australia,English,David Michôd,...,AUD 5000000,$ 1044039,$ 7209912,83,176,263,2.348018,2.348018,0.08296105,1.970508
60256,tt1637688,In Time,In Time,2011,2012-02-17,"Action, Sci-Fi, Thriller",109,USA,English,Andrew Niccol,...,$ 40000000,$ 37520095,$ 173930596,53,540,373,2.326649,2.326649,0.1379791,1.96187
70672,tt3297330,Good Kill,Good Kill,2014,2016-02-25,"Drama, Thriller, War",102,USA,English,Andrew Niccol,...,,$ 316472,$ 1474471,63,87,175,2.265723,2.265723,0.3360935,1.944118
66427,tt2334879,Sotto assedio - White House Down,White House Down,2013,2013-09-26,"Action, Drama, Thriller",131,USA,English,Roland Emmerich,...,$ 150000000,$ 73103784,$ 205366737,52,509,355,2.118569,2.118569,0.9119836,1.917471
