# Film recommender system - regression style


Jo and I are always perplexed as to which film to watch together. To combat this I've made a list of all the films we've watched together and had her rate them from 0 to 3. I hope to be able to make a content based recommender system that we could use to find future films.

The key ingredient is a massive film database that I found on kaggle from here:

https://www.kaggle.com/stefanoleone992/imdb-extensive-dataset

Let's take a look at it.

In [1]:
# import modules and data

import pandas as pd
import numpy as np
import re

films = pd.read_csv('IMDb movies.csv')

# rename the id column for future merging

films = films.rename(columns={'imdb_title_id':'Id'}).fillna('')

# we run into some memory issues later and so I'm removing all films before 1980 from the list

# complicated function to change the year column to just a year that's an integer

films['year'] = films['year'].apply(lambda x: int(re.findall("[0-9]{4}", str(x))[0]))

films = films[films['year']>1980].reset_index(drop=True)

print(films.shape)

films.head()

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


(63432, 22)


Unnamed: 0,Id,title,original_title,year,date_published,genre,duration,country,language,director,...,actors,description,avg_vote,votes,budget,usa_gross_income,worlwide_gross_income,metascore,reviews_from_users,reviews_from_critics
0,tt0017938,La glace à trois faces,La glace à trois faces,1983,1983-03-13,"Drama, Romance",45,France,"None, French",Jean Epstein,...,"Jeanne Helbling, Suzy Pierson, Olga Day, Raymo...",Psychological narrative avantgarde film about ...,7.0,759,,,,,7,4
1,tt0035423,Kate & Leopold,Kate & Leopold,2001,2002-03-01,"Comedy, Fantasy, Romance",118,USA,"English, French",James Mangold,...,"Meg Ryan, Hugh Jackman, Liev Schreiber, Brecki...",An English Duke from 1876 is inadvertedly drag...,6.4,77852,$ 48000000,$ 47121859,$ 76019048,44.0,341,115
2,tt0036606,"Another time, another place - Una storia d'amore","Another Time, Another Place",1983,1983-07-15,"Drama, War",118,UK,"English, Italian",Michael Radford,...,"Phyllis Logan, Giovanni Mauriello, Gianluca Fa...","Set in 1943 Scotland during World War II, Jani...",6.5,252,,,,,3,10
3,tt0062181,Mani in alto!,Rece do góry,1981,1985-01-21,Drama,76,Poland,Polish,Jerzy Skolimowski,...,"Jerzy Skolimowski, Joanna Szczerbic, Tadeusz L...","Censored by the Polish authorities, this movie...",6.5,296,,,,,2,6
4,tt0064730,Nihon boryoku-dan: Kumicho,Nihon boryoku-dan: Kumicho,2000,1969,"Action, Crime",97,Japan,Japanese,Kinji Fukasaku,...,"Kôji Tsuruta, Tomisaburô Wakayama, Bunta Sugaw...",Coming out of jail and hoping for a quiet life...,7.0,168,,,,,3,5


In [2]:
# take a look at the columns

films.columns

Index(['Id', 'title', 'original_title', 'year', 'date_published', 'genre',
       'duration', 'country', 'language', 'director', 'writer',
       'production_company', 'actors', 'description', 'avg_vote', 'votes',
       'budget', 'usa_gross_income', 'worlwide_gross_income', 'metascore',
       'reviews_from_users', 'reviews_from_critics'],
      dtype='object')

There are lots of options for finding similarities between each film here which will make this much easier. We will only be needing the genre, director, writer and production company at first.

In [3]:
# remove redundant columns

columns_to_drop = ['title', 'duration', 'actors', 'original_title', 'year', 'date_published', 'country', 'language', 'description', 'avg_vote', 'votes', 'reviews_from_users', 'reviews_from_critics', 'budget', 'usa_gross_income', 'metascore', 'worlwide_gross_income']

films_regression = films.drop(columns_to_drop, axis = 1)

films_regression.head()

Unnamed: 0,Id,genre,director,writer,production_company
0,tt0017938,"Drama, Romance",Jean Epstein,"Jean Epstein, Paul Morand",Films Jean Epstein
1,tt0035423,"Comedy, Fantasy, Romance",James Mangold,"Steven Rogers, James Mangold",Konrad Pictures
2,tt0036606,"Drama, War",Michael Radford,"Jessie Kesson, John Francis Lane",Associated-Rediffusion Television
3,tt0062181,Drama,Jerzy Skolimowski,"Andrzej Kostenko, Jerzy Skolimowski",Arte France Cinéma
4,tt0064730,"Action, Crime",Kinji Fukasaku,"Kinji Fukasaku, Fumio Kônami",Toei Company


Now we create dummy features for all these different genres and directors ect.

In [4]:
# create dummy variables for regression

def make_dummies(column, df):
    
    print('Creating list and set')
    
    # Get a list of the labels
    
    labels_list = list(", ".join(df[column].astype(str)).split(", "))
    
    labels_set = set(labels_list)
    
    labels = set(labels_list)
    
    print('Removing labels with only one occurence')
    
    # create a set of labels that have more than one occurence
    
    for label in labels_set:
        if labels_list.count(label) == 1:
            labels.remove(label)
    

    print('Creating new columns for the labels')
    
    # Fill columns with 0's and 1's
    
    for label in labels:
        df[label] = df[column].astype(str).str.contains(label).astype(int)

    return df

dummy_variable_list = ['genre', 'director', 'writer', 'production_company']

i=0

#for column in dummy_variable_list:
    #print('Completed ' + str(i) + '/' + str(len(dummy_variable_list)))
    #films_regression = make_dummies(column, films_regression)
    #i+=1
    
#print('Completed ' + str(i) + '/' + str(len(dummy_variable_list)))

This bit get's confusing because we are removing the Id column and then adding it and then converting to a sparse matrix and then converting back. This is because the first time round we were saving the file for regression, but to use this code again in the future we will be loading the file and will be needing the Id column again so don't worry too much about this bit.

In [5]:
# set aside the id column in a seperate df

id_df = films_regression['Id']

# drop the now redundant columns

#films_regression = films_regression.drop(columns=['Id', 'genre', 'director', 'writer', 'production_company'])

#films_regression = pd.read_csv('labeled_df.csv.gz', compression='gzip', header=0, sep=',', quotechar='"')
        
#films_regression.head()

Convert this massive dataframe to a scipy sparse matrix for quick saving.

In [6]:
from scipy import sparse
from time import time

start = time()


#sparse_df = sparse.coo_matrix(films_regression)

print('Converting to a sparse matrix took ' + str((time()-start)/60) + ' minutes.')

#sparse_df

Converting to a sparse matrix took 0.0 minutes.


Save/load the sparse matrix

In [7]:
#scipy.sparse.save_npz('sparse_matrix.npz', sparse_df)

sparse_df = sparse.load_npz('sparse_matrix.npz')

films_regression = pd.DataFrame.sparse.from_spmatrix(sparse_df)

films_regression['Id'] = id_df

films_regression.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,30741,30742,30743,30744,30745,30746,30747,30748,30749,Id
0,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,tt0017938
1,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,tt0035423
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,tt0036606
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,tt0062181
4,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,tt0064730


We are now ready for regression. We will load Jo's films and convert them to the sparse format for use as our training data.

In [8]:
# import Jo's watched films

Jo = pd.read_csv('JosFilms.csv')

# drop the rows with NaN values

Jo = Jo.dropna()

Jo.head()

Unnamed: 0,Film,Response,Id
0,Shutter Island,3.0,tt1130884
3,Green Book,3.0,tt6966692
4,The Silence of the Lambs,3.0,tt0102926
5,Paranormal Activity,0.0,tt1179904
6,Saw,1.0,tt0387564


The data frame needs all of the new features so we conduct an inner join with the main films dataframe on the Id column.

In [9]:
# Join the two dataframes by ID

Jo_regression = pd.merge(films_regression, Jo, on='Id', how='inner')

# drop the titles and Id columns

Jo_regression = Jo_regression.drop(['Id', 'Film'], axis=1)

Jo_regression.tail()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,30741,30742,30743,30744,30745,30746,30747,30748,30749,Response
69,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,3.0
70,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0.0
71,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,1.0
72,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,2.0
73,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,1.0


Now we simply fit a regression model on Jo's regression to predict all the responses of the films in the database.

In [10]:
from sklearn.linear_model import Lasso

In [11]:
# define the training and test sets as sparse matricies

X_train1 = sparse.coo_matrix(Jo_regression.iloc[:,0:Jo_regression.shape[1]-1])

y_train1 = Jo_regression['Response']

X_test1 = sparse_df

# define and fit the model

sparse_lasso1 = Lasso(alpha=0, fit_intercept=False, max_iter=1000).fit(X_train1, y_train1)

  sparse_lasso1 = Lasso(alpha=0, fit_intercept=False, max_iter=1000).fit(X_train1, y_train1)
  model = cd_fast.sparse_enet_coordinate_descent(


In [12]:
first_predictions = sparse_lasso1.predict(X_test1)

first_predictions = (first_predictions*3)/first_predictions.max()

first_predictions

array([1.30539642, 1.36498856, 2.0189668 , ..., 0.5417426 , 0.51897283,
       0.5417426 ])

This completes the first regression model which accounted for the genre, director, writers and production company. We will now do the same with the film description using a similar process.

# Count vectoriser

This section we'll be doing just as above but using the words in the description of the films.

In [13]:
# import the vectorizer module

from sklearn.feature_extraction.text import TfidfVectorizer

# create a sparse matrix from the description column

vectorizer = TfidfVectorizer(
    min_df=3, # ignore words that occur less than 3 times
    max_features=None, # have as many columns as we need
    strip_accents='unicode', # strip any accents that are above words
    analyzer='word', # take the words
    token_pattern=r'\w{1,}', # stipulate how a word is defined with a regex expression i.e a word with at least 1 letter
    ngram_range=(1,3), # define the usable ngram range
    stop_words='english' # void words that arn't relevent i.e and
    )

X = vectorizer.fit_transform(films['description'])

X

<63432x60007 sparse matrix of type '<class 'numpy.float64'>'
	with 1138068 stored elements in Compressed Sparse Row format>

We need to add an Id column by creating a dense matrix

In [14]:
films_regression = pd.DataFrame.sparse.from_spmatrix(X)

films_regression['Id'] = id_df

films_regression.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,59998,59999,60000,60001,60002,60003,60004,60005,60006,Id
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,tt0017938
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,tt0035423
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,tt0036606
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,tt0062181
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,tt0064730


In [15]:
# Join the two dataframes by ID

Jo_regression = pd.merge(films_regression, Jo, on='Id', how='inner')

# drop the titles and Id columns

Jo_regression = Jo_regression.drop(['Id', 'Film'], axis=1)

Jo_regression.tail()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,59998,59999,60000,60001,60002,60003,60004,60005,60006,Response
69,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0
70,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
71,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
72,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0
73,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


In [16]:
# define the training and test sets as sparse matricies

X_train2 = sparse.coo_matrix(Jo_regression.iloc[:,0:Jo_regression.shape[1]-1])

y_train2 = Jo_regression['Response']

X_test2 = X

# define and fit the model

sparse_lasso2 = Lasso(alpha=0, fit_intercept=False, max_iter=1000).fit(X_train2, y_train2)

  sparse_lasso2 = Lasso(alpha=0, fit_intercept=False, max_iter=1000).fit(X_train2, y_train2)


In [17]:
second_predictions = sparse_lasso2.predict(X_test2)

second_predictions

array([2.13405573e-68, 3.28774556e-16, 1.31861099e-34, ...,
       0.00000000e+00, 0.00000000e+00, 2.88293791e-01])

The predictions are a little bit erratic, so we'll normalise them

In [18]:
second_predictions = (second_predictions*3)/second_predictions.max()

# Final step

Now we'll almagamate our two results

In [19]:
results = (first_predictions + second_predictions)/2

films['response1'] = first_predictions

films['response2'] = second_predictions

results

array([0.65269821, 0.68249428, 1.0094834 , ..., 0.2708713 , 0.25948641,
       0.31000128])

In [20]:
films['predicted_response'] = results

films = films.sort_values(by='predicted_response', ascending=False)

# search for english only films

films = films[films['language'] == 'English']

films.head()

Unnamed: 0,Id,title,original_title,year,date_published,genre,duration,country,language,director,...,votes,budget,usa_gross_income,worlwide_gross_income,metascore,reviews_from_users,reviews_from_critics,response1,response2,predicted_response
36097,tt1468703,Bear,Bear,2010,2010-06-04,"Action, Adventure, Drama",78,USA,English,Roel Reiné,...,1130,$ 100000,,,,45,15.0,1.478715,2.337357,1.908036
5027,tt0097888,L'eredita di Miss Richards,Minnamurra,1989,1989-08-17,"Action, Drama, Romance",92,Australia,English,Ian Barry,...,159,,,,,4,,1.261335,2.48826,1.874798
34574,tt1319553,1 and 0 nly,1 and 0 nly,2008,2008-10-16,"Drama, Fantasy, Mystery",84,Australia,English,Martyn Park,...,195,,,,,2,,1.318234,2.406744,1.862489
24764,tt0443706,Zodiac,Zodiac,2007,2007-05-18,"Crime, Drama, Mystery",157,USA,English,David Fincher,...,443791,$ 65000000,$ 33080084,$ 84785914,78.0,782,409.0,2.758106,0.823423,1.790765
52375,tt4470288,Abnormal Attraction,Abnormal Attraction,2018,2018-03-01,"Adventure, Comedy, Fantasy",107,USA,English,Michael Leavy,...,2241,,,,,100,26.0,1.582368,1.956495,1.769432
