# Film recommender system - regression style


Jo and I are always perplexed as to which film to watch together. To combat this I've made a list of all the films we've watched together and had her rate them with a simple system of 0 = bad, 1 = ok, 2 = good. I hope to be able to make a content based recommender system that we could use to find future films.

The key ingredient is a massive film database that I found on kaggle from here:

https://www.kaggle.com/stefanoleone992/imdb-extensive-dataset

Let's take a look at it.

In [1]:
# import modules and data

import pandas as pd
import numpy as np
import re

films = pd.read_csv('IMDb movies.csv')

# rename the id column for future merging

films = films.rename(columns={'imdb_title_id':'Id'})

# we run into some memory issues later and so I'm removing all films before 1980 from the list

# complicated function to change the year column to just a year that's an integer

films['year'] = films['year'].apply(lambda x: int(re.findall("[0-9]{4}", str(x))[0]))

films = films[films['year']>1980].reset_index(drop=True)

print(films.shape)

films.head()

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


(63432, 22)


Unnamed: 0,Id,title,original_title,year,date_published,genre,duration,country,language,director,...,actors,description,avg_vote,votes,budget,usa_gross_income,worlwide_gross_income,metascore,reviews_from_users,reviews_from_critics
0,tt0017938,La glace à trois faces,La glace à trois faces,1983,1983-03-13,"Drama, Romance",45,France,"None, French",Jean Epstein,...,"Jeanne Helbling, Suzy Pierson, Olga Day, Raymo...",Psychological narrative avantgarde film about ...,7.0,759,,,,,7.0,4.0
1,tt0035423,Kate & Leopold,Kate & Leopold,2001,2002-03-01,"Comedy, Fantasy, Romance",118,USA,"English, French",James Mangold,...,"Meg Ryan, Hugh Jackman, Liev Schreiber, Brecki...",An English Duke from 1876 is inadvertedly drag...,6.4,77852,$ 48000000,$ 47121859,$ 76019048,44.0,341.0,115.0
2,tt0036606,"Another time, another place - Una storia d'amore","Another Time, Another Place",1983,1983-07-15,"Drama, War",118,UK,"English, Italian",Michael Radford,...,"Phyllis Logan, Giovanni Mauriello, Gianluca Fa...","Set in 1943 Scotland during World War II, Jani...",6.5,252,,,,,3.0,10.0
3,tt0062181,Mani in alto!,Rece do góry,1981,1985-01-21,Drama,76,Poland,Polish,Jerzy Skolimowski,...,"Jerzy Skolimowski, Joanna Szczerbic, Tadeusz L...","Censored by the Polish authorities, this movie...",6.5,296,,,,,2.0,6.0
4,tt0064730,Nihon boryoku-dan: Kumicho,Nihon boryoku-dan: Kumicho,2000,1969,"Action, Crime",97,Japan,Japanese,Kinji Fukasaku,...,"Kôji Tsuruta, Tomisaburô Wakayama, Bunta Sugaw...",Coming out of jail and hoping for a quiet life...,7.0,168,,,,,3.0,5.0


In [2]:
# take a look at the columns

films.columns

Index(['Id', 'title', 'original_title', 'year', 'date_published', 'genre',
       'duration', 'country', 'language', 'director', 'writer',
       'production_company', 'actors', 'description', 'avg_vote', 'votes',
       'budget', 'usa_gross_income', 'worlwide_gross_income', 'metascore',
       'reviews_from_users', 'reviews_from_critics'],
      dtype='object')

There are lots of options for finding similarities between each film here which will make this much easier. I will try a regression type method for recommending. This will require some data preperation.

In [3]:
# remove redundant columns

columns_to_drop = ['title', 'duration', 'actors', 'original_title', 'year', 'date_published', 'country', 'language', 'description', 'avg_vote', 'votes', 'reviews_from_users', 'reviews_from_critics', 'budget', 'usa_gross_income', 'metascore', 'worlwide_gross_income']

films_regression = films.drop(columns_to_drop, axis = 1)

films_regression.head()

Unnamed: 0,Id,genre,director,writer,production_company
0,tt0017938,"Drama, Romance",Jean Epstein,"Jean Epstein, Paul Morand",Films Jean Epstein
1,tt0035423,"Comedy, Fantasy, Romance",James Mangold,"Steven Rogers, James Mangold",Konrad Pictures
2,tt0036606,"Drama, War",Michael Radford,"Jessie Kesson, John Francis Lane",Associated-Rediffusion Television
3,tt0062181,Drama,Jerzy Skolimowski,"Andrzej Kostenko, Jerzy Skolimowski",Arte France Cinéma
4,tt0064730,"Action, Crime",Kinji Fukasaku,"Kinji Fukasaku, Fumio Kônami",Toei Company


In [4]:
# create dummy variables for regression

def make_dummies(column):
    
    dummy = pd.get_dummies(films_regression[column])
    
    df = pd.merge(dummy, films_regression, left_index=True, right_index=True).drop([column], axis=1)
    
    return df

dummy_variable_list = ['genre', 'director', 'writer', 'production_company']

for column in dummy_variable_list:
    films_regression = make_dummies(column)

films_regression.head()

Unnamed: 0,"""DIA"" Productions GmbH & Co. KG","""DumBeast"" Partners","""G"" P.C. S.A.","""GREEN"" Productions","""Mi"" Production Studio","""Pempti & 12"" Tsaltabasis-Xenopoulos","""Ulitka"" Studio","""Weathering With You"" Film Partners",#Sinning Works,#littlesecretfilm,...,"Thriller, Sci-Fi","Thriller, War","Thriller, Western",War,"War, Action","War, Drama","War, Drama, Action",Western,"Western, Action, Drama",Id
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,tt0017938
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,tt0035423
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,tt0036606
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,tt0062181
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,tt0064730


In [5]:
# import Jo's watched films

Jo = pd.read_csv('JosFilms.csv')

# drop the rows with NaN values

Jo = Jo.dropna()

Jo.head()

Unnamed: 0,Film,Response,Id
0,Shutter Island,2,tt1130884
3,Green Book,2,tt6966692
4,The Silence of the Lambs,2,tt0102926
5,Paranormal Activity,0,tt1179904
6,Saw,1,tt0387564


In [6]:
# Join the two dataframes by ID

Jo_regression = pd.merge(films_regression, Jo,on='Id', how='right')

# drop the titles and Id columns

Jo_regression = Jo_regression.drop(['Film_y', 'Id'], axis=1)

Jo_regression.tail()

Unnamed: 0,"""DIA"" Productions GmbH & Co. KG","""DumBeast"" Partners","""G"" P.C. S.A.","""GREEN"" Productions","""Mi"" Production Studio","""Pempti & 12"" Tsaltabasis-Xenopoulos","""Ulitka"" Studio","""Weathering With You"" Film Partners",#Sinning Works,#littlesecretfilm,...,"Thriller, Sci-Fi","Thriller, War","Thriller, Western",War,"War, Action","War, Drama","War, Drama, Action",Western,"Western, Action, Drama",Response
50,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,2
51,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,2
52,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
53,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,2
54,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,2


Now we simply fit a regression model on Jo's regression to predict all the response of the films in the database.

In [7]:
# import modules

from sklearn.linear_model import LinearRegression

# define the training and test sets

X_train = Jo_regression.iloc[:,0:Jo_regression.shape[1]-1]

y_train = Jo_regression.iloc[:,Jo_regression.shape[1]-1:Jo_regression.shape[1]]

X_test = films_regression.drop(['Id'], axis=1)

# define and fit the model

lr = LinearRegression().fit(X_train, y_train)

In [8]:
# due to memory issues we'll have to split the predicting into pieces

import math

responses = []

for i in range(10):
    start = math.ceil(i*(films.shape[0]/10))
    end = math.ceil((i+1)*(films.shape[0]/10))
    X_test_piece = X_test.iloc[start:end,:]
    
    # predict the responses for the piece

    responses.append(list(lr.predict(X_test_piece)))
    
print(len(responses))

10


In [9]:
# change the format to be a list of all the responses rather than 10 seperate lists
responses_list = []
for item in responses:
    for items in item:
        responses_list.append(items)
        
# convert to pandas series

responses = pd.Series(responses_list)

responses = responses.apply(lambda x: x[0])

In [10]:
# insert back into the films dataframe for sorting

films['response'] = responses

# sort the dataframe by the responses

films = films.sort_values(by=['response'], ascending=False)

# remove the previously seen films

for entry in Jo['Id']:
    films = films[films.Id != entry]
    
# display the top 10 results

films.head(10)

Unnamed: 0,Id,title,original_title,year,date_published,genre,duration,country,language,director,...,description,avg_vote,votes,budget,usa_gross_income,worlwide_gross_income,metascore,reviews_from_users,reviews_from_critics,response
46271,tt2788710,The Interview,The Interview,2014,2014-12-24,"Action, Adventure, Comedy",112,USA,"English, Korean","Evan Goldberg, Seth Rogen",...,Dave Skylark and his producer Aaron Rapaport r...,6.5,301891,$ 44000000,$ 6105175,$ 11782625,52.0,929.0,301.0,2.176884
7324,tt0106226,L'età dell'innocenza,The Age of Innocence,1993,1993-09-10,"Drama, Romance",139,USA,"English, Italian",Martin Scorsese,...,A tale of nineteenth-century New York high soc...,7.2,50690,$ 34000000,$ 32255440,$ 32255440,90.0,186.0,73.0,2.11974
57887,tt6320628,Spider-Man: Far from Home,Spider-Man: Far from Home,2019,2019-07-10,"Action, Adventure, Sci-Fi",129,USA,"English, Italian, Czech",Jon Watts,...,Following the events of,7.5,304708,$ 160000000,$ 390532085,$ 1131927996,69.0,2041.0,415.0,2.039769
10998,tt0119485,Kundun,Kundun,1997,1998-03-26,"Biography, Drama, History",134,"USA, Monaco, Morocco","English, Tibetan, Mandarin",Martin Scorsese,...,"From childhood to adulthood, Tibet's fourteent...",7.0,25870,$ 28000000,$ 5684789,$ 5684789,74.0,131.0,76.0,1.988597
2053,tt0088680,Fuori orario,After Hours,1985,1986-05-15,"Comedy, Crime, Drama",97,USA,English,Martin Scorsese,...,An ordinary word processor has the worst night...,7.7,56815,$ 4500000,$ 10609321,$ 10609321,90.0,228.0,113.0,1.931724
1245,tt0085794,Re per una notte,The King of Comedy,1982,1982-12-18,"Comedy, Crime, Drama",109,USA,English,Martin Scorsese,...,Rupert Pupkin is a passionate yet unsuccessful...,7.8,84308,$ 20000000,$ 2536242,$ 2536242,73.0,296.0,111.0,1.931724
45286,tt2527336,Star Wars - Gli ultimi Jedi,Star Wars: Episode VIII - The Last Jedi,2017,2017-12-13,"Action, Adventure, Fantasy",152,USA,English,Rian Johnson,...,Rey develops her newly discovered abilities wi...,7.0,547797,$ 317000000,$ 620181382,$ 1332540187,84.0,6718.0,717.0,1.92724
49683,tt3654796,Creep 2,Creep 2,2017,2017-10-24,"Horror, Thriller",78,USA,English,Patrick Brice,...,A video artist looking for work drives to a re...,6.4,17701,,,,75.0,104.0,48.0,1.860592
28399,tt0844286,The Brothers Bloom,The Brothers Bloom,2008,2009-06-19,"Action, Adventure, Comedy",114,USA,"English, French, Czech, Japanese",Rian Johnson,...,The Brothers Bloom are the best con men in the...,6.8,48453,$ 20000000,$ 3531756,$ 5530764,55.0,126.0,184.0,1.846537
29500,tt0959337,Revolutionary Road,Revolutionary Road,2008,2009-01-30,"Drama, Romance",119,"USA, UK",English,Sam Mendes,...,A young couple living in a Connecticut suburb ...,7.3,192030,$ 35000000,$ 22911480,$ 75981180,69.0,491.0,323.0,1.841764


In [16]:
films.to_csv('regression_films.csv', index=False)