# Data Exploitation - Recommender Systems

This notebook is devoted to the investigation of three different approaches to recommender systems implemented upon the [TMDB](https://www.kaggle.com/tmdb/tmdb-movie-metadata) dataset .

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import numpy as np
import json
import base64
import io
import codecs
from IPython.display import HTML
from surprise import Reader, Dataset, SVD, evaluate
from scipy import spatial
import operator
import warnings
warnings.filterwarnings('ignore')

# Graph - based recommender

In this part we will present a Graph-based recommender which makes movie recommendations based on a very simple method. 

As described, we created the adjacency matrix where the edges are created between two movies whenever they share a common cast or crew member. The weights are determined depending on the number of shared cast and crew members. Thus, for a given movie, the graph-based recommender will recommend the top 10 movies which have the largest connecting weights with the movie we want to have recommendations for. 

This model suffers from some severe limitations. It is only capable of suggesting movies which are best connected to a certain movie through their cast and crew members. Furthermore, it is not capable of capturing user tastes and providing recommendations across genres or other movie content.

This is obviously a very naive approach, but the results are satisfying for this level of complexity as will be shown below.

In [2]:
# Load the adjacency matrix for our graph
adjacency = np.load('data/final_project_adjacency_largest.npy')

In [3]:
final_project_df = pd.read_csv('data/final_project_df.csv')

In [4]:
def recommend_weights(movie_id):
    """
    Return top 10 movies connected with largest edge weights for a given movie id.
    """
    row = adjacency[movie_id,:]
    max_indices = list(row.argsort()[-10:][::-1])
    return final_project_df[final_project_df.index.isin(max_indices)]

In [5]:
# Movie that we want to have recommendations for
final_project_df[final_project_df.index == 1]

Unnamed: 0,movie_id,title,budget,keywords,original_language,popularity,production_companies,release_date,revenue,runtime,...,History,Horror,Music,Mystery,Romance,Science Fiction,TV Movie,Thriller,War,Western
1,285,Pirates of the Caribbean: At World's End,300000000,"{'traitor', 'exotic island', 'pirate', 'east i...",en,139.082615,"{'Jerry Bruckheimer Films', 'Second Mate Produ...",2007-05-19,961000000,169.0,...,0,0,0,0,0,0,0,0,0,0


In [6]:
# Get the top 10 recommended movies
recommend_weights(1)

Unnamed: 0,movie_id,title,budget,keywords,original_language,popularity,production_companies,release_date,revenue,runtime,...,History,Horror,Music,Mystery,Romance,Science Fiction,TV Movie,Thriller,War,Western
12,58,Pirates of the Caribbean: Dead Man's Chest,200000000,"{'exotic island', 'pirate', 'east india tradin...",en,145.847379,"{'Jerry Bruckheimer Films', 'Second Mate Produ...",2006-06-20,1065659812,151.0,...,0,0,0,0,0,0,0,0,0,0
13,57201,The Lone Ranger,255000000,"{'lawyer', 'tonto', 'texas', 'survivor', 'texa...",en,49.046956,"{'Silver Bullet Productions (II)', 'Classic Me...",2013-07-03,89289910,149.0,...,0,0,0,0,0,0,0,0,0,1
17,1865,Pirates of the Caribbean: On Stranger Tides,380000000,"{'silver', 'battle', 'pirate', 'sword', '3d', ...",en,135.413856,"{'Jerry Bruckheimer Films', 'Moving Picture Co...",2011-05-14,1045713802,136.0,...,0,0,0,0,0,0,0,0,0,0
107,676,Pearl Harbor,140000000,"{'love', 'pilot', 'airplane', 'pin-up', 'war',...",en,34.20669,"{'Jerry Bruckheimer Films', 'Touchstone Pictur...",2001-05-21,449220945,183.0,...,1,0,0,0,1,0,0,0,1,0
173,44896,Rango,135000000,"{'las vegas', 'armadillo', 'rango', 'chameleon...",en,29.91353,"{'Blind Wink', 'Paramount Animation', 'GK Film...",2011-03-02,245724603,107.0,...,0,0,0,0,0,0,0,0,0,1
181,8961,Bad Boys II,130000000,"{'ku klux klan', 'criminal underworld', 'mexic...",en,38.068736,"{'Columbia Pictures Corporation', 'Don Simpson...",2003-07-18,273339556,147.0,...,0,0,0,0,0,0,0,1,0,0
194,22,Pirates of the Caribbean: The Curse of the Bla...,140000000,"{'east india trading company', 'caribbean', 'a...",en,271.972889,"{'Jerry Bruckheimer Films', 'Walt Disney Pictu...",2003-07-09,655011224,143.0,...,0,0,0,0,0,0,0,0,0,0
307,855,Black Hawk Down,92000000,"{'warlord', 'famine', 'rescue operation', 'del...",en,44.455166,"{'Jerry Bruckheimer Films', 'Revolution Studio...",2001-12-28,172989651,144.0,...,1,0,0,0,0,0,0,0,1,0
342,2253,Valkyrie,75000000,"{'german officer', 'piano wire', 'wife husband...",en,38.832842,"{'United Artists', 'Achte Babelsberg Film', 'B...",2008-12-25,200276000,121.0,...,1,0,0,0,0,0,0,1,1,0
1484,6963,The Weather Man,20000000,"{'new york', 'chicago', 'daughter', 'weatherma...",en,14.031377,"{'Escape Artists', 'Kumar Mobiliengesellschaft...",2005-10-20,12482775,101.0,...,0,0,0,0,0,0,0,0,0,0


As shown in the example above, for "Pirates of the Caribbean: At World's End" the recommendations using the naive Graph-based recommender are limited, and we can notice that the system suggested the other three parts from the series of the "Pirates of the Caribbean" which confirmes that the results are acceptable and sufficient at this level. 

# Content - based recommender

Content-based recommender will be described in this part as a follow up on the naive Graph-based recommender.

Instead of only taking into account the cast and crew the movies share, in this recommender system the content of the movie (genres, cast, crew, keywords, etc.) is used to find its similarity with other movies. Then the movies that are most likely to be similar are recommended.

This is also referred as Content Based Filtering - where the recommender suggests similar items based on a particular item. The general idea behind these recommender systems is that if a person liked a particular item, then will also like an item that is similar to it.

The quality of our recommender would be increased with the usage of better metadata. That is exactly what we are going to do in this section. We are going to build a recommender based on the following content of each movie: the top 4 actors, the director, related genres and the movie plot keywords. We will compute pairwise similarity scores for all movies based on these features and recommend movies based on that similarity score. 

In [7]:
movies = pd.read_csv('data/tmdb_5000_movies.csv')
credits = pd.read_csv('data/tmdb_5000_credits.csv')

In [8]:
filter_movies = list(final_project_df.movie_id.values)

In [9]:
movies = movies[movies['id'].isin(filter_movies)]

In [10]:
movies.head(2)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500


After we load the movie data, we represent the above-mentioned four features for each movie as one-hot encoding vectors in order to be able to compute pairwise similarity between them and make recommendations. 

This is done for movie genres, top 4 actors from the cast, director of the movie and movie plot keywords. 

In [11]:
def prepare_df(df, column):
    """
    This function extracts data from json format for a given dataframe and column.  
    """
    df[column]=df[column].apply(json.loads)
    for index, i in zip(df.index, df[column]):
        temp_list=[]
        for j in range(len(i)):
            temp_list.append((i[j]['name'])) # the key 'name' contains the name of the genre
        df.loc[index, column]=str(temp_list)

In [12]:
# Extract the genres, keywords and the cast for the movies
prepare_df(movies, 'genres')
prepare_df(movies, 'keywords')
prepare_df(credits, 'cast')

In [13]:
def director(x):
    """
    Return the movie Director from movie crew.
    """
    for i in x:
        if i['job'] == 'Director':
            return i['name']

In [14]:
# Save just the movie Director from movie crew
credits['crew'] = credits['crew'].apply(json.loads)
credits['crew'] = credits['crew'].apply(director)
credits.rename(columns = {'crew':'director'},inplace=True)

In [15]:
# Create new merged dataframe from movies and credits with the specified columns
movies = movies.merge(credits, left_on='id', right_on='movie_id', how='left')
movies = movies[['id','original_title','genres','cast','vote_average','director','keywords']]
movies['genres'] = movies['genres'].str.strip('[]').str.replace(' ','').str.replace("'",'')
movies['genres'] = movies['genres'].str.split(',')

Once we have the prepared dataframe with all necessary information, we can continue to assign one-hot encoding vectors. We start with movie genres.

In [16]:
# Sort the genres in every movie and format the string
for i,j in zip(movies['genres'],movies.index):
    temp_list = []
    temp_list = i
    temp_list.sort()
    movies.loc[j,'genres']=str(temp_list)
movies['genres'] = movies['genres'].str.strip('[]').str.replace(' ','').str.replace("'",'')
movies['genres'] = movies['genres'].str.split(',')

In [17]:
# Get a list of unique movie genres 
genreList = []
for index, row in movies.iterrows():
    genres = row["genres"]
    
    for genre in genres:
        if genre not in genreList:
            genreList.append(genre)

In [18]:
def binary_genres(genre_list):
    """
    One-hot encoding for genres in a given list.
    """
    binaryList = []
    
    for genre in genreList:
        if genre in genre_list:
            binaryList.append(1)
        else:
            binaryList.append(0)
    
    return binaryList

In [19]:
movies['genres_bin'] = movies['genres'].apply(lambda x: binary_genres(x))

In this part, the same is done for the top 4 actors from the movie cast.

In [20]:
# Format the movie cast string
movies['cast'] = movies['cast'].str.strip('[]').str.replace(' ','').str.replace("'",'').str.replace('"','')
movies['cast'] = movies['cast'].str.split(',')

In [21]:
# Take the top 4 actors, sort them and format the string
for i,j in zip(movies['cast'],movies.index):
    temp_list = []
    temp_list = i[:4]
    temp_list.sort()
    movies.loc[j,'cast'] = str(temp_list)
    
movies['cast'] = movies['cast'].str.strip('[]').str.replace(' ','').str.replace("'",'')
movies['cast'] = movies['cast'].str.split(',')

In [22]:
# Get a list of unique cast members
castList = []
for index, row in movies.iterrows():
    cast = row["cast"]
    
    for i in cast:
        if i not in castList:
            castList.append(i)

In [23]:
def binary_cast(cast_list):
    """
    One-hot encoding for cast in a given list.
    """
    binaryList = []
    
    for genre in castList:
        if genre in cast_list:
            binaryList.append(1)
        else:
            binaryList.append(0)
    
    return binaryList

In [24]:
movies['cast_bin'] = movies['cast'].apply(lambda x: binary_cast(x))

From the movie crew data, we extracted only the movie director and here we encode this part.

In [25]:
def xstr(s):
    """
    Return whitespace if the string is empty. 
    """
    if s is None:
        return ''
    return str(s)

In [26]:
# Get a list of unique movie directors 
movies['director'] = movies['director'].apply(xstr)
directorList=[]
for i in movies['director']:
    if i not in directorList:
        directorList.append(i)

In [27]:
def binary_director(director_list):
    """
    One-hot encoding for director in a given list.
    """
    binaryList = []
    
    for direct in directorList:
        if direct in director_list:
            binaryList.append(1)
        else:
            binaryList.append(0)
    
    return binaryList

In [28]:
movies['director_bin'] = movies['director'].apply(lambda x: binary_director(x))

Lastly, the same is done for the movie plot keywords.

In [29]:
# Take the keywords for every movie, sort them and format the string
movies['keywords'] = movies['keywords'].str.strip('[]').str.replace(' ','').str.replace("'",'').str.replace('"','')
movies['keywords'] = movies['keywords'].str.split(',')

for i,j in zip(movies['keywords'],movies.index):
    temp_list = []
    temp_list = i
    temp_list.sort()
    movies.loc[j,'keywords'] = str(temp_list)

movies['keywords'] = movies['keywords'].str.strip('[]').str.replace(' ','').str.replace("'",'')
movies['keywords'] = movies['keywords'].str.split(',')

In [30]:
# Get a list of unique movie directors 
words_list = []
for index, row in movies.iterrows():
    genres = row["keywords"]
    
    for genre in genres:
        if genre not in words_list:
            words_list.append(genre)

In [31]:
def binary_keywords(words):
    """
    One-hot encoding for words in a given list.
    """
    binaryList = []
    
    for genre in words_list:
        if genre in words:
            binaryList.append(1)
        else:
            binaryList.append(0)
    
    return binaryList

In [32]:
movies['words_bin'] = movies['keywords'].apply(lambda x: binary_keywords(x))

In [33]:
movies['new_id'] = list(range(0,movies.shape[0]))
movies = movies[['original_title','genres','vote_average','genres_bin','cast_bin','new_id','director_bin','words_bin']]
movies.head(2)

Unnamed: 0,original_title,genres,vote_average,genres_bin,cast_bin,new_id,director_bin,words_bin
0,Avatar,"[Action, Adventure, Fantasy, ScienceFiction]",7.2,"[1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",0,"[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
1,Pirates of the Caribbean: At World's End,"[Action, Adventure, Fantasy]",6.9,"[1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, ...",1,"[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."


The dataframe with assigned one-hot encoding vectors to the desired features is shown above. 

With this dataframe, we can now compute a similarity score. There are several options to compute this score, but the most commonly used are the euclidean distance, the Pearson and the cosine similarity scores. Different scores work well in different scenarios and it is often a good idea to experiment with different metrics in order to obtain the best score.

We will be using the cosine similarity to calculate a numeric quantity that denotes the similarity between two movies. We use the cosine similarity score since it is independent of magnitude and represents a judgment of orientation, and also is relatively easy and fast to calculate. 

The cosine similarity is calculated for each of the chosen features between two movies. The function below gives an equally weighted sum of the cosine similarity calculated for each of the four features. 

We can easily modify this setting and add new features (production company, production country) to the weighted sum. We can also increase the weights of different features (by multipling to a factor) and thus prioritizing over some of them.

In [34]:
def Similarity(movieId1, movieId2):
    """
    Compute and return cosine similarity between two movies for genres, cast, director and keywords.
    """
    
    a = movies.iloc[movieId1]
    b = movies.iloc[movieId2]
    
    genresA = a['genres_bin']
    genresB = b['genres_bin']
    genreDistance = spatial.distance.cosine(genresA, genresB)
    
    scoreA = a['cast_bin']
    scoreB = b['cast_bin']
    scoreDistance = spatial.distance.cosine(scoreA, scoreB)
    
    directA = a['director_bin']
    directB = b['director_bin']
    directDistance = spatial.distance.cosine(directA, directB)
    
    wordsA = a['words_bin']
    wordsB = b['words_bin']
    wordsDistance = spatial.distance.cosine(wordsA, wordsB)
    
    return genreDistance + directDistance + scoreDistance + wordsDistance

In [35]:
def getNeighbors(baseMovie, K):
    """
    Return K movie neighbors for a movie, based on cosine similarity for the defined contents of the movies.
    """
    distances = []

    for index, movie in movies.iterrows():
        if movie['new_id'] != baseMovie['new_id'].values[0]:
            dist = Similarity(baseMovie['new_id'].values[0], movie['new_id'])
            distances.append((movie['new_id'], dist))

    distances.sort(key=operator.itemgetter(1))
    neighbors = []

    for x in range(K):
        neighbors.append(distances[x])
    return neighbors

Here we just apply the cosine similarity function for a movie by calculating the cosine similarity between the chosen movie and every other movie in our data, and thus get the K best movie neighbors based on this metric. 

In [36]:
def recommend(name):
    """
    Return a movie recommendation neighbors for a movie, based on cosine similarity for the defined contents of the movies.
    """
    new_movie = movies[movies['original_title'].str.contains(name)].iloc[0].to_frame().T
    print('Selected Movie: '+ new_movie.original_title.values[0] + " | Genres: " + \
          str(new_movie.genres.values[0]).strip('[]').replace(' ','') + " | Rating: " + str(new_movie.vote_average.values[0]))

    K = 10
    avgRating = 0
    neighbors = getNeighbors(new_movie, K)
    print('\nRecommended Movies: \n')
    for neighbor in neighbors:
        avgRating += movies.iloc[neighbor[0]][2]  
        print( movies.iloc[neighbor[0]][0] + " | Genres: " + str(movies.iloc[neighbor[0]][1]).strip('[]').replace(' ','') + \
              " | Rating: " + str(movies.iloc[neighbor[0]][2]))
    print('\nAverage rating from recommended movies: {:.2f}'.format(avgRating/K))

We present the computed K best neighbors based on the similarity metric as movie recommendations for the selected movie. We can see how this model works on some examples below.

In [37]:
recommend("Pirates of the Caribbean: At World's End")

Selected Movie: Pirates of the Caribbean: At World's End | Genres: 'Action','Adventure','Fantasy' | Rating: 6.9

Recommended Movies: 

Pirates of the Caribbean: Dead Man's Chest | Genres: 'Action','Adventure','Fantasy' | Rating: 7.0
Pirates of the Caribbean: The Curse of the Black Pearl | Genres: 'Action','Adventure','Fantasy' | Rating: 7.5
The Lone Ranger | Genres: 'Action','Adventure','Western' | Rating: 5.9
Rango | Genres: 'Adventure','Animation','Comedy','Family','Western' | Rating: 6.6
Pirates of the Caribbean: On Stranger Tides | Genres: 'Action','Adventure','Fantasy' | Rating: 6.4
The Mexican | Genres: 'Action','Comedy','Crime','Romance' | Rating: 5.8
The Lord of the Rings: The Fellowship of the Ring | Genres: 'Action','Adventure','Fantasy' | Rating: 8.0
Thor: The Dark World | Genres: 'Action','Adventure','Fantasy' | Rating: 6.8
Spider-Man 2 | Genres: 'Action','Adventure','Fantasy' | Rating: 6.7
Thor | Genres: 'Action','Adventure','Fantasy' | Rating: 6.6

Average rating from rec

In [38]:
recommend('Donnie Brasco')

Selected Movie: Donnie Brasco | Genres: 'Crime','Drama','Thriller' | Rating: 7.4

Recommended Movies: 

Four Weddings and a Funeral | Genres: 'Comedy','Drama','Romance' | Rating: 6.6
The Departed | Genres: 'Crime','Drama','Thriller' | Rating: 7.9
Righteous Kill | Genres: 'Action','Crime','Drama','Thriller' | Rating: 5.9
The Godfather: Part III | Genres: 'Crime','Drama','Thriller' | Rating: 7.1
The Infiltrator | Genres: 'Crime','Drama','Thriller' | Rating: 6.6
Runner Runner | Genres: 'Crime','Drama','Thriller' | Rating: 5.5
Brooklyn's Finest | Genres: 'Crime','Drama','Thriller' | Rating: 6.2
A History of Violence | Genres: 'Crime','Drama','Thriller' | Rating: 6.9
Ocean's Thirteen | Genres: 'Crime','Thriller' | Rating: 6.5
The Insider | Genres: 'Drama','Thriller' | Rating: 7.3

Average rating from recommended movies: 6.65


In [39]:
recommend('Shawshank Redemption')

Selected Movie: The Shawshank Redemption | Genres: 'Crime','Drama' | Rating: 8.5

Recommended Movies: 

The Green Mile | Genres: 'Crime','Drama','Fantasy' | Rating: 8.2
The Bad Lieutenant: Port of Call - New Orleans | Genres: 'Crime','Drama' | Rating: 6.0
The Place Beyond the Pines | Genres: 'Crime','Drama' | Rating: 6.8
25th Hour | Genres: 'Crime','Drama' | Rating: 7.2
GoodFellas | Genres: 'Crime','Drama' | Rating: 8.2
Bronson | Genres: 'Action','Crime','Drama' | Rating: 6.9
American Gangster | Genres: 'Crime','Drama' | Rating: 7.4
Wall Street: Money Never Sleeps | Genres: 'Crime','Drama' | Rating: 5.8
Black Mass | Genres: 'Crime','Drama' | Rating: 6.3
Catch Me If You Can | Genres: 'Crime','Drama' | Rating: 7.7

Average rating from recommended movies: 7.05


In [40]:
recommend('The Green Mile')

Selected Movie: The Green Mile | Genres: 'Crime','Drama','Fantasy' | Rating: 8.2

Recommended Movies: 

The Shawshank Redemption | Genres: 'Crime','Drama' | Rating: 8.5
Catch Me If You Can | Genres: 'Crime','Drama' | Rating: 7.7
Perfume: The Story of a Murderer | Genres: 'Crime','Drama','Fantasy' | Rating: 7.1
The Mist | Genres: 'Horror','ScienceFiction','Thriller' | Rating: 6.7
Freedom Writers | Genres: 'Crime','Drama' | Rating: 7.5
The Legend of Bagger Vance | Genres: 'Drama','Fantasy' | Rating: 6.3
Road to Perdition | Genres: 'Crime','Drama','Thriller' | Rating: 7.3
Dancer in the Dark | Genres: 'Crime','Drama','Music' | Rating: 7.6
Boyz n the Hood | Genres: 'Crime','Drama' | Rating: 7.4
Extremely Loud & Incredibly Close | Genres: 'Drama' | Rating: 6.9

Average rating from recommended movies: 7.30


In [41]:
recommend('The Avengers')

Selected Movie: The Avengers | Genres: 'Action','Adventure','ScienceFiction' | Rating: 7.4

Recommended Movies: 

Avengers: Age of Ultron | Genres: 'Action','Adventure','ScienceFiction' | Rating: 7.3
Captain America: Civil War | Genres: 'Action','Adventure','ScienceFiction' | Rating: 7.1
Captain America: The Winter Soldier | Genres: 'Action','Adventure','ScienceFiction' | Rating: 7.6
Serenity | Genres: 'Action','Adventure','ScienceFiction','Thriller' | Rating: 7.4
Iron Man 2 | Genres: 'Action','Adventure','ScienceFiction' | Rating: 6.6
Captain America: The First Avenger | Genres: 'Action','Adventure','ScienceFiction' | Rating: 6.6
Ant-Man | Genres: 'Action','Adventure','ScienceFiction' | Rating: 7.0
Iron Man | Genres: 'Action','Adventure','ScienceFiction' | Rating: 7.4
Iron Man 3 | Genres: 'Action','Adventure','ScienceFiction' | Rating: 6.8
Thor: The Dark World | Genres: 'Action','Adventure','Fantasy' | Rating: 6.8

Average rating from recommended movies: 7.06


It is evident that our recommender has been successful in capturing more information due to more metadata and has given us better recommendations than the Graph-based one. As shown in the examples above, for "Pirates of the Caribbean: At World's End" the recommendations using the naive Graph-based recommender were limited, but here we can notice that the recommender suggested not just the other three parts from the series of the "Pirates of the Caribbean" and mostly movies starring Johnny Depp, but new  ones which correlate with the new given features, which confirms that the results are better using this model. We can modify the described model even further by adding even more metadata or by chaning the weights in the sum for the total cosine similarity between two movies.

However, our content based recommender suffers from some severe limitations. It is only capable of suggesting movies which are close to a certain movie based on the set of features we provide. It is not capable of capturing preferences and providing recommendations across genres but only to similar ones. Also, this engine is not user oriented, in that it doesn't capture the personal tastes and biases of a user. Anyone querying our engine for recommendations based on a movie will receive the same recommendations for that movie, regardless of who the user is, as shown above in the examples.

Other way of building this content based recommender is through graphs, namely by constructing four different graphs based on our selected features. These graphs should be build by calculating a chosen similarity metric (cosine similarity in our case) between every movie. Then similarly as in our case, K best neighbors will be calculated by taking a weighted sum over the cosine similarity metric for a given movie from each graph. We believe that the results we gained with our approach will be very close to the alternative graph method.

# Collaborative - based recommender

The content based recommender system was not capable of capturing user preferences or providing recommendations across genres. Therefore, in this section, we will use a technique called Collaborative Filtering to make movie recommendations. 

Collaborative filtering methods are based on collecting and analyzing a large amount of information on users’ behaviors, activities or preferences and predicting what users will like based on their similarity to other users. Basically there are two types of collaborative filtering: user based and item based. User based systems recommend products to a user that similar users have liked. Item Based systems recommend items based on their similarity with the items that the target user rated, instead of measuring the similarity between users. 

We will use the item based collaborative recommender system to recommend movies for a user by taking user data for movie rankings. For measuring the similarity between two users we use again the cosine similarity. However, this model has issues with scalabillity, where the computation grows with both the users and the ratings and with sparsity. One way to handle the scalability and sparsity issue created by this model is to leverage a latent factor model to capture the similarity between users and items. We will use SVD, which decreases the dimension of the utility matrix by extracting its latent factors. Essentially, we will map each user and each item into a latent space with dimension r. We will do inference on the model by predicting the rating for movies given a user using Root Mean Square Error (RMSE) and Mean Absolute Error (MAE) as a metrics.

First, we downloaded two additional dataframes. One which contains user data with ratings and the second one which gives the connections between movie ids from imdb and tmdb. We merge these two dataframes with our final network dataframe and thus create the dataframe with user data and movies from our network.

In [42]:
ratings = pd.read_csv('data/ratings_small.csv')
links = pd.read_csv('data/links.csv')
movies = pd.read_csv('data/tmdb_5000_movies.csv')
credits = pd.read_csv('data/tmdb_5000_credits.csv')

In [43]:
merged = ratings.merge(links, how='inner',on='movieId')

In [44]:
movies_names_id = movies[['original_title', 'id']]

In [45]:
final = movies_names_id.merge(merged,how='inner',left_on='id',right_on='tmdbId')

In [46]:
final = final[['id','original_title','userId','rating']]

In [47]:
filter_movies = list(final_project_df.movie_id.values)

In [48]:
ratings = final[final['id'].isin(filter_movies)]

In [49]:
ratings.head()

Unnamed: 0,id,original_title,userId,rating
0,19995,Avatar,15,4.0
1,19995,Avatar,26,3.5
2,19995,Avatar,31,4.0
3,19995,Avatar,48,4.5
4,19995,Avatar,72,2.0


Next, we run the SVD on k=5 folds with RMSE and MAE as evaluation metrics and fit the training set.

In [50]:
reader = Reader()

In [51]:
data = Dataset.load_from_df(ratings[['userId', 'id', 'rating']], reader)
data.split(n_folds=5)

In [52]:
svd = SVD()
_ = evaluate(svd, data, measures=['RMSE', 'MAE'])

Evaluating RMSE, MAE of algorithm SVD.

------------
Fold 1
RMSE: 0.8971
MAE:  0.6872
------------
Fold 2
RMSE: 0.8703
MAE:  0.6689
------------
Fold 3
RMSE: 0.8761
MAE:  0.6731
------------
Fold 4
RMSE: 0.8704
MAE:  0.6677
------------
Fold 5
RMSE: 0.8823
MAE:  0.6738
------------
------------
Mean RMSE: 0.8792
Mean MAE : 0.6741
------------
------------


In [53]:
training_set = data.build_full_trainset()
svd.fit(training_set)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x28dbcfa6358>

Now that our model is trained, let us pick a user with user Id 100 and check the given ratings. We create a function to predict the ratings that this user would give to the movies he already gave a rating and compare the absolute and relative error the model is making.

In [54]:
def user_predict_error(user_id):
    """
    Predicts ratings from other users for each movie that the user has rated and calculates average rating error.
    """
    rated_movies = list(ratings[ratings['userId'] == user_id].id.values)
    if rated_movies == []:
        print('Wrong user id entered! \n')
        return
    errors = []
    predictions= []
    for movie_id in rated_movies:
        true_rating = ratings[(ratings['userId'] == user_id) & (ratings['id'] == movie_id)].rating.values[0]
        error = abs(true_rating - svd.predict(user_id,movie_id).est)
        predictions.append(svd.predict(user_id,movie_id).est)
        errors.append(error)
    avg_error = np.mean(errors)
    return avg_error, predictions

In [55]:
user100_error, user100_predictions = user_predict_error(100)

In [56]:
sample = ratings[ratings['userId'] == 100]
sample['predicted_rating'] = user100_predictions
print("User ID 100 - Average absolute rating error: {:.2f}\n  \
       Average relative rating error: {:.2f} %".format(user100_error, user100_error/5*100))
sample

User ID 100 - Average absolute rating error: 0.31
         Average relative rating error: 6.19 %


Unnamed: 0,id,original_title,userId,rating,predicted_rating
7336,9268,Eraser,100,3.0,3.28067
10236,664,Twister,100,3.0,3.133251
11301,954,Mission: Impossible,100,3.0,3.383961
12930,602,Independence Day,100,3.0,3.478147
13421,9802,The Rock,100,3.0,3.623135
16532,9208,Broken Arrow,100,3.0,3.106994
24163,199,Star Trek: First Contact,100,4.0,3.758913
31159,9294,Phenomenon,100,4.0,3.544949
32122,862,Toy Story,100,4.0,3.845779
60567,451,Leaving Las Vegas,100,4.0,3.762114


We can indeed confirm that the model is working quite well. 

In addition we will calculate the total average absolute and relative rating errors for all users.

In [57]:
def total_error(ratings):
    """
    Predicts ratings for each movie that every user has rated and calculates total average rating error.
    """
    user_list = list(set(ratings.userId.values))
    total_error = []
    for user in user_list:
        total_error.append(user_predict_error(user)[0])
    avg_total_error = np.mean(total_error)
    return avg_total_error

In [58]:
print("Average absolute rating error: {:.2f}\nAverage relative rating error: {:.2f} %"\
      .format(total_error(ratings), total_error(ratings)/5*100))

Average absolute rating error: 0.53
Average relative rating error: 10.65 %


Overall, the model makes a 10% relative rating error which is impressive. 

This leads to the final part where we use the predicted ratings to make a movie recommendation for a user. The function below suggests 10 movies for a given user, based on predicted ratings for every movie that the particular user has not rated, sorted in descending order.

In [59]:
def recommend(user_id):
    """
    Recommend 10 movies for a given user id based on rating prediction from other users.
    """
    # Avoid duplicates
    unrated_movies = list(set(ratings[ratings['userId'] != user_id].id.values))
    names = list(set(ratings[ratings['userId'] != user_id].original_title.values))
    predicted_ratings = []
    for movie_id in unrated_movies:
        predicted_ratings.append(svd.predict(user_id,movie_id).est)
        
    dataframe = pd.DataFrame()
    dataframe['movie_id'] = unrated_movies
    dataframe['rating'] = predicted_ratings
    dataframe['name'] = names
    sorted_df = dataframe.sort_values(by='rating',ascending=False).head(10)
    return sorted_df

In [60]:
ratings[ratings['userId'] == 16].sort_values(by='rating',ascending=False)

Unnamed: 0,id,original_title,userId,rating
52806,489,Good Will Hunting,16,5.0
29144,782,Gattaca,16,5.0
50461,2105,American Pie,16,5.0
60356,153,Lost in Translation,16,4.5
6538,180,Minority Report,16,4.5
12498,453,A Beautiful Mind,16,4.5
57744,629,The Usual Suspects,16,4.5
32049,2770,American Pie 2,16,4.5
38125,278,The Shawshank Redemption,16,4.0
52384,1587,What's Eating Gilbert Grape,16,4.0


User with ID 16, likes mostly comedies and action movies, as shown above. 

The recommender system suggests mostly comedies and action movies as well, as shown below.  

In [61]:
recommend(16)

Unnamed: 0,movie_id,rating,name
255,603,4.76164,Identity
30,77,4.669862,Crimson Tide
58,106646,4.65169,Sliding Doors
1410,152601,4.638528,Sherlock Holmes
276,629,4.637297,Inside Out
180,423,4.631893,Hugo
17,38,4.625028,While We're Young
126,278,4.620823,Elektra
1397,242582,4.609624,JFK
365,807,4.607771,The Fugitive


One startling feature of this recommender system is that it does not impact which movie it is by genre or crew and cast. It overcomes the problem of suggesting movies which are close to a certain movie based on the set of features we provide as the previous Content-based system. This engine is strictly user oriented, in a way that represents the personal tastes of a user. It works purely on the basis of an assigned movie ID and tries to predict ratings based on how the other users have predicted the movie. The more user ratings data we have, the better. Once we have the predicted ratings for a given user, we sort them in descending order, take the first 10 and classify them as recommendations for our user. 

Based on this, it is worth mentioning that the model can be used solely for ratings prediction for a specific user for all movies in the database.