# Week 4 Project: Similarity-Based Recommendation System

## Part 1: Data Description

TMDB 5000 movies dataset was obtained from https://www.kaggle.com/tmdb/tmdb-movie-metadata for the implementation of the recommendation system.

This dataset was generated from The Movie Database API. It was originally placed on Kaggle for the public to find a consistent formula regarding the success of a movie prior to its release. The original dataset on Kaggle has been replaced with it as requested by IMDB.

## Part 2: Setting up the Data

Since content-based system is only capable of suggesting movies close to a certain movie, it fails to capture the preference of users and provide appropriate recommendations across genres. Collaborative filtering is thus employed to overcome this limitation.

In [2]:
from collections import defaultdict
import numpy as np
import pandas as pd
import scipy.optimize
import csv
import scipy
import random

credits_path = "./TMDb5000movies_dataset/tmdb_5000_credits.csv"
movies_path = "./TMDb5000movies_dataset/tmdb_5000_movies.csv"
df1 = pd.read_csv(credits_path)
df2 = pd.read_csv(movies_path)
df1.columns = ['id', 'title', 'cast', 'crew']
df = df2.merge(df1, on='id')
df.head(3)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,...,runtime,spoken_languages,status,tagline,title_x,vote_average,vote_count,title_y,cast,crew
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...",...,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...",...,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.sonypictures.com/movies/spectre/,206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...",...,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466,Spectre,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."


## Part 3: Finding Similarities

The parts of our data we want to work with are "popularity", "vote_average", and "vote_count". Since no data is missing for these variables, no handling of missing data is required.

In [3]:
var_of_interests = ['id', 'popularity', 'vote_average', 'vote_count']
df = df[var_of_interests]
df.head()

Unnamed: 0,id,popularity,vote_average,vote_count
0,19995,150.437577,7.2,11800
1,285,139.082615,6.9,4500
2,206647,107.376788,6.3,4466
3,49026,112.31295,7.6,9106
4,49529,43.926995,6.1,2124


In [4]:
print(df.shape)
df.info()

(4803, 4)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4803 entries, 0 to 4802
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   id            4803 non-null   int64  
 1   popularity    4803 non-null   float64
 2   vote_average  4803 non-null   float64
 3   vote_count    4803 non-null   int64  
dtypes: float64(2), int64(2)
memory usage: 187.6 KB


### Functions to find Similarities

Set up a Jaccard function and a function to determine what is similar within the dataset.

In [6]:
def jaccard(s1, s2):
    numerator = len(s1.intersection(s2))
    denominator = len(s1.union(s2))
    return numerator / denominator

In [24]:
def most_similar(prod_id, num_prod):
    sim_list = []
    users = usersPerItem[prod_id]
    for user in usersPerItem:
        if user == prod_id: continue
        sim = jaccard(users, usersPerItem[user])
        sim_list.append((sim, user))
    sim_list.sort(reverse=True)
    return sim_list[:num_prod]

## Part 4: Implement Recommender System

The similarity-based recommender developed was used to predict user's future ratings based on their ratings in the past. A user's rating for an item is assumed to be a sum of their previous ratings weighted by the similarity between the query item and their previous purchases.

In [12]:
reviewsPerUser = defaultdict(list)
reviewsPerItem = defaultdict(list)

for d in dataset:
    user,item = d['customer_id'], d['product_id']
    reviewsPerUser[user].append(d)
    reviewsPerItem[item].append(d)

Now that we have calculated the average rating of our dataset as a whole, we are going to implement a function which predicts Rating based on a user and an item.

In [14]:
def predictRating(user,item):
    ratings = []
    similarities = []
    for d in reviewsPerUser[user]:
        i2 = d['product_id']
        if i2 == item: continue
        ratings.append(d['star_rating'])
        similarities.append(Jaccard(usersPerItem[item],usersPerItem[i2]))
    if (sum(similarities) > 0):
        weightedRatings = [(x*y) for x,y in zip(ratings,similarities)]
        return sum(weightedRatings) / sum(similarities)
    else:
        # User hasn't rated any similar items
        return ratingMean

In [15]:
dataset[10]

{'marketplace': 'US',
 'customer_id': '8926809',
 'review_id': 'R3B3UHK1FO0ERS',
 'product_id': 'B004774IPU',
 'product_parent': '151985175',
 'product_title': "Sid Meier's Civilization V",
 'product_category': 'Digital_Video_Games',
 'star_rating': 1,
 'helpful_votes': 0,
 'total_votes': 0,
 'vine': 'N',
 'verified_purchase': 'N',
 'review_headline': "I am still playing Civ 4 and love it. It's a shame because I'm ready for ...",
 'review_body': "As has been written by so many others, I quickly lost interest in this game. I am still playing Civ 4 and love it. It's a shame because I'm ready for an expanded version of Civ 4 and have waited for about a decade for a better version of it. Civ 5 was not an evolution but a total rewrite and it lost all that was good in Civ 4. I really hope that when Civ 6 comes out they use Civ 4 as the starting point and forget Civ 5 ever happened. Failing that there is a place in the market for a strategy game that involves building a civilisation.",
 'revi

In [25]:
#TODO Using the function defined above, calculate the predicted rating for the user at index [10]

user,item = dataset[ "TODO" ]['customer_id'], dataset[ "TODO" ]['product_id']
predictRating(user, item)

3.8531262248076406

In this case our user hasn't rated any similar items, so our function defaults to returning the dataset Mean Rating. Let's try another example with a user who has.

In [26]:
#TODO Calculate the predicted rating for the user at index [12]

#Answer should differ from the above

4.305965957081731

## Part 5: Evaluating Performance

In [27]:
def MSE(predictions, labels):
    differences = [(x-y)**2 for x,y in zip(predictions,labels)]
    return sum(differences) / len(differences)

To evaluate the performance of our model, we will need two things:
1. A list of the average Rating (i.e. ratingMean)
2. A list of our predicted ratings (i.e. ratings defined by our predictRating function)

Finally, we will compare our two lists above with the actual star ratings in our dataset.

In [30]:
labels = [d['star_rating'] for d in dataset]

print(MSE(alwaysPredictMean, labels), MSE(cfPredictions, labels))

2.371535478415058 2.3705596136412614


In this case, the accuracy of our rating prediction model was _nearly identical_ (in terms of the MSE) than just predicting the mean rating. However note again that this is just a heuristic example, and the similarity function could be modified to change its predictions (e.g. by using a different function other than the Jaccard similarity).