<h2> Content Based Recommender System - Metafeatures </h2>

The goal of this notebook is to implement content based recommender system on the Movielens 100k dataset.

The movie profile is based on the movie genres.

Two approaches are implemented. 

<b> Approach 1: </b>

The user profile is either a weighted average of the movie profile he\she rated, or the average of the movie profile he\she liked (rating >=3) - the average rating he\she didn't like (with a lower weight for the disliked movies)

The recommended movies are the closest ones (e.g. by Cosine similarity) to the user profile vector

The implementation is based on this blog post [website]
    
<b> Approach 2: </b>

The similarity score between two movies is calculated by computing the similarity between the movie profiles of each movies pair. 

The predicted rating a user will give to a candidate item, is calculated by the rating the user gave to K most similar items to the candidate item. The recommended movies are those with highest predicted rating.  

The implementation is based on this post [website2]

[website2]: https://www.kaggle.com/varian97/item-based-collaborative-filtering    

[website]: https://towardsdatascience.com/movie-recommendation-system-based-on-movielens-ef0df580cd0e

In [31]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics.pairwise import cosine_similarity

In [32]:
#Binary option should be set to True if the rating should be binary. 
#It should be set to True for the first approach and False for the second approach
BINARY_OPTION = False
#NEGATIVE WEIGHT is relevant only for the first approach
NEGATIVE_WEIGHT = 0.25

<b> Data loading <b>

In [33]:
column_names = ['user_id', 'item_id', 'rating', 'timestamp']
folder = "./ml-100k/"
ratings = pd.read_csv(folder+'ua.base',sep='\t',names=column_names) 
# sep cannot infer '\t' from files so explicitly supply arg
ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 90570 entries, 0 to 90569
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype
---  ------     --------------  -----
 0   user_id    90570 non-null  int64
 1   item_id    90570 non-null  int64
 2   rating     90570 non-null  int64
 3   timestamp  90570 non-null  int64
dtypes: int64(4)
memory usage: 2.8 MB


In [34]:
def brating(row):
    if row['rating'] >= 3:
        val = 1
    elif row['rating'] >=0:
        val = -NEGATIVE_WEIGHT
    else:
        val = row['rating']
    return val


ratings['binary_rating'] = ratings.apply(brating, axis=1)

In [35]:
ratings.head()

Unnamed: 0,user_id,item_id,rating,timestamp,binary_rating
0,1,1,5,874965758,1.0
1,1,2,3,876893171,1.0
2,1,3,4,878542960,1.0
3,1,4,3,876893119,1.0
4,1,5,3,889751712,1.0


In [36]:
item_col = ['item_id','movie title','release date','video release date','IMDb URL','unknown','Action','Adventure','Animation',
              'Children','Comedy','Crime','Documentary','Drama','Fantasy',
              'Film-Noir','Horror','Musical','Mystery','Romance','Sci-Fi','Thriller','War','Western']
movie_titles = pd.read_csv(folder+"u.item",sep='|',encoding='ISO-8859-1',names=item_col)
movie_titles.head()

Unnamed: 0,item_id,movie title,release date,video release date,IMDb URL,unknown,Action,Adventure,Animation,Children,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,Toy Story (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%2...,0,0,0,1,1,...,0,0,0,0,0,0,0,0,0,0
1,2,GoldenEye (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?GoldenEye%20(...,0,1,1,0,0,...,0,0,0,0,0,0,0,1,0,0
2,3,Four Rooms (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Four%20Rooms%...,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,4,Get Shorty (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Get%20Shorty%...,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,5,Copycat (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Copycat%20(1995),0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


<b> Movie profile <b>

The movie profile is based on the movie genres

In [37]:
movie_profile = movie_titles[['item_id','Action','Adventure','Animation',
              'Children','Comedy','Crime','Documentary','Drama','Fantasy',
              'Film-Noir','Horror','Musical','Mystery','Romance','Sci-Fi','Thriller','War','Western']].set_index('item_id')
movie_profile.sort_index(axis=0, inplace=True)

In [38]:
movie_profile.head()

Unnamed: 0_level_0,Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
item_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
1,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0
2,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
4,1,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,1,0,0


<b> User profile <b>

In [39]:
if BINARY_OPTION:
    rating_column = 'binary_rating'
else:
    rating_column = 'rating'

In [40]:
# user profile
user_x_movie = pd.pivot_table(ratings, values=rating_column, index=['item_id'], columns = ['user_id'])
user_x_movie.sort_index(axis=0, inplace=True)
userIDs = user_x_movie.columns
user_profile = pd.DataFrame(columns = movie_profile.columns)
user_profile

Unnamed: 0,Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western


user_x_movie is the rating matrix. Rows are item_id, columns are user_id. Missing values are NaN

In [41]:
user_x_movie

user_id,1,2,3,4,5,6,7,8,9,10,...,934,935,936,937,938,939,940,941,942,943
item_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,5.0,4.0,,,,4.0,,,,4.0,...,2.0,3.0,4.0,,4.0,,,5.0,,
2,3.0,,,,,,,,,,...,4.0,,,,,,,,,5.0
3,4.0,,,,,,,,,,...,,,4.0,,,,,,,
4,3.0,,,,,,5.0,,,4.0,...,5.0,,,,,,2.0,,,
5,3.0,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1678,,,,,,,,,,,...,,,,,,,,,,
1679,,,,,,,,,,,...,,,,,,,,,,
1680,,,,,,,,,,,...,,,,,,,,,,
1681,,,,,,,,,,,...,,,,,,,,,,


In [17]:
user_profile.head()

Unnamed: 0,Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western


The user profile is the average rating the user gave to movies which belong to each genre

In [18]:
for i in range(len(user_x_movie.columns)):
    working_df = movie_profile.mul(user_x_movie.iloc[:,i], axis=0)
    # working_df.replace(0, np.NaN, inplace=True)    
    #working_df: for each movie the user rated the rating in all positve geners otherwise 0
    #user_profile: average rating for all rated movies
    user_profile.loc[userIDs[i]] = working_df.mean(axis=0)

In [19]:
working_df.head()

Unnamed: 0_level_0,Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
item_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
1,,,,,,,,,,,,,,,,,,
2,5.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0
3,,,,,,,,,,,,,,,,,,
4,,,,,,,,,,,,,,,,,,
5,,,,,,,,,,,,,,,,,,


In [21]:
user_profile.head()

Unnamed: 0,Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
1,0.912214,0.458015,0.141221,0.209924,1.156489,0.328244,0.091603,1.557252,0.026718,0.019084,0.171756,0.137405,0.068702,0.603053,0.637405,0.675573,0.351145,0.083969
2,0.557692,0.153846,0.076923,0.211538,0.884615,0.596154,0.0,2.307692,0.057692,0.173077,0.115385,0.057692,0.211538,1.115385,0.192308,0.769231,0.115385,0.0
3,0.568182,0.227273,0.0,0.0,0.613636,0.590909,0.113636,1.386364,0.0,0.113636,0.204545,0.090909,0.613636,0.272727,0.431818,0.886364,0.295455,0.0
4,1.571429,0.642857,0.0,0.0,0.714286,1.0,0.357143,1.071429,0.0,0.0,0.0,0.357143,0.928571,0.214286,0.785714,1.714286,0.285714,0.0
5,1.0,0.624242,0.29697,0.393939,1.424242,0.187879,0.0,0.412121,0.030303,0.030303,0.393939,0.242424,0.054545,0.266667,0.70303,0.278788,0.266667,0.030303


<b> TFIDF <b>

In the movie profile we want to give higher weight to rare genres. The movie profile is now represented by a TFIDF of the genres in the dataset

In [22]:
# TFIDF
df = movie_profile.sum()
idf = (len(movie_titles)/df).apply(np.log) #log inverse of DF
TFIDF = movie_profile.mul(idf.values)

In [23]:
TFIDF.head()

Unnamed: 0_level_0,Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
item_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
1,0.0,0.0,3.690069,2.623718,1.20318,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1.902286,2.522464,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.902286,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.902286,0.0,0.0
4,1.902286,0.0,0.0,0.0,1.20318,0.0,0.0,0.841567,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,2.736391,0.0,0.841567,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.902286,0.0,0.0


Calculate user profile using the TFIDF items representation

In [24]:
# user profile
user_x_movie = pd.pivot_table(ratings, values=rating_column, index=['item_id'], columns = ['user_id'])
user_x_movie.sort_index(axis=0, inplace=True)
userIDs = user_x_movie.columns
user_profile_TFIDF = pd.DataFrame(columns = movie_profile.columns)

In [25]:
for i in range(len(user_x_movie.columns)):
    working_df = TFIDF.mul(user_x_movie.iloc[:,i], axis=0)
    # working_df.replace(0, np.NaN, inplace=True)    
    #working_df: for each movie the user rated the rating in all positve geners otherwise 0
    #user_profile: average rating for all rated movies
    user_profile_TFIDF.loc[userIDs[i]] = working_df.mean(axis=0)

In [26]:
user_profile_TFIDF.head()

Unnamed: 0,Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
1,1.735291,1.155327,0.521117,0.55078,1.391464,0.898205,0.32205,1.310532,0.115866,0.081101,0.499114,0.467504,0.227876,1.156868,1.792776,1.285132,1.111395,0.346954
2,1.06089,0.388071,0.283851,0.555017,1.064352,1.63131,0.0,1.942078,0.250194,0.735522,0.335302,0.196292,0.701645,2.139699,0.540888,1.463297,0.365199,0.0
3,1.080844,0.573287,0.0,0.0,0.738315,1.616958,0.399513,1.166718,0.0,0.482919,0.594399,0.309308,2.035349,0.523187,1.21454,1.686117,0.935131,0.0
4,2.989306,1.621584,0.0,0.0,0.859415,2.736391,1.255613,0.901679,0.0,0.0,0.0,1.215138,3.079946,0.411075,2.209914,3.261062,0.904303,0.0
5,1.902286,1.574629,1.095839,1.033586,1.713621,0.51411,0.0,0.346828,0.131415,0.128778,1.144768,0.824821,0.18092,0.51156,1.977356,0.530334,0.844016,0.125209


<b> Recommend movies to user (Approach 1) </b>

The recommended items to a user, are the items with highest Cosine similarity with the user profile vector

In [27]:
# recommendation prediction
use_TFIDF = True
if use_TFIDF:
    cosine_similarity_user_item =cosine_similarity(user_profile_TFIDF,TFIDF)
else:
    cosine_similarity_user_item =cosine_similarity(user_profile,movie_profile)

In [28]:
cosine_similarity_user_item.shape

(943, 1682)

In [29]:
def predict_most_similar_items_per_user(user_id,num_items=10):
    result = np.argsort(cosine_similarity_user_item[user_profile.index.get_loc(user_id),:])[::-1][:num_items]
    ret_result = [movie_profile.index[i] for i in result]
    return ret_result

Testing

The goal of this section is to serve as a 'sanity check'. We expect that users who gives high rating to a specific genre will be recommended movies who belong to the prefrred genres

In [None]:
user_profile.head(10)

Unnamed: 0,Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
1,0.912214,0.458015,0.141221,0.209924,1.156489,0.328244,0.091603,1.557252,0.026718,0.019084,0.171756,0.137405,0.068702,0.603053,0.637405,0.675573,0.351145,0.083969
2,0.557692,0.153846,0.076923,0.211538,0.884615,0.596154,0.0,2.307692,0.057692,0.173077,0.115385,0.057692,0.211538,1.115385,0.192308,0.769231,0.115385,0.0
3,0.568182,0.227273,0.0,0.0,0.613636,0.590909,0.113636,1.386364,0.0,0.113636,0.204545,0.090909,0.613636,0.272727,0.431818,0.886364,0.295455,0.0
4,1.571429,0.642857,0.0,0.0,0.714286,1.0,0.357143,1.071429,0.0,0.0,0.0,0.357143,0.928571,0.214286,0.785714,1.714286,0.285714,0.0
5,1.0,0.624242,0.29697,0.393939,1.424242,0.187879,0.0,0.412121,0.030303,0.030303,0.393939,0.242424,0.054545,0.266667,0.70303,0.278788,0.266667,0.030303
6,0.41791,0.358209,0.169154,0.303483,1.104478,0.258706,0.0199,1.791045,0.034826,0.139303,0.079602,0.228856,0.253731,0.711443,0.21393,0.427861,0.383085,0.089552
7,0.928753,0.590331,0.147583,0.371501,0.806616,0.3257,0.043257,1.541985,0.073791,0.10687,0.424936,0.274809,0.198473,0.536896,0.486005,0.75827,0.46056,0.167939
8,2.877551,1.183673,0.0,0.061224,0.326531,0.653061,0.0,1.244898,0.0,0.0,0.102041,0.0,0.061224,0.44898,1.306122,1.244898,0.77551,0.265306
9,1.25,1.25,0.0,0.0,1.833333,0.0,0.0,1.583333,0.0,0.0,0.416667,0.0,0.0,2.333333,0.75,0.666667,0.916667,0.0
10,0.614943,0.350575,0.166667,0.195402,1.017241,0.327586,0.068966,1.862069,0.022989,0.183908,0.155172,0.333333,0.367816,0.816092,0.252874,0.816092,0.505747,0.132184


In [None]:
user_profile_TFIDF.head(10)

Unnamed: 0,Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
1,1.735291,1.155327,0.521117,0.55078,1.391464,0.898205,0.32205,1.310532,0.115866,0.081101,0.499114,0.467504,0.227876,1.156868,1.792776,1.285132,1.111395,0.346954
2,1.06089,0.388071,0.283851,0.555017,1.064352,1.63131,0.0,1.942078,0.250194,0.735522,0.335302,0.196292,0.701645,2.139699,0.540888,1.463297,0.365199,0.0
3,1.080844,0.573287,0.0,0.0,0.738315,1.616958,0.399513,1.166718,0.0,0.482919,0.594399,0.309308,2.035349,0.523187,1.21454,1.686117,0.935131,0.0
4,2.989306,1.621584,0.0,0.0,0.859415,2.736391,1.255613,0.901679,0.0,0.0,0.0,1.215138,3.079946,0.411075,2.209914,3.261062,0.904303,0.0
5,1.902286,1.574629,1.095839,1.033586,1.713621,0.51411,0.0,0.346828,0.131415,0.128778,1.144768,0.824821,0.18092,0.51156,1.977356,0.530334,0.844016,0.125209
6,0.794985,0.903569,0.624191,0.796253,1.328886,0.707922,0.069964,1.507285,0.151029,0.591996,0.231319,0.778656,0.841593,1.364797,0.601704,0.813913,1.212485,0.370021
7,1.766754,1.489088,0.54459,0.974714,0.970504,0.891242,0.152079,1.297684,0.320011,0.454165,1.234844,0.935007,0.658309,1.029954,1.366947,1.442446,1.457699,0.693907
8,5.473925,2.985774,0.0,0.160636,0.392875,1.787031,0.0,1.047665,0.0,0.0,0.296526,0.0,0.203073,0.8613,3.673624,2.368152,2.454536,1.096219
9,2.377857,3.15308,0.0,0.0,2.205831,0.0,0.0,1.332481,0.0,0.0,1.210813,0.0,0.0,4.476151,2.109464,1.268191,2.901304,0.0
10,1.169797,0.884312,0.615012,0.51268,1.223925,0.896404,0.242463,1.567056,0.099694,0.781551,0.450923,1.134129,1.219996,1.56555,0.711237,1.55244,1.600719,0.546171


Test 1: User 8 likes Action (2.69), and Drama (1.22), SciFi (1.23), Thriler (1.15), Adventure (1.1)

Test 1 (TFIDF): User 8 likes Action (5.47), SciFi (3.67), Adventure(2.98), War (2.45), Thriler (2.36)

In [None]:
predict_most_similar_items_per_user(8)

[252, 636, 164, 831, 358, 271, 172, 50, 181, 82]

TFIDF is the movies profile matrix after TFIDF weighting 

In [None]:
TFIDF.loc[252]

Action         1.902286
Adventure      2.522464
Animation      0.000000
Children       0.000000
Comedy         0.000000
Crime          0.000000
Documentary    0.000000
Drama          0.000000
Fantasy        0.000000
Film-Noir      0.000000
Horror         0.000000
Musical        0.000000
Mystery        0.000000
Romance        0.000000
Sci-Fi         2.812618
Thriller       1.902286
War            0.000000
Western        0.000000
Name: 252, dtype: float64

In [None]:
TFIDF.loc[636]

Action         1.902286
Adventure      2.522464
Animation      0.000000
Children       0.000000
Comedy         0.000000
Crime          0.000000
Documentary    0.000000
Drama          0.000000
Fantasy        0.000000
Film-Noir      0.000000
Horror         0.000000
Musical        0.000000
Mystery        0.000000
Romance        0.000000
Sci-Fi         2.812618
Thriller       1.902286
War            0.000000
Western        0.000000
Name: 636, dtype: float64

In [None]:
TFIDF.loc[164]

Action         1.902286
Adventure      2.522464
Animation      0.000000
Children       0.000000
Comedy         0.000000
Crime          0.000000
Documentary    0.000000
Drama          0.000000
Fantasy        0.000000
Film-Noir      0.000000
Horror         0.000000
Musical        0.000000
Mystery        0.000000
Romance        0.000000
Sci-Fi         2.812618
Thriller       1.902286
War            0.000000
Western        0.000000
Name: 164, dtype: float64

Test 2: User 2 likes Drama (2.16), Romance (1.06) and Comedy (0.88)

In [None]:
predict_most_similar_items_per_user(2)


[1682, 378, 387, 1252, 1255, 1256, 1257, 1260, 1261, 1263]

In [None]:
movie_profile.loc[1682]

Action         0
Adventure      0
Animation      0
Children       0
Comedy         0
Crime          0
Documentary    0
Drama          1
Fantasy        0
Film-Noir      0
Horror         0
Musical        0
Mystery        0
Romance        0
Sci-Fi         0
Thriller       0
War            0
Western        0
Name: 1682, dtype: int64

In [None]:
movie_profile.loc[900]

Action         0
Adventure      0
Animation      0
Children       0
Comedy         0
Crime          0
Documentary    0
Drama          1
Fantasy        0
Film-Noir      0
Horror         0
Musical        0
Mystery        0
Romance        0
Sci-Fi         0
Thriller       0
War            0
Western        0
Name: 900, dtype: int64

<b> Recommend movies to user (Approach 2) </b>

The recommended items are the ones with highest rating prediction. The rating prediction is calculated by the rating the user gave to most similar rated items 

In [43]:
user_x_movie_n = user_x_movie.copy()
user_x_movie_n.fillna(0, inplace=True)

In [44]:
user_x_movie_n

user_id,1,2,3,4,5,6,7,8,9,10,...,934,935,936,937,938,939,940,941,942,943
item_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,5.0,4.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,4.0,...,2.0,3.0,4.0,0.0,4.0,0.0,0.0,5.0,0.0,0.0
2,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0
3,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,3.0,0.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0,4.0,...,5.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0
5,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1678,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1679,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1680,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1681,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
#add cold start movies to user_x_movie_n
new_items = set(movie_titles.item_id) - set(ratings.item_id)
for item in new_items:
    user_x_movie_n.loc[item] = 0.0
user_x_movie_n.sort_index(inplace=True)

In [None]:
user_x_movie.shape

(1680, 943)

In [None]:
user_x_movie_n.shape

(1682, 943)

In [None]:
#Calculate movie-movie similarity
movie_sim_df = pd.DataFrame(cosine_similarity(movie_profile, movie_profile), index=movie_profile.index, columns=movie_profile.index)
movie_sim_df.head()

item_id,1,2,3,4,5,6,7,8,9,10,...,1673,1674,1675,1676,1677,1678,1679,1680,1681,1682
item_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.0,0.0,0.0,0.333333,0.0,0.0,0.0,0.666667,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.57735,0.0
2,0.0,1.0,0.57735,0.333333,0.333333,0.0,0.0,0.0,0.0,0.0,...,0.816497,0.0,0.0,0.0,0.0,0.0,0.408248,0.0,0.0,0.0
3,0.0,0.57735,1.0,0.0,0.57735,0.0,0.0,0.0,0.0,0.0,...,0.707107,0.0,0.0,0.0,0.0,0.0,0.707107,0.0,0.0,0.0
4,0.333333,0.333333,0.0,1.0,0.333333,0.57735,0.408248,0.666667,0.57735,0.408248,...,0.408248,0.57735,0.57735,0.57735,0.57735,0.57735,0.0,0.408248,0.57735,0.57735
5,0.0,0.333333,0.57735,0.333333,1.0,0.57735,0.408248,0.333333,0.57735,0.408248,...,0.408248,0.57735,0.57735,0.57735,0.57735,0.57735,0.408248,0.408248,0.0,0.57735


In [None]:
#Calculate user-user similarity (not used here, but can be a based for collaboration via content approach)
user_sim_df = pd.DataFrame(cosine_similarity(user_profile, user_profile), index=user_profile.index, columns=user_profile.index)
user_sim_df.head()

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,...,934,935,936,937,938,939,940,941,942,943
1,1.0,0.911481,0.908018,0.809669,0.843344,0.947947,0.973742,0.790195,0.879929,0.956627,...,0.981526,0.889031,0.96519,0.966781,0.957452,0.962849,0.987071,0.707501,0.941258,0.959096
2,0.911481,1.0,0.897913,0.694302,0.589617,0.957313,0.898124,0.605312,0.79458,0.960745,...,0.896082,0.845464,0.975581,0.950683,0.821146,0.889553,0.910086,0.436053,0.892038,0.8442
3,0.908018,0.897913,1.0,0.897797,0.660807,0.882541,0.920141,0.734967,0.714381,0.93157,...,0.868038,0.799062,0.907339,0.871623,0.871181,0.873332,0.888926,0.591,0.874328,0.87254
4,0.809669,0.694302,0.897797,1.0,0.698315,0.680812,0.829362,0.860469,0.633726,0.772275,...,0.7601,0.787456,0.734288,0.700567,0.878967,0.831271,0.764576,0.764853,0.777376,0.859781
5,0.843344,0.589617,0.660807,0.698315,1.0,0.718614,0.810537,0.730547,0.791778,0.711176,...,0.859016,0.709725,0.714165,0.738536,0.877032,0.793255,0.843354,0.865604,0.769791,0.851785


In [None]:
def get_similar_movie(movie_id):
    if movie_id not in movie_profile.index:
        return None, None
    else:
        sim_movie = movie_sim_df.sort_values(by=movie_id, ascending=False).index[1:]
        sim_score = movie_sim_df.sort_values(by=movie_id, ascending=False).loc[:, movie_id].tolist()[1:]
        return sim_movie, sim_score

In [None]:
# predict the rating of movie x by user y
def predict_rating(user_id, movie_id, max_neighbor=10):
    movies, scores = get_similar_movie(movie_id)
    movie_arr = np.array([x for x in movies])
    sim_arr = np.array([x for x in scores])
    
    # select only the movies that has already rated by user x
    filtering = user_x_movie_n[user_id].loc[movie_arr] > 0

    # calculate the predicted score
    s = 0.0
    #don't estimate rating by less than 4 nearest neighbors (by content)
    if ((np.sum(sim_arr[filtering][:max_neighbor]) > 0.0) and ((np.where(sim_arr[filtering] > 0.0)[0].size > 3))):
       s = np.dot(sim_arr[filtering][:max_neighbor], user_x_movie_n[user_id].loc[movie_arr[filtering][:max_neighbor]]) \
            / np.sum(sim_arr[filtering][:max_neighbor])
    
    return s

In [None]:
# recommend top movies for user x based on similarity to other movies the users rated
def get_recommendation(user_id, n_movies=5):
    predicted_rating = np.array([])
    
    for _movie in user_x_movie_n.index:
        predicted_rating = np.append(predicted_rating, predict_rating(user_id, _movie))
    
    # don't recommend something that user has already rated
    temp = pd.DataFrame({'predicted':predicted_rating, 'movie_id':user_x_movie_n.index})
    filtering = (user_x_movie_n[user_id] == 0.0)
    temp = temp.loc[filtering.values].sort_values(by='predicted', ascending=False)

    # recommend n_anime anime
    #return movie_titles.loc[titles_index.loc[temp.movie_id[:n_movies]]]
    return temp[:n_movies]

Test 1: User 8 likes Action (2.69), and Drama (1.22), SciFi (1.23), Thriler (1.15), Adventure (1.1)

In [None]:
get_recommendation(8)

Unnamed: 0,predicted,movie_id
120,4.794528,121
30,4.530553,31
270,4.513776,271
1109,4.513766,1110
443,4.512635,444


In [None]:
movie_profile.loc[121]

Action         1
Adventure      0
Animation      0
Children       0
Comedy         0
Crime          0
Documentary    0
Drama          0
Fantasy        0
Film-Noir      0
Horror         0
Musical        0
Mystery        0
Romance        0
Sci-Fi         1
Thriller       0
War            1
Western        0
Name: 121, dtype: int64

In [None]:
movie_profile.loc[271]

Action         1
Adventure      1
Animation      0
Children       0
Comedy         0
Crime          0
Documentary    0
Drama          0
Fantasy        0
Film-Noir      0
Horror         0
Musical        0
Mystery        0
Romance        0
Sci-Fi         1
Thriller       0
War            1
Western        0
Name: 271, dtype: int64

Test 2: User 2 likes Drama (2.16), Romance (1.06) and Comedy (0.88)

In [None]:
get_recommendation(2)

Unnamed: 0,predicted,movie_id
1137,4.530053,1138
1646,4.50685,1647
1103,4.50685,1104
402,4.343513,403
654,4.327842,655


In [None]:
movie_profile.loc[1138]

Action         1
Adventure      0
Animation      0
Children       0
Comedy         1
Crime          1
Documentary    0
Drama          1
Fantasy        0
Film-Noir      0
Horror         0
Musical        0
Mystery        0
Romance        0
Sci-Fi         0
Thriller       0
War            0
Western        0
Name: 1138, dtype: int64

In [None]:
movie_profile.loc[1647]

Action         0
Adventure      0
Animation      0
Children       0
Comedy         1
Crime          1
Documentary    0
Drama          1
Fantasy        0
Film-Noir      0
Horror         0
Musical        0
Mystery        0
Romance        0
Sci-Fi         0
Thriller       0
War            0
Western        0
Name: 1647, dtype: int64

<b> Test similarity between movies <b>

In [None]:
movie_titles[movie_titles['movie title'].str.contains("Star Wars")]

Unnamed: 0,item_id,movie title,release date,video release date,IMDb URL,unknown,Action,Adventure,Animation,Children,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
49,50,Star Wars (1977),01-Jan-1977,,http://us.imdb.com/M/title-exact?Star%20Wars%2...,0,1,1,0,0,...,0,0,0,0,0,1,1,0,1,0


In [None]:
res_movie = get_similar_movie(50)[0]

In [None]:
res_movie

Int64Index([ 181,  172,  271,  498,  373,  449,  241,  230,  229,   62,
            ...
             737,  735,  734,  729,  726,  725,  723,  722,  721, 1682],
           dtype='int64', name='item_id', length=1681)

In [None]:
for x in res_movie:
    print(movie_titles[movie_titles['item_id'] == x]['movie title'])

180    Return of the Jedi (1983)
Name: movie title, dtype: object
171    Empire Strikes Back, The (1980)
Name: movie title, dtype: object
270    Starship Troopers (1997)
Name: movie title, dtype: object
497    African Queen, The (1951)
Name: movie title, dtype: object
372    Judge Dredd (1995)
Name: movie title, dtype: object
448    Star Trek: The Motion Picture (1979)
Name: movie title, dtype: object
240    Last of the Mohicans, The (1992)
Name: movie title, dtype: object
229    Star Trek IV: The Voyage Home (1986)
Name: movie title, dtype: object
228    Star Trek III: The Search for Spock (1984)
Name: movie title, dtype: object
61    Stargate (1994)
Name: movie title, dtype: object
227    Star Trek: The Wrath of Khan (1982)
Name: movie title, dtype: object
896    Time Tracers (1995)
Name: movie title, dtype: object
226    Star Trek VI: The Undiscovered Country (1991)
Name: movie title, dtype: object
221    Star Trek: First Contact (1996)
Name: movie title, dtype: object
81    Jurassi