# The MovieLens DataSet

We will be using the MovieLens dataset for this purpose. It has been collected by the GroupLens Research Project at the University of Minnesota. MovieLens 100K dataset can be downloaded from http://grouplens.org/datasets/movielens/100k/. It consists of:

* 100,000 ratings (1-5) from 943 users on 1682 movies.
* Each user has rated at least 20 movies.
* Simple demographic info for the users (age, gender, occupation, zip)
* Genre information of movies

Lets load this data into Python. There are many files in the ml-100k.zip file which we can use. Lets load the three most importance files to get a sense of the data. I also recommend you to read the readme document which gives a lot of information about the difference files.

In [18]:
#load libraries
import pandas as pd
import numpy as np

In [2]:
#Reading users file:
u_cols = ['user_id', 'age', 'sex', 'occupation', 'zip_code']
users = pd.read_csv(r'C:\Users\user\Desktop\program\Recommenadtion syste\ml-100k\u.user', sep='|', names=u_cols,  encoding='latin-1')

In [167]:
#Reading ratings file:
r_cols = ['user_id', 'movie_id', 'rating', 'unix_timestamp']
ratings = pd.read_csv(r'C:\Users\user\Desktop\program\Recommenadtion syste\ml-100k\u.data', sep='\t', names=r_cols,
 encoding='latin-1')

In [24]:
#Reading items file:
i_cols = ['movie_id', 'movie title' ,'release date','video release date', 'IMDb URL', 'unknown', 'Action', 'Adventure',
 'Animation', 'Children\'s', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy',
 'Film-Noir', 'Horror', 'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War', 'Western']
items = pd.read_csv(r'C:\Users\user\Desktop\program\Recommenadtion syste\ml-100k\u.item', sep='|', names=i_cols,
 encoding='latin-1')

In [9]:
print(users.shape)

(943, 5)


In [5]:
users.head()

Unnamed: 0,user_id,age,sex,occupation,zip_code
0,1,24,M,technician,85711
1,2,53,F,other,94043
2,3,23,M,writer,32067
3,4,24,M,technician,43537
4,5,33,F,other,15213


This reconfirms that there are 943 users and we have 5 features for each namely their unique ID, age, gender, occupation and the zip code they are living in.

# Ratings

In [6]:
print (ratings.shape)

(100000, 4)


In [123]:
ratings.head()

Unnamed: 0,user_id,movie_id,rating,unix_timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


In [107]:
ratings.shape

(100000, 4)

This confirms that there are 100K ratings for different user and movie combinations. Also notice that each rating has a timestamp associated with it.

# Items

In [8]:
print (items.shape)

(1682, 24)


In [43]:
items.head()

Unnamed: 0,movie_id,movie title,release date,video release date,IMDb URL,unknown,Action,Adventure,Animation,Children's,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,Toy Story (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%2...,0,0,0,1,1,...,0,0,0,0,0,0,0,0,0,0
1,2,GoldenEye (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?GoldenEye%20(...,0,1,1,0,0,...,0,0,0,0,0,0,0,1,0,0
2,3,Four Rooms (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Four%20Rooms%...,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,4,Get Shorty (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Get%20Shorty%...,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,5,Copycat (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Copycat%20(1995),0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


This dataset contains attributes of the 1682 movies. There are 24 columns out of which 19 specify the genre of a particular movie. The last 19 columns are for each genre and a value of 1 denotes movie belongs to that genre and 0 otherwise.

Now we have to divide the ratings data set into test and train data for making models. Luckily GroupLens provides pre-divided data wherein the test data has 10 ratings for each user, i.e. 9430 rows in total. Lets load that:

In [15]:
r_cols = ['user_id', 'movie_id', 'rating', 'unix_timestamp']

ratings_base = pd.read_csv(r'C:\Users\user\Desktop\program\Recommenadtion syste\ml-100k\ua.base', sep='\t', names=r_cols, encoding='latin-1')

ratings_test = pd.read_csv(r'C:\Users\user\Desktop\program\Recommenadtion syste\ml-100k\ua.test', sep='\t', names=r_cols, encoding='latin-1')

ratings_base.shape, ratings_test.shape

((90570, 4), (9430, 4))

# IMDB popularity model


(WR) = (v ÷ (v+m)) × R + (m ÷ (v+m)) × C
```
Where
- R = average rating for the movie. (rating)
- v = number of votes for the movie. (members)
- m = minimum votes required to be listed in the Top 250 (defined by > percentile 80 of total votes)
- C = the average rating across the whole dataset.

### weight_rating

In [19]:
def weight_rating(R, v, m, C):
    return (v / (v + m)) * R + (m / (v + m)) * C

### Popular movie

In [34]:
moive_rating = ratings.groupby("movie_id", as_index=False).agg({"user_id": "count", "rating": "mean"})
moive_rating.columns = ["movie_id","num_of_vote", "avg_rating"]
moive_rating
C = np.mean(moive_rating["avg_rating"])
m = np.percentile(moive_rating["num_of_vote"], 80)
R = moive_rating["avg_rating"]
v = moive_rating["num_of_vote"]
moive_rating['weight_rating'] = weight_rating(R, v, m, C)
moive_rating.sort_values("weight_rating", ascending=False)
#moive_rating.shape

Unnamed: 0,movie_id,num_of_vote,avg_rating,weight_rating
49,50,583,4.358491,4.170724
317,318,298,4.466443,4.117097
63,64,283,4.445230,4.087740
482,483,243,4.456790,4.054240
126,127,413,4.283293,4.047962
...,...,...,...,...
987,988,86,2.313953,2.723680
242,243,132,2.439394,2.713812
686,687,69,2.188406,2.713636
687,688,44,1.840909,2.698642


In [49]:
import matplotlib.pyplot as plt
import seaborn as sns
popular_items = moive_rating.merge(items, on = "movie_id", how='left')
popular_items
popular_items = popular_items.loc[:, ['movie_id','movie title','num_of_vote', 'weight_rating']].sort_values("weight_rating", ascending=False)
popular_items.head()['movie title']


49                     Star Wars (1977)
317             Schindler's List (1993)
63     Shawshank Redemption, The (1994)
482                   Casablanca (1942)
126               Godfather, The (1972)
Name: movie title, dtype: object

---

# Content Based

In [56]:
from sklearn.metrics.pairwise import cosine_similarity
items_genre = items.drop(['movie title','release date','video release date', 'IMDb URL'], axis = 1)
items_genre
corr_mat = cosine_similarity(items_genre)
corr_mat

array([[1.        , 0.37796447, 0.47434165, ..., 0.49999982, 0.50029735,
        0.49999991],
       [0.37796447, 1.        , 0.83666003, ..., 0.75592868, 0.75592881,
        0.75592881],
       [0.47434165, 0.83666003, 1.        , ..., 0.94868296, 0.94868313,
        0.94868313],
       ...,
       [0.49999982, 0.75592868, 0.94868296, ..., 1.        , 0.99999947,
        0.99999982],
       [0.50029735, 0.75592881, 0.94868313, ..., 0.99999947, 1.        ,
        0.99999965],
       [0.49999991, 0.75592881, 0.94868313, ..., 0.99999982, 0.99999965,
        1.        ]])

In [81]:
def top_k_items(item_id, top_k, corr_mat, map_name):

    # sort correlation value ascendingly and select top_k item_id
    top_items = corr_mat[item_id,:].argsort()[-top_k:][::-1]
    top_items = [map_name[e] for e in top_items]

    return top_items
ind2name = {ind:name for ind,name in enumerate(items_genre.index)}
name2ind = {v:k for k,v in ind2name.items()}
similar_items = top_k_items(name2ind[50],
                            top_k = 10,
                            corr_mat = corr_mat,
                            map_name = ind2name)
display(items.loc[items['movie_id'].isin(similar_items)])
"""
# display result
print("The top-k similar movie to item_id 99")
display(items.loc[items['movie_id'].isin(similar_items)])"""

Unnamed: 0,movie_id,movie title,release date,video release date,IMDb URL,unknown,Action,Adventure,Animation,Children's,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
34,35,Free Willy 2: The Adventure Home (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Free%20Willy%...,0,0,1,0,1,...,0,0,0,0,0,0,0,0,0,0
49,50,Star Wars (1977),01-Jan-1977,,http://us.imdb.com/M/title-exact?Star%20Wars%2...,0,1,1,0,0,...,0,0,0,0,0,1,1,0,1,0
50,51,Legends of the Fall (1994),01-Jan-1994,,http://us.imdb.com/M/title-exact?Legends%20of%...,0,0,0,0,0,...,0,0,0,0,0,1,0,0,1,1
67,68,"Crow, The (1994)",01-Jan-1994,,"http://us.imdb.com/M/title-exact?Crow,%20The%2...",0,1,0,0,0,...,0,0,0,0,0,1,0,1,0,0
95,96,Terminator 2: Judgment Day (1991),01-Jan-1991,,http://us.imdb.com/M/title-exact?Terminator%20...,0,1,0,0,0,...,0,0,0,0,0,0,1,1,0,0
123,124,Lone Star (1996),21-Jun-1996,,http://us.imdb.com/M/title-exact?Lone%20Star%2...,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
129,130,Kansas City (1996),16-Aug-1996,,http://us.imdb.com/M/title-exact?Kansas%20City...,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
131,132,"Wizard of Oz, The (1939)",01-Jan-1939,,http://us.imdb.com/M/title-exact?Wizard%20of%2...,0,0,1,0,1,...,0,0,0,1,0,0,0,0,0,0
155,156,Reservoir Dogs (1992),01-Jan-1992,,http://us.imdb.com/M/title-exact?Reservoir%20D...,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
170,171,Delicatessen (1991),01-Jan-1991,,http://us.imdb.com/M/title-exact?Delicatessen%...,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0


'\n# display result\nprint("The top-k similar movie to item_id 99")\ndisplay(items.loc[items[\'movie_id\'].isin(similar_items)])'

---


# Collaborative Filtering
## Memory Based

if u want to do user-based, just use the same approach as below, and change mat.T to mat.
Once u get the mat, you can find the similar user group, average their movie list rating and sort, then u can recommend the highest rating movie. 

### item-based

In [191]:
from scipy.sparse import csr_matrix
row = ratings['user_id']
col = ratings["movie_id"]
data = ratings["rating"]
#ratings['user_id']
#print(max(row))

# init user-item matrix
mat = csr_matrix((data, (row, col)), shape=(944, 1683))
mat.eliminate_zeros()

sparsity = float(len(mat.nonzero()[0]))
sparsity /= (mat.shape[0] * mat.shape[1])
sparsity *= 100
print(f'Sparsity: {sparsity:4.2f}%. This means that {sparsity:4.2f}% of the user-item ratings have a value.')

item_corr_mat = cosine_similarity(mat.T)
similar_items = top_k_items(name2ind[50],
                            top_k = 10,
                            corr_mat = item_corr_mat,
                            map_name = ind2name)
display(items.loc[items['movie_id'].isin(similar_items)])

Sparsity: 6.29%. This means that 6.29% of the user-item ratings have a value.


Unnamed: 0,movie_id,movie title,release date,video release date,IMDb URL,unknown,Action,Adventure,Animation,Children's,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,Toy Story (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%2...,0,0,0,1,1,...,0,0,0,0,0,0,0,0,0,0
49,50,Star Wars (1977),01-Jan-1977,,http://us.imdb.com/M/title-exact?Star%20Wars%2...,0,1,1,0,0,...,0,0,0,0,0,1,1,0,1,0
97,98,"Silence of the Lambs, The (1991)",01-Jan-1991,,http://us.imdb.com/M/title-exact?Silence%20of%...,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
99,100,Fargo (1996),14-Feb-1997,,http://us.imdb.com/M/title-exact?Fargo%20(1996),0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
120,121,Independence Day (ID4) (1996),03-Jul-1996,,http://us.imdb.com/M/title-exact?Independence%...,0,1,0,0,0,...,0,0,0,0,0,0,1,0,1,0
126,127,"Godfather, The (1972)",01-Jan-1972,,"http://us.imdb.com/M/title-exact?Godfather,%20...",0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
171,172,"Empire Strikes Back, The (1980)",01-Jan-1980,,http://us.imdb.com/M/title-exact?Empire%20Stri...,0,1,1,0,0,...,0,0,0,0,0,1,1,0,1,0
173,174,Raiders of the Lost Ark (1981),01-Jan-1981,,http://us.imdb.com/M/title-exact?Raiders%20of%...,0,1,1,0,0,...,0,0,0,0,0,0,0,0,0,0
180,181,Return of the Jedi (1983),14-Mar-1997,,http://us.imdb.com/M/title-exact?Return%20of%2...,0,1,1,0,0,...,0,0,0,0,0,1,1,0,1,0
209,210,Indiana Jones and the Last Crusade (1989),01-Jan-1989,,http://us.imdb.com/M/title-exact?Indiana%20Jon...,0,1,1,0,0,...,0,0,0,0,0,0,0,0,0,0


---

# Model-Based

In [196]:
user_item_matrix = mat
print(user_item_matrix)

  (1, 1)	5
  (1, 2)	3
  (1, 3)	4
  (1, 4)	3
  (1, 5)	3
  (1, 6)	5
  (1, 7)	4
  (1, 8)	1
  (1, 9)	5
  (1, 10)	3
  (1, 11)	2
  (1, 12)	5
  (1, 13)	5
  (1, 14)	5
  (1, 15)	5
  (1, 16)	5
  (1, 17)	3
  (1, 18)	4
  (1, 19)	5
  (1, 20)	4
  (1, 21)	1
  (1, 22)	4
  (1, 23)	4
  (1, 24)	3
  (1, 25)	4
  :	:
  (943, 739)	4
  (943, 756)	2
  (943, 763)	4
  (943, 765)	3
  (943, 785)	2
  (943, 794)	3
  (943, 796)	3
  (943, 808)	4
  (943, 816)	4
  (943, 824)	4
  (943, 825)	3
  (943, 831)	2
  (943, 840)	4
  (943, 928)	5
  (943, 941)	1
  (943, 943)	5
  (943, 1011)	2
  (943, 1028)	2
  (943, 1044)	3
  (943, 1047)	2
  (943, 1067)	2
  (943, 1074)	4
  (943, 1188)	3
  (943, 1228)	3
  (943, 1330)	3


## TruncatedSVD

In [194]:
import numpy as np
from sklearn.decomposition import TruncatedSVD

'''
the user_item_matrix will look like this
|        | item 1 | ... | item m |
|--------|--------|-----|--------|
| user 1 | 3      | 0   | 0      |
| ...    | 0      | 4   | 5      |
| user n | 2      | 0   | 0      |
'''

# initial hyperparameter
epsilon = 1e-9
n_latent_factors = 10

# generate item lantent features
item_svd = TruncatedSVD(n_components = n_latent_factors)
item_features = item_svd.fit_transform(user_item_matrix.transpose()) + epsilon

# generate user latent features
user_svd = TruncatedSVD(n_components = n_latent_factors)
user_features = user_svd.fit_transform(user_item_matrix) + epsilon

item_corr_mat = cosine_similarity(item_features)
similar_items = top_k_items(name2ind[50],
                            top_k = 10,
                            corr_mat = item_corr_mat,
                            map_name = ind2name)
display(items.loc[items['movie_id'].isin(similar_items)])


Unnamed: 0,movie_id,movie title,release date,video release date,IMDb URL,unknown,Action,Adventure,Animation,Children's,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,Toy Story (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%2...,0,0,0,1,1,...,0,0,0,0,0,0,0,0,0,0
6,7,Twelve Monkeys (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Twelve%20Monk...,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
49,50,Star Wars (1977),01-Jan-1977,,http://us.imdb.com/M/title-exact?Star%20Wars%2...,0,1,1,0,0,...,0,0,0,0,0,1,1,0,1,0
126,127,"Godfather, The (1972)",01-Jan-1972,,"http://us.imdb.com/M/title-exact?Godfather,%20...",0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
150,151,Willy Wonka and the Chocolate Factory (1971),01-Jan-1971,,http://us.imdb.com/M/title-exact?Willy%20Wonka...,0,0,1,0,1,...,0,0,0,0,0,0,0,0,0,0
171,172,"Empire Strikes Back, The (1980)",01-Jan-1980,,http://us.imdb.com/M/title-exact?Empire%20Stri...,0,1,1,0,0,...,0,0,0,0,0,1,1,0,1,0
173,174,Raiders of the Lost Ark (1981),01-Jan-1981,,http://us.imdb.com/M/title-exact?Raiders%20of%...,0,1,1,0,0,...,0,0,0,0,0,0,0,0,0,0
180,181,Return of the Jedi (1983),14-Mar-1997,,http://us.imdb.com/M/title-exact?Return%20of%2...,0,1,1,0,0,...,0,0,0,0,0,1,1,0,1,0
221,222,Star Trek: First Contact (1996),22-Nov-1996,,http://us.imdb.com/M/title-exact?Star%20Trek:%...,0,1,1,0,0,...,0,0,0,0,0,0,1,0,0,0
256,257,Men in Black (1997),04-Jul-1997,,http://us.imdb.com/M/title-exact?Men+in+Black+...,0,1,1,0,0,...,0,0,0,0,0,0,1,0,0,0


## Funk Matrix Factorization

In [200]:
MF_ratings = ratings.drop("unix_timestamp", axis = 1)
MF_ratings

Unnamed: 0,user_id,movie_id,rating
0,196,242,3
1,186,302,3
2,22,377,1
3,244,51,2
4,166,346,1
...,...,...,...
99995,880,476,3
99996,716,204,5
99997,276,1090,1
99998,13,225,2


In [206]:
from surprise import SVD, accuracy
from surprise import Dataset, Reader
from surprise.model_selection import cross_validate
from surprise.model_selection.split import train_test_split
from collections import defaultdict

def pred2dict(predictions, top_k=None):

    rec_dict = defaultdict(list)
    for user_id, item_id, actual_rating, pred_rating, _ in predictions:
        rec_dict[user_id].append((item_id, pred_rating))

    return rec_dict

def get_top_k_recommendation(rec_dict, user_id, top_k, ind2name):

    pred_ratings = rec_dict[user_id]
    # sort descendingly by pred_rating
    pred_ratings = sorted(pred_ratings, key=lambda x: x[1], reverse=True)
    pred_ratings = pred_ratings[:top_k]
    recs = [ind2name[e[0]] for e in pred_ratings]

    return recs

# prepare train and test sets  
reader = Reader(rating_scale=(1,5)) # 要告訴 suprise ，評分的範圍
data = Dataset.load_from_df(MF_ratings, reader) #只要 3 欄 user, item, rating ，還有reader 要當參數

svd = SVD(verbose=True, n_epochs=10, n_factors = 50)  # 選擇用SVD法來分解矩陣
cross_validate(svd, data, measures=['RMSE', 'MAE'], cv=3, verbose=True) # 做 3 遍看訓練成果

# evaluation the test set


#display(items.loc[items['movie_id'].isin(similar_items)])


Processing epoch 0
Processing epoch 1
Processing epoch 2
Processing epoch 3
Processing epoch 4
Processing epoch 5
Processing epoch 6
Processing epoch 7
Processing epoch 8
Processing epoch 9
Processing epoch 0
Processing epoch 1
Processing epoch 2
Processing epoch 3
Processing epoch 4
Processing epoch 5
Processing epoch 6
Processing epoch 7
Processing epoch 8
Processing epoch 9
Processing epoch 0
Processing epoch 1
Processing epoch 2
Processing epoch 3
Processing epoch 4
Processing epoch 5
Processing epoch 6
Processing epoch 7
Processing epoch 8
Processing epoch 9
Evaluating RMSE, MAE of algorithm SVD on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    0.9548  0.9496  0.9496  0.9513  0.0025  
MAE (testset)     0.7554  0.7528  0.7518  0.7533  0.0015  
Fit time          0.28    0.27    0.29    0.28    0.01    
Test time         0.55    0.16    0.16    0.29    0.19    


{'test_rmse': array([0.95480449, 0.94963833, 0.94957241]),
 'test_mae': array([0.75535341, 0.75276973, 0.75177341]),
 'fit_time': (0.2808196544647217, 0.26914262771606445, 0.2949366569519043),
 'test_time': (0.5535461902618408, 0.15776586532592773, 0.15616178512573242)}

In [208]:
uid = 3
mid = 50
pred = svd.predict(uid, mid)
print(pred)

user: 3          item: 50         r_ui = None   est = 4.16   {'was_impossible': False}


In [209]:
train, test = train_test_split(data, test_size=.2, random_state=42)
pred = svd.test(test)
record = pred2dict(pred)

In [212]:
#predcit user 3 top k recommdation
pred = []
for i in range(1, 1684):
    pred.append(svd.predict(3, i))
record = pred2dict(pred)
top_k = get_top_k_recommendation(record, 3, 10, ind2name)
display(items.loc[items['movie_id'].isin(top_k)])

Unnamed: 0,movie_id,movie title,release date,video release date,IMDb URL,unknown,Action,Adventure,Animation,Children's,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
11,12,"Usual Suspects, The (1995)",14-Aug-1995,,http://us.imdb.com/M/title-exact?Usual%20Suspe...,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
49,50,Star Wars (1977),01-Jan-1977,,http://us.imdb.com/M/title-exact?Star%20Wars%2...,0,1,1,0,0,...,0,0,0,0,0,1,1,0,1,0
63,64,"Shawshank Redemption, The (1994)",01-Jan-1994,,http://us.imdb.com/M/title-exact?Shawshank%20R...,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
168,169,"Wrong Trousers, The (1993)",01-Jan-1993,,http://us.imdb.com/M/title-exact?Wrong%20Trous...,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
177,178,12 Angry Men (1957),01-Jan-1957,,http://us.imdb.com/M/title-exact?12%20Angry%20...,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
190,191,Amadeus (1984),01-Jan-1984,,http://us.imdb.com/M/title-exact?Amadeus%20(1984),0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
317,318,Schindler's List (1993),01-Jan-1993,,http://us.imdb.com/M/title-exact?Schindler's%2...,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
407,408,"Close Shave, A (1995)",28-Apr-1996,,http://us.imdb.com/M/title-exact?Close%20Shave...,0,0,0,1,0,...,0,0,0,0,0,0,0,1,0,0
426,427,To Kill a Mockingbird (1962),01-Jan-1962,,http://us.imdb.com/M/title-exact?To%20Kill%20a...,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
482,483,Casablanca (1942),01-Jan-1942,,http://us.imdb.com/M/title-exact?Casablanca%20...,0,0,0,0,0,...,0,0,0,0,0,1,0,0,1,0


---

## Neural Collaborative Filtering (NCF)

In [None]:
from recommenders.models.ncf.ncf_singlenode import NCF
from recommenders.models.ncf.dataset import Dataset as NCFDataset
from recommenders.datasets.python_splitters import python_chrono_split
from recommenders.utils.constants import SEED as DEFAULT_SEED

# Initial parameters
TOP_K = 10
EPOCHS = 50
BATCH_SIZE = 1024
SEED = DEFAULT_SEED

'''
create NCFDataset
the train and test dataframe will look like this
| user_id | item_id | rating          |
|---------|---------|-----------------|
| 1       | 1       | 5               |
| ...     | ...     | ...             |
| n       | m       | 3               |
Requried: All test users need to appear in the train set
'''

data = NCFDataset(train = train, test = test, seed=SEED)

model = NCF (
    n_users=data.n_users, 
    n_items=data.n_items,
    model_type="NeuMF",
    n_factors=4,
    layer_sizes=[16,8,4],
    n_epochs=EPOCHS,
    batch_size=BATCH_SIZE,
    learning_rate=1e-3,
    verbose=1,
    seed=SEED
)

# fitting the model
model.fit(data)

# predict the data in the test set
predictions = [[row.userID, row.itemID, model.predict(row.userID, row.itemID)]
               for (_, row) in test.iterrows()]

---

Since we’ll be using GraphLab, lets convert these in SFrames.

In [29]:
import graphlab

In [30]:
graphlab.SFrame()

In [31]:
train_data = graphlab.SFrame(ratings_base)
test_data = graphlab.SFrame(ratings_test)

We can use this data for training and testing. Now that we have gathered all the data available. Note that here we have user behaviour as well as attributes of the users and movies. So we can make content based as well as collaborative filtering algorithms.

# A Simple Popularity Model

Lets start with making a popularity based model, i.e. the one where all the users have same recommendation based on the most popular choices. We’ll use the  graphlab recommender functions popularity_recommender for this.

We can train a recommendation as:

In [32]:
popularity_model = graphlab.popularity_recommender.create(train_data, user_id='user_id', item_id='movie_id', target='rating')

Arguments:

* train_data: the SFrame which contains the required data
* user_id: the column name which represents each user ID
* item_id: the column name which represents each item to be recommended
* target: the column name representing scores/ratings given by the user

Lets use this model to make top 5 recommendations for first 5 users and see what comes out:

In [33]:
#Get recommendations for first 5 users and print them
#users = range(1,6) specifies user ID of first 5 users
#k=5 specifies top 5 recommendations to be given
popularity_recomm = popularity_model.recommend(users=range(1,6),k=5)
popularity_recomm.print_rows(num_rows=25)

+---------+----------+-------+------+
| user_id | movie_id | score | rank |
+---------+----------+-------+------+
|    1    |   1599   |  5.0  |  1   |
|    1    |   1201   |  5.0  |  2   |
|    1    |   1189   |  5.0  |  3   |
|    1    |   1122   |  5.0  |  4   |
|    1    |   814    |  5.0  |  5   |
|    2    |   1599   |  5.0  |  1   |
|    2    |   1201   |  5.0  |  2   |
|    2    |   1189   |  5.0  |  3   |
|    2    |   1122   |  5.0  |  4   |
|    2    |   814    |  5.0  |  5   |
|    3    |   1599   |  5.0  |  1   |
|    3    |   1201   |  5.0  |  2   |
|    3    |   1189   |  5.0  |  3   |
|    3    |   1122   |  5.0  |  4   |
|    3    |   814    |  5.0  |  5   |
|    4    |   1599   |  5.0  |  1   |
|    4    |   1201   |  5.0  |  2   |
|    4    |   1189   |  5.0  |  3   |
|    4    |   1122   |  5.0  |  4   |
|    4    |   814    |  5.0  |  5   |
|    5    |   1599   |  5.0  |  1   |
|    5    |   1201   |  5.0  |  2   |
|    5    |   1189   |  5.0  |  3   |
|    5    | 

Did you notice something? The recommendations for all users are same – 1500,1201,1189,1122,814 in the same order. This can be verified by checking the movies with highest mean recommendations in our ratings_base data set:

In [34]:
ratings_base.groupby(by='movie_id')['rating'].mean().sort_values(ascending=False).head(20)

movie_id
1500    5.000000
1293    5.000000
1122    5.000000
1189    5.000000
1656    5.000000
1201    5.000000
1599    5.000000
814     5.000000
1467    5.000000
1536    5.000000
1449    4.714286
1642    4.500000
1463    4.500000
1594    4.500000
1398    4.500000
114     4.491525
408     4.480769
169     4.476636
318     4.475836
483     4.459821
Name: rating, dtype: float64

This confirms that all the recommended movies have an average rating of 5, i.e. all the users who watched the movie gave a top rating. Thus we can see that our popularity system works as expected. But it is good enough? We’ll analyze it in detail later.

## A Collaborative Filtering Model

Lets start by understanding the basics of a collaborative filtering algorithm. The core idea works in 2 steps:

1. Find similar items by using a similarity metric
2. For a user, recommend the items most similar to the items (s)he already likes

To give you a high level overview, this is done by making an item-item matrix in which we keep a record of the pair of items which were rated together.

In this case, an item is a movie. Once we have the matrix, we use it to determine the best recommendations for a user based on the movies he has already rated. Note that there a few more things to take care in actual implementation which would require deeper mathematical introspection, which I’ll skip for now.

I would just like to mention that there are 3 types of item similarity metrics supported by graphlab. These are:

1. Jaccard Similarity:
* Similarity is based on the number of users which have rated item A and B divided by the number of users who have rated either A or B
* It is typically used where we don’t have a numeric rating but just a boolean value like a product being bought or an add being clicked

2. Cosine Similarity:
* Similarity is the cosine of the angle between the 2 vectors of the item vectors of A and B
* Closer the vectors, smaller will be the angle and larger the cosine

3. Pearson Similarity
* Similarity is the pearson coefficient between the two vectors.

Lets create a model based on item similarity as follow:

* https://hquach.github.io/Machine-Learning-Similarity-or-Distance-Metrics/


In [35]:
#Train Model
item_sim_model = graphlab.item_similarity_recommender.create(train_data, user_id='user_id', item_id='movie_id', target='rating', similarity_type='pearson')

#Make Recommendations:
item_sim_recomm = item_sim_model.recommend(users=range(1,6),k=5)
item_sim_recomm.print_rows(num_rows=25)

+---------+----------+-------+------+
| user_id | movie_id | score | rank |
+---------+----------+-------+------+
|    1    |   1599   |  5.0  |  1   |
|    1    |   1201   |  5.0  |  2   |
|    1    |   1189   |  5.0  |  3   |
|    1    |   1122   |  5.0  |  4   |
|    1    |   814    |  5.0  |  5   |
|    2    |   1599   |  5.0  |  1   |
|    2    |   1201   |  5.0  |  2   |
|    2    |   1189   |  5.0  |  3   |
|    2    |   1122   |  5.0  |  4   |
|    2    |   814    |  5.0  |  5   |
|    3    |   1599   |  5.0  |  1   |
|    3    |   1201   |  5.0  |  2   |
|    3    |   1189   |  5.0  |  3   |
|    3    |   1122   |  5.0  |  4   |
|    3    |   814    |  5.0  |  5   |
|    4    |   1599   |  5.0  |  1   |
|    4    |   1201   |  5.0  |  2   |
|    4    |   1189   |  5.0  |  3   |
|    4    |   1122   |  5.0  |  4   |
|    4    |   814    |  5.0  |  5   |
|    5    |   1599   |  5.0  |  1   |
|    5    |   1201   |  5.0  |  2   |
|    5    |   1189   |  5.0  |  3   |
|    5    | 

Here we can see that the recommendations are different for each user. So, personalization exists. But how good is this model? We need some means of evaluating a recommendation engine. Lets focus on that in the next section.

## Evaluating Recommendation Engines

For evaluating recommendation engines, we can use the concept of precision-recall. You must be familiar with this in terms of classification and the idea is very similar. Let me define them in terms of recommendations.

* Recall:
** What ratio of items that a user likes were actually recommended.
** If a user likes say 5 items and the recommendation decided to show 3 of them, then the recall is 0.6

* Precision
** Out of all the recommended items, how many the user actually liked?
** If 5 items were recommended to the user out of which he liked say 4 of them, then precision is 0.8

Now if we think about recall, how can we maximize it? If we simply recommend all the items, they will definitely cover the items which the user likes. So we have 100% recall! But think about precision for a second. If we recommend say 1000 items and user like only say 10 of them then precision is 0.1%. This is really low. Our aim is to maximize both precision and recall.

An idea recommender system is the one which only recommends the items which user likes. So in this case precision=recall=1. This is an optimal recommender and we should try and get as close as possible.

Lets compare both the models we have built till now based on precision-recall characteristics:

In [36]:
model_performance = graphlab.compare(test_data, [popularity_model, item_sim_model])
graphlab.show_comparison(model_performance,[popularity_model, item_sim_model])

PROGRESS: Evaluate model M0

Precision and recall summary statistics by cutoff
+--------+-------------------+-------------------+
| cutoff |   mean_precision  |    mean_recall    |
+--------+-------------------+-------------------+
|   1    |        0.0        |        0.0        |
|   2    | 0.000530222693531 | 0.000106044538706 |
|   3    | 0.000353481795688 | 0.000106044538706 |
|   4    | 0.000530222693531 | 0.000212089077413 |
|   5    | 0.000424178154825 | 0.000212089077413 |
|   6    | 0.000353481795688 | 0.000212089077413 |
|   7    | 0.000302984396304 | 0.000212089077413 |
|   8    | 0.000265111346766 | 0.000212089077413 |
|   9    | 0.000235654530458 | 0.000212089077413 |
|   10   | 0.000212089077413 | 0.000212089077413 |
+--------+-------------------+-------------------+
[10 rows x 3 columns]

PROGRESS: Evaluate model M1

Precision and recall summary statistics by cutoff
+--------+-------------------+-------------------+
| cutoff |   mean_precision  |    mean_recall    |
+--

Here we can make 2 very quick observations:

1. The item similarity model is definitely better than the popularity model (by atleast 10x)
2. On an absolute level, even the item similarity model appears to have a poor performance. It is far from being a useful recommendation system.

There is a big scope of improvement here. But I leave it up to you to figure out how to improve this further. I would like to give a couple of tips:

1. Try leveraging the additional context information which we have
2. Explore more sophisticated algorithms like matrix factorization

In the end, I would like to mention that along with GraphLab, you can also use some other open source python packages like Crab. Crab is till under development and supports only basic collaborative filtering techniques for now. But this is something to watch out for in future for sure!

## End Notes

In this article, we traversed through the process of making a basic recommendation engine in Python using GrpahLab. We started by understanding the fundamentals of recommendations. Then we went on to load the MovieLens 100K data set for the purpose of experimentation.

Subsequently we made a first model as a simple popularity model in which the most popular movies were recommended for each user. Since this lacked personalization, we made another model based on collaborative filtering and observed the impact of personalization.

Finally, we discussed precision-recall as evaluation metrics for recommendation systems and on comparison found the collaborative filtering model to be more than 10x better than the popularity model.

Did you like reading this article ?  Do share your experience / suggestions in the comments section below.