## Collaborative Filtering

*Prepared by:*
**Jude Michael Teves**  
Faculty, Software Technology Department  
College of Computer Studies - De La Salle University

This notebook shows how to perform a collaborative filtering type of recommender system.

## Preliminaries

### Import libraries

In [2]:
import numpy as np
import pandas as pd

from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import jaccard_score
from scipy.stats import pearsonr

### Load Data

We will be using the MovieLens dataset here. I have already preprocessed the data so it will be easier for us to process later on.

In [3]:
df_ratings = pd.read_csv('https://raw.githubusercontent.com/Cyntwikip/data-repository/main/movielens_movie_ratings.csv')
df_ratings.head()

Unnamed: 0,userId,movieId,rating
0,1,1,4.0
1,1,3,4.0
2,1,6,4.0
3,1,47,5.0
4,1,50,5.0


In [4]:
df_genres = pd.read_csv('https://raw.githubusercontent.com/Cyntwikip/data-repository/main/movielens_movie_genres.csv')
df_genres.head()

Unnamed: 0,movieId,title,Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,...,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,Toy Story (1995),0,1,1,1,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,2,Jumanji (1995),0,1,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,3,Grumpier Old Men (1995),0,0,0,0,1,0,0,0,...,0,0,0,0,0,1,0,0,0,0
3,4,Waiting to Exhale (1995),0,0,0,0,1,0,0,1,...,0,0,0,0,0,1,0,0,0,0
4,5,Father of the Bride Part II (1995),0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## User-based Collaborative Filtering

### Build User-Item Matrix

In [5]:
user_id = 3

In [330]:
df_user = df_ratings.pivot(index='userId', columns='movieId', values='rating')
df_user

movieId,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,,4.0,,,4.0,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,4.0,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
606,2.5,,,,,,2.5,,,,...,,,,,,,,,,
607,4.0,,,,,,,,,,...,,,,,,,,,,
608,2.5,2.0,2.0,,,,,,,4.0,...,,,,,,,,,,
609,3.0,,,,,,,,,4.0,...,,,,,,,,,,


### Retrieve *k* most similar users

#### Preprocessing - Mean Imputation

In [7]:
df_user_filled = df_user.apply(lambda x: x.fillna(x.mean()), axis=1)
df_user_filled.head()

movieId,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,4.366379,4.0,4.366379,4.366379,4.0,4.366379,4.366379,4.366379,4.366379,...,4.366379,4.366379,4.366379,4.366379,4.366379,4.366379,4.366379,4.366379,4.366379,4.366379
2,3.948276,3.948276,3.948276,3.948276,3.948276,3.948276,3.948276,3.948276,3.948276,3.948276,...,3.948276,3.948276,3.948276,3.948276,3.948276,3.948276,3.948276,3.948276,3.948276,3.948276
3,2.435897,2.435897,2.435897,2.435897,2.435897,2.435897,2.435897,2.435897,2.435897,2.435897,...,2.435897,2.435897,2.435897,2.435897,2.435897,2.435897,2.435897,2.435897,2.435897,2.435897
4,3.555556,3.555556,3.555556,3.555556,3.555556,3.555556,3.555556,3.555556,3.555556,3.555556,...,3.555556,3.555556,3.555556,3.555556,3.555556,3.555556,3.555556,3.555556,3.555556,3.555556
5,4.0,3.636364,3.636364,3.636364,3.636364,3.636364,3.636364,3.636364,3.636364,3.636364,...,3.636364,3.636364,3.636364,3.636364,3.636364,3.636364,3.636364,3.636364,3.636364,3.636364


#### Similarity Computation

In [8]:
k = 10
reference_user = df_user_filled.loc[user_id]
user_similarities = df_user_filled.apply(lambda x: pearsonr(x, reference_user)[0], axis=1)
similar_users = user_similarities.drop(user_id, axis=0).nlargest(k)
similar_users



userId
441    0.117418
496    0.067878
549    0.064006
231    0.061159
527    0.058456
537    0.058072
313    0.055313
518    0.050288
244    0.049511
246    0.048314
dtype: float64

### Get average rating of similar users

In [9]:
predicted_ratings = df_user.loc[similar_users.index].mean().sort_values(ascending=False)
predicted_ratings

movieId
2450      5.0
68954     5.0
68486     5.0
2683      5.0
1199      5.0
         ... 
193581    NaN
193583    NaN
193585    NaN
193587    NaN
193609    NaN
Length: 9724, dtype: float64

#### Recommend items

In [10]:
user_unrated_items = df_user.loc[user_id].isna()
recommended_items = predicted_ratings[user_unrated_items].head(10)
recommended_items

movieId
2450     5.0
68954    5.0
68486    5.0
2683     5.0
1199     5.0
1200     5.0
1997     5.0
3153     5.0
66371    5.0
1213     5.0
dtype: float64

Let's observe how other similar users rated those items.

In [11]:
df_user.loc[similar_users.index, recommended_items.index]

movieId,2450,68954,68486,2683,1199,1200,1997,3153,66371,1213
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
441,,,,5.0,,,,,,
496,,,,,,,,,,
549,,,,,,,,,,
231,,,,,,5.0,,,,
527,5.0,,,,,,5.0,,,
537,,,,,,,,,,
313,,,,,5.0,5.0,5.0,5.0,,5.0
518,,,,,,,,,,
244,,,,,,5.0,,,,
246,,5.0,5.0,,,,,,5.0,


### Variation: Get weighted average of similar users

In [12]:
def get_weighted_similarity(x):
    weighted_similarity = x*similar_users
    norm = similar_users[~weighted_similarity.isna()].sum()
    rating = weighted_similarity.sum()/norm
    return rating

predicted_ratings = df_user.loc[similar_users.index].apply(get_weighted_similarity, axis=0)
predicted_ratings = predicted_ratings.sort_values(ascending=False)
predicted_ratings

  rating = weighted_similarity.sum()/norm


movieId
1333      5.0
1982      5.0
3071      5.0
1961      5.0
2450      5.0
         ... 
193581    NaN
193583    NaN
193585    NaN
193587    NaN
193609    NaN
Length: 9724, dtype: float64

#### Recommend items

In [13]:
user_unrated_items = df_user.loc[user_id].isna()
recommended_items = predicted_ratings[user_unrated_items].head(10)
recommended_items

movieId
1333    5.0
1982    5.0
3071    5.0
1961    5.0
2450    5.0
2118    5.0
2355    5.0
2137    5.0
4638    5.0
1035    5.0
dtype: float64

Let's observe how other similar users rated those items.

In [14]:
df_user.loc[similar_users.index, recommended_items.index]

movieId,1333,1982,3071,1961,2450,2118,2355,2137,4638,1035
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
441,,,,,,,,,,
496,,,,,,,,,,
549,,,,,,,,,,
231,,,,,,,,,,
527,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0
537,,,,,,,,,,
313,,,,,,,,,,
518,,,,,,,,,,
244,,,,,,,,,,
246,,,,,,,,,,


## Item-based Collaborative Filtering

### Build Item-User Matrix

In [15]:
user_id = 3
item_id = 1

In [16]:
df_item = df_ratings.pivot(index='movieId', columns='userId', values='rating')
df_item

userId,1,2,3,4,5,6,7,8,9,10,...,601,602,603,604,605,606,607,608,609,610
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,,,,4.0,,4.5,,,,...,4.0,,4.0,3.0,4.0,2.5,4.0,2.5,3.0,5.0
2,,,,,,4.0,,4.0,,,...,,4.0,,5.0,3.5,,,2.0,,
3,4.0,,,,,5.0,,,,,...,,,,,,,,2.0,,
4,,,,,,3.0,,,,,...,,,,,,,,,,
5,,,,,,5.0,,,,,...,,,,3.0,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
193581,,,,,,,,,,,...,,,,,,,,,,
193583,,,,,,,,,,,...,,,,,,,,,,
193585,,,,,,,,,,,...,,,,,,,,,,
193587,,,,,,,,,,,...,,,,,,,,,,


### Retrieve *k* most similar items

#### Preprocessing - Mean Imputation

In [17]:
df_item_filled = df_item.apply(lambda x: x.fillna(x.mean()), axis=1)
df_item_filled.head()

userId,1,2,3,4,5,6,7,8,9,10,...,601,602,603,604,605,606,607,608,609,610
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,3.92093,3.92093,3.92093,4.0,3.92093,4.5,3.92093,3.92093,3.92093,...,4.0,3.92093,4.0,3.0,4.0,2.5,4.0,2.5,3.0,5.0
2,3.431818,3.431818,3.431818,3.431818,3.431818,4.0,3.431818,4.0,3.431818,3.431818,...,3.431818,4.0,3.431818,5.0,3.5,3.431818,3.431818,2.0,3.431818,3.431818
3,4.0,3.259615,3.259615,3.259615,3.259615,5.0,3.259615,3.259615,3.259615,3.259615,...,3.259615,3.259615,3.259615,3.259615,3.259615,3.259615,3.259615,2.0,3.259615,3.259615
4,2.357143,2.357143,2.357143,2.357143,2.357143,3.0,2.357143,2.357143,2.357143,2.357143,...,2.357143,2.357143,2.357143,2.357143,2.357143,2.357143,2.357143,2.357143,2.357143,2.357143
5,3.071429,3.071429,3.071429,3.071429,3.071429,5.0,3.071429,3.071429,3.071429,3.071429,...,3.071429,3.071429,3.071429,3.0,3.071429,3.071429,3.071429,3.071429,3.071429,3.071429


#### Similarity Computation

In [18]:
k = 5
reference_item = df_item_filled.loc[item_id]
item_similarities = df_item_filled.apply(lambda x: pearsonr(x, reference_item)[0], axis=1)
user_rated_items = df_item.loc[:, user_id].dropna().index.tolist()
item_similarities = item_similarities.drop(item_id, axis=0).loc[user_rated_items]
similar_items = item_similarities.nlargest(k)
similar_items



movieId
1275    0.112876
2080    0.112139
2424    0.110641
688     0.092266
2288    0.081103
dtype: float64

### Get weighted average of similar items

This is how `user_id = 3` will rate `movieId = 1`

In [19]:
df_similar_items = df_item.loc[similar_items.index, user_id]
df_similar_items

movieId
1275    3.5
2080    0.5
2424    0.5
688     0.5
2288    4.0
Name: 3, dtype: float64

In [20]:
def get_item_weighted_similarity(x):
    weighted_similarity = x*similar_items
    norm = similar_items[~weighted_similarity.isna()].sum()
    rating = weighted_similarity.sum()/norm
    return rating

get_item_weighted_similarity(df_similar_items)

1.7229008908010734

## Latent Factor Models

### Train-Test Split

In [337]:
train_size = 0.9

matrix = df_user.copy()
row_boundary, col_boundary = (np.array(matrix.shape) * train_size).astype(int)

# hide the test values in train set
train_matrix = matrix.copy()
train_matrix.iloc[row_boundary:, col_boundary:] = np.nan

# hide the train values in test set
test_matrix = matrix.copy()
test_matrix.iloc[:, :] = np.nan 
test_matrix.iloc[:row_boundary, :col_boundary] = matrix.iloc[:row_boundary, :col_boundary]

Sanity check.

The test set section of the train matrix should be null.

In [338]:
train_matrix.iloc[row_boundary:, col_boundary:]

movieId,128838,128842,128852,128900,128902,128908,128914,128944,128968,128975,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
550,,,,,,,,,,,...,,,,,,,,,,
551,,,,,,,,,,,...,,,,,,,,,,
552,,,,,,,,,,,...,,,,,,,,,,
553,,,,,,,,,,,...,,,,,,,,,,
554,,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
606,,,,,,,,,,,...,,,,,,,,,,
607,,,,,,,,,,,...,,,,,,,,,,
608,,,,,,,,,,,...,,,,,,,,,,
609,,,,,,,,,,,...,,,,,,,,,,


In [339]:
def check_not_null_count(values:np.array):
    return np.isfinite(values).sum()    

check_not_null_count(train_matrix.iloc[row_boundary:, col_boundary:].values)

0

Sanity check.

The test set section of the test matrix should contain non-null values.

In [340]:
# test_matrix.iloc[:row_boundary, :col_boundary]

In [341]:
check_not_null_count(test_matrix.iloc[:row_boundary, :col_boundary].values)

82959

The non-null count for the test set section should match the non-null count for the whole test set matrix.

In [342]:
check_not_null_count(test_matrix.values)

82959

Our dataset is very sparse. We only have this much non-null values:

In [392]:
print(f'Non-null values: {check_not_null_count(matrix.values) / (matrix.shape[0]*matrix.shape[1]):.3%}')

Non-null values: 1.700%


### Singular Value Decomposition

In [351]:
matrix_imputed = train_matrix.apply(lambda x: x.fillna(x.mean()), axis=1)
# matrix_imputed = train_matrix.fillna(0)
u, s, vh = np.linalg.svd(matrix_imputed, full_matrices=False)
u.shape, s.shape, vh.shape

((610, 610), (610,), (610, 9724))

In [352]:
factors = 600
reconstructed_matrix = u[:, :factors] @ np.diag(s[:factors]) @ vh[:factors, :]
reconstructed_matrix.shape

(610, 9724)

In [353]:
# train_matrix.iloc[:10,:10]

#### Train Set Score

In [354]:
train_ratings = train_matrix.reset_index().melt(id_vars=['userId'])
train_ratings.rename({'value':'actual'}, inplace=True, axis=1)
train_ratings['pred'] = reconstructed_matrix.T.flatten()
train_ratings

Unnamed: 0,userId,movieId,actual,pred
0,1,1,4.0,3.998999
1,2,1,,3.945500
2,3,1,,2.436001
3,4,1,,3.554769
4,5,1,4.0,3.998107
...,...,...,...,...
5931635,606,193609,,3.657400
5931636,607,193609,,3.786097
5931637,608,193609,,3.134173
5931638,609,193609,,3.270429


In [355]:
# true_ratings = train_matrix.values.flatten()
# predicted_ratings = reconstructed_matrix.flatten()
# print(f"Original Size: {len(true_ratings)}")
# mask = np.argwhere(~np.isnan(true_ratings))
# true_ratings = true_ratings[mask]
# predicted_ratings = predicted_ratings[mask]
# print(f"After Filtering Nulls Size: {len(true_ratings)}")

In [356]:
from sklearn.metrics import mean_squared_error

# rmse = mean_squared_error(true_ratings, predicted_ratings, squared=False)

print(f"Original Size: {train_ratings.shape[0]}")
train_ratings_filtered = train_ratings.dropna()
print(f"After Filtering Nulls Size: {train_ratings_filtered.shape[0]}")
rmse = mean_squared_error(train_ratings_filtered['actual'], train_ratings_filtered['pred'], squared=False)
print(f'RMSE: {rmse}')

Original Size: 5931640
After Filtering Nulls Size: 100411
RMSE: 0.003960286952709188


#### Test Set Score

In [357]:
test_ratings = test_matrix.reset_index().melt(id_vars=['userId'])
test_ratings.rename({'value':'actual'}, inplace=True, axis=1)
test_ratings['pred'] = reconstructed_matrix.T.flatten()

print(f"Original Size: {test_ratings.shape[0]}")
test_ratings_filtered = test_ratings.dropna()
print(f"After Filtering Nulls Size: {test_ratings_filtered.shape[0]}")
rmse = mean_squared_error(test_ratings_filtered['actual'], test_ratings_filtered['pred'], squared=False)
print(f'RMSE: {rmse}')

Original Size: 5931640
After Filtering Nulls Size: 82959
RMSE: 0.004128927247791236


### Non-Negative Matrix Factorization

In [370]:
from sklearn.decomposition import non_negative_factorization

matrix_imputed = train_matrix.apply(lambda x: x.fillna(x.mean()), axis=1)
W, H, n_iter = non_negative_factorization(matrix_imputed, n_components=500, 
                                          init='nndsvd', random_state=0, max_iter=50)
W.shape, H.shape, n_iter



((610, 500), (500, 9724), 50)

In [371]:
reconstructed_matrix = W @ H
reconstructed_matrix.shape

(610, 9724)

#### Train Set Score

In [372]:
train_ratings = train_matrix.reset_index().melt(id_vars=['userId'])
train_ratings.rename({'value':'actual'}, inplace=True, axis=1)
train_ratings['pred'] = reconstructed_matrix.T.flatten()

print(f"Original Size: {train_ratings.shape[0]}")
train_ratings_filtered = train_ratings.dropna()
print(f"After Filtering Nulls Size: {train_ratings_filtered.shape[0]}")
rmse = mean_squared_error(train_ratings_filtered['actual'], train_ratings_filtered['pred'], squared=False)
print(f'RMSE: {rmse}')

Original Size: 5931640
After Filtering Nulls Size: 100411
RMSE: 0.3433503967285106


#### Test Set Score

In [373]:
test_ratings = test_matrix.reset_index().melt(id_vars=['userId'])
test_ratings.rename({'value':'actual'}, inplace=True, axis=1)
test_ratings['pred'] = reconstructed_matrix.T.flatten()

print(f"Original Size: {test_ratings.shape[0]}")
test_ratings_filtered = test_ratings.dropna()
print(f"After Filtering Nulls Size: {test_ratings_filtered.shape[0]}")
rmse = mean_squared_error(test_ratings_filtered['actual'], test_ratings_filtered['pred'], squared=False)
print(f'RMSE: {rmse}')

Original Size: 5931640
After Filtering Nulls Size: 82959
RMSE: 0.34632106907456445


### Using Surprise library

For more details, visit the <a href="http://surpriselib.com/">Surprise documentation</a>.

In [394]:
from surprise import SVD
from surprise import Dataset
from surprise.model_selection import cross_validate

# Load the movielens-100k dataset (download it if needed).
data = Dataset.load_builtin('ml-100k')
data

<surprise.dataset.DatasetAutoFolds at 0x1708501cc40>

In [401]:
len(data.raw_ratings)

100000

In [402]:
# Use the famous SVD algorithm.
algo = SVD()

# Run 5-fold cross-validation and print results.
cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9392  0.9316  0.9377  0.9332  0.9401  0.9364  0.0034  
MAE (testset)     0.7369  0.7352  0.7406  0.7357  0.7389  0.7375  0.0020  
Fit time          6.81    6.64    6.55    6.56    6.78    6.67    0.11    
Test time         0.22    0.20    0.21    0.43    0.22    0.26    0.09    


{'test_rmse': array([0.93921411, 0.93158093, 0.93773454, 0.93322255, 0.94010429]),
 'test_mae': array([0.73691424, 0.73521979, 0.74056718, 0.73572648, 0.7388908 ]),
 'fit_time': (6.809737205505371,
  6.639822244644165,
  6.547652721405029,
  6.564387321472168,
  6.781553506851196),
 'test_time': (0.22304987907409668,
  0.20055198669433594,
  0.20606541633605957,
  0.4340980052947998,
  0.2170572280883789)}

In [408]:
from surprise import NormalPredictor
from surprise import Dataset
from surprise import Reader
from surprise.model_selection import cross_validate

# A reader is still needed but only the rating_scale param is requiered.
reader = Reader(rating_scale=(0, 5))

# The columns must correspond to user id, item id and ratings (in that order).
data = Dataset.load_from_df(df_ratings, reader)

In [450]:
from surprise.model_selection import train_test_split
from surprise import accuracy

trainset, testset = train_test_split(data, test_size=.2, random_state=0)

# We'll use the famous SVD algorithm.
algo = SVD(n_factors=2)

# Train the algorithm on the trainset, and predict ratings for the testset
algo.fit(trainset)
predictions = algo.test(testset)

# Then compute RMSE
accuracy.rmse(predictions)

RMSE: 0.8645


0.8644530531029808

In [451]:
algo.pu.shape, algo.qi.shape

((610, 2), (8979, 2))

In [455]:
(algo.pu @ algo.qi.T).shape

(610, 8979)

In [442]:
len(df_ratings['movieId'].unique())

9724

In [434]:
trainset.n_users, trainset.n_items

(610, 8979)

In [443]:
len(testset)

20168

In [409]:
# We can now use this dataset as we please, e.g. calling cross_validate
cross_validate(NormalPredictor(), data, cv=2)

{'test_rmse': array([1.42208342, 1.42297205]),
 'test_mae': array([1.13396628, 1.13534907]),
 'fit_time': (0.11360979080200195, 0.13002991676330566),
 'test_time': (0.48670101165771484, 0.6636519432067871)}

In [410]:
# Use the famous SVD algorithm.
algo = SVD()

# Run 5-fold cross-validation and print results.
cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.8703  0.8745  0.8623  0.8820  0.8800  0.8738  0.0070  
MAE (testset)     0.6679  0.6711  0.6621  0.6784  0.6773  0.6714  0.0060  
Fit time          6.47    6.43    6.48    6.46    6.48    6.47    0.02    
Test time         0.23    0.20    0.38    0.20    0.20    0.24    0.07    


{'test_rmse': array([0.87034096, 0.87445257, 0.86233985, 0.88196824, 0.87996387]),
 'test_mae': array([0.66792047, 0.67110855, 0.66207446, 0.67837767, 0.67732784]),
 'fit_time': (6.474883794784546,
  6.434120416641235,
  6.478055953979492,
  6.459465742111206,
  6.483584403991699),
 'test_time': (0.22705650329589844,
  0.19904518127441406,
  0.378084659576416,
  0.19904470443725586,
  0.19805026054382324)}

## References

1. F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4: 19:1–19:19. https://doi.org/10.1145/2827872

## End
<sup>made by **Jude Michael Teves**</sup> <br>
<sup>for comments, corrections, suggestions, please email:</sup><sup> <href>judemichaelteves@gmail.com</href> or <href>jude.teves@dlsu.edu.ph</href></sup><br>