Introduction

Matrix factorization is a class of collaborative filtering algorithms used in recommender systems. Matrix factorization algorithms work by decomposing the user-item interaction matrix into the product of two lower dimensionality rectangular matrices.The idea behind matrix factorization is to represent users and items in a lower dimensional latent space. For the case here, the items represent movies

Our aim is to fill in the blank rating values here. For users and movies to populate them, the weights of the latent features, that are supposed to be available, are found from the available data. Then, with these weights, an estimate is made for the empty rating values.

In [1]:
!pip install surprise
import pandas as pd
!pip install openpyxl
from surprise import Reader, SVD, Dataset, accuracy
from surprise.model_selection import GridSearchCV, train_test_split, cross_validate
pd.set_option('display.max_columns', None)



In [2]:
movie = pd.read_csv('movie.csv')
rating = pd.read_csv('rating.csv')
df = movie.merge(rating, how="left", on="movieId")
df.head()

Unnamed: 0,movieId,title,genres,userId,rating,timestamp
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,3.0,4.0,1999-12-11 13:36:47
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,6.0,5.0,1997-03-13 17:50:52
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,8.0,4.0,1996-06-05 13:37:51
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,10.0,4.0,1999-11-25 02:44:47
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,11.0,4.5,2009-01-02 01:13:41


##Step 1: Preparing the Data Set

In [5]:
movie_ids = [130219, 356, 4422, 541]
movies = ["The Dark Knight (2011)",
          "Cries and Whispers (Viskningar och rop) (1972)",
          "Forrest Gump (1994)",
          "Blade Runner (1982)"]

In [7]:
sample_df = df[df.movieId.isin(movie_ids)]
sample_df.shape

(97343, 6)

In [9]:
sample_df.head()

Unnamed: 0,movieId,title,genres,userId,rating,timestamp
2457839,356,Forrest Gump (1994),Comedy|Drama|Romance|War,4.0,4.0,1996-08-24 09:28:42
2457840,356,Forrest Gump (1994),Comedy|Drama|Romance|War,7.0,4.0,2002-01-16 19:02:55
2457841,356,Forrest Gump (1994),Comedy|Drama|Romance|War,8.0,5.0,1996-06-05 13:44:19
2457842,356,Forrest Gump (1994),Comedy|Drama|Romance|War,9.0,4.0,2001-07-01 20:26:38
2457843,356,Forrest Gump (1994),Comedy|Drama|Romance|War,10.0,3.0,1999-11-25 02:32:02


In [13]:
user_movie_df = sample_df.pivot_table(index=["userId"], columns=["title"], values="rating")
user_movie_df.shape

(76918, 4)

In [15]:
user_movie_df.head()

title,Blade Runner (1982),Cries and Whispers (Viskningar och rop) (1972),Forrest Gump (1994),The Dark Knight (2011)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1.0,4.0,,,
2.0,5.0,,,
3.0,5.0,,,
4.0,,,4.0,
7.0,,,4.0,


In [17]:
reader = Reader(rating_scale=(1, 5))

In [19]:
data = Dataset.load_from_df(sample_df[['userId', 'movieId', 'rating']], reader)

#Step 2: Modelling

In [21]:
trainset, testset = train_test_split(data, test_size=.25)
svd_model = SVD()
svd_model.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x182e41851c0>

In [23]:
predictions = svd_model.test(testset)

accuracy.rmse(predictions)

RMSE: 0.9349


0.9348564168050575

In [25]:
cross_validate(svd_model, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

svd_model.predict(uid=12.0, iid=213, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9343  0.9347  0.9315  0.9334  0.9420  0.9352  0.0036  
MAE (testset)     0.7239  0.7229  0.7217  0.7219  0.7277  0.7236  0.0022  
Fit time          1.70    1.68    1.74    1.69    1.74    1.71    0.03    
Test time         0.38    0.41    0.24    0.36    0.23    0.33    0.07    
user: 12.0       item: 213        r_ui = None   est = 4.06   {'was_impossible': False}


Prediction(uid=12.0, iid=213, r_ui=None, est=4.062953451043339, details={'was_impossible': False})

In [26]:
svd_model.predict(uid=1.0, iid=541, verbose=True)

user: 1.0        item: 541        r_ui = None   est = 4.24   {'was_impossible': False}


Prediction(uid=1.0, iid=541, r_ui=None, est=4.238569060995527, details={'was_impossible': False})

Step 3: Model Tuning

In [27]:
param_grid = {'n_epochs': [5, 10], 'lr_all': [0.002, 0.005]}

gs = GridSearchCV(SVD,
                  param_grid,
                  measures=['rmse', 'mae'],
                  cv=3,
                  n_jobs=-1,
                  joblib_verbose=True)

gs.fit(data)
gs.best_score['rmse']

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 out of  12 | elapsed:    1.2s remaining:    6.5s
[Parallel(n_jobs=-1)]: Done  12 out of  12 | elapsed:    2.5s finished


0.9313252770800768

In [28]:
gs.best_params['rmse']

{'n_epochs': 5, 'lr_all': 0.002}

Step 4: Final Model and Prediction

In [29]:
svd_model = SVD(**gs.best_params['rmse'])

data = data.build_full_trainset()
svd_model.fit(data)

svd_model.predict(uid=1.0, iid=1, verbose=True)

user: 1.0        item: 1          r_ui = None   est = 4.06   {'was_impossible': False}


Prediction(uid=1.0, iid=1, r_ui=None, est=4.06027583063687, details={'was_impossible': False})

In [32]:
import pickle
with open('svd_model.pkl', 'wb') as file:
    pickle.dump(svd_model, file)

print("Pickle file 'model_data.pkl' created successfully.")

Pickle file 'model_data.pkl' created successfully.
