<strong style="color:#2B3467;font-size:60px;font-family:Georgia;text-align:center;">Model Based Matrix Factorization </strong>


<div style="float:center;margin-left:100px;max-width:100%;">
<img src="https://i.imgur.com/hmptLkK.gif"></div>

   We come across recommendations multiple times a day — while deciding what to watch on Netflix/Youtube, item recommendations on shopping sites, song suggestions on Spotify, friend recommendations on Instagram, job recommendations on LinkedIn…the list goes on! Recommender systems aim to predict the “rating” or “preference” a user would give to an item. These ratings are used to determine what a user might like and make informed suggestions.


<strong style="color:#2B3467;font-size:30px;font-family:Georgia;text-align:center;">Matrix Factorization </strong>

 A recommender system has two entities — users and items. Let’s say we have m users and n items. 
The goal of our recommendation system is to build an mxn matrix (called the utility matrix) which consists of the rating (or preference) for each user-item pair.
Initially, this matrix is usually very sparse because we only have ratings for a limited number of user-item pairs.

 Now, our goal is to populate this matrix by finding similarities between users and items. 
To get an intuition, for example, we see that User3 and User4 gave the same rating to Batman,  
so we can assume the users are similar and they’d feel the same way about Spiderman and predict that User3 would give a rating of 4 to Spiderman.
In practice, however, this is not as straightforward because there are multiple users interacting with many different items.

In practice, The matrix is populated by decomposing (or factorizing) the Utility matrix into two tall and skinny matrices. 
The decomposition has the equation:

<div style="float:center;margin-left:100px;max-width:50%;">
<img src="https://miro.medium.com/v2/resize:fit:344/format:webp/1*VsL3stRmAnK_cmz-7j2mbA.png"></div>

   where U is m x k and V is n x k. U is a representation of users in some low dimensional space, and V is a representation of items. 
For a user i, uᵢ gives the representation of that user, and for an item e, vₑ gives the representation of that item. 

In practice, The matrix is populated by decomposing (or factorizing) the Utility matrix into two tall and skinny matrices. 
The decomposition has the equation:

* The rating prediction for a user-item pair is simply::

<div style="float:center;margin-left:100px;max-width:50%;">
<img src="https://miro.medium.com/v2/resize:fit:408/format:webp/1*Z0vVFgUDn-xbhcyWY0DD0g.png"></div>
<div style="float:center;margin-left:100px;max-width:500%;">
<img src="https://miro.medium.com/v2/resize:fit:1400/format:webp/1*n-vTVJgQahPXRI09mMpKNA.png"></div>


Our goal is to find the optimal embeddings for each user and each item. 
We can then use these embeddings to make predictions for any user-item pair by taking the dot product of user embedding and item embedding

Cost function: We are trying to minimize the Mean squared error over the utility matrix. Here N is the number of non-blank elements in the utility matrix.

<div style="float:center;margin-left:100px;max-width:50%;">
<img src="https://miro.medium.com/v2/resize:fit:1400/format:webp/1*ahOZMlf7tiEiWj6dhZb_2w.png"></div>

<strong style="color:#2B3467;font-size:30px;font-family:Georgia;text-align:center;">Gradient Descent: </strong>

The gradient descent equations are:

<div style="float:center;margin-left:60px;max-width:60%;">
<img src="https://miro.medium.com/v2/resize:fit:1400/format:webp/1*aU9HD6gWwJJFSEXYFE2jZQ.png"></div>

   I’ve used momentum in my implementation which is a method that helps accelerate gradient descent in the relevant direction and dampens oscillations leading to faster convergence. I’ve also added regularization to ensure my model does not overfit to the training data. Hence, the gradient descent equations in my code are slightly more complex than the ones mentioned above.

The regularized cost function is:

<div style="float:center;margin-left:70px;max-width:80%;">
<img src="https://miro.medium.com/v2/resize:fit:1400/format:webp/1*6h35XmJVDT1zmWW_PeD09Q.png"></div>


<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p></p><div class="list-group" id="list-tab" role="tablist">
  <h3 class="list-group-item list-group-item-action active" data-toggle="list" role="tab" aria-controls="home" style = "background-color:#FCFFE7;font-family:Georgia;color:#2B3467;font-size:200%;text-align:LEFT;border-radius:20px 40px;overflow:hidden;border-style:dotted;border-width:1.8px;border-color:#2B3467;">...Table of Contents...</h3>
  <a class="list-group-item list-group-item-action" data-toggle="list" href="#1" role="tab" aria-controls="home" target="_self" style = "color:#000000; font-family:Charter;font-size:120%;">Data Preparation <span class="badge badge-primary badge-pill">1</span></a>
  <a class="list-group-item list-group-item-action" data-toggle="list" href="#2" role="tab" aria-controls="messages" target="_self" style = "color:#000000; font-family:Charter;font-size:120%;">Modell<span class="badge badge-primary badge-pill">2</span></a>
  <a class="list-group-item list-group-item-action" data-toggle="list" href="#3" role="tab" aria-controls="settings" target="_self" style = "color:#000000; font-family:Charter;font-size:120%;">Model Tuning<span class="badge badge-primary badge-pill">3</span></a>
  <a class="list-group-item list-group-item-action" data-toggle="list" href="#4" role="tab" aria-controls="settings" target="_self" style = "color:#000000; font-family:Charter;font-size:120%;">Final Model & Prediction<span class="badge badge-primary badge-pill">4</span></a> 
  <a class="list-group-item list-group-item-action" data-toggle="list" href="#5" role="tab" aria-controls="settings" target="_self" style = "color:#000000; font-family:Charter;font-size:120%;"> ^^^^^<span class="badge badge-primary badge-pill">5</span></a>
</div>
</div>
</div>

<a id = "1"></a>
<strong style="color:#2B3467;font-size:20px;font-family:Georgia;text-align:center;">  Packages-Libraries-Settings 📜 ⚙️ </strong>

In [1]:
import numpy as np
import pandas as pd
from surprise import Reader, SVD, Dataset, accuracy
from surprise.model_selection import GridSearchCV, train_test_split, cross_validate

<a id = "1"></a>
<strong style="color:#2B3467;font-size:20px;font-family:Georgia;text-align:center;">  Data Preparation </strong>

In [2]:
movie = pd.read_csv('/kaggle/input/movielens-20m-dataset/movie.csv')
rating = pd.read_csv('/kaggle/input/movielens-20m-dataset/rating.csv')
df = movie.merge(rating, how="left", on="movieId")
df.head()

Unnamed: 0,movieId,title,genres,userId,rating,timestamp
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,3.0,4.0,1999-12-11 13:36:47
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,6.0,5.0,1997-03-13 17:50:52
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,8.0,4.0,1996-06-05 13:37:51
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,10.0,4.0,1999-11-25 02:44:47
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,11.0,4.5,2009-01-02 01:13:41


In [3]:
# Veri turlerinde yapacagimiz degisiklikler ile hafiza kullanimini iyilestirelim.
# Bu sayede hafiza kullanimini %73 oraninda azalttik
def reduce_memory_usage(df, verbose=True):
    numerics = ["int8", "int16", "int32", "int64", "float16", "float32", "float64"]
    start_mem = df.memory_usage().sum() / 1024 ** 2
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == "int":
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)
            else:
                if (
                    c_min > np.finfo(np.float16).min
                    and c_max < np.finfo(np.float16).max
                ):
                    df[col] = df[col].astype(np.float16)
                elif (
                    c_min > np.finfo(np.float32).min
                    and c_max < np.finfo(np.float32).max
                ):
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
    end_mem = df.memory_usage().sum() / 1024 ** 2
    if verbose:
        print(
            "Mem. usage decreased to {:.2f} Mb ({:.1f}% reduction)".format(
                end_mem, 100 * (start_mem - end_mem) / start_mem
            )
        )
    return df

df = reduce_memory_usage(df, verbose=True)

Mem. usage decreased to 801.12 Mb (25.0% reduction)


In [4]:
def check_df(dataframe, head=5):
    print(" SHAPE ".center(70,'-'))
    print('Rows: {}'.format(dataframe.shape[0]))
    print('Columns: {}'.format(dataframe.shape[1]))
    print(" TYPES ".center(70,'-'))
    print(dataframe.dtypes)
    print(" HEAD ".center(70,'-'))
    print(dataframe.head(head))
    print(" TAIL ".center(70,'-'))
    print(dataframe.tail(head))
    print(" MISSING VALUES ".center(70,'-'))
    print(dataframe.isnull().sum())
    print(" DUPLICATED VALUES ".center(70,'-'))
    print(dataframe.duplicated().sum())
    print(" DESCRIBE ".center(70,'-'))
    print(dataframe.describe([0, 0.10, 0.20, 0.30, 0.40, 0.50, 0.60, 0.70, 0.80, 0.90, 1]).T)
    
check_df(df)

------------------------------- SHAPE --------------------------------
Rows: 20000797
Columns: 6
------------------------------- TYPES --------------------------------
movieId        int32
title         object
genres        object
userId       float32
rating       float16
timestamp     object
dtype: object
-------------------------------- HEAD --------------------------------
   movieId             title                                       genres  \
0        1  Toy Story (1995)  Adventure|Animation|Children|Comedy|Fantasy   
1        1  Toy Story (1995)  Adventure|Animation|Children|Comedy|Fantasy   
2        1  Toy Story (1995)  Adventure|Animation|Children|Comedy|Fantasy   
3        1  Toy Story (1995)  Adventure|Animation|Children|Comedy|Fantasy   
4        1  Toy Story (1995)  Adventure|Animation|Children|Comedy|Fantasy   

   userId  rating            timestamp  
0     3.0     4.0  1999-12-11 13:36:47  
1     6.0     5.0  1997-03-13 17:50:52  
2     8.0     4.0  1996-06-05 13:37

In [5]:
movie_ids = [130219, 356, 4422, 541]
movies = ["The Dark Knight (2011)",
          "Cries and Whispers (Viskningar och rop) (1972)",
          "Forrest Gump (1994)",
          "Blade Runner (1982)"]

sample_df = df[df.movieId.isin(movie_ids)]
sample_df.head()

Unnamed: 0,movieId,title,genres,userId,rating,timestamp
2457839,356,Forrest Gump (1994),Comedy|Drama|Romance|War,4.0,4.0,1996-08-24 09:28:42
2457840,356,Forrest Gump (1994),Comedy|Drama|Romance|War,7.0,4.0,2002-01-16 19:02:55
2457841,356,Forrest Gump (1994),Comedy|Drama|Romance|War,8.0,5.0,1996-06-05 13:44:19
2457842,356,Forrest Gump (1994),Comedy|Drama|Romance|War,9.0,4.0,2001-07-01 20:26:38
2457843,356,Forrest Gump (1994),Comedy|Drama|Romance|War,10.0,3.0,1999-11-25 02:32:02


In [6]:
sample_df.shape

(97343, 6)

In [7]:
reader = Reader(rating_scale=(1, 5))

data = Dataset.load_from_df(sample_df[['userId',
                                       'movieId',
                                       'rating']], reader)

<a id = "2"></a>
<strong style="color:#2B3467;font-size:20px;font-family:Georgia;text-align:center;">  Modell </strong>

In [8]:
trainset, testset = train_test_split(data, test_size=.25)
svd_model = SVD()
svd_model.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7fd9a6c34c10>

In [9]:
predictions = svd_model.test(testset)

accuracy.rmse(predictions)

RMSE: 0.9399


0.9398671047356251

In [10]:
svd_model.predict(uid=1.0, iid=541, verbose=True)


user: 1.0        item: 541        r_ui = None   est = 4.26   {'was_impossible': False}


Prediction(uid=1.0, iid=541, r_ui=None, est=4.263744283993758, details={'was_impossible': False})

In [11]:
svd_model.predict(uid=1.0, iid=356, verbose=True)


user: 1.0        item: 356        r_ui = None   est = 4.05   {'was_impossible': False}


Prediction(uid=1.0, iid=356, r_ui=None, est=4.048781885767558, details={'was_impossible': False})

In [12]:
sample_df[sample_df["userId"] == 1]


Unnamed: 0,movieId,title,genres,userId,rating,timestamp
3612352,541,Blade Runner (1982),Action|Sci-Fi|Thriller,1.0,4.0,2005-04-02 23:30:03


<a id = "3"></a>
<strong style="color:#2B3467;font-size:20px;font-family:Georgia;text-align:center;">  Model Tuning </strong>


In [13]:
param_grid = {'n_epochs': [5, 10, 20],
              'lr_all': [0.002, 0.005, 0.007]}


gs = GridSearchCV(SVD,
                  param_grid,
                  measures=['rmse', 'mae'],
                  cv=3,
                  n_jobs=-1,
                  joblib_verbose=True)

gs.fit(data)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  27 out of  27 | elapsed:  1.3min finished


In [14]:
gs.best_score['rmse']


0.930229238035991

In [15]:
gs.best_params['rmse']


{'n_epochs': 5, 'lr_all': 0.005}

<a id = "4"></a>
<strong style="color:#2B3467;font-size:20px;font-family:Georgia;text-align:center;"> Final Model & Prediction </strong>


In [16]:
svd_model.n_epochs

20

In [17]:
svd_model = SVD(**gs.best_params['rmse'])

data = data.build_full_trainset()
svd_model.fit(data)



<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7fd9230a4750>

In [18]:
svd_model.predict(uid=1.0, iid=541, verbose=True)


user: 1.0        item: 541        r_ui = None   est = 4.16   {'was_impossible': False}


Prediction(uid=1.0, iid=541, r_ui=None, est=4.163107435859443, details={'was_impossible': False})

<p style="padding:15px;
background-color:#f4ebdc;
margin:0; color:#B3005E; border:2px dotted #C689C6; font-family:Charter; font-weight: bold; font-size:250%; text-align:center; overflow:hidden; font-weight:500">If you like this then please UPVOTE this 😄 Your opinions and suggestions are very important to me...<img src="https://media.giphy.com/media/WUlplcMpOCEmTGBtBW/giphy.gif" width="100"> <div style="float:center;margin-left:270px;max-width:50%;">

<p style="padding:15px;
background-color:#f4ebdc;
margin:0; color:#B3005E; border:2px dotted #C689C6; font-family:Charter; font-weight: bold; font-size:250%; text-align:center; overflow:hidden; font-weight:500">
FOR MORE:</p> 

https://www.linkedin.com/in/serdar-ozturk/

https://github.com/StanleyHopson

https://medium.com/@serdar.f95

<p style="padding:15px;
background-color:#f4ebdc;
margin:0; color:#B3005E; border:2px dotted #C689C6; font-family:Charter; font-weight: bold; font-size:250%; text-align:center; overflow:hidden; font-weight:500">CREDITS:</p> 

[https://numpy.org/](http://)

[https://seaborn.pydata.org/](http://)

[https://pandas.pydata.org](http://)

[https://pandas.pydata.org/](http://)

[https://learning.miuul.com](http://)