<a href="https://colab.research.google.com/github/PriyankaGPawar/MachineLearningWith_Python/blob/master/Class_Notebooks/Recommendation_Engine_Part_II_Singular_Value_Decomposition_(SVD).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img src="https://github.com/insaid2018/Term-1/blob/master/Images/INSAID_Full%20Logo.png?raw=true" width="240" height="360" />

# Recommendation Engine Part II - Singular Value Decomposition (SVD)

## Table of Contents

1. [Matrix Factorization](#section1)<br>
2. [Let's understand what is SVD](#section2)<br>
3. [What is the use of it?](#section3)<br>
4. [Loading Data](#section4)<br>
5. [Implementing Singular Vector Decoposition](#section5)<br>
6. [Setting up SVD](#section6)<br>
7. [Making Recommendations using SVD](#section7)<br>
8. [Model Evaluation](#section8)<br>

#### **Note: Kindly run this notebook on Google Colab to avoid Memory issues.**

<a id=section1></a>
## 1. Matrix Factorization

**Most important technique in recommendation system**<br><br>
- When a user gives feedback to a cerrtain movie they saw, this collection of feedback can be collected in the form of a matrix.
- Each row represents each users,
- Each column represents different movies.
- The matrix will be sparse since not everyone is going to watch every movies.

<img src = "https://raw.githubusercontent.com/insaid2018/Term-4/master/images/rec14.png">

The idea behind such models is that the preference of a user can be determined by a small number of hidden factors. We can call these factors as **Embeddings**.<br><br>

<a id=section2></a>
## 2. Let's understand what is SVD

Singular Value Decomposition(SVD) is a variability localization technique in which we represent data in form of matrix and then reduce the number of columns it has in order to maximize loss of dimensionality while minimizing loss of variability in the data being processed.<br>
Why wouldn’t the data be lost? The answer for that question is the essence of SVD.

Basically, SVD breaks a matrix into three other matrices called u, v, and d.

1- A is the real matrix with m*n elements.

2- U is an Orthogonal matrix with m*m elements

3- V is an Orthogonal matrix with n*n elements.

4- D is a diagonal matrix with m*n elements.

Orthogonal matrix is a matrix that does not get its properties changed if multiplied by other numbers.

<img src = "https://raw.githubusercontent.com/insaid2018/Term-4/master/images/svd.png">

<a id=section3></a>
## 3. What is the use of it?

When we decompose our matrix A into (U, D, V), a few left-most columns of all three matrices represent almost all the information we need to recover our actual data. For example 92% of the information in just 5% of total columns which is a pretty good deal given that you have reduced the size of your data set tremendously.

This means that SVD found some relation between all the columns of the matrix A and represented this same information with fewer columns.

The curse of dimensionality is no longer able to affect your performance.

**Matrix decompostion can be formulated as  an optimization problem with loss functions and constraints**

We can understand embeddings as low dimensional hidden factors for items and users.<br>
Let's say, we have 5 dimensional (D or n_factors = 5) embeddings for both items and users. Then for user-X and movie-A, we can say those 5 numbers might represent 5 different characterestics about the movies, like:
- How much movie-A is sci-fi intense?
- How recent is the movie?
- How much special effects ar in movie?
- How dialogue drive is the movie?


Like wise some numbers in user embedding matrix might represents,
- How much does user-X like sci-fi movies?
- How much does user-X like recent movies?

<img src= "https://raw.githubusercontent.com/insaid2018/Term-4/master/images/Shubham's%20crap.PNG">

- Source: [https://www.youtube.com/watch?v=ZspR5PZemcs](https://www.youtube.com/watch?v=ZspR5PZemcs)

<a id=section4></a>
## 4. Loading Data

In [0]:
# pip install scikit-surprise

In [0]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Reading ratings file
# Ignore the timestamp column
ratings = pd.read_csv('https://raw.githubusercontent.com/insaid2018/Term-4/master/Data/Assignment/ratings.csv', sep='\t', encoding='latin-1', usecols=['user_id', 'movie_id', 'rating'])


# Reading movies file
movies = pd.read_csv('https://raw.githubusercontent.com/insaid2018/Term-4/master/Data/Assignment/movies.csv', sep='\t', encoding='latin-1', usecols=['movie_id', 'title', 'genres'])

<a id=section5></a>
## 5. Implementing Singular Vector Decomposition

#### Using Ratings Data

In [0]:
n_users = ratings.user_id.unique().shape[0]
n_movies = ratings.movie_id.unique().shape[0]
print ('Number of users = ' + str(n_users) + ' | Number of movies = ' + str(n_movies))

Number of users = 6040 | Number of movies = 3706


In [0]:
n_users

6040

In [0]:
n_movies

3706

- We want the format of my ratings matrix to be one row per user and one column per movie. 
- We'll pivot *ratings* to get that and call the new variable *Ratings* (with a capital *R).

In [0]:
Ratings = ratings.pivot(index = 'user_id', columns ='movie_id', values = 'rating').fillna(0)
Ratings.head()

movie_id,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,...,3913,3914,3915,3916,3917,3918,3919,3920,3921,3922,3923,3924,3925,3926,3927,3928,3929,3930,3931,3932,3933,3934,3935,3936,3937,3938,3939,3940,3941,3942,3943,3944,3945,3946,3947,3948,3949,3950,3951,3952
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1
1,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0,4.0,0.0,4.0,0.0,3.0,0.0,0.0,3.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


We need to de-normalize the data (normalize by each users mean) and convert it from a dataframe to a numpy array.

In [0]:
Ratings.values

array([[5., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [3., 0., 0., ..., 0., 0., 0.]])

In [0]:
Ratings.values.shape

(6040, 3706)

In [0]:
np.mean(Ratings.values, axis = 1)

array([0.05990286, 0.12924987, 0.05369671, ..., 0.02050729, 0.1287102 ,
       0.3291959 ])

In [0]:
R = Ratings.values
user_ratings_mean = np.mean(R, axis = 1)
Ratings_demeaned = R - user_ratings_mean.reshape(-1, 1)

In [0]:
user_ratings_mean.shape

(6040,)

In [0]:
Ratings_demeaned.shape

(6040, 3706)

In [0]:
Ratings_demeaned[0]

array([ 4.94009714, -0.05990286, -0.05990286, ..., -0.05990286,
       -0.05990286, -0.05990286])

- With the ratings matrix properly formatted and normalized, we can do some dimensionality reduction.

<a id=section6></a>
## 6. Setting Up SVD

Scipy and Numpy both have functions to do the singular value decomposition. We're going to use the Scipy function *svds* because it let's us choose how many latent factors we want to use to approximate the original ratings matrix (instead of having to truncate it after).

In [0]:
from scipy.sparse.linalg import svds
U, sigma, Vt = svds(Ratings_demeaned, k = 50)

In [0]:
U

array([[-4.97801875e-03,  5.86971868e-03, -1.18186843e-02, ...,
        -2.95139320e-03, -1.95703358e-03,  5.46889776e-03],
       [-1.10526375e-03, -4.04545890e-03,  1.16776791e-02, ...,
        -9.18855171e-04,  2.17034433e-03,  1.04359614e-02],
       [ 9.44963839e-03, -1.43519545e-02, -3.93638119e-04, ...,
         2.89764529e-03,  2.86504507e-03,  6.13985002e-03],
       ...,
       [-1.05731072e-02, -6.80807641e-03, -2.63392883e-03, ...,
         4.84286245e-05, -1.89077440e-03,  1.52456048e-03],
       [ 6.34420788e-03, -9.45269844e-03,  2.69929558e-03, ...,
         1.07208981e-02, -1.88878158e-02,  6.87143535e-03],
       [-1.84854679e-02,  1.28388950e-02,  8.46988257e-03, ...,
         1.89987575e-03, -4.15563933e-02,  1.92850979e-02]])

In [0]:
U.shape

(6040, 50)

In [0]:
sigma.shape

(50,)

In [0]:
sigma

array([ 147.18581225,  147.62154312,  148.58855276,  150.03171353,
        151.79983807,  153.96248652,  154.29956787,  154.54519202,
        156.1600638 ,  157.59909505,  158.55444246,  159.49830789,
        161.17474208,  161.91263179,  164.2500819 ,  166.36342107,
        166.65755956,  167.57534795,  169.76284423,  171.74044056,
        176.69147709,  179.09436104,  181.81118789,  184.17680849,
        186.29341046,  192.15335604,  192.56979125,  199.83346621,
        201.19198515,  209.67692339,  212.55518526,  215.46630906,
        221.6502159 ,  231.38108343,  239.08619469,  244.8772772 ,
        252.13622776,  256.26466285,  275.38648118,  287.89180228,
        315.0835415 ,  335.08085421,  345.17197178,  362.26793969,
        415.93557804,  434.97695433,  497.2191638 ,  574.46932602,
        670.41536276, 1544.10679346])

In [0]:
Vt

array([[-0.07028629,  0.02415349, -0.01883837, ...,  0.00380736,
        -0.00049127,  0.00061123],
       [ 0.03681506,  0.00346263, -0.01264234, ..., -0.00965995,
        -0.00513455, -0.02377963],
       [ 0.03495646,  0.00904907,  0.00823098, ...,  0.00157338,
        -0.00234513,  0.00802561],
       ...,
       [-0.03287652,  0.01185799, -0.01107445, ..., -0.00114772,
        -0.00294575, -0.02222119],
       [ 0.01776333,  0.03068092,  0.01786526, ..., -0.00087071,
        -0.0012666 , -0.00435186],
       [ 0.07625855,  0.01650222,  0.00468327, ..., -0.00852744,
        -0.01020778,  0.00425656]])

In [0]:
Vt.shape

(50, 3706)

As we're going to leverage matrix multiplication to get predictions, We'll convert the $\Sigma$ (now are values) to the diagonal matrix form.

In [0]:
sigma = np.diag(sigma)

In [0]:
sigma

array([[ 147.18581225,    0.        ,    0.        , ...,    0.        ,
           0.        ,    0.        ],
       [   0.        ,  147.62154312,    0.        , ...,    0.        ,
           0.        ,    0.        ],
       [   0.        ,    0.        ,  148.58855276, ...,    0.        ,
           0.        ,    0.        ],
       ...,
       [   0.        ,    0.        ,    0.        , ...,  574.46932602,
           0.        ,    0.        ],
       [   0.        ,    0.        ,    0.        , ...,    0.        ,
         670.41536276,    0.        ],
       [   0.        ,    0.        ,    0.        , ...,    0.        ,
           0.        , 1544.10679346]])

In [0]:
sigma.shape

(50, 50)

<a id=section7></a>
## 7. Making Recommendations using SVD

Now, we have everything we need to make movie ratings predictions for every user. We can do it all at once by following the math and matrix multiply $U$, $\Sigma$, and $V^{T}$ back to get the rank $k=50$ approximation of $A$.

But first, we need to add the user means back to get the actual star ratings prediction.

In [0]:
all_user_predicted_ratings = np.dot(np.dot(U, sigma), Vt) + user_ratings_mean.reshape(-1, 1)

With the predictions matrix for every user, we can build a function to recommend movies for any user. We return the list of movies the user has already rated, for the sake of comparison.

In [0]:
preds = pd.DataFrame(all_user_predicted_ratings, columns = Ratings.columns)
preds.head()

movie_id,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,...,3913,3914,3915,3916,3917,3918,3919,3920,3921,3922,3923,3924,3925,3926,3927,3928,3929,3930,3931,3932,3933,3934,3935,3936,3937,3938,3939,3940,3941,3942,3943,3944,3945,3946,3947,3948,3949,3950,3951,3952
0,4.288861,0.143055,-0.19508,-0.018843,0.012232,-0.176604,-0.07412,0.141358,-0.059553,-0.19595,0.512867,-0.089172,0.310181,-0.002005,-0.052401,-0.189827,0.23836,0.006466,-0.099315,-0.069682,-0.321492,0.111577,0.034795,0.320576,-0.118217,-0.012647,0.065573,-0.098318,0.064081,-0.005914,0.091936,0.180563,-0.009566,2.641693,-0.012495,0.765179,0.019784,0.002917,0.053079,0.014856,...,0.01881,-0.018782,0.022249,0.227852,-0.067653,-0.046039,-0.023574,-0.019405,-0.005116,-0.032921,-0.008259,-0.019157,0.007527,-0.008687,-0.02563,-0.013563,0.01524,-0.044665,-0.009568,-0.043549,-0.003131,-0.008221,-0.005948,0.031885,-0.003424,-0.001159,-0.002124,-0.002827,0.010393,-0.001068,0.027807,0.00164,0.026395,-0.022024,-0.085415,0.403529,0.105579,0.031912,0.05045,0.08891
1,0.744716,0.169659,0.335418,0.000758,0.022475,1.35305,0.051426,0.071258,0.161601,1.567246,0.772656,0.046179,-0.054562,0.042344,0.04839,0.347313,1.074905,-0.099782,0.008163,0.250869,2.186638,0.018789,-0.002199,0.218934,0.824475,0.139274,-0.007135,0.053071,-0.156952,0.044739,-0.00296,0.453298,-0.007484,0.920325,0.016566,1.335129,-0.015066,-0.045602,0.034649,0.12201,...,-0.042363,-0.137822,-0.112071,0.380783,-0.036273,-0.016174,0.00292,-0.148021,-0.017614,-0.033474,0.086133,0.008153,-0.126819,0.109208,0.001798,0.151866,0.014118,0.032897,0.005764,0.042259,0.022404,0.00326,0.010556,0.137181,-0.042184,0.006759,-0.005789,0.00034,0.002024,0.016013,-0.056502,-0.013733,-0.01058,0.062576,-0.016248,0.15579,-0.418737,-0.101102,-0.054098,-0.140188
2,1.818824,0.456136,0.090978,-0.043037,-0.025694,-0.158617,-0.131778,0.098977,0.030551,0.73547,-0.023476,0.034796,0.065942,0.008661,0.110348,-0.002952,-0.122061,0.063974,0.061033,0.081799,0.329471,0.149579,0.095352,-0.161493,0.022545,-0.009284,-0.002677,-0.14271,0.012345,-0.085331,0.076139,-0.355795,-0.008579,1.046871,-0.088946,0.383583,-0.018144,-0.038618,0.113984,0.006942,...,0.007233,-0.047221,0.066474,-0.179455,0.097428,0.034113,0.008098,-0.024784,-0.012749,-0.007394,-0.01722,0.004719,0.113348,-0.074943,-0.145795,0.128619,0.112567,0.0455,-0.018027,-0.058946,-0.00277,-0.035276,-0.008085,0.132182,-0.017005,0.014383,0.006598,-0.006217,-0.000342,0.000518,0.040481,-0.005301,0.012832,0.029349,0.020866,0.121532,0.076205,0.012345,0.015148,-0.109956
3,0.408057,-0.07296,0.039642,0.089363,0.04195,0.237753,-0.049426,0.009467,0.045469,-0.11137,-0.375831,0.068658,0.011199,0.069699,-0.037529,-0.238788,0.060607,-0.043418,0.053152,0.078237,0.357185,-0.096005,-0.028243,-0.067169,0.246164,-0.020379,0.034461,-0.022225,-0.012327,0.009182,0.01473,0.215893,-0.019687,-0.293933,-0.011511,0.145326,-0.029213,0.030029,-0.045409,-0.030684,...,-0.015077,-0.030208,0.028357,-0.072643,-0.135727,-0.053318,-0.012962,-0.054465,0.00587,-0.018048,-0.006836,-0.008222,-0.027214,-0.071677,-0.094072,-0.010745,-0.103191,-0.031297,-0.02392,-0.015053,-0.017914,-0.029561,-0.024299,-0.057678,-0.11145,-0.015473,-0.007123,-0.007416,-0.011508,-0.010038,0.008571,-0.005425,-0.0085,-0.003417,-0.083982,0.094512,0.057557,-0.02605,0.014841,-0.034224
4,1.574272,0.021239,-0.0513,0.246884,-0.032406,1.552281,-0.19963,-0.01492,-0.060498,0.450512,-0.251178,0.012337,-0.084051,0.258937,0.01657,0.980536,1.267869,0.275619,-0.008139,-0.038832,1.849627,0.107649,-0.168424,0.386541,1.790343,0.192379,-0.054356,0.267566,1.027817,0.374665,-0.010445,1.94798,0.017468,2.784035,0.274397,1.422393,0.040553,0.022926,1.3458,0.104507,...,0.075475,0.330767,0.15047,-0.261636,0.085163,-0.014229,-0.029247,0.124172,0.092875,0.061895,0.034757,0.054386,0.047055,0.048403,0.082926,0.129035,-0.174646,0.102727,0.024732,0.04728,0.017818,0.041451,0.041595,-0.007138,-0.080448,0.018639,0.034068,0.026941,0.035905,0.024459,0.110151,0.04601,0.006934,-0.01594,-0.05008,-0.052539,0.507189,0.03383,0.125706,0.199244


 Just to recall, below are the samples of movies and ratings dataset

In [0]:
movies.head(2)

Unnamed: 0,movie_id,title,genres
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy


In [0]:
ratings.head(2)

Unnamed: 0,user_id,movie_id,rating
0,1,1193,5
1,1,661,3


Now, we write a function to return the movies with the highest predicted rating that the specified user hasn't already rated. Though we didn't use any explicit movie content features (such as genre or title), we'll merge in that information to get a more complete picture of the recommendations.

In [0]:
userID=3
user_row_number = userID - 1 # User ID starts at 1, not 0
sorted_user_predictions = preds.iloc[user_row_number].sort_values(ascending=False)
sorted_user_predictions

movie_id
1198    4.382961
1197    4.124136
260     3.703313
1196    3.658492
1210    3.242291
          ...   
858    -0.609292
2662   -0.664298
1301   -0.686389
1221   -0.721101
1253   -0.721257
Name: 2, Length: 3706, dtype: float64

In [0]:
user_data = ratings[ratings.user_id == (userID)]
user_data

Unnamed: 0,user_id,movie_id,rating
182,3,3421,4
183,3,1641,2
184,3,648,3
185,3,1394,4
186,3,3534,3
187,3,104,4
188,3,2735,4
189,3,1210,4
190,3,1431,3
191,3,3868,3


In [0]:
user_full = (user_data.merge(movies, how = 'left', left_on = 'movie_id', right_on = 'movie_id').
                     sort_values(['rating'], ascending=False)
                 )
user_full

Unnamed: 0,user_id,movie_id,rating,title,genres
38,3,1304,5,Butch Cassidy and the Sundance Kid (1969),Action|Comedy|Western
12,3,1615,5,"Edge, The (1997)",Adventure|Thriller
31,3,2355,5,"Bug's Life, A (1998)",Animation|Children's|Comedy
32,3,1197,5,"Princess Bride, The (1987)",Action|Adventure|Comedy|Romance
33,3,1198,5,Raiders of the Lost Ark (1981),Action|Adventure
19,3,260,5,Star Wars: Episode IV - A New Hope (1977),Action|Adventure|Fantasy|Sci-Fi
34,3,1378,5,Young Guns (1988),Action|Comedy|Western
16,3,2167,5,Blade (1998),Action|Adventure|Horror
37,3,3552,5,Caddyshack (1980),Comedy
14,3,1259,5,Stand by Me (1986),Adventure|Comedy|Drama


In [0]:
user_full.shape

(51, 5)

In [0]:
movies[~movies['movie_id'].isin(user_full['movie_id'])]

Unnamed: 0,movie_id,title,genres
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy
...,...,...,...
3878,3948,Meet the Parents (2000),Comedy
3879,3949,Requiem for a Dream (2000),Drama
3880,3950,Tigerland (2000),Drama
3881,3951,Two Family House (2000),Drama


In [0]:
num_recommendations = 20
recommendations = (movies[~movies['movie_id'].isin(user_full['movie_id'])].
         merge(pd.DataFrame(sorted_user_predictions).reset_index(), how = 'left',
               left_on = 'movie_id',
               right_on = 'movie_id').
         rename(columns = {user_row_number: 'Predictions'}).
         sort_values('Predictions', ascending = False).
                       iloc[:num_recommendations, :-1]
                      )
recommendations

Unnamed: 0,movie_id,title,genres
2807,2918,Ferris Bueller's Day Off (1986),Comedy
2682,2791,Airplane! (1980),Comedy
2520,2628,Star Wars: Episode I - The Phantom Menace (1999),Action|Adventure|Fantasy|Sci-Fi
898,919,"Wizard of Oz, The (1939)",Adventure|Children's|Drama|Musical
0,1,Toy Story (1995),Animation|Children's|Comedy
2695,2804,"Christmas Story, A (1983)",Comedy|Drama
107,110,Braveheart (1995),Action|Drama|War
2608,2716,Ghostbusters (1984),Comedy|Horror
1264,1307,When Harry Met Sally... (1989),Comedy|Romance
2290,2396,Shakespeare in Love (1998),Comedy|Romance


In [0]:
def recommend_movies(predictions, userID, movies, original_ratings, num_recommendations):
    
    # Get and sort the user's predictions
    user_row_number = userID - 1 # User ID starts at 1, not 0
    sorted_user_predictions = preds.iloc[user_row_number].sort_values(ascending=False) # User ID starts at 1
    
    # Get the user's data and merge in the movie information.
    user_data = original_ratings[original_ratings.user_id == (userID)]
    user_full = (user_data.merge(movies, how = 'left', left_on = 'movie_id', right_on = 'movie_id').
                     sort_values(['rating'], ascending=False)
                 )

    print('User {0} has already rated {1} movies.'.format(userID, user_full.shape[0]))
    print('Recommending highest {0} predicted ratings movies not already rated.'.format(num_recommendations))
    
    # Recommend the highest predicted rating movies that the user hasn't seen yet.
    recommendations = (movies[~movies['movie_id'].isin(user_full['movie_id'])].
         merge(pd.DataFrame(sorted_user_predictions).reset_index(), how = 'left',
               left_on = 'movie_id',
               right_on = 'movie_id').
         rename(columns = {user_row_number: 'Predictions'}).
         sort_values('Predictions', ascending = False).
                       iloc[:num_recommendations, :-1]
                      )

    return user_full, recommendations

Let's try to recommend 20 movies for user with ID 1310.

In [0]:
already_rated, predictions = recommend_movies(preds, 1310, movies, ratings, 20)

User 1310 has already rated 24 movies.
Recommending highest 20 predicted ratings movies not already rated.


In [0]:
# Top 10 movies that User 1310 has rated 
already_rated.head(10)

Unnamed: 0,user_id,movie_id,rating,title,genres
5,1310,2248,5,Say Anything... (1989),Comedy|Drama|Romance
6,1310,2620,5,This Is My Father (1998),Drama|Romance
7,1310,3683,5,Blood Simple (1984),Drama|Film-Noir
15,1310,1704,5,Good Will Hunting (1997),Drama
1,1310,1293,5,Gandhi (1982),Drama
12,1310,3101,4,Fatal Attraction (1987),Thriller
11,1310,1343,4,Cape Fear (1991),Thriller
20,1310,2000,4,Lethal Weapon (1987),Action|Comedy|Crime|Drama
18,1310,3526,4,Parenthood (1989),Comedy|Drama
17,1310,3360,4,Hoosiers (1986),Drama


In [0]:
# Top 20 movies that User 1310 hopefully will enjoy
predictions

Unnamed: 0,movie_id,title,genres
1618,1674,Witness (1985),Drama|Romance|Thriller
1880,1961,Rain Man (1988),Drama
1187,1210,Star Wars: Episode VI - Return of the Jedi (1983),Action|Adventure|Romance|Sci-Fi|War
1216,1242,Glory (1989),Action|Drama|War
1202,1225,Amadeus (1984),Drama
1273,1302,Field of Dreams (1989),Drama
1220,1246,Dead Poets Society (1989),Drama
1881,1962,Driving Miss Daisy (1989),Drama
1877,1957,Chariots of Fire (1981),Drama
1938,2020,Dangerous Liaisons (1988),Drama|Romance


- It's good to see that, although we didn't actually use the genre of the movie as a feature, the truncated matrix factorization features "picked up" on the underlying tastes and preferences of the user. 


- We've recommended some comedy, drama, and romance movies - all of which were genres of some of this user's top rated movies.

<a id=section8></a>
## 8. Model Evaluation

We will use the [Surprise](https://pypi.python.org/pypi/scikit-surprise) library that provided various ready-to-use powerful prediction algorithms including (SVD) to evaluate its **RMSE (Root Mean Squared Error)** on the MovieLens dataset. It is a Python scikit building and analyzing recommender systems.

In [0]:
!pip install surprise

Collecting surprise
  Downloading https://files.pythonhosted.org/packages/61/de/e5cba8682201fcf9c3719a6fdda95693468ed061945493dea2dd37c5618b/surprise-0.1-py2.py3-none-any.whl
Collecting scikit-surprise
[?25l  Downloading https://files.pythonhosted.org/packages/f5/da/b5700d96495fb4f092be497f02492768a3d96a3f4fa2ae7dea46d4081cfa/scikit-surprise-1.1.0.tar.gz (6.4MB)
[K     |████████████████████████████████| 6.5MB 4.4MB/s 
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (setup.py) ... [?25l[?25hdone
  Created wheel for scikit-surprise: filename=scikit_surprise-1.1.0-cp36-cp36m-linux_x86_64.whl size=1675361 sha256=d246609a9626e49d62dc2d0e48e383088d6e13c606d50d2053d80db839942786
  Stored in directory: /root/.cache/pip/wheels/cc/fa/8c/16c93fccce688ae1bde7d979ff102f7bee980d9cfeb8641bcf
Successfully built scikit-surprise
Installing collected packages: scikit-surprise, surprise
Successfully installed scikit-surprise-1.1.0 surprise-0.1


In [0]:
# Import libraries from Surprise package
from surprise import Reader, Dataset, SVD#, evaluate

# Load Reader library
reader = Reader()

# Load ratings dataset with Dataset library
data = Dataset.load_from_df(ratings[['user_id', 'movie_id', 'rating']], reader)

# Split the dataset for 5-fold evaluation
# data.split(n_folds=5)

NameError: ignored

In [0]:
# from sklearn import model_selection
from surprise.model_selection import cross_validate

In [0]:
# Use the SVD algorithm.
svd = SVD()

# Compute the RMSE of the SVD algorithm.
cross_validate(svd, data, measures=['RMSE'], cv=5, verbose=True)

Evaluating RMSE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.8728  0.8754  0.8745  0.8742  0.8731  0.8740  0.0009  
Fit time          50.25   51.19   51.15   55.17   51.12   51.78   1.73    
Test time         2.82    2.77    2.35    2.74    2.71    2.68    0.17    


{'fit_time': (50.25301694869995,
  51.18598914146423,
  51.150800466537476,
  55.173394203186035,
  51.12203049659729),
 'test_rmse': array([0.87280177, 0.87541449, 0.87450958, 0.87418007, 0.87312575]),
 'test_time': (2.81990909576416,
  2.767259359359741,
  2.348268508911133,
  2.742522954940796,
  2.7103917598724365)}

- Root Mean Square Error of 0.8736 which is pretty good. 


- Now train on the dataset and arrive at predictions.

In [0]:
trainset = data.build_full_trainset()
svd.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7fb9cbd76518>

We'll pick again user with ID 1310 and check the ratings he has given.

In [0]:
ratings[ratings['user_id'] == 1310][:5]

Unnamed: 0,user_id,movie_id,rating
215928,1310,2988,3
215929,1310,1293,5
215930,1310,1295,2
215931,1310,1299,4
215932,1310,2243,4


Now let's use SVD to predict the rating that User with ID 1310 will give to a random movie (let's say with Movie ID 1994).

In [0]:
svd.predict(1310, 2988)

Prediction(uid=1310, iid=2988, r_ui=None, est=3.3680773794422, details={'was_impossible': False})

For movie with ID 1994, we get an estimated prediction of 3.349. The recommender system works purely on the basis of an assigned movie ID and tries to predict ratings based on how the other users have predicted the movie.

<a id=section9></a>
# 9. Conclusion

In this notebook, we attempted to build a movie recommendation sytem based on latent features from a low rank matrix factorization method called SVD. As it captures the underlying features driving the raw data, it can scale significantly better to massive datasets as well as make better recommendations based on user's tastes.

However, we still likely lose some meaningful signals by using a low-rank approximation. Specifically, there's an interpretability problem as a singular vector specifies a linear combination of all input columns or rows. There's also a lack of sparsity when the singular vectors are quite dense. Thus, SVD approach is limited to linear projections.

In [0]:
import pandas as pd
user = pd.read_csv('/content/drive/My Drive/Insaid Notes/ML3/BX-CSV-Dump/BX-Users.csv', sep=';', error_bad_lines=False, encoding="latin-1")
user.columns = ['userID', 'Location', 'Age']
rating = pd.read_csv('/content/drive/My Drive/Insaid Notes/ML3/BX-CSV-Dump/BX-Book-Ratings.csv', sep=';', error_bad_lines=False, encoding="latin-1")
rating.columns = ['userID', 'ISBN', 'bookRating']
df = pd.merge(user, rating, on='userID', how='inner')
df.drop(['Location', 'Age'], axis=1, inplace=True)
df.head()

Unnamed: 0,userID,ISBN,bookRating
0,2,195153448,0
1,7,34542252,0
2,8,2005018,5
3,8,60973129,0
4,8,374157065,0


In [0]:
min_book_ratings = 50
filter_books = df['ISBN'].value_counts() > min_book_ratings
filter_books = filter_books[filter_books].index.tolist()

min_user_ratings = 50
filter_users = df['userID'].value_counts() > min_user_ratings
filter_users = filter_users[filter_users].index.tolist()

df_new = df[(df['ISBN'].isin(filter_books)) & (df['userID'].isin(filter_users))]
print('The original data frame shape:\t{}'.format(df.shape))
print('The new data frame shape:\t{}'.format(df_new.shape))

The original data frame shape:	(1149780, 3)
The new data frame shape:	(140516, 3)


In [0]:
# Import libraries from Surprise package
from surprise import Reader, Dataset, SVD,SVDpp,SlopeOne,NMF,NormalPredictor,KNNBaseline,KNNBasic,KNNWithMeans,KNNWithZScore,BaselineOnly,CoClustering#, evaluate
from surprise.model_selection import cross_validate
# Load Reader library
reader = Reader()
reader = Reader(rating_scale=(0, 9))

data = Dataset.load_from_df(df_new[['userID', 'ISBN', 'bookRating']], reader)

In [0]:
benchmark = []
# Iterate over all algorithms
for algorithm in [SVD(), SVDpp(), SlopeOne(), NormalPredictor(), KNNBaseline(), KNNBasic(), KNNWithMeans(), KNNWithZScore(), BaselineOnly(), CoClustering()]:
    # Perform cross validation
    results = cross_validate(algorithm, data, measures=['RMSE'], cv=3, verbose=False)
    
    # Get results & append algorithm name
    tmp = pd.DataFrame.from_dict(results).mean(axis=0)
    tmp = tmp.append(pd.Series([str(algorithm).split(' ')[0].split('.')[-1]], index=['Algorithm']))
    benchmark.append(tmp)
    
pd.DataFrame(benchmark).set_index('Algorithm').sort_values('test_rmse')  

Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...


Unnamed: 0_level_0,test_rmse,fit_time,test_time
Algorithm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
BaselineOnly,3.377431,0.255715,0.43658
CoClustering,3.464651,2.31177,0.50203
SlopeOne,3.479234,1.033143,4.728151
KNNWithMeans,3.486449,0.896143,5.96635
KNNBaseline,3.497189,1.118566,7.187921
KNNWithZScore,3.508383,0.923181,6.386557
SVD,3.543866,5.656676,0.489534
KNNBasic,3.730869,0.856467,5.714518
SVDpp,3.784652,133.97609,6.498695
NormalPredictor,4.671169,0.164197,0.508767


In [0]:
print('Using ALS')
bsl_options = {'method': 'als',
               'n_epochs': 5,
               'reg_u': 12,
               'reg_i': 5
               }
algo = BaselineOnly(bsl_options=bsl_options)
cv = cross_validate(algo, data, measures=['RMSE'], cv=3, verbose=False)


Using ALS
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...


{'fit_time': (0.12901949882507324, 0.18054628372192383, 0.1624438762664795),
 'test_rmse': array([3.36719391, 3.38312107, 3.36877497]),
 'test_time': (0.41590285301208496, 0.3586757183074951, 0.3678450584411621)}

In [0]:
from sklearn.model_selection import train_test_split
trainset, testset = train_test_split(data, test_size=0.25)
trainset = data.build_full_trainset()
testset = data.b
algo = BaselineOnly(bsl_options=bsl_options)
predictions = algo.fit(trainset).test(testset)
accuracy.rmse(predictions)

TypeError: ignored

In [0]:
def get_Iu(uid):
    """ return the number of items rated by given user
    args: 
      uid: the id of the user
    returns: 
      the number of items rated by the user
    """
    try:
        return len(trainset.ur[trainset.to_inner_uid(uid)])
    except ValueError: # user was not part of the trainset
        return 0
    
def get_Ui(iid):
    """ return number of users that have rated given item
    args:
      iid: the raw id of the item
    returns:
      the number of users that have rated the item.
    """
    try: 
        return len(trainset.ir[trainset.to_inner_iid(iid)])
    except ValueError:
        return 0
    
df = pd.DataFrame(predictions, columns=['uid', 'iid', 'rui', 'est', 'details'])
df['Iu'] = df.uid.apply(get_Iu)
df['Ui'] = df.iid.apply(get_Ui)
df['err'] = abs(df.est - df.rui)
best_predictions = df.sort_values(by='err')[:10]
worst_predictions = df.sort_values(by='err')[-10:]