## Model 4 - Singular value Decomposition

Collaborative filtering is the most sophisticated and useful method. So to ensure our system provides the best recommendation we would be building a second model based on Singular Value Decomposition.

This model in theory is similar to the Keras based model. Here the movies are rated based on user-user filtering. This model also addresses the shortcomings of the deep learning model.

<pre>
        i. With the Keras model as the number of users and movies go up, the computation gets more and more expensive. The model doesn't scale up well. By performing matrix factorization we are able to scale up.
        ii. The matrix containing user rating is a sparse matrix, and because of this we would not be able to apply a variety of functions to the matrix. To overcome this, in this method we decompose the matrix into 3 denser matrices.
</pre>

### How does SVD work?

Using SVD, we are able to find the latent features in our data. 
<br><br>
For example, let's say user A likes movies Harry Potter and the Chamber of Secrets, Toy Story 2, and Charlie and the chocolate factory. 
<br>Here there is a common underline that is not very evident. The user seems to enjoy children's movies a lot. 
Our model would then recommend in turn Cars and Trolls. 
Even though we didn't specify the genre of the movie as our feature, the algorithm will pick up on important latent features and use that for recommendation

Breaking down the math behind the algorithm :
<pre>
                A = U x Sigma x V
              
            where A is the matrix containing the user rating
                  U is the User feature rating , ie how much does a user like a particular feature (Comedy, thriller)
                  Sigma is the diagonal matrix that contains the weight/strength of each of these features
                  V is the Movie feature rating, ie how much of these features apply to the movie.
</pre>

The steps involved in creating an SVD model :
<pre>
    i. Create a matrix of user ratings
    ii. Calculate U, Sigma, V from the user rating matrix
    iii. Multiple the matrices together to recreate a modified version of the rating matrix
    iv. For a user, sort the new matrix and find the top 10 movies
</pre>

### Creating user ratings matrix

In [1]:
## Import
import pandas as pd
import numpy as np
from scipy.sparse.linalg import svds

In [2]:
rating = pd.read_csv('Data/ratings.csv')
movies= pd.read_csv('Data/movies.csv')

In [3]:
um_rating = rating.pivot(index='userId',columns='movieId',values='rating')
um_rating = um_rating.fillna(0)
um_rating

movieId,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,0.0,4.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
606,2.5,0.0,0.0,0.0,0.0,0.0,2.5,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
607,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
608,2.5,2.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
609,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Calculate U, Sigma, V from the user rating matrix

In [4]:
U, Sigma, V = svds(um_rating, k = 20)
s_diag_matrix=np.diag(Sigma)

### Multiple the matrices together to recreate a modified version of the rating matrix

In [5]:
X_pred = np.dot(np.dot(U, s_diag_matrix), V)
X_pred_df = pd.DataFrame(X_pred)
X_pred_df.columns=um_rating.columns.values

In [16]:
X_pred_df

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
0,2.290336,1.460203,1.033507,-0.061334,-0.002275,1.243261,0.029650,0.056161,0.036220,1.442856,...,-0.008584,-0.007358,-0.009810,-0.009810,-0.008584,-0.009810,-0.008584,-0.008584,-0.008584,-0.038606
1,0.038570,0.015272,0.016968,0.002944,0.019201,-0.005821,-0.025436,0.000918,0.010531,-0.117149,...,0.010662,0.009139,0.012186,0.012186,0.010662,0.012186,0.010662,0.010662,0.010662,0.015610
2,-0.015220,0.049067,0.047202,-0.004936,-0.035349,0.052758,-0.012911,0.010422,-0.002532,-0.014094,...,0.000029,0.000025,0.000033,0.000033,0.000029,0.000033,0.000029,0.000029,0.000029,-0.002412
3,2.238621,0.060011,0.039384,0.066455,0.221806,0.487591,0.318594,-0.057422,0.016371,0.234273,...,0.002029,0.001739,0.002319,0.002319,0.002029,0.002319,0.002029,0.002029,0.002029,-0.007359
4,1.358363,0.970071,0.340939,0.121053,0.479936,0.628346,0.504583,0.136293,0.040721,1.122003,...,0.000348,0.000299,0.000398,0.000398,0.000348,0.000398,0.000348,0.000348,0.000348,0.001611
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
605,-0.617336,0.556016,-0.374855,0.162583,-0.155438,-1.403045,2.364098,-0.205127,-0.444244,0.380738,...,-0.046865,-0.040170,-0.053560,-0.053560,-0.046865,-0.053560,-0.046865,-0.046865,-0.046865,-0.077927
606,2.056401,1.216670,0.593186,-0.006625,-0.020369,1.678307,0.261799,0.060570,0.025766,1.289120,...,-0.012653,-0.010845,-0.014460,-0.014460,-0.012653,-0.014460,-0.012653,-0.012653,-0.012653,-0.030033
607,2.369716,1.838958,1.577564,-0.131902,0.362084,3.628608,0.248347,0.278704,0.125466,3.895638,...,-0.043875,-0.037607,-0.050143,-0.050143,-0.043875,-0.050143,-0.043875,-0.043875,-0.043875,0.005026
608,0.809741,0.651456,0.297184,0.081167,0.334388,0.577311,0.362697,0.091491,0.067186,0.940384,...,0.000254,0.000217,0.000290,0.000290,0.000254,0.000290,0.000254,0.000254,0.000254,0.001664


### For a user, sort the new matrix and find the top 10 movies

In [7]:
# Enter User ID here
userId = 10

In [8]:
user_fav_movies = um_rating.iloc[userId-1].sort_values(ascending=False).head(5).index

In [9]:
i=0
for i in range(5):
    print(movies.loc[movies['movieId'] == user_fav_movies[i] ].title)

6352    Holiday, The (2006)
Name: title, dtype: object
7466    King's Speech, The (2010)
Name: title, dtype: object
4948    Troy (2004)
Name: title, dtype: object
7371    Despicable Me (2010)
Name: title, dtype: object
7156    Education, An (2009)
Name: title, dtype: object


In [14]:
user_fav_pred = X_pred_df.iloc[userId-1].sort_values(ascending=False).head(10).index
user_fav_pred

Int64Index([4306, 7153, 6377, 68954, 58559, 356, 79132, 5952, 4993, 6539], dtype='int64')

In [24]:
user_rating = um_rating.iloc[userId-1].sort_values(ascending=False)
user_rating

movieId
49286    5.0
81845    5.0
7458     5.0
79091    5.0
71579    5.0
        ... 
52375    0.0
52328    0.0
52319    0.0
52299    0.0
1        0.0
Name: 10, Length: 9724, dtype: float64

In [30]:
user_predicted_rating = X_pred_df.iloc[userId-1]
user_predicted_rating

1         1.214372
2         0.378873
3        -0.052351
4         0.015189
5         0.251362
            ...   
193581    0.014074
193583    0.012315
193585    0.012315
193587    0.012315
193609    0.021499
Name: 9, Length: 9724, dtype: float64

In [12]:
i=0
for i in range(5):
    print(movies.loc[movies['movieId'] == user_fav_pred[i] ].title)

3194    Shrek (2001)
Name: title, dtype: object
4800    Lord of the Rings: The Return of the King, The...
Name: title, dtype: object
4360    Finding Nemo (2003)
Name: title, dtype: object
7039    Up (2009)
Name: title, dtype: object
6710    Dark Knight, The (2008)
Name: title, dtype: object
