# SVD for Movie Recommendation System

The most common method for recommendation systems often comes with Collaborating Filtering (CF) where it relies on the past user and item dataset. Two popular approaches of CF are **latent factor models**, which extract features from user and item matrices and **neighborhood models**, which finds similarities between products or users.

In this notebook, I am going to use **latent factor models** such as **Singular Value Decomposition (SVD)** extract features and correlation from the user-item matrix. 

Singular Value Decomposition (SVD) will allow me to apply **Dimensionality Reduction technique** to derive the tastes and preferences from the raw data, otherwise known as doing low-rank matrix factorization. Why reduce dimensions?

1. I can discover hidden correlations / features in the raw data.
2. I can remove redundant and noisy features that are not useful.
3. I can interpret and visualize the data easier.
4. I can also access easier data storage and processing.

With that goal, I'll be using Singular Vector Decomposition (SVD), a powerful dimensionality reduction technique that is used heavily in modern model-based CF recommender system.

Dataset: https://www.kaggle.com/datasets/odedgolden/movielens-1m-dataset

### Import necessary packages

In [2]:
import pandas as pd
import numpy as np
import os
########## display full outputs in Jupyter Notebook, not only the last command's output
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
from collections import Counter

# reference: https://stackoverflow.com/questions/20625582/how-to-deal-with-settingwithcopywarning-in-pandas
pd.options.mode.chained_assignment = None  # default='warn'

# Set pandas view
pd.set_option('display.max_columns', 30)

### Loading the Dataset

In [3]:
movies = pd.read_csv('D:/DATASET/Movie Recommendation/movies.csv',encoding='unicode escape',sep='\t').drop('Unnamed: 0', axis=1)
movies

users = pd.read_csv('D:/DATASET/Movie Recommendation/users.csv',sep='\t').drop('Unnamed: 0', axis=1)
users

ratings = pd.read_csv('D:/DATASET/Movie Recommendation/ratings.csv',sep='\t').drop('Unnamed: 0', axis=1)
ratings

Unnamed: 0,movie_id,title,genres
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy
...,...,...,...
3878,3948,Meet the Parents (2000),Comedy
3879,3949,Requiem for a Dream (2000),Drama
3880,3950,Tigerland (2000),Drama
3881,3951,Two Family House (2000),Drama


Unnamed: 0,user_id,gender,age,occupation,zipcode,age_desc,occ_desc
0,1,F,1,10,48067,Under 18,K-12 student
1,2,M,56,16,70072,56+,self-employed
2,3,M,25,15,55117,25-34,scientist
3,4,M,45,7,02460,45-49,executive/managerial
4,5,M,25,20,55455,25-34,writer
...,...,...,...,...,...,...,...
6035,6036,F,25,15,32603,25-34,scientist
6036,6037,F,45,1,76006,45-49,academic/educator
6037,6038,F,56,1,14706,56+,academic/educator
6038,6039,F,45,0,01060,45-49,other or not specified


Unnamed: 0,user_id,movie_id,rating,timestamp,user_emb_id,movie_emb_id
0,1,1193,5,978300760,0,1192
1,1,661,3,978302109,0,660
2,1,914,3,978301968,0,913
3,1,3408,4,978300275,0,3407
4,1,2355,5,978824291,0,2354
...,...,...,...,...,...,...
1000204,6040,1091,1,956716541,6039,1090
1000205,6040,1094,5,956704887,6039,1093
1000206,6040,562,5,956704746,6039,561
1000207,6040,1096,4,956715648,6039,1095


### Preprocesssing & EDA data

In [115]:
ratings['rating'].unique()

array([5, 3, 4, 2, 1], dtype=int64)

So, user in this data rated movie rank with the scale from 1 to 5

In [4]:
number_user = len(ratings['user_id'].unique())
print(f'We have {number_user} users')

number_movie = len(ratings['movie_id'].unique())
print(f'We have {number_movie} rated movies')

We have 6040 users
We have 3706 rated movies


In [117]:
ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000209 entries, 0 to 1000208
Data columns (total 6 columns):
 #   Column        Non-Null Count    Dtype
---  ------        --------------    -----
 0   user_id       1000209 non-null  int64
 1   movie_id      1000209 non-null  int64
 2   rating        1000209 non-null  int64
 3   timestamp     1000209 non-null  int64
 4   user_emb_id   1000209 non-null  int64
 5   movie_emb_id  1000209 non-null  int64
dtypes: int64(6)
memory usage: 45.8 MB


In [5]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3883 entries, 0 to 3882
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   movie_id  3883 non-null   int64 
 1   title     3883 non-null   object
 2   genres    3883 non-null   object
dtypes: int64(1), object(2)
memory usage: 91.1+ KB


**These 2 datasets seem great, they do not have null value and most number value are integer.**

Now I want the format of my ratings matrix to be one row per user and one column per movie. To do so, I'll pivot ratings to get that and call the new variable **ratings_pivot**.

In [6]:
ratings_pivot = ratings.pivot(index='user_id', columns='movie_id', values='rating')
ratings_pivot

movie_id,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,...,3938,3939,3940,3941,3942,3943,3944,3945,3946,3947,3948,3949,3950,3951,3952
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1
1,5.0,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,
2,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,
3,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,
4,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,
5,,,,,,2.0,,,,,,,,,,...,,,,,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6036,,,,2.0,,3.0,,,,,3.0,,,,,...,,,,,,,,,,,,,,,
6037,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,
6038,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,
6039,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,


NaN value means that user hasn't rate that movie. So I will replace NaN value with 0 for later Analsis

In [7]:
ratings_pivot = ratings_pivot.fillna(0)
ratings_pivot

movie_id,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,...,3938,3939,3940,3941,3942,3943,3944,3945,3946,3947,3948,3949,3950,3951,3952
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1
1,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6036,0.0,0.0,0.0,2.0,0.0,3.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6037,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6038,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6039,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### De-normalize the data (normalize by each users mean) and convert it from a dataframe to a numpy array

In [12]:
ratings_matrix = ratings_pivot.values
ratings_matrix.shape
ratings_matrix

(6040, 3706)

array([[5., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [3., 0., 0., ..., 0., 0., 0.]])

### Mean

In [13]:
ratings_mean = np.mean(ratings_matrix, axis=1)
ratings_mean.shape
ratings_mean

(6040,)

array([0.05990286, 0.12924987, 0.05369671, ..., 0.02050729, 0.1287102 ,
       0.3291959 ])

### Reshape the Mean

In [14]:
ratings_mean = ratings_mean.reshape(-1,1)
ratings_mean.shape
ratings_mean

(6040, 1)

array([[0.05990286],
       [0.12924987],
       [0.05369671],
       ...,
       [0.02050729],
       [0.1287102 ],
       [0.3291959 ]])

### Demeaned the matrix 

In [15]:
ratings_demeaned = ratings_matrix - ratings_mean
ratings_demeaned.shape
ratings_demeaned

(6040, 3706)

array([[ 4.94009714, -0.05990286, -0.05990286, ..., -0.05990286,
        -0.05990286, -0.05990286],
       [-0.12924987, -0.12924987, -0.12924987, ..., -0.12924987,
        -0.12924987, -0.12924987],
       [-0.05369671, -0.05369671, -0.05369671, ..., -0.05369671,
        -0.05369671, -0.05369671],
       ...,
       [-0.02050729, -0.02050729, -0.02050729, ..., -0.02050729,
        -0.02050729, -0.02050729],
       [-0.1287102 , -0.1287102 , -0.1287102 , ..., -0.1287102 ,
        -0.1287102 , -0.1287102 ],
       [ 2.6708041 , -0.3291959 , -0.3291959 , ..., -0.3291959 ,
        -0.3291959 , -0.3291959 ]])

With my ratings matrix properly formatted and normalized, I'm ready to do some dimensionality reduction. But first, let's go over the math.

### Model-Based Collaborative Filtering
Model-based Collaborative Filtering is based on matrix factorization (MF) which has received greater exposure, mainly as an unsupervised learning method for latent variable decomposition and dimensionality reduction. Matrix factorization is widely used for recommender systems where it can deal better with scalability and sparsity than Memory-based CF:

- The goal of MF is to learn the latent preferences of users and the latent attributes of items from known ratings (learn features that describe the characteristics of ratings) to then predict the unknown ratings through the dot product of the latent features of users and items.
- When you have a very sparse matrix, with a lot of dimensions, by doing matrix factorization, you can restructure the user-item matrix into low-rank structure, and you can represent the matrix by the multiplication of two low-rank matrices, where the rows contain the latent vector.
- You fit this matrix to approximate your original matrix, as closely as possible, by multiplying the low-rank matrices together, which fills in the entries missing in the original matrix.

For example, let's check the sparsity of the ratings dataset:

In [124]:
sparsity = round(1 - len(ratings)/float(number_user * number_movie), 3)
print (f'The sparsity level of MovieLens1M dataset is ' 
       +  str(sparsity * 100) + '%')

The sparsity level of MovieLens1M dataset is 95.5%


### Support Vector Decomposition (SVD)

A well-known matrix factorization method is Singular value decomposition (SVD). At a high level, SVD is an algorithm that decomposes a matrix  A  into the best lower rank (i.e. smaller/simpler) approximation of the original matrix  A . Mathematically, it decomposes A into a two unitary matrices and a diagonal matrix:
![svd.png](attachment:svd.png)
where  A  is the input data matrix (users's ratings),  U  is the left singular vectors (user "features" matrix),  Σ  is the diagonal matrix of singular values (essentially weights/strengths of each concept), and  VT  is the right singluar vectors (movie "features" matrix).  U  and  VT  are column orthonomal, and represent different things.  U  represents how much users "like" each feature and  VT  represents how relevant each feature is to each movie.

To get the lower rank approximation, I take these matrices and **keep only the top k features (in this case, our k/rank = 50)**, which can be thought of as the underlying tastes and preferences vectors.

In [16]:
from scipy.sparse.linalg import svds
U, sigma, Vt = svds(ratings_demeaned, k = 50)

**Scipy and Numpy both** have functions to do the singular value decomposition. I **used the Scipy function svds** because it let's me choose how many latent factors I want to use to approximate the original ratings matrix (instead of having to truncate it after).

In [17]:
sigma

array([ 147.18581225,  147.62154312,  148.58855276,  150.03171353,
        151.79983807,  153.96248652,  154.29956787,  154.54519202,
        156.1600638 ,  157.59909505,  158.55444246,  159.49830789,
        161.17474208,  161.91263179,  164.2500819 ,  166.36342107,
        166.65755956,  167.57534795,  169.76284423,  171.74044056,
        176.69147709,  179.09436104,  181.81118789,  184.17680849,
        186.29341046,  192.15335604,  192.56979125,  199.83346621,
        201.19198515,  209.67692339,  212.55518526,  215.46630906,
        221.6502159 ,  231.38108343,  239.08619469,  244.8772772 ,
        252.13622776,  256.26466285,  275.38648118,  287.89180228,
        315.0835415 ,  335.08085421,  345.17197178,  362.26793969,
        415.93557804,  434.97695433,  497.2191638 ,  574.46932602,
        670.41536276, 1544.10679346])

Let's reshape **sigma** into diagonal matrix for leverage matrix multiplication to get predictions

In [18]:
sigma = np.diag(sigma)
sigma.shape
sigma

(50, 50)

array([[ 147.18581225,    0.        ,    0.        , ...,    0.        ,
           0.        ,    0.        ],
       [   0.        ,  147.62154312,    0.        , ...,    0.        ,
           0.        ,    0.        ],
       [   0.        ,    0.        ,  148.58855276, ...,    0.        ,
           0.        ,    0.        ],
       ...,
       [   0.        ,    0.        ,    0.        , ...,  574.46932602,
           0.        ,    0.        ],
       [   0.        ,    0.        ,    0.        , ...,    0.        ,
         670.41536276,    0.        ],
       [   0.        ,    0.        ,    0.        , ...,    0.        ,
           0.        , 1544.10679346]])

### Making Predictions from the Decomposed Matrice

I now have everything I need to make movie ratings predictions for every user. I can do it all at once by following the math and matrix multiply  U ,  Σ , and  VT  back to get the rank  k=50  approximation of  A.

**Matrix multiply U , Σ , and VT:**

In [23]:
k_rank_matrix = np.dot(np.dot(U, sigma), Vt)
k_rank_matrix.shape
k_rank_matrix

(6040, 3706)

array([[ 4.22895775,  0.0831523 , -0.25498236, ..., -0.02799091,
        -0.00945311,  0.02900747],
       [ 0.615466  ,  0.0404094 ,  0.20616821, ..., -0.23035193,
        -0.18334807, -0.26943833],
       [ 1.76512711,  0.40243952,  0.0372813 , ..., -0.04135219,
        -0.03854919, -0.16365266],
       ...,
       [ 0.59858142, -0.18227588,  0.08623077, ..., -0.03387677,
        -0.05086158, -0.1354428 ],
       [ 1.37489463, -0.16491781, -0.28997837, ..., -0.13961427,
        -0.16735769, -0.29706963],
       [ 1.66705226, -0.51518305, -0.4856741 , ..., -0.3358365 ,
        -0.20212877, -0.04419478]])

**Add the user means back to get the actual star ratings prediction.**

In [25]:
predicted_ratings = k_rank_matrix + ratings_mean
predicted_ratings.shape
predicted_ratings

(6040, 3706)

array([[ 4.28886061,  0.14305516, -0.1950795 , ...,  0.03191195,
         0.05044975,  0.08891033],
       [ 0.74471587,  0.16965927,  0.33541808, ..., -0.10110207,
        -0.0540982 , -0.14018846],
       [ 1.81882382,  0.45613623,  0.09097801, ...,  0.01234452,
         0.01514752, -0.10995596],
       ...,
       [ 0.61908871, -0.16176859,  0.10673806, ..., -0.01336948,
        -0.0303543 , -0.11493552],
       [ 1.50360483, -0.03620761, -0.16126817, ..., -0.01090407,
        -0.03864749, -0.16835943],
       [ 1.99624816, -0.18598715, -0.1564782 , ..., -0.00664061,
         0.12706713,  0.28500112]])

### Before & After SVD

In [28]:
# Before SVD
ratings_pivot

movie_id,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,...,3938,3939,3940,3941,3942,3943,3944,3945,3946,3947,3948,3949,3950,3951,3952
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1
1,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6036,0.0,0.0,0.0,2.0,0.0,3.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6037,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6038,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6039,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [27]:
# After SVD
prediction_ratings = pd.DataFrame(predicted_ratings, columns = ratings_pivot.columns)
prediction_ratings

movie_id,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,...,3938,3939,3940,3941,3942,3943,3944,3945,3946,3947,3948,3949,3950,3951,3952
0,4.288861,0.143055,-0.195080,-0.018843,0.012232,-0.176604,-0.074120,0.141358,-0.059553,-0.195950,0.512867,-0.089172,0.310181,-0.002005,-0.052401,...,-0.001159,-0.002124,-0.002827,0.010393,-0.001068,0.027807,0.001640,0.026395,-0.022024,-0.085415,0.403529,0.105579,0.031912,0.050450,0.088910
1,0.744716,0.169659,0.335418,0.000758,0.022475,1.353050,0.051426,0.071258,0.161601,1.567246,0.772656,0.046179,-0.054562,0.042344,0.048390,...,0.006759,-0.005789,0.000340,0.002024,0.016013,-0.056502,-0.013733,-0.010580,0.062576,-0.016248,0.155790,-0.418737,-0.101102,-0.054098,-0.140188
2,1.818824,0.456136,0.090978,-0.043037,-0.025694,-0.158617,-0.131778,0.098977,0.030551,0.735470,-0.023476,0.034796,0.065942,0.008661,0.110348,...,0.014383,0.006598,-0.006217,-0.000342,0.000518,0.040481,-0.005301,0.012832,0.029349,0.020866,0.121532,0.076205,0.012345,0.015148,-0.109956
3,0.408057,-0.072960,0.039642,0.089363,0.041950,0.237753,-0.049426,0.009467,0.045469,-0.111370,-0.375831,0.068658,0.011199,0.069699,-0.037529,...,-0.015473,-0.007123,-0.007416,-0.011508,-0.010038,0.008571,-0.005425,-0.008500,-0.003417,-0.083982,0.094512,0.057557,-0.026050,0.014841,-0.034224
4,1.574272,0.021239,-0.051300,0.246884,-0.032406,1.552281,-0.199630,-0.014920,-0.060498,0.450512,-0.251178,0.012337,-0.084051,0.258937,0.016570,...,0.018639,0.034068,0.026941,0.035905,0.024459,0.110151,0.046010,0.006934,-0.015940,-0.050080,-0.052539,0.507189,0.033830,0.125706,0.199244
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6035,2.392388,0.233964,0.413676,0.443726,-0.083641,2.192294,1.168936,0.145237,-0.046551,0.560895,2.239887,0.260693,0.457274,0.877058,-0.099574,...,0.001079,0.040538,0.006475,0.028787,-0.001650,0.188493,-0.004439,-0.042271,-0.090101,0.276312,0.133806,0.732374,0.271234,0.244983,0.734771
6036,2.070760,0.139294,-0.012666,-0.176990,0.261243,1.074234,0.083999,0.013814,-0.030179,-0.084956,0.240129,0.113375,0.056177,-0.133446,-0.023538,...,-0.002727,-0.011607,0.003313,-0.013968,-0.015826,-0.161548,0.001184,-0.029223,-0.047087,0.099036,-0.192653,-0.091265,0.050798,-0.113427,0.033283
6037,0.619089,-0.161769,0.106738,0.007048,-0.074701,-0.079953,0.100220,-0.034013,0.007671,0.001280,0.182847,0.008752,-0.024851,-0.020687,-0.032327,...,0.003093,-0.000203,0.004458,0.004425,0.008262,-0.053546,0.005835,0.007551,-0.024082,-0.010739,-0.008863,-0.099774,-0.013369,-0.030354,-0.114936
6038,1.503605,-0.036208,-0.161268,-0.083401,-0.081617,-0.143517,0.106668,-0.054404,-0.008826,0.205801,0.328057,0.038232,-0.028300,0.066413,0.069179,...,-0.014116,-0.018968,-0.010119,-0.024114,-0.005999,-0.006104,0.008933,0.007595,-0.037800,0.050743,0.024052,-0.172466,-0.010904,-0.038647,-0.168359


Now I write a function to recommend the user the movies with the highest rating that that specific user has not rated yet.

### Movies Recommendation Function

In [37]:
def recommend_movies(userID, number_of_movies_for_recommendation):
    # Get and sort the user's ratings
    user_row_number = userID - 1 # User ID starts at 1, not 0
    sorted_user_predictions = prediction_ratings.iloc[user_row_number].sort_values(ascending=False).reset_index()
    
    user_data = ratings[ratings.user_id == (userID)]
    user_data = (user_data.merge(movies, how = 'left', left_on = 'movie_id', right_on = 'movie_id').
                     sort_values(['rating'], ascending=False))
    
    # List of movies that the user hasn't watched
    unseen_movies = movies[~movies['movie_id'].isin(user_data['movie_id'])]
    
    # Recommend the highest predicted rating movies that the user hasn't seen yet.
    recommendations = unseen_movies.merge(sorted_user_predictions, how = 'left', left_on = 'movie_id', right_on = 'movie_id')
    recommendations = recommendations.rename(columns = {user_row_number: 'score'})
    recommendations = recommendations.sort_values('score', ascending = False)
#     recommendations['score'] = recommendations['score'].apply(lambda x: round(x,2))
    recommendations = recommendations.iloc[:number_of_movies_for_recommendation, :-1].reset_index(drop=True)
    
    return user_data, recommendations

#### Let's test with user 1673 to see 20 recommended movies

In [43]:
already_rated, recommendation = recommend_movies(userID = 1673, number_of_movies_for_recommendation = 20)
recommendation

Unnamed: 0,movie_id,title,genres
0,329,Star Trek: Generations (1994),Action|Adventure|Sci-Fi
1,1372,Star Trek VI: The Undiscovered Country (1991),Action|Adventure|Sci-Fi
2,2393,Star Trek: Insurrection (1998),Action|Sci-Fi
3,1584,Contact (1997),Drama|Sci-Fi
4,2571,"Matrix, The (1999)",Action|Sci-Fi|Thriller
5,3793,X-Men (2000),Action|Sci-Fi
6,2011,Back to the Future Part II (1989),Comedy|Sci-Fi
7,1371,Star Trek: The Motion Picture (1979),Action|Adventure|Sci-Fi
8,1265,Groundhog Day (1993),Comedy|Romance
9,2640,Superman (1978),Action|Adventure|Sci-Fi


**Let's see if movie 'Star Trek: Generations (1994)' is in the already_rated dataframe or not. if not in, then I think I have succeeded creating a Recommendation System for movies.**

In [40]:
already_rated[already_rated['title'] == 'Star Trek: Generations (1994)'] #1

Unnamed: 0,user_id,movie_id,rating,timestamp,user_emb_id,movie_emb_id,title,genres


In [42]:
## and other movies in the recommendation dataframe as well

already_rated[already_rated['title'] == 'Bug\'s Life, A (1998)'] #2
already_rated[already_rated['title'] == 'Star Trek VI: The Undiscovered Country (1991)'] #3
already_rated[already_rated['title'] == 'Star Trek: Insurrection (1998)'] #4
already_rated[already_rated['title'] == 'Contact (1997)'] #5

Unnamed: 0,user_id,movie_id,rating,timestamp,user_emb_id,movie_emb_id,title,genres


Unnamed: 0,user_id,movie_id,rating,timestamp,user_emb_id,movie_emb_id,title,genres


Unnamed: 0,user_id,movie_id,rating,timestamp,user_emb_id,movie_emb_id,title,genres


Unnamed: 0,user_id,movie_id,rating,timestamp,user_emb_id,movie_emb_id,title,genres


**There we go, this is how we apply SVD to create a Recommendation System.** In the next project, I will use user-based and item-based Collaborative Filtering to make movie recommendations from users' ratings data.