### Singular Value Decomposition

So far in this lesson, you have gained some exposure to Singular Value Decomposition.  In this notebook, you will get some hands on practice with this technique.

Let's get started by reading in our libraries and setting up the data we will be using throughout this notebook

`1.` Run the cell below to create the **user_movie_subset** dataframe.  This will be the dataframe you will be using for the first part of this notebook.


In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import svd_tests as t
%matplotlib inline

# Read in the datasets
movies = pd.read_csv('./data/movies_clean.csv')
reviews = pd.read_csv('./data/reviews_clean.csv')

del movies['Unnamed: 0']
del reviews['Unnamed: 0']

# Create user-by-item matrix
user_items = reviews[['user_id', 'movie_id', 'rating']]
user_by_movie = user_items.groupby(['user_id', 'movie_id'])['rating'].max().unstack()

user_movie_subset = user_by_movie[[75314,  68646, 99685]].dropna(axis=0)
print(user_movie_subset)

movie_id  75314  68646  99685
user_id                      
2213        7.0   10.0    8.0
2223        6.0   10.0    7.0
2942        8.0    9.0    8.0
3298        8.0   10.0   10.0
3424        9.0    9.0    9.0
5205        8.0    9.0    9.0


In [2]:

# match each letter to the best statement in the dictionary below - each will be used at most once
a = 6
b = 99685
c = 'The Godfather'
d = 'Goodfellas'
e = 3298
f = 30685
g = 3

sol_1_dict = {
    'the number of users in the user_movie_subset': a,
    'the number of movies in the user_movie_subset': g,
    'the user_id with the highest average ratings given': e,
    'the movie_id with the highest average ratings received': b,
    'the name of the movie that received the highest average rating': d
}


#test dictionary here
t.test1(sol_1_dict)

That's right!  There are 6 users in the dataset, which is given by the number of rows. There are 3 movies in the dataset given by the number of columns.  You can find the movies or users with the highest average ratings by taking the mean of each row or column.  Using the movies table, you can find the movie names associated with each id.  This shows the top rated movie is Goodfellas!


In [4]:
# match each letter in the dictionary below - a letter may appear more than once.
a = 'a number that you can choose as the number of latent features to keep'
b = 'the number of users'
c = 'the number of movies'
d = 'the sum of the number of users and movies'
e = 'the product of the number of users and movies'

sol_2_dict = {
    'the number of rows in the U matrix': b,#enter a letter,
    'the number of columns in the U matrix': a, #enter a letter,
    'the number of rows in the V transpose matrix': a, #enter a letter,
    'the number of columns in the V transpose matrix': c #enter a letter
}

#test dictionary here
t.test2(sol_2_dict)

That's right!  We will now put this to use, so you can see how the dot product of these matrices come together to create our user item matrix.  The number of latent features will control the sigma matrix as well, and this will a square matrix that will at most be the minimum of the number of users and number of movies (in our case the minimum is the 4 movies).


In [5]:
u, s, vt = np.linalg.svd(user_movie_subset)
s.shape, u.shape, vt.shape

((3,), (6, 6), (3, 3))

In [6]:
# Run this cell for our thoughts on the questions posted above
t.question4thoughts()


Looking at the dimensions of the three returned objects, we can see the following:

 1. The u matrix is a square matrix with the number of rows and columns equaling the number of users. 

 2. The v transpose matrix is also a square matrix with the number of rows and columns equaling the number of items.

 3. The sigma matrix is actually returned as just an array with 4 values.  

 In order to set up the matrices in a way that they can be multiplied together, we have a few steps to perform: 

 1. Turn sigma into a square matrix with the number of latent features we would like to keep. 

 2. Change the columns of u and the rows of v transpose to match this number of dimensions. 

 If we would like to exactly re-create the user-movie matrix, we could choose to keep all of the latent features.


In [7]:
np.dot(np.dot(u, s), vt)


ValueError: shapes (6,6) and (3,) not aligned: 6 (dim 1) != 3 (dim 0)

In [9]:
s_new = np.zeros(shape=(len(s), len(s)))
s_new[:len(s), :len(s)] = np.diag(s)

In [11]:
# Change the dimensions of u, s, and vt as necessary to use three latent features
# update the shape of u and store in u_new
u_new = u[:, :len(s)] #implement your code here

# update the shape of s and store in s_new
s_new = np.zeros(shape=(len(s), len(s)))
s_new[:len(s), :len(s)] = np.diag(s)

#implement your code here

# Because we are using 3 latent features and there are only 3 movies,
# vt and vt_new are the same
vt_new = vt #implement your code here


np.dot(np.dot(u_new, s_new), vt_new) #

array([[ 7., 10.,  8.],
       [ 6., 10.,  7.],
       [ 8.,  9.,  8.],
       [ 8., 10., 10.],
       [ 9.,  9.,  9.],
       [ 8.,  9.,  9.]])