# Recommendation System

Model based collaborative filtering for recommendations on the MovieLens dataset. 

Singular Vector Decomposition (SVD) is used as dimensionality reduction technique that is used in recommender systems.

## Dataset

In [15]:
import warnings
warnings.filterwarnings("ignore")

In [49]:
# Import libraries
import numpy as np
import pandas as pd

# Reading ratings file
ratings = pd.read_csv(r'movielens\ratings.csv', sep='\t', encoding='latin-1', usecols=['user_id', 'movie_id', 'rating', 'timestamp'])

# Reading users file
users = pd.read_csv(r'movielens\users.csv', sep='\t', encoding='latin-1', usecols=['user_id', 'gender', 'zipcode', 'age_desc', 'occ_desc'])

# Reading movies file
movies = pd.read_csv(r'movielens\movies.csv', sep='\t', encoding='latin-1', usecols=['movie_id', 'title', 'genres'])

In [3]:
movies.head()

Unnamed: 0,movie_id,title,genres
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy


In [4]:
ratings.head()

Unnamed: 0,user_id,movie_id,rating,timestamp
0,1,1193,5,978300760
1,1,661,3,978302109
2,1,914,3,978301968
3,1,3408,4,978300275
4,1,2355,5,978824291


In [51]:
# count of unique users and movies

users_ = ratings.user_id.unique().shape[0]
movies_ = ratings.movie_id.unique().shape[0]
print('Number of users = ' + str(users_))
print('Number of movies = ' + str(movies_))

Number of users = 6040
Number of movies = 3706


In [12]:
# using pivot table

ratings_pivot = ratings.pivot(index = 'user_id', columns ='movie_id', values = 'rating').fillna(0)
print('Shape of pivot: ',ratings_pivot.shape)
ratings_pivot.head()

Shape of pivot:  (6040, 3706)


movie_id,1,2,3,4,5,6,7,8,9,10,...,3943,3944,3945,3946,3947,3948,3949,3950,3951,3952
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Denormalize the data (normalize by each users mean) and convert it from a dataframe to a numpy array. Denormalizing to increse the performance of model.

In [16]:
matrix = ratings_pivot.as_matrix()
user_ratings_mean = np.mean(matrix, axis = 1)
ratings_new = matrix - user_ratings_mean.reshape(-1, 1)
print('Shape of denormalized matrix: ', ratings_new.shape)

Shape of denormalized matrix:  (6040, 3706)


## Model-Based Collaborative Filtering using SVD

Scipy function svds let's me choose how many latent factors I want to use to approximate the original ratings matrix (instead of having to truncate it after).

In [17]:
from scipy.sparse.linalg import svds
U, sigma, Vt = svds(Ratings_demeaned, k = 50)

In [18]:
sigma = np.diag(sigma)
print(sigma)

[[ 147.18581225    0.            0.         ...    0.
     0.            0.        ]
 [   0.          147.62154312    0.         ...    0.
     0.            0.        ]
 [   0.            0.          148.58855276 ...    0.
     0.            0.        ]
 ...
 [   0.            0.            0.         ...  574.46932602
     0.            0.        ]
 [   0.            0.            0.         ...    0.
   670.41536276    0.        ]
 [   0.            0.            0.         ...    0.
     0.         1544.10679346]]


### Predicting from U, Sigma and VT

- multiply $U$, $\Sigma$, and $V^{T}$ back to get 50 approximation of $A$.

Need to add the user means back to get the actual star ratings prediction.

In [19]:
user_predicted_ratings = np.dot(np.dot(U, sigma), Vt) + user_ratings_mean.reshape(-1, 1)

With the predictions matrix for every user, I can build a function to recommend movies for any user. I return the list of movies the user has already rated, for the sake of comparison.

In [32]:
preds = pd.DataFrame(user_predicted_ratings, columns = ratings_pivot.columns)
preds.head()

movie_id,1,2,3,4,5,6,7,8,9,10,...,3943,3944,3945,3946,3947,3948,3949,3950,3951,3952
0,4.288861,0.143055,-0.19508,-0.018843,0.012232,-0.176604,-0.07412,0.141358,-0.059553,-0.19595,...,0.027807,0.00164,0.026395,-0.022024,-0.085415,0.403529,0.105579,0.031912,0.05045,0.08891
1,0.744716,0.169659,0.335418,0.000758,0.022475,1.35305,0.051426,0.071258,0.161601,1.567246,...,-0.056502,-0.013733,-0.01058,0.062576,-0.016248,0.15579,-0.418737,-0.101102,-0.054098,-0.140188
2,1.818824,0.456136,0.090978,-0.043037,-0.025694,-0.158617,-0.131778,0.098977,0.030551,0.73547,...,0.040481,-0.005301,0.012832,0.029349,0.020866,0.121532,0.076205,0.012345,0.015148,-0.109956
3,0.408057,-0.07296,0.039642,0.089363,0.04195,0.237753,-0.049426,0.009467,0.045469,-0.11137,...,0.008571,-0.005425,-0.0085,-0.003417,-0.083982,0.094512,0.057557,-0.02605,0.014841,-0.034224
4,1.574272,0.021239,-0.0513,0.246884,-0.032406,1.552281,-0.19963,-0.01492,-0.060498,0.450512,...,0.110151,0.04601,0.006934,-0.01594,-0.05008,-0.052539,0.507189,0.03383,0.125706,0.199244


Now I write a function to return the movies with the highest predicted rating that the specified user hasn't already rated. Though I didn't use any explicit movie content features (such as genre or title), I'll merge in that information to get a more complete picture of the recommendations.

In [73]:
def recommend_movies(predictions, userID, movies, original_ratings, num_recommendations):
    
    # Get and sort the user's predictions
    row_num = (userID - 1)   # User ID starts at 1, not 0
    user_pred = preds.iloc[row_num].sort_values(ascending=False) # User ID starts at 1
    
    # Get the user's data and merge in the movie information.
    user_data = original_ratings[original_ratings.user_id == (userID)]
    user_all_data = (user_data.merge(movies, how = 'left', left_on = 'movie_id', right_on = 'movie_id')
                 .sort_values(['rating'], ascending=False))
    
    # Recommend the highest predicted rating movies that the user hasn't seen yet.
    recommendations = (movies[movies['movie_id'].isin(user_all_data['movie_id'])].
         merge(pd.DataFrame(user_pred).reset_index(), how = 'left',
               left_on = 'movie_id',
               right_on = 'movie_id').
         rename(columns = {row_num: 'Predictions'}).
         sort_values('Predictions', ascending = False).
                       iloc[:num_recommendations, :-1]
                      )

    return user_all_data, recommendations

In [81]:
# recommending 10 movies for id : 1511
already_rated, predictions = recommend_movies(preds, 1511, movies, ratings, 10)

In [82]:
# Top 20 movies that User 1511 has rated 
already_rated.head(20)

Unnamed: 0,user_id,movie_id,rating,timestamp,title,genres
0,1511,2987,5,974841692,Who Framed Roger Rabbit? (1988),Adventure|Animation|Film-Noir
60,1511,1927,5,974747741,All Quiet on the Western Front (1930),War
69,1511,3730,5,974748154,"Conversation, The (1974)",Drama|Mystery
26,1511,913,5,974748190,"Maltese Falcon, The (1941)",Film-Noir|Mystery
27,1511,919,5,974841476,"Wizard of Oz, The (1939)",Adventure|Children's|Drama|Musical
29,1511,924,5,974748154,2001: A Space Odyssey (1968),Drama|Mystery|Sci-Fi|Thriller
32,1511,940,5,974841641,"Adventures of Robin Hood, The (1938)",Action|Adventure
62,1511,1947,5,974841578,West Side Story (1961),Musical|Romance
38,1511,260,5,974747907,Star Wars: Episode IV - A New Hope (1977),Action|Adventure|Fantasy|Sci-Fi
76,1511,1212,5,974748154,"Third Man, The (1949)",Mystery|Thriller


In [84]:
# Top 10 movies that User 1511 hopefully will enjoy
predictions

Unnamed: 0,movie_id,title,genres
26,1198,Raiders of the Lost Ark (1981),Action|Adventure
8,858,"Godfather, The (1972)",Action|Crime|Drama
1,260,Star Wars: Episode IV - A New Hope (1977),Action|Adventure|Fantasy|Sci-Fi
60,2028,Saving Private Ryan (1998),Action|Drama|War
25,1196,Star Wars: Episode V - The Empire Strikes Back...,Action|Adventure|Drama|Sci-Fi|War
33,1221,"Godfather: Part II, The (1974)",Action|Crime|Drama
13,912,Casablanca (1942),Drama|Romance|War
19,969,"African Queen, The (1951)",Action|Adventure|Romance|War
15,919,"Wizard of Oz, The (1939)",Adventure|Children's|Drama|Musical
27,1204,Lawrence of Arabia (1962),Adventure|War
