###  SVD for Movie Recommendations

       

    I previously implemented the  user-based and item-based collaborative filtering to make movie recommendations from users' ratings data. I implemented it on 20000 ratings only which is 0.02% of the entire dataset due to the limited computation power of my machine and got very high RMSE score which means recommendations will not be tailored and thus a recommendation system is not of high quality. 
    
    Memory-based collaborative filtering approaches have following two major issues:

    a) It doesn't scale particularly well to massive datasets.
    b) Ratings matrices may be overfitting to noisy representations of user tastes and preferences. 

    Thus applied the  Dimensionality Reduction techniqueto do low-rank matrix factorization. 
    
    reasons to reduce the dimensions:

    Ability to discover hidden correlations / features in the raw data.
    removal of redundant and noisy features that are not useful.
    
     in this notebook i will implement the  Singular Vector Decomposition (SVD)  a powerful dimensionality reduction technique that is used heavily in modern model-based CF recommender system.

In [17]:
import pandas as pd
import numpy as np

In [13]:
ratings = pd.read_csv('data_movies/ratings.csv',sep = '\t', index_col=0)

  mask |= (ar1 == a)


In [14]:
ratings.head(5)

Unnamed: 0,user_id,movie_id,rating,timestamp,user_emb_id,movie_emb_id
0,1,1193,5,978300760,0,1192
1,1,661,3,978302109,0,660
2,1,914,3,978301968,0,913
3,1,3408,4,978300275,0,3407
4,1,2355,5,978824291,0,2354


In [6]:
users = pd.read_csv('data_movies/users.csv',sep = '\t', index_col=0)


In [7]:
users.head(5)

Unnamed: 0,user_id,gender,age,occupation,zipcode,age_desc,occ_desc
0,1,F,1,10,48067,Under 18,K-12 student
1,2,M,56,16,70072,56+,self-employed
2,3,M,25,15,55117,25-34,scientist
3,4,M,45,7,2460,45-49,executive/managerial
4,5,M,25,20,55455,25-34,writer


In [8]:
movies = pd.read_csv('data_movies/movies.csv',sep = '\t',encoding='latin-1', index_col=0)

In [9]:
movies.head(5)

Unnamed: 0,movie_id,title,genres
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy


##### now i will get each user's rating on one column group by movie id

In [15]:
ratings1 = ratings.pivot(index = 'user_id', columns ='movie_id', values = 'rating').fillna(0)
ratings1.head()

movie_id,1,2,3,4,5,6,7,8,9,10,...,3943,3944,3945,3946,3947,3948,3949,3950,3951,3952
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


##### normalizing the data 

In [18]:
rating_matrix = ratings1.as_matrix()
user_ratings_mean = np.mean(rating_matrix, axis = 1)
rating_norm = rating_matrix - user_ratings_mean.reshape(-1, 1)

  """Entry point for launching an IPython kernel.


##### Collaborative Filtering  based on matrix factorization (MF)

    matrix based collaborative filtering method is favoured more over memory based cf model as it can deal better with scalability and sparsity in the matrix. Further advantages are listed as below:
    
    
    1. The goal of MF is to learn the features that describe the ratings rather than learning to predict them.
    2. When you have a very sparse matrix, with a lot of dimensions, by doing matrix factorization, you can restructure the user-item matrix into low-rank structure, and you can represent the matrix by the multiplication of two low-rank matrices, where the rows contain the latent vector.
    
#### lets check the sparsity of the matrix

In [19]:
n_users = ratings.user_id.unique().shape[0]
n_movies = ratings.movie_id.unique().shape[0]

In [22]:
sparsity = round(1.0 - len(ratings) / float(n_users * n_movies), 3)
print ('The sparsity level of dataset is ' +  str(sparsity * 100) + '%')

The sparsity level of dataset is 95.5%


#### Support Vector Decomposition (SVD)

A well-known matrix factorization method is Singular value decomposition (SVD). At a high level, SVD is an algorithm that decomposes a matrix $A$ into the best lower rank (i.e. smaller/simpler) approximation of the original matrix $A$. Mathematically, it decomposes A into a two unitary matrices and a diagonal matrix:

In [24]:
from scipy.sparse.linalg import svds
U, sigma, Vt = svds(rating_norm, k = 50)

#### converting the values onto diagonal matrix form

In [25]:
sigma = np.diag(sigma)

In [26]:
all_user_predicted_ratings = np.dot(np.dot(U, sigma), Vt) + user_ratings_mean.reshape(-1, 1)

In [27]:
preds = pd.DataFrame(all_user_predicted_ratings, columns = Ratings.columns)
preds.head()

movie_id,1,2,3,4,5,6,7,8,9,10,...,3943,3944,3945,3946,3947,3948,3949,3950,3951,3952
0,4.288861,0.143055,-0.19508,-0.018843,0.012232,-0.176604,-0.07412,0.141358,-0.059553,-0.19595,...,0.027807,0.00164,0.026395,-0.022024,-0.085415,0.403529,0.105579,0.031912,0.05045,0.08891
1,0.744716,0.169659,0.335418,0.000758,0.022475,1.35305,0.051426,0.071258,0.161601,1.567246,...,-0.056502,-0.013733,-0.01058,0.062576,-0.016248,0.15579,-0.418737,-0.101102,-0.054098,-0.140188
2,1.818824,0.456136,0.090978,-0.043037,-0.025694,-0.158617,-0.131778,0.098977,0.030551,0.73547,...,0.040481,-0.005301,0.012832,0.029349,0.020866,0.121532,0.076205,0.012345,0.015148,-0.109956
3,0.408057,-0.07296,0.039642,0.089363,0.04195,0.237753,-0.049426,0.009467,0.045469,-0.11137,...,0.008571,-0.005425,-0.0085,-0.003417,-0.083982,0.094512,0.057557,-0.02605,0.014841,-0.034224
4,1.574272,0.021239,-0.0513,0.246884,-0.032406,1.552281,-0.19963,-0.01492,-0.060498,0.450512,...,0.110151,0.04601,0.006934,-0.01594,-0.05008,-0.052539,0.507189,0.03383,0.125706,0.199244


In [29]:
def recommend_movies(predictions, userID, movies, original_ratings, num_recommendations):
    
    # Get and sort the user's predictions
    user_row_number = userID - 1 # User ID starts at 1, not 0
    sorted_user_predictions = preds.iloc[user_row_number].sort_values(ascending=False) # User ID starts at 1
    
    # Get the user's data and merge in the movie information.
    user_data = original_ratings[original_ratings.user_id == (userID)]
    user_full = (user_data.merge(movies, how = 'left', left_on = 'movie_id', right_on = 'movie_id').
                     sort_values(['rating'], ascending=False)
                 )

    print ('User {0} has already rated {1} movies.'.format(userID, user_full.shape[0]))
    print ('Recommending highest {0} predicted ratings movies not already rated.'.format(num_recommendations))
    
    # Recommend the highest predicted rating movies that the user hasn't seen yet.
    recommendations = (movies[~movies['movie_id'].isin(user_full['movie_id'])].
         merge(pd.DataFrame(sorted_user_predictions).reset_index(), how = 'left',
               left_on = 'movie_id',
               right_on = 'movie_id').
         rename(columns = {user_row_number: 'Predictions'}).
         sort_values('Predictions', ascending = False).
                       iloc[:num_recommendations, :-1]
                      )

    return user_full, recommendations

In [31]:
already_rated, predictions = recommend_movies(preds, 2310, movies, ratings, 20)

User 2310 has already rated 101 movies.
Recommending highest 20 predicted ratings movies not already rated.


In [32]:
already_rated.head(5)

Unnamed: 0,user_id,movie_id,rating,timestamp,user_emb_id,movie_emb_id,title,genres
23,2310,1617,5,974487726,2309,1616,L.A. Confidential (1997),Crime|Film-Noir|Mystery|Thriller
68,2310,2761,5,974488323,2309,2760,"Iron Giant, The (1999)",Animation|Children's
31,2310,3424,5,974486093,2309,3423,Do the Right Thing (1989),Comedy|Drama
60,2310,3361,5,974486898,2309,3360,Bull Durham (1988),Comedy
88,2310,1217,5,974485729,2309,1216,Ran (1985),Drama|War


In [33]:
predictions.head(5)

Unnamed: 0,movie_id,title,genres
1152,1197,"Princess Bride, The (1987)",Action|Adventure|Comedy|Romance
286,296,Pulp Fiction (1994),Crime|Drama
1045,1079,"Fish Called Wanda, A (1988)",Comedy
722,745,"Close Shave, A (1995)",Animation|Comedy|Thriller
1897,2020,Dangerous Liaisons (1988),Drama|Romance


In [38]:
from surprise import Reader, Dataset, SVD
from surprise.model_selection import cross_validate

# Load Reader library
reader = Reader()

# Load ratings dataset with Dataset library
data = Dataset.load_from_df(ratings[['user_id', 'movie_id', 'rating']], reader)



In [49]:
algo = SVD()

# Run 5-fold cross-validation and print results
cross_validate(algo, data, measures=['RMSE'], cv=5, verbose=True)

Evaluating RMSE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.8752  0.8773  0.8726  0.8719  0.8737  0.8741  0.0019  
Fit time          52.37   53.92   53.57   55.49   57.72   54.62   1.85    
Test time         2.46    2.04    2.48    2.39    2.29    2.33    0.16    


{'test_rmse': array([0.8751831 , 0.87727691, 0.87261096, 0.87189236, 0.87369257]),
 'fit_time': (52.37338995933533,
  53.92311882972717,
  53.565634965896606,
  55.4949688911438,
  57.72338080406189),
 'test_time': (2.4581949710845947,
  2.0448989868164062,
  2.477572202682495,
  2.3861241340637207,
  2.290879964828491)}

In [47]:
from surprise.model_selection import train_test_split
from surprise import Dataset
from surprise import accuracy



In [None]:
data = Dataset.load_from_df(ratings[['user_id', 'movie_id', 'rating']], reader)

In [43]:
trainset, testset = train_test_split(data, test_size=.25)
algo.fit(trainset)


<surprise.prediction_algorithms.matrix_factorization.SVD at 0x1109a9f28>

In [44]:
predictions = algo.fit(trainset).test(testset)

In [48]:
accuracy.rmse(predictions)

RMSE: 0.8768


0.8767757639859081

#### rmse is 0.8768 which is much better than memory based approach

#### train on entire dataset

In [50]:
trainset = data.build_full_trainset()

In [51]:
algo = SVD()
algo.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x12424e630>

In [52]:
uid = str(196)  # raw user id (as in the ratings file). They are **strings**!
iid = str(302)  # raw item id (as in the ratings file). They are **strings**!

# get a prediction for specific users and items.
pred = algo.predict(uid, iid, r_ui=4, verbose=True)

user: 196        item: 302        r_ui = 4.00   est = 3.58   {'was_impossible': False}


#### rating of 3.58 is obtained for user id 196 for movie id 302