## Movielens - 100K Dataset 

## Data Description


**Ratings**    -- The full u data set, 100000 ratings by 943 users on 1682 items.
              Each user has rated at least 20 movies.  Users and items are
              numbered consecutively from 1.  The data is randomly
              ordered. This is a comma separated list of 
	         user id | item id | rating | timestamp. 
              The time stamps are unix seconds since 1/1/1970 UTC   


**Movie Information**   -- Information about the items (movies); this is a comma separated
              list of
              movie id | movie title | release date | unknown | Action | Adventure | Animation |
              Children's | Comedy | Crime | Documentary | Drama | Fantasy |
              Film-Noir | Horror | Musical | Mystery | Romance | Sci-Fi |
              Thriller | War | Western |
              The last 19 fields are the genres, a 1 indicates the movie
              is of that genre, a 0 indicates it is not; movies can be in
              several genres at once.


**User Demographics**    -- Demographic information about the users; this is a comma
              separated list of
              user id | age | gender | occupation | zip code

## Table of Content

[1. Reading Dataset](#Reading-Dataset)

[2. Merging Movie information to ratings dataframe](#merge)

[3. Creating train and test data & setting evaluation metric](#eval)

[7. Importing Surprise & Loading Dataset](#dataload)

[8. Fitting SVD Model with 100 latent factors on train set and checking performance on test set](#svdfit)

[9. Examining Item and User matrices](#examine)

[10. Grid Search for better performance with SVD](#gridsearch)


## 1. Reading Dataset <a class="anchor" id="Reading-Dataset"></a>

In [1]:
import pandas as pd
import numpy as np
from sklearn.metrics import mean_squared_error
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.model_selection import train_test_split

In [2]:
#Reading ratings file:
ratings = pd.read_csv('ratings.csv')

#Reading Movie Info File
movie_info = pd.read_csv('movie_info.csv')

## 2.  Merging Movie information to ratings dataframe <a class="anchor" id="merge"></a>

The movie names are contained in a separate file. Let's merge that data with ratings and store it in ratings dataframe. The idea is to bring movie title information in ratings dataframe as it would be useful later on

In [3]:
ratings = ratings.merge(movie_info[['movie id','movie title']], how='left', left_on = 'movie_id', right_on = 'movie id')

In [4]:
ratings.head()

Unnamed: 0,user_id,movie_id,rating,unix_timestamp,movie id,movie title
0,196,242,3,881250949,242,Kolya (1996)
1,186,302,3,891717742,302,L.A. Confidential (1997)
2,22,377,1,878887116,377,Heavyweights (1994)
3,244,51,2,880606923,51,Legends of the Fall (1994)
4,166,346,1,886397596,346,Jackie Brown (1997)


Lets also combine movie id and movie title separated by ': ' and store it in a new column named movie

In [5]:
ratings['movie'] = ratings['movie_id'].map(str) + str(': ') + ratings['movie title'].map(str)

In [6]:
ratings.columns

Index(['user_id', 'movie_id', 'rating', 'unix_timestamp', 'movie id',
       'movie title', 'movie'],
      dtype='object')

In [7]:
ratings.head()

Unnamed: 0,user_id,movie_id,rating,unix_timestamp,movie id,movie title,movie
0,196,242,3,881250949,242,Kolya (1996),242: Kolya (1996)
1,186,302,3,891717742,302,L.A. Confidential (1997),302: L.A. Confidential (1997)
2,22,377,1,878887116,377,Heavyweights (1994),377: Heavyweights (1994)
3,244,51,2,880606923,51,Legends of the Fall (1994),51: Legends of the Fall (1994)
4,166,346,1,886397596,346,Jackie Brown (1997),346: Jackie Brown (1997)


Keeping the columns movie, user_id and rating in the ratings dataframe and drop all others

In [8]:
ratings = ratings.drop(['movie id', 'movie title', 'movie_id','unix_timestamp'], axis = 1)

In [9]:
ratings = ratings[['user_id','movie','rating']]

In [10]:
ratings.head()

Unnamed: 0,user_id,movie,rating
0,196,242: Kolya (1996),3
1,186,302: L.A. Confidential (1997),3
2,22,377: Heavyweights (1994),1
3,244,51: Legends of the Fall (1994),2
4,166,346: Jackie Brown (1997),1


## 3. Creating Train & Test Data & Setting Evaluation Metric <a class="anchor" id="eval"></a>
In order to test how well we do with a given rating prediction method, we would first need to define our train and test set, we will only use the train set to build different models and evaluate our model using the test set.

In [11]:
#Assign X as the original ratings dataframe
X = ratings.copy()

#Split into training and test datasets
X_train, X_test = train_test_split(X, test_size = 0.25, random_state=42)

In [12]:
#Function that computes the root mean squared error (or RMSE)
def rmse(y_true, y_pred):
    return np.sqrt(mean_squared_error(y_true, y_pred))

## 4. Importing Surprise & Loading Dataset <a class="anchor" id="dataload"></a>

In [13]:
#Importing functions to be used in this notebook from Surprise Package
from surprise import Dataset, Reader, SVD
from surprise.model_selection import GridSearchCV

To load a dataset from a pandas dataframe within Surprise, you will need the load_from_df() method. 
1. You will also need a `Reader` object and the `rating_scale` parameter must be specified. 
2. The dataframe here must have three columns, corresponding to the user (raw) ids, the item (raw) ids, and the ratings in this order. 
3. Each row thus corresponds to a given rating. This is not restrictive as you can reorder the columns of your dataframe easily.

In [14]:
#Reader object to import ratings from X_train
reader = Reader(rating_scale=(1, 5))

#Storing Data in surprise format from X_train
data = Dataset.load_from_df(X_train[['user_id','movie','rating']], reader)

## 5. Fitting SVD Model with 100 latent factors on train set and checking performance on test set <a class="anchor" id="svdfit"></a>

Here we first fit an arbitrary model with 100 factors and check performance on the test set

In [15]:
# Train a new SVD with 100 latent features (number was chosen arbitrarily)
model = SVD(n_factors=100)

#Build full trainset will essentially fits the knnwithmeans on the complete train set instead of a part of it
#like we do in cross validation
model.fit(data.build_full_trainset())

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x2301408f088>

In [16]:
#id pairs for test set
id_pairs = zip(X_test['user_id'], X_test['movie'])

#Making predictions for test set using predict method from Surprise
y_pred = [model.predict(uid = user, iid = movie)[3] for (user, movie) in id_pairs]

#Actual rating values for test set
y_true = X_test['rating']

# Checking performance on test set
rmse(y_true, y_pred)

0.9411966325472734

We clearly see merit in using SVD algorithm as it performs even better than the user based or item based techniques discussed earlier. Surprise provides ways to extract and read the user matrix and the item matrix separately as well.

## 6. Examining the user and item matrices <a class="anchor" id="examine"></a>

Surprise SVD stores the item matrix under the `model.qi` attribute and user matrix in `model.pu` attribute. First let us check the number of unique movies and users in the train data.

In [21]:
#Number of movies & users in train data
X_train.movie.nunique(), X_train.user_id.nunique()

(1642, 943)

We see that there are 1642 movies with 943 users in the train set we created. Now let us check the shape of use and item matrices separately.

In [22]:
# 1642*100 (movie matrix)  943*100 (user matrix) # 1642*943 (user movie matrix)
model.qi.shape, model.pu.shape,X_train.movie.nunique(), X_train.user_id.nunique() 

((1642, 100), (943, 100), 1642, 943)

We could also find the reduction in the dimensionality of our original rating matrix and calculate the percentage reduction from these shapes.

In [23]:
#Percentage reduction in size wrt user item matrix
(1642*943 - 943*100 - 1642*100)/(1642*943)*100

83.30541214642672

Turns out its massive 83% means there is significant reduction in memory usage as compared to the simple neighbourhood based techniques with similar performance. Now, lets check the factors for a given movie. 

Surprise assigns its own ids to the items and users and we need to first extract that in order to identify the indec for a given movie

In [24]:
#Extracting id for Toy story within qi matrix
movie_row_idx = model.trainset._raw2inner_id_items['1: Toy Story (1995)']
np.array(model.qi[movie_row_idx])

array([ 0.09191765, -0.32029703,  0.28216844,  0.0497652 ,  0.0067574 ,
        0.0176916 ,  0.04810066,  0.08444062,  0.10252175, -0.12763247,
        0.07391934, -0.0268707 , -0.01647942, -0.14861366, -0.10888956,
        0.26088914,  0.22984806, -0.06534786, -0.12639092, -0.04067131,
        0.10124812, -0.22559521,  0.10821559, -0.04150409, -0.04659582,
        0.08439676, -0.12463503,  0.13427298,  0.02585698, -0.00465552,
       -0.29874661, -0.08029293, -0.02654757,  0.0595778 , -0.08772409,
       -0.22445005,  0.25253278,  0.14755908, -0.14532826,  0.14273256,
        0.24059268, -0.25264822,  0.04254642, -0.15271079, -0.17550436,
        0.26088351,  0.04226278, -0.0506748 ,  0.245662  , -0.32658025,
       -0.21349492,  0.12001323, -0.12424371, -0.14022601,  0.109795  ,
        0.12746833, -0.13125641,  0.0139525 ,  0.31184841, -0.03946827,
       -0.14424481,  0.23298815, -0.15293573, -0.23914927, -0.00052753,
       -0.20653038,  0.18316886,  0.03364704, -0.19460192,  0.00

In [25]:
#Latent factors learnt from Funk SVD
ts_vector = np.array(model.qi[movie_row_idx])

In [26]:
#Extracting id for Wizard of Oz within qi matrix
movie_row_idx = model.trainset._raw2inner_id_items['132: Wizard of Oz, The (1939)']
woz_vector = np.array(model.qi[movie_row_idx])

In [28]:
#Checking the similarity in latent factors for wizard of oz & Toy Story
from scipy import spatial
1 - spatial.distance.cosine(ts_vector,woz_vector)
spatial.distance.cosine(ts_vector,woz_vector)

0.9691690296619897

## 7. Grid Search for better performance with SVD <a class="anchor" id="gridsearch"></a>
We will try to optimize for number of factors and check cross validation score with 5 folds. We also need to set the random state here as the initialization clearly depends on that.

In [29]:
#Defining the parameter grid for SVD and fixing the random state
param_grid = {'n_factors':list(range(1,50,5)), 'n_epochs': [5, 10, 20], 'random_state': [42]}

#Defining the grid search with the parameter grid and SVD algorithm optimizing for RMSE
gs = GridSearchCV(SVD, 
                  param_grid, 
                  measures=['rmse'], 
                  cv=5, 
                  n_jobs = -1)

#Fitting the mo
gs.fit(data)
 
#Printing the best score
print(gs.best_score['rmse'])

#Printing the best set of parameters
print(gs.best_params['rmse'])

0.9440714892501321
{'n_factors': 26, 'n_epochs': 20, 'random_state': 42}


In [30]:
#Fitting the model on train data with the best parameters
model = SVD(n_factors = 11, n_epochs = 20, random_state = 42)

#Build full trainset will essentially fits the SVD on the complete train set instead of a part of it
#like we do in cross validation for grid search
model.fit(data.build_full_trainset())

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x2301964de48>

In [31]:
#id pairs for test set
id_pairs = zip(X_test['user_id'], X_test['movie'])

#Making predictions for test set using predict method from Surprise
y_pred = [model.predict(uid = user, iid = movie)[3] for (user, movie) in id_pairs]

#Actual rating values for test set
y_true = X_test['rating']

# Checking performance on test set
rmse(y_true, y_pred)

0.9390125163978545