## Collaborative Filtering

- Sampling ratings by # of users

- Users filtered: the users who have rated minimum of 20 movies

- Filtering movies with at least 50 ratings 

- Suprise Library
    
       - Memory Based
        
       - Model Based
      
- Comparison between different algorithms: 
        
       - 1000 users
        
       - cv =5
        
       - metric: rmse

       - selected the model with the least test_rmse
       
- Evaluations: separate notebook
    
        - Precision/Recall @K k =5
        
        - Personalization Score: 5fold
        
        - Personal Diversity: top 10 movies
        
        - General Diversity: # of users = 20 , top 10 recommendation
        
        - Average of average ratings: # of users = 20 , top 10 recommendation


       
- Top two models compared:
        
        - SVDpp()
        
        - KNNBaseline

| Model | Personalization | Precision@10 | Recall@10 | Personal diversity | Global diversity | Average rating
| --- | --- | --- | --- | --- | --- | --- 
| SVDpp | 0.79 | 0.87 | 0.26 | 0.36 | 1495.15 | 4.07
| KNNBaseline | 0.8 | 0.84 | 0.27 | 0.42 | 86.2 | 3.6

# - Model Selected: KNNBaseline 

    - Build train dataset using users_ratings['userId','movieId','rating' ]
    
    - Model is trained on complete data(user-item)
    
    - Builds test set using build_anti_testset: it created all the user-item combinations
      not present in the actuat dataset(trainset). This returns a list uid	iid	r_ui	
      here r_ui the global mean of all ratings as here user is known, item is known only
      the r_ui is known.
      
    - Predictions: an estimated rating is predicted for the user-item combination in the test set.
    
    - Recommendation:
        
        -  create user_profile: filter all the user-item ratings based on the user_id from the testset
        
        -  recommend top_n movies to the user
        



In [34]:
import pandas as pd
import os
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime as datetime
import operator
import scipy.spatial.distance as distance
from sklearn import metrics 
import random
import pickle
import time
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import fastparquet
from scipy.sparse import csr_matrix
from surprise import SVD,Dataset,Reader
from collections import defaultdict
from surprise.model_selection import cross_validate,KFold
from surprise import SVDpp, SlopeOne, NMF, NormalPredictor, KNNBaseline, KNNBasic, KNNWithMeans, KNNWithZScore, BaselineOnly, CoClustering

# Reading all the files

In [16]:
df = pd.read_parquet('processed_files/processed_df.parq')
ratings = pd.read_parquet('processed_files/ratings_sample_useradd.parq')
ratings = ratings.reset_index()
movies_raitings = pd.read_parquet('processed_files/movies_ratings.parq')
movies_raitings = movies_raitings.rename(columns={"avg": "Average_Ratings"})

In [17]:
with open('processed_files/sparse_metadata', "rb") as f:
    cols = pickle.load(f)
    movieIds = pickle.load(f)

# Filtering movies rated at least by 50 users

In [18]:
filtered_movies = movies_raitings[movies_raitings['cnt']>50].movieId.values
moviesfilters = ratings[ratings.movieId.isin(filtered_movies)]

print(f' Number of movies at least rated by 50 users: {len(filtered_movies)}')

 Number of movies at least rated by 50 users: 11840


# Creating users list

In [24]:
users_list = ratings.groupby('userId')['userId'].count().reset_index(name="rating_count")
print(f' Number of users who rated at least 20 movies: {len(users_list)}')


 Number of users who rated at least 20 movies: 40634


array([     2,      8,     14, ..., 162523, 162524, 162534])

# Model trained using n_users ratings data

In [25]:
random.seed(42)
n_users = 1000
users_list = set(users_list.userId.unique())
random_users = random.sample(users_list, n_users)
users_ratings = moviesfilters[moviesfilters.userId.isin(random_users)]
print(f' ratings for {n_users} is {len(users_ratings)}')

 ratings for 1000 is 150062


# Filtering movie data for getting movie metdata

In [26]:
movies_filters = np.unique(users_ratings.movieId.values)
movie_rating = movies_raitings[movies_raitings.movieId.isin(movies_filters)]

In [27]:
users_ratings = users_ratings[['userId','movieId','rating']]

# Below section is to train the model. Model is currently trained and saved.

# getting the min max rating 

In [9]:
min_rat = users_ratings.rating.min()
max_rat = users_ratings.rating.max()

# specify the range of rating

In [10]:
reader = Reader(rating_scale=(min_rat,max_rat))

# Loading users_ratings using load_from_df for model comparison: 
The columns must correspond to user id, item id and ratings (in that order)

In [11]:
data = Dataset.load_from_df(users_ratings, reader)

## Models compared 

Models are compared using 1000 users ratings data

- SVD()

- SVDpp()

- SlopeOne() 

- NMF()

- NormalPredictor()

- KNNBaseline()

- KNNBasic()

- KNNWithMeans()

- KNNWithZScore()

- BaselineOnly()

- CoClustering()

The below block has been executed once, I have commented it now. The results are stored and can be read from the parq file.

In [None]:
"""
results_test_df = []
# Iterate over all surprise algorithms

for algorithm in [SVD(), SVDpp(), SlopeOne(), NMF(), NormalPredictor(), KNNBaseline(), KNNBasic(), KNNWithMeans(), KNNWithZScore(), BaselineOnly(), CoClustering()]:
    # Perform cross validation cv =5
    results = cross_validate(algorithm, data, measures=['RMSE'], cv=5, verbose=False)
    
    # Get results & append into results_test_df
    tmp = pd.DataFrame.from_dict(results).mean(axis=0)
    tmp = tmp.append(pd.Series([str(algorithm).split(' ')[0].split('.')[-1]], index=['Algorithm']))
    results_test_df.append(tmp)
    
pd.DataFrame(results_test_df).set_index('Algorithm').sort_values('test_rmse')  

"""

# df_test_score contains the comparison statistics of the above models 

In [324]:
#df_test_score = pd.DataFrame(results_test_df).set_index('Algorithm').sort_values('test_rmse')  

# Storing the comparison results

In [325]:
#df_test_score.to_parquet('test_rmse_score_comparison.parq', engine = 'fastparquet', compression = 'GZIP')

# getting the model with least rmse

In [7]:
df_test_score = pd.read_parquet('processed_files/test_rmse_score_comparison.parq')
df_test_score

Unnamed: 0_level_0,test_rmse,fit_time,test_time
Algorithm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
SVDpp,0.870838,750.548304,8.109057
KNNBaseline,0.878081,2.111808,11.520009
BaselineOnly,0.878819,0.696858,0.565961
SVD,0.884845,6.801081,0.555114
KNNWithZScore,0.893763,1.255122,7.649508
KNNWithMeans,0.895202,1.20556,8.98496
SlopeOne,0.895398,247.660159,18.841247
NMF,0.920075,27.935698,1.137187
CoClustering,0.939221,4.204486,0.627891
KNNBasic,0.943295,1.103818,8.500007


# Comparing the similarity option

In [26]:
# Comparing the similarity option 

item_based = {'name': 'cosine',
               'user_based': False  # compute  similarities between items
               }

user_based = {'name': 'pearson_baseline',
               'shrinkage': 0  # no shrinkage
               }


In [31]:
results_comp = []

for algorithm in [KNNBaseline(sim_options=user_based), KNNBaseline(sim_options=item_based)]:
    # Perform cross validation cv =5
    results = cross_validate(algorithm, data, measures=['RMSE','MAE'], cv=5, verbose=False)
    
    # Get results & append into results_test_df
    tmp = pd.DataFrame.from_dict(results).mean(axis=0)
    tmp = tmp.append(pd.Series([str(algorithm).split(' ')[0].split('.')[-1]], index=['Algorithm']))
    results_comp.append(tmp)
    
pd.DataFrame(results_comp).set_index('Algorithm').sort_values('test_rmse')  

Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the cosine similarity matrix...
Done computing simila

Unnamed: 0_level_0,test_rmse,test_mae,fit_time,test_time
Algorithm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
KNNBaseline,0.862782,0.656062,1.125165,3.005531
KNNBaseline,0.872462,0.667737,12.736901,9.376762


# Final Model KNNBaseline

- User-User 
- using pearson_baseline

### The model has beed commented and dumped into a pickle file.

In [12]:
'''
user_based = {'name': 'pearson_baseline',
               'shrinkage': 0  # no shrinkage
               }

KNN = KNNBaseline(sim_options=user_based)

'''

## Building a train set using the complete data

In [13]:
#trainset = data.build_full_trainset()

## fit the trainset

In [14]:
#KNN.fit(trainset)

Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNBaseline at 0x7fe4a2a0e310>

## save the model 

In [15]:
#filename = 'processed_files/KNNBaseline_model_pearson_baseline.pkl'
#pickle.dump(KNN, open(filename, 'wb'))

## load the model from the dump

In [8]:
#filename = 'processed_files/KNNBaseline_model_pearson_baseline.pkl'
#KNNBas = pickle.load(open(filename, 'rb'))

## Recommendation:


1) testset : Create user Item combination that is not available in the train set.

2) predictions: predict ratings for the user-item in the test set

3) Save all the predictions data into predicted_ratings

All the predictions are saved into the predictions_df



In [17]:
#testset = trainset.build_anti_testset()

In [18]:
#predictions = KNNBas.test(testset)

In [19]:
#predicted_ratings = pd.DataFrame(predictions)

In [20]:
#predicted_ratings.to_parquet('Predictions/KNN_predictions_df.parq', compression='gzip')

# Recommendation starts here 

In [37]:
predicted_ratings = pd.read_parquet('Predictions/KNN_predictions_df.parq')

In [38]:
predicted_ratings[['uid','iid','est']].rename(columns = {'est':'prediction', 'uid':'userId', 'iid':'movieId'})

Unnamed: 0,userId,movieId,prediction
0,75,2,3.055887
1,75,3,3.112601
2,75,7,3.104419
3,75,11,3.766334
4,75,21,3.148158
...,...,...,...
57627410,162514,30996,3.448362
57627411,162514,47330,3.448362
57627412,162514,160684,3.340844
57627413,162514,65588,3.469857


In [12]:
def collaborative_filtering_model(userId,movie_rating,predicted_ratings,top_n):
    
    """
    This functions recommends top_n movies to the end user
    
    """

    single_user = predicted_ratings[predicted_ratings['uid']==userId]
    top_nmovies = single_user.sort_values(by = ['est'] , ascending = False)[:top_n]['iid']
    
    recommendations = pd.merge(top_nmovies,movie_rating, how='left', left_on='iid',right_on='movieId')
    
    
    recommendations = recommendations[['movieId', 'title_eng', 'Average_Ratings', 'cnt']]
    
    return recommendations

# Get top_n recommendation for a user

1) Give userId

2) top_n: # of recommendations

In [33]:
user_id_list = set(users_ratings.userId.values)
random_userId = random.sample(user_id_list, 1)
top_n = 10
recommendations = collaborative_filtering_model(random_userId[0],movie_rating,predicted_ratings,top_n)


In [None]:
recommendations