# Collaborative Filtering Recommender System 

![](https://images.unsplash.com/photo-1560169897-fc0cdbdfa4d5?ixlib=rb-1.2.1&ixid=MXwxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHw%3D&auto=format&fit=crop&w=2552&q=80)


in the last notebook we have created a content-based recommender system to recommend similar movies to the user. one of the disadvantages of that algorithm it will always recommend content based on the user  already  watched and does not let him discover new content, also it might misunderstand what the user likes. in this notebook, we will work with a collaborative filtering algorithm.

## What is Collaborative Filtering means?
collaborative filtering recommends a list of movies based on people who like the same things as you, but who also like something that you haven’t yet consumed. its focus on the relationship between users and items. The similarity of items is determined by the similarity of the ratings of those items by the users who have rated both items.
There are two types of Collaborative Filtering:
1.User-based, which measures the similarity between target users and other users. 
2.Item-based, which measures the similarity between the items that target users rate or interact with and other items.

in this notebook we will work with the user based recommendation technique as following:
1.import the data and split it into training and testing data.
2.find 20 nearest neighbors using cosine metric.
3.collect the movies watched by the neighbors and not watched yet from the target user.
4.calculate user predicted ratings.
5.create recommedations withthe top 20 predicted movies.

we will start by importing and splitting the data into training and testing data. we will use the testing data to evaluate the recommendation later on.

In [None]:
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity
from sklearn.metrics import mean_squared_error

In [None]:
movies=pd.read_csv('../input/movies-dataset/movies.csv')
ratings=pd.read_csv('../input/movies-dataset/ratings.csv')

In [None]:
#check if there is any none values 
ratings.isnull().sum()

we will round the rating to be integer values.

In [None]:
ratings['rating']=round(ratings['rating'])

In [None]:
ratings.head(10)

In [None]:
from sklearn.model_selection import train_test_split
training ,testing =train_test_split(ratings,
                                   stratify=ratings['userId'], 
                                   test_size=0.20,
                                   random_state=42)


we fill null value ratings with zeros so we can apply nearest neighbors. but we should replace it with NaN when we calculate the means.

In [None]:
df= training.pivot(index='userId',columns='movieId',values='rating')
df.fillna(0,inplace=True)

we will use cosine as a metric so nearest neighbors will be calculated based on cosine similarity between users.

In [None]:
from sklearn.neighbors import NearestNeighbors
model_knn = NearestNeighbors(metric='cosine', algorithm='brute', n_neighbors=20, n_jobs=-1)
model_knn.fit(df)

we will use this formula to calculate the predicted mean:
![Predicted rating formella](https://i.ibb.co/hf2wDdf/Screen-Shot-2021-03-04-at-11-35-24-AM.png)
    
    

In [None]:
def getRecommendations(userId):
    user_movies=df.columns[df.loc[userId].to_numpy().nonzero()[0]]
    user_mean=df.loc[userId].replace(0, np.NaN).mean()
    distances, indices = model_knn.kneighbors([df.iloc[0]], n_neighbors=20)
    neighbors=indices[0][1:]
    #find the similarity between user and neighbors
    sim=1-distances[0][1:]
    
    # find movies rated by neighbors
    neighbors_movies=df.iloc[neighbors,:].sum()
    neighbors_movies=neighbors_movies.loc[neighbors_movies>0].index
    neighbors_movies= set(neighbors_movies)-set(user_movies)
    
    neighbors_ratings=df.iloc[neighbors]
    neighbors_ratings = neighbors_ratings.replace(0, np.NaN)
    neighbors_mean= neighbors_ratings.mean(axis=1)
    
    reco_list=[]
    for movie in neighbors_movies:
        neighbors_rating=neighbors_ratings.loc[:,movie]
        sum=((neighbors_rating-neighbors_mean)*sim).sum()
        prediction=user_mean+(sum/sim.sum())
        reco_list.append([movie,round(prediction)])
        
    # sort movies by ratings    
    reco_df = pd.DataFrame(reco_list, columns=['movieId','est_rating']).sort_values(by='est_rating',ascending=False)
    return reco_df    

### Surprise KNNWithMean Model

There is a famous library for recommendations called  Surprise. if we used the KNNWithMeans model from the library, it will do almost what we did  in our model. I will train the model and compare the result we had from our model with the Surprise model.

In [None]:
from surprise import KNNWithMeans
from surprise import Dataset
from surprise import accuracy
from surprise.model_selection import train_test_split
from surprise import Reader
reader = Reader()
ratingsSet = Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']], reader)

sim_options = {'name': 'cosine',
               'user_based': True  # compute  similarities between items
               }
train_ratings, test_ratings = train_test_split(ratingsSet, test_size=.2, random_state = 42)
# Use user_based true/false to switch between user-based or item-based collaborative filtering
algo = KNNWithMeans(k=20, sim_options=sim_options)
algo.fit(train_ratings)

we will use user 22 for testing.

In [None]:
test_user_id=22
recommendations=getRecommendations(userId=test_user_id)
recommendations.head(10)

In [None]:
result=recommendations.merge(testing[testing['userId']==610],on='movieId',how='inner')

In [None]:
items = result['movieId'].values
topn = []
for iid in items:
        est = algo.predict(22, iid).est
        topn.append([iid,est])
result_surprise = pd.DataFrame(topn, columns=['movieId','est_sur_rating'])  
result_surprise['est_sur_rating']=round(result_surprise['est_sur_rating'])

In [None]:
result=result.merge(result_surprise,on='movieId',how='inner')

In [None]:
result.head(10)

In [None]:
mean_squared_error(result['rating'], result['est_rating'], squared=False)

In [None]:
mean_squared_error(result['rating'], result['est_sur_rating'], squared=False)

The RSME from Surprise predictions is larger than our model for this user. which means that our model has more accurate results for user 610 differentially it's not always the case. try to change the test user id and check the result.
The final recommendations for the user are:

In [None]:
movies.merge(recommendations,on='movieId',how='inner').sort_values(by='est_rating',ascending=False)[:30]
