# Collaborative Based Filtering

This type of filtering considers user's preferences and compares it with the preferences of other users. Collaborative filtering is further divided into 2 categories:
- **User Based Filtering**: User based filtering finds users that are similar to the target user.
- **Item Based Filtering**: Item based filtering finds items that are similar to the items that the target user likes. 

User based filtering is prone to fake user creation and hence unreliable recommendations. This notebook focusses on **Item Based Filtering** and recommends top 5 movies. 

In [39]:
#importing required libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [40]:
#loading the dataset.
ratings = pd.read_csv('ratings_small.csv')
movies = pd.read_csv('tmdb_5000_movies.csv')

In [41]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205


In [42]:
ratings.shape

(100004, 4)

For better representation of the data, we will create a new dataframe where each row represents a unique movieID and each column represents a unique userID.

In [43]:
df = ratings.pivot(index='movieId',columns='userId', values='rating')
df

userId,1,2,3,4,5,6,7,8,9,10,...,662,663,664,665,666,667,668,669,670,671
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,,,,3.0,,4.0,,...,,4.0,3.5,,,,,,4.0,5.0
2,,,,,,,,,,,...,5.0,,,3.0,,,,,,
3,,,,,4.0,,,,,,...,,,,3.0,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,,,,,,,,,,,...,,,,3.0,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
161944,,,,,,,,,,,...,,,,,,,,,,
162376,,,,,,,,,,,...,,,,,,,,,,
162542,,,,,,,,,,,...,,,,,,,,,,
162672,,,,,,,,,,,...,,,,,,,,,,


In [44]:
#Filling all the Null values with 0.
df.fillna(0, inplace=True)
df

userId,1,2,3,4,5,6,7,8,9,10,...,662,663,664,665,666,667,668,669,670,671
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,4.0,0.0,...,0.0,4.0,3.5,0.0,0.0,0.0,0.0,0.0,4.0,5.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,5.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
161944,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
162376,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
162542,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
162672,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


**Filtering the data based on popular movies and highly engaged users**
- We don't want movies having a very few ratings in our model, because that does not provide us with credibility.
- Similarly, users who have rated a very few movies are also vague.
- Hence, we will filter out the dataset based on the following factors:
  - To qualify for **a popular movie**, a movie should be voted by **atleast 10 users**.
  - To qualify for a **high engaged user**, a user should have voted **atleast 50 movies**.

In [45]:
#Computing the number of ratings a user has done
users = ratings.groupby('userId')['rating'].agg('count')
users

userId
1       20
2       76
3       51
4      204
5      100
      ... 
667     68
668     20
669     37
670     31
671    115
Name: rating, Length: 671, dtype: int64

From the above result, we can interpret that user 1 has rated 20 movies, user 2 has rated 76 movies and so on...\
We will keep only the users who have rated atleast 50 movies for our prediction purposes.

In [46]:
#Computing the number of ratings a movie has received
movies_filtered = ratings.groupby('movieId')['rating'].agg('count')
movies_filtered

movieId
1         247
2         107
3          59
4          13
5          56
         ... 
161944      1
162376      1
162542      1
162672      1
163949      1
Name: rating, Length: 9066, dtype: int64

In [47]:
#Filtering based on our conditions
df = df.loc[movies_filtered[movies_filtered > 10].index,:] 
df = df.loc[:,users[users > 50].index]

In [48]:
df.shape

(2083, 421)

We can see a significant reduction in the dimensions of our dataframe. The dataframe now contains more reliable information on which we can make our predictions.

**Removing data sparsity** <br>
Our data is thinly populated hence we must reduce the data sparsity because for large dataset the system may run out of computational resources.

In [49]:
from scipy.sparse import csr_matrix
csr_data = csr_matrix(df.values)
df.reset_index(inplace=True)

**Using KNN with cosine distance to calculate similarity**

In [50]:
from sklearn.neighbors import NearestNeighbors

In [51]:
knn = NearestNeighbors(metric='cosine', algorithm='brute', n_neighbors=20, n_jobs=-1)
knn.fit(csr_data)

# Recommender function

In [52]:
def recommend(movie):
    movie_index = movies[movies['original_title']==movie].index[0]
    distances, indices = knn.kneighbors(csr_data[movie_index],n_neighbors=6)
    recommended_list = sorted(list(zip(indices.squeeze().tolist(),distances.squeeze().tolist())),key=lambda x:x[1])[:0:-1]
    for items in recommended_list:
        print(movies.iloc[items[0]]['original_title'])

In [53]:
recommend('Iron Man')

Need for Speed
Dreamcatcher
Iron Man 2
Eragon
X-Men: The Last Stand


In [56]:
import pickle
pickle.dump(csr_data,open('csr_data','wb'))
pickle.dump(df,open('ratingsdf','wb'))