

# Item based filtering
This approach is mostly preferred since the movie don't change much. We can rerun this model once a week unlike User based where we have to frequently run the model.

In this kernel, We look at the implementation of Item based filtering

In [None]:
import pandas as pd
import numpy as np

from scipy.sparse import csr_matrix
from sklearn.neighbors import NearestNeighbors

import matplotlib.pyplot as plt
import seaborn as sns



In [None]:
#read the dataset
movies = pd.read_csv('data/movies.csv') #read csv file in data/movies.csv
ratings = pd.read_csv('data/ratings.csv') #read csv file in data/ratings.csv

In [None]:
#print the head of the ratings to see some of its features
ratings.head()

Ratings dataset has 
* userId - unique for each user
* movieId - using this feature ,we take the title of the movie from movies dataset
* rating - Ratings given by each user to all the movies using this we are going to predict the top 10 similar movies

In [None]:
#print the head of the movies to see some of its features\
movies.head()

Movie dataset has 
* movieId - once the recommendation is done, we get list of all similar movieId and get the title for each movie from this dataset. 
* genres -  which is not required for this filtering approach

In [None]:
#display the ratings from users per movies
#in other words, change the form of the table from what you saw in ratings.head()
#to something you could see as movie ratings per users
#so, transform the dataset of ratings to have the index as movieid.
final_dataset = ratings.pivot(index='movieId',columns='userId',values='rating')
final_dataset.head()

In [None]:
# How should you handle missing values here?
#for now, just fill the NaN  as 0
final_dataset.fillna(0,inplace=True)
final_dataset.head()

In a real world, ratings are very sparse and data points are mostly collected from very popular movies and highly engaged users. So we will reduce the noise by adding some filters and qualify the movies for the final dataset.
* To qualify a movie, minimum 10 users should have voted a movie.
* To qualify a user, minimum 50 movies should have voted by the user.


In [None]:
# Count the Number of Users Who Voted for Each Movie:
# Group the DataFrame by movieId and count the number of ratings for each movie.

no_user_voted = ratings.groupby('movieId')['rating'].agg('count')





In [None]:
#Count the Number of Movies Each User Has Voted For:
#Group the DataFrame by userId and count the number of ratings each user has made.
no_movie_voted=ratings.groupby('userId')['rating'].agg('count')

In [None]:
#Visualize the Number of Users Who Voted for Each Movie:
f,ax = plt.subplots(1,1,figsize=(16,4))
plt.scatter(no_user_voted.index,no_user_voted, color='mediumseagreen') #scatter plot the number of users voted
plt.axhline(y=10,color='r') #add this line as a threshold
plt.xlabel('MovieId')
plt.ylabel('No. of users voted')
plt.show()

In [None]:
#now keep the movies that where voted by more than 10 users.
final_dataset = final_dataset.loc[no_user_voted[no_user_voted>10].index,:]

In [None]:
f,ax = plt.subplots(1,1,figsize=(16,4))
plt.scatter(no_movie_voted.index,no_movie_voted,color='mediumseagreen') #scatter plot the number of movies 
plt.axhline(y=50,color='r') #this is the threshold of movies voted from users that voted more than 50 movies
plt.xlabel('UserId')
plt.ylabel('No. of votes by user')
plt.show()

In [None]:
#now keep only the movies that has users voting for minimum of 50 movies
final_dataset=final_dataset.loc[:,no_movie_voted[no_movie_voted>50].index]
final_dataset 

Our final_dataset has dimensions of **2121 * 378** where most of the values are sparse. I took only small dataset but for
original large dataset of movie lens which has more than **100000** features, this will sure hang our system when this has 
feed to model. To reduce the sparsity we use csr_matric scipy lib. I'll give an example how it works

In [None]:
sample = np.array([[0,0,3,0,0],[4,0,0,0,2],[0,0,0,0,1]])
sparsity = 1.0 - ( np.count_nonzero(sample) / float(sample.size) )
print(sparsity)

In [None]:
csr_sample = csr_matrix(sample)
print(csr_sample)

* As you can see there is no sparse value in the csr_sample and values are assigned as rows and column index. for the 0th row and 2nd column , value is 3 . Look at the original dataset where the values at the right place. This is how it works using todense method you can take it back to original dataset.
* Most of the sklearn works with sparse matrix. surely this will improve our performance

In [None]:
csr_data = csr_matrix(final_dataset.values)
final_dataset.reset_index(inplace=True)

We use cosine distance metric which is very fast and preferable than pearson coefficient. Please don't use euclidean distance which will not work when the values are equidistant.

In [None]:
knn = NearestNeighbors(metric='cosine', algorithm='brute', n_neighbors=20, n_jobs=-1)

In [None]:
knn = NearestNeighbors(metric='cosine', algorithm='brute', n_neighbors=20, n_jobs=-1)
knn.fit(csr_data) #fit the sparse data

In [None]:
def get_movie_recommendation(movie_name):
    n_movies_to_reccomend = 10
    movie_list = movies[movies['title'].str.contains(movie_name)]  
    if len(movie_list):        
        movie_idx= movie_list.iloc[0]['movieId']
        movie_idx = final_dataset[final_dataset['movieId'] == movie_idx].index[0]
        
        distances , indices = knn.kneighbors(csr_data[movie_idx],n_neighbors=n_movies_to_reccomend+1)    
        rec_movie_indices = sorted(list(zip(indices.squeeze().tolist(),distances.squeeze().tolist())),\
                               key=lambda x: x[1])[:0:-1]
        
        recommend_frame = []
        
        for val in rec_movie_indices:
            movie_idx = final_dataset.iloc[val[0]]['movieId']
            idx = movies[movies['movieId'] == movie_idx].index
            recommend_frame.append({'Title':movies.iloc[idx]['title'].values[0],'Distance':val[1]})
        df = pd.DataFrame(recommend_frame,index=range(1,n_movies_to_reccomend+1))
        return df
    
    else:
        
        return "No movies found. Please check your input"

In [None]:
get_movie_recommendation('Iron Man')

In [None]:
get_movie_recommendation('Memento')

Our model works perfectly predicting the recommendation based on user behaviour and past search. So we conclude our 
collaborative filtering here.
#### Now,we will try k-means clustering


In [None]:
from sklearn.cluster import KMeans

final_dataset.reset_index(drop=True, inplace=True)
X = final_dataset.drop('movieId', axis=1)



csr_data = csr_matrix(X.values)

In [None]:
#normalize the data
from sklearn.preprocessing import normalize
normalized_data=normalize(csr_data)


# Why:
1.Removes popularity bias <br>
2.Makes distance meaningful<br>
3.Required for good KMeans results<br>

Now apply elbow method to find out the best K(number of clusters)

In [None]:
final_dataset.reset_index(drop=True, inplace=True)
clusters=[]
for i in range (1,20):
    kmeans=KMeans(n_clusters=i,random_state=42)
    kmeans.fit(X)
    clusters.append(kmeans.inertia_)

In [None]:
clusters

## Plotting the elbow point graph of the k-means algorithm

In [None]:
plt.figure(figsize=(8,5))
plt.plot(range(1,20),clusters)
plt.xlabel('Number of clusters')
plt.ylabel('Inertia')
plt.title('Elbow Method For Optimal k') 
plt.show()

### In this elbow graph, we can see that after n_clusters=5, the graph is getting stabilized. Therefore, the optimal number of clusters for k-means is 5

## Training the model with the optimal number of clusters found in the elbow point graph

In [None]:
km_sample=KMeans(n_clusters=5,init='k-means++',random_state=42)
km_sample.fit_predict(X)

In [None]:
KM_labels=km_sample.labels_
KM_labels

## Plotting the clusters as a scatter plot

In [None]:
from sklearn.decomposition import PCA




# Reduce dimensions to 2D
pca = PCA(n_components=2, random_state=42)
X_pca = pca.fit_transform(X)

# Scatter plot
plt.figure(figsize=(10,7))
plt.scatter(X_pca[:,0], X_pca[:,1], c=KM_labels, cmap='rainbow', alpha=0.6, s=50)

# Plot centroids in PCA space
centroids_pca = pca.transform(km_sample.cluster_centers_)
plt.scatter(centroids_pca[:,0], centroids_pca[:,1], s=200, c='black', marker='X', label='Centroids')

plt.title('KMeans Clustering')
plt.xlabel('PCA Component 1')
plt.ylabel('PCA Component 2')
plt.legend()
plt.show()


## Calculate the silhouette score to see if the clusters are good 

In [None]:
from sklearn.metrics import silhouette_score

sil_score=silhouette_score(X,KM_labels)
sil_score

In [None]:
get_movie_recommendation('Iron Man')