This project is a basic movie recommendation system using item based collaborative filtering algorithm - KNearestNeighbors (KNN)

What is collaborative filtering?

Collaborative filtering is a technique that makes predictions about the interests of a user on the basis of preferences from other users. In this case, the users are collaborating. We have the User based collaborative filtering and the Item based collaborative filtering.

This project is focused on Item based collaborative filtering which recommends movies based on similarity between the movies using the user's ratings for these movies. The dataset used in this project is the MovieLens 2OM Dataset which was gotten from Kaggle.

Let's get started!

In [1]:
#importing libraries
import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix
from sklearn.neighbors import NearestNeighbors

In [84]:
#importing the dataset
movies = pd.read_csv('movie.csv')
ratings = pd.read_csv('rating.csv')

In [85]:
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [86]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,2,3.5,2005-04-02 23:53:47
1,1,29,3.5,2005-04-02 23:31:16
2,1,32,3.5,2005-04-02 23:33:39
3,1,47,3.5,2005-04-02 23:32:07
4,1,50,3.5,2005-04-02 23:29:40


In [87]:
len(movies['title'].unique()) #Total of 27,262 movies

27262

In [88]:
len(ratings['userId'].unique()) #Total of 138,493 unique users

138493

In [89]:
ratings['rating'].count() #Total of 20,000,263 ratings

20000263

In [68]:
#extract the year from the movie column and create a new column "year"

In [69]:
movies['year'] = movies['title'].str.extract(r'\((\d{4})\)')

In [70]:
movies['title'] = movies['title'].str.split(' \(', n = 1, expand = True)[0]

In [71]:
movies.head()

Unnamed: 0,movieId,title,genres,year
0,1,Toy Story,Adventure|Animation|Children|Comedy|Fantasy,1995
1,2,Jumanji,Adventure|Children|Fantasy,1995
2,3,Grumpier Old Men,Comedy|Romance,1995
3,4,Waiting to Exhale,Comedy|Drama|Romance,1995
4,5,Father of the Bride Part II,Comedy,1995


# Data Preprocessing

In [72]:
#join the two datasets using movieId which is the common column in both datasets

In [73]:
movies_df_ratings = pd.merge(movies, ratings, how = 'left', on = 'movieId')

In [74]:
movies_df_ratings.head()

Unnamed: 0,movieId,title,genres,year,userId,rating,timestamp
0,1,Toy Story,Adventure|Animation|Children|Comedy|Fantasy,1995,3.0,4.0,1999-12-11 13:36:47
1,1,Toy Story,Adventure|Animation|Children|Comedy|Fantasy,1995,6.0,5.0,1997-03-13 17:50:52
2,1,Toy Story,Adventure|Animation|Children|Comedy|Fantasy,1995,8.0,4.0,1996-06-05 13:37:51
3,1,Toy Story,Adventure|Animation|Children|Comedy|Fantasy,1995,10.0,4.0,1999-11-25 02:44:47
4,1,Toy Story,Adventure|Animation|Children|Comedy|Fantasy,1995,11.0,4.5,2009-01-02 01:13:41


Since we need the top rated movies for recommendation, let's define the rating count threshold as 10000, which mean movies with total rating count below 10000 will be removed

In [12]:
#here we have a dataset with the movies and total rating counts for each moviess
ratings_group = ratings.groupby(['movieId'])['rating'].count().to_frame().reset_index()

In [13]:
ratings_group = ratings_group.rename(columns = {'rating':'rating_count'})
ratings_group.head()

Unnamed: 0,movieId,rating_count
0,1,49695
1,2,22243
2,3,12735
3,4,2756
4,5,12161


In [14]:
popular_rated_movies = ratings_group[ratings_group['rating_count'] >= 10000]
popular_rated_movies.head()

Unnamed: 0,movieId,rating_count
0,1,49695
1,2,22243
2,3,12735
4,5,12161
5,6,23899


In [15]:
#let's join this with our dataset using an inner join. With this, we would have a dataset with only popular movies

In [16]:
popular_movies_df_ratings = pd.merge(movies_df_ratings, popular_rated_movies, on = 'movieId', how = 'inner')
popular_movies_df_ratings.head()

Unnamed: 0,movieId,title,genres,year,userId,rating,timestamp,rating_count
0,1,Toy Story,Adventure|Animation|Children|Comedy|Fantasy,1995,3.0,4.0,1999-12-11 13:36:47,49695
1,1,Toy Story,Adventure|Animation|Children|Comedy|Fantasy,1995,6.0,5.0,1997-03-13 17:50:52,49695
2,1,Toy Story,Adventure|Animation|Children|Comedy|Fantasy,1995,8.0,4.0,1996-06-05 13:37:51,49695
3,1,Toy Story,Adventure|Animation|Children|Comedy|Fantasy,1995,10.0,4.0,1999-11-25 02:44:47,49695
4,1,Toy Story,Adventure|Animation|Children|Comedy|Fantasy,1995,11.0,4.5,2009-01-02 01:13:41,49695


In [17]:
#dropping columns that are unnecessary
popular_movies_df_ratings = popular_movies_df_ratings.drop(['timestamp', 'rating_count'], axis = 1)

# Item Based Recommendation Syetem 

In [19]:
#create a Pivot matrix
movies_features = popular_movies_df_ratings.pivot_table(index = 'title', columns = 'userId', values = 'rating').fillna(0)
movies_features.head()

userId,1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0,...,138484.0,138485.0,138486.0,138487.0,138488.0,138489.0,138490.0,138491.0,138492.0,138493.0
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
10 Things I Hate About You,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
12 Angry Men,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,4.5,0.0,0.0,0.0,4.0
2001: A Space Odyssey,3.5,5.0,5.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
28 Days Later,3.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,4.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0
300,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.5


Since our pivot matrix have a large number of zero element, storing in csr format help in efficient memory usage and faster matrix operations by allowing only the non-zero elements and their positions will be stored.

In [20]:
movies_features_matrix = csr_matrix(movies_features.values)

In [21]:
#fit the knn model
knn_model = NearestNeighbors(metric = 'cosine', algorithm = 'brute')
knn_model.fit(movies_features_matrix)

In this item-based recommendation system, the K-nearest neighbors algorithm is employed to identify similar items (movies) by analyzing their features (ratings). The key mechanism involves utilizing the cosine of the angles between the movies to gauge their similarity. A smaller angle signifies that, based on their features, the movies are aligned in the same or nearly the same direction, indicating a higher degree of similarity between the movies.

In [30]:
movies_features.shape

(462, 137658)

In [32]:
#pick a random index number from the dataset
movie_index = np.random.choice(movies_features.shape[0])
print(movie_index)

255


How can we find similar movies to the movie in the selected index above? 

Good question. Look at the code below

In [56]:
distances, indices = knn_model.kneighbors(movies_features.iloc[movie_index,:].values.reshape(1, -1), n_neighbors = 11)

What this code does is, it gives the distance between the selected movie and the 10 closest movies to it which we would call the 10 nearest neighbors

In [57]:
print(distances)
print(indices)

[[1.54321000e-13 3.90567097e-01 4.95576345e-01 5.20765743e-01
  5.24458902e-01 5.46146201e-01 5.49929750e-01 5.50813342e-01
  5.52300595e-01 5.54463434e-01 5.63885427e-01]]
[[255 373 336 158  56 234 274  23 271 427 235]]


In [58]:
#let's see the selected movies and it's neighbors
for i in range(0, len(distances.flatten())):
    if i == 0:
        print(f'The selected movie is "{movies_features.index[i]}"')
        print('The recommendations for this movie are:')
    else:
        print(f'{i}: {movies_features.index[i]}')

The selected movie is "10 Things I Hate About You"
The recommendations for this movie are:
1: 12 Angry Men
2: 2001: A Space Odyssey
3: 28 Days Later
4: 300
5: A.I. Artificial Intelligence
6: Abyss, The
7: Ace Ventura: Pet Detective
8: Ace Ventura: When Nature Calls
9: Addams Family Values
10: Adventures of Priscilla, Queen of the Desert, The


In [59]:
#let see the distances of these movies to the selected movies
for i in range(0, len(distances.flatten())):
    if i > 0:
        print(f'{i}: {distances.flatten()[i]}')

1: 0.39056709710992843
2: 0.4955763450721625
3: 0.5207657431248094
4: 0.5244589016753476
5: 0.5461462008131518
6: 0.5499297502022229
7: 0.5508133420795396
8: 0.5523005954955912
9: 0.5544634336839096
10: 0.5638854274760241
