# KNN on MovieLens dataset
Filippo Fantinato 2041620

Here you can find the notebook where I experienced with KNN on the movielens dataset.

In [3]:
import math
import numpy as np
import pandas as pd
import matplotlib.pylab as plt
from scipy.sparse import csr_matrix
from sklearn.neighbors import NearestNeighbors

Let's download and unzip the dataset and import the movies and ratings ones.

In [4]:
!wget https://files.grouplens.org/datasets/movielens/ml-latest-small.zip
!unzip ml-latest-small.zip

--2022-12-29 11:00:00--  https://files.grouplens.org/datasets/movielens/ml-latest-small.zip
Resolving files.grouplens.org (files.grouplens.org)... 128.101.65.152
Connecting to files.grouplens.org (files.grouplens.org)|128.101.65.152|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 978202 (955K) [application/zip]
Saving to: ‘ml-latest-small.zip’


2022-12-29 11:00:00 (6.73 MB/s) - ‘ml-latest-small.zip’ saved [978202/978202]

Archive:  ml-latest-small.zip
   creating: ml-latest-small/
  inflating: ml-latest-small/links.csv  
  inflating: ml-latest-small/tags.csv  
  inflating: ml-latest-small/ratings.csv  
  inflating: ml-latest-small/README.txt  
  inflating: ml-latest-small/movies.csv  


In [5]:
movies = pd.read_csv('ml-latest-small/movies.csv')
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [6]:
ratings = pd.read_csv('ml-latest-small/ratings.csv')
ratings = ratings.drop('timestamp', axis=1)
ratings.head()

Unnamed: 0,userId,movieId,rating
0,1,1,4.0
1,1,3,4.0
2,1,6,4.0
3,1,47,5.0
4,1,50,5.0


## Preprocessing

Since we are building a user-based recommendation system, I create a dataframe which showes the ratings of each movie for each user, setting $0$ to the not rated ones.

In [7]:
ratings_pivot = ratings.pivot(index='movieId', columns='userId', values='rating')
ratings_pivot.fillna(0, inplace=True)
ratings_pivot.head()

userId,1,2,3,4,5,6,7,8,9,10,...,601,602,603,604,605,606,607,608,609,610
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,0.0,0.0,0.0,4.0,0.0,4.5,0.0,0.0,0.0,...,4.0,0.0,4.0,3.0,4.0,2.5,4.0,2.5,3.0,5.0
2,0.0,0.0,0.0,0.0,0.0,4.0,0.0,4.0,0.0,0.0,...,0.0,4.0,0.0,5.0,3.5,0.0,0.0,2.0,0.0,0.0
3,4.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0


Before training the KNN model, I have to filter the datatframe selecting only the movies with at least 10 ratings.

In [8]:
num_user_vote = ratings.groupby('movieId')['rating'].agg('count')
num_user_vote.head()

movieId
1    215
2    110
3     52
4      7
5     49
Name: rating, dtype: int64

In [9]:
num_movies_voted = ratings.groupby('userId')['rating'].agg('count')
num_movies_voted.head()

userId
1    232
2     29
3     39
4    216
5     44
Name: rating, dtype: int64

In [10]:
# A minimum of 10 ratings for a movie:
ratings_pivot = ratings_pivot.loc[num_user_vote[num_user_vote > 10].index,:]
ratings_pivot.reset_index(inplace=True)
ratings_pivot.shape

(2121, 611)

Finally we can create the sparse matrix and procede with the KNN training.

In [11]:
csr_data = csr_matrix(ratings_pivot.values)

## Recommendation with KNN

I chose to use a KNN model with metric "cosine, algorithm "brute" and number of neighbors equal to 20.

In [12]:
knn = NearestNeighbors(metric='cosine', algorithm='brute', n_neighbors=20)
knn.fit(csr_data)

NearestNeighbors(algorithm='brute', metric='cosine', n_neighbors=20)

In the following method, KNN just trained is exploited to get the recommendation of n movies given the favorite film name.

In [13]:
def get_recommendation(fav_movie_name, n_movies):
  fav_movies = movies[movies['title'].str.contains(fav_movie_name)]

  if not fav_movies.empty:
    movie_id = int(fav_movies.index[0])
    movie_idx = ratings_pivot[ratings_pivot['movieId'] == movie_id].index[0]
    
    # Exploiting KNN to get the closest n films to the favorie movie
    distances, indices = knn.kneighbors(csr_data[movie_idx], n_neighbors=n_movies+1)
    rec_movie_indices = list(
            zip(
                indices.squeeze().tolist(), distances.squeeze().tolist()
            )
        )[:0:-1]

    # Arranging thr results in a dataframe with film name and distance from the favorite movie
    recommend_frame = []
    for id, distance in rec_movie_indices:
        movie_idx = ratings_pivot.iloc[id]['movieId']
        idx = movies[movies['movieId'] == movie_idx].index
        recommend_frame.append({
          'Title': movies.iloc[idx]['title'].values[0],
          'Distance': distance
        })

    return pd.DataFrame(recommend_frame, index = range(1,n_movies+1))
  else:
    return "No movies with such name"

In [14]:
get_recommendation('Jumanji', 10)

Unnamed: 0,Title,Distance
1,Father of the Bride Part II (1995),0.694991
2,"American President, The (1995)",0.688471
3,Ace Ventura: When Nature Calls (1995),0.674948
4,Seven (a.k.a. Se7en) (1995),0.649004
5,"Usual Suspects, The (1995)",0.646938
6,Heat (1995),0.625282
7,GoldenEye (1995),0.611565
8,Babe (1995),0.59337
9,Jumanji (1995),0.589176
10,Twelve Monkeys (a.k.a. 12 Monkeys) (1995),0.553816
