The following code is a movie recommendation system based on ratings and using the KNN algorithm to compute similarity with cosine distance metric. However, not all ratings are equal. For instance, a movie with 100 5-star ratings is regarded as more successful than a movie with only 1 5-star rating. For this reason, we introduce the `user_threshold` which is the criteria for how many people should have voted for that movie to qualify that movie. In a similar concept, someone who rated 10 movies would generally assumed to have more of a valid opinion than someone who only rated 1 movie. Therefore, we also introduce `movie_threshold` which is the number of movies a user needs to have voted for to qualify their opinions. Lastly, we have `number_to_recommend` which is the number of movies we want the recommendation system to output

In [7]:
# source code https://www.analyticsvidhya.com/blog/2020/11/create-your-own-movie-movie-recommendation-system/
# datasets https://grouplens.org/datasets/movielens/latest/

In [8]:
# import necessary packages
import pandas as pd
import numpy as np
from scipy.sparse import csr_matrix
from sklearn.neighbors import NearestNeighbors
import matplotlib.pyplot as plt
import seaborn as sns

In [9]:
# import movies dataset from https://grouplens.org/datasets/movielens/latest/
movies = pd.read_csv("movies.csv")

# import ratings dataset from https://grouplens.org/datasets/movielens/latest/
ratings = pd.read_csv("ratings.csv")

# make a new dataframe using the movies and rating data
# columns represent each unique userId  
# rows represent each unique movieId.
dataset = ratings.pivot(index='movieId',columns='userId',values='rating')

# replace NaN with 0
dataset.fillna(0,inplace=True)

# aggregating the number of users who voted and the number of movies that were voted.
no_user_voted = ratings.groupby('movieId')['rating'].agg('count')
no_movies_voted = ratings.groupby('userId')['rating'].agg('count')

In [27]:
movies

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
...,...,...,...
9737,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy
9738,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy
9739,193585,Flint (2017),Drama
9740,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation


In [28]:
ratings

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931
...,...,...,...,...
100831,610,166534,4.0,1493848402
100832,610,168248,5.0,1493850091
100833,610,168250,5.0,1494273047
100834,610,168252,5.0,1493846352


In [29]:
dataset

userId,1,2,3,4,5,6,7,8,9,10,...,601,602,603,604,605,606,607,608,609,610
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,0.0,0.0,0.0,4.0,0.0,4.5,0.0,0.0,0.0,...,4.0,0.0,4.0,3.0,4.0,2.5,4.0,2.5,3.0,5.0
2,0.0,0.0,0.0,0.0,0.0,4.0,0.0,4.0,0.0,0.0,...,0.0,4.0,0.0,5.0,3.5,0.0,0.0,2.0,0.0,0.0
3,4.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
193581,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
193583,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
193585,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
193587,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [10]:
# this is the function we will use to recommend a desired number of movies based certain threshold criteria
# user_threshodld is the criteria for how many people should have voted for that movie to qualify that movie
# movie_threshold is the number of movies a user needs to have voted for to qualify their opinions
# number_to_recommend is the number of movies we want the recommendation system to output
# plot = 'off' is the default setting, can change this to 'on' to visualize the voting based on the thresholds

def function(user_threshold, movie_threshold, movie_name, number_to_recommend, plot = 'off'):
    
    if plot == 'on':
        # visualize the number of users who voted with the [user threshold] 
        f,ax = plt.subplots(1,1,figsize=(8,4))
        plt.scatter(no_user_voted.index,no_user_voted,color='blue')
        plt.axhline(y=user_threshold,color='r')
        plt.xlabel('MovieId')
        plt.ylabel('No. of users voted')
        plt.show()

        # visualize the number of votes by each user with the [movie threshold]
        f,ax = plt.subplots(1,1,figsize=(8,4))
        plt.scatter(no_movies_voted.index,no_movies_voted,color='green')
        plt.axhline(y=movie_threshold,color='r')
        plt.xlabel('UserId')
        plt.ylabel('No. of votes by user')
        plt.show()
    
    # modify the dataset based on the user threshold
    # to qualify a movie, a minimum of [user threshold] many users should have voted for that movie
    final_dataset = dataset.loc[no_user_voted[no_user_voted > user_threshold].index,:]
    
    # modify the dataset based on the movie threshold
    # to qualify a user, a minimum of [movie threshold] many movies movies should have voted by the user
    final_dataset=final_dataset.loc[:,no_movies_voted[no_movies_voted > movie_threshold].index]
    
    # remove sparsity using csr_matrix
    csr_data = csr_matrix(final_dataset.values)
    final_dataset.reset_index(inplace=True)
    
    # using the KNN algorithm to compute similarity with cosine distance metric 
    knn = NearestNeighbors(metric='cosine', algorithm='brute', n_neighbors=20, n_jobs=-1)
    knn.fit(csr_data)
    
    # movie recommendation
    # check if the movie name input is in the database 
    # if it is we use our recommendation system to find similar movies
    # sort these based on their similarity distance 
    # output only the top [number to recommend] many movies with their distances from the input movie
    n_movies_to_reccomend = number_to_recommend
    movie_list = movies[movies['title'].str.contains(movie_name)]  
    if len(movie_list):        
        movie_idx= movie_list.iloc[0]['movieId']
        movie_idx = final_dataset[final_dataset['movieId'] == movie_idx].index[0]
        distances , indices = knn.kneighbors(csr_data[movie_idx],n_neighbors=n_movies_to_reccomend+1)    
        rec_movie_indices = sorted(list(zip(indices.squeeze().tolist(),distances.squeeze().tolist())),key=lambda x: x[1])[:0:-1]
        recommend_frame = []
        for val in rec_movie_indices:
            movie_idx = final_dataset.iloc[val[0]]['movieId']
            idx = movies[movies['movieId'] == movie_idx].index
            recommend_frame.append({'Title':movies.iloc[idx]['title'].values[0],'Distance':val[1]})
        df = pd.DataFrame(recommend_frame,index=range(1,n_movies_to_reccomend+1))
        return df
    else:
        return "No movies found. Please check your input" 

In [11]:
# this is Dan's Netfix Viewing History
NVHD = pd.read_csv('NetflixViewingHistoryDan.csv')
NVHD

Unnamed: 0,Title,Date
0,Big Fish,3/28/22
1,Monsters vs. Aliens,3/19/22
2,Brand New Cherry Flavor: Limited Series: Tadpo...,3/10/22
3,Brand New Cherry Flavor: Limited Series: I Exist,3/10/22
4,Starship Troopers,3/9/22
...,...,...
973,Battle for Haditha,3/14/15
974,Blackfish,3/14/15
975,The Immigrant,2/24/15
976,Elsa & Fred,2/23/15


In [14]:
# This is Dan's Netflix Viewing History Home
NVHH = pd.read_csv('NetflixViewingHistoryHome.csv')
NVHH  

Unnamed: 0,Title,Date
0,Bridgerton: Season 2: Capital R Rake,3/29/22
1,The Adam Project,3/26/22
2,Bridgerton: Season 1: After the Rain,3/25/22
3,Shameless (U.S.): Season 1: Frank Gallagher: L...,3/24/22
4,Cobra Kai: Season 2: Pulpo,3/20/22
...,...,...
2785,Shake It Up: Season 1: Start It Up!,10/8/12
2786,MythBusters: Collection 1: Exploding Toilet,10/5/12
2787,Iron Man 2,10/4/12
2788,Hachi: A Dog's Tale,10/4/12


In [26]:
# user_threshold, movie_threshold, movie_name, number_to_recommend, plot = 'off'
function(100, 100, 'Shrek', 10, plot = 'off')

Unnamed: 0,Title,Distance
1,"Matrix, The (1999)",0.29701
2,Harry Potter and the Sorcerer's Stone (a.k.a. ...,0.283498
3,Spider-Man (2002),0.28284
4,"Lord of the Rings: The Two Towers, The (2002)",0.260674
5,"Lord of the Rings: The Return of the King, The...",0.256306
6,Pirates of the Caribbean: The Curse of the Bla...,0.246572
7,"Monsters, Inc. (2001)",0.239565
8,"Lord of the Rings: The Fellowship of the Ring,...",0.227745
9,"Incredibles, The (2004)",0.227605
10,Finding Nemo (2003),0.220623


In [30]:
function(10, 10, 'Shrek', 10, plot = 'off')

Unnamed: 0,Title,Distance
1,"Beautiful Mind, A (2001)",0.394709
2,"Lord of the Rings: The Two Towers, The (2002)",0.391296
3,Harry Potter and the Sorcerer's Stone (a.k.a. ...,0.388015
4,"Lord of the Rings: The Return of the King, The...",0.377855
5,"Lord of the Rings: The Fellowship of the Ring,...",0.353577
6,Pirates of the Caribbean: The Curse of the Bla...,0.353282
7,Shrek 2 (2004),0.325528
8,"Incredibles, The (2004)",0.320871
9,"Monsters, Inc. (2001)",0.318305
10,Finding Nemo (2003),0.298565


In [32]:
function(20, 50, 'Toy Story', 10, plot = 'off')

Unnamed: 0,Title,Distance
1,Groundhog Day (1993),0.399495
2,"Lion King, The (1994)",0.398578
3,Pulp Fiction (1994),0.398293
4,Star Wars: Episode VI - Return of the Jedi (1983),0.390663
5,Apollo 13 (1995),0.384633
6,Shrek (2001),0.381472
7,Star Wars: Episode IV - A New Hope (1977),0.380789
8,Toy Story 2 (1999),0.371637
9,Forrest Gump (1994),0.356542
10,Jurassic Park (1993),0.334884
