Libraries and imports 

In [60]:
import pandas as pd
import numpy as np
import regex as re


from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity


import ipywidgets as wd
from IPython.display import display

Needed Data sets 
- https://www.kaggle.com/datasets/grouplens/movielens-20m-dataset(Updated version of this dataset is used)

In [61]:
#importing the data

original_m_data = pd.read_csv(r"C:\Users\nrhhe\Downloads\ml-25m\ml-25m\movies.csv")
movies = original_m_data.copy()

original_r_data = pd.read_csv(r"C:\Users\nrhhe\Downloads\ml-25m\ml-25m\ratings.csv")
ratings = original_r_data.copy()


Taking a look at our datasets 

In [62]:
#movies dataset
#all the columns are usefull to us
movies                                                                                      

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
...,...,...,...
62418,209157,We (2018),Drama
62419,209159,Window of the Soul (2001),Documentary
62420,209163,Bad Poems (2018),Comedy|Drama
62421,209169,A Girl Thing (2001),(no genres listed)


In [63]:
#ratings dataset 
#userid of the user,id of the movie that user rated ,rated value and timestamp of the moment when the user rated the movie

ratings

Unnamed: 0,userId,movieId,rating,timestamp
0,1,296,5.0,1147880044
1,1,306,3.5,1147868817
2,1,307,5.0,1147868828
3,1,665,5.0,1147878820
4,1,899,3.5,1147868510
...,...,...,...,...
25000090,162541,50872,4.5,1240953372
25000091,162541,55768,2.5,1240951998
25000092,162541,56176,2.0,1240950697
25000093,162541,58559,4.0,1240953434


Data Preprocessing

In [64]:
# check for null values
movies.isnull().sum()


movieId    0
title      0
genres     0
dtype: int64

In [65]:
ratings.isnull().sum()

userId       0
movieId      0
rating       0
timestamp    0
dtype: int64

In [66]:
#removing special characters from the title column   and putting it as a new column called title_cleaned
#this is done to reduce complexity of a string

movies['title_cleaned'] = movies['title'].astype(str).apply(lambda x: re.sub("[^a-zA-Z0-9 ]", "", x))

In [67]:
movies.head()

Unnamed: 0,movieId,title,genres,title_cleaned
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,Toy Story 1995
1,2,Jumanji (1995),Adventure|Children|Fantasy,Jumanji 1995
2,3,Grumpier Old Men (1995),Comedy|Romance,Grumpier Old Men 1995
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,Waiting to Exhale 1995
4,5,Father of the Bride Part II (1995),Comedy,Father of the Bride Part II 1995


since 25Mil in ratings takes too much time in the App.py file, we will use sampled 12Mil dataset for the app
- also drop timestamp column since it is irrelevant for our recommender system(it will reduce file size further)

In [None]:
#ratings.drop('timestamp', axis=1, inplace=True)
#ratings = ratings.sample(n=12000000, random_state=42)
#ratings.to_csv(r'C:\Users\nrhhe\Downloads\ml-25m\ml-25m\ratings_cleaned.csv', index=False)

In [68]:
#creating a tfidf vectorizer
vect = TfidfVectorizer(analyzer='word', ngram_range=(1, 2))
movies_tfidf = vect.fit_transform(movies['title_cleaned'])

In [69]:
#transforming to a tfidf matrix
title2="Your Name"
tfidf= vect.transform([title2])

In [70]:
t_sim= cosine_similarity(movies_tfidf, tfidf).flatten()
ind = np.argsort(t_sim)[-10:]
test_result= movies['title_cleaned'].iloc[ind][::-1]
test_result

42103                                 Your Name 2016
55023                              Tell Me Your Name
56130                            Burn Your Name 2016
44566                      Call Me by Your Name 2017
49627                                   Name Me 2014
51867                              Without Name 2017
2376     Name of the Rose The Name der Rose Der 1986
56382                             A Womans Name 2018
44393                            In the Name of 2013
16003                          Remember My Name 1978
Name: title_cleaned, dtype: object

In [71]:
#turn the above process to a function
#given a movie title, return the most simillar movies 

def recommend(title):
    title =  re.sub("[^a-zA-Z0-9 ]", "", title)
    title_tfidf = vect.transform([title])
    title_sim=cosine_similarity(title_tfidf, movies_tfidf).flatten()
    indices=np.argsort(title_sim)[-10:]

    #HERER  
    movie_recommend = movies.iloc[indices][::-1]
    
    return movie_recommend

recommend("your name")

Unnamed: 0,movieId,title,genres,title_cleaned
42103,163134,Your Name. (2016),Animation|Drama|Fantasy|Romance,Your Name 2016
55023,190937,Tell Me Your Name,Horror|Thriller,Tell Me Your Name
56130,193495,Burn Your Name (2016),(no genres listed),Burn Your Name 2016
44566,168492,Call Me by Your Name (2017),Drama|Romance,Call Me by Your Name 2017
49627,179277,Name Me (2014),Adventure|Drama,Name Me 2014
51867,184067,Without Name (2017),Drama|Horror|Mystery,Without Name 2017
2376,2467,"Name of the Rose, The (Name der Rose, Der) (1986)",Crime|Drama|Mystery|Thriller,Name of the Rose The Name der Rose Der 1986
56382,194046,A Woman's Name (2018),Drama,A Womans Name 2018
44393,168128,In the Name of... (2013),Drama|Thriller,In the Name of 2013
16003,84479,Remember My Name (1978),Drama|Thriller,Remember My Name 1978


The top result will most likely what user inputed 
using that we can get the movieId

Collabarative Filtering
- using user ratings for a movie 

Aim
- if we search a movie
- the system  will get other user who liked the movie we searched(users that rated>4)
- then the the system will get the filtered set of movies that those users also liked(other movies that users rated>4) 
- and save them as a percatage (x% amount of users recommended this movie)
- then the system will get all the users that liked(rated>4) the filtered set of movies get the percentage
- by dividing the 2 percentages we get a score for each movie
- then we sort the movies by score, higher the score the more likely the user will like the movie
- idea is larger the difference between the 2 percentages the more likely the user will like the movie(score will be higher)
- the system will return the movies with the highest score

In [72]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,296,5.0,1147880044
1,1,306,3.5,1147868817
2,1,307,5.0,1147868828
3,1,665,5.0,1147878820
4,1,899,3.5,1147868510


In [73]:
#setting the movie id 
#163134
movie_id=1

In [74]:
#users who liked the movie and gsve it a rating larger than 4

users_alike= ratings[(ratings["movieId"]==movie_id)& (ratings["rating"]>4)]["userId"].unique()

In [75]:
#amount of users who liked the movie
users_alike.shape

(18835,)

In [76]:
#movies that before mentioned users liked
#set of movies that the users liked and also liked our searched movie 
movies_alike= ratings[(ratings["userId"].isin(users_alike)) & (ratings["rating"]>4)]["movieId"]

movies_alike

5101            1
5105           34
5111          110
5114          150
5127          260
            ...  
24998854    60069
24998861    67997
24998876    78499
24998884    81591
24998888    88129
Name: movieId, Length: 1358326, dtype: int64

In [77]:
movies_alike.shape

(1358326,)

In [78]:
movies_alike=movies_alike.value_counts()

In [79]:
#number of movies that the users liked and also liked(4or more rating) our searched movie
movies_alike.shape

(19282,)

In [80]:
users_alike.shape

(18835,)

In [81]:
movies_alike=movies_alike/len(users_alike)
movies_alike

movieId
1         1.000000
318       0.445607
260       0.403770
356       0.370215
296       0.367295
            ...   
128478    0.000053
125125    0.000053
119701    0.000053
107563    0.000053
7625      0.000053
Name: count, Length: 19282, dtype: float64

By dividing amount alike people that liked a specific movie by whole number of user alike we get a score/percentage
- 1.0000 means 100% of the alike users recomment it 
- 0.4444 means 44.44% of the alike users recomment it

In [82]:
#filter and getting the movies that are greater than 10%
movies_alike=movies_alike[movies_alike>.1]
movies_alike

movieId
1        1.000000
318      0.445607
260      0.403770
356      0.370215
296      0.367295
           ...   
953      0.103053
551      0.101195
1222     0.100876
745      0.100345
48780    0.100186
Name: count, Length: 113, dtype: float64

Filtering out the movies that are recommended by less than 10% of the users. reduces the number of movies a lot 

In [83]:
#putting in a dataframe to check 
movies_alike_df = pd.DataFrame({'movieId': movies_alike.index, 'also_liked%': movies_alike.values})
movies_alike_df.head()


Unnamed: 0,movieId,also_liked%
0,1,1.0
1,318,0.445607
2,260,0.40377
3,356,0.370215
4,296,0.367295


In [84]:
#adding movie title to the dataframe to check 
movies_alike_df = pd.merge(movies_alike_df, movies, on='movieId', how='inner')
movies_alike_df.head()

Unnamed: 0,movieId,also_liked%,title,genres,title_cleaned
0,1,1.0,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,Toy Story 1995
1,318,0.445607,"Shawshank Redemption, The (1994)",Crime|Drama,Shawshank Redemption The 1994
2,260,0.40377,Star Wars: Episode IV - A New Hope (1977),Action|Adventure|Sci-Fi,Star Wars Episode IV A New Hope 1977
3,356,0.370215,Forrest Gump (1994),Comedy|Drama|Romance|War,Forrest Gump 1994
4,296,0.367295,Pulp Fiction (1994),Comedy|Crime|Drama|Thriller,Pulp Fiction 1994


Here you can see Toy Story(the one we hardcoded) and rest of the top movies that are shown not simillar at all. Granted all movies shown are great but they are not simillar to Toy Story at all 

we need to find a way to get  much more simillar movies

In [85]:
#this all the users who watched filtered movies in movies_alike set and gave it a rating more than 4
users_all= ratings[ratings["movieId"].isin(movies_alike.index)&(ratings["rating"]>4)]

In [86]:
users_all

Unnamed: 0,userId,movieId,rating,timestamp
0,1,296,5.0,1147880044
29,1,4973,4.5,1147869080
48,1,7361,5.0,1147880055
72,2,110,5.0,1141416589
76,2,260,5.0,1141417172
...,...,...,...,...
25000062,162541,5618,4.5,1240953299
25000065,162541,5952,5.0,1240952617
25000078,162541,7153,5.0,1240952613
25000081,162541,7361,4.5,1240953484


let's see what percentage of all users recommend each movie

In [87]:
#divide
user_recom=users_all["movieId"].value_counts()/len(users_all["userId"].unique())
user_recom

movieId
318      0.342220
296      0.284674
2571     0.244033
356      0.235266
593      0.225909
           ...   
551      0.040918
50872    0.039111
745      0.037031
78499    0.035131
2355     0.025091
Name: count, Length: 113, dtype: float64

Put both scores together 

In [88]:
r_percentages=pd.concat([movies_alike,user_recom],axis=1)
r_percentages.columns=["also_liked%","all_user_recom%"]
r_percentages

Unnamed: 0_level_0,also_liked%,all_user_recom%
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,1.000000,0.124728
318,0.445607,0.342220
260,0.403770,0.222207
356,0.370215,0.235266
296,0.367295,0.284674
...,...,...
953,0.103053,0.045792
551,0.101195,0.040918
1222,0.100876,0.066877
745,0.100345,0.037031


Create a score (finding a ratio between percentages)

In [89]:
r_percentages["score"]=r_percentages["also_liked%"]/r_percentages["all_user_recom%"]
r_percentages

Unnamed: 0_level_0,also_liked%,all_user_recom%,score
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,1.000000,0.124728,8.017414
318,0.445607,0.342220,1.302105
260,0.403770,0.222207,1.817089
356,0.370215,0.235266,1.573604
296,0.367295,0.284674,1.290232
...,...,...,...
953,0.103053,0.045792,2.250441
551,0.101195,0.040918,2.473085
1222,0.100876,0.066877,1.508376
745,0.100345,0.037031,2.709748


divide the 1st score by 2nd score and store the resulting 3rd score in a another column 
- this score represent how relevent the movie is to the inputed movie 
- higher the value more relevent it is 

In [90]:
#sort by score
r_percentages=r_percentages.sort_values(by="score",ascending=False)
r_percentages

Unnamed: 0_level_0,also_liked%,all_user_recom%,score
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,1.000000,0.124728,8.017414
3114,0.280648,0.053706,5.225654
2355,0.110539,0.025091,4.405452
78499,0.152960,0.035131,4.354038
4886,0.235147,0.070811,3.320783
...,...,...,...
2858,0.216724,0.167634,1.292845
296,0.367295,0.284674,1.290232
79132,0.166817,0.131384,1.269693
4973,0.142501,0.112405,1.267747


merge with relevent movie metadata using movieId
- the top 10 will represent the top 10 movies that are simillar to the inputed movie

In [91]:
#getting the top 10 movies
#and merge with movies to get the movie title

r_top=r_percentages.head(10).merge(movies,left_index=True,right_on="movieId",how="inner")
r_top

Unnamed: 0,also_liked%,all_user_recom%,score,movieId,title,genres,title_cleaned
0,1.0,0.124728,8.017414,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,Toy Story 1995
3021,0.280648,0.053706,5.225654,3114,Toy Story 2 (1999),Adventure|Animation|Children|Comedy|Fantasy,Toy Story 2 1999
2264,0.110539,0.025091,4.405452,2355,"Bug's Life, A (1998)",Adventure|Animation|Children|Comedy,Bugs Life A 1998
14813,0.15296,0.035131,4.354038,78499,Toy Story 3 (2010),Adventure|Animation|Children|Comedy|Fantasy|IMAX,Toy Story 3 2010
4780,0.235147,0.070811,3.320783,4886,"Monsters, Inc. (2001)",Adventure|Animation|Children|Comedy|Fantasy,Monsters Inc 2001
580,0.216618,0.067513,3.208539,588,Aladdin (1992),Adventure|Animation|Children|Comedy|Musical,Aladdin 1992
6258,0.228139,0.072268,3.156862,6377,Finding Nemo (2003),Adventure|Animation|Children|Comedy,Finding Nemo 2003
587,0.1794,0.059977,2.99115,595,Beauty and the Beast (1991),Animation|Children|Fantasy|Musical|Romance|IMAX,Beauty and the Beast 1991
8246,0.203504,0.068453,2.972889,8961,"Incredibles, The (2004)",Action|Adventure|Animation|Children|Comedy,Incredibles The 2004
359,0.253411,0.085764,2.954762,364,"Lion King, The (1994)",Adventure|Animation|Children|Drama|Musical|IMAX,Lion King The 1994


Now it's much better, we can see that the movies are very simillar to Toy Story(cartoons)

Now we will combine all of this into one function

In [92]:
def get_movies(movie_id):
    users_alike = ratings[(ratings["movieId"] == movie_id) & (ratings["rating"] > 4)]["userId"].unique()
    movies_alike = ratings[(ratings["userId"].isin(users_alike)) & (ratings["rating"] > 4)]["movieId"]
    movies_alike = movies_alike.value_counts() / len(users_alike)

    movies_alike = movies_alike[movies_alike > .10]
    users_all = ratings[(ratings["movieId"].isin(movies_alike.index)) & (ratings["rating"] > 4)]
    user_recom = users_all["movieId"].value_counts() / len(users_all["userId"].unique())
    r_percentages = pd.concat([movies_alike, user_recom], axis=1)
    r_percentages.columns = ["similar", "all"]
    
    r_percentages["score"] = r_percentages["similar"] / r_percentages["all"]
    r_percentages = r_percentages.sort_values("score", ascending=False)
    return r_percentages.head(10).merge(movies, left_index=True, right_on="movieId")[["score", "title", "genres","movieId"]]


In [93]:
movies[movies['movieId'] == 318]

Unnamed: 0,movieId,title,genres,title_cleaned
314,318,"Shawshank Redemption, The (1994)",Crime|Drama,Shawshank Redemption The 1994


In [94]:
test=get_movies(1)
test

Unnamed: 0,score,title,genres,movieId
0,8.017414,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1
3021,5.225654,Toy Story 2 (1999),Adventure|Animation|Children|Comedy|Fantasy,3114
2264,4.405452,"Bug's Life, A (1998)",Adventure|Animation|Children|Comedy,2355
14813,4.354038,Toy Story 3 (2010),Adventure|Animation|Children|Comedy|Fantasy|IMAX,78499
4780,3.320783,"Monsters, Inc. (2001)",Adventure|Animation|Children|Comedy|Fantasy,4886
580,3.208539,Aladdin (1992),Adventure|Animation|Children|Comedy|Musical,588
6258,3.156862,Finding Nemo (2003),Adventure|Animation|Children|Comedy,6377
587,2.99115,Beauty and the Beast (1991),Animation|Children|Fantasy|Musical|Romance|IMAX,595
8246,2.972889,"Incredibles, The (2004)",Action|Adventure|Animation|Children|Comedy,8961
359,2.954762,"Lion King, The (1994)",Adventure|Animation|Children|Drama|Musical|IMAX,364


In [95]:
test2=get_movies(163134)
test2

Unnamed: 0,score,title,genres,movieId
42103,187.893432,Your Name. (2016),Animation|Drama|Fantasy|Romance,163134
19639,58.873276,Wolf Children (Okami kodomo no ame to yuki) (2...,Animation|Fantasy,101962
20129,37.178913,"Wind Rises, The (Kaze tachinu) (2013)",Animation|Drama|Romance,104283
40153,34.193219,The Handmaiden (2016),Drama|Romance|Thriller,158783
41846,34.081616,Kubo and the Two Strings (2016),Adventure|Animation|Children|Fantasy,162578
12093,31.403537,"Girl Who Leapt Through Time, The (Toki o kaker...",Animation|Comedy|Drama|Romance|Sci-Fi,57504
40240,24.997518,Captain Fantastic (2016),Drama,158966
51785,23.222784,Isle of Dogs (2018),Animation|Comedy,183897
42923,22.023117,La La Land (2016),Comedy|Drama|Romance,164909
48958,20.604793,Coco (2017),Adventure|Animation|Children,177765
