<a href="https://www.kaggle.com/code/iahhel/recommender-system-user-filtering-content-based?scriptVersionId=130509556" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# This project is a movie recommender system built using IMDB's dataset of movies and a dataset of user ratings. The recommender system employs both content-based and user filtering methods to suggest movies to users based on their preferences and past viewing habits. The content-based method recommends movies similar to ones a user has liked in the past, while the user filtering method recommends movies that users with similar preferences have enjoyed.

Importing packages we're going to use.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns



Loading the datasets.

In [2]:
movies = pd.read_csv('/kaggle/input/movies/movies.csv') # movies dataset
ratings = pd.read_csv('/kaggle/input/imdb-user-ratings/ratings.csv') # user ratings dataset
print(movies.shape, ratings.shape)
movies.head()

(9742, 3) (100836, 4)


Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


Cleaning the data and defining some new variables.

In [3]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9742 entries, 0 to 9741
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  9742 non-null   int64 
 1   title    9742 non-null   object
 2   genres   9742 non-null   object
dtypes: int64(1), object(2)
memory usage: 228.5+ KB


In [4]:
# Extracting the year from the title column using regex and adding it to a new 'year' column
movies['year'] = movies.title.str.extract('(\(\d\d\d\d\))',expand=False)

# Cleaning the year column by extracting only the year digits and replacing the year in the 'title' column with an empty string
movies['year'] = movies.year.str.extract('(\d\d\d\d)',expand=False)
movies['title'] = movies.title.str.replace('(\(\d\d\d\d\))', '')

# Splitting the genres column on '|' and converting it to a list
movies['genres'] = movies.genres.str.split('|')

# Removing leading and trailing whitespaces from the 'title' column using a lambda function
movies['title'] = movies['title'].apply(lambda x: x.strip())

# Dropping any rows with missing values and printing the number of missing values in each column
movies.dropna(inplace=True)
print(movies.isna().sum(),'\n----')

movies.sample(6)

movieId    0
title      0
genres     0
year       0
dtype: int64 
----


  movies['title'] = movies.title.str.replace('(\(\d\d\d\d\))', '')


Unnamed: 0,movieId,title,genres,year
1877,2495,"Fantastic Planet, The (Planète sauvage, La)","[Animation, Sci-Fi]",1973
8368,109295,Cold Comes the Night,"[Crime, Drama, Thriller]",2013
2950,3955,"Ladies Man, The",[Comedy],2000
5308,8804,"Story of Women (Affaire de femmes, Une)",[Drama],1988
1175,1564,For Roseanna (Roseanna's Grave),"[Comedy, Drama, Romance]",1997
5253,8617,Butterfield 8,[Drama],1960


In [5]:
moviegenre = movies.copy() # making a new dataframe with genres as dummy features

for index, row in movies.iterrows():
    for genre in row['genres']:
        moviegenre.at[index, genre] = 1

moviegenre = moviegenre.fillna(0) # replacing nan values with 0
moviegenre.head(5)

Unnamed: 0,movieId,title,genres,year,Adventure,Animation,Children,Comedy,Fantasy,Romance,...,Horror,Mystery,Sci-Fi,War,Musical,Documentary,IMAX,Western,Film-Noir,(no genres listed)
0,1,Toy Story,"[Adventure, Animation, Children, Comedy, Fantasy]",1995,1.0,1.0,1.0,1.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2,Jumanji,"[Adventure, Children, Fantasy]",1995,1.0,0.0,1.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,3,Grumpier Old Men,"[Comedy, Romance]",1995,0.0,0.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,4,Waiting to Exhale,"[Comedy, Drama, Romance]",1995,0.0,0.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,5,Father of the Bride Part II,[Comedy],1995,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


taking a quick look into ratings dataframe and cleaning it up.

In [6]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [7]:
ratings.drop('timestamp', axis=1, inplace=True)
print(ratings.isna().sum())
ratings.info()

userId     0
movieId    0
rating     0
dtype: int64
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835
Data columns (total 3 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   userId   100836 non-null  int64  
 1   movieId  100836 non-null  int64  
 2   rating   100836 non-null  float64
dtypes: float64(1), int64(2)
memory usage: 2.3 MB


In [8]:
inp = [
            {'title':'Breakfast Club, The', 'rating':5},
            {'title':'Toy Story', 'rating':3.5},
            {'title':'Jumanji', 'rating':2},
            {'title':"Pulp Fiction", 'rating':5},
            {'title':'Akira', 'rating':4.5}
         ]
userinp = pd.DataFrame(inp) # making a sample dataframe as our input.

making some seperated dataframes to work with and dropping unnecessary features to save memory.

first im going to apply content-based method, next i'll try collaborative and finally creating a proper report.

In [9]:
inpid = movies[movies['title'].isin(userinp['title'].tolist())] # getting the movieId of the specified movies
userinput = pd.merge(inpid, userinp)
userinput.drop(['genres','year'], axis=1, inplace=True)

userMovies = moviegenre[moviegenre['movieId'].isin(userinput['movieId'].tolist())]
userMovies = userMovies.reset_index(drop=True)
userMovies.drop(['movieId','title','genres','year'], axis=1, inplace=True)

In [10]:
userprofile = userMovies.T.dot(userinput['rating']) # getting the weights of each feature and storing it in the userprofile
userprofile

Adventure             10.0
Animation              8.0
Children               5.5
Comedy                13.5
Fantasy                5.5
Romance                0.0
Drama                 10.0
Action                 4.5
Crime                  5.0
Thriller               5.0
Horror                 0.0
Mystery                0.0
Sci-Fi                 4.5
War                    0.0
Musical                0.0
Documentary            0.0
IMAX                   0.0
Western                0.0
Film-Noir              0.0
(no genres listed)     0.0
dtype: float64

In [11]:
genreTable = moviegenre.set_index(moviegenre['movieId'])
genreTable.drop(['movieId','title','genres','year'], axis=1, inplace=True) # making a genre dataframe with movieId as it's index


recommendationTable_df = ((genreTable*userprofile).sum(axis=1))/(userprofile.sum()) # getting the percentage of what the user would most likely enjoy based on their taste in movie.
recommendationTable_df = recommendationTable_df.sort_values(ascending=False)

top_movie_ids = recommendationTable_df.head(20).keys()

# Select the rows from the movies dataframe that correspond to the top movie IDs,
# and preserve the order of the movie IDs in the recommendation table
content_based_res = movies[movies['movieId'].isin(top_movie_ids)].merge(
    pd.DataFrame({'movieId': top_movie_ids, 'rank': range(1, len(top_movie_ids)+1)}),
    on='movieId'
).sort_values(by='rank')
content_based_res.reset_index(inplace=True, drop=True)
content_based_res

Unnamed: 0,movieId,title,genres,year,rank
0,134853,Inside Out,"[Adventure, Animation, Children, Comedy, Drama...",2015,1
1,148775,Wizards of Waverly Place: The Movie,"[Adventure, Children, Comedy, Drama, Fantasy, ...",2009,2
2,117646,Dragonheart 2: A New Beginning,"[Action, Adventure, Comedy, Drama, Fantasy, Th...",2000,3
3,6902,Interstate 60,"[Adventure, Comedy, Drama, Fantasy, Mystery, S...",2002,4
4,81132,Rubber,"[Action, Adventure, Comedy, Crime, Drama, Film...",2010,5
5,2987,Who Framed Roger Rabbit?,"[Adventure, Animation, Children, Comedy, Crime...",1988,6
6,51939,TMNT (Teenage Mutant Ninja Turtles),"[Action, Adventure, Animation, Children, Comed...",2007,7
7,673,Space Jam,"[Adventure, Animation, Children, Comedy, Fanta...",1996,8
8,32031,Robots,"[Adventure, Animation, Children, Comedy, Fanta...",2005,9
9,108932,The Lego Movie,"[Action, Adventure, Animation, Children, Comed...",2014,10


# Collaborative filtering recommendation system.

In [12]:
colabdf = movies.drop('genres',axis=1)
# picking all the users that watched the same movies as our "userinput" sample data.
userSubset = ratings[ratings['movieId'].isin(userinput['movieId'].tolist())]
userSubsetGroup = userSubset.groupby(['userId'])
# grouping the "userId" together and sorting it so the users on top would match our sample input the most.
userSubsetGroup = sorted(userSubsetGroup,  key=lambda x: len(x[1]), reverse=True)

  userSubsetGroup = sorted(userSubsetGroup,  key=lambda x: len(x[1]), reverse=True)


In [13]:
# quick comparison
print(userSubsetGroup[0:1])
userinput

[(91,        userId  movieId  rating
14121      91        1     4.0
14122      91        2     3.0
14173      91      296     4.5
14316      91     1274     5.0
14383      91     1968     3.0)]


Unnamed: 0,movieId,title,rating
0,1,Toy Story,3.5
1,2,Jumanji,2.0
2,296,Pulp Fiction,5.0
3,1274,Akira,4.5
4,1968,"Breakfast Club, The",5.0


In [14]:
userSubsetGroup = userSubsetGroup[0:100] # keeping our top 100 matches

Store the Pearson Correlation in a dictionary, where the key is the user Id and the value is the coefficient

In [15]:
# Initialize an empty dictionary to store Pearson correlation coefficients
pearson_correlation_dict = {}

# Iterate over each user subset group
for name, group in userSubsetGroup:
    # Sort the movies in the group by their IDs
    group = group.sort_values(by='movieId')
    # Sort the input movies by their IDs
    input_movies = userinput.sort_values(by='movieId')
    # Get the number of ratings for this group
    n_ratings = len(group)
    # Filter the input movies to only those in this group
    temp_df = input_movies[input_movies['movieId'].isin(group['movieId'].tolist())]
    # Create lists of ratings for the input movies and the group
    temp_rating_list = temp_df['rating'].tolist()
    temp_group_list = group['rating'].tolist()
    # Compute the sum of squares of the input movie ratings
    sxx = sum([i**2 for i in temp_rating_list]) - pow(sum(temp_rating_list), 2) / float(n_ratings)
    # Compute the sum of squares of the group ratings
    syy = sum([i**2 for i in temp_group_list]) - pow(sum(temp_group_list), 2) / float(n_ratings)
    # Compute the sum of the products of the input movie ratings and the group ratings
    sxy = sum(i * j for i, j in zip(temp_rating_list, temp_group_list)) - sum(temp_rating_list) * sum(temp_group_list) / float(n_ratings)
    # Compute the Pearson correlation coefficient for this user and add it to the dictionary
    if sxx != 0 and syy != 0:
        pearson_correlation_dict[name] = sxy / np.sqrt(sxx * syy)
    else:
        pearson_correlation_dict[name] = 0

In [16]:
pearson_correlation_dict.items()

dict_items([(91, 0.43852900965351443), (177, 0.0), (219, 0.45124262819713973), (274, 0.716114874039432), (298, 0.9592712306918567), (414, 0.9376144618769914), (474, 0.11720180773462392), (477, 0.4385290096535153), (480, 0.7844645405527362), (483, 0.08006407690254357), (599, 0.7666866491579839), (608, 0.920736884379251), (50, 0.15713484026367722), (57, -0.7385489458759964), (68, 0.0), (103, 0.5222329678670935), (135, 0.8703882797784892), (182, 0.9428090415820635), (202, 0.5222329678670935), (217, 0.30151134457776363), (226, 0.9438798074485389), (288, 0.6005325641789633), (307, 0.9655810287305759), (318, 0.44486512077567225), (322, 0.5057805388588731), (330, 0.9035942578600878), (357, 0.5606119105813882), (434, 0.9864036607532465), (448, 0.30151134457776363), (469, 0.8164965809277261), (561, 0.5222329678670935), (600, 0.18442777839082938), (606, 0.9146591207600472), (610, -0.47140452079103173), (18, 1.0), (19, -0.5), (21, 0), (45, 0.5000000000000009), (63, -0.4999999999999982), (64, 0.0)

In [17]:
# Convert the Pearson correlation dictionary to a DataFrame
pearsonDF = pd.DataFrame.from_dict(pearson_correlation_dict, orient='index')
pearsonDF.columns = ['similarityIndex']
pearsonDF['userId'] = pearsonDF.index
pearsonDF.index = range(len(pearsonDF))
# Print the first five rows of the DataFrame
pearsonDF.head()

Unnamed: 0,similarityIndex,userId
0,0.438529,91
1,0.0,177
2,0.451243,219
3,0.716115,274
4,0.959271,298


In [18]:
# Sort the users by their Pearson correlation coefficients and select the top 50
topUsers=pearsonDF.sort_values(by='similarityIndex', ascending=False)[0:50]
# Print the first five rows of the top users DataFrame
topUsers.head()

Unnamed: 0,similarityIndex,userId
43,1.0,132
34,1.0,18
63,1.0,305
82,1.0,489
86,1.0,525


In [19]:
# Merge the top users DataFrame with the ratings DataFrame
topUsersRating=topUsers.merge(ratings, left_on='userId', right_on='userId', how='inner')
# Print the first five rows of the merged DataFrame
topUsersRating.head()

Unnamed: 0,similarityIndex,userId,movieId,rating
0,1.0,132,1,2.0
1,1.0,132,17,3.0
2,1.0,132,29,2.0
3,1.0,132,32,3.0
4,1.0,132,34,1.5


In [20]:
# Compute the weighted rating for each movie by multiplying the similarity index and the user's rating
topUsersRating['weightedRating'] = topUsersRating['similarityIndex']*topUsersRating['rating']
# Print the first five rows of the DataFrame with weighted ratings
topUsersRating.head()

Unnamed: 0,similarityIndex,userId,movieId,rating,weightedRating
0,1.0,132,1,2.0,2.0
1,1.0,132,17,3.0,3.0
2,1.0,132,29,2.0,2.0
3,1.0,132,32,3.0,3.0
4,1.0,132,34,1.5,1.5


In [21]:
# Group the DataFrame by movie ID and compute the sum of the similarity indices and the sum of the weighted ratings
tempTopUsersRating = topUsersRating.groupby('movieId').sum()[['similarityIndex','weightedRating']]
tempTopUsersRating.columns = ['sum_similarityIndex','sum_weightedRating']
# Print the first five rows of the grouped DataFrame
tempTopUsersRating.head()

Unnamed: 0_level_0,sum_similarityIndex,sum_weightedRating
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,36.354096,133.167946
2,31.005292,94.904257
3,8.783859,26.381456
4,0.866025,1.732051
5,7.165336,19.775255


In [22]:
# create an empty dataframe to hold the recommendation scores for each movie
recommendation_df = pd.DataFrame()
# calculate the weighted average recommendation score for each movie using the sum of weighted ratings and sum of similarity indices
recommendation_df['weighted average recommendation score'] = tempTopUsersRating['sum_weightedRating']/tempTopUsersRating['sum_similarityIndex']
# set the movieId column to the index of the dataframe
recommendation_df['movieId'] = tempTopUsersRating.index
recommendation_df.head()

Unnamed: 0_level_0,weighted average recommendation score,movieId
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,3.66308,1
2,3.060905,2
3,3.003402,3
4,2.0,4
5,2.75985,5


In [23]:
# sort the recommendations by descending order of weighted average recommendation score
recommendation_df = recommendation_df.sort_values(by='weighted average recommendation score', ascending=False)
# get the top 10 movies from the recommendations dataframe and retrieve their metadata from the movies dataframe
collab_res = movies.loc[movies['movieId'].isin(recommendation_df.head(10)['movieId'].tolist())].reset_index(drop=True)
# display the top 10 recommendations with their weighted average recommendation score
recommendation_df.head(10)

Unnamed: 0_level_0,weighted average recommendation score,movieId
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
3310,5.0,3310
7579,5.0,7579
905,5.0,905
1211,5.0,1211
140627,5.0,140627
4298,5.0,4298
152711,5.0,152711
633,5.0,633
5537,5.0,5537
5485,5.0,5485


In [24]:
content_based_res.head(10)

Unnamed: 0,movieId,title,genres,year,rank
0,134853,Inside Out,"[Adventure, Animation, Children, Comedy, Drama...",2015,1
1,148775,Wizards of Waverly Place: The Movie,"[Adventure, Children, Comedy, Drama, Fantasy, ...",2009,2
2,117646,Dragonheart 2: A New Beginning,"[Action, Adventure, Comedy, Drama, Fantasy, Th...",2000,3
3,6902,Interstate 60,"[Adventure, Comedy, Drama, Fantasy, Mystery, S...",2002,4
4,81132,Rubber,"[Action, Adventure, Comedy, Crime, Drama, Film...",2010,5
5,2987,Who Framed Roger Rabbit?,"[Adventure, Animation, Children, Comedy, Crime...",1988,6
6,51939,TMNT (Teenage Mutant Ninja Turtles),"[Action, Adventure, Animation, Children, Comed...",2007,7
7,673,Space Jam,"[Adventure, Animation, Children, Comedy, Fanta...",1996,8
8,32031,Robots,"[Adventure, Animation, Children, Comedy, Fanta...",2005,9
9,108932,The Lego Movie,"[Action, Adventure, Animation, Children, Comed...",2014,10


In [25]:
collab_res

Unnamed: 0,movieId,title,genres,year
0,633,Denise Calls Up,[Comedy],1995
1,905,It Happened One Night,"[Comedy, Romance]",1934
2,1211,"Wings of Desire (Himmel über Berlin, Der)","[Drama, Fantasy, Romance]",1987
3,3310,"Kid, The","[Comedy, Drama]",1921
4,4298,Rififi (Du rififi chez les hommes),"[Crime, Film-Noir, Thriller]",1955
5,5485,Tadpole,"[Comedy, Drama, Romance]",2002
6,5537,Satin Rouge,"[Drama, Musical]",2002
7,7579,Pride and Prejudice,"[Comedy, Drama, Romance]",1940
8,140627,Battle For Sevastopol,"[Drama, Romance, War]",2015
9,152711,Who Killed Chea Vichea?,[Documentary],2010
