I'm a data scientist at FlixGenius, a popular movie streaming service. Our management has recently decided to invest in enhancing our recommendation engine to provide more personalized movie recommendations to our users. We've found that users are more likely to continue using our service if they receive movie recommendations that match their personal preferences.

As a data scientist, I've been tasked with building a model that provides top 5 movie recommendations to a user, based on their ratings of other movies. The model will take into account the user's past viewing history and ratings, as well as the ratings and viewing history of other users with similar preferences.

For this project, I've been provided with a dataset called MovieLens. The data comes from the GroupLens research lab at the University of Minnesota and includes user ratings of movies, as well as information about the movies themselves. My job is to use this data to build a recommendation model that will provide personalized recommendations to users.

In [131]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity

In [132]:
movies_df = pd.read_csv('https://gist.githubusercontent.com/MichalOst3389/e3913bbb6921ea9475660d58d280d55c/raw/3b8861ea300bbdd6b689bf853dfce94524b39301/movies.csv')
links_df = pd.read_csv('https://gist.githubusercontent.com/MichalOst3389/cfc2c59e9f323d11b7afb8f3224229f3/raw/ce13331097cbff6abcd941e8388db941220876fb/links.csv')
ratings_df = pd.read_csv('https://gist.githubusercontent.com/MichalOst3389/d8ed774a84197205b2d7e53ce8345aae/raw/064966f6d7c5f45f3aa404ca45d5ee9b9fed0ece/ratings.csv')
tags_df = pd.read_csv('https://gist.githubusercontent.com/MichalOst3389/9ff4a3740c440a3391d891af2ccac50a/raw/05572aac39b0fd8d278228158ad5a4cb20ecaa9c/tags.csv')

# Data Cleaning

The first step is to combine them into a single dataset. I've merge the datasets using a common identifier which is movieId

In [133]:
# Merge datasets using movieId as the key
merged_df_1 = pd.merge(ratings_df, movies_df, on='movieId', how='inner')
merged_df = pd.merge(merged_df_1, tags_df, on='movieId', how='inner')

In [134]:
merged_df.isnull()

Unnamed: 0,userId_x,movieId,rating,timestamp_x,title,genres,userId_y,tag,timestamp_y
0,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...
233208,False,False,False,False,False,False,False,False,False
233209,False,False,False,False,False,False,False,False,False
233210,False,False,False,False,False,False,False,False,False
233211,False,False,False,False,False,False,False,False,False


In [135]:
print(merged_df.isnull().sum())

userId_x       0
movieId        0
rating         0
timestamp_x    0
title          0
genres         0
userId_y       0
tag            0
timestamp_y    0
dtype: int64


### There are no null values

In [136]:
merged_df.duplicated()

0         False
1         False
2         False
3         False
4         False
          ...  
233208    False
233209    False
233210    False
233211    False
233212    False
Length: 233213, dtype: bool

In [137]:
print(merged_df.duplicated().sum())

0


### There are no duplicated rows.

In [138]:
merged_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 233213 entries, 0 to 233212
Data columns (total 9 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   userId_x     233213 non-null  int64  
 1   movieId      233213 non-null  int64  
 2   rating       233213 non-null  float64
 3   timestamp_x  233213 non-null  int64  
 4   title        233213 non-null  object 
 5   genres       233213 non-null  object 
 6   userId_y     233213 non-null  int64  
 7   tag          233213 non-null  object 
 8   timestamp_y  233213 non-null  int64  
dtypes: float64(1), int64(5), object(3)
memory usage: 17.8+ MB


We can see there are no duplicates above, however when we merged dataframes some columns were merged into the new df that are representing the same things.  example - user_Id_x, and user_Id_y.

## One hot encoding tags

   By one-hot encoding the tag column, we can represent the user's tags for each movie in a way that can be easily used as input to a recommendation algorithm.

# One-hot encode the 'tag' column
tags_encoded = pd.get_dummies(merged_df['tag'])

# Merge the encoded tags back into the original dataframe
merged_df = pd.concat([merged_df, tags_encoded], axis=1)

# Drop the original 'tag' column
merged_df.drop('tag', axis=1, inplace=True)

## One hot encoding genres

# One-hot encode the 'genres' column
genres_encoded = pd.get_dummies(merged_df['genres'])

# Merge the encoded genres back into the original dataframe
merged_df = pd.concat([merged_df, tags_encoded], axis=1)

# Drop the original 'genres' column
merged_df.drop('genres', axis=1, inplace=True)

In [139]:
merged_df = merged_df.drop('userId_y', axis=1)
merged_df = merged_df.drop('timestamp_y', axis=1)

  I removed these columns as they were duplicates with different values from a different dataframe, that represented the same things.

In [143]:
# drop duplicates from merged_df
merged_df.drop_duplicates(subset=['userId_x', 'movieId'], inplace=True)

# pivot the dataframe to have users as rows and movies as columns
pivot_df = merged_df.pivot(index='userId_x', columns='movieId', values='rating').fillna(0)

# compute the cosine similarity matrix between all users
user_similarities = cosine_similarity(pivot_df)

In [160]:
# pivot the dataframe to have users as rows and movies as columns
pivot_df = merged_df.pivot(index='userId_x', columns='movieId', values='rating').fillna(0)

# compute the cosine similarity matrix between all users
user_similarities = cosine_similarity(pivot_df)

# define the user_id for which we want to provide recommendations
user_id = 20

# get the similarity scores for the target user compared to all other users
user_sim_scores = user_similarities[user_id]

# find the indices of the top 5 most similar users
top_users = np.argsort(-user_sim_scores)[1:6]

# get the movies that the top 5 similar users rated the highest
top_movies = pivot_df.iloc[top_users].max().sort_values(ascending=False)[:5]

# map the movie ids to movie titles
top_movie_titles = [movies_df.loc[movies_df['movieId'] == movie_id, 'title'].iloc[0] for movie_id in top_movies.index]

# print the top 5 recommended movie titles
print(top_movie_titles)

['Toy Story (1995)', 'My Neighbor Totoro (Tonari no Totoro) (1988)', 'L.A. Confidential (1997)', 'Men in Black (a.k.a. MIB) (1997)', 'Face/Off (1997)']


In [159]:
# compute the cosine similarity matrix between all users
user_similarities = cosine_similarity(pivot_df)

# define the user_id for which we want to provide recommendations
user_id = 1

# get the similarity scores for the target user compared to all other users
user_sim_scores = user_similarities[userId_x]

# find the indices of the top 5 most similar users
top_users = np.argsort(-user_sim_scores)[1:6]

# get the movies that the top 5 similar users rated the highest
top_movies = pivot_df.iloc[top_users].max().sort_values(ascending=False)[:5]

# define the ground truth dataset
ground_truth = merged_df[merged_df['userId_x'] == user_id]

# define the set of recommended items
recommended_items = top_movies.index.tolist()

# compute precision, recall, and F1-score
relevant_items = ground_truth[ground_truth['rating'] >= 4]['movieId_x'].tolist()
tp = len(set(recommended_items) & set(relevant_items))
fp = len(set(recommended_items) - set(relevant_items))
fn = len(set(relevant_items) - set(recommended_items))
precision = tp / (tp + fp)
recall = tp / (tp + fn)
f1_score = 2 * precision * recall / (precision + recall)

print('Precision:', precision)
print('Recall:', recall)
print('F1-score:', f1_score)

NameError: name 'userId_x' is not defined