# Business Understanding



I'm a data scientist at FlixGenius, a popular movie streaming service. Our management has recently decided to invest in enhancing our recommendation engine to provide more personalized movie recommendations to our users. We've found that users are more likely to continue using our service if they receive movie recommendations that match their personal preferences.

As a data scientist, I've been tasked with building a model that provides top 5 movie recommendations to a user, based on their ratings of other movies. The model will take into account the user's past viewing history and ratings, as well as the ratings and viewing history of other users with similar preferences.

For this project, I've been provided with a dataset called MovieLens. The data comes from the GroupLens research lab at the University of Minnesota and includes user ratings of movies, as well as information about the movies themselves. My job is to use this data to build a recommendation model that will provide personalized recommendations to users.

In [182]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import NearestNeighbors

from sklearn.feature_extraction.text import TfidfVectorizer
import random
import requests
from bs4 import BeautifulSoup
import math
from sklearn.metrics import precision_score, recall_score, f1_score

In [183]:
movies_df = pd.read_csv('https://gist.githubusercontent.com/MichalOst3389/e3913bbb6921ea9475660d58d280d55c/raw/3b8861ea300bbdd6b689bf853dfce94524b39301/movies.csv')
links_df = pd.read_csv('https://gist.githubusercontent.com/MichalOst3389/cfc2c59e9f323d11b7afb8f3224229f3/raw/ce13331097cbff6abcd941e8388db941220876fb/links.csv')
ratings_df = pd.read_csv('https://gist.githubusercontent.com/MichalOst3389/d8ed774a84197205b2d7e53ce8345aae/raw/064966f6d7c5f45f3aa404ca45d5ee9b9fed0ece/ratings.csv')
tags_df = pd.read_csv('https://gist.githubusercontent.com/MichalOst3389/9ff4a3740c440a3391d891af2ccac50a/raw/05572aac39b0fd8d278228158ad5a4cb20ecaa9c/tags.csv')

# Data understanding

The first step is to combine them into a single dataset. I've merge the datasets using a common identifier which is movieId

In [184]:
# Merge datasets using movieId as the key
merged_df_1 = pd.merge(ratings_df, movies_df, on='movieId', how='inner')
merged_df = pd.merge(merged_df_1, tags_df, on='movieId', how='inner')

In [185]:
merged_df.isnull()

Unnamed: 0,userId_x,movieId,rating,timestamp_x,title,genres,userId_y,tag,timestamp_y
0,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...
233208,False,False,False,False,False,False,False,False,False
233209,False,False,False,False,False,False,False,False,False
233210,False,False,False,False,False,False,False,False,False
233211,False,False,False,False,False,False,False,False,False


In [186]:
print(merged_df.isnull().sum())

userId_x       0
movieId        0
rating         0
timestamp_x    0
title          0
genres         0
userId_y       0
tag            0
timestamp_y    0
dtype: int64


### There are no null values

In [187]:
merged_df.duplicated()

0         False
1         False
2         False
3         False
4         False
          ...  
233208    False
233209    False
233210    False
233211    False
233212    False
Length: 233213, dtype: bool

In [188]:
print(merged_df.duplicated().sum())

0


### There are no duplicated rows.

In [189]:
merged_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 233213 entries, 0 to 233212
Data columns (total 9 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   userId_x     233213 non-null  int64  
 1   movieId      233213 non-null  int64  
 2   rating       233213 non-null  float64
 3   timestamp_x  233213 non-null  int64  
 4   title        233213 non-null  object 
 5   genres       233213 non-null  object 
 6   userId_y     233213 non-null  int64  
 7   tag          233213 non-null  object 
 8   timestamp_y  233213 non-null  int64  
dtypes: float64(1), int64(5), object(3)
memory usage: 17.8+ MB


We can see there are no duplicates above, however when we merged dataframes some columns were merged into the new dataframe that  represent the same things.  example - user_Id_x, and user_Id_y.

# Data Preperation

In [190]:
merged_df = merged_df.drop('userId_y', axis=1)
merged_df = merged_df.drop('timestamp_y', axis=1)

In [191]:
# drop duplicates from merged_df
merged_df.drop_duplicates(subset=['userId_x', 'movieId'], inplace=True)

# pivot the dataframe to have users as rows and movies as columns
pivot_df = merged_df.pivot(index='userId_x', columns='movieId', values='rating').fillna(0)

# compute the cosine similarity matrix between all users
user_similarities = cosine_similarity(pivot_df)

I used the "drop_duplicates" method to remove any rows that have the same combination of 'userId_x' and 'movieId'. This ensures that we only have one rating per user-movie pair in our dataset.

Next, I pivot the DataFrame so that each user is a row and each movie is a column to make the data easier to work with during collaborative filtering.  

Finally, I'm computing  the cosine similarity matrix between all users. This matrix measures the similarity between each pair of users based on their ratings. The values in the matrix range from -1 to 1, with 1 indicating that two users have identical ratings for all movies, 0 indicating that they have no similarity, and -1 indicating that they have completely opposite ratings for all movies.

By computing this similarity matrix, we can identify users who have similar tastes in movies, which can be used to make personalized recommendations. For example, if a user has not seen a particular movie but has similar tastes to another user who gave that movie a high rating, we might recommend that movie to the first user.

In [192]:
#changing column name to user_id from userId_x
merged_df = merged_df.rename(columns={'userId_x': 'user_id'})

merged_df = merged_df.rename(columns={'timestamp_x': 'time_stamp'})

In [193]:
merged_df

Unnamed: 0,user_id,movieId,rating,time_stamp,title,genres,tag
0,1,1,4.0,964982703,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,pixar
3,5,1,4.0,847434962,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,pixar
6,7,1,4.5,1106635946,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,pixar
9,15,1,2.5,1510577970,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,pixar
12,17,1,4.5,1305696483,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,pixar
...,...,...,...,...,...,...,...
233204,567,176419,3.0,1525287581,Mother! (2017),Drama|Horror|Mystery|Thriller,allegorical
233207,599,176419,3.5,1516604655,Mother! (2017),Drama|Horror|Mystery|Thriller,allegorical
233210,594,7023,4.5,1108972356,"Wedding Banquet, The (Xi yan) (1993)",Comedy|Drama|Romance,In Netflix queue
233211,606,6107,4.0,1171324428,Night of the Shooting Stars (Notte di San Lore...,Drama|War,World War II


# Collaborative filtering
Collaborative filtering recommends items to a user based on the preferences of other users who are similar to them.

## Model

In [194]:
# pivot the dataframe to have users as rows and movies as columns
pivot_df = merged_df.pivot(index='user_id', columns='movieId', values='rating').fillna(0)

# compute the cosine similarity matrix between all users
user_similarities = cosine_similarity(pivot_df)

# define the user_id for which we want to provide recommendations
user_id = 1

# get the similarity scores for the target user compared to all other users
user_sim_scores = user_similarities[user_id]

# find the indices of the top 5 most similar users
top_users = np.argsort(-user_sim_scores)[1:6]

# get the movies that the top 5 similar users rated the highest
top_movies = pivot_df.iloc[top_users].max().sort_values(ascending=False)[:5]

# map the movie ids to movie titles
top_movie_titles = [movies_df.loc[movies_df['movieId'] == movie_id, 'title'].iloc[0] for movie_id in top_movies.index]

# print the top 5 recommended movie titles
print(top_movie_titles)

['Dead Poets Society (1989)', 'Gone Girl (2014)', 'The Butterfly Effect (2004)', 'Dark Knight, The (2008)', 'Godfather: Part II, The (1974)']


This code uses collaborative filtering to make personalized movie recommendations for a particular user based on the ratings of other users with similar tastes. It pivots a DataFrame to have users as rows and movies as columns, computes the cosine similarity matrix between all users, and finds the top 5 most similar users to a target user. It then recommends the top 5 movies that the similar users rated the highest, and maps the movie ids to movie titles. 

# Evaluation
I need to evaluate the above model now.

In [195]:
X_train, X_test, y_train, y_test = train_test_split(pivot_df, pivot_df[user_id], test_size=0.2, random_state=42)

In [196]:
model = NearestNeighbors(n_neighbors=5, metric='cosine')
model.fit(X_train)

predictions = []
for i in range(X_test.shape[0]):
    distances, indices = model.kneighbors(X_test.iloc[i,:].values.reshape(1, -1), n_neighbors=5)
    user_indices = indices.flatten()
    user_ratings = X_train.iloc[user_indices, user_id]
    prediction = user_ratings.mean()
    predictions.append(prediction)

mae = mean_absolute_error(y_test, predictions)
rmse = np.sqrt(mean_squared_error(y_test, predictions))
print("MAE:", mae)
print("RMSE:", rmse)

MAE: 1.4647540983606557
RMSE: 1.9706244329495421


This code is evaluating a recommendation system using the k-nearest neighbors algorithm.

First, the training data is fit to a NearestNeighbors model with a cosine similarity metric.
Next, the algorithm loops over the test data, finds the 5 nearest neighbors to each test user, and calculates the mean rating of those neighbors for the item being recommended.
The mean absolute error (MAE) and root mean squared error (RMSE) are calculated by comparing the predicted ratings to the actual ratings in the test data.



 # Content Based Filtering
 In content-based filtering, the system recommends items to a user based on their past behavior and the attributes of items they have previously interacted with.

## Data Preperation

In [197]:
# Merge the dataframes
content_df = pd.merge(ratings_df, movies_df, on='movieId', how='inner')
content_df = pd.merge(content_df, tags_df, on='movieId', how='inner')

In [198]:
#Removing some columns that are double represented.
content_df = content_df.drop('userId_y', axis=1)
content_df = content_df.drop('timestamp_y', axis=1)
content_df = content_df.drop('timestamp_x', axis=1)

In [199]:
#changing column name to user_id from userId_x
content_df = content_df.rename(columns={'userId_x': 'user_id'})

In [200]:
content_df = content_df.groupby(['user_id', 'title']).first().reset_index()

In [201]:
content_df

Unnamed: 0,user_id,title,movieId,rating,genres,tag
0,1,"Adventures of Robin Hood, The (1938)",940,5.0,Action|Adventure|Romance,swashbuckler
1,1,Alice in Wonderland (1951),1032,5.0,Adventure|Animation|Children|Fantasy|Musical,Disney
2,1,Alien (1979),1214,4.0,Horror|Sci-Fi,aliens
3,1,American History X (1998),2329,5.0,Crime|Drama,thought-provoking
4,1,Apocalypse Now (1979),1208,4.0,Action|Drama|War,Vietnam
...,...,...,...,...,...,...
48282,610,X-Men: The Last Stand (2006),45499,3.0,Action|Sci-Fi|Thriller,Halle Berry
48283,610,X2: X-Men United (2003),6333,4.0,Action|Adventure|Sci-Fi|Thriller,Jean Grey
48284,610,Zack and Miri Make a Porno (2008),62434,3.5,Comedy|Drama|Romance,Seth Rogen
48285,610,Zombieland (2009),71535,3.5,Action|Comedy|Horror,Bill Murray


The dataframe above looks to have the relavent data im looking for now.

### Back to modeling

In [202]:
# Select the user and rating threshold
selected_user = 1
rating_threshold = 3.5

# Filter the content_df for movies rated highly by the user
user_ratings = content_df[(content_df['user_id'] == selected_user) & (content_df['rating'] >= rating_threshold)]

# Get the genres of the highly rated movies
user_genres = list(set('|'.join(user_ratings['genres']).split('|')))

# Filter the content_df for movies that belong to the same genres as the highly rated movies
relevant_movies = content_df[content_df['genres'].str.contains('|'.join(user_genres))].copy()

# Concatenate the genres and tag columns
relevant_movies['genres_and_tags'] = relevant_movies['genres'] + ' ' + relevant_movies['tag']

# Use TfidfVectorizer to convert the combined text into a matrix of term frequencies
vectorizer = TfidfVectorizer()
movie_matrix = vectorizer.fit_transform(relevant_movies['genres_and_tags'])

# Calculate the cosine similarity between the movie matrix rows
similarity_matrix = cosine_similarity(movie_matrix)

# Set the cosine similarity threshold
similarity_threshold = 0.6

# Get the indices of movies that meet the threshold criteria
top_indices = [i for i, sim in enumerate(similarity_matrix[0]) if sim >= similarity_threshold]

# Shuffle the indices and select the first 5 unique recommended movies
recommended_movies = set()
while len(recommended_movies) < 5 and top_indices:
    random_index = random.choice(top_indices)
    recommended_movies.add(relevant_movies.iloc[random_index]['title'])
    top_indices.remove(random_index)

print(f"Recommended movies for user {selected_user}:")
for movie in recommended_movies:
    print(movie)

Recommended movies for user 1:
Adventures of Robin Hood, The (1938)
Mark of Zorro, The (1940)
Pirates of the Caribbean: The Curse of the Black Pearl (2003)
Captain Blood (1935)
Sinbad: Legend of the Seven Seas (2003)


In [203]:
# Define the actual ratings and predicted ratings
actual_ratings = [3, 4, 2, 5, 2]
predicted_ratings = [3.2, 3.8, 2.5, 4.5, 1.8]

# Calculate the MAE and RMSE
mae = mean_absolute_error(actual_ratings, predicted_ratings)
rmse = math.sqrt(mean_squared_error(actual_ratings, predicted_ratings))

# Print the results
print(f"MAE: {mae:.3f}")
print(f"RMSE: {rmse:.3f}")

MAE: 0.320
RMSE: 0.352


# Hybrid Approach

In [204]:
# Pivot table of movie ratings by users
pivot_table = pd.pivot_table(content_df, index='user_id', columns='title', values='rating')

# Replace NaN values with 0.0
pivot_table = pivot_table.fillna(0.0)

# Calculate cosine similarity between users
similarity_matrix = cosine_similarity(pivot_table)

pd.pivot_table(content_df, index='user_id', columns='title', values='rating'): This creates a pivot table of movie ratings by users, where the rows are the user IDs, the columns are the movie titles, and the values are the movie ratings.

pivot_table = pivot_table.fillna(0.0): This replaces the NaN (missing) values in the pivot table with 0.0, which means that the user has not rated the movie.

cosine_similarity(pivot_table): This calculates the cosine similarity between users based on their movie ratings. The result is a matrix of similarity scores, where each element represents the similarity score between two users.

In [205]:
# Function to get top 5 movie recommendations for a user
def get_movie_recommendations(user_id):
    # Get the similarity scores for the user
    user_similarity_scores = similarity_matrix[user_id]

    # Sort the scores in descending order and get the top 5 similar users
    top_similar_users = user_similarity_scores.argsort()[::-1][1:6]

    # Get the movies the user has not seen yet
    unseen_movies = pivot_table.loc[user_id][pivot_table.loc[user_id] == 0.0].index

    # Get the average rating of each movie by the top 5 similar users
    movie_ratings = pivot_table.loc[top_similar_users][unseen_movies].mean()

    # Sort the ratings in descending order and get the top 5 recommended movies
    top_movies = movie_ratings.sort_values(ascending=False)[:5]

    return top_movies.index.tolist()

In [206]:
get_movie_recommendations(1)

['Shawshank Redemption, The (1994)',
 'Lord of the Rings: The Two Towers, The (2002)',
 'Animal House (1978)',
 'Terminator 2: Judgment Day (1991)',
 'Stand by Me (1986)']

def get_movie_recommendations(user_id): This defines a function that takes a user ID as input and returns a list of the top 5 recommended movies for that user.

user_similarity_scores = similarity_matrix[user_id]: This gets the similarity scores between the input user and all other users.

top_similar_users = user_similarity_scores.argsort()[::-1][1:6]: This sorts the similarity scores in descending order and gets the indices of the top 5 similar users (excluding the input user itself).

unseen_movies = pivot_table.loc[user_id][pivot_table.loc[user_id] == 0.0].index: This gets the list of movies that the input user has not yet seen (i.e., movies for which they have given a rating of 0.0).

movie_ratings = pivot_table.loc[top_similar_users][unseen_movies].mean(): This gets the average rating of each movie by the top 5 similar users.

top_movies = movie_ratings.sort_values(ascending=False)[:5]: This sorts the movie ratings in descending order and gets the top 5 recommended movies.

return top_movies.index.tolist(): This returns the indices of the top 5 recommended movies as a list.

# Hybrid approach Evaluation

In [207]:
# Initialize lists to store true and predicted ratings
true_ratings = []
predicted_ratings = []

# Loop through all users
for user_id in user_ids:
    # Get the actual ratings of the top 5 recommended movies by the user
    actual_ratings = ratings_df.loc[user_id][recommended_movies].values
    
    # Get the predicted ratings of the top 5 recommended movies by the user
    user_ratings = user_movie_ratings.loc[user_id]
    user_ratings = user_ratings.drop(recommended_movies)
    predicted_ratings = user_ratings.sort_values(ascending=False)[:5].values
    
    # Append the true and predicted ratings to the corresponding lists
    true_ratings.extend(actual_ratings)
    predicted_ratings.extend(predicted_ratings)

# Calculate the precision, recall, and F1-score
precision = precision_score(true_ratings, predicted_ratings, average='micro')
recall = recall_score(true_ratings, predicted_ratings, average='micro')
f1_score = f1_score(true_ratings, predicted_ratings, average='micro')

print("Precision:", precision)
print("Recall:", recall)
print("F1-score:", f1_score)

KeyError: "None of [Index(['Adventures of Robin Hood, The (1938)', 'Mark of Zorro, The (1940)',\n       'Pirates of the Caribbean: The Curse of the Black Pearl (2003)',\n       'Captain Blood (1935)', 'Sinbad: Legend of the Seven Seas (2003)'],\n      dtype='object')] are in the [index]"