# Business Understanding



I'm a data scientist at FlixGenius, a popular movie streaming service. Our management has recently decided to invest in enhancing our recommendation engine to provide more personalized movie recommendations to our users. We've found that users are more likely to continue using our service if they receive movie recommendations that match their personal preferences.

As a data scientist, I've been tasked with building a model that provides top 5 movie recommendations to a user, based on their ratings of other movies. The model will take into account the user's past viewing history and ratings, as well as the ratings and viewing history of other users with similar preferences.

For this project, I've been provided with a dataset called MovieLens. The data comes from the GroupLens research lab at the University of Minnesota and includes user ratings of movies, as well as information about the movies themselves. My job is to use this data to build a recommendation model that will provide personalized recommendations to users.

In [329]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import NearestNeighbors

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
from sklearn.feature_extraction.text import CountVectorizer

In [330]:
movies_df = pd.read_csv('https://gist.githubusercontent.com/MichalOst3389/e3913bbb6921ea9475660d58d280d55c/raw/3b8861ea300bbdd6b689bf853dfce94524b39301/movies.csv')
links_df = pd.read_csv('https://gist.githubusercontent.com/MichalOst3389/cfc2c59e9f323d11b7afb8f3224229f3/raw/ce13331097cbff6abcd941e8388db941220876fb/links.csv')
ratings_df = pd.read_csv('https://gist.githubusercontent.com/MichalOst3389/d8ed774a84197205b2d7e53ce8345aae/raw/064966f6d7c5f45f3aa404ca45d5ee9b9fed0ece/ratings.csv')
tags_df = pd.read_csv('https://gist.githubusercontent.com/MichalOst3389/9ff4a3740c440a3391d891af2ccac50a/raw/05572aac39b0fd8d278228158ad5a4cb20ecaa9c/tags.csv')

# Data understanding

The first step is to combine them into a single dataset. I've merge the datasets using a common identifier which is movieId

In [331]:
# Merge datasets using movieId as the key
merged_df_1 = pd.merge(ratings_df, movies_df, on='movieId', how='inner')
merged_df = pd.merge(merged_df_1, tags_df, on='movieId', how='inner')

In [332]:
merged_df.isnull()

Unnamed: 0,userId_x,movieId,rating,timestamp_x,title,genres,userId_y,tag,timestamp_y
0,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...
233208,False,False,False,False,False,False,False,False,False
233209,False,False,False,False,False,False,False,False,False
233210,False,False,False,False,False,False,False,False,False
233211,False,False,False,False,False,False,False,False,False


In [333]:
print(merged_df.isnull().sum())

userId_x       0
movieId        0
rating         0
timestamp_x    0
title          0
genres         0
userId_y       0
tag            0
timestamp_y    0
dtype: int64


### There are no null values

In [334]:
merged_df.duplicated()

0         False
1         False
2         False
3         False
4         False
          ...  
233208    False
233209    False
233210    False
233211    False
233212    False
Length: 233213, dtype: bool

In [335]:
print(merged_df.duplicated().sum())

0


### There are no duplicated rows.

In [336]:
merged_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 233213 entries, 0 to 233212
Data columns (total 9 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   userId_x     233213 non-null  int64  
 1   movieId      233213 non-null  int64  
 2   rating       233213 non-null  float64
 3   timestamp_x  233213 non-null  int64  
 4   title        233213 non-null  object 
 5   genres       233213 non-null  object 
 6   userId_y     233213 non-null  int64  
 7   tag          233213 non-null  object 
 8   timestamp_y  233213 non-null  int64  
dtypes: float64(1), int64(5), object(3)
memory usage: 17.8+ MB


We can see there are no duplicates above, however when we merged dataframes some columns were merged into the new dataframe that  represent the same things.  example - user_Id_x, and user_Id_y.

# Data Preperation

In [337]:
merged_df = merged_df.drop('userId_y', axis=1)
merged_df = merged_df.drop('timestamp_y', axis=1)

In [338]:
# drop duplicates from merged_df
merged_df.drop_duplicates(subset=['userId_x', 'movieId'], inplace=True)

# pivot the dataframe to have users as rows and movies as columns
pivot_df = merged_df.pivot(index='userId_x', columns='movieId', values='rating').fillna(0)

# compute the cosine similarity matrix between all users
user_similarities = cosine_similarity(pivot_df)

I used the "drop_duplicates" method to remove any rows that have the same combination of 'userId_x' and 'movieId'. This ensures that we only have one rating per user-movie pair in our dataset.

Next, I pivot the DataFrame so that each user is a row and each movie is a column to make the data easier to work with during collaborative filtering.  

Finally, I'm computing  the cosine similarity matrix between all users. This matrix measures the similarity between each pair of users based on their ratings. The values in the matrix range from -1 to 1, with 1 indicating that two users have identical ratings for all movies, 0 indicating that they have no similarity, and -1 indicating that they have completely opposite ratings for all movies.

By computing this similarity matrix, we can identify users who have similar tastes in movies, which can be used to make personalized recommendations. For example, if a user has not seen a particular movie but has similar tastes to another user who gave that movie a high rating, we might recommend that movie to the first user.

In [339]:
#changing column name to user_id from userId_x
merged_df = merged_df.rename(columns={'userId_x': 'user_id'})

merged_df = merged_df.rename(columns={'timestamp_x': 'time_stamp'})

In [340]:
merged_df

Unnamed: 0,user_id,movieId,rating,time_stamp,title,genres,tag
0,1,1,4.0,964982703,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,pixar
3,5,1,4.0,847434962,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,pixar
6,7,1,4.5,1106635946,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,pixar
9,15,1,2.5,1510577970,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,pixar
12,17,1,4.5,1305696483,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,pixar
...,...,...,...,...,...,...,...
233204,567,176419,3.0,1525287581,Mother! (2017),Drama|Horror|Mystery|Thriller,allegorical
233207,599,176419,3.5,1516604655,Mother! (2017),Drama|Horror|Mystery|Thriller,allegorical
233210,594,7023,4.5,1108972356,"Wedding Banquet, The (Xi yan) (1993)",Comedy|Drama|Romance,In Netflix queue
233211,606,6107,4.0,1171324428,Night of the Shooting Stars (Notte di San Lore...,Drama|War,World War II


# Modeling

##### Collaborative filtering
Collaborative filtering recommends items to a user based on the preferences of other users who are similar to them.

In [341]:
# pivot the dataframe to have users as rows and movies as columns
pivot_df = merged_df.pivot(index='user_id', columns='movieId', values='rating').fillna(0)

# compute the cosine similarity matrix between all users
user_similarities = cosine_similarity(pivot_df)

# define the user_id for which we want to provide recommendations
user_id = 1

# get the similarity scores for the target user compared to all other users
user_sim_scores = user_similarities[user_id]

# find the indices of the top 5 most similar users
top_users = np.argsort(-user_sim_scores)[1:6]

# get the movies that the top 5 similar users rated the highest
top_movies = pivot_df.iloc[top_users].max().sort_values(ascending=False)[:5]

# map the movie ids to movie titles
top_movie_titles = [movies_df.loc[movies_df['movieId'] == movie_id, 'title'].iloc[0] for movie_id in top_movies.index]

# print the top 5 recommended movie titles
print(top_movie_titles)

['Dead Poets Society (1989)', 'Gone Girl (2014)', 'The Butterfly Effect (2004)', 'Dark Knight, The (2008)', 'Godfather: Part II, The (1974)']


This code uses collaborative filtering to make personalized movie recommendations for a particular user based on the ratings of other users with similar tastes. It pivots a DataFrame to have users as rows and movies as columns, computes the cosine similarity matrix between all users, and finds the top 5 most similar users to a target user. It then recommends the top 5 movies that the similar users rated the highest, and maps the movie ids to movie titles. 

# Evaluation
I need to evaluate the above model now.

In [342]:
X_train, X_test, y_train, y_test = train_test_split(pivot_df, pivot_df[user_id], test_size=0.2, random_state=42)

In [343]:
model = NearestNeighbors(n_neighbors=5, metric='cosine')
model.fit(X_train)

predictions = []
for i in range(X_test.shape[0]):
    distances, indices = model.kneighbors(X_test.iloc[i,:].values.reshape(1, -1), n_neighbors=5)
    user_indices = indices.flatten()
    user_ratings = X_train.iloc[user_indices, user_id]
    prediction = user_ratings.mean()
    predictions.append(prediction)

mae = mean_absolute_error(y_test, predictions)
rmse = np.sqrt(mean_squared_error(y_test, predictions))

print("MAE:", mae)
print("RMSE:", rmse)

MAE: 1.4647540983606557
RMSE: 1.9706244329495421


This code is evaluating a recommendation system using the k-nearest neighbors algorithm.

First, the training data is fit to a NearestNeighbors model with a cosine similarity metric.
Next, the algorithm loops over the test data, finds the 5 nearest neighbors to each test user, and calculates the mean rating of those neighbors for the item being recommended.
The mean absolute error (MAE) and root mean squared error (RMSE) are calculated by comparing the predicted ratings to the actual ratings in the test data.



 # Modeling
 ##### Content Based Filtering
 In content-based filtering, the system recommends items to a user based on their past behavior and the attributes of items they have previously interacted with.

In [344]:
#creating a new dataframe called content_df
content_df = pd.DataFrame(data=merged_df[['user_id', 'title', 'rating', 'genres', 'tag']])

In [355]:
# Select a subset of the data
num_movies = 5000
movie_subset = content_df['title'].value_counts().head(num_movies).index.tolist()
user_subset = content_df['user_Id'].value_counts().head(1000).index.tolist()
content_df = content_df[content_df['title'].isin(movie_subset) & content_df['user_Id'].isin(user_subset)]

# Convert the movie features into a matrix
tfidf_vectorizer = TfidfVectorizer()
movie_features = tfidf_vectorizer.fit_transform(content_df['genres'].fillna(''))
movie_features_sparse = movie_features.sparse.T

# Compute the cosine similarities between movies
cosine_similarities = cosine_similarity(movie_features_sparse)

KeyError: 'userId'

We create a user-item matrix from the content_df dataframe, where each row represents a user and each column represents a movie. The cells of the matrix represent the rating given by the user to the movie.

We fill missing values with 0 to handle the case where a user has not rated a movie.

We calculate the cosine similarity matrix between the movies using the cosine_similarity function from scikit-learn.

We define the collaborative_filtering_recommender function that takes a user ID as input and generates the top 5 movie recommendations for the user using item-based collaborative filtering.

To do this, we get the ratings of the user from the user-item matrix and calculate the cosine similarity between the user's ratings and all the movies.

We get the indices of the top 5 similar movies and use the movie_to_idx mapping to get their titles.

Finally, we get the movie titles and genres for the top movies from the content_df dataframe and print the recommendations.