## **Movie Recommendation System**

### *Business Understanding*
#### Project Overview

The goal of this project is to develop a machine learning model that is able to learn historical information regarding movie consumption by viewers and then generate the most relevant recommendations based on individual preferences and interests. With the advancement in technology, viewers now have access to large collections of movies to choose from. This may make it difficult for viewers to make a decision on what exactly to watch. A scenario described as choice overload. Such individuals may choose a movie that they may not like or loose interest completely which leads to customer dissatisfaction. A movie recommendation system addresses this challenge by recommending movies that the individual will like based on historical user activity. This will enhance customer engagement, retention and satisfaction.

#### Objectives
- Build a model that will provide 5 top movie recommendations for each individual user.
- Create a feature that will allow users to rate other movies that they have watched.
- Develop a validation strategy that ensures that the model will perform well on unseen data.

### *Data Understanding*
This project makes use of the MovieLens dataset from the GroupLens research lab at the University of Minnesota.
The MovieLens dataset contains historical information about movie consumption ratings by different viewers(~100,000 user ratings). The folder contains 4 files: links.csv, movies.csv, ratings.csv and tags.csv. These files contain some of the following features:
- User Id
- Movie Id
- Movie Title
- Movie Rating(5 star scale)
- Genres
- Tags

These features enable the model to learn user patterns/preferences regarding the genre of movies they like to watch and as a result it will be able to make informed decisions on which movies to recommend to each individual.

### *Data Preparation*
We will use pandas library to read in the information contained within the files into a format we can use. We will clean then preprocess the data before we can fit a machine learning model. Other useful libraries we will use include numpy, matplotlib and seaborn for visualization.

In [563]:
# Loading the relevant libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


In [564]:
# Reading in the dataset
links = pd.read_csv('../data/links.csv')
print(links.head())
print('\n')
movies = pd.read_csv('../data/movies.csv')
print(movies.head())
print('\n')
ratings = pd.read_csv('../data/ratings.csv')
print(ratings.head())
print('\n')
tags = pd.read_csv('../data/tags.csv')
tags.head()

   movieId  imdbId   tmdbId
0        1  114709    862.0
1        2  113497   8844.0
2        3  113228  15602.0
3        4  114885  31357.0
4        5  113041  11862.0


   movieId                               title  \
0        1                    Toy Story (1995)   
1        2                      Jumanji (1995)   
2        3             Grumpier Old Men (1995)   
3        4            Waiting to Exhale (1995)   
4        5  Father of the Bride Part II (1995)   

                                        genres  
0  Adventure|Animation|Children|Comedy|Fantasy  
1                   Adventure|Children|Fantasy  
2                               Comedy|Romance  
3                         Comedy|Drama|Romance  
4                                       Comedy  


   userId  movieId  rating  timestamp
0       1        1     4.0  964982703
1       1        3     4.0  964981247
2       1        6     4.0  964982224
3       1       47     5.0  964983815
4       1       50     5.0  964982931




Unnamed: 0,userId,movieId,tag,timestamp
0,2,60756,funny,1445714994
1,2,60756,Highly quotable,1445714996
2,2,60756,will ferrell,1445714992
3,2,89774,Boxing story,1445715207
4,2,89774,MMA,1445715200


In [565]:
# checking for missing values
print(links.isnull().sum())
print('\n')
print(movies.isnull().sum())
print('\n')
print(ratings.isnull().sum())
print('\n')
print(tags.isnull().sum())

movieId    0
imdbId     0
tmdbId     8
dtype: int64


movieId    0
title      0
genres     0
dtype: int64


userId       0
movieId      0
rating       0
timestamp    0
dtype: int64


userId       0
movieId      0
tag          0
timestamp    0
dtype: int64


In [566]:
# checking for duplicates
print(links.duplicated().sum())
print(movies.duplicated().sum())
print(ratings.duplicated().sum())
print(tags.duplicated().sum())

0
0
0
0


In [567]:
#dropping irrelevant columns
ratings = ratings.drop(['timestamp'], axis=1)
tags = tags.drop(['timestamp'], axis=1)
# checking the datatypes
print(links.dtypes)
print('\n')
print(movies.dtypes)
print('\n')
print(ratings.dtypes)
print('\n')
print(tags.dtypes)


movieId      int64
imdbId       int64
tmdbId     float64
dtype: object


movieId     int64
title      object
genres     object
dtype: object


userId       int64
movieId      int64
rating     float64
dtype: object


userId      int64
movieId     int64
tag        object
dtype: object


### *Modeling*

We will build several models including a baseline model by iteratively modeling to find the best one that fits our data. we will use the RMSE and MAE to compare models. A lower RMSE implies a better model.

In [568]:
# importing surprise library for building recommendation system
from surprise import Dataset, Reader
from surprise.model_selection import train_test_split, GridSearchCV, cross_validate
from surprise.prediction_algorithms import SVD,knns
from surprise import accuracy

In [569]:
# preparing the data for surprise library
ratings_data = Dataset.load_from_df(ratings,Reader(rating_scale = (0.5,5.0)))
ratings_training_data, ratings_test_data = train_test_split(ratings_data, test_size=0.3, random_state =42)

In [570]:
# getting the number of users and items in the dataset
dataset = ratings_data.build_full_trainset()
print(dataset.n_users)
print(dataset.n_items)  

610
9724


In [571]:
# checking the types of training and test data
print("ratings_training_data:", type(ratings_training_data),"\n")
print("ratings_test_data:", type(ratings_test_data))

ratings_training_data: <class 'surprise.trainset.Trainset'> 

ratings_test_data: <class 'list'>


In [572]:
# checking the length of training and test data
print("length:",len(ratings_test_data))

length: 30251


In [573]:
# 
similarity_options = {'name': "cosine", "user_based": True}

In [574]:
# Building and evaluating the KNNBasic model
knn_basic_model = knns.KNNBasic(sim_options = similarity_options)
results = cross_validate(knn_basic_model, ratings_data, measures=['RMSE','MAE'], cv=5, verbose=True)  
print("Average RMSE score for the test sets:", results['test_rmse'].mean())
knn_basic_model.fit(ratings_training_data)
predictions = knn_basic_model.test(ratings_test_data)
print(accuracy.rmse(predictions))
print(accuracy.mae(predictions))

Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Evaluating RMSE, MAE of algorithm KNNBasic on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9747  0.9742  0.9719  0.9749  0.9724  0.9736  0.0012  
MAE (testset)     0.7503  0.7510  0.7467  0.7509  0.7498  0.7497  0.0016  
Fit time          0.86    0.85    0.95    0.88    0.75    0.86    0.06    
Test time         3.32    2.73    2.44    2.67    2.80    2.79    0.29    
Average RMSE score for the test sets: 0.97361864403863
Computing the cosine similarity matrix...
Done computing similarity matrix.
RMSE: 0.9805
0.9804637193602495
MAE:  0.7553
0.75525745

In [575]:
# Building and evaluating the KNNWithMeans model
knn_with_means_model = knns.KNNWithMeans(sim_options = similarity_options)
results = cross_validate(knn_with_means_model,ratings_data,measures = ['RMSE','MAE'],cv=5,verbose=True)
print("Average RMSE score for the test sets:", results['test_rmse'].mean())
knn_with_means_model.fit(ratings_training_data)
predictions = knn_with_means_model.test(ratings_test_data)
print(accuracy.rmse(predictions))
print(accuracy.mae(predictions))

Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Evaluating RMSE, MAE of algorithm KNNWithMeans on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.8973  0.9005  0.8995  0.8984  0.9104  0.9012  0.0047  
MAE (testset)     0.6840  0.6901  0.6873  0.6890  0.6964  0.6894  0.0041  
Fit time          0.96    1.01    0.80    1.20    0.81    0.95    0.15    
Test time         3.51    3.36    3.17    3.25    3.16    3.29    0.13    
Average RMSE score for the test sets: 0.9012314572152217
Computing the cosine similarity matrix...
Done computing similarity matrix.
RMSE: 0.9072
0.9072139261357854
MAE:  0.6940
0.69

In [576]:
# Hyperparameter tuning using GridSearchCV
parameter_grid = {"n_factors":[20,100],
                  "n_epochs":[5,10],
                  "lr_all":[0.002,0.005],
                  "reg_all":[0.4,0.6]}
gs_model = GridSearchCV(SVD,parameter_grid,measures=["RMSE","MAE"],n_jobs=-1,joblib_verbose=5,cv=3)
gs_model.fit(ratings_data)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  35 out of  48 | elapsed:   16.4s remaining:    6.0s
[Parallel(n_jobs=-1)]: Done  45 out of  48 | elapsed:   24.5s remaining:    1.5s
[Parallel(n_jobs=-1)]: Done  48 out of  48 | elapsed:   25.5s finished


In [577]:
# getting the best model parameters
gs_model.best_params

{'rmse': {'n_factors': 20, 'n_epochs': 10, 'lr_all': 0.005, 'reg_all': 0.4},
 'mae': {'n_factors': 20, 'n_epochs': 10, 'lr_all': 0.005, 'reg_all': 0.4}}

In [578]:
# getting the best model score
gs_model.best_score

{'rmse': 0.8937770063205047, 'mae': 0.6920981602094599}

In [579]:
# Building and evaluating the SVD model with the best parameters
svd_model = SVD(n_factors=20, n_epochs=10, lr_all=0.005, reg_all=0.4)
svd_model.fit(ratings_training_data)
predictions = svd_model.test(ratings_test_data)
print(accuracy.rmse(predictions))

RMSE: 0.8929
0.8929004979183419


Making a prediction for a single user

In [580]:
# prediction for specific user and movie
user1item300 = svd_model.predict(uid=1,iid=300)
user1item300[3]

4.100393827441175