#  Movie Recommendation System

---

## Overview
The objective of this project is to build a movie recommendation system that provides personalized recommendations to users based on their ratings of other movies. The system will utilize machine learning algorithms to analyze user ratings and similarities between movies to generate a list of top 5 movie recommendations for each user.

---


## Business understanding

In the age of streaming services and an abundance of movie alternatives, customers frequently face the issue of finding films that match their preferences. A movie recommendation system seeks to solve this problem by using user ratings to generate individualized movie recommendations. The target audience for this project are companies that provide movie streaming services, such as Netflix, Amazon Prime Video, or Hulu, which can in turn use recommendation systems to increase their customer engagement and retention. These businesses can provide personalized movie suggestions that adapt to their audience's different preferences by employing innovative machine learning techniques and user data, resulting in growth and competitive advantage in the entertainment industry. 

---

## Data Understanding

### Source:
 a) This data set is obtained from GroupLens
 
 b) (https://grouplens.org/datasets/movielens/latest/)

### Details on the data set:

This dataset (ml-latest-small) describes 5-star rating and free-text tagging activity from MovieLens, a movie recommendation service. It contains 100836 ratings and 3683 tag applications across 9742 movies. These data were created by 610 users between March 29, 1996 and September 24, 2018. This dataset was generated on September 26, 2018.

Users were selected at random for inclusion. All selected users had rated at least 20 movies. Each user is represented by an id, and no other information is provided.

The data are contained in the following files: **links.csv**, **movies.csv**, **ratings.csv** and **tags.csv**. 

**Ratings Data File Structure (ratings.csv):** All ratings are contained in the file ratings.csv. Each line of this file after the header row represents one rating of one movie by one user, and has the following format:

userId,movieId,rating,timestamp
The lines within this file are ordered first by userId, then, within user, by movieId.

Ratings are made on a 5-star scale, with half-star increments (0.5 stars - 5.0 stars).



**Tags Data File Structure (tags.csv):** All tags are contained in the file tags.csv. Each line of this file after the header row represents one tag applied to one movie by one user, and has the following format:

userId,movieId,tag,timestamp
The lines within this file are ordered first by userId, then, within user, by movieId.


**Movies Data File Structure (movies.csv):**Movie information is contained in the file movies.csv. Each line of this file after the header row represents one movie, and has the following format:

movieId,title,genres
Movie titles are entered manually or imported from https://www.themoviedb.org/, and include the year of release in parentheses. 



**Links Data File Structure (links.csv):**Identifiers that can be used to link to other sources of movie data are contained in the file links.csv. Each line of this file after the header row represents one movie, and has the following format:

movieId,imdbId,tmdbId


### Description of columns
**User Ids:** MovieLens users were selected at random for inclusion. Their ids have been anonymized. User ids are consistent between ratings.csv and tags.csv (i.e., the same id refers to the same user across the two files).

**Movie Ids:** Only movies with at least one rating or tag are included in the dataset. Movie ids are consistent between ratings.csv, tags.csv, movies.csv, and links.csv (i.e., the same id refers to the same movie across these four data files).

**Timestamps:** represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970.

**Tags:** are user-generated metadata about movies. Each tag is typically a single word or short phrase. The meaning, value, and purpose of a particular tag is determined by each user.

**Genres:** are a pipe-separated list, and are selected from the following:

Action, Adventure, Animation, Children's, Comedy, Crime, Documentary, Drama, Fantasy, Film-Noir, Horror, Musical, Mystery, Romance, Sci-Fi, Thriller, War, Western, (no genres listed)


### Citation
F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4: 19:1–19:19. https://doi.org/10.1145/2827872

---



## Importing necessary libraries for the project

In [14]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import sparse 
from surprise import Reader, Dataset
from sklearn.metrics. pairwise import cosine_similarity
from surprise.model_selection import cross_validate
from surprise.prediction_algorithms import SVD
from surprise.prediction_algorithms import KNNWithMeans, KNNBasic, KNNBaseline
from surprise.model_selection import GridSearchCV

## Loading the data sets

In [2]:
ratings = pd.read_csv("ml-latest-small/ratings.csv")
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [3]:
links = pd.read_csv("ml-latest-small/links.csv")
links.head()

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0


In [4]:
movies = pd.read_csv("ml-latest-small/movies.csv")
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [5]:
tags = pd.read_csv("ml-latest-small/tags.csv")
tags.head()

Unnamed: 0,userId,movieId,tag,timestamp
0,2,60756,funny,1445714994
1,2,60756,Highly quotable,1445714996
2,2,60756,will ferrell,1445714992
3,2,89774,Boxing story,1445715207
4,2,89774,MMA,1445715200


## Exploratory data analysis and data pre-processing

### Ratings dataset

In [6]:
ratings.shape

(100836, 4)

In [7]:
# Checking for null values in the dataset
ratings.isnull().sum()

userId       0
movieId      0
rating       0
timestamp    0
dtype: int64

The ratings csv file has 4 columns and 100,836 columns non of which have any null values

In [8]:
# Drop unnecessary columns
ratings = ratings.drop(columns='timestamp')
ratings.head()

Unnamed: 0,userId,movieId,rating
0,1,1,4.0
1,1,3,4.0
2,1,6,4.0
3,1,47,5.0
4,1,50,5.0


### Links dataset

In [33]:
links.shape

(9742, 3)

In [34]:
# Checking for null values in the dataset
links.isnull().sum()

movieId    0
imdbId     0
tmdbId     8
dtype: int64

The links csv file has 3 columns and 9,742 columns non of which have any null values

### Movies dataset

In [35]:
movies.shape

(9742, 3)

In [36]:
# Checking for null values in the dataset
movies.isnull().sum()

movieId    0
title      0
genres     0
dtype: int64

The movies csv file has 3 columns and 9,742 columns non of which have any null values

### Tags dataset

In [37]:
tags.shape

(3683, 4)

In [38]:
# Checking for null values in the dataset
tags.isnull().sum()

userId       0
movieId      0
tag          0
timestamp    0
dtype: int64

The movies csv file has 3 columns and 9,742 columns non of which have any null values

In [12]:
reader = Reader()
data = Dataset.load_from_df(ratings,reader)

In [13]:
# Finding out how many users and items that are in the dataset.
dataset = data.build_full_trainset()
print('Number of users: ', dataset.n_users, '\n')
print('Number of items: ', dataset.n_items)

Number of users:  610 

Number of items:  9724


## Determining the best model

### SVD
Singular value decomposition (SVD) is a matrix factorization technique that can be used to decompose a matrix into its constituent parts. In the context of recommendation systems, SVD can be used to decompose a user-item matrix into the product of three matrices: a user matrix, a diagonal matrix of singular values, and an item matrix. These matrices can then be used to make recommendations to users.



In [39]:
#Instantiate the SVD model
svd = SVD()

In [40]:
# Returning 5 cross-validated iterations 
cv_svd = cross_validate(SVD(), data, cv=5, n_jobs=5, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.8839  0.8691  0.8769  0.8729  0.8738  0.8753  0.0050  
MAE (testset)     0.6763  0.6692  0.6749  0.6690  0.6714  0.6721  0.0029  
Fit time          5.25    5.15    4.86    4.04    3.56    4.57    0.66    
Test time         0.53    0.57    0.60    0.33    0.26    0.46    0.14    


In [44]:
## Performing a gridsearch with SVD

params = {'n_factors': [20, 50, 100],
         'reg_all': [0.02, 0.05, 0.1]}
g_s_svd = GridSearchCV(SVD,param_grid=params,n_jobs=-1)
g_s_svd.fit(data)

In [45]:
print(g_s_svd.best_score)
print(g_s_svd.best_params)

{'rmse': 0.8692617749287408, 'mae': 0.6677655734870045}
{'rmse': {'n_factors': 100, 'reg_all': 0.05}, 'mae': {'n_factors': 20, 'reg_all': 0.02}}


### KNN
In a recommendation system, KNN works by finding the K nearest neighbors of a user or item, and using these neighbors to make recommendations.

In [46]:
# cross validating with KNNBasic
knn_basic = KNNBasic(sim_options={'name':'pearson', 'user_based':True})
cv_knn_basic = cross_validate(knn_basic, data, n_jobs=-1)

In [18]:
for i in cv_knn_basic.items():
    print(i)
print('-----------------------')
print(np.mean(cv_knn_basic['test_rmse']))

('test_rmse', array([0.96688093, 0.97700004, 0.9714279 , 0.97173629, 0.9776347 ]))
('test_mae', array([0.74804563, 0.75345248, 0.74727763, 0.75238314, 0.7557333 ]))
('fit_time', (2.5239949226379395, 2.4369943141937256, 2.225999116897583, 1.8369674682617188, 1.4569990634918213))
('test_time', (4.955236911773682, 5.026327848434448, 5.126237630844116, 5.378239393234253, 1.8388299942016602))
-----------------------
0.9729359714073269


In [19]:
# cross validating with KNNBaseline
knn_baseline = KNNBaseline(sim_options={'name':'pearson', 'user_based':True})
cv_knn_baseline = cross_validate(knn_baseline,data)

Estimating biases using als...
Computing the pearson similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson similarity matrix...
Done computing similarity matrix.


In [20]:
for i in cv_knn_baseline.items():
    print(i)

np.mean(cv_knn_baseline['test_rmse'])

('test_rmse', array([0.8802581 , 0.87202833, 0.88052622, 0.87296949, 0.87910178]))
('test_mae', array([0.67164352, 0.66476771, 0.6706068 , 0.66642034, 0.67376557]))
('fit_time', (1.9525933265686035, 3.2988760471343994, 2.2432944774627686, 3.886241912841797, 1.6895074844360352))
('test_time', (6.854063987731934, 4.462336540222168, 4.400030136108398, 3.2341959476470947, 3.4076898097991943))


0.8769767873012702

Based off these outputs, the best performing model is the SVD model with n_factors = 50 and a regularization rate of 0.05.

## Making recommendations

In [21]:
#Fit the SVD model
svd = SVD(n_factors= 50, reg_all=0.05)
svd.fit(dataset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x2484f963ac0>

In [22]:
#Simple prediction
svd.predict(2, 4)

Prediction(uid=2, iid=4, r_ui=None, est=3.032774878466421, details={'was_impossible': False})

In [26]:
def movie_rater( movies, num, genre=None):
    userID = 1000
    rating_list = []
    while num > 0:
        if genre:
            movie = movies[movies['genres'].str.contains(genre)].sample(1)
        else:
            movie = movies.sample(1)
        print(movie)
        rating = input('How do you rate this movie on a scale of 1-5, press n if you have not seen :\n')
        if rating == 'n':
            continue
        else:
            rating_one_movie = {'userId':userID,'movieId':movie['movieId'].values[0],'rating':rating}
            rating_list.append(rating_one_movie) 
            num -= 1
    return rating_list

In [27]:
user_rating = movie_rater(movies, 4, 'Comedy')

      movieId                   title  genres
4767     7093  Front Page, The (1974)  Comedy
      movieId                                       title  genres
3256     4402  Dr. Goldfoot and the Bikini Machine (1965)  Comedy
      movieId                      title  genres
6608    55729  King of California (2007)  Comedy
     movieId             title          genres
106      122  Boomerang (1992)  Comedy|Romance


## Making Predictions With the New Ratings

In [28]:
## add the new ratings to the original ratings DataFrame
user_ratings = pd.DataFrame(user_rating)
new_ratings_df = pd.concat([ratings, user_ratings], axis=0)
new_data = Dataset.load_from_df(new_ratings_df,reader)

In [29]:
# train a model using the new combined DataFrame
svd_ = SVD(n_factors= 50, reg_all=0.05)
svd_.fit(new_data.build_full_trainset())

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x2484f6f42e0>

In [30]:
# make predictions for the user

list_of_movies = []
for m_id in ratings['movieId'].unique():
    list_of_movies.append( (m_id,svd_.predict(1000,m_id)[3]))

In [31]:
# order the predictions from highest to lowest rated
ranked_movies = sorted(list_of_movies, key=lambda x:x[1], reverse=True)

In [32]:
# return the top n recommendations using the following function 
def recommended_movies(user_ratings,movie_title_df,n):
        for idx, rec in enumerate(user_ratings):
            title = movie_title_df.loc[movie_title_df['movieId'] == int(rec[0])]['title']
            print('Recommendation # ', idx+1, ': ', title, '\n')
            n-= 1
            if n == 0:
                break
            
recommended_movies(ranked_movies, movies,5)

Recommendation #  1 :  680    Philadelphia Story, The (1940)
Name: title, dtype: object 

Recommendation #  2 :  906    Lawrence of Arabia (1962)
Name: title, dtype: object 

Recommendation #  3 :  2582    Guess Who's Coming to Dinner (1967)
Name: title, dtype: object 

Recommendation #  4 :  899    Princess Bride, The (1987)
Name: title, dtype: object 

Recommendation #  5 :  841    Streetcar Named Desire, A (1951)
Name: title, dtype: object 

