### Project Title: Customer Recommendation System
#### Business Understanding: 
The movie data is available from Netflix and is available at the [Kaggle website](https://www.kaggle.com/datasets/netflix-inc/netflix-prize-data) as part Netflix prize. The business expects to use the [SURPRISE](https://surprise.readthedocs.io/en/stable/index.html)/ and other recommendation methodologies to develop a customer recommendation system. Compare various algorithms and systems must be able to recommend movies that the customer has not watched earlier. Also, tune the algorithm with the available parameters to minimize the error. 

#### Business Goal
The Business goal is to construct a recommendation system based on movie data. The system shall provide movie recommendations to customers, compare different algorithms, and select the best algorithm for recommendation with the least error. 

#### Data Understanding:

The movie rating data is downloaded from [Kaggle website](https://www.kaggle.com/datasets/netflix-inc/netflix-prize-data), The movie rating files contain over 100 million ratings from 480 thousand randomly chosen, anonymous customers over 17 thousand movie titles. The data were collected between October 1998 and December 2005 and reflect the distribution of all ratings received during this period. The dataset was trimmed to support the local execution of algorithms, and all predictions and comparisons/ tuning are done on the trimmed dataset, this may compromise on error measure/ accuracy of the algorithms.   

The rating data is provided by Netflix, the dataset contains all the movie ratings ranging from 1-5 provided by users. Also, a separate file is provided which contains the movie title/ID for a user.

The consolidated file  data attributes are: 

|Feature Name  | Description                                                | Feature Type  |
|------------- |------------------------------------------------------------|---------------|
|MovieID       | A unique number for a Movie                                | Integer       |
|CustomerID    | A unique number given to identify a customer               | Integer       |                          
|Title         | Movie title                                                | Categorical   |
|YearOfRelease | Movie release date                                         | Date          |
|Ratings       | Movie ratings on a five-star (integral) scale from 1 to 5  | Integer       |
|Rating Date   | Movie rating date by customer                              | Date          |




##### References :
1. [Kaggle website](https://www.kaggle.com/datasets/netflix-inc/netflix-prize-data)
2. [SURPRISE](https://surprise.readthedocs.io/en/stable/index.html)
3. [Keras](https://keras.io/)




### Table of Content
***

1. [Import Libraries](#1-import-libraries)

2. [Data Analysis & Preparation](#2-data-analysis--preparation) 
    - 2.1 -  [Convert Data Format](#21-convert-data-format-origianally-provided-by-netflix)
    - 2.2 -  [Read Movie Title Dataset ](#22-load-the-title-movie-dataset)
    - 2.3 -  [Load Training Dataset](#23-load-training-dataset-provided-by-netflix)
    - 2.4 -  [Data Formatting](#24-data-formatting)
    - 2.5 -  [Data Information](#25--dataset-information)
    - 2.6 -  [Prepare Dataset For SURPRiSE Algorithm](#26-create-dataset-for-surprise-algorithms)
    - 2.7 -  [Upload Dataset using SURPRISE Methods](#27-create-test-train-split-using-surprise-methods)
    - 2.8 -  [Input data preparation for Neural Layer](#28-create-the-dataset-for-neural-layers)
    
3. [Recommendation System Algorithm ](#3-compare-the-recommendation-system-algorithms)
    - 3.1 - [SVD Algorithm - Matrix Factorization](#31-svd-algorithm---matrix-factorization)
        - 3.1.1 - [Execute SVD & Calculate Scores using CV](#311-svd--calculate-mean-and-mae-score-with-default-parameter-and-cv)
        - 3.1.2 - [Optimize SVD using GridSearch](#312-optimize-svd)
    - 3.2 - [SVD++ Algorithm - Matrix Factorization](#32-svd---matrix-factorization)
        - 3.2.1 - [Execute SVD++ & Calculate Scores using CV](#321-svd--calculate-mean-and-mae-score-with-default-parameter-and-cv)
        - 3.2.2 - [Optimize SVD++ using GridSearch](#322-optimize-svd)
    - 3.3 - [NMF Algorithm](#33-nmf---matrix-factorization)
        - 3.3.1 - [Execute NMF & Calculate Scores using CV](#331-nmf--calculate-mean-and-mae-score-with-default-parameter-and-cv)
        - 3.3.2 - [Optimize NFM using GridSearch](#332-optimize-nmf)
    - 3.4 - [SlopeOne Algorithm](#34-slopeone---collaborative-filtering-algorithm)
        - 3.4.1 - [Execute SlopeOne & Calculate Scores using CV](#341-slopeone--calculate-mean-and-mae-score-with-default-parameter-and-cv)
    - 3.5 - [Co-Clustering Algorithm ](#35-co-clustering---collaborative-filtering-algorithm)
        - 3.5.1 - [Execute NMF & Calculate Scores using CV](#351-co-clustering--calculate-mean-and-mae-score-with-default-parameter-and-cv)
        - 3.5.2 - [Optimize Co-Clustering using GridSearch](#352-optimize-co-clustering)
    - 3.6 - [KNN - Nearest Neighbor Approach](#36-knn---nearest-neighbour-approach)
        - 3.6.1 - [Execute KNNBasic & Calculate Scores using CV](#361-knnbasic--basic-nearest-neighbors-approach)
        - 3.6.2 - [Optimize KNNBasic using GridSearch](#362-optimize-the-knn-basic)
        - 3.6.3 - [Execute KNNWithMean & Calculate Scores using CV](#363-knnwithmeans---a-basic-collaborative-filtering-algorithm)
        - 3.6.4 - [Optimize KNNwithMeans using GridSearch](#364---optimize-the-knnwithmeans-algorithm)
        - 3.6.5 - [Execute KNN Baseline & Calculate Scores using CV](#365-knn-baseline--basic-collaborative-filtering-algorithm-taking-into-account-a-baseline-rating)
        - 3.6.6 - [Optimize KNNwithMeans using GridSearch](#366-optimize-knn-baseline)
    - 3.7 - [Normal Predictor - Basic Algorithm](#37-normalpredictor---basic-random-rating-based-algorithm)
    - 3.8 - [Execute Neural layer](#38-neural-network-approach)
    - 3.9 - [Compare Performance of Algorithms](#39-compare-performance-scoreexecution-timebest-parameters)
4. [Recommendation Results](#40-customer-cluster-analysis)
    - 4.1 - [Top N Movies Recommendation for a User ](#41-return-top-movie-recommendation-for-users)
    - 4.2 - [Top 10 Nearest Neighbor For A Movie ](#42-nearest-neighbour-for-given-movie)
    - 4.3 - [Top 10 Nearest Neighbor For A User](#43-nearest-neighbour-for-given-customer)
5. [Conclusion](#50-conclusion)

### 1. Import Libraries

In [2]:
## Import the required libraries
import io
import numpy as np
import pandas as pd
from pandas import to_datetime
import plotly.express as px
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from surprise import Dataset, Reader, SVD, accuracy
from surprise.model_selection import cross_validate
from surprise.model_selection import GridSearchCV
from collections import defaultdict
from surprise import Dataset, SVD, SVDpp, NMF, SlopeOne, KNNBasic, KNNWithMeans, KNNBaseline, CoClustering, NormalPredictor, BaselineOnly
import keras
import keras.layers
import keras.losses
import keras.metrics
import tensorflow as tf 


### 2. Data Analysis & Preparation

#### 2.1 Convert Data Format Origianally Provided by Netflix

In [3]:
## Format converter  
## This code is used one time to convert the format of training datsset provided by Netflix.
## Uncomment the part of code with single '#' if data set has to be pointed to original training file 

## Import the CSV module 
# import csv

## Open the file from local, read the movie ID, append to rest of the data and write to local output file. 
 
# with open('combined_data_1.txt') as f:
#    csv_reader = csv.reader(f, delimiter=',') # w use, reader method to read csv
#    head_count = 1
#    for row in csv_reader:
#        if row[0] == f'{head_count}:':
#            print('Headcount', head_count)
#            head_count += 1
#        else:            
#            row.append(f'{head_count-1}') 
#            x = ",".join(row)
#            with open('combinedata1.txt','a') as f:
#                f.write(x+'\n')

#### 2.2 Load the Title Movie Dataset 

In [4]:
## Upload the movie title dataset 

# Use read_csv function to load the movie title form  
df_title = pd.read_csv('movie_titles.csv')

# print the dataset  
df_title

Unnamed: 0,MovieID,YearOfRelease,Title
0,1,2003.0,Dinosaur Planet
1,2,2004.0,Isle of Man TT 2004 Review
2,3,1997.0,Character
3,4,1994.0,Paula Abdul's Get Up & Dance
4,5,2004.0,The Rise and Fall of ECW
...,...,...,...
17765,17766,2002.0,Where the Wild Things Are and Other Maurice Se...
17766,17767,2004.0,Fidel Castro: American Experience
17767,17768,2000.0,Epoch
17768,17769,2003.0,The Company


#### 2.3 Load Training Dataset Provided By Netflix

In [5]:
## Load the training data provided by Netflix. There are 4 files provided, however due to computational constrain i am loading only 
## first part of the dataset, This contains ~27 million records. 

# Load the first training file after converting the format using read csv function.  
df_comdata1 = pd.read_csv('combinedata1.txt')

# Set the column value for the dataframe
df_comdata1.columns = ['CustID','Rating','Rating_Date','MovieID']

# Print the dataframe 
df_comdata1

Unnamed: 0,CustID,Rating,Rating_Date,MovieID
0,822109,5,2005-05-13,1
1,885013,4,2005-10-19,1
2,30878,4,2005-12-26,1
3,823519,3,2004-05-03,1
4,893988,3,2005-11-17,1
...,...,...,...,...
24053758,2591364,2,2005-02-16,4499
24053759,1791000,2,2005-02-10,4499
24053760,512536,5,2005-07-27,4499
24053761,988963,3,2005-12-20,4499


#### 2.4 Data Formatting

In [6]:
## Data formatting and imputation 
# 
# Prepare dataset by joining the training set and movie title.  
df = df_comdata1.set_index('MovieID').join(df_title.set_index('MovieID'), on='MovieID', how='left')

# Set the rating date column to DateTime  
df['Rating_Date'] = to_datetime(df['Rating_Date'])

# Fill the YearOfRelease with mean date  
df = df.fillna(df['YearOfRelease'].mean())

# Convert Year of release
df['YearOfRelease'] = df['YearOfRelease'].astype(int) 

# Copy index to MovieID column 
df['MovieID'] = df.index.values
 

#### 2.5  Dataset Information

In [7]:
# Get dataset information
df.head()

Unnamed: 0_level_0,CustID,Rating,Rating_Date,YearOfRelease,Title,MovieID
MovieID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,822109,5,2005-05-13,2003,Dinosaur Planet,1
1,885013,4,2005-10-19,2003,Dinosaur Planet,1
1,30878,4,2005-12-26,2003,Dinosaur Planet,1
1,823519,3,2004-05-03,2003,Dinosaur Planet,1
1,893988,3,2005-11-17,2003,Dinosaur Planet,1


#### 2.6 Create Dataset for SURPRISE Algorithms

In [8]:
## Reducing number of records from original dataset so as to compute locally. The parameter can be tweaked if we have to run the algorithms with more ## set of data records. 

# Import sklearn train test split
from sklearn.model_selection import train_test_split

## Reduce the records to execute the dataset in local    

# Use the Sklearn dataset to get the random set  
df_neural,df_surprise = train_test_split(df,test_size=0.001, train_size=0.001,random_state=42) 

# Print dataset information 
df_neural


Unnamed: 0_level_0,CustID,Rating,Rating_Date,YearOfRelease,Title,MovieID
MovieID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1826,1243711,2,2005-03-21,1996,Terminal,1826
2578,2333491,4,2003-04-11,2001,Y Tu Mama Tambien,2578
1665,2470099,3,2005-04-29,1998,Orgazmo,1665
2161,2171187,3,2005-06-06,1993,Six Degrees of Separation,2161
1561,1865677,4,2005-11-28,2003,American Wedding,1561
...,...,...,...,...,...,...
2290,1018101,3,2005-07-23,1992,Aladdin: Platinum Edition,2290
1367,1501337,3,2000-02-06,1993,The Piano,1367
3371,604517,3,2005-08-29,2003,Whale Rider,3371
1073,2087880,5,2005-11-22,2005,Coach Carter,1073


#### 2.7 Create Test Train Split Using Surprise Methods

In [9]:
## Create the test train split  
from surprise.model_selection import train_test_split

# Initialize the reader 
reader = Reader(rating_scale=(1, 5))

# Set the dataset for Surprise methods 
data = Dataset.load_from_df(df_surprise[['CustID','MovieID','Rating']], reader)

# Create test train split 
trainset, testset = train_test_split(data, test_size=0.25) 


#### 2.8 Create The Dataset For Neural Layers 

In [10]:
# Set the seed value to lock the data
tf.random.set_seed(42)

# Define the keras sequential processing neural model
model = keras.models.Sequential()

# Define the input layer for the model
user_id = keras.layers.Input(shape=(1,), name='user_id')
movie_id = keras.layers.Input(shape=(1,), name='movie_id') 

# Generate the input dimension for user & movie and define the output dimension

# Calculate the user dataset dimension
user_dim   =  (df['CustID'].max())+1
# Calculate the movie dimension
movie_dim  =  (df['MovieID'].max())+1
# Define the value for output dimension 
output_dim = 50

# Define user embedding
user_layer = keras.layers.Flatten()(keras.layers.Embedding(input_dim=user_dim, output_dim=output_dim, input_length=1,name="user_layers")(user_id))

# Define movie embedding
movie_layer = keras.layers.Flatten()(keras.layers.Embedding(input_dim=movie_dim, output_dim=output_dim, input_length=1,name="movie_layer")(movie_id))

# Concatenate the user and movie embedding to generate input for dense layer
input_data = keras.layers.Concatenate()([user_layer, movie_layer])

### 3. Compare the Recommendation System Algorithms 

#### 3.1 SVD Algorithm - Matrix Factorization

##### 3.1.1 SVD : Calculate MEAN and MAE Score With Default Parameter and CV

In [11]:
## Measure the SVD algorithm  

# Initialize the SVD algorithm
algo_svd = SVD()

# Try cross validation to measure the RMSE and MAE score
cv = cross_validate(algo_svd, data, measures=['RMSE', 'MAE'], cv=5)

# Capture RMSE and MAE score along with fit time 
perf_svd = pd.DataFrame(cv)

# Print mean score
perf_svd.describe().loc[['mean']]

Unnamed: 0,test_rmse,test_mae,fit_time,test_time
mean,1.03639,0.839113,0.296463,0.027009


##### 3.1.2 Optimize SVD 

In [12]:
## Optimize the SVD algorithm using GridSearchCV

# Set the parameter grid for SVD algorithm
param_grid = {"n_factors": [50, 100, 200], "n_epochs": [5, 10], "lr_all": [0.002, 0.005], "reg_all": [0.02, 0.05] }

# Initialize the GridSearchCV
grid_search = GridSearchCV(SVD, param_grid, measures=["rmse", "MAE"], cv=3)

# Fit the data to GridSearchCV
grid_search.fit(data)

# Get the best RMSE score
grid_search.best_estimator["rmse"]

# Convert the results of GridSearch to DataFrame
perf_grid_svd = pd.DataFrame(grid_search.cv_results)

# Initialize the score list
score = []
score_comp = []

# Get the best RMSE and MAE score
for index,row in (perf_grid_svd.iterrows()):
    if row['rank_test_mae'] == 1:
        score.append(['SVD', 'MAE', np.round(row['mean_test_mae'], 4), np.round(row['mean_fit_time'], 4), np.round(row['mean_test_time'],4),row['params']])
    if row['rank_test_rmse'] == 1:
        score.append(['SVD', 'RMSE', np.round(row['mean_test_rmse'],4), np.round(row['mean_fit_time'], 4), np.round(row['mean_test_time'],4),row['params']])

# Append the score to the list
score_comp = score_comp + score

# Print the score
score

[['SVD',
  'MAE',
  0.8501,
  0.0786,
  0.0329,
  {'n_factors': 50, 'n_epochs': 10, 'lr_all': 0.005, 'reg_all': 0.05}],
 ['SVD',
  'RMSE',
  1.0422,
  0.0786,
  0.0329,
  {'n_factors': 50, 'n_epochs': 10, 'lr_all': 0.005, 'reg_all': 0.05}]]

#### 3.2 SVD++ - Matrix Factorization  

##### 3.2.1 SVD++ : Calculate MEAN and MAE Score With Default Parameter and CV

In [13]:
## Measure the performance of SVD++ algorithm 
algo_svdpp = SVDpp()

# Initialize the SVD algorithm
cv = cross_validate(algo_svdpp, data, measures=['RMSE', 'MAE'], cv=5)

# Capture RMSE and MAE score along with fit time 
perf_svd = pd.DataFrame(cv)

# Print mean score
perf_svd.describe().loc[['mean']]

Unnamed: 0,test_rmse,test_mae,fit_time,test_time
mean,1.03732,0.845861,0.221224,0.030902


##### 3.2.2 Optimize SVD++

In [14]:
## OPtimize the SVD++ algorithm using GridSearchCV

# Set the parameter grid for SVD algorithm
param_grid = {"n_factors": [10, 20, 50, 100], "n_epochs": [10, 20, 30], "lr_all": [0.002, 0.005], "reg_all": [0.02, 0.05], "cache_ratings":[False, True] } 

# Initialize the GridSearchCV
grid_search = GridSearchCV(SVDpp, param_grid, measures=["rmse", "MAE"], cv=3)

# Fit the data to GridSearchCV
grid_search.fit(data)

# Get the best RMSE score
grid_search.best_estimator["rmse"]

# Convert the results of GridSearch to DataFrame
perf_grid_svd = pd.DataFrame(grid_search.cv_results)

# Initialize the score list
score = []

# Get the best RMSE and MAE score
for index,row in (perf_grid_svd.iterrows()):
    if row['rank_test_mae'] == 1:
        score.append(['SVD++','MAE', np.round(row['mean_test_mae'], 4), np.round(row['mean_fit_time'], 4), np.round(row['mean_test_time'],4),row['params']])
    if row['rank_test_rmse'] == 1:
        score.append(['SVD++', 'RMSE', np.round(row['mean_test_rmse'],4), np.round(row['mean_fit_time'], 4), np.round(row['mean_test_time'],4),row['params']])

# Append the score to the list
score_comp = score_comp + score

# Print the score
score

[['SVD++',
  'MAE',
  0.8434,
  0.1847,
  0.1143,
  {'n_factors': 10,
   'n_epochs': 30,
   'lr_all': 0.005,
   'reg_all': 0.05,
   'cache_ratings': False}],
 ['SVD++',
  'RMSE',
  1.0382,
  0.1847,
  0.1143,
  {'n_factors': 10,
   'n_epochs': 30,
   'lr_all': 0.005,
   'reg_all': 0.05,
   'cache_ratings': False}]]

#### 3.3 NMF - Matrix Factorization  

##### 3.3.1 NMF : Calculate MEAN and MAE Score With Default Parameter and CV

In [15]:
## Measure the performance of NMF algorithm

# Initialize the NMF algorithm
algo_nmf = NMF()

# Try cross validation to measure the RMSE and MAE score
cv = cross_validate(algo_nmf, data, measures=['RMSE', 'MAE'], cv=5)

# Capture RMSE and MAE score along with fit time
perf_svd = pd.DataFrame(cv)

# Print mean score
perf_svd.describe().loc[['mean']]

Unnamed: 0,test_rmse,test_mae,fit_time,test_time
mean,1.109958,0.92407,1.481435,0.028439


##### 3.3.2 Optimize NMF

In [16]:
## OPtimize the NMF algorithm using GridSearchCV

# Define the parmameter grid for NMF algorithm
param_grid = {"n_factors": [5, 15, 20, 25], "n_epochs": [20, 50, 70] }

# Initialize the GridSearchCV
grid_search = GridSearchCV(NMF, param_grid, measures=["rmse", "MAE"], cv=3)

# Fit the data to GridSearchCV
grid_search.fit(data)

# Get the best RMSE score
grid_search.best_estimator["rmse"]

# Convert the results of GridSearch to DataFrame
perf_grid_svd = pd.DataFrame(grid_search.cv_results)

# Initialize the score list
score = []

# Get the best RMSE and MAE score
for index,row in (perf_grid_svd.iterrows()):
    if row['rank_test_mae'] == 1:
        score.append(['NMF', 'MAE', np.round(row['mean_test_mae'], 4), np.round(row['mean_fit_time'], 4), np.round(row['mean_test_time'],4),row['params']])
    if row['rank_test_rmse'] == 1:
        score.append(['NMF', 'RMSE', np.round(row['mean_test_rmse'],4), np.round(row['mean_fit_time'], 4), np.round(row['mean_test_time'],4),row['params']])

# Append the score to the list
score_comp = score_comp + score

# Print the score
score

[['NMF', 'MAE', 0.9167, 0.5871, 0.0393, {'n_factors': 20, 'n_epochs': 20}],
 ['NMF', 'RMSE', 1.1008, 1.851, 0.0431, {'n_factors': 25, 'n_epochs': 50}]]

#### 3.4 SlopeOne - Collaborative filtering algorithm

##### 3.4.1 SlopeOne : Calculate MEAN and MAE Score With Default Parameter and CV

In [17]:
## Measure the performance of SlopeOne algorithm
algo_so = SlopeOne()

# Try cross validation to measure the RMSE and MAE score
cv = cross_validate(algo_so, data, measures=['RMSE', 'MAE'], cv=5)

# Capture RMSE and MAE score along with fit time
perf_svd = pd.DataFrame(cv)

# Print mean score
temp = perf_svd.describe().loc[['mean']] 

# Append the score to the list
score_comp = score_comp + [['SlopeOne', 'RMSE', np.round((temp['test_rmse']['mean']),4), np.round(temp['fit_time']['mean'],4) , np.round(temp['test_time']['mean'],4) , 'NA'], ['SlopeOne', 'MAE', np.round((temp['test_mae']['mean']),4), np.round(temp['fit_time']['mean'],4) , np.round(temp['test_time']['mean'],4) , 'NA']]

perf_svd.describe().loc[['mean']]

Unnamed: 0,test_rmse,test_mae,fit_time,test_time
mean,1.115905,0.918874,0.186121,0.032755


#### 3.5 Co-Clustering - Collaborative Filtering Algorithm 

##### 3.5.1 Co-Clustering : Calculate MEAN and MAE Score With Default Parameter and CV

In [18]:
## Measure the Co-Clustering algorithm
algo_coc = CoClustering()

# Try cross validation to measure the RMSE and MAE score
cv = cross_validate(algo_coc, data, measures=['RMSE', 'MAE'], cv=5)

# Capture RMSE and MAE score along with fit time
perf_svd = pd.DataFrame(cv)

# Print the score
perf_svd.describe().loc[['mean']]

Unnamed: 0,test_rmse,test_mae,fit_time,test_time
mean,1.11078,0.921386,1.934233,0.018136


##### 3.5.2 Optimize Co-Clustering

In [19]:
## Optimize the Co-Clustering algorithm using GridSearchCV

# Define the parmameter grid for Co-Clustering algorithm
param_grid = {"n_cltr_u" : [2,3,5], "n_cltr_i": [2,3,5], "n_epochs": [10,20,30]}

# Initialize the GridSearchCV
grid_search = GridSearchCV(CoClustering, param_grid, measures=["rmse", "MAE"], cv=3)
# Fit the data to GridSearchCV
grid_search.fit(data)
# Get the best RMSE score
grid_search.best_estimator["rmse"]
# Convert the results of GridSearch to DataFrame
perf_grid_svd = pd.DataFrame(grid_search.cv_results)

# Initialize the score list
score = []
# Get the best RMSE and MAE score
for index,row in (perf_grid_svd.iterrows()):
    if row['rank_test_mae'] == 1:
        score.append(['Co-Clustering', 'MAE', np.round(row['mean_test_mae'], 4), np.round(row['mean_fit_time'], 4), np.round(row['mean_test_time'],4), row['params']])
    if row['rank_test_rmse'] == 1:
        score.append(['Co-Clustering', 'RMSE', np.round(row['mean_test_rmse'],4), np.round(row['mean_fit_time'], 4), np.round(row['mean_test_time'],4),row['params']])

# Append the score to the list
score_comp = score_comp + score

# Print the score
score

[['Co-Clustering',
  'MAE',
  0.9184,
  1.0203,
  0.0338,
  {'n_cltr_u': 5, 'n_cltr_i': 5, 'n_epochs': 10}],
 ['Co-Clustering',
  'RMSE',
  1.1045,
  1.0203,
  0.0338,
  {'n_cltr_u': 5, 'n_cltr_i': 5, 'n_epochs': 10}]]

#### 3.6 KNN - Nearest Neighbour Approach

##### 3.6.1 KNNBasic : Basic Nearest Neighbors Approach

In [20]:
## Measure the performance of KNN Basicr algorithm
sim_options = {"name": "pearson_baseline", "user_based": False}

# Initialize the KNN Basic algorithm
algo_knn = KNNBasic(sim_options=sim_options)

# Try cross validation to measure the RMSE and MAE score
cv = cross_validate(algo_knn, data, measures=['RMSE', 'MAE'], cv=3)

# Capture RMSE and MAE score along with fit time
perf_svd = pd.DataFrame(cv)

# Print mean score
print(algo_coc.__class__.__name__)                 
perf_svd.describe().loc[['mean']] 

Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
CoClustering


Unnamed: 0,test_rmse,test_mae,fit_time,test_time
mean,1.081001,0.907577,0.132089,0.033199


##### 3.6.2 Optimize the KNN Basic

In [21]:
## Optimize the KNN Basic algorithm using GridSearchCV

# Define the parmameter grid for KNN Basic algorithm
param_grid = {"k": [10, 20, 40], "min_k": [1,2,5], 'sim_options': {'name': ['cosine', 'pearson'], 'user_based': [False]  } }

# Initialize the GridSearchCV
grid_search = GridSearchCV(KNNBasic, param_grid, measures=["rmse", "MAE"], cv=3)

# Fit the data to GridSearchCV
grid_search.fit(data)

# Get the best RMSE score
grid_search.best_estimator["rmse"]

# Convert the results of GridSearch to DataFrame
perf_grid_svd = pd.DataFrame(grid_search.cv_results)

# Initialize the score list
score = []

# Get the best RMSE and MAE score
for index,row in (perf_grid_svd.iterrows()):
    if row['rank_test_mae'] == 1:
        score.append(['KNNBasic', 'MAE', np.round(row['mean_test_mae'], 4), np.round(row['mean_fit_time'], 4), np.round(row['mean_test_time'],4),row['params']])
    if row['rank_test_rmse'] == 1:
        score.append(['KNNBasic','RMSE', np.round(row['mean_test_rmse'],4), np.round(row['mean_fit_time'], 4), np.round(row['mean_test_time'],4),row['params']])

# Append the score to the list
score_comp = score_comp + score

# Print the score
score

Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Comput

[['KNNBasic',
  'MAE',
  0.9076,
  0.0508,
  0.0344,
  {'k': 20,
   'min_k': 2,
   'sim_options': {'name': 'cosine', 'user_based': False}}],
 ['KNNBasic',
  'RMSE',
  1.0811,
  0.0508,
  0.0344,
  {'k': 20,
   'min_k': 2,
   'sim_options': {'name': 'cosine', 'user_based': False}}]]

##### 3.6.3 KNNwithMeans - A Basic Collaborative Filtering Algorithm

In [22]:
## Measure the performance of KNN Means Only algorithm
sim_options = {"name": "pearson_baseline", "user_based": False}

# Initialize the KNN Means Only algorithm
algo_knn = KNNWithMeans(sim_options=sim_options)

# Try cross validation to measure the RMSE and MAE score
cv = cross_validate(algo_knn, data, measures=['RMSE', 'MAE'], cv=3)

# Capture RMSE and MAE score along with fit time
perf_svd = pd.DataFrame(cv)

# Print mean score                  
perf_svd.describe().loc[['mean']] 

Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.


Unnamed: 0,test_rmse,test_mae,fit_time,test_time
mean,1.082804,0.90432,0.134117,0.043786


##### 3.6.4 - Optimize the KNNwithMeans Algorithm

In [23]:
## OPtimize the KNN Means Only algorithm using GridSearchCV

# Define the parmameter grid for KNN Means Only algorithm
param_grid = {"k": [10, 20, 40], "min_k": [1,2,5], 'sim_options': {'name': ['cosine', 'pearson'], 'user_based': [False]  }}

# Initialize the GridSearchCV
grid_search = GridSearchCV(KNNWithMeans, param_grid, measures=["rmse", "MAE"], cv=3)
# Fit the data to GridSearchCV
grid_search.fit(data)
# Get the best RMSE score
grid_search.best_estimator["rmse"]
# Convert the results of GridSearch to DataFrame
perf_grid_svd = pd.DataFrame(grid_search.cv_results)

# Initialize the score list
score = []
# Get the best RMSE and MAE score
for index,row in (perf_grid_svd.iterrows()):
    if row['rank_test_mae'] == 1:
        score.append(['KNNwithMeans', 'MAE', np.round(row['mean_test_mae'], 4), np.round(row['mean_fit_time'], 4), np.round(row['mean_test_time'],4),row['params']])
    if row['rank_test_rmse'] == 1:
        score.append(['KNNwithMeans', 'RMSE', np.round(row['mean_test_rmse'],4), np.round(row['mean_fit_time'], 4), np.round(row['mean_test_time'],4),row['params']])

# Append the score to the list
score_comp = score_comp + score

# Print the score
score

Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Comput

[['KNNwithMeans',
  'MAE',
  0.9042,
  0.0693,
  0.0361,
  {'k': 20,
   'min_k': 2,
   'sim_options': {'name': 'cosine', 'user_based': False}}],
 ['KNNwithMeans',
  'RMSE',
  1.0827,
  0.0693,
  0.0361,
  {'k': 20,
   'min_k': 2,
   'sim_options': {'name': 'cosine', 'user_based': False}}]]

##### 3.6.5 KNN Baseline- Basic Collaborative Filtering Algorithm Taking Into Account a Baseline Rating

In [24]:
## Measure the performance of KNN baseline Only algorithm
sim_options = {"name": "pearson_baseline", "user_based": False}

# Initialize the KNN baseline Only algorithm
algo_knn = KNNBaseline(sim_options=sim_options)

# Try cross validation to measure the RMSE and MAE score
cv = cross_validate(algo_knn, data, measures=['RMSE', 'MAE'], cv=3)

# Capture RMSE and MAE score along with fit time
perf_svd = pd.DataFrame(cv)

# Print mean score                  
perf_svd.describe().loc[['mean']] 

Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.


Unnamed: 0,test_rmse,test_mae,fit_time,test_time
mean,1.038414,0.844238,0.121918,0.044801


##### 3.6.6 Optimize KNN Baseline

In [25]:
## OPtimize the KNNBaseline only algorithm using GridSearchCV

# Define the parmameter grid for KNN Means Only algorithm
param_grid = {"k": [10, 40, 50], "min_k": [1,2,5], 'sim_options': {'name': ['cosine', 'pearson'], 'user_based': [False] }}

# Initialize the GridSearchCV
grid_search = GridSearchCV(KNNBaseline, param_grid, measures=["rmse", "MAE"], cv=3)
# Fit the data to GridSearchCV
grid_search.fit(data)
# Get the best RMSE score
grid_search.best_estimator["rmse"]
# Convert the results of GridSearch to DataFrame
perf_grid_svd = pd.DataFrame(grid_search.cv_results)

# Initialize the score list
score = []    
# Get the best RMSE and MAE score
for index,row in (perf_grid_svd.iterrows()):
    if row['rank_test_mae'] == 1:
        score.append(['KNNBaseline' ,'MAE', np.round(row['mean_test_mae'], 4), np.round(row['mean_fit_time'], 4), np.round(row['mean_test_time'],4),row['params']])
    if row['rank_test_rmse'] == 1:
        score.append(['KNNBaseline', 'RMSE', np.round(row['mean_test_rmse'],4), np.round(row['mean_fit_time'], 4), np.round(row['mean_test_time'],4),row['params']])


# Append the score to the list
score_comp = score_comp + score

# Print the score
score 

Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Com

[['KNNBaseline',
  'MAE',
  0.8462,
  0.1001,
  0.0394,
  {'k': 40,
   'min_k': 2,
   'sim_options': {'name': 'cosine', 'user_based': False}}],
 ['KNNBaseline',
  'RMSE',
  1.0399,
  0.1001,
  0.0394,
  {'k': 40,
   'min_k': 2,
   'sim_options': {'name': 'cosine', 'user_based': False}}]]

In [26]:
score_comp

[['SVD',
  'MAE',
  0.8501,
  0.0786,
  0.0329,
  {'n_factors': 50, 'n_epochs': 10, 'lr_all': 0.005, 'reg_all': 0.05}],
 ['SVD',
  'RMSE',
  1.0422,
  0.0786,
  0.0329,
  {'n_factors': 50, 'n_epochs': 10, 'lr_all': 0.005, 'reg_all': 0.05}],
 ['SVD++',
  'MAE',
  0.8434,
  0.1847,
  0.1143,
  {'n_factors': 10,
   'n_epochs': 30,
   'lr_all': 0.005,
   'reg_all': 0.05,
   'cache_ratings': False}],
 ['SVD++',
  'RMSE',
  1.0382,
  0.1847,
  0.1143,
  {'n_factors': 10,
   'n_epochs': 30,
   'lr_all': 0.005,
   'reg_all': 0.05,
   'cache_ratings': False}],
 ['NMF', 'MAE', 0.9167, 0.5871, 0.0393, {'n_factors': 20, 'n_epochs': 20}],
 ['NMF', 'RMSE', 1.1008, 1.851, 0.0431, {'n_factors': 25, 'n_epochs': 50}],
 ['SlopeOne', 'RMSE', 1.1159, 0.1861, 0.0328, 'NA'],
 ['SlopeOne', 'MAE', 0.9189, 0.1861, 0.0328, 'NA'],
 ['Co-Clustering',
  'MAE',
  0.9184,
  1.0203,
  0.0338,
  {'n_cltr_u': 5, 'n_cltr_i': 5, 'n_epochs': 10}],
 ['Co-Clustering',
  'RMSE',
  1.1045,
  1.0203,
  0.0338,
  {'n_cltr_u': 5,

### 3.7 NormalPredictor - Basic Random Rating based Algorithm

In [27]:
## Measure the performance of KNN baseline Only algorithm
sim_options = {"name": "pearson_baseline", "user_based": False}

# Initialize the KNN baseline Only algorithm
algo_np = NormalPredictor()

# Try cross validation to measure the RMSE and MAE score
cv = cross_validate(algo_np, data, measures=['RMSE', 'MAE'], cv=3)

# Capture RMSE and MAE score along with fit time
perf_svd = pd.DataFrame(cv)

# Create the score list    
temp = perf_svd.describe().loc[['mean']]

# Append the score to the list
score_comp = score_comp + [['NormalPredictor', 'RMSE', np.round((temp['test_rmse']['mean']),4), np.round(temp['fit_time']['mean'],4) , np.round(temp['test_time']['mean'],4) , 'NA'], ['NormalPredictor', 'MAE', np.round((temp['test_mae']['mean']),4), np.round(temp['fit_time']['mean'],4) , np.round(temp['test_time']['mean'],4) , 'NA']]

# Print the score
perf_svd.describe().loc[['mean']]

Unnamed: 0,test_rmse,test_mae,fit_time,test_time
mean,1.457546,1.168555,0.02757,0.043323


### 3.8 Neural Network Approach 

In [None]:
# Set the seed 
tf.random.set_seed(42)
np.random.seed(42)

# Input the data to dense layers

# Dense layer1 with with 128 nodes and activation function as relu   
dense_input1 = keras.layers.Dense(128, activation="relu")(input_data)

# Hidden layer2 with 128 nodes and activation function as relu
dense_input2 = keras.layers.Dense(128, activation="relu")(dense_input1)

# Hidden layer3 with 128 nodes and activation function as relu
dense_input3 = keras.layers.Dense(128, activation="relu")(dense_input2)

# Final output layer with one node and activation function as linear
final_output = keras.layers.Dense(1, activation="linear")(dense_input3)

# Group the layers into an object with training/output layer
model = keras.Model(inputs=[user_id, movie_id], outputs=final_output)

# Compile step for model and set the compile parameters
# optimizer == 'Adam' , loss = MSE and metrics = MAE
tf.random.set_seed(42)
np.random.seed(42) 

model.compile(optimizer=keras.optimizers.Adam(learning_rate=1e-3), loss="mean_squared_error", metrics=['mae','RootMeanSquaredError'])

# Print the model summary 
model.summary()

# Create input for the model
X = [np.array(df_surprise['CustID'].astype('int32')), np.array(df_surprise['MovieID']).astype('int32')]
y = np.array(df_surprise['Rating'].astype('int32'))

tf.random.set_seed(42)
np.random.seed(42)

# Fit the model on input data with 5 epocs
# Set the Epoch values
epoch = 20
history= model.fit(X, y, epochs=epoch, batch_size=100)

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 user_id (InputLayer)           [(None, 1)]          0           []                               
                                                                                                  
 movie_id (InputLayer)          [(None, 1)]          0           []                               
                                                                                                  
 user_layers (Embedding)        (None, 1, 50)        132471500   ['user_id[0][0]']                
                                                                                                  
 movie_layer (Embedding)        (None, 1, 50)        225000      ['movie_id[0][0]']               
                                                                                              

### 3.9 Compare Performance Score/Execution Time/Best Parameters

In [29]:
## Measure the performance algorithm

# Initialize the Normal Predictor algorithm
score_comp = pd.DataFrame(score_comp) 

# Set the column names
score_comp.columns = ['Algorithm', 'Score_Type','Score', 'Mean_Fit_Time', 'Mean_Test_Time', 'Best_Parameters']

# Plot the RMSE score
fig = px.bar(score_comp.query('Score_Type == "RMSE"'), x='Algorithm', y='Score', title='RMSE Score Comparison', color='Algorithm', text_auto=True)
fig.update_layout(xaxis={'categoryorder': 'total ascending'})
fig.show()

# Plot the MAE score
fig = px.bar(score_comp.query('Score_Type == "MAE"'), x='Algorithm', y='Score', title='MAE Score Comparison', color='Algorithm', text_auto=True)
fig.update_layout(xaxis={'categoryorder': 'total ascending'})
fig.show()

# Plot the Mean Fit Time
fig = px.bar(score_comp, x='Algorithm', y='Mean_Fit_Time', title='Mean Fit Time Comparison', color='Algorithm', text_auto=True)
fig.update_layout(xaxis={'categoryorder': 'total ascending'})
fig.show()

# Plot the Mean Test Time
fig = px.bar(score_comp, x='Algorithm', y='Mean_Test_Time', title='Mean Test Time Comparison', color='Algorithm',text_auto=True)
fig.update_layout(xaxis={'categoryorder': 'total ascending'})
fig.show()

# Set Epochs =5 and list(range(1,epoch+1,1)) 
epoch = 20
x_range = list(range(1,epoch+1,1))

# Plot Loss value for Epochs
fig = px.line(y=history.history['loss'], x=x_range, title='Epochs vs Loss(Neural Model)',labels={'x': 'Epochs', 'y':'Loss'}, width=600, height=400)
fig.show()

# Plot MAE value for Epochs
fig = px.bar(x=list(range(1,epoch+1,1)), y=history.history['mae'],color=x_range, title='MAE Score vs Epochs(Neural Model)',labels={'x': 'Epochs', 'y':'MAE'}, width=600, height=400) 
fig.update_layout(xaxis={'categoryorder': 'total ascending'})
fig.show()

#Plot RMSE for Epochs
fig = px.bar(x=list(range(1,epoch+1,1)), y=history.history['root_mean_squared_error'],color=x_range, title='RMSE Score vs Epochs(Neural Model)',labels={'x': 'Epochs', 'y':'RMSE'}, width=600, height=400) 
fig.update_layout(xaxis={'categoryorder': 'total ascending'})
fig.show()

# Print the best parameters
#
display('BEST Parameters in cross validation')
pd.set_option('display.max_colwidth', 100)
score_comp[['Algorithm','Score_Type','Best_Parameters']]

# Print 

'BEST Parameters in cross validation'

Unnamed: 0,Algorithm,Score_Type,Best_Parameters
0,SVD,MAE,"{'n_factors': 50, 'n_epochs': 10, 'lr_all': 0.005, 'reg_all': 0.05}"
1,SVD,RMSE,"{'n_factors': 50, 'n_epochs': 10, 'lr_all': 0.005, 'reg_all': 0.05}"
2,SVD++,MAE,"{'n_factors': 10, 'n_epochs': 30, 'lr_all': 0.005, 'reg_all': 0.05, 'cache_ratings': False}"
3,SVD++,RMSE,"{'n_factors': 10, 'n_epochs': 30, 'lr_all': 0.005, 'reg_all': 0.05, 'cache_ratings': False}"
4,NMF,MAE,"{'n_factors': 20, 'n_epochs': 20}"
5,NMF,RMSE,"{'n_factors': 25, 'n_epochs': 50}"
6,SlopeOne,RMSE,
7,SlopeOne,MAE,
8,Co-Clustering,MAE,"{'n_cltr_u': 5, 'n_cltr_i': 5, 'n_epochs': 10}"
9,Co-Clustering,RMSE,"{'n_cltr_u': 5, 'n_cltr_i': 5, 'n_epochs': 10}"


### 4.0 Recommendation Results

#### 4.1 Return top Movie Recommendation for Users

In [None]:
## Return the top-N recommendation for each user from a set of predictions.

def get_top_n(predictions, n=3): 

    # First map the predictions to each user.
    top_n = defaultdict(list)
    for uid, iid, true_r, est, _ in predictions:
        top_n[uid].append((iid, est))

    # Then sort the predictions for each user and retrieve the k highest ones.
    for uid, user_ratings in top_n.items():
        user_ratings.sort(key=lambda x: x[1], reverse=True)
        top_n[uid] = user_ratings[:n]
    
    return top_n

# Than predict ratings for all pairs (u, i) that are NOT in the training set.
testset = trainset.build_anti_testset()
predictions = algo_svdpp.test(testset)

top_n = get_top_n(predictions, n=3)

# Print the recommended items for each user
# Capture the top N recommendation for each user
# Initialize the dictionary to capture the recommendation 

recommendation = [] 
for uid, user_ratings in top_n.items():
#    print(uid, [iid for (iid, _) in user_ratings])
    recommendation.append([uid, [df_surprise['Title'].loc[iid].unique()[0] for (iid, _) in user_ratings]])

# Create the dataframe to capture the recommendation
recommendation = pd.DataFrame(recommendation)

# Set the column names
recommendation.columns = ['UserID', 'Movie_Recommendation']

# Sort the dataframe by UserID
recommendation.sort_values(by='UserID', inplace=True)
# Reset the index
recommendation.reset_index(drop=True, inplace=True)
# Print the recommendation
recommendation


Unnamed: 0,UserID,Movie_Recommendation
0,7,"[Alias: Season 1, CSI: Season 1, The Simpsons: Season 6]"
1,352,"[Braveheart, Alias: Season 1, Rabbit-Proof Fence]"
2,857,"[Alias: Season 1, Star Trek: Voyager: Season 1, CSI: Season 1]"
3,1070,"[Alias: Season 1, Bringing Up Baby, The Simpsons: Season 6]"
4,1188,"[The Simpsons: Season 6, Alias: Season 1, CSI: Season 1]"
...,...,...
17071,2648874,"[CSI: Season 1, Star Trek: The Next Generation: Season 7, Bringing Up Baby]"
17072,2648885,"[Pride and Prejudice, Alias: Season 1, CSI: Season 1]"
17073,2649049,"[CSI: Season 1, The Simpsons: Season 1, Star Trek: Voyager: Season 1]"
17074,2649376,"[Six Feet Under: Season 4, Firefly, Stargate SG-1: Season 3]"


#### 4.2 Nearest Neighbour For Given Movie 

In [31]:
## Retrieve the nearest neighbor for a specific movie
# First, train the algorithm to compute the similarities between items

# Set the movie name
movie_name = 'Jingle All the Way'

# Initialize the KNN Baseline algorithm
sim_options = {"name": "pearson_baseline", "user_based": False}
algo_kb = KNNBaseline(sim_options=sim_options)
# Fit the data to the algorithm
algo_kb.fit(trainset)

# Read the mappings raw id <-> movie name
rid_to_name, name_to_rid = df_surprise['Title'].to_dict(), df_surprise.set_index('Title')['MovieID'].to_dict()

## Retrieve inner id of the movie Toy Story
movie_raw_id = name_to_rid[movie_name] 
movie_inner_id = algo_kb.trainset.to_inner_iid(movie_raw_id)

# Retrieve inner ids of the nearest neighbors of Toy Story.
movie_neighbors = algo_kb.get_neighbors(movie_inner_id, k=10)

# Convert inner ids of the neighbors into names.
movie_neighbors = (
    algo_kb.trainset.to_raw_iid(inner_id) for inner_id in movie_neighbors
)

# Convert raw ids into movie names.
movie_neighbors = (rid_to_name[rid] for rid in movie_neighbors)
# Print the 10 nearest neighbors of Toy Story
print()
print('The 10 nearest neighbors of : 'f'{movie_name}')
print()
pd.DataFrame(movie_neighbors, columns=['Movie Name'])
#for movie in movie_neighbors:
#    print(movie)

Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.

The 10 nearest neighbors of : Jingle All the Way



Unnamed: 0,Movie Name
0,Touched by an Angel: Season 1
1,Road to Perdition
2,The Crazies
3,When Harry Met Sally
4,101 Dalmatians II: Patch's London Adventure
5,The Alamo
6,About Schmidt
7,Don't Say a Word
8,About a Boy
9,Barbershop


#### 4.3 Nearest Neighbour For Given Customer

In [32]:
## Retrieve the nearest neighbor for a specific User

# First, train the algorithm to compute the similarities between items

# Set the User ID
User_ID = 1083252

# Initialize the KNN Baseline algorithm
sim_options = {"name": "pearson_baseline", "user_based": False}
algo_kb = KNNBaseline(sim_options=sim_options)
# Fit the data to the algorithm
algo_kb.fit(trainset)

# Read the mappings raw id <-> CustID
rid_to_name, name_to_rid = df_surprise['CustID'].to_dict(), df_surprise.set_index('CustID')['MovieID'].to_dict()

## Retrieve inner id of the movie Toy Story
movie_raw_id = name_to_rid[User_ID] 
movie_inner_id = algo_kb.trainset.to_inner_iid(movie_raw_id)

# Retrieve inner ids of the nearest neighbors of Toy Story.
movie_neighbors = algo_kb.get_neighbors(movie_inner_id, k=10)

# Convert inner ids of the neighbors into names.
movie_neighbors = (
    algo_kb.trainset.to_raw_iid(inner_id) for inner_id in movie_neighbors
)

# Convert raw ids into movie names.
movie_neighbors = (rid_to_name[rid] for rid in movie_neighbors)
# Print the 10 nearest neighbors of Toy Story
print()
print('The 10 nearest neighbors of : 'f'{User_ID}')
print()
pd.DataFrame(movie_neighbors, columns=['Nearest Neighbor Customer ID'])
#for movie in movie_neighbors:
#    print(movie)

Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.

The 10 nearest neighbors of : 1083252



Unnamed: 0,Nearest Neighbor Customer ID
0,1680362
1,2598710
2,1049640
3,2606295
4,1144671
5,727924
6,2385435
7,73330
8,664443
9,987810


#### 4.4 Recommendation from Neural model

In [38]:
## The function calculates consine distance between the customer unwatched movies and return the ## top 3 recommendation. The 
tf.random.set_seed(42)
np.random.seed(42)

def get_top_3_rec(customer_id):

    arr= np.array(df_surprise)
#   Watched movie list 
    val,dt= np.where(arr[0:,0:1].astype(int) == customer_id)
    cus_watched_movie = arr[val][0:,5:].flatten().astype(int)

# Movies not watched by customer   
    all_unique_movies= np.sort(np.unique(arr[0:,5:].flatten().astype(int)))
    mask = np.invert(np.isin(all_unique_movies,cus_watched_movie))
    cus_not_watched= all_unique_movies[mask]
#  
#    Extract the weights for User and movie
#   
    user_weights= np.array(model.get_layer('user_layers').get_weights())
    movie_weights= np.array(model.get_layer('movie_layer').get_weights())

# Extract the weights for customer   
    X= user_weights[0][customer_id]
# Extract the weights for list of movies customer has not watched    
    y= movie_weights[0][list(cus_not_watched.astype('int32'))]
# Evaluate the cosine similarity for customer and list of movies
    cosine = np.dot(X,np.transpose(y))/np.linalg.norm(X)*np.linalg.norm(np.transpose(y))
# Prepare the   
    movie_distance= np.concatenate([cosine.reshape(-1,1),cus_not_watched.reshape(-1,1)],axis=1)
    movie_sorted = movie_distance[movie_distance[:,0].argsort()]
    return df_surprise.loc[list(np.flip(np.unique(movie_sorted, axis=0))[:3,:1].flatten().astype(int))]['Title'].unique()

rec= get_top_3_rec(352)

print("Top 3 recommendation for user is: ")
print("")
print(rec)


Top 3 recommendation for user is: 

['Wild Things' 'Cries and Whispers' 'Christmas with The Simpsons']


### 5.0 Conclusion

Applied different types of algorithms to generate recommendations for the customer. On comparing the scores, the neural network performed better; however, the model learning and execution time were very high. In the surprise library, the SVD++ showed good results for recommendation and tried to extract the nearest neighbor for movies and users based on similarity.    