# Assignment 2  FIT5212 
# Recommender System

**Student Name:**  Eddy

**Student ID:**    33495608

---------

## Importing the Library

In [2]:
#Importing the required library
%matplotlib inline
import pandas as pd
import random
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from ast import literal_eval
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import wordnet

#Command to install the surprise library if it's not available
#!pip install surprise

from surprise import Reader, Dataset, SVD, KNNBasic
from surprise.model_selection import cross_validate, train_test_split, GridSearchCV
reader = Reader()

#Setting the seed for the entire code
np.random.seed(42)
random.seed(42)


import warnings; warnings.simplefilter('ignore')

## Processing the dataset

The dataset will be processed to be used for the KNN and SVD model

In [None]:
#Loading the train dataset 
df_train = pd.read_csv('train.csv')

#Making a copy of the train  dataset to be preprocessed
df_train_p = df_train.copy()

#Checking if it's copied properly
df_train_p.head()

In [None]:
#Loading the train dataset then check if it's loaded properly
df_test = pd.read_csv('test.csv')

#Making a copy of the test dataset to be preprocessed
df_test_p = df_test.copy()

#Checking if it's copied properly
df_test_p.head()

In [None]:
#Doing Exploratory Data Analysis to see the shape of the train dataset
print('The row and columns for the df_train is =', df_train.shape)

In [None]:
#Doing Exploratory Data Analysis to see the info of the train dataset
df_train.info()

In [None]:
#Looking at the distribution of the ratings column
print('The rating column consist of =', df_train['rating'].unique())

In [None]:
#Looking if there's any NaN in any of the columns
print('Checking if any of the column has missing values')
df_train.isnull().any()

In [None]:
#Checking the product id for a specific product name
prod_big = df_train[df_train['product_name'].isin(['Big'])][['user_id','product_id','product_name']]

print("The Big product name is associated with product id which is")
prod_big['product_id'].unique()

From the EDA on the Big product name, there are 3 product id that is associated with Big product name. This violates Amazon rule since a single product name must only have 1 product id.

To resolve this issue, the most voted product id by product name is chosen as the correct product id for that product name. This wrangling is applied to both train and test dataset.



In [10]:
# Grouping the product_name then find the product_id with the most votes
most_voted_id = df_train_p.groupby(['product_name', 'product_id'])['votes'].max()
most_voted_id = most_voted_id.groupby(level=0).idxmax().apply(lambda x: x[1])

# Converting the result to dictionary
product_name_to_id = most_voted_id.to_dict()

# Map the 'product_name' to the 'product_id' with the most votes
df_train_p['product_id'] = df_train_p['product_name'].map(product_name_to_id)

# Applying the dictionary to the other dataset
df_test_p['product_id'] = df_test_p['product_name'].map(product_name_to_id)

In [11]:
#Parsing the rating through the reader class from Surprise package
#This is to make a user based collaborative filtering
data = Dataset.load_from_df(df_train[['user_id', 'product_id', 'rating']], reader) #Pre-processed data
data_p = Dataset.load_from_df(df_train_p[['user_id', 'product_id', 'rating']], reader) #Processed data

#Parsing the rating through the reader class from Surprise package
#The order of user and product is reversed therefore this is to make an item based collaborative filtering
data_rev = Dataset.load_from_df(df_train[['product_id', 'user_id', 'rating']], reader) #Pre-processed data
data_rev_p = Dataset.load_from_df(df_train_p[['product_id', 'user_id', 'rating']], reader) #Processed data

## K-Nearest Neighbor (KNN)

In this section, the algorithm K-Nearest Neighbor and its optimization will be performed to predict the rating. First the pre-processed data will be used

**Note that because KNN and Cross Validation from Surprise library didn't allow for random state, the result will be slightly different each time**

In [None]:
np.random.seed(42)
random.seed(42)
# The base KNN algorithm from surprise package is loaded
algo_knn = KNNBasic()
# The algorithm is run on the pre-processed datawith 5 CV and the evaluation metric is printed (RMSE, MAE)
cross_validate(algo_knn, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

In [None]:
np.random.seed(42)
random.seed(42)
#Setting up the dictionary to explore the best parameter
#For the purpose of this code, only the optimal parameter is supplied here
dict = {'k': [20], 'min_k' : [3]}

#Loading up the gridsearchcv algorithm then run the algorithm on the pre-processed dataset
grid_knn_pre = GridSearchCV(KNNBasic, dict, measures=['rmse', 'mae'], cv=5)
grid_knn_pre.fit(data)

#Printing the result
print("Best RMSE score for pre-processed data:", round(grid_knn_pre.best_score['rmse'],4))
print("Best parameters for RMSE:", grid_knn_pre.best_params['rmse'])

print("Best MAE score for pre-processed data:", round(grid_knn_pre.best_score['mae'],4))
print("Best parameters for MAE:", grid_knn_pre.best_params['mae'])

**Evaluation of the KNN model to the PreProcessed Data**
|Model| HyperParameter| RMSE | MAE |
| --- | --- | --- | --- |
|Base KNN | k = 40, min_k = 1 | 1.0181 | 0.7270 |
|Finetuned KNN | k = 20, min_k = 3 | 0.9576 | 0.7103 |

Next the model is applied to the processed dataset

In [None]:
np.random.seed(42)
random.seed(42)
# The algorithm is run on the processed data with 5 CV and the evaluation metric is printed (RMSE, MAE)
cross_validate(algo_knn, data_p, measures=['RMSE', 'MAE'], cv=5, verbose=True)

In [None]:
np.random.seed(42)
random.seed(42)
#Setting up the dictionary to explore the best parameter
#For the purpose of this code, only the optimal parameter is supplied here
dict = {'k': [20], 'min_k' : [3]}

#Loading up the gridsearchcv algorithm then run the algorithm on the pre-processed dataset
grid_knn_post = GridSearchCV(KNNBasic, dict, measures=['rmse', 'mae'], cv=5)
grid_knn_post.fit(data_p)

#Printing the result
print("Best RMSE score for processed data:", round(grid_knn_post.best_score['rmse'],4))
print("Best parameters for RMSE:", grid_knn_post.best_params['rmse'])

print("Best MAE score for processed data:", round(grid_knn_post.best_score['mae'],4))
print("Best parameters for MAE:", grid_knn_post.best_params['mae'])

**Evaluation of the KNN model to the Processed Data**

|Model| HyperParameter| RMSE | MAE |
| --- | --- | --- | --- |
|Base KNN | k = 40, min_k = 1 | 0.9616 | 0.6713 |
|Finetuned KNN | k = 20, min_k = 3 | 0.9057 | 0.6517 |

In [None]:
#Processing the train data to be able to parsed into the algorithm
trainset_pre = data.build_full_trainset()
trainset_post = data_p.build_full_trainset()

#Fitting the preprocessed and processed train dataset into the KNN algorithm
#The hyperparameter is set to the finetuned KNN
pred_knn = KNNBasic(k = 20, min_k = 3 ).fit(trainset_pre)
pred_knn_post = KNNBasic(k = 20, min_k = 3 ).fit(trainset_post)

#Copying the test dataset
df_test_knn = df_test.copy()
df_test_p_knn = df_test_p.copy()

#Applying the model to the preprocessed and processed test dataset
df_test_knn['rating'] = df_test_knn.apply(lambda x : pred_knn.predict(x.user_id, x.product_id).est, axis = 1)
df_test_p_knn['rating'] = df_test_p_knn.apply(lambda x : pred_knn_post.predict(x.user_id, x.product_id).est, axis = 1)

In [17]:
#Dropping the irrelevant column and writing to csv file for preprocessed dataset
df_test_knn = df_test_knn.drop(columns=['user_id', 'product_id', 'product_name'])
df_test_knn.to_csv('KNN_predictions_pre.csv', index=False)

#Dropping the irrelevant column and writing to csv file for processed dataset
df_test_p_knn = df_test_p_knn.drop(columns=['user_id', 'product_id', 'product_name'])
df_test_p_knn.to_csv('KNN_predictions_post1.csv', index=False)

**Kaggle Score**

|Model|Processed | HyperParameter| Score |
| --- | --- | --- | --- | 
|Base KNN | No | k = 40, min_k = 1 | 1.03142 |
|Finetuned KNN | No | k = 20, min_k = 3 | 0.96901 |
|Base KNN | Yes | k = 40, min_k = 1 |  0.98305 |
|Finetuned KNN | Yes | k = 20, min_k = 3 | 0.92402 |


## SVD

In this section, the algorithm Singular Value Decomposition and its optimization will be performed to predict the rating. First the pre-processed data will be used

**Note that because Cross Validation from Surprise library didn't allow for random state, the result will be slightly different each time**

In [None]:
np.random.seed(42)
random.seed(42)
# Loading the base SVD model
algo_svd = SVD(random_state=42)
# Running the base SVD model with cross validation to get the RMSE and MAE
cross_validate(algo_svd, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

In [None]:
np.random.seed(42)
random.seed(42)
#Setting up the dictionary to explore the best parameter
#For the purpose of this code, only the optimal parameter is supplied here
dict = {'n_factors': [200], 'lr_all' : [0.1], 'reg_all' : [0.022], 'n_epochs' : [150], 'random_state' : [42]}

#Loading up the gridsearchcv algorithm then run the algorithm on the pre-processed dataset
grid_svd_pre = GridSearchCV(SVD, dict, measures=['rmse', 'mae'], cv=5)
grid_svd_pre.fit(data)

#Printing the result
print("Best RMSE score for pre-processed data:", round(grid_svd_pre.best_score['rmse'],4))
print("Best parameters for RMSE:", grid_svd_pre.best_params['rmse'])

print("Best MAE score for pre-processed data:", round(grid_svd_pre.best_score['mae'],4))
print("Best parameters for MAE:", grid_svd_pre.best_params['mae'])

**Evaluation of the SVD model to the PreProcessed Data**

|Model| HyperParameter| RMSE | MAE |
| --- | --- | --- | --- |
|Base SVD | n_factors=100, n_epochs=20, lr_all=0.005, reg_all=0.02 | 0.9381 | 0.6936 |
|Finetuned SVD | n_factors=200, n_epochs=150, lr_all=0.1, reg_all=0.022 | 0.8668 | 0.6285 |

Next the processed dataset will be used

In [None]:
np.random.seed(42)
random.seed(42)
# Loading the base SVD model
algo_svd = SVD(random_state=42)
# Running the base SVD model with cross validation to get the RMSE and MAE
cross_validate(algo_svd, data_p, measures=['RMSE', 'MAE'], cv=5, verbose=True)

In [None]:
np.random.seed(42)
random.seed(42)
#Setting up the dictionary to explore the best parameter
#For the purpose of this code, only the optimal parameter is supplied here
dict = {'n_factors': [200], 'lr_all' : [0.1], 'reg_all' : [0.022], 'n_epochs' : [150], 'random_state' : [42]}

#Loading up the gridsearchcv algorithm then run the algorithm on the pre-processed dataset
grid_svd_post = GridSearchCV(SVD, dict, measures=['rmse', 'mae'], cv=5)
grid_svd_post.fit(data_p)

#Printing the result
print("Best RMSE score for pre-processed data:", round(grid_svd_post.best_score['rmse'],4))
print("Best parameters for RMSE:", grid_svd_post.best_params['rmse'])

print("Best MAE score for pre-processed data:", round(grid_svd_post.best_score['mae'],4))
print("Best parameters for MAE:", grid_svd_post.best_params['mae'])

**Evaluation of the SVD model to the Processed Data**

|Model| HyperParameter| RMSE | MAE |
| --- | --- | --- | --- |
|Base SVD | n_factors=100, n_epochs=20, lr_all=0.005, reg_all=0.02 | 0.8679 | 0.6276 |
|Finetuned SVD | n_factors=200, n_epochs=150, lr_all=0.1, reg_all=0.022 | 0.7964 | 0.5289 |

In [22]:
#Fitting the preprocessed and processed train dataset into the SVD algorithm
#The hyperparameter is set to the finetuned SVD
pred_svd = SVD(n_factors=200, lr_all=0.1, reg_all=0.022, n_epochs=150, random_state=42).fit(trainset_pre)
pred_svd_post = SVD(n_factors=200, lr_all=0.1, reg_all=0.022, n_epochs=150, random_state=42).fit(trainset_post)

#Copying the test dataset
df_test_svd = df_test.copy()
df_test_p_svd = df_test_p.copy()

#Applying the model to the preprocessed and processed test dataset
df_test_svd['rating'] = df_test_svd.apply(lambda x : pred_svd.predict(x.user_id, x.product_id).est, axis = 1)
df_test_p_svd['rating'] = df_test_p_svd.apply(lambda x : pred_svd_post.predict(x.user_id, x.product_id).est, axis = 1)

In [23]:
#Dropping the irrelevant column and writing to csv file for preprocessed dataset
df_test_svd = df_test_svd.drop(columns=['user_id', 'product_id', 'product_name'])
df_test_svd.to_csv('SVD_predictions_pre1.csv', index=False)

#Dropping the irrelevant column and writing to csv file for processed dataset
df_test_p_svd = df_test_p_svd.drop(columns=['user_id', 'product_id', 'product_name'])
df_test_p_svd.to_csv('SVD_predictions_post1.csv', index=False)

**Kaggle Score**

|Model|Processed | HyperParameter| Score |
| --- | --- | --- | --- | 
|Base SVD | No |  n_factors=100, n_epochs=20, lr_all=0.005, reg_all=0.02 | 0.96019 |
|Finetuned SVD | No | n_factors=200, n_epochs=150, lr_all=0.1, reg_all=0.022 | 0.89576 |
|Base SVD | Yes |  n_factors=100, n_epochs=20, lr_all=0.005, reg_all=0.02 |  0.89279 |
|Finetuned SVD | Yes | n_factors=200, n_epochs=150, lr_all=0.1, reg_all=0.022 | 0.83558 |


From the result, it's clear that SVD is the better model. Next is finetuning the combination of lr_all which is 

- lr_bu
- lr_bi
- lr_pu
- lr qi

and the combination of reg_all which is

- reg_bu
- reg_bi
- reg_pu
- reg_qi

The hyperparameter is already found so it will be directly provided into the code. The other parameter will be matched to the finetuned hyperparameter

In [None]:
np.random.seed(42)
random.seed(42)
#Setting up the dictionary to explore the best parameter
#For the purpose of this code, only the optimal parameter is supplied here
dict = {'n_factors': [200], 'lr_bu' : [0.091], 'lr_bi' : [0.085], 'lr_pu' : [0.085], 
        'lr_qi' : [0.106], 'reg_all' : [0.022], 'n_epochs' : [150], 'random_state' : [42]}

#Loading up the gridsearchcv algorithm then run the algorithm on the pre-processed dataset
grid_svd_lr = GridSearchCV(SVD, dict, measures=['rmse', 'mae'], cv=5)
grid_svd_lr.fit(data_p)

#Printing the result
print("Best RMSE score for finetuning lr data:", round(grid_svd_lr.best_score['rmse'],4))
print("Best parameters for RMSE:", grid_svd_lr.best_params['rmse'])

print("Best MAE score for finetuning lr data:", round(grid_svd_lr.best_score['mae'],4))
print("Best parameters for MAE:", grid_svd_lr.best_params['mae'])

In [None]:
np.random.seed(42)
random.seed(42)
#Setting up the dictionary to explore the best parameter
#For the purpose of this code, only the optimal parameter is supplied here
dict = {'n_factors': [200], 'reg_bu' : [0.021], 'reg_bi' : [0.02], 'reg_pu' : [0.022], 
        'reg_qi' : [0.022], 'lr_all' : [0.1], 'n_epochs' : [150], 'random_state' : [42]}

#Loading up the gridsearchcv algorithm then run the algorithm on the pre-processed dataset
grid_svd_reg = GridSearchCV(SVD, dict, measures=['rmse', 'mae'], cv=5)
grid_svd_reg.fit(data_p)

#Printing the result
print("Best RMSE score for finetuning lr data:", round(grid_svd_reg.best_score['rmse'],4))
print("Best parameters for RMSE:", grid_svd_reg.best_params['rmse'])

print("Best MAE score for finetuning lr data:", round(grid_svd_reg.best_score['mae'],4))
print("Best parameters for MAE:", grid_svd_reg.best_params['mae'])

In [None]:
np.random.seed(42)
random.seed(42)
# Loading the fine tuned SVD model to be used for the item-based collaborative filtering
algo_svd_rev = SVD(n_factors=200, lr_bu=0.091, lr_bi=0.085, lr_pu=0.085, lr_qi=0.106, reg_all=0.022, n_epochs=200, random_state=42)
# Running the base SVD model with cross validation to get the RMSE and MAE
cross_validate(algo_svd_rev, data_rev_p, measures=['RMSE', 'MAE'], cv=5, verbose=True)

**SVD, further Learning Rates and Regularization finetuning**

|Model| HyperParameter| RMSE | MAE |
| --- | --- | --- | --- | 
|Finetuning LR |  lr_bu=0.091, lr_bi=0.085, lr_pu=0.085, lr_qi=0.106 | 0.7965 | 0.5308 |
|Finetuning REG |  reg_bu=0.021, reg_bi=0.02, reg_pu=0.022, reg_qi=0.022 | 0.7985 | 0.5251 |
|With Item-Based | n_factors=200, n_epochs=150, lr_all=0.1, reg_all=0.022 | 0.8070 | 0.5509 |

We see here that there is a diminishing return on the performance. Next we're going to average the user and item based rating

In [12]:
#Fitting the preprocessed and processed train dataset into the SVD algorithm
#The hyperparameter is set to the finetuned SVD
trainset_lr = data_p.build_full_trainset()
trainset_lr_rev = data_rev_p.build_full_trainset()
pred_lr = SVD(n_factors=200, lr_bu=0.091, lr_bi=0.085, lr_pu=0.085, lr_qi=0.106, n_epochs=150, random_state=42).fit(trainset_lr)
pred_lr_rev = SVD(n_factors=200, lr_all=0.1, reg_all=0.022, n_epochs=150, random_state=42).fit(trainset_lr_rev)

#Copying the test dataset
df_test_lr = df_test_p.copy()
df_test_rev_lr = df_test_p.copy()

#Applying the model to the preprocessed and processed test dataset
df_test_lr['rating'] = df_test_lr.apply(lambda x : pred_lr.predict(x.user_id, x.product_id).est, axis = 1)
df_test_rev_lr['rating'] = df_test_rev_lr.apply(lambda x : pred_lr_rev.predict(x.product_id, x.user_id).est, axis = 1)

In [None]:
#Dropping the irrelevant column for merging the dataset
df_test_lr = df_test_lr.drop(columns=['user_id', 'product_id', 'product_name'])
df_test_rev_lr = df_test_rev_lr.drop(columns=['user_id', 'product_id', 'product_name'])

#Merge the dataset then average the rating, then see the result
df_test_comb = pd.merge(df_test_lr, df_test_rev_lr, on='ID')
df_test_comb['rating'] = df_test_comb[['rating_x','rating_y']].mean(axis=1)
df_test_comb.head()

In [17]:
#Drop the irrelevant column then save the prediction into a CSV
df_test_comb = df_test_comb.drop(columns=['rating_x','rating_y'])
df_test_comb.to_csv('combined_mean_prediction.csv', index=False)

The combination of user-based and item-based rating achieved a score of 0.82644 in Kaggle which is the best score when compared to the other score listed