# **Recommender Systems**

# **Task 1: Data Preprocessing**

In [1]:
!pip install scikit-surprise

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting scikit-surprise
  Downloading scikit-surprise-1.1.3.tar.gz (771 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m772.0/772.0 KB[0m [31m8.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (setup.py) ... [?25l[?25hdone
  Created wheel for scikit-surprise: filename=scikit_surprise-1.1.3-cp39-cp39-linux_x86_64.whl size=3193647 sha256=5259ff1c4aada0288e61b0051d7295ee9b0df4ef73cac1a705cf0ed0ade74a94
  Stored in directory: /root/.cache/pip/wheels/c6/3a/46/9b17b3512bdf283c6cb84f59929cdd5199d4e754d596d22784
Successfully built scikit-surprise
Installing collected packages: scikit-surprise
Successfully installed scikit-surprise-1.1.3


In [2]:
# import all relevant libraries
import pandas as pd
import numpy as np
from surprise import BaselineOnly, Dataset, Reader, SVD
from surprise import accuracy 
import surprise
from surprise.model_selection import train_test_split

# Dataset 

-  'ratings.csv' file

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
# Load the data and check column details in table using .head()
df = pd.read_csv('/content/drive/MyDrive/ratings.csv')
df.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


### **Convert the data into the utility matrix**

In [5]:
# path to dataset file
file_path = '/content/drive/MyDrive/ratings.csv'

# 'userId', 'movieId', 'rating', and 'timestamp', separated by commas.
reader = Reader(line_format="user item rating timestamp", sep=",", skip_lines=1)

# To collect 'ratings' as matrix
data = Dataset.load_from_file(file_path, reader=reader)

In [6]:
print(data)

<surprise.dataset.DatasetAutoFolds object at 0x7f02008cb610>


### **Split the dataset into training and testing sets with a ratio of 80:20**

## Train Test Split (Train:Test= 80:20)
 *surprise.model_selection.split.train_test_split(data, test_size=0.2, train_size=None, random_state=None, shuffle=True)*

Split a dataset into trainset and testset.

In [7]:
trainset, testset= train_test_split(data, test_size = 0.20, random_state = 0)

## Trainset class 
Trainset contains all useful data that constitute a training set.
Reference:https://surprise.readthedocs.io/en/stable/trainset.html

In [8]:
trainset.n_users
print("Number of users",trainset.n_users)

Number of users 610


In [9]:
trainset.n_items
print("Number of items:",trainset.n_items)

Number of items: 8979


# **Collaborative Filtering Algorithm**

## Train the model

## KNN With Means 

In [10]:
from surprise import KNNWithMeans
sim_options = {"name": "pearson", "user_based": True, "shrinkage": 100}  
algo1 = KNNWithMeans(sim_options=sim_options)
algo1.fit(trainset)
predictions1 = algo1.test(testset)


Computing the pearson similarity matrix...
Done computing similarity matrix.


## KNN with ZScore 

In [11]:
from surprise import KNNWithZScore
sim_options = {"name": "pearson_baseline", "user_based": False, "shrinkage": 100} 
algo2 = KNNWithZScore(sim_options=sim_options)
algo2.fit(trainset)
predictions2 = algo2.test(testset)

Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.


### **Calculate the RMSE (root mean squared error) for both algorithms using the testing set**

## Evaluate the model 

In [12]:
RMSE1 = accuracy.rmse(predictions1)
RMSE2 = accuracy.rmse(predictions2)

RMSE: 0.8888
RMSE: 0.8798


In [13]:
print("User-based RMSE: ", round(RMSE1,4))
print("Item-based RMSE: ", round(RMSE2,4))

User-based RMSE:  0.8888
Item-based RMSE:  0.8798


In [14]:
# List of predictions for KNN with Means algorithm for a user- based collaborative filtering system
predictions1[0:5]

[Prediction(uid='548', iid='1196', r_ui=3.5, est=4.4677715198217145, details={'actual_k': 40, 'was_impossible': False}),
 Prediction(uid='590', iid='1252', r_ui=3.0, est=3.954750769092146, details={'actual_k': 40, 'was_impossible': False}),
 Prediction(uid='8', iid='32', r_ui=3.0, est=4.057075161246585, details={'actual_k': 40, 'was_impossible': False}),
 Prediction(uid='217', iid='2993', r_ui=3.0, est=3.1264745449375506, details={'actual_k': 14, 'was_impossible': False}),
 Prediction(uid='51', iid='2613', r_ui=5.0, est=3.2044931148625877, details={'actual_k': 7, 'was_impossible': False})]

In [15]:
# List of predictions for KNN with ZScore algorithm for a item- based collaborative filtering system
predictions2[0:5]

[Prediction(uid='548', iid='1196', r_ui=3.5, est=4.17127245021777, details={'actual_k': 5, 'was_impossible': False}),
 Prediction(uid='590', iid='1252', r_ui=3.0, est=4.250017833238, details={'actual_k': 40, 'was_impossible': False}),
 Prediction(uid='8', iid='32', r_ui=3.0, est=4.148155591835363, details={'actual_k': 15, 'was_impossible': False}),
 Prediction(uid='217', iid='2993', r_ui=3.0, est=3.2804308903720845, details={'actual_k': 40, 'was_impossible': False}),
 Prediction(uid='51', iid='2613', r_ui=5.0, est=3.296068220979953, details={'actual_k': 40, 'was_impossible': False})]

### **My findings and recommendations for improving the recommender system**

> According to results from User-based RMSE using KNN with Means algorithm and Item-based RMSE using KNN with ZScore, althogh both are slightly different, **RMSE using KNN with ZScore is lower with value of 0.8789** compare RMSE using with value of 0.8888. So, apart from the fact that users have multiple tastes while Items are simpler, with RMSE, KNN with Zscore seems to have a better result. However, to improve the recommender system, **Hyperparameter tuning, advanced algorithms, more users' data and, better data quality can be done to enhance this**. Starting from Hyperparameter tuning, using different similarity metrics, varying the number of neighbors considered for the predictions, adjusting the shrinkage parameter, employing cross-validation, and performing grid or random search to find the best hyperparameters will help improving the recommender system. For advanced algorithms, using Hybrid method or Content-based Recommender Systems can take part in help upgrade accuracy while reduce the problem of first rater or new user and new item that collaborative filtering cannot provide. More users' data, collect more data as much as possible to avoid unmatch problem. Finally, better data quality in terms of variety, incentivize users to provide more ratings or feedback on the items they have used or interacted with and using multiple source of data can contribute to a better recommender system.







# **Note to improving the Recommender System**

### **Implement a hybrid recommender system that combines user-based and item-based collaborative filtering algorithms.**

In [16]:
# Create matrix factorization model 
mf = SVD()

# Train both models on the training set using the former 2 algorithm of User based and Item based  
algo1.fit(trainset)
algo2.fit(trainset)

# Train matrix factorization model on the training set
mf.fit(trainset)

# Define a function to combine the predictions of the user-based and item-based models
def hybrid(user_id, item_id):
    # Get the predicted rating from the user-based model
    user_pred = algo1.predict(user_id, item_id).est
    
    # Get the predicted rating from the item-based model
    item_pred = algo2.predict(user_id, item_id).est
    
    # Get the predicted rating from the matrix factorization model
    mf_pred = mf.predict(user_id, item_id).est
    
    # Calculate the weighted average of the three predictions
    hybrid_pred = (user_pred + item_pred + mf_pred) / 3
    return hybrid_pred


Computing the pearson similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.


In [17]:
# Calculate the RMSE of the hybrid model on the testing set
hybrid_predictions = []
for user_id, item_id, actual_rating in testset:
    hybrid_pred = hybrid(user_id, item_id)
    hybrid_predictions.append(hybrid_pred)
hybrid_rmse = np.sqrt(np.mean((np.array(hybrid_predictions) - np.array([actual_rating for (_, _, actual_rating) in testset]))**2))

# Calculate the RMSE of the user-based model on the testing set
user_predictions = algo1.test(testset)
user_rmse = surprise.accuracy.rmse(user_predictions)

# Calculate the RMSE of the item-based model on the testing set
item_predictions = algo2.test(testset)
item_rmse = surprise.accuracy.rmse(item_predictions)

RMSE: 0.8888
RMSE: 0.8798


### **Compare the performance of the hybrid system with the individual user-based and item-based algorithms.**

In [18]:
# Print the RMSEs of all three models
print("User-based RMSE: ", round(user_rmse,4))
print("Item-based RMSE: ", round(item_rmse,4))
print("Hybrid RMSE: ", round(hybrid_rmse,4))

User-based RMSE:  0.8888
Item-based RMSE:  0.8798
Hybrid RMSE:  0.8517


### **Result from my implementation together with my findings.**

> My implementation start by creating 3 different recommendation models - User-based collaborative filtering model (using KNNWithMeans algorithm), Item-based collaborative filtering model (using KNNWithZScore algorithm), and Matrix Factorization model (using SVD algorithm). Then, I trained all models with the same training dataset. I used function called 'hybrid' to take a user ID and an item ID as input and return a weighted average of the predicted rating from the three models. This function was designed to improve the accuracy of the rating prediction by combining the predictions of the three models. Finnaly, I calculated RMSE for all models which are User-based collaborative filtering model (using KNNWithMeans algorithm), Item-based collaborative filtering model (using KNNWithZScore algorithm),and Hybrid model to compare performance and check which one is better. 
According to the all 3 RMSE results, the one with the lowest RMSE is Hybrid model with RMSE of 0.8517 compare to User-based RMSE with value of 0.8888
and Item-based RMSE with value of 0.8798. Therefore, to combine the individual user-based with item-based algorithms as a hybrid model does help improve the accuracy by reducing RMSE to be lower.



