# Collaborative Filtering with [Surprise](https://surpriselib.com/) Toolkit

In [1]:
import pandas as pd
import numpy as np
import random
from surprise import Dataset, Reader, accuracy, dump
from surprise import KNNBasic, KNNWithMeans, KNNWithZScore, KNNBaseline
from surprise import SVD, SVDpp, BaselineOnly
from surprise.model_selection import cross_validate, train_test_split, GridSearchCV
import surprise_helper

## Preparation

In [2]:
rates = pd.read_csv("rates.csv")
rates.head(3)

Unnamed: 0,user_id,display_name,course_id,rate
0,14177,Jacynthe,1055720,5.0
1,24654,Norval,1055720,5.0
2,14484,Jany,1055720,4.0


In [3]:
courses = pd.read_csv("courses.csv")
courses.head(3)

Unnamed: 0,id,title,category,course_url
0,9287,Microsoft Excel 2010 Course Beginners/ Interme...,Office Productivity,/course/excel-tutorial/
1,9385,Microsoft Excel 2010: Advanced Training,Office Productivity,/course/advanced-excel/
2,9711,Beginner PHP and MySQL Tutorial,Development,/course/php-mysql-tutorial/


In [4]:
reader = Reader(rating_scale=(0, 5))
data = Dataset.load_from_df(rates[["user_id", "course_id", "rate"]], reader)

## Model Selection

### Basic Collaborative Filtering Algorithms
#### BaselineOnly

$baseline = b_{ui} = \text{average rating of all courses + bias of user u + bias of item i} $

$b_u$ and $b_i$ minimized by using objective function.

In [5]:
_ = cross_validate(BaselineOnly(verbose=False), data, measures=['RMSE', 'MAE'], cv=3, verbose=True)

Evaluating RMSE, MAE of algorithm BaselineOnly on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    0.6919  0.6939  0.6884  0.6914  0.0023  
MAE (testset)     0.4729  0.4753  0.4727  0.4736  0.0012  
Fit time          0.51    0.58    0.61    0.57    0.04    
Test time         0.28    0.38    0.35    0.34    0.04    


### Memory Based Collaborative Filtering

#### Basic KNN

Simple k-NN method by using similarity metric.

User based: Example -> users similar to you bought Y

$\large est_{ui} = \frac{\text{sum_nearest_k_v(similarity between user u and v * rate given to item i by user v)}}{\text{sum_nearest_k_v(similarity between user u and v)}}$

Item based: Example ->  users who bought X, also bought Y

$\large est_{ui} = \frac{\text{sum_nearest_k_j(similarity between item i and j * rate given to item j by user u)}}{\text{sum_nearest_k_j(similarity between item i and j)}}$

In [6]:
_ = cross_validate(KNNBasic(sim_options={"name": "msd", "user_based": False}, verbose=False), data, measures=['RMSE', 'MAE'], cv=3, verbose=True)
_ = cross_validate(KNNBasic(sim_options={"name": "cosine", "user_based": False}, verbose=False), data, measures=['RMSE', 'MAE'], cv=3, verbose=True)
_ = cross_validate(KNNBasic(sim_options={"name": "pearson", "user_based": False}, verbose=False), data, measures=['RMSE', 'MAE'], cv=3, verbose=True)
_ = cross_validate(KNNBasic(sim_options={"name": "pearson_baseline", "user_based": False}, verbose=False), data, measures=['RMSE', 'MAE'], cv=3, verbose=True)

Evaluating RMSE, MAE of algorithm KNNBasic on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    0.7971  0.8081  0.7949  0.8000  0.0058  
MAE (testset)     0.4737  0.4786  0.4713  0.4745  0.0030  
Fit time          0.15    0.20    0.20    0.19    0.02    
Test time         0.70    0.73    0.69    0.71    0.02    
Evaluating RMSE, MAE of algorithm KNNBasic on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    0.7994  0.7972  0.7928  0.7965  0.0028  
MAE (testset)     0.4748  0.4745  0.4735  0.4743  0.0006  
Fit time          0.26    0.30    0.31    0.29    0.02    
Test time         0.63    0.72    0.70    0.68    0.04    
Evaluating RMSE, MAE of algorithm KNNBasic on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    0.7519  0.7516  0.7500  0.7512  0.0008  
MAE (testset)     0.4867  0.4874  0.4869  0.4870  0.0003  
Fit time          0.34    0.42    0.39    0.38    0.03  

#### KNN with Means

A k-NN method similar to previous one but takes into account the mean ratings of each user.

User based:

$\large est_{ui} = \text{mean rating of user u} + \frac{\text{sum_nearest_k_v(similarity between user u and v * (rate given to item i by user v - mean rating of user v)}}{\text{sum_nearest_k_v(similarity between user u and v)}}$

Item based:

$\large est_{ui} = \text{mean rating of item i} + \frac{\text{sum_nearest_k_j(similarity between item i and j * (rate given to item j by user u - mean rating of item j))}}{\text{sum_nearest_k_j(similarity between item i and j)}}$

In [7]:
_ = cross_validate(KNNWithMeans(sim_options={"name": "msd", "user_based": False}, verbose=False), data, measures=['RMSE', 'MAE'], cv=3, verbose=True)
_ = cross_validate(KNNWithMeans(sim_options={"name": "cosine", "user_based": False}, verbose=False), data, measures=['RMSE', 'MAE'], cv=3, verbose=True)
_ = cross_validate(KNNWithMeans(sim_options={"name": "pearson", "user_based": False}, verbose=False), data, measures=['RMSE', 'MAE'], cv=3, verbose=True)
_ = cross_validate(KNNWithMeans(sim_options={"name": "pearson_baseline", "user_based": False}, verbose=False), data, measures=['RMSE', 'MAE'], cv=3, verbose=True)

Evaluating RMSE, MAE of algorithm KNNWithMeans on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    0.7830  0.7834  0.7812  0.7825  0.0010  
MAE (testset)     0.4827  0.4835  0.4819  0.4827  0.0007  
Fit time          0.20    0.25    0.26    0.23    0.03    
Test time         0.74    0.74    0.69    0.72    0.03    
Evaluating RMSE, MAE of algorithm KNNWithMeans on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    0.7787  0.7795  0.7812  0.7798  0.0010  
MAE (testset)     0.4815  0.4821  0.4820  0.4818  0.0003  
Fit time          0.31    0.35    0.37    0.34    0.03    
Test time         0.75    0.76    0.76    0.76    0.00    
Evaluating RMSE, MAE of algorithm KNNWithMeans on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    0.7391  0.7392  0.7439  0.7407  0.0022  
MAE (testset)     0.4811  0.4806  0.4826  0.4814  0.0009  
Fit time          0.38    0.43    0.44    0.

#### KNN with Z Score

A k-NN method similar to basic k-NN one but takes into account the z-score normalization of each user.

User based:

$\large est_{ui} = \text{mean rating of user u} + \text{std of user u} * \frac{\text{sum_nearest_k_v(similarity between user u and v * (rate given to item i by user v - mean rating of user v) / std of user v)}}{\text{sum_nearest_k_v(similarity between user u and v)}}$

Item based:

$\large est_{ui} = \text{mean rating of item i} + \text{std of item i} * \frac{\text{sum_nearest_k_j(similarity between item i and j * (rate given to item j by user u - mean rating of item j) / std of item j)}}{\text{sum_nearest_k_j(similarity between item i and j)}}$

In [8]:
_ = cross_validate(KNNWithZScore(sim_options={"name": "msd", "user_based": False}, verbose=False), data, measures=['RMSE', 'MAE'], cv=3, verbose=True)
_ = cross_validate(KNNWithZScore(sim_options={"name": "cosine", "user_based": False}, verbose=False), data, measures=['RMSE', 'MAE'], cv=3, verbose=True)
_ = cross_validate(KNNWithZScore(sim_options={"name": "pearson", "user_based": False}, verbose=False), data, measures=['RMSE', 'MAE'], cv=3, verbose=True)
_ = cross_validate(KNNWithZScore(sim_options={"name": "pearson_baseline", "user_based": False}, verbose=False), data, measures=['RMSE', 'MAE'], cv=3, verbose=True)

Evaluating RMSE, MAE of algorithm KNNWithZScore on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    0.7924  0.7965  0.7914  0.7935  0.0022  
MAE (testset)     0.4795  0.4809  0.4772  0.4792  0.0015  
Fit time          0.33    0.37    0.38    0.36    0.02    
Test time         0.79    0.77    0.79    0.79    0.01    
Evaluating RMSE, MAE of algorithm KNNWithZScore on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    0.7931  0.7858  0.7946  0.7911  0.0038  
MAE (testset)     0.4814  0.4780  0.4800  0.4798  0.0014  
Fit time          0.45    0.48    0.50    0.48    0.02    
Test time         0.78    0.77    0.72    0.76    0.03    
Evaluating RMSE, MAE of algorithm KNNWithZScore on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    0.7437  0.7424  0.7413  0.7425  0.0010  
MAE (testset)     0.4806  0.4790  0.4807  0.4801  0.0008  
Fit time          0.51    0.59    0.56   

#### KNN with Baseline

A k-NN method similar to basic k-NN one but takes into account a *baseline* rating.

$baseline = b_{ui} = \text{average rating of all courses + bias of user u + bias of item i} $

User based:

$\large est_{ui} = \text{baseline rating of user u to item i} + \frac{\text{sum_nearest_k_v(similarity between user u and v * (rate given to item i by user v - baseline rating of user v to item i))}}{\text{sum_nearest_k_v(similarity between user u and v)}}$

Item based:

$\large est_{ui} = \text{baseline rating of user u to item i} + \frac{\text{sum_nearest_k_j(similarity between item i and j * (rate given to item j by user u - baseline rating of user u to item j))}}{\text{sum_nearest_k_j(similarity between item i and j)}}$

In [9]:
_ = cross_validate(KNNBaseline(sim_options={"name": "msd", "user_based": False}, verbose=False), data, measures=['RMSE', 'MAE'], cv=3, verbose=True)
_ = cross_validate(KNNBaseline(sim_options={"name": "cosine", "user_based": False}, verbose=False), data, measures=['RMSE', 'MAE'], cv=3, verbose=True)
_ = cross_validate(KNNBaseline(sim_options={"name": "pearson", "user_based": False}, verbose=False), data, measures=['RMSE', 'MAE'], cv=3, verbose=True)
_ = cross_validate(KNNBaseline(sim_options={"name": "pearson_baseline", "user_based": False}, verbose=False), data, measures=['RMSE', 'MAE'], cv=3, verbose=True)

Evaluating RMSE, MAE of algorithm KNNBaseline on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    0.7764  0.7775  0.7845  0.7795  0.0036  
MAE (testset)     0.4774  0.4778  0.4804  0.4785  0.0014  
Fit time          0.70    0.72    0.74    0.72    0.01    
Test time         0.80    0.81    0.80    0.80    0.01    
Evaluating RMSE, MAE of algorithm KNNBaseline on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    0.7772  0.7768  0.7743  0.7761  0.0013  
MAE (testset)     0.4789  0.4781  0.4758  0.4776  0.0013  
Fit time          0.79    0.84    0.83    0.82    0.02    
Test time         0.80    0.79    0.73    0.78    0.03    
Evaluating RMSE, MAE of algorithm KNNBaseline on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    0.7399  0.7312  0.7319  0.7344  0.0039  
MAE (testset)     0.4768  0.4733  0.4726  0.4742  0.0018  
Fit time          0.86    0.93    0.94    0.91 

Best results for all these 4 models are achieved with pearson similarity metric. Overall best result is achieved with KNNBaseline.

### Model-based Collaborative Filtering (Matrix Factorization)

#### SVD

$\text{overall interest} = o_i = q_i^T\cdot p_u, q_i = \text{vector for item i}, p_u = \text{vector for user u}$

$est_{ui} = \text{mean of all courses + baseline of item + baseline of user + overall interest}$

Parameters learned by minimizing objective function.

In [10]:
_ = cross_validate(SVD(verbose=False), data, measures=['RMSE', 'MAE'], cv=3, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    0.6949  0.7002  0.7005  0.6985  0.0026  
MAE (testset)     0.4722  0.4717  0.4738  0.4726  0.0009  
Fit time          1.78    1.77    1.83    1.80    0.02    
Test time         0.47    0.53    0.54    0.51    0.03    


#### SVD++

An extension of SVD taking into account implicit ratings. For this implementation, an implicit rating describes the fact that a user u rated an item j, regardless of the rating value.

In [11]:
_ = cross_validate(SVDpp(verbose=False), data, measures=['RMSE', 'MAE'], cv=3, verbose=True)

Evaluating RMSE, MAE of algorithm SVDpp on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    0.6999  0.6950  0.6938  0.6962  0.0027  
MAE (testset)     0.4698  0.4676  0.4673  0.4682  0.0011  
Fit time          2.00    2.03    2.04    2.02    0.01    
Test time         1.25    1.24    1.26    1.25    0.01    


These matrix factorization models outperformed the memory based methods.

We tested 7 different collaborative filtering methods. We choose 3 of them to fine-tune and see which one is the best. The selected ones are BaselineOnly, KNNBaseline and SVD. We selected SVD because it is slightly faster than SVD++.

### Parameter Search

In [12]:
param_grid_base = {"bsl_options": {
    "method": ["sgd"], 
    "reg":[0.01, 0.02, 0.03], 
    "learning_rate": [0.005, 0.01, 0.02], 
    "n_epochs": [10, 20, 30]},
    "verbose": [False]
}
gs_base = GridSearchCV(BaselineOnly, param_grid_base, measures=["rmse", "mae"], cv=5, joblib_verbose=0)
gs_base.fit(data)
print("BaselineOnly SGD Grid Search")
print(gs_base.best_score["rmse"])
print(gs_base.best_params["rmse"])

BaselineOnly SGD Grid Search
0.6885029389204965
{'bsl_options': {'method': 'sgd', 'reg': 0.03, 'learning_rate': 0.005, 'n_epochs': 20}, 'verbose': False}


Best hyperparameter values for baseline are *{'method': 'sgd', 'reg': 0.03, 'learning_rate': 0.005, 'n_epochs': 20}*.

In [13]:
"""
param_grid_knn = {
    "k": [20,30,40], 
    "sim_options ": {"name": "pearson", "user_based": False},
    "verbose": [False]
}
gs_knn = GridSearchCV(KNNBaseline, param_grid_knn, measures=["rmse", "mae"], cv=5)
gs_knn.fit(data)
print("KNNWithBaseline Grid Search")
print(gs_knn.best_score["rmse"])
print(gs_knn.best_params["rmse"])
"""

# Parameter tuning for KNNBaseline could not done due to the memory limitations. 
# Instead of that we tried cross validation with baseline hypermeters to see if they have positive effect on model.

bsl_opt = {'method': 'sgd', 'reg': 0.03, 'learning_rate': 0.005, 'n_epochs': 20}
sim_opt = {"name": "pearson", "user_based": False}

_ = cross_validate(KNNBaseline(k=20, bsl_options=bsl_opt, sim_options=sim_opt, verbose=False), data, measures=['RMSE', 'MAE'], cv=3, verbose=True)
_ = cross_validate(KNNBaseline(k=30, bsl_options=bsl_opt, sim_options=sim_opt, verbose=False), data, measures=['RMSE', 'MAE'], cv=3, verbose=True)
_ = cross_validate(KNNBaseline(k=40, bsl_options=bsl_opt, sim_options=sim_opt, verbose=False), data, measures=['RMSE', 'MAE'], cv=3, verbose=True)

Evaluating RMSE, MAE of algorithm KNNBaseline on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    0.7307  0.7310  0.7358  0.7325  0.0024  
MAE (testset)     0.4707  0.4700  0.4728  0.4712  0.0012  
Fit time          1.20    1.24    1.23    1.22    0.02    
Test time         0.64    0.63    0.64    0.64    0.00    
Evaluating RMSE, MAE of algorithm KNNBaseline on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    0.7306  0.7325  0.7388  0.7339  0.0035  
MAE (testset)     0.4694  0.4717  0.4751  0.4721  0.0023  
Fit time          1.19    1.24    1.25    1.23    0.03    
Test time         0.69    0.68    0.77    0.72    0.04    
Evaluating RMSE, MAE of algorithm KNNBaseline on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    0.7299  0.7352  0.7357  0.7336  0.0026  
MAE (testset)     0.4693  0.4722  0.4730  0.4715  0.0016  
Fit time          1.20    1.22    1.24    1.22 

From the results for KNNBaseline we can see that changing the k value and newly set baseline options did not affect the accuracy values.

In [14]:
param_grid_svd = {"n_factors": [100, 200],
                  "n_epochs": [10, 20], 
                  "lr_all": [0.005, 0.01, 0.02],
                  "reg_all": [0.01, 0.02, 0.03],
}
gs_svd = GridSearchCV(SVD, param_grid_svd, measures=["rmse", "mae"], cv=5)
gs_svd.fit(data)
print("SVD Grid Search")
print(gs_svd.best_score["rmse"])
print(gs_svd.best_params["rmse"])

SVD Grid Search
0.6945597792676461
{'n_factors': 100, 'n_epochs': 20, 'lr_all': 0.005, 'reg_all': 0.03}


Best hyperparameter values for SVD are *{'n_factors': 100, 'n_epochs': 20, 'lr_all': 0.005, 'reg_all': 0.03}*.

***

Lets test these models with train-test sets. 

In [15]:
trainset, testset = train_test_split(data, random_state=78, test_size=0.2)

In [16]:
pred = BaselineOnly(bsl_options:={'method': 'sgd', 'reg': 0.03, 'learning_rate': 0.005, 'n_epochs': 20}, verbose=False).fit(trainset).test(testset)
print("RMSE values of BaselineOnly:", accuracy.rmse(pred))

RMSE: 0.6949
RMSE values of BaselineOnly: 0.6948999191927382


In [17]:
pred = KNNBaseline(bsl_options=bsl_opt, sim_options=sim_opt, verbose=False).fit(trainset).test(testset)
print("RMSE values of KNNBaseline:", accuracy.rmse(pred))

RMSE: 0.7538
RMSE values of KNNBaseline: 0.7538073470017157


In [18]:
pred = SVD(n_factors=100, n_epochs=20, lr_all=0.005, reg_all=0.03).fit(trainset).test(testset)
print("RMSE values of SVD:", accuracy.rmse(pred))

RMSE: 0.7006
RMSE values of SVD: 0.7006290645236952


### Recommendation

We decided check SVD and BaselineOnly to see how well they recommend courses to users.

In [19]:
full_trainset = data.build_full_trainset()

In [20]:
bsl = BaselineOnly(bsl_options={'method': 'sgd', 'reg': 0.03, 'learning_rate': 0.005, 'n_epochs': 20}, verbose=False)
svd = SVD(n_factors=100, n_epochs=20, lr_all=0.005, reg_all=0.03)
bsl.fit(full_trainset)
svd.fit(full_trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x232239adb20>

In [21]:
# causes memory errors, instead calculate anti testset for the user
#anti_testset = full_trainset.build_anti_testset()
anti_testset_user, user_id = surprise_helper.get_anti_testset_user(full_trainset)
print("User ID:", user_id)
print("Display Name:", rates[rates["user_id"] == user_id]["display_name"].unique()[0])

User ID: 17954
Display Name: Kirubel


In [22]:
user_pred_svd = svd.test(anti_testset_user)
user_pred_bsl = bsl.test(anti_testset_user)

In [23]:
courses_taken = rates[rates["user_id"] == user_id]["course_id"]
courses_taken_df = courses[courses["id"].isin(courses_taken)]
print("Courses Taken By The User")
courses_taken_df

Courses Taken By The User


Unnamed: 0,id,title,category,course_url
28,24823,Java Tutorial for Complete Beginners,Development,/course/java-tutorial/
81,65330,Web Development By Doing: HTML / CSS From Scratch,Development,/course/web-development-learn-by-doing-html5-c...
745,762616,The Complete SQL Bootcamp 2022: Go from Zero t...,Business,/course/the-complete-sql-bootcamp/
1015,958532,Bash Scripting and Shell Programming (Linux Co...,Development,/course/bash-scripting/
1046,980086,Deep Learning Prerequisites: The Numpy Stack i...,Business,/course/deep-learning-prerequisites-the-numpy-...
1056,984734,Website Hacking / Penetration Testing,IT & Software,/course/learn-website-hacking-penetration-test...
1504,1326292,Complete Guide to TensorFlow for Deep Learning...,Development,/course/complete-guide-to-tensorflow-for-deep-...
1895,1668050,Presentation Skills: Master Confident Presenta...,Business,/course/presentations-mastery/
2326,2485240,Python : Master Programming and Development wi...,IT & Software,/course/python-complete-bootcamp-2019-learn-by...


In [24]:
user_pred_df = pd.DataFrame(user_pred_svd)
user_pred_df.sort_values(by=['est'],inplace=True,ascending = False)
recom_list = user_pred_df.head(5)['iid'].to_list()
print("Recommended Courses by SVD")
courses[courses["id"].isin(recom_list)]

Recommended Courses by SVD


Unnamed: 0,id,title,category,course_url
994,948840,Java y BlueJ | Introducción a las Bases de la ...,Development,/course/programacion/
995,948866,Excel Essentials: The Complete Excel Series - ...,Office Productivity,/course/excel-essentials-the-complete-series-l...
1114,1033544,TypeScript: Tu completa guía y manual de mano.,Development,/course/typescript-guia-completa/
2438,2887266,Microservices with Node JS and React,Development,/course/microservices-with-node-js-and-react/
2506,3581711,Fluent Grammar for IELTS Speaking,Teaching & Academics,/course/fluency-for-ielts-speaking/


In [25]:
user_pred_df = pd.DataFrame(user_pred_bsl)
user_pred_df.sort_values(by=['est'],inplace=True,ascending = False)
recom_list = user_pred_df.head(5)['iid'].to_list()
print("Recommended Courses by BaselineOnly")
courses[courses["id"].isin(recom_list)]

Recommended Courses by BaselineOnly


Unnamed: 0,id,title,category,course_url
909,882422,Curso Maestro de Python: De Cero a Programador...,Development,/course/python-3-al-completo-desde-cero/
1305,1177664,Crystal Healing Certificate Course - Energy He...,Lifestyle,/course/crystal-healing/
1604,1384236,Uygulama Geliştirerek C# Öğrenin: A'dan Z'ye E...,Development,/course/sifirdan-ileri-seviye-csharp-programlama/
1985,1774828,Clickfunnels & Sales Funnels MASTERY in 2022 +...,Business,/course/clickfunnelsninjamasterclass/
2385,2649080,Curso de Scrum Básico - para TODOS,Business,/course/curso-scrum-basico-para-todos/


Sometimes when we look at the recommended courses it feels wrong. Since these methods only takes ratings into consideration, recommended courses are the best according to only accuracy values. Recommendations can be done better with using other informations such as user informations and implicit ratings and it is called *Content-based Filtering*. 

### Similar Courses with KNN

We can use k-NN based methods to see the similar courses & users.

In [26]:
knn = KNNBaseline(sim_options={"name": "pearson", "user_based": False}, verbose=False)
knn.fit(full_trainset)

<surprise.prediction_algorithms.knns.KNNBaseline at 0x232298503a0>

In [27]:
similar_courses_raw_ids, course_id = surprise_helper.get_nearest_neighbors(full_trainset, knn)

In [28]:
print("Course")
courses[courses["id"]==course_id]

Course


Unnamed: 0,id,title,category,course_url
1617,1393266,Lead Generation Machine: Cold Email B2B Sales ...,Business,/course/lead-generation-machine-the-cold-email...


In [29]:
print("Similar Courses")
courses[courses["id"].isin(similar_courses_raw_ids)]

Similar Courses


Unnamed: 0,id,title,category,course_url
220,258316,Complete C# Unity Game Developer 2D,Development,/course/unitycourse/
661,684824,Spring Framework Master Class - Java Spring th...,IT & Software,/course/spring-tutorial-for-beginners/
949,914296,The Complete Digital Marketing Course - 12 Cou...,Marketing,/course/learn-digital-marketing-course/
989,946194,Tableau 2022 Advanced: Master Tableau in Data ...,Business,/course/tableau10-advanced/
1143,1055720,Selenium Webdriver-How to Do Mouse and Keyboar...,IT & Software,/course/selenium-webdriver-how-to-do-mouse-and...
