In [1]:
import numpy as np
import math

Let's illustrate how the equation works using a simple example. From the above figure, suppose we want to predict the rating of `user6` to item `Machine Learning Capstone` course. After some similarity measurements, we found that k = 4 nearest neighbors: `user2, user3, user4, user5` with similarities in array ```knn_sims```:


In [2]:
knn_sims = np.array([0.8, 0.92, 0.75, 0.83])

Also their rating on the `Machine Learning Capstone` course are:


In [3]:
knn_ratings = np.array([3.0, 3.0, 2.0, 3.0])

So the predicted rating of `user6` to item `Machine Learning Capstone` course can be calculated as:


In [4]:
r_u6_ml = np.dot(knn_sims, knn_ratings) / np.sum(knn_sims)
r_u6_ml

2.7727272727272725

If we already know the true rating to be 3.0, then we get a prediction error RMSE (Rooted Mean Squared Error) as:


In [5]:
true_rating = 3.0
rmse = math.sqrt((true_rating - r_u6_ml)**2)
rmse

0.22727272727272751

The predicted rating is around 2.7 (close to 3.0 with RMSE 0.22), which indicates that `user6` is also likely to complete the course `Machine Learning Capstone`. As such, we may recommend it to user6 with high confidence.


### Load and exploring dataset


In [6]:
import pandas as pd

In [7]:
rating_url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-ML0321EN-Coursera/labs/v2/module_3/ratings.csv"
rating_df = pd.read_csv(rating_url)

In [8]:
rating_df.head()

Unnamed: 0,user,item,rating
0,1889878,CC0101EN,5
1,1342067,CL0101EN,3
2,1990814,ML0120ENv3,5
3,380098,BD0211EN,5
4,779563,DS0101EN,3


The dataset contains three columns, `user id` (learner), `item id`(course), and `rating`(enrollment mode).

Note that this matrix is presented as the dense or vertical form, and you may convert it to a sparse matrix using `pivot` :


In [9]:
rating_sparse_df = rating_df.pivot(index='user', columns='item', values='rating').fillna(0).reset_index().rename_axis(index=None, columns=None)
rating_sparse_df.head()

Unnamed: 0,user,AI0111EN,BC0101EN,BC0201EN,BC0202EN,BD0101EN,BD0111EN,BD0115EN,BD0121EN,BD0123EN,...,SW0201EN,TA0105,TA0105EN,TA0106EN,TMP0101EN,TMP0105EN,TMP0106,TMP107,WA0101EN,WA0103EN
0,2,0.0,4.0,0.0,0.0,5.0,4.0,0.0,5.0,3.0,...,0.0,5.0,0.0,4.0,0.0,3.0,3.0,0.0,5.0,0.0
1,4,0.0,0.0,0.0,0.0,5.0,3.0,4.0,5.0,3.0,...,0.0,4.0,0.0,0.0,0.0,3.0,3.0,0.0,3.0,3.0
2,5,3.0,5.0,5.0,0.0,4.0,0.0,0.0,0.0,3.0,...,0.0,0.0,4.0,4.0,4.0,4.0,4.0,5.0,0.0,3.0
3,7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,8,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Usually, the dense format is more preferred as it saves a lot of storage and memory space. While the benefit of the sparse matrix is it is in the nature matrix format and you could apply computations such as cosine similarity directly.
<br><br>
Next, you need to perform KNN-based collaborative filtering on the user-item interaction matrix.
You may choose one of the two following implementation options of KNN-based collaborative filtering.
- The first one is to use `scikit-surprise` which is a popular and easy-to-use Python recommendation system library.
- The second way is to implement it with standard `numpy`, `pandas`, and `sklearn`. You may need to write a lot of low-level implementation code along the way.


## Implementation Option 1: Use **Surprise** library (recommended)


In [10]:
from surprise import KNNBasic
from surprise import Dataset, Reader
from surprise.model_selection import train_test_split
from surprise import accuracy

In [11]:
data = Dataset.load_builtin('ml-100k', prompt=False)

In [12]:
trainset, testset = train_test_split(data, test_size=.25)

In [13]:
algo = KNNBasic()
algo.fit(trainset)

Computing the msd similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNBasic at 0x7e0a76f51450>

In [14]:
predictions = algo.test(testset)

In [15]:
accuracy.rmse(predictions)

RMSE: 0.9829


0.9828607618907637

As you can see, just a couple of lines and you can apply KNN collaborative filtering on the sample movie lens dataset. The main evaluation metric is `Root Mean Square Error (RMSE)` which is a very popular rating estimation error metric used in recommender systems as well as many regression model evaluations.


In [16]:
rating_df.to_csv('course_ratings.csv', index=False)

In [17]:
reader = Reader(line_format='user item rating', sep=',', skip_lines=1, rating_scale=(2, 3))

In [18]:
course_dataset = Dataset.load_from_file('course_ratings.csv', reader=reader)

In [19]:
trainset, testset = train_test_split(course_dataset, test_size=.3)

then check how many users and items we can use to fit a KNN model:


In [20]:
print(f"Total {trainset.n_users} users and {trainset.n_items} items in the trainingset")

Total 31303 users and 125 items in the trainingset


### TASK: Perform KNN-based collaborative filtering on the user-item interaction matrix


In [21]:
model = KNNBasic()

In [None]:
model.fit(trainset)

In [None]:
predictions = model.test(testset)

In [None]:
accuracy.rmse(predictions)

## Implementation Option 2: Use `numpy`, `pandas`, and `sklearn`
