In [1]:
!wget -nc http://files.grouplens.org/datasets/movielens/ml-100k.zip
!unzip -n ml-100k.zip

--2023-12-03 21:24:48--  http://files.grouplens.org/datasets/movielens/ml-100k.zip
Resolving files.grouplens.org (files.grouplens.org)... 128.101.65.152
Connecting to files.grouplens.org (files.grouplens.org)|128.101.65.152|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4924029 (4.7M) [application/zip]
Saving to: ‘ml-100k.zip’


2023-12-03 21:24:48 (16.8 MB/s) - ‘ml-100k.zip’ saved [4924029/4924029]

Archive:  ml-100k.zip
   creating: ml-100k/
  inflating: ml-100k/allbut.pl       
  inflating: ml-100k/mku.sh          
  inflating: ml-100k/README          
  inflating: ml-100k/u.data          
  inflating: ml-100k/u.genre         
  inflating: ml-100k/u.info          
  inflating: ml-100k/u.item          
  inflating: ml-100k/u.occupation    
  inflating: ml-100k/u.user          
  inflating: ml-100k/u1.base         
  inflating: ml-100k/u1.test         
  inflating: ml-100k/u2.base         
  inflating: ml-100k/u2.test         
  inflating: ml-100k/u3.base    

In [34]:
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.model_selection import train_test_split
from sklearn.neighbors import NearestNeighbors
from sklearn.metrics import mean_squared_error
from math import sqrt
import numpy as np

# I. Data Loading and Exploration
users_columns = ['user_id', 'movie_id', 'rating', 'timestamp']
users = pd.read_csv('ml-100k/u.data', sep='\t', names=users_columns)

# II. Data Preprocessing
user_data = users.pivot(index='user_id', columns='movie_id', values='rating').fillna(0)

# III. Data Split
train_data, test_data = train_test_split(users, test_size=0.2, random_state=42)

# IV. User-Based Collaborative filtering
knn_model = NearestNeighbors(metric='cosine', algorithm='brute')
knn_model.fit(user_data)

# V. Evaluation
all_movie_ids = set(users['movie_id'].unique())

# including all movies
train_user_data = train_data.pivot(index='user_id', columns='movie_id', values='rating').fillna(0).reindex(columns=all_movie_ids, fill_value=0)
test_user_data = test_data.pivot(index='user_id', columns='movie_id', values='rating').fillna(0).reindex(columns=all_movie_ids, fill_value=0)

# k-neighbors for each user in the test set
_, indices = knn_model.kneighbors(test_user_data, n_neighbors=5)

# predicted ratings
user_predicted_ratings = np.zeros(test_user_data.shape)

# Predict ratings for each user in the test set
for i in range(len(test_user_data)):
    neighbor_ratings = train_user_data.iloc[indices[i]].values
    user_similarity = cosine_similarity([test_user_data.iloc[i].values], neighbor_ratings)
    user_predicted_ratings[i] = np.dot(user_similarity, neighbor_ratings) / np.sum(np.abs(user_similarity))

# VI. Evaluation Metrics
# Flattenning arrays
actual_ratings = test_user_data.values.flatten()
predicted_ratings = user_predicted_ratings.flatten()

# Removing zero values
non_zero_indices = actual_ratings.nonzero()
actual_ratings = actual_ratings[non_zero_indices]
predicted_ratings = predicted_ratings[non_zero_indices]

#RMSE
rmse = sqrt(mean_squared_error(actual_ratings, predicted_ratings))
print("Root Mean Squared Error (RMSE):", rmse)

# Display actual vs predicted values
results_df = pd.DataFrame({
    'user_id': np.repeat(test_user_data.index, test_user_data.shape[1])[non_zero_indices],
    'item_id': np.tile(test_user_data.columns, len(test_user_data))[non_zero_indices],
    'actual_rating': actual_ratings,
    'predicted_rating': predicted_ratings})

print(results_df.head())


Root Mean Squared Error (RMSE): 2.3274645886947405
   user_id  item_id  actual_rating  predicted_rating
0        1        1            5.0          4.000000
1        1        4            3.0          2.358972
2        1        6            5.0          0.000000
3        1        8            1.0          0.242476
4        1       20            4.0          0.927535


The RMSE which I got is relatively high and lower values are always better. However, the scale of the rating system is from 1-5 so it may only be moderately innacurate. My actual vs predicted ratings are pretty inconsistent - some show close predictions and some show predicted values that are far off from the actual values.
Strengths: User based CF is simple and easy to implement - it leverages user preferences to make predictions. It suggests items that are liked by similar users, potentially introducing users to new and unexpected items.
Weaknesses: The model may not be able to deal with new users or items as well without sufficient historical data. It may not provide accurate recommendations until the user or item has built up a history of interactions. Also, as the number of users and items grows, the computation of user similarity becomes more resource-intensive. For large datasets, this can impact both training and prediction times.

Below, I tried implementing the same problem using the Surprise library to perform user-based collaborative filtering using the KNNBasic algorithm.
We can see that it gives us far more accurate results as opposed to the above. This is because the Surprise library is designed specifically for building recommendation systems and provides a high-level interface for collaborative filtering and other recommendation algorithms.

In [12]:
pip install scikit-surprise

Collecting scikit-surprise
  Downloading scikit-surprise-1.1.3.tar.gz (771 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m772.0/772.0 kB[0m [31m10.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (setup.py) ... [?25l[?25hdone
  Created wheel for scikit-surprise: filename=scikit_surprise-1.1.3-cp310-cp310-linux_x86_64.whl size=3163755 sha256=da848a08863dcd8f046230355bf3b13846174b28cca23fb1a1f0f34147d8ed8d
  Stored in directory: /root/.cache/pip/wheels/a5/ca/a8/4e28def53797fdc4363ca4af740db15a9c2f1595ebc51fb445
Successfully built scikit-surprise
Installing collected packages: scikit-surprise
Successfully installed scikit-surprise-1.1.3


In [37]:
from surprise import Dataset, Reader, KNNBasic
from surprise.model_selection import train_test_split
from surprise import accuracy

# Data Loading and Exploration
reader = Reader(rating_scale=(1, 5))
data = Dataset.load_builtin('ml-100k')

# Data Split
trainset, testset = train_test_split(data, test_size=0.2, random_state=42)

# User-Based Collaborative filtering using Surprise
sim_options = {'name': 'cosine', 'user_based': True}
knn_model = KNNBasic(sim_options=sim_options)
knn_model.fit(trainset)

# Predictions
predictions = knn_model.test(testset)

# Evaluation
rmse = accuracy.rmse(predictions)

print(f'Root Mean Squared Error: {rmse}')

# Display actual and predicted scores
result_df = pd.DataFrame(predictions, columns=['user_id', 'item_id', 'actual_rating', 'predicted_rating', 'details'])
result_df['actual_rating'] = result_df['actual_rating'].astype(int)
result_df['predicted_rating'] = round(result_df['predicted_rating'], 1)
print(result_df[['user_id', 'item_id', 'actual_rating', 'predicted_rating']].head())


Computing the cosine similarity matrix...
Done computing similarity matrix.
RMSE: 1.0194
Root Mean Squared Error: 1.0193536815834319
  user_id item_id  actual_rating  predicted_rating
0     907     143              5               4.0
1     371     210              4               4.0
2     218      42              4               3.9
3     829     170              4               4.3
4     733     277              1               3.4
