<a href="https://colab.research.google.com/github/HMy2912/LTSSUD-RecommenderSys-ColabFiltering/blob/main/Group_7_Seminar_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CSC14116 - Group 7 - Parallel Collaborative Filtering Recommender System
**Week 1 (6/9/2025 – 6/14/2025)**  
**Member**: Đăng Hoàn Mỹ - 191272216  
**Project**: User-user Neighborhood-based Collaborative Filtering (NBCF) Recommender System using MovieLens 100K dataset.  
**Objective**: Build a movie recommender system with sequential (V1), Numba (V2), CUDA (V3), and CUDA with shared memory (V4) implementations, targeting 10× speedup and MAE < 1.2.
**References**
* MovieLens Datasets: https://grouplens.org/datasets/movielens/
* Viblo Tutorial: Basics of Collaborative Filtering.
* Machine Learning Cơ Bản: NBCF with MovieLens examples.
* Lei Mao’s Blog: Cosine Similarity vs. Pearson Correlation.

## 1. Environment Setup
Set up Google Colab with necessary libraries (`pandas`, `numpy`, `scipy`, `scikit-learn`, `numba`) and mount Google Drive for data storage. This ensures reproducibility and GPU access for future CUDA implementations (V3, V4).

In [1]:
import pandas as pd
import numpy as np
import scipy.sparse as sp
from sklearn.metrics.pairwise import cosine_similarity
from numba import jit, prange, cuda
import time
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
# Verify environment
import numba
print("Numba version:", numba.__version__)  # Check compatibility (e.g., 0.61.2)
!nvcc --version  # Expect CUDA ~11.x
!nvidia-smi  # Confirm T4 GPU

Numba version: 0.60.0
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Jun__6_02:18:23_PDT_2024
Cuda compilation tools, release 12.5, V12.5.82
Build cuda_12.5.r12.5/compiler.34385749_0
/bin/bash: line 1: nvidia-smi: command not found


## 2. Understanding the NBCF Algorithm

### Overview

User-user Neighborhood-based Collaborative Filtering (NBCF) predicts a user’s movie ratings based on ratings from similar users. It’s suitable for MovieLens 100K (943 users, 1682 movies) due to fewer users than items, reducing similarity computation cost compared to `item-item` NBCF.

### Steps
1. **Load Dataset**: Create user-item matrix `R` (943×1682) from MovieLens 100K.
2. **Normalization**: Mean-center ratings to remove user bias, producing `R_norm`.
3. **Similarity Computation**: Compute user-user cosine similarities (V1: sequential, V2: Numba, V3: CUDA, V4: CUDA with shared memory).
4. **K-Nearest Neighbors (K-NN)**: Select 20 most similar users per user.
5. **Recommendation**: Predict ratings for unrated movies, recommend top-10.
6. **Evaluation**: Compute MAE (<1.2) and Precision@10 (~4%) on `u1.test`.

### Cosine Similarity

For users `u` and `v`, cosine similarity is:

$$ \text{sim}(u,v) = \frac{R_{\text{norm},u} \cdot R_{\text{norm},v}}{|R_{\text{norm},u}| |R_{\text{norm},v}|} $$

where $R_{\text{norm},u}$ is the mean-centered rating vector. This measures rating pattern similarity, ignoring magnitude.

## 3. Dataset Description

### MovieLens 100K
* Source: https://grouplens.org/datasets/movielens/100k/
* Files:
    * `u.data`: 100,000 ratings (tab-separated, columns: `user_id`, `item_id`, `rating`, `timestamp`).
    * `u.item`: 1682 movies (pipe-separated, columns: `item_id`, `title`, ...; use first two).
    * `u1.test`: ~20,000 test ratings (same format as `u.data`).
* Stats:
    * Users: 943
    * Movies: 1682
    * Ratings: ~100,000 (1–5 scale)
    * Sparsity: `~6.3%` non-zero entries
      * $\sqrt{\frac{a}{b}}$
      * $x^2$
    *  ( $\frac{100,000}{943 \times 1682} \approx 0.063$).
* Relevance: Ideal for user-user NBCF due to fewer users than movies, reducing similarity matrix size (943×943 vs. 1682×1682).

This is a test: $x^2$ and \(y^3\)
.

### Why Sparse Matrix?

The user-item matrix R (943×1682) has \~6.3% non-zero entries, making dense storage (\~12MB) inefficient. A sparse CSR (Compressed Sparse Row) matrix reduces memory usage to \~1.2MB, critical for CUDA (V3, V4) on Colab’s T4 GPU (\~12.7GB VRAM).

# Step 1: Load Data and Create User-Item Matrix
Load MovieLens 100K (`u.data`, `u.item`) into pandas DataFrames, create sparse CSR matrix `R` (943×1682), and save to Google Drive for reuse.

In [3]:
# Load data
data_url = 'https://files.grouplens.org/datasets/movielens/ml-100k/u.data'
item_url = 'https://files.grouplens.org/datasets/movielens/ml-100k/u.item'
ratings = pd.read_csv(data_url, sep='\t', names=['user_id', 'item_id', 'rating', 'timestamp'])
movies = pd.read_csv(item_url, sep='|', encoding='latin-1', usecols=[0, 1], names=['item_id', 'title'])
print("Ratings shape:", ratings.shape)  # (100000, 4)
print("Movies shape:", movies.shape)  # (1682, 2)

Ratings shape: (100000, 4)
Movies shape: (1682, 2)


In [4]:
# Save to Drive
ratings.to_csv('/content/drive/MyDrive/2025/HK3/LTSSUD/Data/ml-100k_ratings.csv', index=False)
movies.to_csv('/content/drive/MyDrive/2025/HK3/LTSSUD/Data/ml-100k_movies.csv', index=False)

In [5]:
# Create user-item matrix
n_users, n_items = 943, 1682
R = sp.csr_matrix((ratings['rating'], (ratings['user_id'] - 1, ratings['item_id'] - 1)), shape=(n_users, n_items))
print("User-item matrix shape:", R.shape, "Non-zero entries:", R.nnz)  # (943, 1682), ~100000
print("Sparsity:", R.nnz / (n_users * n_items))  # ~0.063
np.save('/content/drive/MyDrive/2025/HK3/LTSSUD/Data/R_sparse.npy', R)

User-item matrix shape: (943, 1682) Non-zero entries: 100000
Sparsity: 0.06304669364224531


In [6]:
# Load test data (for later use)
test_url = 'https://files.grouplens.org/datasets/movielens/ml-100k/u1.test'
test_ratings = pd.read_csv(test_url, sep='\t', names=['user_id', 'item_id', 'rating', 'timestamp'])
test_ratings.to_csv('/content/drive/MyDrive/2025/HK3/LTSSUD/Data/ml-100k_test.csv', index=False)
print("Test ratings shape:", test_ratings.shape)  # (~20000, 4)

Test ratings shape: (20000, 4)


In [7]:
print("Sample ratings (user 1):", R[0].toarray()[0, :5])  # First 5 items

Sample ratings (user 1): [5 3 4 3 3]
