<a href="https://colab.research.google.com/github/Ernesto16/AI-Saturdays/blob/master/Week%209%20Assessment/Week_9_Assessment_Recommender_System.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 📘 Week 9 Assessment: Recommender System

- **Dataset**: [Goodbooks-10k Ratings Dataset](https://github.com/zygmuntz/goodbooks-10k)
- **Approach Used**: Collaborative Filtering (Matrix Factorization using TruncatedSVD)
- **Evaluation Metric**: RMSE
- **Libraries**: pandas, numpy, scikit-learn, scipy

### **Model Summary**
- Converted the dataset into a user–item rating matrix.
- Used `TruncatedSVD` to factorize the matrix into latent user and item features.
- Predicted missing ratings by taking dot products of user and item factors.
- Evaluated predictions on the test split using **Root Mean Squared Error (RMSE)**.

### **Results**
- RMSE ≈ (value from print output above)
- Generated top-10 book recommendations for a sample user.

### **Submission Details**
- **Approach implemented:** Collaborative Filtering  
- **Evaluation metric:** RMSE  
- **Dataset link:** [Goodbooks-10k Ratings](https://github.com/zygmuntz/goodbooks-10k)  
- **Code uploaded to GitHub repository:** (https://github.com/Ernesto16/AI-Saturdays/tree/master)

In [3]:
# STEP 1: Install dependencies (no 'surprise' needed)
!pip install pandas numpy scikit-learn scipy --quiet

# STEP 2: Import required libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.decomposition import TruncatedSVD
from sklearn.metrics import mean_squared_error
from math import sqrt
from scipy.sparse import csr_matrix

In [4]:
# Using the Goodbooks-10k dataset (ratings.csv) directly from GitHub (no download needed)
url = "https://raw.githubusercontent.com/zygmuntz/goodbooks-10k/master/ratings.csv"
df = pd.read_csv(url)
df.head()

Unnamed: 0,user_id,book_id,rating
0,1,258,5
1,2,4081,4
2,2,260,5
3,2,9296,5
4,2,2318,3


In [5]:
# Quick summary of dataset
print("Shape of dataset:", df.shape)
print("Columns:", df.columns.tolist())
print(df.describe())
print(df.isnull().sum())

Shape of dataset: (5976479, 3)
Columns: ['user_id', 'book_id', 'rating']
            user_id       book_id        rating
count  5.976479e+06  5.976479e+06  5.976479e+06
mean   2.622446e+04  2.006477e+03  3.919866e+00
std    1.541323e+04  2.468499e+03  9.910868e-01
min    1.000000e+00  1.000000e+00  1.000000e+00
25%    1.281300e+04  1.980000e+02  3.000000e+00
50%    2.593800e+04  8.850000e+02  4.000000e+00
75%    3.950900e+04  2.973000e+03  5.000000e+00
max    5.342400e+04  1.000000e+04  5.000000e+00
user_id    0
book_id    0
rating     0
dtype: int64


In [6]:
max_users = 5000
max_items = 5000
top_users = df['user_id'].value_counts().nlargest(max_users).index
top_items = df['book_id'].value_counts().nlargest(max_items).index

df_small = df[df['user_id'].isin(top_users) & df['book_id'].isin(top_items)].copy()
print("Reduced dataset size:", df_small.shape)

# Create mappings for users and items
user_ids = df_small['user_id'].unique()
item_ids = df_small['book_id'].unique()
user_map = {u:i for i,u in enumerate(user_ids)}
item_map = {i:j for j,i in enumerate(item_ids)}

# Build sparse ratings matrix
rows = df_small['user_id'].map(user_map)
cols = df_small['book_id'].map(item_map)
vals = df_small['rating'].astype(float)
R = csr_matrix((vals, (rows, cols)), shape=(len(user_ids), len(item_ids)))

Reduced dataset size: (735210, 3)


In [7]:
# Split into train and test interactions
train_df, test_df = train_test_split(df_small, test_size=0.2, random_state=42)

train_rows = train_df['user_id'].map(user_map)
train_cols = train_df['book_id'].map(item_map)
train_vals = train_df['rating'].astype(float)
R_train = csr_matrix((train_vals, (train_rows, train_cols)), shape=(len(user_ids), len(item_ids)))

In [8]:
# Train a collaborative filtering model using matrix factorization
n_components = 50
svd = TruncatedSVD(n_components=n_components, random_state=42)
user_factors = svd.fit_transform(R_train)
item_factors = svd.components_.T

In [9]:
# Predict function
def predict_single(user_raw_id, item_raw_id):
    if (user_raw_id not in user_map) or (item_raw_id not in item_map):
        return np.nan
    u = user_map[user_raw_id]; i = item_map[item_raw_id]
    return float(user_factors[u].dot(item_factors[i]))

# Evaluate RMSE on test set
y_true = []
y_pred = []

for _, row in test_df.iterrows():
    true_r = row['rating']
    pred_r = predict_single(row['user_id'], row['book_id'])
    if not np.isnan(pred_r):
        y_true.append(true_r)
        y_pred.append(pred_r)

y_pred = np.clip(y_pred, 1, 5)
rmse = sqrt(mean_squared_error(y_true, y_pred))
print(f"✅ RMSE on test set: {rmse:.4f}")

✅ RMSE on test set: 2.7938


In [10]:
# Recommend top N items for a sample user
def recommend_top_n(user_raw_id, N=10):
    if user_raw_id not in user_map:
        return []
    u = user_map[user_raw_id]
    scores = item_factors.dot(user_factors[u])
    rated_items = set(df_small[df_small['user_id']==user_raw_id]['book_id'])
    candidates = [(item_ids[i], scores[i]) for i in range(len(item_ids)) if item_ids[i] not in rated_items]
    top = sorted(candidates, key=lambda x: x[1], reverse=True)[:N]
    return top

sample_user = df_small['user_id'].iloc[0]
print(f"Top 10 recommendations for user {sample_user}:")
print(recommend_top_n(sample_user, 10))

Top 10 recommendations for user 75:
[(np.int64(101), np.float64(2.4085453901420597)), (np.int64(33), np.float64(2.281399753744734)), (np.int64(119), np.float64(2.004813321938901)), (np.int64(45), np.float64(1.9853508401828814)), (np.int64(60), np.float64(1.9333812321608446)), (np.int64(30), np.float64(1.735070444057039)), (np.int64(7), np.float64(1.5111353754849992)), (np.int64(167), np.float64(1.4403392319159622)), (np.int64(117), np.float64(1.4249269302142487)), (np.int64(111), np.float64(1.4148947244683385))]
