# Group 17: 
### Eyad Medhat 221100279 / Hady Aly 221101190 / Mohamed Mahfouz 221101743 / Omar Mady 221100745

## Part 3. Item-Based Collaborative Filtering System

###  Calculating Item Similarity

We calculate the pairwise similarity between items using Cosine Similarity on the $R_{centered}$ matrix. Because the data is centered, this results in the Pearson Correlation Coefficient, which is robust to differences in rating scales.

###  Prediction Mechanism

For a given user $u$ and target item $i$, we predict the rating by aggregating the deviations of similar items the user has rated.

In [9]:
import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix, diags
from scipy.sparse.linalg import svds
import os
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import time

# Set random seed for reproducibility
np.random.seed(42)

RESULTS_DIR = "../results"
if not os.path.exists(RESULTS_DIR):
    os.makedirs(RESULTS_DIR)
    print(f"Created results directory: {RESULTS_DIR}")
else:
    print(f"Results directory exists: {RESULTS_DIR}")
    
print("Libraries imported successfully.")

Results directory exists: ../results
Libraries imported successfully.


## 1. Data Loading

In [10]:
file_path = '../data/Amazon_health&household_label_encoded.csv'
print(f"\n[Step 1] Loading data from: {file_path}...")
start_time = time.time()
df = pd.read_csv(file_path)
elapsed = time.time() - start_time
print(f"Data loaded in {elapsed:.4f} seconds.")

# Display first few rows
print(f"Data Shape: {df.shape}")
print("First 5 rows:")
display(df.head())


[Step 1] Loading data from: ../data/Amazon_health&household_label_encoded.csv...
Data loaded in 0.0728 seconds.
Data Shape: (14554, 7)
First 5 rows:


Unnamed: 0,user_id,item,item_id_encoded,rating,price,text,is_green
0,AEV4DP5E3FJH6FHDLXUYQTDEQYCQ,Sonic Handheld Percussion Massage Gun - Deep T...,827,2,79.99,"This product worked great when it worked, but ...",1
1,AEHWUHNTB5FX32HJ7UBOZ2WWUX3Q,DUDE Wipes On-The-Go Flushable Wet Wipes - 1 P...,255,5,6.48,These are amazing for travel or for keeping in...,1
2,AEHWUHNTB5FX32HJ7UBOZ2WWUX3Q,"Sleep Mask for Side Sleeper, 100% Blackout 3D ...",814,5,12.69,These are great! My other sleep mask pressed o...,0
3,AEHWUHNTB5FX32HJ7UBOZ2WWUX3Q,"Cottonelle Freshfeel Flushable Wet Wipes, Adul...",226,5,15.79,These are really good quality. Do not tear lik...,1
4,AEHWUHNTB5FX32HJ7UBOZ2WWUX3Q,"Silk Sleep Eye Mask for Men Women, Comfortable...",808,5,8.88,Great for travel. Super soft and silky. Has ad...,0


## 2. Preprocessing: Matrix Construction & Normalization

### 2.1 Interaction Matrix Construction

We transform the raw interaction logs into a sparse User-Item Matrix ($R$), where rows correspond to unique users and columns to unique items.

### 2.2 User Bias Correction

We apply **Mean-Centering** to the matrix. By computing and subtracting the average rating ($\mu_u$) for every user, we isolate the user's preference relative to their own baseline. This step is crucial for accurate similarity computation.

In [11]:
def build_user_item_matrix(df_interactions):
    """
    Constructs sparse User-Item Rating Matrix (Rows=Users, Cols=Items).
    Returns: Matrix R, user_map (id->idx), item_map (id->idx)
    """
    print("\n[Step 2a] Building User-Item Matrix...")
    start_time = time.time()
    
    user_col = 'user_id'
    item_col = 'item_id_encoded'
    rating_col = 'rating'
    
    # Unique IDs
    users = sorted(df_interactions[user_col].unique())
    items = sorted(df_interactions[item_col].unique())
    print(f"Found {len(users)} unique users and {len(items)} unique items.")
    
    user_map = {u: i for i, u in enumerate(users)}
    item_map = {it: i for i, it in enumerate(items)}
    
    # Map Data
    row_indices = df_interactions[user_col].map(user_map)
    col_indices = df_interactions[item_col].map(item_map)
    ratings = df_interactions[rating_col].values
    
    # Build CSR
    R = csr_matrix((ratings, (row_indices, col_indices)), shape=(len(users), len(items)))
    elapsed = time.time() - start_time
    
    print(f"Matrix Built in {elapsed:.4f}s. Shape: {R.shape} with {R.nnz} ratings.")
    print(f"Sparsity: {1.0 - R.nnz / (R.shape[0] * R.shape[1]):.6%}")
    
    return R, user_map, item_map

def center_ratings_by_user(R):
    """
    Subtracts row mean from non-zero elements.
    Returns: Centered Matrix R_centered, User Means vector
    """
    print("\n[Step 2b] Mean-Centering Ratings (Bias Removal)...")
    start_time = time.time()
    
    row_sums = np.array(R.sum(axis=1)).flatten()
    row_counts = np.diff(R.indptr)
    
    with np.errstate(divide='ignore', invalid='ignore'):
        user_means = row_sums / row_counts
    user_means[~np.isfinite(user_means)] = 0.0
    
    print(f"User Means Computed. Example means: {user_means[:5]}")
    
    # Create Centered Matrix
    R_coo = R.tocoo()
    rows = R_coo.row
    cols = R_coo.col
    data = R_coo.data
    
    new_data = data - user_means[rows]
    
    R_centered = csr_matrix((new_data, (rows, cols)), shape=R.shape)
    elapsed = time.time() - start_time
    
    print(f"Mean Centering complete in {elapsed:.4f}s.")
    return R_centered, user_means

# Execute
R, user_map, item_map = build_user_item_matrix(df)
R_centered, user_means = center_ratings_by_user(R)

# [Added] Maps for Lookup
inv_user_map = {v: k for k, v in user_map.items()}
inv_item_map = {v: k for k, v in item_map.items()}
# title map from df
unique_items = df[['item_id_encoded', 'item']].drop_duplicates()
item_title_map = dict(zip(unique_items.item_id_encoded, unique_items.item))



[Step 2a] Building User-Item Matrix...
Found 10000 unique users and 1000 unique items.
Matrix Built in 0.0106s. Shape: (10000, 1000) with 14346 ratings.
Sparsity: 99.856540%

[Step 2b] Mean-Centering Ratings (Bias Removal)...
User Means Computed. Example means: [5. 5. 5. 5. 5.]
Mean Centering complete in 0.0020s.


## 3. Item-Based Collaborative Filtering

### Subtask 4: Compute itemâ€“item similarity

We calculate the Cosine Similarity between distinct **columns** of $R_{centered}$.
Since $R_{centered}$ has user biases removed, this is equivalent to Pearson Correlation.

### Subtask 6 & 7: Predict ratings and Handle Edge Cases

We predict rating for user $u$ on item $i$.

In [12]:
def compute_item_similarity_matrix(R_centered):
    """
    Computes Cosine Similarity between columns (Items).
    Since data is centered, this is Pearson Correlation.
    """
    print("\n[Step 3] Computing Item-Item Similarity matrix...")
    start_time = time.time()
    
    # Transpose so Items are Rows
    M = R_centered.T.tocsr()
    print(f"Transposed Matrix Shape: {M.shape}")
    
    # Compute Norms
    print("Calculating Item Norms...")
    item_norms = np.sqrt(np.array(M.power(2).sum(axis=1)).flatten())
    item_norms[item_norms == 0] = 1e-9
    inv_norms = 1.0 / item_norms
    
    # Normalize
    M_normalized = diags(inv_norms) @ M
    
    # Dot Product
    print("Computing Dot Product (Similarity)...")
    sim_matrix = M_normalized.dot(M_normalized.T)
    
    # Set diagonal
    sim_matrix.setdiag(0.0)
    sim_matrix.eliminate_zeros()
    
    elapsed = time.time() - start_time
    print(f"Similarity Matrix Computed in {elapsed:.4f}s. Shape: {sim_matrix.shape}")
    return sim_matrix

sim_matrix = compute_item_similarity_matrix(R_centered)


[Step 3] Computing Item-Item Similarity matrix...
Transposed Matrix Shape: (1000, 10000)
Calculating Item Norms...
Computing Dot Product (Similarity)...
Similarity Matrix Computed in 0.0025s. Shape: (1000, 1000)


In [13]:
def predict_item_based_cf(user_idx, item_idx, R, sim_matrix, user_means, k=20):
    """
    Predicts rating: mu_u + [Sum(sim * (r - mu_u)) / Sum(|sim|)]
    """
    # This function is called many times, so we minimize print here to avoid flooding
    sim_row = sim_matrix.getrow(item_idx)
    user_row = R.getrow(user_idx)
    
    rated_indices = user_row.indices
    rated_values = user_row.data
    
    if len(rated_indices) == 0:
         return user_means[user_idx]
    
    rating_map = {idx: val for idx, val in zip(rated_indices, rated_values)}
    neighbor_indices = sim_row.indices
    neighbor_scores = sim_row.data
    
    candidates = []
    for j, score in zip(neighbor_indices, neighbor_scores):
        if j in rating_map:
            candidates.append((score, rating_map[j]))
            
    candidates.sort(key=lambda x: abs(x[0]), reverse=True)
    top_k = candidates[:k]
    
    if not top_k:
        return user_means[user_idx]
        
    weighted_sum = 0.0
    sum_abs_sim = 0.0
    mu_u = user_means[user_idx]
    
    for score, r_val in top_k:
        dev = r_val - mu_u
        weighted_sum += score * dev
        sum_abs_sim += abs(score)
        
    if sum_abs_sim == 0:
        return mu_u
        
    pred = mu_u + (weighted_sum / sum_abs_sim)
    return max(1.0, min(5.0, pred))

## 4. Evaluation (Item-Based CF)

In [14]:
print("\n[Step 4] Starting Evaluation (Train/Test Split)...")
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)
print(f"Train Samples: {len(train_df)}")
print(f"Test Samples: {len(test_df)}")

print("Rebuilding matrices on TRAIN data to avoid leakage...")
start_time = time.time()
R_train, user_map_train, item_map_train = build_user_item_matrix(train_df)
R_train_centered, user_means_train = center_ratings_by_user(R_train)
sim_matrix_train = compute_item_similarity_matrix(R_train_centered)
elapsed = time.time() - start_time
print(f"Matrices rebuilt in {elapsed:.4f}s.")

def evaluate_cf(test_df, R, sim, user_means, user_map, item_map):
    errors = []
    print(f"Evaluating predictions on {len(test_df)} test cases...")
    start_eval = time.time()
    count = 0
    
    for _, row in test_df.iterrows():
        uid = row['user_id']
        iid = row['item_id_encoded']
        true_r = row['rating']
        
        if uid in user_map and iid in item_map:
            u_idx = user_map[uid]
            i_idx = item_map[iid]
            pred = predict_item_based_cf(u_idx, i_idx, R, sim, user_means, k=20)
            errors.append((true_r - pred) ** 2)
        else:
            # Cold start
            errors.append((true_r - 4.4) ** 2)
        
        count += 1
        if count % 500 == 0:
             print(f"  Processed {count}/{len(test_df)}...")
            
    rmse = np.sqrt(np.mean(errors))
    print(f"Evaluation done in {time.time() - start_eval:.4f}s.")
    return rmse

rmse_cf = evaluate_cf(test_df, R_train, sim_matrix_train, user_means_train, user_map_train, item_map_train)
print(f"\n>>> Item-Based CF (Centered Cosine) RMSE: {rmse_cf:.4f} <<<")

# --- [Added] Generate Top-10 CF Recommendations for Top 5 Active Users ---
print("\n[Step 4.1] Generating Item-Based CF Recommendations for Top 5 Users...")
# Identify Top 5 active users
top_5_users = df['user_id'].value_counts().head(5).index.tolist()

cf_recs = []

for uid in top_5_users:
    if uid in user_map:
        u_idx = user_map[uid]
        # Get rated items to exclude
        rated_indices = R.getrow(u_idx).indices
        rated_set = set(rated_indices)
        
        # Predict for all items
        user_predictions = []
        for i_idx in range(R.shape[1]):
            if i_idx not in rated_set:
                 # Use existing function using sim_matrix
                 score = predict_item_based_cf(u_idx, i_idx, R, sim_matrix, user_means, k=20)
                 user_predictions.append((score, i_idx))
        
        # Sort and pick Top 10
        user_predictions.sort(key=lambda x: x[0], reverse=True)
        top_10 = user_predictions[:10]
        
        for rank, (score, i_idx) in enumerate(top_10, 1):
            item_code = inv_item_map.get(i_idx, "Unknown")
            title = item_title_map.get(item_code, "Unknown Title")
            
            cf_recs.append({
                'User': uid,
                'Rank': rank,
                'Item_ID': item_code,
                'Title': title,
                'Predicted_Rating': round(score, 4)
            })

df_cf_recs = pd.DataFrame(cf_recs)
save_path_cf = "../results/cf_item_based_recommendations.csv"
df_cf_recs.to_csv(save_path_cf, index=False)
print(f"Saved CF Recommendations to {save_path_cf}")
print(df_cf_recs.head())



[Step 4] Starting Evaluation (Train/Test Split)...
Train Samples: 11643
Test Samples: 2911
Rebuilding matrices on TRAIN data to avoid leakage...

[Step 2a] Building User-Item Matrix...
Found 8485 unique users and 997 unique items.
Matrix Built in 0.0080s. Shape: (8485, 997) with 11517 ratings.
Sparsity: 99.863858%

[Step 2b] Mean-Centering Ratings (Bias Removal)...
User Means Computed. Example means: [5.         5.         5.         5.         4.33333333]
Mean Centering complete in 0.0010s.

[Step 3] Computing Item-Item Similarity matrix...
Transposed Matrix Shape: (997, 8485)
Calculating Item Norms...
Computing Dot Product (Similarity)...
Similarity Matrix Computed in 0.0010s. Shape: (997, 997)
Matrices rebuilt in 0.0120s.
Evaluating predictions on 2911 test cases...
  Processed 500/2911...
  Processed 1000/2911...
  Processed 1500/2911...
  Processed 2000/2911...
  Processed 2500/2911...
Evaluation done in 0.1576s.

>>> Item-Based CF (Centered Cosine) RMSE: 1.4285 <<<

[Step 4.1] G

## 5. Latent Factor Model (SVD)

## 8.2. Dimensionality Reduction via SVD

We employ **Truncated SVD** to decompose the centered rating matrix into latent factors, effectively reducing noise and sparsity.
$$ R_{centered} \approx U \cdot \Sigma \cdot V^T $$
The reconstructed rating is estimated as: $\hat{r}_{ui} = (U_u \cdot \Sigma \cdot V_i^T) + \mu_u$

### Scalable Prediction Strategy (Memory Efficiency)

**Optimization**: Reconstructing the full dense matrix for prediction requires significant memory (proportional to $Users \times Items$), which often leads to overflows.

**Solution**: We implement a **Row-by-Row Prediction** pattern. We compute the dense prediction vector for one target user at a time, extract the top items, and then discard the vector. This keeps memory usage low and constant regardless of the total number of users.

In [15]:
def perform_svd(matrix, k):
    print(f"\n[Step 5] Calculating SVD with k={k}...")
    start_time = time.time()
    # Helper to convert to float for SVD
    matrix_f = matrix.asfptype()
    
    U, sigma, Vt = svds(matrix_f, k=k)
    
    # Sort descending
    U = U[:, ::-1]
    sigma = sigma[::-1]
    Vt = Vt[::-1, :]
    
    sigma_diag = np.diag(sigma)
    print(f"SVD Done in {time.time() - start_time:.4f}s.")
    print(f"U: {U.shape}, Sigma: {sigma_diag.shape}, Vt: {Vt.shape}")
    return U, sigma_diag, Vt

def calculate_rmse_reconstruction(original_centered, U, sigma_diag, Vt):
    # Approximation
    print("Calculating Reconstruction Error...")
    approx = np.dot(np.dot(U, sigma_diag), Vt)
    
    diff = original_centered - approx
    rmse = np.sqrt(np.mean(np.square(diff)))
    return rmse

# Evaluate both k=10 and k=20
print("\n{:<5} {:<25}".format("k", "RMSE (Reconstruction)"))
print("-"*30)

for k in [10, 20]:
    U, S, Vt = perform_svd(R_centered, k)
    rmse_svd = calculate_rmse_reconstruction(R_centered, U, S, Vt)
    print("{:<5} {:<25.4f}".format(k, rmse_svd))


k     RMSE (Reconstruction)    
------------------------------

[Step 5] Calculating SVD with k=10...
SVD Done in 0.0097s.
U: (10000, 10), Sigma: (10, 10), Vt: (10, 1000)
Calculating Reconstruction Error...
10    0.0235                   

[Step 5] Calculating SVD with k=20...
SVD Done in 0.0202s.
U: (10000, 20), Sigma: (20, 20), Vt: (20, 1000)
Calculating Reconstruction Error...
20    0.0225                   


In [16]:
print("\n[Step 6] Generating efficient SVD Recommendations (k=20)...")

# Efficient SVD Prediction Function (Row-by-Row)
def predict_for_user_svd(u_idx, U, S, Vt, user_mean):
    """
    Computes dense prediction vector for a single user to save memory.
    Result shape: (n_items, )
    P = (User_Latent . Sigma . Vt) + Mean
    """
    # User latent vector: U[u_idx] -> shape (1, k)
    user_vec = U[u_idx, :].reshape(1, -1)
    
    # w = User_Latent . S
    w = np.dot(user_vec, S)
    
    # Preds = w . Vt
    preds = np.dot(w, Vt).flatten() + user_mean
    return np.clip(preds, 1.0, 5.0)

# Use the best model k=20
k_best = 20
U, S, Vt = perform_svd(R_centered, k_best)

svd_recs = []

# Reuse Top 5 users
top_5_users = df['user_id'].value_counts().head(5).index.tolist()

print(f"Generating SVD recommendations for {len(top_5_users)} users...")

for uid in top_5_users:
    if uid in user_map:
        u_idx = user_map[uid]
        user_mean = user_means[u_idx]
        
        # Predict all items for this user efficiently
        preds = predict_for_user_svd(u_idx, U, S, Vt, user_mean)
        
        # Accessing R sparse row again to filter out already rated items
        rated_indices = set(R.getrow(u_idx).indices)
        
        # Create list of (score, idx) for unrated items only
        candidates = []
        for i_idx in range(len(preds)):
            if i_idx not in rated_indices:
                candidates.append((preds[i_idx], i_idx))
        
        # Sort by score descending
        candidates.sort(key=lambda x: x[0], reverse=True)
        top_20 = candidates[:20]
        
        for rank, (score, i_idx) in enumerate(top_20, 1):
            item_code = inv_item_map.get(i_idx, "Unknown")
            title = item_title_map.get(item_code, "Unknown Title")
            
            svd_recs.append({
                'User': uid,
                'Rank': rank,
                'Item_ID': item_code,
                'Title': title,
                'Predicted_Rating': round(score, 4),
                'Method': f'SVD (k={k_best})'
            })

df_svd_recs = pd.DataFrame(svd_recs)
save_path_svd = "../results/svd_recommendations.csv"
df_svd_recs.to_csv(save_path_svd, index=False)
print(f"Saved SVD Recommendations to {save_path_svd}")
print(df_svd_recs.head())


[Step 6] Generating efficient SVD Recommendations (k=20)...

[Step 5] Calculating SVD with k=20...
SVD Done in 0.0229s.
U: (10000, 20), Sigma: (20, 20), Vt: (20, 1000)
Generating SVD recommendations for 5 users...
Saved SVD Recommendations to ../results/svd_recommendations.csv
                           User  Rank  Item_ID  \
0  AEBOWVW4PMZZICMOGENXNVFTJUCQ     1      191   
1  AEBOWVW4PMZZICMOGENXNVFTJUCQ     2      410   
2  AEBOWVW4PMZZICMOGENXNVFTJUCQ     3      295   
3  AEBOWVW4PMZZICMOGENXNVFTJUCQ     4      972   
4  AEBOWVW4PMZZICMOGENXNVFTJUCQ     5       77   

                                               Title  Predicted_Rating  \
0  Charmin Ultra Gentle Toilet Paper, 18 Mega Rol...            4.3757   
1  Glade Automatic Spray Refill, Air Freshener fo...            4.3102   
2  Drive Medical RTL12004KD Handicap Bathroom Sto...            4.2759   
3  Yogasleep Rohm Portable White Noise Machine fo...            4.2705   
4  Amazon Brand - Solimo Sandwich Storage Bags, 3.