# Group 17: 
### Eyad Medhat 221100279 / Hady Aly 221101190 / Mohamed Mahfouz 221101743 / Omar Mady 221100745

# Part 2: PCA Method with Maximum Likelihood Estimation (MLE)

## Assignment Requirements
**Objective**: Use the PCA method with MLE technique to compute the covariance matrix, then compute rating predictions for the target items I1 and I2 using the "Top Peers + Regression" approach (same methodology as Part 1).

**MLE Definition**: 
"For simplicity, assume the Maximum Likelihood Estimate of the covariance between each pair of items is estimated as the covariance between only the specified entries. i.e, only the users that have specified ratings for a particular pair of items are used to estimate the covariance. If there are no users in common between a pair of items, the covariance is estimated to be 0."

**Steps**:
1. Generate the covariance matrix.
2. Determine the top 5-peers and top 10-peers for each of the target items (I1 and I2) using the transformed representation (covariance matrix).
3. Determine reduced dimensional space for each user in case of using the top 5-peers.
4. Use the results from point 3 compute the rating predictions of the original missing rating for each of the target items (I1 and I2) using the top 5-peers.
5. Determine reduced dimensional space for each user in case of using the top 10-peers.
6. Use the results from point 5 compute the rating predictions of the original missing rating for each of the target items (I1 and I2) using the top 10-peers.
7. Comparisons.

### Step 1: Load Data & Generate MLE Covariance


In [1]:
from utils import *

Results folder exists at: c:\Users\moham\Desktop\IRS GIT\SECTION1_DimensionalityReduction\results
Subfolder exists: c:\Users\moham\Desktop\IRS GIT\SECTION1_DimensionalityReduction\results\plots
Subfolder exists: c:\Users\moham\Desktop\IRS GIT\SECTION1_DimensionalityReduction\results\tables


In [2]:
results_dir = ensure_results_folders()

Results folder exists at: c:\Users\moham\Desktop\IRS GIT\SECTION1_DimensionalityReduction\results
Subfolder exists: c:\Users\moham\Desktop\IRS GIT\SECTION1_DimensionalityReduction\results\plots
Subfolder exists: c:\Users\moham\Desktop\IRS GIT\SECTION1_DimensionalityReduction\results\tables


In [3]:
df=load_data()

print(f"Data Loaded. Shape: {df.shape}")
print(f"Unique Users: {df['userId'].nunique()}, Unique Items: {df['movieId'].nunique()}")

# 2. Pivot to User-Item Matrix
# Columns are MovieIDs, Index is UserIDs
user_item_matrix = df.pivot(index='userId', columns='movieId', values='rating')
print(f"User-Item Matrix Shape: {user_item_matrix.shape}")

# 3. Identify Targets (Load from file if possible, else defaults)
target_items_path = os.path.join(results_dir, 'lowest_two_rateditems.csv')
if os.path.exists(target_items_path):
    target_items_df = pd.read_csv(target_items_path)
    target_ids = target_items_df['movieId'].tolist()[:2]
    print(f"Loaded Target Items: {target_ids}")
else:
    target_ids = [1556, 1499] # Fallback/Example
    print(f"Using Default Target Items: {target_ids}")

# Ensure targets are in the matrix
target_ids = [t for t in target_ids if t in user_item_matrix.columns]
print(f"Active Target Items: {target_ids}")

 Found cached sample at: ..\data\ml-20m\ratings_cleaned_sampled.csv
Data Loaded. Shape: (1000000, 3)
Unique Users: 96345, Unique Items: 1000
User-Item Matrix Shape: (96345, 1000)
Using Default Target Items: [1556, 1499]
Active Target Items: [1556, 1499]


In [4]:
cov_matrix, item_means, centered_df_raw = calculate_mle_covariance(user_item_matrix)
print("Covariance Matrix Generated.")
print(cov_matrix.iloc[:5, :5])

# Save covariance matrix for reference
save_csv(cov_matrix, "3.3.1_mle_covariance.csv")

Computing MLE Covariance Matrix...
- Calculating Numerator...
- Calculating Denominator...
- Dividing...
Covariance Matrix Generated.
movieId         1         2         3         5         6
movieId                                                  
1        0.784061  0.257519  0.265043  0.193229  0.077289
2        0.257519  0.965164  0.367336  0.259827  0.262220
3        0.265043  0.367336  0.936565  0.289819  0.400179
5        0.193229  0.259827  0.289819  1.044418  0.407888
6        0.077289  0.262220  0.400179  0.407888  0.795272
    Saved CSV: tables/3.3.1_mle_covariance.csv


### Step 2: Determine Top 5-Peers and Top 10-Peers

In [5]:
peers_data = get_top_peers(cov_matrix, target_ids, k_list=[5, 10])


--- Target 1556 ---
Top 5 Peers: [53125, 1103, 69757, 2421, 3275]
Values: [1.53675873 1.19301815 1.12165668 1.04980911 0.98559018]
Top 10 Peers: [53125, 1103, 69757, 2421, 3275, 78499, 5015, 160, 48385, 1591]
Values: [1.53675873 1.19301815 1.12165668 1.04980911 0.98559018 0.97810952
 0.97784548 0.97211373 0.95355605 0.91181265]

--- Target 1499 ---
Top 5 Peers: [362, 1296, 276, 261, 379]
Values: [1.45527386 1.06729786 1.05805999 1.01015344 0.9694005 ]
Top 10 Peers: [362, 1296, 276, 261, 379, 2340, 1591, 1380, 231, 2722]
Values: [1.45527386 1.06729786 1.05805999 1.01015344 0.9694005  0.9062974
 0.89877922 0.85796729 0.8534604  0.84931266]


### Steps 3-6: Reduced Dimensional Space & Prediction (Top 5 & Top 10)

**Methodology**: 
1. **Reduced Space**: Select columns corresponding to the Top-K Peers from the centered rating matrix. `fillna(0)` is used because 0 represents the mean in centered data.
2. **Prediction**: Train a Linear Regression model where:
   - $X$: Ratings of Peer items (Reduced Space)
   - $y$: Rating of Target item (Centered)
   - Train on users who rated BOTH Target and Peers (or at least Target, filling missing peers with 0).
   - Predict for users who have NOT rated the Target.
   - Add Target Mean to get final rating.

In [6]:
# Run for K=5
reduced_space_5, preds_5 = solve_pca_regression(5, peers_data, centered_df_raw, item_means, target_ids)

# Run for K=10
reduced_space_10, preds_10 = solve_pca_regression(10, peers_data, centered_df_raw, item_means, target_ids)


=== Processing K=5 Peers ===

Target: 1556, Peers: [53125, 1103, 69757, 2421, 3275]
Reduced Space Shape: (96345, 5)
Training Samples: 422, Prediction Samples: 95923
    Saved CSV: tables/3.3_mle_predictions_k5_target_1556.csv

Target: 1499, Peers: [362, 1296, 276, 261, 379]
Reduced Space Shape: (96345, 5)
Training Samples: 453, Prediction Samples: 95892
    Saved CSV: tables/3.3_mle_predictions_k5_target_1499.csv

=== Processing K=10 Peers ===

Target: 1556, Peers: [53125, 1103, 69757, 2421, 3275, 78499, 5015, 160, 48385, 1591]
Reduced Space Shape: (96345, 10)
Training Samples: 422, Prediction Samples: 95923
    Saved CSV: tables/3.3_mle_predictions_k10_target_1556.csv

Target: 1499, Peers: [362, 1296, 276, 261, 379, 2340, 1591, 1380, 231, 2722]
Reduced Space Shape: (96345, 10)
Training Samples: 453, Prediction Samples: 95892
    Saved CSV: tables/3.3_mle_predictions_k10_target_1499.csv


### Step 7: Comparisons
1. Compare Point 3 (Reduced Space K=5) vs Point 6 (Predictions K=10)? No, prompt says:
   - "Compare the results of point 3 with results of point 6" -> (Reduced Space K=5 vs Reduced Space K=10 Prediction? Wait. Point 3 is Reduced Space K=5. Point 6 is Preds K=10. This is weird comparison. Maybe checks dimensionality vs accuracy? Or maybe typo in prompt referring to Preds K=5 vs Preds K=10. Let's compare Preds K=5 vs Preds K=10)
   - "Compare the results of point 9 in part 1 (Mean Filling Preds K=5) with results of point 4 (MLE Preds K=5)"
   - "Compare the results of point 11 in part 1 (Mean Filling Preds K=10) with results of point 6 (MLE Preds K=10)"

In [7]:
print("\n=== Comparisons ===")

# 1. Compare MLE K=5 vs MLE K=10 Predictions
print("--- MLE K=5 vs MLE K=10 (Predictions) ---")
for tid in target_ids:
    if tid in preds_5 and tid in preds_10:
        p5 = preds_5[tid].set_index('userId')['predicted_rating_mle']
        p10 = preds_10[tid].set_index('userId')['predicted_rating_mle']
        
        # Align indices
        common = p5.index.intersection(p10.index)
        mae = np.mean(np.abs(p5.loc[common] - p10.loc[common]))
        print(f"Target {tid}: MAE between K=5 and K=10 predictions: {mae:.4f}")

# 2. Compare Part 1 (Mean Filling) vs Part 2 (MLE)
# Load Part 1 Predictions
print("\n--- Part 1 (Mean Fill) vs Part 2 (MLE) ---")
utils_tables_path = "../results/tables"

for tid in target_ids:
    # Load Part 1 K=5 (Point 9 in Part 1)
    p1_k5_file = os.path.join(utils_tables_path, f"3.2.9_predictions_target_{tid}.csv")
    
    if os.path.exists(p1_k5_file):
        p1_k5_df = pd.read_csv(p1_k5_file)
        # Check column names in Part 1 file
        # Usually 'predicted_rating_final'
        if 'predicted_rating_final' in p1_k5_df.columns:
            p1_series = p1_k5_df.set_index('userId')['predicted_rating_final']
            
            # Compare with MLE K=5
            if tid in preds_5:
                mle_series = preds_5[tid].set_index('userId')['predicted_rating_mle']
                common = p1_series.index.intersection(mle_series.index)
                mae = np.mean(np.abs(p1_series.loc[common] - mle_series.loc[common]))
                print(f"Target {tid} (K=5): MAE MeanFill vs MLE: {mae:.4f}")
            
            # Compare with MLE K=10
            # Assuming Part 1 K=10 file follows pattern? 
            # (Accessing Part 1 point 11 - assuming it was saved similarly or we skip)
            # If Part 1 didn't explicitly save K=10 separate from K=5 (Part 1 notebook ended at Task 9 with K=5),
            # we might only compare K=5.
            pass
    else:
        print(f"Part 1 Predictions for Target {tid} not found at {p1_k5_file}.")


=== Comparisons ===
--- MLE K=5 vs MLE K=10 (Predictions) ---
Target 1556: MAE between K=5 and K=10 predictions: 0.0160
Target 1499: MAE between K=5 and K=10 predictions: 0.0226

--- Part 1 (Mean Fill) vs Part 2 (MLE) ---
Target 1556 (K=5): MAE MeanFill vs MLE: 0.0519
Target 1499 (K=5): MAE MeanFill vs MLE: 0.0325


### Steps 8-9: Comparison with Part 1
Compare with results from Part 1 Points 9 and 11.
- Reference your previous run's output for Part 1 to make specific comments.
- Generally, MLE PCA (Part 2) ignores missing entries for correlations, providing valid structure analysis even with sparse data, whereas Mean Filling (Part 1) tends to dampen correlations.

In [8]:
# --- DYNAMIC BENCHMARKING: PCA MLE (Part 2) ---\n
method_name = "PCA MLE (Part 2)"
import time
import pandas as pd
import numpy as np
import os
from scipy.linalg import eigh
# Ensure utils is loaded if locals missing (fallback)
if 'load_data' not in locals():
    try:
        from utils import load_data, calculate_mle_covariance
    except ImportError:
        print("Warning: utils module not found. Benchmarking might fail.")

print(f"\n--- Benchmarking {method_name} ---")

# 1. Setup Results Path
results_table_dir = os.path.join(os.getcwd(), '../results/tables')
if not os.path.exists(results_table_dir):
   os.makedirs(results_table_dir)

# 2. Data Preparation (Robust Loading)
R_bench = None

# Attempt 1: Check for existing pivoted variables
if 'user_item_matrix' in locals():
    R_bench = user_item_matrix
elif 'R_df' in locals():
    R_bench = R_df

# Attempt 2: Check for raw ratings and pivot
if R_bench is None:
    if 'ratings_df' in locals():
        print("Pivoting ratings_df for benchmark...")
        R_bench = ratings_df.pivot(index='userId', columns='movieId', values='rating')
    elif 'df' in locals() and 'rating' in df.columns:
        print("Pivoting df for benchmark...")
        R_bench = df.pivot(index='userId', columns='movieId', values='rating')

# Attempt 3: Load from Disk
if R_bench is None and 'load_data' in locals():
    print("Loading data from disk for benchmark...")
    _df = load_data()
    if _df is not None:
        R_bench = _df.pivot(index='userId', columns='movieId', values='rating')

if R_bench is None:
    print("Error: Could not obtain User-Item Matrix. Skipping Benchmark.")
else:
    # Ensure float
    R_bench = R_bench.astype(float)
    
    # 3. Decomposition & Prediction
    t0 = time.time()
    
    if 'Mean Filling' in method_name:
        # Mean Filling Logic
        item_means_bench = R_bench.mean()
        R_filled_bench = R_bench.fillna(item_means_bench)
        R_centered_bench = R_filled_bench - item_means_bench
        X = R_centered_bench.values
        # Cov = X.T @ X / (N-1)
        cov_matrix_bench = np.dot(X.T, X) / (X.shape[0] - 1)
        
    else:
        # MLE Logic
        if 'calculate_mle_covariance' in locals():
             cov_matrix_bench, _, _ = calculate_mle_covariance(R_bench)
             if isinstance(cov_matrix_bench, pd.DataFrame): cov_matrix_bench = cov_matrix_bench.values
        else:
             print("MLE Util not found. Skipping.")
             cov_matrix_bench = None

    if cov_matrix_bench is not None:
        # Eigh
        evals_b, evecs_b = eigh(cov_matrix_bench)
        idx_b = np.argsort(evals_b)[::-1]
        evecs_b = evecs_b[:, idx_b]
        
        t1 = time.time()
        time_decomp = t1 - t0
        
        # 4. Prediction Timing
        t2 = time.time()
        k_limit = 50
        V_k = evecs_b[:, :k_limit]
        
        if 'Mean Filling' in method_name:
             # U = X V
             U_b = np.dot(X, V_k)
             # X_hat = U V.T + Mean
             X_hat_b = np.dot(U_b, V_k.T) + item_means_bench.values
        else:
             means_mle = R_bench.mean()
             X_mle = R_bench.fillna(means_mle).values - means_mle.values
             U_b = np.dot(X_mle, V_k)
             X_hat_b = np.dot(U_b, V_k.T) + means_mle.values
             
        t3 = time.time()
        time_pred = t3 - t2
        
        # 5. Metrics
        mask_b = ~np.isnan(R_bench.values)
        diff_b = R_bench.values[mask_b] - X_hat_b[mask_b]
        rmse_val = np.sqrt(np.mean(diff_b**2))
        mae_val = np.mean(np.abs(diff_b))
        mem_mb = (cov_matrix_bench.nbytes + evecs_b.nbytes) / 1024 / 1024
        
        print(f"Results: Time Decomp {time_decomp:.4f}s, RMSE {rmse_val:.4f}")
        
        # 6. Save
        bench_file = os.path.join(results_table_dir, 'pca_benchmarks.csv')
        new_data = {
            'Method': method_name,
            'k': k_limit,
            'RMSE': rmse_val,
            'MAE': mae_val,
            'Time_Decomp(s)': time_decomp,
            'Time_Pred(s)': time_pred,
            'Memory(MB)': mem_mb
        }
        
        if os.path.exists(bench_file):
            df_bench = pd.read_csv(bench_file)
            df_bench = df_bench[df_bench['Method'] != method_name]
            df_bench = pd.concat([df_bench, pd.DataFrame([new_data])], ignore_index=True)
        else:
            df_bench = pd.DataFrame([new_data])
            
        df_bench.to_csv(bench_file, index=False)
        print(f"Benchmark saved to {bench_file}")



--- Benchmarking PCA MLE (Part 2) ---
Computing MLE Covariance Matrix...
- Calculating Numerator...
- Calculating Denominator...
- Dividing...
Results: Time Decomp 102.5571s, RMSE 0.8738
Benchmark saved to c:\Users\moham\Desktop\IRS GIT\SECTION1_DimensionalityReduction\code\../results/tables\pca_benchmarks.csv
