# Part 2: PCA Method with Maximum Likelihood Estimation (MLE)

## Assignment Requirements
**Objective**: Use the PCA method with MLE technique to compute the covariance matrix, then compute rating predictions.

**Reduced Scale**: We rely on `load_and_sample_data(..., n_samples=100000)` to work with a subset of 100k ratings (approx 10% of default). This ensures fast covariance calculation on typical hardware.

**Note**: If you have made changes to `utils.py`, please **Restart Kernel** to ensure they are loaded.

### Step 1: Load Data & Generate MLE Covariance


In [1]:
from utils import *

In [2]:
# 1. Load Reduced Data (100k samples)
df = load_data()


print(f"Data Loaded. Shape: {df.shape}")
print(f"Unique Users: {df['userId'].nunique()}, Unique Items: {df['movieId'].nunique()}")

# 2. Pivot to User-Item Matrix
user_item_matrix = df.pivot(index='userId', columns='movieId', values='rating')
print(f"User-Item Matrix Shape: {user_item_matrix.shape}")

# 3. Identify Targets
target_ids = [1, 2]
if 1 not in user_item_matrix.columns:
    target_ids = user_item_matrix.columns[:2].tolist()
print(f"Target Items: {target_ids}")

 Found cached sample at: ..\data\ml-20m\ratings_cleaned_sampled.csv
Data Loaded. Shape: (1000000, 3)
Unique Users: 96345, Unique Items: 1000
User-Item Matrix Shape: (96345, 1000)
Target Items: [1, 2]


In [3]:
cov_matrix, item_means = calculate_mle_covariance(user_item_matrix)
print("Covariance Matrix Generated.")
print(cov_matrix.iloc[:5, :5])

Computing MLE Covariance Matrix...
Covariance Matrix Generated.
movieId         1         2         3         5         6
movieId                                                  
1        0.784061  0.257519  0.265043  0.193229  0.077289
2        0.257519  0.965164  0.367336  0.259827  0.262220
3        0.265043  0.367336  0.936565  0.289819  0.400179
5        0.193229  0.259827  0.289819  1.044418  0.407888
6        0.077289  0.262220  0.400179  0.407888  0.795272


### Step 2: Determine Top 5-Peers and Top 10-Peers (Eigen Analysis)

In [None]:
eigenvalues, eigenvectors = get_eigen_pairs(cov_matrix)
print(f"Top 5 Eigenvalues: {eigenvalues[:5]}")

### Steps 3-6: Predictions (PCA Projection)

In [None]:
# K=5
reduced_5, preds_5 = predict_pca(5, user_item_matrix, item_means, eigenvectors, target_ids)
# K=10
reduced_10, preds_10 = predict_pca(10, user_item_matrix, item_means, eigenvectors, target_ids)

### Step 7: Comparison

In [None]:
print("\n--- Comparison: k=5 vs k=10 ---")
for tid in target_ids:
    mae = np.mean(np.abs(preds_5[tid] - preds_10[tid]))
    print(f"Item {tid}: Mean Absolute Difference (k=5 vs k=10): {mae:.4f}")

### Steps 8-9: Comparison with Part 1
Compare with results from Part 1 Points 9 and 11.
- Reference your previous run's output for Part 1 to make specific comments.
- Generally, MLE PCA (Part 2) ignores missing entries for correlations, providing valid structure analysis even with sparse data, whereas Mean Filling (Part 1) tends to dampen correlations.