# Group 17: 
### Eyad Medhat 221100279 / Hady Aly 221101190 / Mohamed Mahfouz 221101743 / Omar Mady 221100745

# PCA with Maximum Likelihood Estimation (MLE) Approach

This notebook implements PCA-based collaborative filtering using the MLE covariance estimation method to handle sparse rating matrices.

In [1]:
from utils import *

Results folder exists at: d:\University\semester 9\IRS\AIE425_FinalProject_Group17\SECTION1_DimensionalityReduction\results
Subfolder exists: d:\University\semester 9\IRS\AIE425_FinalProject_Group17\SECTION1_DimensionalityReduction\results\plots
Subfolder exists: d:\University\semester 9\IRS\AIE425_FinalProject_Group17\SECTION1_DimensionalityReduction\results\tables


In [2]:
results_dir = ensure_results_folders()

Results folder exists at: d:\University\semester 9\IRS\AIE425_FinalProject_Group17\SECTION1_DimensionalityReduction\results
Subfolder exists: d:\University\semester 9\IRS\AIE425_FinalProject_Group17\SECTION1_DimensionalityReduction\results\plots
Subfolder exists: d:\University\semester 9\IRS\AIE425_FinalProject_Group17\SECTION1_DimensionalityReduction\results\tables


## Data Loading and Preprocessing

Load the dataset and create the user-item matrix. Identify target items and users for prediction.

In [3]:
ratings_df =load_data()

# Fallback to local file if utility function fails
if ratings_df is None:
    local_file = 'ratings_cleaned_sampled.csv'
    if os.path.exists(local_file):
        ratings_df = pd.read_csv(local_file)
    else:
        raise FileNotFoundError("Dataset not found! Check file paths.")

print(f"Loaded dataset with shape: {ratings_df.shape}")

# Create user-item matrix
ui_matrix = ratings_df.pivot(index='userId', columns='movieId', values='rating')
print(f"User-Item Matrix dimensions: {ui_matrix.shape}")

# Load or set target items for prediction
targets_df = load_data(table_name='lowest_two_rateditems.csv')

target_items = targets_df['movieId'].tolist()[:2]
print(f"Loaded target items from file: {target_items}")


# Filter to active targets in the matrix
active_targets = [item for item in target_items if item in ui_matrix.columns]
print(f"Active targets in matrix: {active_targets}")

# Identify users who haven't rated at least one target item
users_with_missing = ui_matrix[ui_matrix[active_targets].isna().any(axis=1)].index.tolist()
print(f"Users with missing target ratings: {len(users_with_missing)}")

 Found cached sample at: ..\data\ml-20m\ratings_cleaned_sampled.csv
Loaded dataset with shape: (1000000, 3)
User-Item Matrix dimensions: (96345, 1000)
 Found requested table at: ..\results\tables\lowest_two_rateditems.csv
Loaded target items from file: [1556, 1499]
Active targets in matrix: [1556, 1499]
Users with missing target ratings: 96329


## Step 1: MLE-based Covariance Matrix Construction

Calculate the true MLE covariance between items by computing pairwise covariances using only users who rated both items in each pair.

In [4]:
# Center the data by subtracting item means
item_avg = ui_matrix.mean(axis=0)
ui_centered = ui_matrix - item_avg

# Extract values and dimensions
matrix_vals = ui_centered.values
n_items = ui_centered.shape[1]
item_indices = ui_centered.columns

# Initialize covariance matrix
cov_matrix = np.zeros((n_items, n_items), dtype=float)

print("Building MLE Covariance Matrix...")
# Compute pairwise covariances
for i in range(n_items):
    col_i = matrix_vals[:, i]
    for j in range(i, n_items):
        col_j = matrix_vals[:, j]
        
        # Find users who rated both items
        valid_mask = ~np.isnan(col_i) & ~np.isnan(col_j)
        count = int(valid_mask.sum())

        # Calculate MLE covariance
        if count == 0:
            cov_val = 0.0
        else:
            cov_val = float(np.dot(col_i[valid_mask], col_j[valid_mask]) / count)

        # Store symmetric values
        cov_matrix[i, j] = cov_val
        cov_matrix[j, i] = cov_val

# Convert to DataFrame
cov_df = pd.DataFrame(cov_matrix, index=item_indices, columns=item_indices)

print("Step 1: TRUE MLE Covariance Matrix Generated.")
save_csv(cov_df, "pca_mle_cov_matrix.csv")
print("Step 1 Output Saved: pca_mle_cov_matrix.csv")
print(cov_df.head())

Building MLE Covariance Matrix...
Step 1: TRUE MLE Covariance Matrix Generated.
    Saved CSV: tables/pca_mle_cov_matrix.csv
Step 1 Output Saved: pca_mle_cov_matrix.csv
movieId     1         2         3         5         6         7         10     \
movieId                                                                         
1        0.784061  0.257519  0.265043  0.193229  0.077289  0.010011  0.095170   
2        0.257519  0.965164  0.367336  0.259827  0.262220  0.475421  0.207785   
3        0.265043  0.367336  0.936565  0.289819  0.400179  0.192354  0.137546   
5        0.193229  0.259827  0.289819  1.044418  0.407888  0.375331  0.513870   
6        0.077289  0.262220  0.400179  0.407888  0.795272  0.180567  0.128871   

movieId     11        14        16     ...     70286     71535     72998  \
movieId                                ...                                 
1        0.245748 -0.044494 -0.069877  ...  0.192374 -0.057573  0.058382   
2        0.248989  0.294459 -0.0384

## Step 2: Latent Space Construction and Peer Discovery

Apply eigen-decomposition to extract principal components and identify similar items in the latent space.

In [5]:
print("Performing Eigen-decomposition...")

# Eigenvalue decomposition
eigenvalues, eigenvectors = np.linalg.eigh(cov_df.values)

# Sort by eigenvalues (descending)
sort_idx = np.argsort(eigenvalues)[::-1]
eigenvalues = eigenvalues[sort_idx]
eigenvectors = eigenvectors[:, sort_idx]

# Calculate variance explained
total_var = np.sum(eigenvalues)
var_ratio = eigenvalues / total_var
cumulative_var = np.cumsum(var_ratio)

# Determine components for 75% variance
k_components = np.argmax(cumulative_var >= 0.75) + 1
print(f"\nNumber of components to explain 75% variance: {k_components}")
print(f"Correlation Variance at k={k_components}: {cumulative_var[k_components-1]:.4f}")

print(f"Selected Top {k_components} Eigenvalues based on 75% Variance:")
print(eigenvalues[:k_components])

# Helper function to find similar items in latent space


# Get all item IDs
all_items = list(cov_df.columns)

# Find Top-5 and Top-10 neighbors
neighbors_5 = compute_latent_neighbors(
    n_comp=k_components, n_neighbors=5,
    all_items=all_items, target_list=active_targets, evec_matrix=eigenvectors
)

neighbors_10 = compute_latent_neighbors(
    n_comp=k_components, n_neighbors=10,
    all_items=all_items, target_list=active_targets, evec_matrix=eigenvectors
)

# Create summary table
peer_summary = []
for target in active_targets:
    if target in neighbors_5 and target in neighbors_10:
        peer_summary.append({
            "TargetItem": target,
            "Top5_Peers_MLE": str(neighbors_5[target]["top_ids"]),
            "Top10_Peers_MLE": str(neighbors_10[target]["top_ids"])
        })

peers_table = pd.DataFrame(peer_summary)
save_csv(peers_table, "pca_mle_peers.csv")
print("Step 2 Output Saved: pca_mle_peers.csv")

# Display top-5 peers for each target
for target in active_targets:
    if target in neighbors_5:
        print(f"Target {target} Top-5 Peers: {neighbors_5[target]['top_ids']}")

Performing Eigen-decomposition...

Number of components to explain 75% variance: 22
Correlation Variance at k=22: 0.7541
Selected Top 22 Eigenvalues based on 75% Variance:
[143.69768713  44.48094567  34.93584832  28.03784742  27.24015097
  26.09275962  24.73083927  23.83401048  23.55443214  23.05144599
  22.92263611  22.28116497  22.05522298  21.50783826  21.31747887
  21.11426852  21.07358691  20.56912542  20.32480027  20.07447692
  20.06109814  19.9205765 ]
    Saved CSV: tables/pca_mle_peers.csv
Step 2 Output Saved: pca_mle_peers.csv
Target 1556 Top-5 Peers: [628, 2167, 1722, 1917, 1801]
Target 1499 Top-5 Peers: [2804, 1296, 420, 3623, 4855]


## Step 3: Reduced Dimensional Space (Top 5 Neighbors)

Document the latent peer space and construct user feature vectors in the reduced dimension.

In [6]:
# Build reduced space table
reduced_space_5 = []
for target in active_targets:
    if target not in neighbors_5:
        continue
    peer_list = neighbors_5[target]['top_ids'][:5]
    similarity_vals = neighbors_5[target]['sim_series']

    for position, peer_id in enumerate(peer_list, 1):
        reduced_space_5.append({
            'TargetItem': target,
            'Peer_Rank': position,
            'Peer_ItemID': peer_id,
            'Latent_Similarity': float(similarity_vals[peer_id]),
            'Space_Type': 'Top5_MLE'
        })

reduced_5_df = pd.DataFrame(reduced_space_5)
save_csv(reduced_5_df, 'pca_mle_reduced_space_top5.csv')
print("Step 3 Output Saved: pca_mle_reduced_space_top5.csv")

# Build user feature vectors in reduced space
user_features_5 = []
for target in active_targets:
    if target not in neighbors_5:
        continue
    peer_items = neighbors_5[target]['top_ids'][:5]

    for user in users_with_missing:
        if user not in ui_centered.index:
            continue
        
        # Extract user's ratings on peer items
        user_vector = ui_centered.loc[user, peer_items].values
        feature_dict = {'UserID': user, 'TargetItem': target}
        
        for idx, peer in enumerate(peer_items, 1):
            feature_dict[f'Peer{idx}_{peer}'] = user_vector[idx-1]
        
        user_features_5.append(feature_dict)

user_vec_5_df = pd.DataFrame(user_features_5)
save_csv(user_vec_5_df, 'pca_mle_user_reduced_vectors_top5.csv')
print("Step 3 (extra) Output Saved: pca_mle_user_reduced_vectors_top5.csv")

print(reduced_5_df.head())

    Saved CSV: tables/pca_mle_reduced_space_top5.csv
Step 3 Output Saved: pca_mle_reduced_space_top5.csv
    Saved CSV: tables/pca_mle_user_reduced_vectors_top5.csv
Step 3 (extra) Output Saved: pca_mle_user_reduced_vectors_top5.csv
   TargetItem  Peer_Rank  Peer_ItemID  Latent_Similarity Space_Type
0        1556          1          628           0.650902   Top5_MLE
1        1556          2         2167           0.643302   Top5_MLE
2        1556          3         1722           0.611007   Top5_MLE
3        1556          4         1917           0.578528   Top5_MLE
4        1556          5         1801           0.562561   Top5_MLE


## Step 4: Rating Predictions Using Top 5 Neighbors

Generate rating predictions using similarity-weighted averaging over the top 5 latent neighbors.

In [7]:
predictions_5 = []

for user in users_with_missing:
    if user not in ui_matrix.index:
        continue

    for target in active_targets:
        if target not in neighbors_5:
            continue

        # Check if rating exists
        rating_status = "Existing" if pd.notna(ui_matrix.loc[user, target]) else "Missing"

        peer_items = neighbors_5[target]['top_ids'][:5]
        similarity_weights = neighbors_5[target]['sim_series']

        # Compute weighted prediction
        numerator = 0.0
        denominator = 0.0

        for peer in peer_items:
            user_rating = ui_centered.loc[user, peer]
            if pd.notna(user_rating):
                weight = float(similarity_weights[peer])
                numerator += weight * float(user_rating)
                denominator += abs(weight)

        # Add back the item mean
        baseline = float(item_avg[target]) if target in item_avg.index else 0.0
        predicted_rating = baseline + (numerator / denominator) if denominator != 0 else baseline

        predictions_5.append({
            'UserID': user, 
            'ItemID': target, 
            'Pred_MLE_Top5': predicted_rating, 
            'Status': rating_status
        })

predictions_5_df = pd.DataFrame(predictions_5)
save_csv(predictions_5_df, "pca_mle_preds_top5.csv")
print("Step 4 Output Saved: pca_mle_preds_top5.csv")
print(predictions_5_df.head())

    Saved CSV: tables/pca_mle_preds_top5.csv
Step 4 Output Saved: pca_mle_preds_top5.csv
   UserID  ItemID  Pred_MLE_Top5   Status
0       1    1556       1.919431  Missing
1       1    1499       1.986574  Missing
2       2    1556       1.919431  Missing
3       2    1499       2.059603  Missing
4       3    1556       1.919431  Missing


## Step 5: Reduced Dimensional Space (Top 10 Neighbors)

Expand the latent space to include top 10 neighbors and generate corresponding user feature vectors.

In [8]:
# Build reduced space table for top 10
reduced_space_10 = []
for target in active_targets:
    if target not in neighbors_10:
        continue
    peer_list = neighbors_10[target]['top_ids'][:10]
    similarity_vals = neighbors_10[target]['sim_series']

    for position, peer_id in enumerate(peer_list, 1):
        reduced_space_10.append({
            'TargetItem': target,
            'Peer_Rank': position,
            'Peer_ItemID': peer_id,
            'Latent_Similarity': float(similarity_vals[peer_id]),
            'Space_Type': 'Top10_MLE'
        })

reduced_10_df = pd.DataFrame(reduced_space_10)
save_csv(reduced_10_df, 'pca_mle_reduced_space_top10.csv')
print("Step 5 (extra) Output Saved: pca_mle_reduced_space_top10.csv")


# Build user feature vectors with 10 dimensions
user_features_10 = []
for target in active_targets:
    if target not in neighbors_10:
        continue
    peer_items = neighbors_10[target]['top_ids'][:10]

    for user in users_with_missing:
        if user not in ui_centered.index:
            continue
        
        # Extract user's ratings on peer items
        user_vector = ui_centered.loc[user, peer_items].values
        feature_dict = {'UserID': user, 'TargetItem': target}
        
        for idx, peer in enumerate(peer_items, 1):
            feature_dict[f'Peer{idx}_{peer}'] = user_vector[idx-1]
        
        user_features_10.append(feature_dict)

user_vec_10_df = pd.DataFrame(user_features_10)
save_csv(user_vec_10_df, 'pca_mle_user_reduced_vectors_top10.csv')
print("Step 5 (extra) Output Saved: pca_mle_user_reduced_vectors_top10.csv")

print(reduced_10_df.head(10))

    Saved CSV: tables/pca_mle_reduced_space_top10.csv
Step 5 (extra) Output Saved: pca_mle_reduced_space_top10.csv
    Saved CSV: tables/pca_mle_user_reduced_vectors_top10.csv
Step 5 (extra) Output Saved: pca_mle_user_reduced_vectors_top10.csv
   TargetItem  Peer_Rank  Peer_ItemID  Latent_Similarity Space_Type
0        1556          1          628           0.650902  Top10_MLE
1        1556          2         2167           0.643302  Top10_MLE
2        1556          3         1722           0.611007  Top10_MLE
3        1556          4         1917           0.578528  Top10_MLE
4        1556          5         1801           0.562561  Top10_MLE
5        1556          6          748           0.550894  Top10_MLE
6        1556          7         5481           0.525898  Top10_MLE
7        1556          8          196           0.524533  Top10_MLE
8        1556          9          160           0.517657  Top10_MLE
9        1556         10         4643           0.499900  Top10_MLE


## Step 6: Rating Predictions Using Top 10 Neighbors

Generate predictions with expanded neighborhood for potentially improved accuracy.

In [9]:
predictions_10 = []

for user in users_with_missing:
    if user not in ui_matrix.index:
        continue

    for target in active_targets:
        if target not in neighbors_10:
            continue

        # Determine rating status
        rating_status = "Existing" if pd.notna(ui_matrix.loc[user, target]) else "Missing"

        peer_items = neighbors_10[target]['top_ids'][:10]
        similarity_weights = neighbors_10[target]['sim_series']

        # Calculate weighted average
        numerator = 0.0
        denominator = 0.0

        for peer in peer_items:
            user_rating = ui_centered.loc[user, peer]
            if pd.notna(user_rating):
                weight = float(similarity_weights[peer])
                numerator += weight * float(user_rating)
                denominator += abs(weight)

        # Add item baseline
        baseline = float(item_avg[target]) if target in item_avg.index else 0.0
        predicted_rating = baseline + (numerator / denominator) if denominator != 0 else baseline

        predictions_10.append({
            'UserID': user, 
            'ItemID': target, 
            'Pred_MLE_Top10': predicted_rating, 
            'Status': rating_status
        })

predictions_10_df = pd.DataFrame(predictions_10)
save_csv(predictions_10_df, "pca_mle_preds_top10.csv")
print("Step 6 Output Saved: pca_mle_preds_top10.csv")
print(predictions_10_df.head(10))

    Saved CSV: tables/pca_mle_preds_top10.csv
Step 6 Output Saved: pca_mle_preds_top10.csv
   UserID  ItemID  Pred_MLE_Top10   Status
0       1    1556        1.919431  Missing
1       1    1499        1.986574  Missing
2       2    1556        1.919431  Missing
3       2    1499        2.059603  Missing
4       3    1556        1.919431  Missing
5       3    1499        2.059603  Missing
6       4    1556        1.919431  Missing
7       4    1499        2.301512  Missing
8       5    1556        1.919431  Missing
9       5    1499        2.059603  Missing


## Step 7/8/9: Comparison Analysis



In [10]:
# Display the reduced dimensional space from Step 3 (Top 5)
print("Step 3 - Reduced Dimensional Space (Top 5 Peers):")
print(reduced_5_df)
print("\n" + "="*80 + "\n")

# Display the prediction results from Step 6 (Top 10)
print("Step 6 - Rating Predictions (Top 10 Peers):")
print(predictions_10_df.head(20))
print("\n" + "="*80 + "\n")

# Comparison Analysis
print("Comparison Analysis:")
print(f"Number of peer items in reduced space (Step 3): {len(reduced_5_df)}")
print(f"Number of predictions made (Step 6): {len(predictions_10_df)}")
print(f"\nStep 3 uses {reduced_5_df['Space_Type'].iloc[0]} for dimensionality reduction")
print(f"Step 6 predictions are based on Top 10 neighbors")


# Save comparison summary
comparison_summary = {
    'Step3_Peers_Count': len(reduced_5_df),
    'Step6_Predictions_Count': len(predictions_10_df),
    'Step3_Avg_Similarity': reduced_5_df['Latent_Similarity'].mean(),
    'Step6_Avg_Prediction': predictions_10_df['Pred_MLE_Top10'].mean()
}
summary_df = pd.DataFrame([comparison_summary])
print(f"\nAverage prediction value (Step 6): {predictions_10_df['Pred_MLE_Top10'].mean():.4f}")
print(f"Average latent similarity (Step 3): {reduced_5_df['Latent_Similarity'].mean():.4f}")

save_csv(summary_df, "pca_mle_comparison.csv")
print("\nStep 7 Output Saved: pca_mle_comparison.csv")

Step 3 - Reduced Dimensional Space (Top 5 Peers):
   TargetItem  Peer_Rank  Peer_ItemID  Latent_Similarity Space_Type
0        1556          1          628           0.650902   Top5_MLE
1        1556          2         2167           0.643302   Top5_MLE
2        1556          3         1722           0.611007   Top5_MLE
3        1556          4         1917           0.578528   Top5_MLE
4        1556          5         1801           0.562561   Top5_MLE
5        1499          1         2804           0.628761   Top5_MLE
6        1499          2         1296           0.610479   Top5_MLE
7        1499          3          420           0.587618   Top5_MLE
8        1499          4         3623           0.542076   Top5_MLE
9        1499          5         4855           0.539109   Top5_MLE


Step 6 - Rating Predictions (Top 10 Peers):
    UserID  ItemID  Pred_MLE_Top10   Status
0        1    1556        1.919431  Missing
1        1    1499        1.986574  Missing
2        2    1556      

In [11]:
# Step 8: Compare Linear Regression Method with PCA MLE Method

# Load linear regression predictions from Task 9 (from results folder)
lr_pred_1499 = load_data(table_name='3.2.9_predictions_target_1499.csv')
lr_pred_1556 = load_data(table_name='3.2.9_predictions_target_1556.csv')

# Combine both target items
lr_predictions = pd.concat([lr_pred_1499, lr_pred_1556], ignore_index=True)

# Rename columns to match PCA naming convention
lr_predictions.rename(columns={
    'userId': 'UserID',
    'movieId': 'ItemID',
    'predicted_rating_final': 'Predicted_Rating'  # Fixed: removed leading space
}, inplace=True)

# Merge with PCA MLE predictions from Step 4
comparison_df = pd.merge(
    lr_predictions[['UserID', 'ItemID', 'Predicted_Rating']],
    predictions_5_df[['UserID', 'ItemID', 'Pred_MLE_Top5']],
    on=['UserID', 'ItemID'],
    how='inner'
)

# Rename columns for clarity
comparison_df.rename(columns={
    'Predicted_Rating': 'LinearRegression_Pred',
    'Pred_MLE_Top5': 'PCA_MLE_Pred'
}, inplace=True)

# Calculate differences
comparison_df['Diff'] = comparison_df['LinearRegression_Pred'] - comparison_df['PCA_MLE_Pred']
comparison_df['AbsDiff'] = comparison_df['Diff'].abs()

# Save comparison results
save_csv(comparison_df, "pca_mle_method_comparison.csv")
print("Step 8 Output Saved: pca_mle_method_comparison.csv")

# Display results
print("\n" + "="*80)
print("STEP 8: Comparison of Linear Regression vs PCA MLE Top-5 Predictions")
print("="*80)
print("\nFirst 10 comparisons:")
print(comparison_df.head(10))

print("\n" + "-"*80)
print("Statistical Summary:")
print("-"*80)
print(f"Total predictions compared: {len(comparison_df)}")
print(f"\nLinear Regression Method:")
print(f"  Mean Prediction: {comparison_df['LinearRegression_Pred'].mean():.6f}")
print(f"  Std Deviation:   {comparison_df['LinearRegression_Pred'].std():.6f}")
print(f"\nPCA MLE Method:")
print(f"  Mean Prediction: {comparison_df['PCA_MLE_Pred'].mean():.6f}")
print(f"  Std Deviation:   {comparison_df['PCA_MLE_Pred'].std():.6f}")
print(f"\nDifference Analysis:")
print(f"  Mean Absolute Difference: {comparison_df['AbsDiff'].mean():.6f}")
print(f"  Max Absolute Difference:  {comparison_df['AbsDiff'].max():.6f}")
print(f"  Min Absolute Difference:  {comparison_df['AbsDiff'].min():.6f}")
print(f"  Std Dev of Differences:   {comparison_df['Diff'].std():.6f}")
print("="*80)

 Found requested table at: ..\results\tables\3.2.9_predictions_target_1499.csv
 Found requested table at: ..\results\tables\3.2.9_predictions_target_1556.csv
    Saved CSV: tables/pca_mle_method_comparison.csv
Step 8 Output Saved: pca_mle_method_comparison.csv

STEP 8: Comparison of Linear Regression vs PCA MLE Top-5 Predictions

First 10 comparisons:
   UserID  ItemID  LinearRegression_Pred  PCA_MLE_Pred      Diff   AbsDiff
0       1  1499.0               2.059603      1.986574  0.073028  0.073028
1       2  1499.0               2.059603      2.059603  0.000000  0.000000
2       3  1499.0               2.059603      2.059603  0.000000  0.000000
3       4  1499.0               2.059603      2.301512 -0.241909  0.241909
4       5  1499.0               2.059603      2.059603  0.000000  0.000000
5       7  1499.0               2.059603      2.059603  0.000000  0.000000
6       8  1499.0               2.059603      2.059603  0.000000  0.000000
7       9  1499.0               2.059603      

In [12]:
# Step 9: Compare Linear Regression Top-10 with PCA MLE Top-10

# Load linear regression Top-10 predictions
lr_pred_1499_top10 = load_data(table_name='3.2.11_predictions_target_1499_top10.csv')
lr_pred_1556_top10 = load_data(table_name='3.2.11_predictions_target_1556_top10.csv')

# Combine both target items
lr_predictions_top10 = pd.concat([lr_pred_1499_top10, lr_pred_1556_top10], ignore_index=True)

# Rename columns to match PCA naming convention
lr_predictions_top10.rename(columns={
    'userId': 'UserID',
    'movieId': 'ItemID',
    'predicted_rating_final': 'Predicted_Rating'
}, inplace=True)

# Merge with PCA MLE Top-10 predictions from Step 6
comparison_df = pd.merge(
    lr_predictions_top10[['UserID', 'ItemID', 'Predicted_Rating']],
    predictions_10_df[['UserID', 'ItemID', 'Pred_MLE_Top10']],
    on=['UserID', 'ItemID'],
    how='inner'
)

# Rename columns for clarity
comparison_df.rename(columns={
    'Predicted_Rating': 'LinearRegression_Top10_Pred',
    'Pred_MLE_Top10': 'PCA_MLE_Top10_Pred'
}, inplace=True)

# Calculate differences
comparison_df['Diff'] = comparison_df['LinearRegression_Top10_Pred'] - comparison_df['PCA_MLE_Top10_Pred']
comparison_df['AbsDiff'] = comparison_df['Diff'].abs()

# Save comparison results
save_csv(comparison_df, "pca_mle_method_comparison_top10.csv")
print("Step 9 Output Saved: pca_mle_method_comparison_top10.csv")

# Display results
print("\n" + "="*80)
print("STEP 9: Comparison of Linear Regression Top-10 vs PCA MLE Top-10 Predictions")
print("="*80)
print("\nFirst 10 comparisons:")
print(comparison_df.head(10))

print("\n" + "-"*80)
print("Statistical Summary:")
print("-"*80)
print(f"Total predictions compared: {len(comparison_df)}")
print(f"\nLinear Regression Top-10 Method:")
print(f"  Mean Prediction: {comparison_df['LinearRegression_Top10_Pred'].mean():.6f}")
print(f"  Std Deviation:   {comparison_df['LinearRegression_Top10_Pred'].std():.6f}")
print(f"\nPCA MLE Top-10 Method:")
print(f"  Mean Prediction: {comparison_df['PCA_MLE_Top10_Pred'].mean():.6f}")
print(f"  Std Deviation:   {comparison_df['PCA_MLE_Top10_Pred'].std():.6f}")
print(f"\nDifference Analysis:")
print(f"  Mean Absolute Difference: {comparison_df['AbsDiff'].mean():.6f}")
print(f"  Max Absolute Difference:  {comparison_df['AbsDiff'].max():.6f}")
print(f"  Min Absolute Difference:  {comparison_df['AbsDiff'].min():.6f}")
print(f"  Std Dev of Differences:   {comparison_df['Diff'].std():.6f}")
print("="*80)

 Found requested table at: ..\results\tables\3.2.11_predictions_target_1499_top10.csv
 Found requested table at: ..\results\tables\3.2.11_predictions_target_1556_top10.csv
    Saved CSV: tables/pca_mle_method_comparison_top10.csv
Step 9 Output Saved: pca_mle_method_comparison_top10.csv

STEP 9: Comparison of Linear Regression Top-10 vs PCA MLE Top-10 Predictions

First 10 comparisons:
   UserID  ItemID  LinearRegression_Top10_Pred  PCA_MLE_Top10_Pred      Diff  \
0       1  1499.0                     2.059603            1.986574  0.073028   
1       2  1499.0                     2.059603            2.059603  0.000000   
2       3  1499.0                     2.059603            2.059603  0.000000   
3       4  1499.0                     2.059603            2.301512 -0.241909   
4       5  1499.0                     2.059603            2.059603  0.000000   
5       7  1499.0                     2.059603            2.059603  0.000000   
6       8  1499.0                     2.059603      