In [2]:
!pip install ppca

Collecting ppca
  Downloading ppca-0.0.4-py3-none-any.whl.metadata (400 bytes)
Downloading ppca-0.0.4-py3-none-any.whl (6.7 kB)
Installing collected packages: ppca
Successfully installed ppca-0.0.4


In [3]:
import numpy as np
from ppca import PPCA
from sklearn.impute import SimpleImputer

In [4]:
# Step I: Synthetic dataset + Missing values
np.random.seed(42)
N, D, k = 500, 10, 3  # samples, features, latent dimension

# Generate low-rank data
Z_true = np.random.randn(N, k)
W_true = np.random.randn(D, k)
X_full = Z_true @ W_true.T + 0.1 * np.random.randn(N, D)  # add small noise

# Save full data for error comparison
X_true = X_full.copy()

# Introduce 10% missing values at random
mask = np.random.rand(*X_full.shape) < 0.1
X_miss = X_full.copy()
X_miss[mask] = np.nan

print("Missing values ratio:", np.isnan(X_miss).mean())


Missing values ratio: 0.0964


In [5]:
# Step II: Apply PPCA (handles NaNs in EM)
ppca = PPCA()
ppca.fit(X_miss, d=k)

# Get imputed data (reconstructed full matrix)
X_imputed = ppca.data

In [6]:
# Step III: Evaluation
ppca_error = np.mean((X_imputed[mask] - X_true[mask])**2)
print(f"PPCA Imputation MSE on missing entries: {ppca_error:.4f}")


PPCA Imputation MSE on missing entries: 0.7920


In [7]:
# Baseline (Mean Imputation)
imp = SimpleImputer(strategy="mean")
X_mean_imp = imp.fit_transform(X_miss)
mean_error = np.mean((X_mean_imp[mask] - X_true[mask])**2)
print(f"Mean Imputation MSE on missing entries: {mean_error:.4f}")

Mean Imputation MSE on missing entries: 2.7325


In [None]:
'''📊 Output Recap

Missing values ratio: 0.0964 → about 9.6% of entries in your dataset were randomly set to NaN.

PPCA Imputation MSE: 0.7920

Mean Imputation MSE: 2.7325

🔹 What this means

Mean Imputation

Just replaces each missing entry with the mean of that column.

It ignores relationships between features.

That’s why its error (2.73) is relatively high.

PPCA Imputation

PPCA learns a low-dimensional latent space that explains correlations among features.

During the EM algorithm, it can estimate missing values consistently with the low-rank structure.

That’s why its error (0.79) is much smaller.

🔹 Interpretation

Your synthetic data was generated from a low-rank model (k = 3).

PPCA knows how to exploit that, so it reconstructs missing entries much closer to the true values.

In short:

PPCA ≫ Mean imputation for correlated/multivariate data.

If features were independent, PPCA wouldn’t help much.'''