## 🔹 Objective:
__Apply PCA to reduce dimensionality and train classic ML models (SVC, Random Forest, etc.)__

## 🟩 Step 1: Load Preprocessed Data
- Load X and y from train_df
- Normalize pixel values (again, if not already)

In [1]:
import pandas as pd

train_df = pd.read_csv('../data/train.csv')
test_df = pd.read_csv('../data/test.csv')

In [2]:
X = train_df.drop("label", axis=1)
y = train_df["label"]
X = X / 255.0

In [3]:
# ✅ 1. Drop Sparse/Dead Pixels
# Remove pixel columns where more than 95% of values are 0.

zero_mask = (X == 0).sum(axis=0) / len(X) > 0.95
X_reduced = X.loc[:, ~zero_mask]  # Keep only informative pixels

print(f"Reduced from {X.shape[1]} to {X_reduced.shape[1]} features")

Reduced from 784 to 398 features


## 🟨 Step 2: Apply PCA
- Use sklearn.decomposition.PCA
- Try both:
  - n_components = 0.95 (retain 95% variance)
  - n_components = 0.98

In [7]:
from sklearn.decomposition import PCA

pca_95 = PCA(n_components=100, svd_solver='randomized',random_state=42)
X_pca_95 = pca_95.fit_transform(X_reduced)

print(f"🔍 95% Variance → {X_pca_95.shape[1]} components")

# Optional: Save transformed data for reuse
# np.save('data/preprocessed/X_pca_95.npy', X_pca_95)

🔍 95% Variance → 100 components


## 🟧 Step 3: Train Classical ML Models
Train with cross-validation and accuracy checks:
- Support Vector Classifier (SVC)
- Random Forest
- k-Nearest Neighbors

In [8]:
from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(X_pca_95, y, test_size=0.2, random_state=42, stratify=y)

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC

svc = SVC()

svc_params = {
    'C': [1, 10, 100],
    'kernel': ['rbf', 'poly'],
    'gamma': ['scale', 'auto']
}

grid_svc = GridSearchCV(svc, svc_params, cv=3, scoring='accuracy', verbose=2, n_jobs=-1)
grid_svc.fit(X_train, y_train)

print("🔍 Best SVC Params:", grid_svc.best_params_)
print("✅ SVC Best CV Accuracy:", grid_svc.best_score_)

# Evaluate on validation set
svc_val_preds = grid_svc.predict(X_val)
from sklearn.metrics import accuracy_score
print("🎯 SVC Validation Accuracy:", accuracy_score(y_val, svc_val_preds))

Fitting 3 folds for each of 12 candidates, totalling 36 fits


In [None]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(random_state=42)

rf_params = {
    'n_estimators': [100, 200],
    'max_depth': [10, 20, None],
    'min_samples_split': [2, 5]
}

grid_rf = GridSearchCV(rf, rf_params, cv=3, scoring='accuracy', verbose=2, n_jobs=-1)
grid_rf.fit(X_train, y_train)

print("🔍 Best RF Params:", grid_rf.best_params_)
print("✅ RF Best CV Accuracy:", grid_rf.best_score_)

# Evaluate on validation
rf_val_preds = grid_rf.predict(X_val)
print("🎯 RF Validation Accuracy:", accuracy_score(y_val, rf_val_preds))

In [None]:
from sklearn.metrics import confusion_matrix, classification_report
print(classification_report(y_val, svc_val_preds))
print('-------------------------------------------------------------')
print(classification_report(y_val, rf_val_preds))