# Artificial Vision & Feature Separability — 00 · Intro to Features, PCA & K-Means (CIFAR-10)

**Goal.** Build core intuition about feature geometry and separability using CIFAR-10 images.  
**Outputs.** Saved plots in `results/` and a brief summary.

In [4]:
# --- Reproducibility & Environment (with SSL fix) ---
import os, random, numpy as np, certifi

SEED = 42
random.seed(SEED)
np.random.seed(SEED)

# ensure consistent results/outputs folder
os.makedirs("results", exist_ok=True)
os.makedirs("data", exist_ok=True)

# ---- SSL fix: trust certifi CA bundle everywhere ----
os.environ["SSL_CERT_FILE"] = certifi.where()
print("SSL_CERT_FILE set to:", os.environ["SSL_CERT_FILE"])

print("Seed set to", SEED)

SSL_CERT_FILE set to: /Library/Frameworks/Python.framework/Versions/3.13/lib/python3.13/site-packages/certifi/cacert.pem
Seed set to 42


In [5]:
# --- Imports ---
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import torch
import torchvision
from torchvision import transforms

from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

## 1. Data (CIFAR-10)
We will load CIFAR-10 via `torchvision.datasets`.  
For compact demos and fast plots, we'll optionally **subset to two classes** (e.g., `frog` vs `ship`) and a limited sample size.

In [6]:
# transforms -> tensor only (no normalization here; we'll Standardize for scikit later)
tensor_transform = transforms.Compose([transforms.ToTensor()])

trainset = torchvision.datasets.CIFAR10(root='./data', train=True, download=True, transform=tensor_transform)
testset  = torchvision.datasets.CIFAR10(root='./data', train=False, download=True, transform=tensor_transform)

# map numeric labels to class names for readability
label_names = trainset.classes
def label_to_name(y): return label_names[y]

print("Train size:", len(trainset), " Test size:", len(testset))

URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1018)>

In [None]:
# Build a small two-class subset for quick demo
CLASSES = {"frog", "ship"}  # adjust as needed
def filter_indices(ds, classes):
    idx = []
    for i in range(len(ds)):
        y = ds[i][1]
        if label_to_name(y) in classes:
            idx.append(i)
    return idx

train_idx = filter_indices(trainset, CLASSES)[:2000]  # cap for speed
test_idx  = filter_indices(testset,  CLASSES)[:1000]

# tensors -> numpy arrays
X_train = np.stack([trainset[i][0].numpy().transpose(1,2,0).reshape(-1) for i in train_idx])
y_train_names = np.array([label_to_name(trainset[i][1]) for i in train_idx])

X_test  = np.stack([testset[i][0].numpy().transpose(1,2,0).reshape(-1) for i in test_idx])
y_test_names  = np.array([label_to_name(testset[i][1]) for i in test_idx])

print("Subset shapes ->", X_train.shape, X_test.shape)
np.unique(y_train_names, return_counts=True)

## 2. Methods
Standardize → PCA (2D for visualization) → K-Means clusters → Logistic regression baseline.

In [None]:
scaler = StandardScaler()
Xz_train = scaler.fit_transform(X_train)
Xz_test  = scaler.transform(X_test)

pca = PCA(n_components=2, random_state=SEED)
Xp_train = pca.fit_transform(Xz_train)
Xp_test  = pca.transform(Xz_test)

print("Explained variance ratio:", pca.explained_variance_ratio_)

### 3.1 PCA scatter by true label (train subset)

In [None]:
plt.figure()
# encode labels to ints for colormap
labels_enc = (y_train_names == list(CLASSES)[0]).astype(int) if len(CLASSES)==2 else pd.factorize(y_train_names)[0]
plt.scatter(Xp_train[:,0], Xp_train[:,1], c=labels_enc, s=8)
plt.title("CIFAR-10 (subset) PCA — true labels")
plt.xlabel("PC1"); plt.ylabel("PC2")
plt.tight_layout(); plt.savefig("results/00_pca_true_labels_cifar.png", dpi=150); plt.show()

### 3.2 K-Means on PCA space

In [None]:
n_clusters = len(np.unique(y_train_names))
kmeans = KMeans(n_clusters=n_clusters, n_init=10, random_state=SEED)
km_train = kmeans.fit_predict(Xp_train)

plt.figure()
plt.scatter(Xp_train[:,0], Xp_train[:,1], c=km_train, s=8)
plt.title("K-Means clusters (PCA space) — train subset")
plt.xlabel("PC1"); plt.ylabel("PC2")
plt.tight_layout(); plt.savefig("results/00_kmeans_clusters_cifar.png", dpi=150); plt.show()

### 3.3 Logistic regression baseline

In [None]:
# Supervised baseline on standardized pixel features (very linear)
clf = LogisticRegression(max_iter=500, random_state=SEED)
clf.fit(Xz_train, y_train_names)

pred_test = clf.predict(Xz_test)

acc = accuracy_score(y_test_names, pred_test)
cm = confusion_matrix(y_test_names, pred_test, labels=sorted(list(CLASSES)))
print(f"Test Accuracy: {acc:.3f}")
print(classification_report(y_test_names, pred_test))

plt.figure()
plt.imshow(cm, aspect="auto")
plt.title("Confusion Matrix (test)")
plt.xlabel("Pred"); plt.ylabel("True")
plt.colorbar(); plt.tight_layout(); plt.savefig("results/00_confusion_matrix_cifar.png", dpi=150); plt.show()

## 4. Takeaways
- PCA on standardized pixel features reveals some structure but is limited for complex images.
- K-Means clusters may partially align with labels, depending on class separability.
- A linear model on raw pixels sets a conservative baseline we’ll surpass with CNNs later.