# 02 - Feature Extraction

**Goal:** Extract high-level feature representations from CIFAR-10 using pretrained ResNet50

## Overview
In this notebook, we will:
1. Load the full CIFAR-10 dataset
2. Extract 2048-dimensional features using ResNet50
3. Apply PCA to reduce dimensionality to 50D
4. Visualize the reduced feature space
5. Save features for TDA analysis

**Why Feature Extraction?**
- Raw pixels (32×32×3 = 3,072 dims) are noisy and high-dimensional
- ResNet50 learns semantic features (objects, textures, shapes)
- These features are better suited for TDA

### Setup and Imports

In [1]:
import sys
sys.path.append('../')
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import tensorflow as tf
from sklearn.manifold import TSNE

# Our modules
from src.data.cifar10 import load_cifar10
from src.models.feature_extractor import FeatureExtractor, DimensionalityReducer

### Load the CIFAR-10 Dataset

We'll use a subset of 5,000 training samples for faster experimentation. For the final paper, we'll scale up to the full 45,000 training samples.

You can test with any number of training samples you like by changing the `5000` to a different number.

In [3]:
(training_images, training_labels), (validation_images, validation_labels), (testing_images, testing_labels), class_names = load_cifar10()

training_images_subset = training_images[:5000]   # 5000 images for quick experiment
training_labels_subset = training_labels[:5000]   # 5000 labels for quick experiment

print(f"\nUsing {len(training_images_subset):,} training samples for feature extraction")


Final splits:
Training:   45000 samples
Validation: 5000 samples
Testing:    10000 samples

Using 5,000 training samples for feature extraction
