## Data Loading & Preprocessing

```
Data is loaded, then split into training & testing sets.
Standard scaling is also applied to images.
```

### Download Data & Load as numPy Arrays

To avoid downloading data set to disk use **sklearn.datasets.fetch_lfw_people**

In [1]:
from sklearn.datasets import fetch_lfw_people


people = fetch_lfw_people(min_faces_per_person=70, resize=0.4)

# Introspect the image arrays
n_samples, h, w = people.images.shape

X = people.data
n_features = X.shape[1]

y = people.target
target_names = people.target_names
n_classes = target_names.shape[0]

print("Total Dataset Size:")
print(f"n_samples: {n_samples}")
print(f"n_features: {n_features}")
print(f"n_classes: {n_classes}")


Total Dataset Size:
n_samples: 1288
n_features: 1850
n_classes: 7


### Split Data into Training & Test Sets

Apply standardised scaling
75% Training | 25% Testing

In [2]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler


X_train, X_test, y_train, y_test = train_test_split(
    # Prior to split, 'random_state' controls the amount of data mixing
    # No specific reason as to 42, just allows same result to be produced across a different run
    X, y, test_size=0.25, random_state=42
)

# Remove mean & scale to unit variance to standardise features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

## Exploratory Data Analysis

### Dimensionality Reduction w/ Principal Component Analysis (PCA)

In [9]:
from time import time
from sklearn.decomposition import PCA


# Compute PCA
n_components = 150
t0 = time()

pca = PCA(n_components = n_components,
          svd_solver= "randomized",
          whiten = True
          ).fit(X_train)

print(f"Extracting the top {n_components} from {X_train.shape[0]} faces...")
print(f"done in {time() - t0}s")


# Data Transformation
eigenfaces = pca.components_.reshape((n_components, h, w))

print("Projecting input data on the eigenfaces orthonormal basis")

t0 = time()

X_train_pca = pca.transform(X_train)
X_test_pca = pca.transform(X_test)

print(f"done in {time() - t0}s")

Extracting the top 150 from 966 faces...
done in 1.545753002166748s
Projecting input data on the eigenfaces orthonormal basis
done in 0.06431031227111816s
