# PCA Issues

PCA is slow. PCA is a memory hog.

Randomized PCA is faster. Incremental PCA uses less memory.

PCA uses Singular Value Decomposition (SVD) 
to find each next optimal orthogonal axis.
O(m*n^2 + n^3) for m instances and n features.

In [1]:
import ssl
import tensorflow
from tensorflow.keras.datasets import mnist
(X_train, y_train), (X_test, y_test) = mnist.load_data()
num_pixels = 784
X_train1D = X_train.reshape(X_train.shape[0],num_pixels)

import time
from sklearn.decomposition import PCA
pca1 = PCA(n_components=154,svd_solver='full')
start = time.time()
pca1.fit(X_train1D)
done = time.time()
elapsed1 = done-start
explained1 = pca1.explained_variance_ratio_
explained1[:5]

array([0.09704664, 0.07095924, 0.06169089, 0.05389419, 0.04868797])

In [14]:
pca2 = PCA(n_components=154,svd_solver='randomized')
start = time.time()
pca2.fit(X_train1D)
done = time.time()
elapsed2 = done-start
explained2 = pca2.explained_variance_ratio_
explained2[:5]

array([0.09704664, 0.07095924, 0.06169089, 0.05389419, 0.04868797])

In [15]:
# Hardly any difference in outputs!
(explained1-explained2)[:5]

array([1.73472348e-15, 1.36002321e-15, 9.71445147e-16, 9.29811783e-16,
       8.11850587e-16])

In [16]:
# Shaved 3 seconds off the clock!
elapsed1,elapsed2

(11.489128112792969, 8.100416898727417)

## Incremental PCA (IPCA)
When the data is too large for RAM, use sklearn IncrementalPCA.

### Solution 1: numpy mmap
Use numpy.mmap(file) to map a file to memory.
Then IncrementalPCA(batch_size=100) to load a few instances at a time.
Then IncrementalPCA.fit() works as usual.

### Solution 2: numpy array_split

In [20]:
# Solution 2: numpy array_split
from sklearn.decomposition import IncrementalPCA
import numpy as np

ipca = IncrementalPCA(n_components=154)
num_batches=100
for X_batch in np.array_split(X_train1D,num_batches):
    ipca.partial_fit(X_batch)

In [21]:
explained3 = ipca.explained_variance_ratio_
explained3[:5]

array([0.09704663, 0.07095923, 0.06169087, 0.05389418, 0.04868795])

In [23]:
# Note the results are not EXACTLY the same as PCA, but pretty close.
(explained1-explained3)[:5]

array([8.85068449e-09, 1.37314708e-08, 1.68055410e-08, 1.91886324e-08,
       2.41079584e-08])