# Principal Component Analysis (PCA)

## Notes about PCA

We fit PCA on the training set but only apply transform on the test set.

The rule is the same : **Avoid data leak**

**n_components_**: nummber of choosen components.

**explained_variance_ratio_**: percentage of variance explained by each of the selectedd components.

## PCA quickly



```
pca = PCA(0.99)                      # 99% of the variance
# pca = PCA(2)                       # 2 components
pca.fit(train_set)

train_set = pca.transform(train_set)
test_set = pca99.transform(test_set)

variance = pca.explained_variance_ratio_.sum()
nb_components = pca.n_components_
```



Tutorial: https://towardsdatascience.com/pca-using-python-scikit-learn-e653f8989e60

In [72]:
from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784')

## Split Data into Training and Test Sets

In [73]:
from sklearn.model_selection import train_test_split
train_img, test_img, train_lbl, test_lbl = train_test_split( mnist.data, mnist.target, test_size=1/7.0, random_state=0)

## Standardize the Data

**Note:** PCA is affected by scale. We can transform the data onto unit scale (mean = 0 and variance = 1) which is a requirement for the optimal performance of many machine learning algorithms

In [74]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
# Fit on training set only.
scaler.fit(train_img)
# Apply transform to both the training set and the test set.
train_img = scaler.transform(train_img)
test_img = scaler.transform(test_img)

In [None]:
train_img

In [None]:
train_lbl

## Applying PCA

**Note:** PCA(.95) tells scikit-learn to choose the minimum number of principal components such that 95% of the variance is retained.

**Note:** We fit PCA on the training set only.

In [75]:
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
import time

## PCA with 99% Variance

In [86]:
variances = []
nb_components = []
accuracies = []
times = []

In [88]:
train_img99 = train_img
test_img99 = test_img
pca99 = PCA(0.99)

start = time.time()

# PCA
pca99.fit(train_img99)
train_img99 = pca99.transform(train_img99)
test_img99 = pca99.transform(test_img99)

# Logistic Regression
# all parameters not specified are set to their defaults
# default solver is incredibly slow which is why it was changed to 'lbfgs'
logisticRegr = LogisticRegression(solver = 'lbfgs')
logisticRegr.fit(train_img99, train_lbl)

end = time.time()

## Accuracy
accuracy = logisticRegr.score(test_img99, test_lbl)

variances.append(pca99.explained_variance_ratio_.sum())
nb_components.append(pca99.n_components_)
accuracies.append(accuracy)
times.append(end - start)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


## PCA with 95% Variance

In [89]:
train_img95 = train_img
test_img95 = test_img
pca95 = PCA(0.95)

start = time.time()

# Applying PCA
pca95.fit(train_img95)
train_img95 = pca95.transform(train_img95)
test_img95 = pca95.transform(test_img95)

logisticRegr = LogisticRegression(solver = 'lbfgs')
logisticRegr.fit(train_img95, train_lbl)

end = time.time()

## Accuracy
accuracy = logisticRegr.score(test_img95, test_lbl)

variances.append(pca95.explained_variance_ratio_.sum())
nb_components.append(pca95.n_components_)
accuracies.append(accuracy)
times.append(end - start)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


## PCA with 90% Variance

In [90]:
train_img90 = train_img
test_img90 = test_img
pca90 = PCA(0.90)

start = time.time()

# Applying PCA
pca90.fit(train_img90)
train_img90 = pca90.transform(train_img90)
test_img90 = pca90.transform(test_img90)

logisticRegr = LogisticRegression(solver = 'lbfgs')
logisticRegr.fit(train_img90, train_lbl)

end = time.time()

## Accuracy
accuracy = logisticRegr.score(test_img90, test_lbl)

variances.append(pca90.explained_variance_ratio_.sum())
nb_components.append(pca90.n_components_)
accuracies.append(accuracy)
times.append(end - start)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


## PCA with 85% Variance

In [91]:
train_img85 = train_img
test_img85 = test_img
pca85 = PCA(0.85)

start = time.time()

# Applying PCA
pca85.fit(train_img85)
train_img85 = pca85.transform(train_img85)
test_img85 = pca85.transform(test_img85)

logisticRegr = LogisticRegression(solver = 'lbfgs')
logisticRegr.fit(train_img85, train_lbl)

end = time.time()

## Accuracy
accuracy = logisticRegr.score(test_img85, test_lbl)

variances.append(pca85.explained_variance_ratio_.sum())
nb_components.append(pca85.n_components_)
accuracies.append(accuracy)
times.append(end - start)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


## Speed comparison

In [96]:
import pandas as pd

pca_performace_metrics = {
    'Variance': variances,  
    'N Components': nb_components,
    'Time': times,
    'Accuracy': accuracies
    }

pca_metrics_df = pd.DataFrame(pca_performace_metrics)

In [97]:
pca_metrics_df

Unnamed: 0,Variance,N Components,Time,Accuracy
0,0.990041,538,46.332413,0.9173
1,0.950201,327,42.465435,0.9201
2,0.900681,234,29.369252,0.9199
3,0.850359,182,27.171834,0.9191


### Predict the labels of new data (new images)

In [None]:
# Predict for One Observation (image)
logisticRegr.predict(test_img[0].reshape(1,-1))

# The code below predicts for multiple observations at once
# Predict for One Observation (image)
# logisticRegr.predict(test_img[0:10])

array(['0'], dtype=object)

## Visualizing The Results