# Principle Component Analysis (PCA) to Speed-up Machine Learning Algorithms

Principal component analysis is used to speed up the learning algorithm changing the optimization of the algorithm. We ae going to use the MNIST database of handwritten digits, available from this page, has a training set of 60,000 examples, and a test set of 10,000 examples. It is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed-size image. 


It is a good database for people who want to try learning techniques and pattern recognition methods on real-world data while spending minimal efforts on preprocessing and formatting.

In [5]:
from sklearn.datasets import fetch_mldata
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn import metrics
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np

### Download and Load the Data

We have already downloaded the data so we just have to load it, but later there is other way to get the data.

In [13]:
## Load the training set
train_data = np.load('MNIST/train_data.npy')
train_labels = np.load('MNIST/train_labels.npy')

## Load the testing set
test_data = np.load('MNIST/test_data.npy')
test_labels = np.load('MNIST/test_labels.npy')

In [14]:
# You can add the parameter data_home to wherever to where you want to download your data. Other way to download the data
    # mnist = fetch_mldata('MNIST original')
# These are the images
    # mnist.data.shape
# These are the labels
    # mnist.target.shape

### Split Data into Training and Test Sets

This point is specific for the second part of the above point because our dataset is already split.

In [15]:
# test_size: what proportion of original data is used for test set
    # train_img, test_img, train_lbl, test_lbl = train_test_split(mnist.data, mnist.target, test_size=1/7.0, random_state=0)

In [16]:
print(train_data.shape)
print(train_labels.shape)
print(test_data.shape)
print(test_labels.shape)

(7500, 784)
(7500,)
(1000, 784)
(1000,)


### Standardizing the Data


Since PCA yields a feature subspace that maximizes the variance along the axes, it makes sense to standardize the data, especially, if it was measured on different scales.

Standardization of a dataset is a common requirement for many machine learning estimators: they might behave badly if the individual feature do not more or less look like standard normally distributed data

In [17]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

# Fit on training set only.
scaler.fit(train_data)

# Apply transform to both the training set and the test set.
train_data = scaler.transform(train_data)
test_data = scaler.transform(test_data)

### PCA to Speed up Machine Learning Algorithms (Logistic Regression)

**Step 0**: Import and use PCA. After PCA you will apply a machine learning algorithm of your choice to the transformed data
    

In [18]:
from sklearn.decomposition import PCA

We make an instance of the Model

In [20]:
pca = PCA(.95)

Fit PCA on training set. We are fitting PCA on the training set only

In [21]:
pca.fit(train_data)

PCA(copy=True, iterated_power='auto', n_components=0.95, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)

In [22]:
pca.n_components_

274

Apply the mapping to both the trainin set and the test set

In [23]:
train_data = pca.transform(train_data)
test_data = pca.transform(test_data)

**Step 1**: Import the model we want to use

In [25]:
from sklearn.linear_model import LogisticRegression

**Step 2**: Make an instance of the Model

All parameters not specified are set to their defaults. The default solver is incredebly slow so it's better to change it

In [26]:
logisticRegr = LogisticRegression(solver = 'lbfgs')

**Step 3**: Training the model on the data, storing the information learned from the data.

In [28]:
logisticRegr.fit(train_data,train_labels)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='lbfgs',
          tol=0.0001, verbose=0, warm_start=False)

**Step 4**: Predict the labels of new data

In [31]:
# We can predict One observation returning a NumPy Array
print(logisticRegr.predict(test_data[0].reshape(1,-1)))

# =r we can predict for Multiple Observations
print(logisticRegr.predict(test_data[0:10]))

[0]
[0 2 6 5 9 6 0 7 1 1]


### Measuring Model Performance

We are going to use the accuracy: correct predictions / total number of data points

In [32]:
score = logisticRegr.score(test_data, test_labels)
print(score)

0.905
