# Objective:

### Use MNIST dataset and apply PCA to find out the impact on the model training time and also model performance
### The work is taken from https://github.com/mGalarnyk/Python_Tutorials/blob/master/Sklearn/PCA/PCA_to_Speed-up_Machine_Learning_Algorithms.ipynb

In [1]:
# Setup
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn import metrics
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
from sklearn.datasets import fetch_openml


#Download and Load the Data


In [2]:
from sklearn.datasets import fetch_openml

mnist = fetch_openml('mnist_784', version=1, cache=True)
mnist.target = mnist.target.astype(np.int8) # fetch_openml() returns targets as strings

X, y = mnist["data"], mnist["target"]

# Split data into train/test

In [3]:
# Write a code to split your dataset into 80/20 dataset
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size= 0.2)

# View Data Dimension

In [4]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape


((56000, 784), (14000, 784), (56000,), (14000,))

#Standardizing the Data¶

Since PCA yields a feature subspace that maximizes the variance along the axes, it makes sense to standardize the data, especially, if it was measured on different scales.

Standardization of a dataset is a common requirement for many machine learning estimators: they might behave badly if the individual feature do not more or less look like standard normally distributed data

Notebook going over the importance of feature Scaling: http://scikit-learn.org/stable/auto_examples/preprocessing/plot_scaling_importance.html#sphx-glr-auto-examples-preprocessing-plot-scaling-importance-py


In [5]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

# Fit on training set only.
scaler.fit(X_train)

# Apply transform to both the training set and the test set.
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

#X_train.shape, X_test.shape, y_train.shape, y_test.shape


In [6]:
# In case you want to see how the scaled number would look like, you can uncomment below lines
#from scipy.stats import describe
#describe(X_train)[1]

In [7]:
from sklearn.decomposition import PCA
# specify how much of variation you would like PCA to capture (between 0-1)
#pca = PCA('mle')
pca = PCA(0.5)

pca.fit(X_train)


PCA(copy=True, iterated_power='auto', n_components=0.9, random_state=None,
    svd_solver='auto', tol=0.0, whiten=False)

# Look at components

In [8]:
pca.n_components_


233

#Apply the mapping (transform) to both the training set and the test set.



In [9]:
X_train = pca.transform(X_train)
X_test = pca.transform(X_test)

#Build a linear model and measure model fitting period.

In [10]:
from sklearn.linear_model import LogisticRegression
logisticRegr = LogisticRegression(multi_class ='auto')

import datetime
start= datetime.datetime.now()
logisticRegr.fit(X_train, y_train)
end= datetime.datetime.now()



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


In [11]:

print(end-start)
#logisticRegr.predict(X_train[0].reshape(1,-1))


0:00:15.620661


#Measuring Model Performance

In [12]:
score = logisticRegr.score(X_test, y_test)
print(score)

0.9245714285714286


#Number of Components, Variance, Time Table


In [13]:
pd.DataFrame(data = [[1.00, 784, 48.94, .9158],
                     [.99, 541, 34.69, .9169],
                     [.95, 330, 13.89, .92],
                     [.90, 236, 10.56, .9168],
                     [.85, 184, 8.85, .9156]], 
             columns = ['Variance Retained',
                      'Number of Components', 
                      'Time (seconds)',
                      'Accuracy'])

Unnamed: 0,Variance Retained,Number of Components,Time (seconds),Accuracy
0,1.0,784,48.94,0.9158
1,0.99,541,34.69,0.9169
2,0.95,330,13.89,0.92
3,0.9,236,10.56,0.9168
4,0.85,184,8.85,0.9156


In [14]:
#My own results

# pca(0.99)= n_componnets = mle (81), acc= 0.9167857142857143
# pca(0.85)= n_componnets = 150, acc= 0.916
# pca(0.7)= n_componnets = 53, acc= 0.906

