               ** PCA + Logistic Regression (MNIST)**
The MNIST database of handwritten digits, available from this page, has a training set of 60,000 examples, and a test set of 10,000 examples. It is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed-size image.
It is a good database for people who want to try learning techniques and pattern recognition methods on real-world data while spending minimal efforts on preprocessing and formatting.

Parameters	Number
Classes	10
Samples per class	~7000 samples per class
Samples total	70000
Dimensionality	784
Features	integers values from 0 to 255
The MNIST database of handwritten digits is available on the following website: MNIST Dataset
http://yann.lecun.com/exdb/mnist/

Tutorial Link: https://towardsdatascience.com/pca-using-python-scikit-learn-e653f8989e60

Github Link: https://github.com/mGalarnyk/Python_Tutorials/blob/master/Sklearn/PCA/PCA_Image_Reconstruction_and_such.ipynb

Sebastian Article link: https://sebastianraschka.com/Articles/2015_pca_in_3_steps.html

**Logistic Regression and PCA**

One of the most important applications of PCA is for speeding up machine learning algorithms. Using the IRIS dataset would be impractical here as the dataset only has 150 rows and only 4 feature columns. The MNIST database of handwritten digits is more suitable as it has 784 feature columns (784 dimensions), a training set of 60,000 examples, and a test set of 10,000 examples.

          **Download and Load the Data**
You can also add a data_home parameter to fetch_mldata to change where you download the data.

In [None]:
from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784')

The images that you downloaded are contained in mnist.data and has a shape of (70000, 784) meaning there are 70,000 images with 784 dimensions (784 features).

The labels (the integers 0–9) are contained in mnist.target. The features are 784 dimensional (28 x 28 images) and the labels are simply numbers from 0–9.

        **Split Data into Training and Test Sets**
Typically the train test split is 80% training and 20% test. In this case, I chose 6/7th of the data to be training and 1/7th of the data to be in the test set.

In [None]:
from sklearn.model_selection import train_test_split

# test_size: what proportion of original data is used for test set
train_img, test_img, train_lbl, test_lbl = train_test_split( mnist.data, mnist.target, test_size=1/7.0, random_state=0)

          **Standardize the Data**
The text in this paragraph is almost an exact copy of what was written earlier. PCA is effected by scale so you need to scale the features in the data before applying PCA. You can transform the data onto unit scale (mean = 0 and variance = 1) which is a requirement for the optimal performance of many machine learning algorithms. StandardScaler helps standardize the dataset’s features. Note you fit on the training set and transform on the training and test set. If you want to see the negative effect not scaling your data can have, scikit-learn has a section on the effects of not standardizing your data.

**Importance of Feature Scaling**

https://scikit-learn.org/stable/auto_examples/preprocessing/plot_scaling_importance.html#sphx-glr-auto-examples-preprocessing-plot-scaling-importance-py

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

# Fit on training set only.
scaler.fit(train_img)

# Apply transform to both the training set and the test set.
train_img = scaler.transform(train_img)
test_img = scaler.transform(test_img)

**Import and Apply PCA**
Notice the code below has .95 for the number of components parameter. It means that scikit-learn choose the minimum number of principal components such that 95% of the variance is retained.

In [None]:
from sklearn.decomposition import PCA

# Make an instance of the Model
pca = PCA(.95)

In [None]:
#Fit PCA on training set. Note: you are fitting PCA on the training set only.

pca.fit(train_img)

PCA(copy=True, iterated_power='auto', n_components=0.95, random_state=None,
    svd_solver='auto', tol=0.0, whiten=False)

Note: You can find out how many components PCA choose after fitting the model using pca.n_components_ . In this case, 95% of the variance amounts to 330 principal components.

**Apply the mapping (transform) to both the training set and the test set.**

In [None]:
train_img = pca.transform(train_img)
test_img = pca.transform(test_img)

**Apply Logistic Regression to the Transformed Data**

Step 1: Import the model you want to use
In sklearn, all machine learning models are implemented as Python classes

In [None]:
from sklearn.linear_model import LogisticRegression

Step 2: Make an instance of the Model.

In [None]:
# all parameters not specified are set to their defaults
# default solver is incredibly slow which is why it was changed to 'lbfgs'

logisticRegr = LogisticRegression(max_iter=2000,solver = 'lbfgs')

Step 3: Training the model on the data, storing the information learned from the data

Model is learning the relationship between digits and labels

I was getting an error for code to increase the maximum no of iteration, I have the max_iter=2000 in above line of code

In [None]:
logisticRegr.fit(train_img, train_lbl)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=2000,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

Step 4: Predict the labels of new data (new images)

Uses the information the model learned during the model training process

The code below predicts for one observation

In [None]:
# Predict for One Observation (image)
logisticRegr.predict(test_img[0].reshape(1,-1))

array(['0'], dtype=object)

In [None]:
# Predict for One Observation (image)
logisticRegr.predict(test_img[1].reshape(1,-1))

array(['4'], dtype=object)

In [None]:
#The code below predicts for multiple observations at once

# Predict for One Observation (image)
logisticRegr.predict(test_img[0:10])

array(['0', '4', '1', '2', '4', '7', '7', '1', '1', '7'], dtype=object)

In [None]:
Predictions=logisticRegr.predict(test_img)

Measuring Model Performance

While accuracy is not always the best metric for machine learning algorithms (precision, recall, F1 Score, ROC Curve, etc would be better), it is used here for simplicity.

ROC: https://towardsdatascience.com/receiver-operating-characteristic-curves-demystified-in-python-bd531a4364d0

Accuracy is defined as:
(fraction of correct predictions): correct predictions / total number of data points

In [None]:
logisticRegr.score(test_img, test_lbl)

0.9184

Since Logistic regression is not same as Linear regression , predicting just accuracy will mislead. 
** Confusion Matrix** is one way to evaluate the performance of your model. Checking the values of True Positives, False Negatives ( Type II Error) are really important.

** ROC Curve** Receiver Operating Characteristic(ROC) summarizes the model’s performance by evaluating the trade offs between true positive rate (sensitivity) and false positive rate(1- specificity)

**Confusion Matrix**

A confusion matrix is a table that is often used to describe the performance of a classification model (or “classifier”) on a set of test data for which the true values are known. In this section, I am just showing two python packages (Seaborn and Matplotlib) for making confusion matrices more understandable and visually appealing.

https://intellipaat.com/blog/confusion-matrix-python/

In [None]:
from sklearn import metrics

cm = metrics.confusion_matrix(test_lbl, Predictions)
print(cm)

[[ 965    0    2    2    1   10    9    1    5    1]
 [   0 1105   13    1    1    6    0    4    9    2]
 [   3   15  929   19   13    4   14   12   27    4]
 [   1    7   39  889    1   29    1   13   20   13]
 [   1    3    8    0  901    0   11    7    4   27]
 [   7    2    9   29    7  759   15    3   27    5]
 [   8    2    9    0   13   14  935    1    5    2]
 [   4    4   16    2   11    5    0  977    6   39]
 [   3   19    8   20    7   25    7    2  858   14]
 [   4    4    3   11   30   10    1   34    6  866]]


In [None]:
classification_report=metrics.classification_report(test_lbl, Predictions)
print(classification_report)

              precision    recall  f1-score   support

           0       0.97      0.97      0.97       996
           1       0.95      0.97      0.96      1141
           2       0.90      0.89      0.89      1040
           3       0.91      0.88      0.90      1013
           4       0.91      0.94      0.93       962
           5       0.88      0.88      0.88       863
           6       0.94      0.95      0.94       989
           7       0.93      0.92      0.92      1064
           8       0.89      0.89      0.89       963
           9       0.89      0.89      0.89       969

    accuracy                           0.92     10000
   macro avg       0.92      0.92      0.92     10000
weighted avg       0.92      0.92      0.92     10000

