### What are the main motivations for reducing a dataset’s dimensionality? 

+ Speed up subsequent training algorithm
+ Visualize the data and gain insights on the most important features
+ Save space (compression)



### What are the main drawbacks? 

+ Some info is lost, possibly degrading performance
+ Computationally intensive
+ Adds some complexity to your ML pipelines
+ Transformed features can be hard to interpret

### What are other applications of PCA (other than visualizing data)?

- data compression
- image processing 
- exploratory data analysis 
- pattern recognition 
- time series prediction.

### What are the limitations of PCA?

 - It may lead to some amount of data loss.
 - PCA tends to find linear correlations between variables, which is sometimes undesirable.
 - PCA fails in cases where mean and covariance are not enough to define datasets.
 - We may not know how many principal components to keep- in practice, some thumb rules are applied.

### Load the MNIST dataset (given below) 


In [1]:
import pandas as pd
import numpy as np

In [2]:
from sklearn.datasets import fetch_openml

In [6]:
X, y = fetch_openml("mnist_784", version=1, return_X_y=True)

In [7]:
print(X.shape, y.shape)

(70000, 784) (70000,)


### Split it into a training set and a test set
### Take the first 60,000 instances for training, and the remaining 10,000 for testing.

In [8]:
X_train = X[:60000]
X_test = X[60000:]
y_train = y[:60000]
y_test = y[60000:]

### Train a Random Forest classifier on the dataset and time how long it takes, 
### then evaluate the resulting model on the test set. 

In [9]:
import time
from sklearn.ensemble import RandomForestClassifier

rf=RandomForestClassifier(n_estimators=100)

start_time = time.time()
rf.fit(X_train,y_train)

print("Time taken to train RandomForest model: " + str(time.time() - start_time))

Time taken to train RandomForest model: 41.1946496963501


In [10]:
from sklearn.metrics import classification_report, confusion_matrix

pred=rf.predict(X_test)

print ("Classification Report")
print(classification_report(y_test, pred))

print ("Confusion Report")
print(confusion_matrix(y_test, pred))

Classification Report
              precision    recall  f1-score   support

           0       0.97      0.99      0.98       980
           1       0.99      0.99      0.99      1135
           2       0.97      0.97      0.97      1032
           3       0.96      0.97      0.96      1010
           4       0.98      0.97      0.97       982
           5       0.97      0.96      0.97       892
           6       0.98      0.98      0.98       958
           7       0.97      0.96      0.96      1028
           8       0.96      0.95      0.96       974
           9       0.96      0.95      0.95      1009

    accuracy                           0.97     10000
   macro avg       0.97      0.97      0.97     10000
weighted avg       0.97      0.97      0.97     10000

Confusion Report
[[ 971    1    0    0    0    4    2    1    1    0]
 [   0 1124    2    3    0    2    2    0    2    0]
 [   7    0  998    5    3    0    3   10    6    0]
 [   0    0    9  977    0    5    0    9  

### Next, use PCA to reduce the dataset’s dimensionality, with an explained variance ratio of 95%.

In [11]:
from sklearn.decomposition import PCA

pca=PCA(n_components=0.95)

X_data=pca.fit_transform(X)
X_pca=pd.DataFrame(X_data)

In [15]:
print(X_pca.shape)

(70000, 154)


### Train a new Random Forest classifier on the reduced dataset and see how long it takes.
### Was training much faster? 

In [12]:
X_train_pca = X_pca[:60000]
X_test_pca = X_pca[60000:]

In [13]:
rf_pca=RandomForestClassifier(n_estimators=100)

start_time = time.time()
rf_pca.fit(X_train_pca,y_train)

print("Time taken to train RandomForest model with PCA: " + str(time.time() - start_time))

Time taken to train RandomForest model with PCA: 80.49040150642395


### Next evaluate the classifier on the test set: how does it compare to the previous classifier?

In [14]:
pred_pca=rf_pca.predict(X_test_pca)

print ("Classification Report with PCA")
print(classification_report(y_test, pred_pca))

print ("Confusion Report with PCA")
print(confusion_matrix(y_test, pred_pca))

Classification Report with PCA
              precision    recall  f1-score   support

           0       0.95      0.98      0.97       980
           1       0.98      0.99      0.98      1135
           2       0.94      0.94      0.94      1032
           3       0.93      0.94      0.94      1010
           4       0.94      0.96      0.95       982
           5       0.95      0.93      0.94       892
           6       0.97      0.97      0.97       958
           7       0.96      0.95      0.96      1028
           8       0.93      0.91      0.92       974
           9       0.95      0.92      0.93      1009

    accuracy                           0.95     10000
   macro avg       0.95      0.95      0.95     10000
weighted avg       0.95      0.95      0.95     10000

Confusion Report with PCA
[[ 965    0    2    1    0    4    5    1    2    0]
 [   0 1118    5    3    0    0    3    0    6    0]
 [  12    0  967   10    6    2    2    8   24    1]
 [   1    0   12  954    

Observation:
- Time taken without PCA = 41 seconds
- Time taken with PCA = 81 seconds
- Accuracy without PCA = 0.97
- Accuracy wiht PCA = 0.95
- Random Forest model without PCA performed better but with PCA the size has been drastically reduced to save more memory. 