# Homework # 4: Unsupervised Learning - Part 2

Divide the MNIST dataset into a training and a testing set using 60,000 images for training and
the remaining for testing.

• Train a Random Forest classifier using the training data and evaluate how long it takes to
train.

• Test the classifier with the testing data and generate a confusion matrix and compute the
overall accuracy.

• Use PCA to reduce the dataset’s dimensionality, using explained variance ratio’s of 95%, 90%,
and 85%, respectively.

• Train Random Forest classifiers using the dimensionally-reduced data and evaluate how long
it takes to train. Discuss how the explained variance ratio influences training data, along
with a comparison to the initial training time.

• Evaluate the classifiers using the testing set, generating the confusion matrices and overall
accuracy, for each case. Discuss the performance differences.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
import time
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from keras.datasets import mnist
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

In [2]:
(X_train, Y_train),(X_test, Y_test) = mnist.load_data()

X_train = X_train.reshape(-1,X_train.shape[1]*X_train.shape[2])
X_test = X_test.reshape(-1,X_test.shape[1]*X_test.shape[2])

print(X_train.shape)
print(Y_train.shape)
print(X_test.shape)
print(Y_test.shape)


Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
(60000, 784)
(60000,)
(10000, 784)
(10000,)


In [3]:
classifier = RandomForestClassifier()

In [4]:
def calculate_time_for_training(cf, X_train, Y_train):
  print()
  start_time = time.time()
  print ('Training Started')
  cf = cf.fit(X_train, Y_train)
  end_time = time.time()
  print('Time taken for training : ',format(end_time - start_time))
  print ('Training completed')
 
  return cf

In [5]:
def calculate_accuracy(cf, Y_test, y_pred):
  print()
  scores = accuracy_score(Y_test, y_pred)
  print ('Accuracy {0}'.format(np.mean(scores)))

In [6]:
def generate_confusion_matrix(Y_test, y_pred):
  print()
  print("Confusion Matrix")
  cm = confusion_matrix(Y_test, y_pred)
  print(cm)

In [7]:
cf = calculate_time_for_training(classifier, X_train, Y_train) # Calculating time taken

y_pred = cf.predict(X_test)
calculate_accuracy(cf, Y_test, y_pred) # Calculating Accuracy
generate_confusion_matrix(Y_test, y_pred) # Generating confusion matrix


Training Started
Time taken for training :  44.99418258666992
Training completed

Accuracy 0.9689

Confusion Matrix
[[ 970    0    0    0    0    1    3    1    4    1]
 [   0 1122    2    3    1    2    3    1    1    0]
 [   5    0 1002    6    2    0    3    8    6    0]
 [   0    0   12  972    0    6    0   10    8    2]
 [   1    0    1    0  954    0    5    0    3   18]
 [   5    1    0   14    3  854    5    1    6    3]
 [   7    3    0    0    4    4  937    0    3    0]
 [   0    3   22    1    0    0    0  987    3   12]
 [   5    0    5    7    3    7    3    3  932    9]
 [   5    6    3   11   12    4    1    4    4  959]]


**Using PCA to reduce dimensionality with variance of 95%, 90% and 85%**

In [8]:
print("Processing with data dimensionality reduced with variance of 95%")
print()
pca = PCA(.95).fit(X_train)
X_train_95 = pca.transform(X_train)
X_test_95 = pca.transform(X_test)

print(X_train_95.shape)
print(X_test_95.shape)

cf = calculate_time_for_training(classifier, X_train_95, Y_train)

y_pred = cf.predict(X_test_95)
calculate_accuracy(cf, Y_test, y_pred)
generate_confusion_matrix(Y_test, y_pred)

Processing with data dimensionality reduced with variance of 95%

(60000, 154)
(10000, 154)

Training Started
Time taken for training :  118.5601453781128
Training completed

Accuracy 0.9486

Confusion Matrix
[[ 964    0    2    0    1    3    6    1    3    0]
 [   0 1118    5    3    0    0    4    0    4    1]
 [   8    0  958   15    8    1    6   11   23    2]
 [   2    1    8  950    1   15    1    9   19    4]
 [   1    3    4    0  939    0    9    1    2   23]
 [   5    1    2   29    4  832    7    2    6    4]
 [   7    3    2    0    4    7  932    0    2    1]
 [   1    6   17    3    7    1    0  973    1   19]
 [   7    0    9   24   11   17    5    8  885    8]
 [   7    5    2   13   26    3    0   12    6  935]]


In [9]:
print("Processing with data dimensionality reduced with variance of 90%")
print()
pca = PCA(.90).fit(X_train)
X_train_90 = pca.transform(X_train)
X_test_90 = pca.transform(X_test)

print(X_train_90.shape)
print(X_test_90.shape)

cf = calculate_time_for_training(classifier, X_train_90, Y_train)

y_pred = cf.predict(X_test_90)
calculate_accuracy(cf, Y_test, y_pred)
generate_confusion_matrix(Y_test, y_pred)

Processing with data dimensionality reduced with variance of 90%

(60000, 87)
(10000, 87)

Training Started
Time taken for training :  89.26274871826172
Training completed

Accuracy 0.9495

Confusion Matrix
[[ 965    0    2    0    1    3    6    1    2    0]
 [   0 1122    2    4    0    0    4    0    3    0]
 [   9    0  968   11    9    1    3    8   23    0]
 [   1    0    9  957    1   15    3    6   12    6]
 [   1    2    6    0  930    0    7    4    5   27]
 [   4    0    2   23    4  833   12    2    5    7]
 [   7    3    0    0    3    5  939    0    1    0]
 [   1    8   21    1    6    0    0  965    3   23]
 [   8    0    9   18   11   23    4    8  885    8]
 [   4    6    2   13   25    4    1   13   10  931]]


In [10]:
print("Processing with data dimensionality reduced with variance of 85%")
print()
pca = PCA(.85).fit(X_train)
X_train_85 = pca.transform(X_train)
X_test_85 = pca.transform(X_test)

print(X_train_85.shape)
print(X_test_85.shape)

cf = calculate_time_for_training(classifier, X_train_85, Y_train)

y_pred = cf.predict(X_test_85)
calculate_accuracy(cf, Y_test, y_pred)
generate_confusion_matrix(Y_test, y_pred)

Processing with data dimensionality reduced with variance of 85%

(60000, 59)
(10000, 59)

Training Started
Time taken for training :  68.8673906326294
Training completed

Accuracy 0.9536

Confusion Matrix
[[ 965    0    2    0    0    5    5    2    1    0]
 [   0 1119    4    4    0    1    3    0    3    1]
 [   9    0  973   13    5    2    2    8   19    1]
 [   1    0   11  946    1   15    3    7   21    5]
 [   1    1    4    0  942    2    7    1    3   21]
 [   3    1    4   24    6  838    6    1    6    3]
 [   6    2    1    0    3    3  941    0    1    1]
 [   1    4   18    1    4    0    1  975    2   22]
 [   8    0    8   15    8   17    4    7  900    7]
 [   6    6    2   11   23    8    0    9    7  937]]


                          variance    time     accuracy   feature
                          normal      44.99     96.9%      784
                          95          118.56    94.8%      154
                          90          89.26     94.9%      87
                          85          68.86     95.3%      59

- As we can see from the above resuts, the original dataset with 784 features take around 46 seconds for training with accuracy of 94.5%. This is the most accurate with least time as compared to all other different variance ratios.

- As the variance ratio reduces, the features or dimensions also reduces. This change reflects in the time and accuracy of training our model for different variance ratios.

- Now, since the features/dimensions are reduced from 784 to 154, 87 and 59, we would expect the time taken for training would decrease and the accuracy of our model would increase. But we see the contradiction over here. 

- As the dimesions are reduced, the time taken for training increases from the original dataset and also the accuracy drops from that of the original one.

- This happens only when we are training with 0.95 variance that increases the training time and further reducing it to 0.90 and 0.85 decreases the training time and increases the accuracy as expected. Although, this is not better than that of the original dataset but close enough.

- The reason for this unexpected behaviour can be due to PCA dimension reduction, model reduces the information from the original dataset and only uses portion of data for training. 

- This may cause some of the features to be missing or losing out some of the data points which might have helped in finding good splits and training the model and eventually increasing the training time of our model.

- This is becuause the classifier tries to find the good split by considering all of the features in the reduced dataset or losing some data points. Since some of the features are reduced, we may have missed out on a feature that might have resulted in finding a good split for our training model. For this reasons, the overall time of the training increasea and the accuracy goes down.

- Overall, PCA reduced dimensionality model performed worse than the original model using Random forest classifier for the above said reasons. Although the decision tree was expected to be build quicker for reduced feature, it tries to find the most accurate feature to perform the split and since we might have lost that during reduction, we have to perform more processing to find the optimum split. The reason for model performing well for 0.90 and 0.85 after it being worse for 0.95 from the original dataset is due to the fact that we are already in the similar dimensions and hence this will help in quickly finding the good feature to perform the split.

- Hence, PCA with correct dimensions or variance might help in training our model better with reduced time.