# Exercise - 9

Load the MNIST dataset (introduced in chapter 3) and split it into a training set and a test set (take the first 60,000 instances for training, and the remaining 10,000 for testing). Train a Random Forest classifier on the dataset and time how long it takes, then evaluate the resulting model on the test set. Next, use PCA to reduce the dataset's dimensionality, with an explained variance ratio of 95%. Train a new Random Forest classifier on the reduced dataset and see how long it takes. Was training much faster? Next evaluate the classifier on the test set: how does it compare to the previous classifier? Try again with an SGDClassifier. How much does PCA help now? 

In [4]:
from numpy import ndarray
from sklearn.datasets import fetch_openml
from sklearn.utils import Bunch
from sklearn.ensemble import RandomForestClassifier
from sklearn.decomposition import PCA
from sklearn.linear_model import SGDClassifier

## Loading Dataset

In [5]:
mnist: Bunch = fetch_openml('mnist_784', as_frame= False, parser= 'auto')
X_train, y_train = mnist['data'][:60000], mnist['target'][:60000]
X_test, y_test = mnist['data'][60000:], mnist['target'][60000:]

## Without PCA

In [6]:
rnd_forest = RandomForestClassifier(random_state= 42)
%time rnd_forest.fit(X_train, y_train)

CPU times: user 54.6 s, sys: 67 ms, total: 54.7 s
Wall time: 54.8 s


In [7]:
rnd_forest.score(X_test, y_test)

0.9705

## With PCA

In [8]:
pca = PCA(n_components= 0.95, random_state= 42)
X_reduced: ndarray = pca.fit_transform(X_train)

In [9]:
X_reduced.shape

(60000, 154)

In [10]:
rnd_forest_pca = RandomForestClassifier(random_state= 42)
%time rnd_forest_pca.fit(X_reduced, y_train)

CPU times: user 2min 8s, sys: 15.1 ms, total: 2min 8s
Wall time: 2min 8s


The time taken is almost doubled after dimensionality reduction because dimensionality reduction does not guarantee decrease in training time it depend on dataset and the algorithm we use.

In [11]:
X_reduced_test: ndarray = pca.transform(X_test)
rnd_forest_pca.score(X_reduced_test, y_test)

0.9481

It is common for performance to drop slightly when reducing dimensionality, because we do lose some potentially useful signal in the process. However, the performance drop is rather severe in this case. So PCA really did not help: it slowed down training and reduced performance.

## Let's try SGDClassifier

In [16]:
sgd_clf = SGDClassifier(random_state= 42)
%time sgd_clf.fit(X_train, y_train)

CPU times: user 3min 10s, sys: 118 ms, total: 3min 10s
Wall time: 3min 10s


In [17]:
sgd_clf.score(X_test, y_test)

0.874

In [18]:
sgd_clf_pca = SGDClassifier(random_state= 42)
%time sgd_clf_pca.fit(X_reduced, y_train)

CPU times: user 43.2 s, sys: 185 ms, total: 43.4 s
Wall time: 43.2 s


In [19]:
sgd_clf_pca.score(X_reduced_test, y_test)

0.8959

In case of SGDClassifier PCA not only reduced training time but also improved the performance.