Load the MNIST dataset using fetch_openml(mnist_784) and split it into:

- training set (60000)
- test set (10000)

Train a Random Forest classifier and time how long it takes (use time, as shown in the slides).
Also check the score. Apply PCA with an explained variance of 0.95. Train a Random Forest
classifier on the reduced dataset and time how long it takes and check the score. What do you
notice? You can try to repeat this for other explained variance numbers as well.

In [1]:
from sklearn import clone
from sklearn.datasets import fetch_openml
from sklearn.utils import Bunch
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.decomposition import PCA

import time

# Load
Load the MNIST dataset using fetch_openml(mnist_784) and split it into:

- training set (60000)
- test set (10000)

In [2]:
b: Bunch = fetch_openml('mnist_784', parser="auto")

X = b.get("data")
y = b.get('target')

## split

In [3]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=60000)
X_train.shape, X_test.shape

((60000, 784), (10000, 784))

# Train

Train a Random Forest classifier and time how long it takes (use time, as shown in the slides).

In [4]:
clf = RandomForestClassifier(n_estimators=50, random_state=42, verbose=1, n_jobs=1)
clf

In [5]:
t1 = time.time()
clf.fit(X_train, y_train)
t2 = time.time()

f"Random Forest Classifier without PCA fitting took {t2 - t1} seconds by using {clf.n_jobs} concurrent workers"

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  50 out of  50 | elapsed:   21.8s finished


'Random Forest Classifier without PCA fitting took 22.049708604812622 seconds by using 1 concurrent workers'

## score

In [6]:
clf.score(X_test, y_test)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  50 out of  50 | elapsed:    0.1s finished


0.9652

# Applying PCA

Apply PCA with an explained variance of $0.95$.

In [7]:
pca = PCA(n_components=0.95)

In [8]:
t1 = time.time()
pca.fit(X_train, y_train)
t2 = time.time()

f"fitting the PCA took {t2 - t1} seconds"

'fitting the PCA took 5.921201467514038 seconds'

## Train with PCA
Train a Random Forest classifier on the reduced dataset and time how long it takes and check the score.

In [9]:
clf_with_PCA = clone(clf)
clf_with_PCA

In [10]:
t1 = time.time()
X_train_PCA = pca.transform(X_train)
t2 = time.time()

f"PCA transform took {t2 - t1} seconds"

'PCA transform took 0.49166321754455566 seconds'

In [11]:
t1 = time.time()
clf_with_PCA.fit(X_train_PCA, y_train)
t2 = time.time()

f"Random Forest Classifier without PCA fitting took {t2 - t1} seconds by using {clf_with_PCA.n_jobs} concurrent workers"

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  50 out of  50 | elapsed:  1.3min finished


'Random Forest Classifier without PCA fitting took 75.3734838962555 seconds by using 1 concurrent workers'

## Score

In [12]:
clf_with_PCA.score(pca.transform(X_test), y_test)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  50 out of  50 | elapsed:    0.1s finished


0.9406

What do you notice? You can try to repeat this for other explained variance numbers as well.

*With PCA, it takes longer to train/fit & it is less accurate*