Task:

Load the MNIST dataset (introduced in Chapter 3) and split it into a training set and a test set (take the first 60,000 instances for training, and the remaining 10,000 for testing). Train a random forest classifier on the dataset and time how long it takes, then evaluate the resulting model on the test set. Next, use PCA to reduce the dataset’s dimensionality, with an explained variance ratio of 95%. Train a new random forest classifier on the reduced dataset and see how long it takes. Was training much faster? Next, evaluate the classifier on the test set. How does it compare to the previous classifier? Try again with an SGDClassifier. How much does PCA help now?

This code will load the MNIST dataset, split it into a training set of 60,000 instances and a test set of 10,000 instances, and convert the target variable to integers.

In [1]:
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split

mnist = fetch_openml('mnist_784', version=1)
X = mnist['data']
y = mnist['target'].astype(int)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=10000, random_state=42)
X_train, y_train = X_train[:60000], y_train[:60000]

  X_train, y_train = X_train[:60000], y_train[:60000]


Now, let's train a random forest classifier on the dataset and time how long it takes:

In [8]:
from sklearn.ensemble import RandomForestClassifier
import time

rf_clf = RandomForestClassifier(n_estimators=100, random_state=42)

start_time = time.time()
rf_clf.fit(X_train, y_train)
end_time = time.time()

print("Training time:", end_time - start_time)

Training time: 38.919782638549805


Next, let's evaluate the resulting model on the test set:

In [3]:
from sklearn.metrics import accuracy_score

y_pred = rf_clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print("Accuracy:", accuracy)

Accuracy: 0.9674


Now, let's reduce the dataset's dimensionality using PCA with an explained variance ratio of 95%:

In [4]:
from sklearn.decomposition import PCA

pca = PCA(n_components=0.95, random_state=42)
X_train_reduced = pca.fit_transform(X_train)
X_test_reduced = pca.transform(X_test)

Let's train a new random forest classifier on the reduced dataset and time how long it takes:

In [9]:
rf_clf_reduced = RandomForestClassifier(n_estimators=100, random_state=42)

start_time = time.time()
rf_clf_reduced.fit(X_train_reduced, y_train)
end_time = time.time()

print("Training time (with PCA):", end_time - start_time)

Training time (with PCA): 93.09499001502991


Let's evaluate the new classifier on the test set:

In [6]:
y_pred_reduced = rf_clf_reduced.predict(X_test_reduced)
accuracy_reduced = accuracy_score(y_test, y_pred_reduced)

print("Accuracy (with PCA):", accuracy_reduced)

Accuracy (with PCA): 0.9469


Finally, let's try training a SGDClassifier on the reduced dataset:

In [7]:
from sklearn.linear_model import SGDClassifier

sgd_clf_reduced = SGDClassifier(random_state=42)

start_time = time.time()
sgd_clf_reduced.fit(X_train_reduced, y_train)
end_time = time.time()

print("Training time with SGDClassifier (with PCA):", end_time - start_time)

y_pred_sgd = sgd_clf_reduced.predict(X_test_reduced)
accuracy_sgd = accuracy_score(y_test, y_pred_sgd)

print("Accuracy with SGDClassifier (with PCA):", accuracy_sgd)

Training time with SGDClassifier (with PCA): 25.190171241760254
Accuracy with SGDClassifier (with PCA): 0.8907
