# Exercise 9
Load the MNIST dataset (introduced in Chapter 3) and split it into a training set and a test set (take the first 60,000 instances for training, and the remaining 10,000 for testing). Train a Random Forest classifier on the dataset and time how long it takes, then evaluate the resulting model on the test set. Next, use PCA to reduce the dataset’s dimensionality, with an explained variance ratio of 95%. Train a new Random Forest classifier on the reduced dataset and see how long it takes. Was training much faster? Next, evaluate the classifier on the test set. How does it compare to the previous classifier?

In [1]:
# load libraries
from sklearn.datasets import fetch_openml
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
import warnings
# ignore warnings because they're annoying
warnings.filterwarnings("ignore")
# load MNIST dataset
mnist = fetch_openml('mnist_784', version=1)
# set up data
X = np.array(mnist.data)
y = np.array(mnist.target).astype(int)

# split the data
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = (1/7)) # have 20k samples in valid

In [2]:
rdm_clf = RandomForestClassifier(random_state=42)
# time the training time of our classifier
import time
start_time = time.time()
rdm_clf.fit(X_train, y_train)
print("--- %s seconds ---" % (time.time() - start_time))

--- 63.44660663604736 seconds ---


In [3]:
# test the accuracy of our classifier
y_pred = rdm_clf.predict(X_test)
print(accuracy_score(y_test, y_pred))

0.9695


In [4]:
# using PCA analysis
from sklearn.decomposition import PCA
pca = PCA(n_components=0.95) # capture 95% of the variance in the dataset
X_reduced = pca.fit_transform(X_train)
# create new random forest classifier
rdm_clf2 = RandomForestClassifier(random_state = 42)
start_time = time.time()
rdm_clf2.fit(X_reduced, y_train)
print("--- %s seconds ---" % (time.time() - start_time))

--- 124.11359477043152 seconds ---


In [5]:
# new model only understands 154 features
X_test_reduced = pca.transform(X_test)
# test accuracy of our second classifier
y_pred = rdm_clf2.predict(X_test_reduced)
print(accuracy_score(y_test, y_pred))

0.9493


Our results show that our model performs worse when using PCA and takes longer to train

# Exercise 10
Use t-SNE to reduce the MNIST dataset down to two dimensions and plot the result using Matplotlib. You can use a scatterplot using 10 different colors to represent each image’s target class. Alternatively, you can replace each dot in the scatterplot with the corresponding instance’s class (a digit from 0 to 9), or even plot scaled-down versions of the digit images themselves (if you plot all digits, the visualization will be too cluttered, so you should either draw a random sample or plot an instance only if no other instance has already been plotted at a close distance). You should get a nice visualization with well-separated clusters of digits. Try using other dimensionality reduction algorithms such as PCA, LLE, or MDS and compare the resulting visualizations.

In [None]:
from sklearn.manifold import TSNE

tsne = TSNE()
manifold = tsne.fit_transform(X)