# **Dimensionality Reduction**
Week 3 — live class coding 💻

# Setup 📋

In [3]:
import warnings
warnings.filterwarnings('ignore')

# Python ≥3.5 is required
import sys
assert sys.version_info >= (3, 5)

# Scikit-Learn ≥0.20 is required
import sklearn
assert sklearn.__version__ >= "0.20"

# Common imports
import numpy as np
import os

# to make this notebook's output stable across runs
np.random.seed(42)

# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

# Where to save the figures
PROJECT_ROOT_DIR = "."
CHAPTER_ID = "dim_reduction"
IMAGES_PATH = os.path.join(PROJECT_ROOT_DIR, "images", CHAPTER_ID)
os.makedirs(IMAGES_PATH, exist_ok=True)

def save_fig(fig_id, tight_layout=True, fig_extension="png", resolution=300):
    path = os.path.join(IMAGES_PATH, fig_id + "." + fig_extension)
    print("Saving figure", fig_id)
    if tight_layout:
        plt.tight_layout()
    plt.savefig(path, format=fig_extension, dpi=resolution)

# Exercises ✏️

**Load the MNIST dataset and split it into a training set and a test set (take the first 60,000 instances for training, and the remaining 10,000 for testing).**

In [4]:
from sklearn.datasets import fetch_openml

mnist = fetch_openml('mnist_784', version=1, as_frame=False)
mnist.target = mnist.target.astype(np.uint8)

In [10]:
from sklearn.model_selection import train_test_split

In [11]:
# Enter the code here
x_train, x_test, y_train, y_test = train_test_split(mnist.data, mnist.target, train_size=60000, shuffle=False)

**Train a Random Forest classifier on the dataset and time how long it takes, then evaluate the resulting model on the test set.**

In [17]:
from sklearn.ensemble import RandomForestClassifier

# Enter code here
clf = RandomForestClassifier(n_estimators=100, random_state=42)

In [18]:
import time

# Enter code here
t0 = time.time()
clf.fit(x_train, y_train)
t1 = time.time()

In [19]:
print("Training took {:.2f}s".format(t1 - t0))

Training took 48.96s


In [20]:
from sklearn.metrics import accuracy_score

# Enter code here
accuracy_score(y_test, clf.predict(x_test))

0.9705

**Next, use PCA to reduce the dataset's dimensionality, with an explained variance ratio of 95%.**

In [34]:
from sklearn.decomposition import PCA

# Enter code here
pca = PCA(n_components=0.95)
x_train_red = pca.fit_transform(x_train)

**Train a new Random Forest classifier on the reduced dataset and see how long it takes. Was training much faster?**

In [24]:
# Enter code here
t0 = time.time()
clf.fit(x_train_red, y_train)
t1 = time.time()

In [25]:
print("Training took {:.2f}s".format(t1 - t0))

Training took 101.88s


🛑 Oh no! Training is actually more than twice slower now! How can that occur? 🛑

Dimensionality reduction does not always lead to faster training time: it depends on the dataset, the model and the training algorithm.

If we try a softmax classifier instead of a random forest classifier, we may find that the training time is reduce when using PCA, but let's first check the precision of the random forest classifier.

**Next, evaluate the classifier on the test set: how does it compare to the previous classifier?**

In [26]:
# Enter code here
y_pred = clf.predict(pca.transform(x_test))
accuracy_score(y_test, y_pred)

0.9481

It is common for performance to drop slightly when reducing dimensionality, because we do lose some useful signal in the process. However, the performance drop is rather severe in this case. So PCA really did not help: it slowed down training and reduced performance ☹️

Let's see if it helps when using softmax regression:

In [27]:
from sklearn.linear_model import LogisticRegression

# Enter code here
log_cls = LogisticRegression(multi_class='multinomial', solver='lbfgs', random_state=42)
t0 = time.time()
log_cls.fit(x_train, y_train)
t1 = time.time()

In [28]:
print("Training took {:.2f}s".format(t1 - t0))

Training took 17.00s


In [29]:
# Enter code here
accuracy_score(y_test, log_cls.predict(x_test))

0.9255

It performs worse on the test set. But that's not what we are interested in right now; we want to see how much PCA can help softmax regression. Let's train the softmax regression model using the reduced dataset:

In [30]:
# Enter code here
log_cls_2 = LogisticRegression(multi_class='multinomial', solver='lbfgs', random_state=42)
t0 = time.time()
log_cls_2.fit(x_train_red, y_train)
t1 = time.time()

In [31]:
print("Training took {:.2f}s".format(t1 - t0))

Training took 4.80s


Let's check the model's accuracy:

In [35]:
# Enter code here
accuracy_score(y_test, log_cls_2.predict(pca.transform(x_test)))

0.9201