# Dimensionality Reduction

## Daniel Wilcox: 19147414

This example problem can be found within chapter 7 of the "Hands-on Machine Learning with Scikit-Learn and TensorFlow" by Aurélien Géron. 

This project will be investigating the theory behind Dimensionality Reduction and how to implament them.

In [1]:
#General imports for operating system, unzip and URL's
import os
from six.moves import urllib
from scipy.io import loadmat
from sklearn.datasets import fetch_mldata
import time

#Graphics
import matplotlib
import matplotlib.pyplot as plt

#Array Manipulation
import numpy as np

from sklearn.linear_model import SGDClassifier

#Shuffles data to test/train sets that represent the original data
from sklearn.model_selection import StratifiedKFold

#Cross-validation
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict

from sklearn.metrics import confusion_matrix

from sklearn.metrics import precision_score, recall_score
from sklearn.metrics import f1_score
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score

#random forest 
from sklearn.ensemble import RandomForestClassifier
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score


from sklearn.base import clone

#Creating custom Transformers
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import FunctionTransformer

import random

#Model Tuning
from sklearn.model_selection import GridSearchCV

#Image shifting
from scipy.ndimage.interpolation import shift

In [2]:
#The Location to save the dataset
MNIST_PATH = "datasets/MNIST"
MNIST_URL = "https://github.com/amplab/datascience-sp14/raw/master/lab7/mldata/mnist-original.mat"
MNIST_MAT = "/mnist-original.mat"

In [3]:
def load_MNIST_data(mnist_path=MNIST_PATH, mnist_mat=MNIST_MAT):
        mnist_raw = loadmat(mnist_path+mnist_mat)
        mnist = {"data": mnist_raw["data"].T,
                 "target": mnist_raw["label"][0],
                 "Col_names": ["target", "data"],
                 "DESCR": "mldata.org dataset: mnist-original",
                }
        print("Data Successfully extracted from mnist.mat!")
        return mnist
        
    
def get_MNIST_data(mnist_path=MNIST_PATH, mnist_url=MNIST_URL, mnist_mat=MNIST_MAT):
    
    print("Checking if directory exists...")
    if not os.path.isdir(mnist_path):
        os.makedirs(mnist_path)
        print("Creating directory")
    
    else: 
        print("Directory exists")
        
        #------------------------------------------------------------------
        #uncomment if connected to internet
        #try:
            #print("\nAttempting to get MNIST data from mldata.org ...")
            #mnist = fetch_mldata('MNIST original')
            #print("\nSuccess!")
            #return mnist
    
        #except urllib.error.HTTPError as ex:
            #print("\nCan't reach mldata.org, attempting alternative...")
            #print("Checking if mnist.mat file exists...")  
            
        #------------------------------------------------------------------
        #followig if, else should fall under 'except' 
            
        if os.path.isfile(mnist_path+mnist_mat):
            print("mnist.mat file does exists...")
            print("extracting data from mnist.mat...")
            
            mnist = load_MNIST_data(mnist_path, mnist_mat)
            print("\nSuccess!")
            return mnist
        
        else:
            print("mnist.mat file doesn't exists...")
            print("downloading mnist.mat file...")
            url_response = urllib.request.urlopen(mnist_url)
            
            print("\nCreating .mat file")
            with open(mnist_path+mnist_mat, "wb") as f:
                contents = url_response.read()
                f.write(contents)
            mnist = load_MNIST_data(mnist_path, mnist_mat)
            print("\nSuccess!")
            return mnist
            

# Exercises:
## 8. 
Load the MNIST dataset (introduced in Chapter 3) and split it into a training set and a test set (take the first 60,000 instances for training, and the remaining 10,000 for testing). Train a Random Forest classifier on the dataset and time how long it takes, then evaluate the resulting model on the test set. Next, use PCA to reduce the dataset’s dimensionality, with an explained variance ratio of 95%. Train a new Random Forest classifier on the reduced dataset and see how long it takes. Was training much faster? Next evaluate the classifier on the test set: how does it compare to the previous classifier?

In [4]:
mnist = get_MNIST_data(MNIST_PATH, MNIST_URL, MNIST_MAT)
mnist            


Checking if directory exists...
Directory exists
mnist.mat file does exists...
extracting data from mnist.mat...
Data Successfully extracted from mnist.mat!

Success!


{'data': array([[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]], dtype=uint8),
 'target': array([0., 0., 0., ..., 9., 9., 9.]),
 'Col_names': ['target', 'data'],
 'DESCR': 'mldata.org dataset: mnist-original'}

In [5]:
X, y = mnist["data"], mnist["target"]
print("Shape of \"Data\": {}\nShape of \"target\": {}\n".format(X.shape,y.shape))

Shape of "Data": (70000, 784)
Shape of "target": (70000,)



In [6]:
#MNIST is already split into train and test
X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]


In [7]:
#Shuffle training set to guarentee cross-validation folds are similar.
shuffle_index = np.random.permutation(60000)
X_train, y_train = X_train[shuffle_index], y_train[shuffle_index]

In [8]:
forest = RandomForestClassifier(n_estimators=10,)

#Time Training
t_start = time.time()
forest.fit(X_train, y_train)
t_end = time.time()

t = t_end - t_start

print("Random forest took {:.2f}s to fit the data".format(t))

Random forest took 5.29s to fit the data


In [9]:
#Evaluate Results:
y_pred = forest.predict(X_test)
acc = accuracy_score(y_test, y_pred)

print("Random forest accuracy score: {:.2f}%".format(acc*100))

Random forest accuracy score: 94.90%


In [10]:
forest_pca = RandomForestClassifier(n_estimators=10)

#PCA with 95%:
pca = PCA(n_components=0.95)
X_reduced_tr = pca.fit_transform(X_train)

In [11]:
#Time Training with PCA
t_start = time.time()
forest_pca.fit(X_reduced_tr, y_train)
t_end = time.time()

t = t_end - t_start

print("Random forest took {:.2f}s to fit the data".format(t))

Random forest took 9.37s to fit the data


In [12]:
#PCA took longer? 
#Evaluation:
X_reduced_test = pca.transform(X_test)

y_pred_pca = forest_pca.predict(X_reduced_test)
acc_pca = accuracy_score(y_test, y_pred_pca)

print("Random forest with PCA's accuracy score: {:.2f}%".format(acc_pca*100))

Random forest with PCA's accuracy score: 89.64%


## 9
Use t-SNE to reduce the MNIST dataset down to two dimensions and plot the result using
Matplotlib. You can use a scatterplot using 10 different colors to represent each image’s target
class. Alternatively, you can write colored digits at the location of each instance, or even plot
scaled-down versions of the digit images themselves (if you plot all digits, the visualization will
be too cluttered, so you should either draw a random sample or plot an instance only if no other
instance has already been plotted at a close distance). You should get a nice visualization with
well-separated clusters of digits. Try using other dimensionality reduction algorithms such as
PCA, LLE, or MDS and compare the resulting visualizations.

In [13]:
#Use 10% of data
frac_per = 0.10
frac = round(frac_per*60000)
frac_idx = np.random.permutation(60000)[:frac]

X_less = X_train[frac_idx]
y_less = y_train[frac_idx]

In [14]:
#t-SNE reduction to 2-D:
tsne = TSNE(n_components=2)
X_2D = tsne.fit_transform(X_less)

KeyboardInterrupt: 

In [None]:
plt.figure(figsize=(13,10))
plt.scatter(X_2D[:, 0], X_2D[:, 1], c=y_less, cmap="jet")
plt.axis('off')
plt.colorbar()
plt.show()

In [None]:
#Previously seen that 3's and 5's got confused, comparing these two digits:
idx_3_5 = (y == 3) | (y == 5) 
X_3_5 = X[idx_3_5]
y_3_5 = y[idx_3_5]

tsne_3_5 = TSNE(n_components=2)
X_2D_3_5 = tsne_3_5.fit_transform(X_3_5)


plt.figure(figsize=(13,10))
for val in (3, 5):
    plt.scatter(X_2D_3_5[y_3_5 == val, 0], X_2D_3_5[y_3_5 == val, 1], c=[cmap(val/9)])
plt.axis('off')
plt.colorbar()
plt.show()