### 1) What are the main motivations for reducing a dataset’s dimensionality? - 0.5 points
### What are the main drawbacks? - 0.5 points

#### Main motivations:
- If there is a large dataset with 10,000 or so features, we may be able to slim it down to say 500 dimensions. This allows us to focus more on the important features.
- Also, a larger dataset will take longer to train a model on and so a reduction will decrease that training time.
- A model may be training on noise if there are a lot of unnecessary features and thus may overfit the model.

#### Main drawbacks:
- We lose some variance in the data whenever we reduce the dimensions. This means that some recorded behavior in a dataset may not be analyzed in the final training.

### Load the MNIST dataset (given below) 


In [1]:
from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784', version=1)
mnist.keys()

dict_keys(['data', 'target', 'frame', 'categories', 'feature_names', 'target_names', 'DESCR', 'details', 'url'])

### 2) Split it into a training set and a test set
### Take the first 60,000 instances for training, and the remaining 10,000 for testing. - 1 point

In [2]:
# Split data into train and test
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(mnist.data, mnist.target, test_size=10000, shuffle=False)

### 3) Train a Random Forest classifier on the dataset and time how long it takes, - 1 point
### then evaluate the resulting model on the test set. - 1 point

In [3]:
import time
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

clf = RandomForestClassifier()
rfc_start = time.time()
clf.fit(x_train, y_train)
rfc_stop = time.time()
time_sec = rfc_stop - rfc_start
predict = clf.predict(x_test)
print("Time taken to train: ", time_sec)
print("Classifier accuracy: ", accuracy_score(y_test, predict))


Time taken to train:  38.381765604019165
Classifier accuracy:  0.9698


### 4) Next, use PCA to reduce the dataset’s dimensionality, with an explained variance ratio of 95%. - 2 + 2 points

In [4]:
from sklearn.decomposition import PCA

pca = PCA(0.95)

transformed = pca.fit_transform(x_train)
print("Total explained variance ratio: ", pca.explained_variance_ratio_.sum())

Total explained variance ratio:  0.9501960192613034


In [5]:
transformed

array([[ 123.93258866, -312.67426202,  -24.51405176, ...,   55.01899792,
         -20.08327427,   39.58995229],
       [1011.71837587, -294.85703827,  596.33956104, ...,    7.24129874,
         -12.45780869,  -12.7432306 ],
       [ -51.84960805,  392.17315286, -188.50974943, ...,  -54.19582221,
          48.47979747,  -73.27826256],
       ...,
       [-178.0534496 ,  160.07821109, -257.61308227, ...,   55.54485537,
          87.99883556,   -5.78979735],
       [ 130.60607208,   -5.59193642,  513.85867395, ...,   23.30835402,
           5.06237836,  -65.26525587],
       [-173.43595244,  -24.71880226,  556.01889393, ...,   52.4956069 ,
          12.63192292,  -45.74001227]])

### 5)  Train a new Random Forest classifier on the reduced dataset and see how long it takes. - 1 point
### Was training much faster?
- Answer: It was actually slower than the non-transformed model training. By about 50 percent.

In [6]:
del clf

clf = RandomForestClassifier()
rfc_start = time.time()
clf.fit(transformed, y_train)
rfc_stop = time.time()
time_sec = rfc_stop - rfc_start
print("Time taken to train with PCA: ", time_sec)

Time taken to train with PCA:  78.82187390327454


### 6) Next evaluate the classifier on the test set: how does it compare to the previous classifier? - 1 point
- Answer: The accuracy wasn't affected too much. Only lost about 2 percent accuracy with the PCA data

In [7]:
transformed_test = pca.transform(x_test)

predict = clf.predict(transformed_test)
print("Classifier accuracy of PCA trained data: ", accuracy_score(y_test, predict))

Classifier accuracy of PCA trained data:  0.9492
