# Exersize 9
# Task

Load the MNIST dataset (introduced in Chapter 3) and split it into a training set
and a test set (take the first 60,000 instances for training, and the remaining
10,000 for testing). Train a Random Forest classifier on the dataset and time how
long it takes, then evaluate the resulting model on the test set. Next, use PCA to
reduce the dataset’s dimensionality, with an explained variance ratio of 95%.
Train a new Random Forest classifier on the reduced dataset and see how long it
takes. Was training much faster? Next, evaluate

# Downloading and splitting the Data

In [1]:
from sklearn.datasets import fetch_openml

mnist = fetch_openml("mnist_784", version=1)

In [2]:
x_train = mnist["data"][:60000]
x_test = mnist["data"][60000:]
y_train = mnist["target"][:60000]
y_test = mnist["target"][60000:]

# Training an estimator on original Dataset

In [4]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
import numpy as np

forest_clf = RandomForestClassifier(n_estimators = 100,
                                    random_state=42)

In [5]:
import time

t0 = time.time()
forest_clf.fit(x_train, y_train)
t1 = time.time()

print("Time for training on original dataset: ", t1 - t0)

Time for training on original dataset:  32.832902669906616


In [6]:
from sklearn.metrics import accuracy_score

y_hat = forest_clf.predict(x_test)
accuracy_score(y_test, y_hat)

0.9705

# Training estimator on transformed dataset

In [7]:
from sklearn.decomposition import PCA

pca = PCA(n_components = 0.95)
pca.fit(x_train, y_train)
x_train_transformed = pca.transform(x_train)

forest_clf = RandomForestClassifier(n_estimators = 100,
                                    random_state=42)

In [8]:
t0 = time.time()
forest_clf.fit(x_train_transformed, y_train)
t1 = time.time()
print("Time for training on transformed dataset: ", t1 - t0)

Time for training on transformed dataset:  80.17137789726257


In [9]:
x_test_transformed = pca.transform(x_test)
y_hat = forest_clf.predict(x_test_transformed)
accuracy_score(y_test, y_hat)

0.9481

# Training a softmax model

In [10]:
from sklearn.linear_model import LogisticRegression

log_clf = LogisticRegression(multi_class="multinomial",
                             solver="lbfgs",
                             random_state=42)

t0 = time.time()
log_clf.fit(x_train, y_train)
t1 = time.time()

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [11]:
print("Time for training on original dataset: ", t1 - t0)

Time for training on original dataset:  12.373929500579834


In [12]:
y_hat = log_clf.predict(x_test)
accuracy_score(y_test, y_hat)

0.9255

In [13]:
log_clf2 = LogisticRegression(multi_class="multinomial",
                              solver="lbfgs",
                              random_state=42)

t0 = time.time()
log_clf2.fit(x_train_transformed, y_train)
t1 = time.time()

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [14]:
print("Time for training on transformed dataset: ", t1 - t0)

Time for training on transformed dataset:  3.9129631519317627


In [15]:
y_hat = log_clf2.predict(x_test_transformed)
accuracy_score(y_test, y_hat)

0.9201

# Exersize 10 
# Task

Use t-SNE to reduce the MNIST dataset down to two dimensions and plot the result using Matplotlib. You can use a scatterplot using 10 different colors to represent each image's target class.

In [48]:
mnist.data.loc[[1]]

Unnamed: 0,pixel1,pixel2,pixel3,pixel4,pixel5,pixel6,pixel7,pixel8,pixel9,pixel10,...,pixel775,pixel776,pixel777,pixel778,pixel779,pixel780,pixel781,pixel782,pixel783,pixel784
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [78]:
np.random.seed(42)

m = 10000
idx = np.random.permutation(60000)[:m]

X = mnist['data'].loc[idx]
y = mnist['target'][idx]

In [64]:
from sklearn.manifold import TSNE

tsne = TSNE(n_components=2, random_state=42)
X_reduced = tsne.fit_transform(X)



In [65]:
X_reduced.shape

(10000, 2)

In [84]:
y.shape

(10000,)

# Stopped here