For this first exercise we are going to use the [MNIST dataset](http://yann.lecun.com/exdb/mnist/), a classic one. Make sure you read the documentation for the dataset and explore it to make sure you know how to tackle the exercise.

We load the dataset

In [29]:
import numpy as np
from sklearn.datasets import fetch_mldata

mnist = fetch_mldata('MNIST original')

In [3]:
mnist.data.shape

(70000, 784)

To speed things up, we will get a sample of 10000 observations. I set numpy's random seed for reproducibility.

In [92]:
sample_size = 10000
np.random.seed(42)
random_sample_index = np.random.randint(0, mnist.data.shape[0], sample_size)
random_sample_index

array([15795,   860, 54886, ..., 49222, 48115, 40647])

In [78]:
mnist_sample_data = mnist.data[random_sample_index]
mnist_sample_target = mnist.target[random_sample_index]

### Use PCA to reduce the mnist sample dimensionality and create a classifier that correctly classifies the images, using as the scorer the F1 score (there are multiple f1 scores for multiclass, choose the one you think fits best the problem, the [documentation can help](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html)). A good classifier would have a F1 of above 0.99

first we check how the target classes are distributed

In [91]:
pd.Series(mnist_sample_target).value_counts(True)

1.0    0.1152
3.0    0.1040
7.0    0.1038
0.0    0.1036
4.0    0.1017
2.0    0.0982
9.0    0.0974
8.0    0.0942
6.0    0.0941
5.0    0.0878
dtype: float64

We see there is no class imbalance, so we can use F1 micro average

In [79]:
from sklearn.decomposition import PCA

In [80]:
pca = PCA(0.8)

In [81]:
mnist_pca = pca.fit_transform(mnist_data)

In [88]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression()

In [89]:
from sklearn.model_selection import cross_val_score
clf_results = cross_val_score(clf,
                              mnist_pca,
                              mnist_target,
                              scoring="f1_micro",
                              cv=3
                             )

In [90]:
clf_results.mean()

0.99689989994800321