# MNIST Classification with Sklearn

Importing the libraries and the datasets

In [48]:
import pandas as pd 
import numpy as np
import matplotlib as plt
import seaborn as sns
from sklearn.datasets import fetch_openml
from sklearn.linear_model import SGDClassifier

We are going to import the dataset, the classical MNIST set with handwritten digits between 0 and 9, from the sklearn datasets libraries. Also we are going to call DESCR method to see info about the dataset 

In [43]:
mnist = fetch_openml('mnist_784', version = 1)

mnist.DESCR

"**Author**: Yann LeCun, Corinna Cortes, Christopher J.C. Burges  \n**Source**: [MNIST Website](http://yann.lecun.com/exdb/mnist/) - Date unknown  \n**Please cite**:  \n\nThe MNIST database of handwritten digits with 784 features, raw data available at: http://yann.lecun.com/exdb/mnist/. It can be split in a training set of the first 60,000 examples, and a test set of 10,000 examples  \n\nIt is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed-size image. It is a good database for people who want to try learning techniques and pattern recognition methods on real-world data while spending minimal efforts on preprocessing and formatting. The original black and white (bilevel) images from NIST were size normalized to fit in a 20x20 pixel box while preserving their aspect ratio. The resulting images contain grey levels as a result of the anti-aliasing technique used by the normalization algorithm. the images were centered in a 28x28 

Now we are going to define the X (features) and the y (target), and see with the shape method the number of featured (predictor variables) and the number os instances that we have for doing the train and the test.

In [44]:
X, y = mnist.data, mnist.target

print(' X -- Instances, columns = ', X.shape, '\n', 'y -- Columns = ', y.shape)

 X -- Instances, columns =  (70000, 784) 
 y -- Columns =  (70000,)


The features represent 784 pixels, which are binary values depending on whether they are filled (1) or empty (0), so they are numeric (float).

In [45]:
X.dtype

dtype('float64')

However the target variable is not a numerical variable, which is the type that most sklearn algorithms need to perform so we must transform the y in it.

In [46]:
print('y before transformation --> ', y.dtype)

y = y.astype('float')

print('y after transformation --> ', y.dtype)

y before transformation -->  object
y after transformation -->  float64


Now we are going to divide the instance in training and test set:

In [47]:
X_train, X_test, y_train, y_test = X[0:60000], X[60000:], y[:60000], y[60000:] # 60000 instances for traininig and the rest, 10000 for testing.

## Binary classifier

First we are going to train a binary classifier for detecting if the target number is a 7 or not, so we are going to change the float values of the y for booleans ('True' if y is equel 7, 'False' if is othe digit). For this purpouse we are going to choose a stochastic gradient descent classifier, because, if well perform a little bit worst that a regular gradient descent algorithm it suits better with big datasets like MNIST:

In [52]:
y_train7 = (y_train == 7) 
y_test7 = (y_test == 7)

In [53]:
sgd_clf = SGDClassifier()

sgd_clf.fit(X_train, y_train7)

SGDClassifier()

Now that we have our binary classifier trained, lets see how perform launching the 'score' method:

In [54]:
sgd_clf.score(X_test, y_test7)

0.9626