# Logistic Regression: OCR with Scikit-Learn on the Digits dataset

## What are we going to do?
- We will download the handwritten digits dataset to classify it using OCR ("optical character recognition").
- We will preprocess the dataset using Scikit-learn methods.
- We will train a multiclass classification model using Scikit-learn.

OCR is a set of techniques related to machine-learning and deep-learning or neural networks that attempts to visually recognise handwritten characters.

As the character set is relatively small (10 classes), it is a model that we can sometimes simply solve using logistic classification or SVM.

- You can find the features of the dataset here: [Optical recognition of handwritten digits dataset](https://scikit-learn.org/stable/datasets/toy_dataset.html#digits-dataset)
- You can load it with this function: [sklearn.datasets.load_digits](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_digits.html)
- You can use this notebook as a reference: [Recognising hand-written digits](https://scikit-learn.org/stable/auto_examples/classification/plot_digits_classification.html)

Repeat the steps of the previous exercise to train an OCR ML model on this dataset with Scikit learn's [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) function:

In [None]:
# TODO: Import all the necessary modules into this cell

## Load the Digits dataset

Before starting to work with the dataset, graph some of the examples and their associated classes or digits:

In [None]:
# TODO: Load the Digits dataset as X and Y arrays representing some of the examples

## Preprocess the data

Preprocess the data using Scikit-learn methods, as you did in the Scikit-learn linear regression exercise:

- Randomly reorder the data.
- Normalise the data, if necessary.
- Divide the dataset into training and test subsets.

On this occasion, we will use K-fold cross-validation, as the dataset is very small (150 examples).

In [None]:
# TODO: Randomly reorder the data, normalize it only if necessary, and divide it into training and test subsets.

## Train an initial model
- Train an initial model on the training subset without regularisation.
- Test the suitability of the model and retrain it if necessary.

The Scikit-learn function that you can use is [sklearn.linear_model.LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) with an OvR scheme ("one-vs-rest", one class versus the rest).

Evaluate it on the test subset using its `model.score()`:

In [None]:
# TODO: Train your model on the unregularised training subset and evaluate it on the test subset

## Find the optimal regularisation using cross-validation
- Train a model for each regularisation value to be considered.
- Train and evaluate them on a training subset fold using K-fold.
- Choose the optimal model and its regularisation.

The LogisticRegression function applies an L2 regularisation by default, although it uses the *C* parameter which represents the inverse of *lambda*:

In [None]:
# TODO: Train a different model for each C on a different K-fold

## Finally, evaluate the model on the test subset

- Display the coefficients and intercept of the best model.
- Evaluate the best model on the initial test subset.
- Calculate the hits and misses on the test subset and plot them graphically.

As this dataset is very visual, try to also show the examples where the model has failed visually, and consider whether you would be able to recognise that number.

*Sometimes even a human would have trouble deciphering it based on the handwriting of the writer 8-).*

In [None]:
# TODO: Evaluate the best model on the initial test subset and plot its misses graphically.