# Introduction to Scikit-learn

In this lab session we will start working with the scikit-learn library for doing supervised learning.
We are using a simple toy datasets that's included with scikit-learn and consists of low resolution images of handwritten digits.
The task is to classify each image into which digit it represents, from zero to nine.

## Data loading and splitting
Before we start with the machine learning, it's a good idea to have a look at the data first.

In [None]:
from sklearn.datasets import load_digits
import numpy as np
digits = load_digits()

The digits object is similar to a dictionary and contains the data and some information about the dataset. The most important attributes are ``digits.data`` and ``digits.target``, both of which are numy arrays:

In [None]:
digits.keys()

### Task
Find out the number of images stored in ``digits.data`` (keeping in mind that the rows correspond to samples and columns correspond to features) and confirm that there are ten separate classes in ``digits.target``.

In [None]:
# your solution here ...

You can see that there are 64 features in the dataset. These represent the grays-scale values in an 8x8 pixel image. We can vizualize these using matplotlib's ``matshow``:

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

plt.matshow(digits.data[0].reshape(8, 8), cmap=plt.cm.Greys)

### Task
Plot the first 10 images in the dataset and set the title of the plot to the corresponding class.

In [None]:
fig, axes = plt.subplots(2, 5,
                         subplot_kw={'xticks': (), 'yticks': ()})
for ax, im, label in zip(axes.ravel(), digits.data, digits.target):
    ax.imshow(im.reshape(8, 8), cmap=plt.cm.Greys)
    ax.set_title(label)


In [None]:
for i in range(10):
    image = digits.data[i]
    plt.matshow(image.reshape(8, 8), cmap=plt.cm.Greys)
    plt.title(digits.target[i])

Next, we need to split the data into a training set for building the model and a test set for evaluating the model. We can use the ``train_test_split`` function from the ``model_selection`` module for that:

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    digits.data, digits.target, random_state=0)
print(X_train.shape)
print(X_test.shape)

## Building our first classifier
We will use the ``LogisticRegression`` model (which, despite its name, is a classifier). We will discuss the model more in-depth later today.

The first step to using a model is importing it:

In [None]:
from sklearn.linear_model import LogisticRegression

Then, we instantiate the model. The lr object we create contains the logic for creating the model, making predictions, and will also store the model parameters we learned from the data.

In [None]:
lr = LogisticRegression()

Now, we can fit the model on the training data:

In [None]:
lr.fit(X_train, y_train)

Apply the model and evaluate it on the training data:

In [None]:
print(lr.predict(X_train))
print(y_train)

For classification, score computes the accuracy, which is the fraction of correctly classified examples.

In [None]:
lr.score(X_train, y_train)

### Task
Compute the accuracy of the model on the test set. Then, use the ``predict`` method to get the test set predictions, and compute the accuracy yourself using numpy. It should yield the same result.

Next, find 10 examples from the test set that have been misclassified, plot the image and the true and predicted classes. Do the mistakes made by the classifier make sense?

In [None]:
# solution here..

## Another classifier

### Task
Try out a different classifier, the ``KNeighborsClassifier`` from the ``sklearn.neighbors`` module. As before, train the model on the training set, then evaluate it on the test set.

In [None]:
# solution here ...

# Regression
Next we will look at a regression problem. The dataset we are using is the  "Boston housing dataset" that predicts house prices in different Boston neighborhoods in the 1970s.
Let's start by loading the data:

In [None]:
from sklearn.datasets import load_boston
boston = load_boston()
print(boston.keys())

The ``boston.DESCR`` attribute provides a description of the dataset:

In [None]:
print(boston.DESCR)

### Task
Confirm the number of samples and features in the ``boston.data`` attribute. Then look at the target attribute ``boston.target`` and plot it's distribution using ``plt.hist``.

In [None]:
# solution here ...

### Task

Proceed as we did for classification, and split the data into a training and a test set. Then fit a ``LinearRegression`` model from the ``sklearn.linear_model`` module on the training set, and evaluate it on the test set. The metric provided by the ``score`` method for regression is the $R^2$ score, with 1 being a perfect score and 0 being the score of a constant prediction.

Create a scatter plot of the predictions against the ground truth on the test set. This plot can often be a helpful analysis tool in regression.