In [None]:
%matplotlib inline

import matplotlib.pyplot as plt

import numpy as np
import sklearn

# Machine Learning in Python with scikit-learn

scikit-learn (`sklearn`) provides a user-friendly, but powerful way to conduct machine learning
analyses in Python. It simplifies most standard machine learning tasks - training many kinds of
classifiers, predicting new results, obtaining cross-validated scores, feature extraction -
with one consistent API; you only have to learn one way of calling the tools.

Here, we provide an ultra brief intro to scikit-learn on a digit classification task.

In [None]:
# load the `digits` dataset
from sklearn.datasets import load_digits
data = load_digits()

In [None]:
print(data["DESCR"])

In [None]:
data.keys()

In Machine Learning, the outcomes are typically called `y` and the predictors `X`.

In [None]:
X, y = data["data"], data["target"]

Let's look at the predictors (with matplotlib):

In [None]:
plt.imshow(data["images"][0], cmap="Greys")

The same values as in `data["images"]` are also stored in ´data["data"]`, and thus now `X`,
albeit in a dfifferent shape (as an 1x64 vector):

In [None]:
plt.imshow(X[0].reshape((8, 8)), cmap="Greys")

What are the corresponding outcomes?

In [None]:
y[0]

Some more ...

In [None]:
plt.imshow(X[1].reshape((8, 8)), cmap="Greys")

In [None]:
y[1]

Plotting a few ...

In [None]:
fig, axes = plt.subplots(ncols=10, nrows=3, figsize=(15, 5))

for ii, (ax, im, y_) in enumerate(zip(axes.flatten(), X, y)):
    ax.imshow(im.reshape((8, 8)), cmap="Greys")
    ax.set_title(y_)
    ax.axis("off")

We will train a simple linear classifier to predict $y$ from $X$, i.e., read the digit based off of the pixels.

In [None]:
from sklearn.linear_model import LogisticRegression

We create an instance of a Logistic Regression Classifier ...

In [None]:
est = LogisticRegression(dual=False)

... and fit it to the training set (i.e., we learn the patterns required to predict digits).

In [None]:
est.fit(X, y)

What has the classifier learned? We can visualise the learned patterns:

In [None]:
est.coef_.shape

In [None]:
fig, axes = plt.subplots(ncols=10, nrows=1, figsize=(15, 3))
for ax, pattern in zip(axes.flatten(), est.coef_):
    ax.imshow((pattern).reshape((8, 8)), cmap="RdBu_r", vmin=-1, vmax=1)
    ax.axis("off")

The classifier has a `fit` and a `predict` method. We can print its predictions:

In [None]:
fig, axes = plt.subplots(ncols=10, nrows=3, figsize=(15, 5))

for ii, (ax, im) in enumerate(zip(axes.flatten(), X)):
    ax.imshow(im.reshape((8, 8)), cmap="Greys")
    predicted = est.predict(im[np.newaxis, :])[0]
    ax.set_title(predicted)
    ax.axis("off")

But this is testing on the training set. We need to separate testing and training!
For that, we can use the `cross_val_predict` helper:

In [None]:
from sklearn.model_selection import cross_val_predict, cross_val_score

In [None]:
predictions = cross_val_predict(est, X, y)

In [None]:
fig, axes = plt.subplots(ncols=10, nrows=3, figsize=(15, 5))

for ii, (ax, im, y_pred) in enumerate(zip(axes.flatten(), X, predictions)):
    ax.imshow(im.reshape((8, 8)), cmap="Greys")
    ax.set_title(y_pred)
    ax.axis("off")

And we can use the `cross_val_score` function to check the predictive accuracy:

In [None]:
cross_val_score(est, X, y)

What if we want to try a more powerful classifier?

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
rf = RandomForestClassifier(n_estimators=1000)

... same API

In [None]:
cross_val_score(rf, X, y)

We can perform complex feature preprocessing techniques with sklearn, with a similar API.

In [None]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler

In [None]:
plt.hist(X.flatten())

In [None]:
x_scaled = MinMaxScaler((0, 1)).fit_transform(X)
plt.hist(X.flatten(), label="original")
plt.hist(x_scaled.flatten(), label="scaled")
plt.legend()

Now for something more interesting ... remember what the digits look like!

In [None]:
fig, axes = plt.subplots(ncols=10, nrows=3, figsize=(15, 5))

for ii, (ax, im, y_) in enumerate(zip(axes.flatten(), X, y)):
    ax.imshow(im.reshape((8, 8)), cmap="Greys")
    ax.set_title(y_)
    ax.axis("off")

We can use non-negative matrix factorization for unsupervised extraction of consistent patterns in the data.

In [None]:
from sklearn.decomposition import PCA, NMF

nmf = NMF(n_components=10)
X2 = nmf.fit(x_scaled).transform(x_scaled)
X2 = nmf.inverse_transform(X2)

In [None]:
nmf.components_.shape

In [None]:
fig, axes = plt.subplots(ncols=10, nrows=4, figsize=(15, 8))

for ii, (ax, im, y_) in enumerate(zip(axes.flatten()[:30], X2, y)):
    ax.imshow(im.reshape((8, 8)), cmap="Greys")
    ax.set_title(y_)
    ax.axis("off")

for jj, comp in enumerate(nmf.components_):
    ax = axes.flatten()[ii + jj + 1]
    ax.imshow(comp.reshape((8, 8)), cmap="Greys")
    ax.axis("off")

And we can chain preprocessing stages and various classifiers ...

In [None]:
from sklearn.pipeline import make_pipeline
from sklearn.svm import LinearSVC

est_pipeline = make_pipeline(MinMaxScaler((0, 1)), NMF(n_components=35), LinearSVC())

In [None]:
cross_val_score(est_pipeline, X, y)

... and with the same API, you could do Machine Learning on any kind of data!