In [None]:
%matplotlib inline
%config InlineBackend.figure_format ='retina'

In [None]:
# A bit of setup
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
#%matplotlib widget
plt.rcParams['figure.figsize'] = (6.0, 6.0) # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'
#plt.rcParams['image.cmap'] = 'gray'

np.set_printoptions(precision=2)

<img src="images/scikit-learn.jpg" style="float:right; width: 200px; "/>

## Python for Scientific Programming

# Introduction to Scikit-learn

#### S. Caillou, EIT&AIC Master, 10-2019

# Plan

- Introduction
- Representation of data
- Supervized learning
    - Regression
    - Classification
- Unsupervized learning
    - Clustering
    - Dimensionality reduction
- Model selection
    - Cross-validation: evaluating estimator performance
    - Hyperparameters and Model Validation
    - Model evaluation
    - Model persistence

## Introduction

#### What is scikit learn ?

- Simple and efficient tools for data mining and data analysis
- Accessible to everybody, and reusable in various contexts
- Built on NumPy, SciPy, and matplotlib
- Open source, commercially usable - BSD license


#### What can we do with scikit-learn ?

- Preprocessing: 
    - Extraction of features
    - Normalization
- Supervised Learning
     - Regression (linear and polynomial, Ridge, Lasso, ...)
     - Classification (SVM, nearest neighbors, random forest, ...)
- Unsupervised learning
     - Clustering (k-Means, spectral clustering, mean shift, ...)
     - DSimensionality reduction: PCA, selection of variables ...
- Evaluation and model selection
     - Evaluate and compare models
     - Optimization of parameters


- General library
- Better suited to certain areas than to others
- Deep neural networks => PyTorch, TensorFlow, Keras ...
- Very massive and distributed data => Apache Sparks MLlib

### About machine learning

- Automatic ** Extraction of knowledge ** from data.
- ** Construction of a model ** representing this knowledge.
- ** Predictions ** on new data not yet observed from this model.

In general, a learning problem considers a set of n samples of data and then tries to predict properties of unknown data. 

If each sample is more than a single number and, for instance, a multi-dimensional entry (aka multivariate data), it is said to have several attributes or features.


Learning problems fall into a few categories:

- Supervized learning
- Unsupervized learning
- Reinforcement learning
- ...

#### Supervized learning

Supervised learning consists in learning the link between two datasets: 

- the observed data X 
- and an external variable y that we are trying to predict, usually called “target” or “labels”. Most often, y is a 1D array of length n_samples.

<img src="figures/supervised_workflow.svg" width="50%">

**Classification** : samples belong to two or more classes and we want to learn from already labeled data how to predict the class of unlabeled data

Example of a classification problem would be handwritten digit recognition, in which the aim is to assign each input vector to one of a finite number of discrete categories. 

Another way to think of classification is as a discrete (as opposed to continuous) form of supervised learning where one has a limited number of categories and for each of the n samples provided, one is to try to label them with the correct category or class.


<img src="figures/classification.png" width="40%">

**Regression** : if the desired output consists of one or more continuous variables, then the task is called regression. 

An example of a regression problem would be the prediction of the length of a salmon as a function of its age and weight.


<img src="figures/regression2.png" width="40%">


#### Unsupervized learning

In unsupervized learning the training data consists of a set of input vectors x without any corresponding target values. 

**Clustering** : Discover groups (or classes) of similar examples within the data


<img width="40%" src='figures/cluster.png' style="float:left"/>
<img width="40%" src='figures/cluster2.png' style="float:rigth"/>

**Dimensionality reduction** : Project the data from a high-dimensional space down to less  dimensions for the purpose of visualizationor computing efficiency.

#### Simplified ontology

<img src="figures/ml_taxonomy.png" width="80%">

## Representation of data

- The data are represented in scikit-learn by a 2D matrix of $\mathbb{R}^{m \times n}$ 
    - Rows = m samples
    - Columns = n features

$$\mathbf{X} = \begin{bmatrix}
    x_{1}^{(1)} & x_{2}^{(1)} & x_{3}^{(1)}  & \dots & x_{n}^{(1)} \\
    x_{1}^{(2)} & x_{2}^{(2)} & x_{3}^{(2)}  & \dots & x_{n}^{(2)} \\
    \vdots & \vdots & \vdots & \ddots & \vdots \\
    x_{1}^{(m)} & x_{2}^{(m)} & x_{3}^{(m)}  & \dots & x_{n}^{(m)}
\end{bmatrix}.$$

For the supervized learning classification task we have also the labels or targets :

- Targets are represented by a 2D matrix  $\mathbb{R}^{m \times k}$ 
    - Rows = m labels
    - Columns = k label classes

$$\mathbf{y} = \begin{bmatrix}
    y_{1}^{(1)} & y_{2}^{(1)} & y_{3}^{(1)} & \dots & y_{k}^{(1)} \\
    y_{1}^{(2)} & y_{2}^{(2)} & y_{3}^{(2)} & \dots & y_{k}^{(2)} \\
    \vdots & \vdots & \vdots & \vdots & \vdots\\
    y_{1}^{(m)} & y_{2}^{(m)} & y_{3}^{(m)} & \dots & y_{k}^{(m)}
\end{bmatrix}.$$

In most of the cases we have k = 1:

- targets are represented by a 2D matrix of $\mathbb{R}^{m \times 1}$ 
    - Rows = m labels
    
$$\mathbf{y} = \begin{bmatrix}
    y^{(1)}  \\
    y^{(2)}  \\
    \vdots  \\
    y^{(m) }
\end{bmatrix}.$$

Scikit-learn provides [many datasets](http://scikit-learn.org/stable/datasets/#dataset-loading-utilities) to test machine learning algorithms:

- ** Packaged Data **: `` sklearn.datasets.load_ * ``
- ** Downloadable Data **: `` sklearn.datasets.fetch_ * ``
- ** Generated Data **: `` sklearn.datasets.make_ * ``

In [None]:
from sklearn.datasets import load_iris
iris = load_iris()

<img src="images/iris_classes.jpg" width="10%" style="float:right"/>
- Variables in the Iris database:
     - sepal length in cm
     - sepal width in cm
     - petal length in cm
     - petal width in cm
- Target classes to predict:
     - Iris Setosa
     - Iris Versicolour
     - Iris Virginica
  

In [None]:
import pandas as pd
iris_df = pd.DataFrame(iris.data, columns=iris.feature_names)
pd.plotting.scatter_matrix(iris_df, c=iris.target, figsize=(8, 8));

In [None]:
iris.keys()

In [None]:
print(iris.feature_names)

In [None]:
n_samples, n_features = iris.data.shape
print('Number of samples:', n_samples)
print('Number of features:', n_features)

print(iris.data[0]) # the sepal length, sepal width, petal length and petal width #of the first sample (first flower)

In [None]:
print(iris.target)

In [None]:
np.bincount(iris.target)

Slide Type

Using the NumPy's bincount function (above), we can see that the classes are distributed uniformly in this dataset - there are 50 flowers from each species, where

class 0: Iris-Setosa
class 1: Iris-Versicolor
class 2: Iris-Virginica


In [None]:
print(iris.target_names)

In [None]:
import pandas as pd

iris_df = pd.DataFrame(iris.data, columns=iris.feature_names)
iris_df['species'] = pd.Series([iris.target_names[k] for k in iris.target])
#iris_df['species'] = iris.target
iris_df.head(3)

In [None]:
x_index = 3
colors = ['orange', 'blue', 'green']

for label, color in zip(range(len(iris.target_names)), colors):
    plt.hist(iris.data[iris.target==label, x_index], 
             label=iris.target_names[label],
             color=color)

plt.xlabel(iris.feature_names[x_index])
plt.legend(loc='upper right')
plt.show()

In [None]:
iris.data[iris.target==0].shape

In [None]:
x_index = 2
y_index = 0

colors = ['orange', 'blue', 'green']

for label, color in zip(range(len(iris.target_names)), colors):
    plt.scatter(iris.data[iris.target==label, x_index], 
                iris.data[iris.target==label, y_index],
                label=iris.target_names[label],
                c=color)

plt.xlabel(iris.feature_names[x_index])
plt.ylabel(iris.feature_names[y_index])
plt.legend(loc='upper left')
plt.show()

Instead of looking at the data one plot at a time, a common tool that analysts use is called the **scatterplot matrix**.

Scatterplot matrices show scatter plots between all features in the data set, as well as histograms to show the distribution of each feature.

In [None]:
import pandas as pd
iris_df = pd.DataFrame(iris.data, columns=iris.feature_names)
pd.plotting.scatter_matrix(iris_df, c=iris.target, figsize=(10, 10));

## Supervized learning

https://scikit-learn.org/stable/supervised_learning.html#supervised-learning

### Regression task

#### Linear Regression

https://scikit-learn.org/stable/modules/linear_model.html

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
a = 0.5
b = 20.0
X_max = 100
# x from 0 to X_max
X = X_max * np.random.random(100)
# y = a*x + b with noise
Y = a*X + b + 3*np.random.normal(size=X.shape)

In [None]:
# create a linear regression classifier
clf = LinearRegression()

All supervised estimators in scikit-learn implement a fit(X, y) method to fit the model and a predict(X) method that, given unlabeled observations X, returns the predicted labels y.

In [None]:
clf.fit(X[:, None], Y)

In [None]:
# predict y
X_new = np.linspace(0, X_max, 10)
Y_new = clf.predict(X_new[:, None])

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(15, 5), sharey=True)

axes[0].scatter(X, Y, edgecolors='black', alpha=0.8, label="training points")
axes[0].legend()
# plot the results
axes[1].scatter(X, Y, edgecolors='black', alpha=0.8, label="training points")
axes[1].plot(X_new, Y_new, 'o--', color='orange', alpha=0.8, label="model predictions")
axes[1].legend()

In [None]:
print('Weight coefficients: ', clf.coef_)
print('y-axis intercept: ', clf.intercept_)

#### Polynomial regression: extending linear models with basis functions¶

One common pattern within machine learning is to use linear models trained on nonlinear functions of the data. 

This approach maintains the generally fast performance of linear methods, while allowing them to fit a much wider range of data.

In [None]:
from sklearn.preprocessing import PolynomialFeatures
import numpy as np
X = np.arange(6).reshape(3, 2)
X

In [None]:
poly = PolynomialFeatures(degree=2)
poly.fit_transform(X)

The features of X have been transformed from $[x_{1},x_{2}] \text{ to } [1,x_{1},x_{2},x_{1}²,x_{1}x_{2},x_{2}²]$, and can now be used within any linear model.


This sort of preprocessing can be streamlined with the Pipeline tools. A single object representing a simple polynomial regression can be created and used as follows:

In [None]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
import numpy as np
model = Pipeline([('poly', PolynomialFeatures(degree=3)),
                  ('linear', LinearRegression(fit_intercept=False))])
# fit to an order-3 polynomial data
x = np.arange(5)
y = 3 - 2 * x + x ** 2 - x ** 3
model = model.fit(x[:, np.newaxis], y)
model.named_steps['linear'].coef_

In [None]:
X = np.linspace(-2.2,2.2,100)
Y = 1/24*X**4 - 1/4*X**2 + 1/8 + 0.02*np.random.normal(size=X.shape)
X = X[:, np.newaxis]

fig, axes = plt.subplots(1, 2, figsize=(15, 5), sharey=True)

axes[0].plot(X, Y, color='cornflowerblue', label="ground truth")
axes[0].scatter(X, Y, color='navy', s=30, marker='o', label="training points")
axes[0].legend(loc='upper left')
axes[1].plot(X, Y, color='cornflowerblue', label="ground truth")
axes[1].scatter(X, Y, color='navy', s=30, marker='o', label="training points")

for count, degree in enumerate([1, 2, 3, 4]):
    model = make_pipeline(PolynomialFeatures(degree), LinearRegression()) 
    model.fit(X, Y)
    y_predict = model.predict(X)
    axes[1].plot(X, y_predict, linewidth=lw,
             label="degree %d" % degree)

axes[1].legend(loc='upper left')

plt.show()

In [None]:
from sklearn.pipeline import make_pipeline

def f(x):
    """ function to approximate by polynomial interpolation"""
    return x * np.sin(x)

# generate points used to plot
x_plot = np.linspace(0, 10, 100)

# generate points and keep a subset of them
x = np.linspace(0, 10, 200)
rng = np.random.RandomState(0)
rng.shuffle(x)
x = np.sort(x[:40])
y = f(x)

# create matrix versions of these arrays
X = x[:, np.newaxis]
X_plot = x_plot[:, np.newaxis]

fig, axes = plt.subplots(1, 2, figsize=(15, 5), sharey=True)
axes[0].plot(x_plot, f(x_plot), color='cornflowerblue', label="ground truth", )
axes[0].scatter(x, y, color='navy', s=30, marker='o', label="training points")
axes[0].legend()

axes[1].plot(x_plot, f(x_plot), color='cornflowerblue', label="ground truth")
axes[1].scatter(x, y, color='navy', s=30, marker='o', label="training points")
axes[1].legend()

for count, degree in enumerate([3, 4, 5]):
    model = make_pipeline(PolynomialFeatures(degree), LinearRegression())
    model.fit(X, y)
    y_plot = model.predict(X_plot)
    axes[1].plot(x_plot, y_plot, linewidth=lw,
             label="degree %d" % degree)

plt.legend(loc='lower left')

plt.show()

#### Nearest Neighbors regression

In [None]:
# Create some data an test
np.random.seed(0)
X = np.sort(5 * np.random.rand(40, 1), axis=0)
T = np.linspace(0, 5, 500)[:, np.newaxis]
y = np.sin(X).ravel()

# Add noise to targets
y[::5] += 1 * (0.5 - np.random.rand(8))

##### Radius Neighbors Regressor

Learning based on the neighbors within a fixed radius r of the query point, where r is a floating-point value specified by the user.

In [None]:
from sklearn.neighbors import RadiusNeighborsRegressor
radius = 1.0
knn = RadiusNeighborsRegressor(radius)
knn.fit(X, y)

plt.scatter(X, y, label='data')
plt.plot(T, y_, label='prediction', color='orange')
plt.legend()
plt.title(f"KNeighborsRegressor radius={radius}")
plt.show()

##### K nearest neighbors regressor

Learning based on the k nearest neighbors of each query point, where k is an integer value specified by the user.

In [None]:
from sklearn import neighbors
# Fit regression model
n_neighbors = 5

fig, axes = plt.subplots(1, 2, figsize=(15, 5), sharey=True)
for i, weights in enumerate(['uniform', 'distance']):
    knn = neighbors.KNeighborsRegressor(n_neighbors, weights=weights)
    y_ = knn.fit(X, y).predict(T)

    axes[i].scatter(X, y, label='data')
    axes[i].plot(T, y_, label='prediction', color='orange')
    axes[i].legend()
    axes[i].set_title("KNeighborsRegressor (k = %i, weights = '%s')" % (n_neighbors, weights))

plt.tight_layout()
plt.show()


#### A lot more...

### Classification task

#### Create some data

In [None]:
from sklearn.datasets.samples_generator import make_blobs

# we create 50 separable points
X, y = make_blobs(n_samples=400, centers=2,
                      random_state=0, cluster_std=0.9)

print('X ~ n_samples x n_features:', X.shape)
print('y ~ n_samples:', y.shape)

print('\nFirst 5 samples:\n', X[:5, :])
print('\nFirst 5 labels:', y[:5])

**When doing classification in scikit-learn, y is a vector of integers or strings.**

#### Plot the data

In [None]:
ax = plt.axes()
ax.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Spectral ,edgecolors='black', alpha=0.6)
plt.show()

#### Split the data between train and test

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.25,
                                                    random_state=1234,
                                                    stratify=y)

#### Logistic regression

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
classifier = LogisticRegression()

In [None]:
classifier.fit(X_train, y_train)

In [None]:
from figures import plot_2d_separator

ax = plt.axes()
ax.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Spectral ,edgecolors='black', alpha=0.6)
plot_2d_separator(classifier, X)

plt.show()

In [None]:
print(classifier.coef_)
print(classifier.intercept_)

In [None]:
prediction = classifier.predict(X_test)

In [None]:
print(prediction)
print(y_test)

In [None]:
np.mean(prediction == y_test)

In [None]:
classifier.score(X_test, y_test)

In [None]:
classifier.score(X_train, y_train)

#### K Nearest Neighbors

In [None]:
from sklearn.neighbors import KNeighborsClassifier

In [None]:
knn = KNeighborsClassifier(n_neighbors=1)

In [None]:
knn.fit(X_train, y_train)

In [None]:
ax = plt.axes()
ax.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Spectral ,edgecolors='black', alpha=0.6)
plot_2d_separator(knn, X)

plt.show()

In [None]:
knn.score(X_test, y_test)

In [None]:
knn.score(X_train, y_train)

#### Support Vector Machines (SVM)

In [None]:
from sklearn.svm import SVC, NuSVC
from sklearn.linear_model import SGDClassifier

# Choose the SVM classifier and its parameters
clf = SVC(C=1,gamma='auto')
#clf = NuSVC()
#clf = SGDClassifier(loss="hinge", alpha=0.01, n_iter=200, fit_intercept=True)

# fit the model
clf.fit(X_train, y_train)

In [None]:
# plot the line, the points, and the nearest vectors to the plane
xx = np.linspace(-1, 5, 10)
yy = np.linspace(-1, 5, 10)

X1, X2 = np.meshgrid(xx, yy)
Z = np.empty(X1.shape)
for (i, j), val in np.ndenumerate(X1):
    x1 = val
    x2 = X2[i, j]
    p = clf.decision_function([[x1, x2]])
    Z[i, j] = p[0]
levels = [-1.0, 0.0, 1.0]
linestyles = ['dashed', 'solid', 'dashed']
colors = 'k'

ax = plt.axes()
ax.contour(X1, X2, Z, levels, colors=colors, linestyles=linestyles)
ax.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Spectral ,edgecolors='black', alpha=0.6)

plt.show()

In [None]:
clf.score(X_test, y_test)

In [None]:
clf.score(X_train, y_train)

https://towardsdatascience.com/support-vector-machine-vs-logistic-regression-94cc2975433f

#### Generating some (not easily linearly separable) data

We generate 2D synthetic data to be classified, preferably not easily linearly separable so that it's a bit more exciting and that more elaborate models appear relevant to the classification task.

In [None]:
np.random.seed(0)
N = 200 # number of points per class
D_in = 2   # dimensionality
D_out = 3   # number of classes
X = np.zeros((N*D_out,D_in))
y = np.zeros(N*D_out, dtype='uint8')
print(y.shape)
for j in range(D_out):
    ix = range(N*j,N*(j+1))
    r = np.linspace(0.0,1,N) # radius
    t = np.linspace(j*4,(j+1)*4,N) + np.random.randn(N)*0.2 # theta
    X[ix] = np.c_[r*np.sin(t), r*np.cos(t)]
    y[ix] = j
Y = np.atleast_2d(y).T.astype(float) # Vector made out of y, for convenience
#print (Y)
#plt.style.use('default')
fig = plt.figure()
plt.scatter(X[:, 0], X[:, 1], c=y, s=40, cmap=plt.cm.Spectral, alpha=0.6
            , edgecolors='black', linewidths=0.2)
plt.axis('equal') ; plt.xlim([-1.5,1.5]) ; plt.ylim([-1.5,1.5]) ;

#### Logistic regression

In [None]:
clf = LogisticRegression()

In [None]:
clf.fit(X,Y)

In [None]:
# plot the resulting classifier
h = 0.02
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                     np.arange(y_min, y_max, h))
X_test = np.array(list(zip(xx.reshape(-1), yy.reshape(-1))))

z = clf.predict(X_test)
Z = z.reshape(xx.shape)

fig = plt.figure()
plt.contourf(xx, yy, Z, cmap=plt.cm.Spectral, alpha=0.8, antialiased= True)
plt.scatter(X[:, 0], X[:, 1], c=y, s=40, cmap=plt.cm.Spectral, alpha=0.6
            , edgecolors='black', linewidths=0.1)
plt.axis('equal') ; plt.xlim([-1.5,1.5]) ; plt.ylim([-1.5,1.5]) ;

#### Training SVMs with different kernels

In [None]:
from sklearn.svm import SVC
# Define models
C = 100.0  # SVM regularization parameter
models = (SVC(kernel='linear', C=C),
          SVC(kernel='poly', degree=3, C=C),
          SVC(kernel='poly', degree=5, C=C),
          SVC(kernel='rbf', gamma='auto', C=C)) # N.B.: "RBF" = "Gaussian kernel" try gamma = 10.
models = (clf.fit(X, y) for clf in models)

# Define model titles
titles = ('SVC with linear kernel',
          'SVC with polynomial (degree 3) kernel',
          'SVC with polynomial (degree 5) kernel',
          'SVC with RBF kernel')

In [None]:
# Plot resulting classifiers
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
i=0
j=0
for clf, title in zip(models, titles):
    X0, X1 = X[:, 0], X[:, 1]
    h = 0.02
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    
    axes[i][j].contourf(xx, yy, Z, cmap=plt.cm.Spectral, alpha=0.8)
    axes[i][j].scatter(X[:, 0], X[:, 1], c=y, s=40, cmap=plt.cm.Spectral, alpha=0.6
            , edgecolors='black', linewidths=0.1)
    axes[i][j].axis("equal") ; plt.xlim(xx.min(), xx.max()) ; plt.ylim(yy.min(), yy.max())
    axes[i][j].set_title(title)
    if j==1:
        j=0
        i+=1
    else:
        j+=1
    
plt.show()

## Unsupervized learning

### Clustering

In [None]:
from sklearn.datasets import make_blobs

X, y = make_blobs(n_samples= 1000,centers=5, random_state=42, cluster_std=1)
X.shape

In [None]:
plt.scatter(X[:, 0], X[:, 1],  s=15, edgecolors='black', alpha=0.8)

In [None]:
from sklearn.cluster import KMeans

In [None]:
kmeans = KMeans(n_clusters=5, random_state=42)

In [None]:
labels = kmeans.fit_predict(X)
labels[:10]

In [None]:
np.all(y == labels)

In [None]:
plt.scatter(X[:, 0], X[:, 1], c=labels, s=15,  edgecolors='black', alpha=0.8);

In [None]:
plt.scatter(X[:, 0], X[:, 1], c=y,s=15 ,  edgecolors='black', alpha=0.8);

In [None]:
from sklearn.metrics import confusion_matrix, accuracy_score

print('Accuracy score:', accuracy_score(y, labels))
print(confusion_matrix(y, labels))

In [None]:
np.mean(y == labels)

In [None]:
from sklearn.metrics import adjusted_rand_score

adjusted_rand_score(y, labels)

In [None]:
kmeans.cluster_centers_

In [None]:
distortions = []
for i in range(1, 11):
    km = KMeans(n_clusters=i, 
                random_state=0)
    km.fit(X)
    distortions.append(km.inertia_)

plt.plot(range(1, 11), distortions, marker='o', alpha=0.8)
plt.xlabel('Number of clusters')
plt.ylabel('Distortion')
plt.show()

### Dimensionality reduction

Reducing the number of random variables to consider.

Applications: Visualization, Increased efficiency

 #### PCA: principal component analysis

PCA is used to decompose a multivariate dataset in a set of successive orthogonal components that explain a maximum amount of the variance. 

In scikit-learn, PCA is implemented as a transformer object that learns n components in its fit method, and can be used on new data to project it on these components.

Below is an example of the iris dataset, which is comprised of 4 features, projected on the 2 dimensions that explain most variance:

In [None]:
from sklearn import datasets
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

iris = datasets.load_iris()

X = iris.data
y = iris.target
target_names = iris.target_names
X.shape

In [None]:
pca = PCA(n_components=2)
X_r = pca.fit(X).transform(X)

# Percentage of variance explained for each components
print('explained variance ratio (first two components): %s'
      % str(pca.explained_variance_ratio_))

In [None]:
X_r.shape

In [None]:
plt.figure()
lw = 2
colors=['orange', 'blue', 'green']

for color, i, target_name in zip(colors, [0, 1, 2], target_names):
    plt.scatter(X_r[y == i, 0], X_r[y == i, 1], alpha=.8, lw=lw,
                label=target_name)
plt.legend(loc='best', shadow=False, scatterpoints=1)
plt.title('PCA of IRIS dataset')

plt.show()

#### Gaussian random projection

In [None]:
import numpy as np
from sklearn import random_projection
X = np.random.rand(100, 10000)
transformer = random_projection.GaussianRandomProjection()
X_new = transformer.fit_transform(X)
X_new.shape

#### Sparse random projection

In [None]:
import numpy as np
from sklearn import random_projection
X = np.random.rand(100, 10000)
transformer = random_projection.SparseRandomProjection()
X_new = transformer.fit_transform(X)
X_new.shape

## Model selection

### Cross-validation: evaluating estimator performance

Learning the parameters of a prediction function and testing it on the same data is a methodological mistake: a model that would just repeat the labels of the samples that it has just seen would have a perfect score but would fail to predict anything useful on yet-unseen data. 

This situation is called **overfitting**. 

To avoid it, it is common practice when performing a (supervised) machine learning experiment to hold out part of the available data as a test set X_test, y_test. 


<img src="figures/train_test_split.svg" width="68%" style="float:left; margin-left:74px">

In [None]:
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn import svm

iris = datasets.load_iris()
iris.data.shape, iris.target.shape

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    iris.data, iris.target, test_size=0.4, random_state=0)

print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

In [None]:
clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)

In [None]:
clf.score(X_test, y_test)     

When evaluating different settings (“hyperparameters”) for estimators, such as the C setting that must be manually set for an SVM, there is still a risk of overfitting on the test set because the parameters can be tweaked until the estimator performs optimally. This way, knowledge about the test set can “leak” into the model and evaluation metrics no longer report on generalization performance. 

To solve this problem, yet another part of the dataset can be held out as a so-called “validation set”: training proceeds on the training set, after which evaluation is done on the validation set, and when the experiment seems to be successful, final evaluation can be done on the test set.

However, by partitioning the available data into three sets, we drastically reduce the number of samples which can be used for learning the model, and the results can depend on a particular random choice for the pair of (train, validation) sets.

A solution to this problem is a procedure called cross-validation (CV for short). 

A test set should still be held out for final evaluation, but the validation set is no longer needed when doing CV. In the basic approach, called k-fold CV, the training set is split into **k smaller sets** (other approaches are described below, but generally follow the same principles). The following procedure is followed for each of the k “folds”:

- A model is trained using k-1 of the folds as training data;
- the resulting model is validated on the remaining part of the data (i.e., it is used as a test set to compute a performance measure such as accuracy).


<img src="figures/cross_validation.svg" width="76%"  style="float:left">

In [None]:
from sklearn.model_selection import KFold

kf = KFold(n_splits=5)
i=0
for train, test in kf.split(iris.data, iris.target):
    print ("Fold number %i" % i)
    i+=1
    print("train : %s \ntest : %s" % (train, test))
    print("---------------------------------")

The performance measure reported by k-fold cross-validation is then the average of the values computed in the loop. 

This approach can be computationally expensive, but does not waste too much data (as is the case when fixing an arbitrary validation set), which is a major advantage in problems such as inverse inference where the number of samples is very small.

Here is a flowchart of typical cross validation workflow in model training. The best parameters can be determined by grid search techniques.

<img src="images/grid_search_workflow.png" width="50%">

<img src="figures/grid_search_cross_validation.svg" width="76%"  style="float:left">

#### Computing cross-validated metrics

The following example demonstrates how to estimate the accuracy of a linear kernel support vector machine on the iris dataset by splitting the data, fitting a model and computing the score 5 consecutive times (with different splits each time):

In [None]:
from sklearn.model_selection import cross_val_score
clf = svm.SVC(kernel='linear', C=1)
scores = cross_val_score(clf, iris.data, iris.target, cv=5)
scores            




The mean score and the 95% confidence interval of the score estimate are hence given by:

In [None]:
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

By default, the score computed at each CV iteration is the score method of the estimator. It is possible to change this by using the scoring parameter:

In [None]:
from sklearn import metrics
scores = cross_val_score(
    clf, iris.data, iris.target, cv=5, scoring='f1_macro')
scores       

When the cv argument is an integer, cross_val_score uses the KFold or StratifiedKFold strategies by default, the latter being used if the estimator derives from ClassifierMixin.

It is also possible to use other cross validation strategies by passing a cross validation iterator instead, for instance:

In [None]:
from sklearn.model_selection import ShuffleSplit
n_samples = iris.data.shape[0]
cv = ShuffleSplit(n_splits=5, test_size=0.3, random_state=0)
cross_val_score(clf, iris.data, iris.target, cv=cv)

#### Cross validation iterators

In [None]:
from sklearn.model_selection import KFold, StratifiedKFold, ShuffleSplit

In [None]:
def plot_cv(cv, features, labels):
    masks = []
    for train, test in cv.split(features, labels):
        mask = np.zeros(len(labels), dtype=bool)
        mask[test] = 1
        masks.append(mask)
    
    plt.matshow(masks, cmap='gray_r')

In [None]:
plot_cv(KFold(n_splits=5), iris.data, iris.target)

In [None]:
plot_cv(StratifiedKFold(n_splits=5), iris.data, iris.target)

In [None]:
plot_cv(ShuffleSplit(n_splits=5, test_size=.2), iris.data, iris.target)

##### A lot more...

### Tuning the hyper-parameters of an estimator

<img src="figures/plot_kneigbors_regularization.png" width="80%">

In [None]:
from sklearn.model_selection import cross_val_score, KFold
from sklearn.neighbors import KNeighborsRegressor
# generate toy dataset:
x = np.linspace(-3, 3, 100)
rng = np.random.RandomState(42)
y = np.sin(4 * x) + x + rng.normal(size=len(x))
X = x[:, np.newaxis]

cv = KFold(shuffle=True)

# for each parameter setting do cross-validation:
for n_neighbors in [1, 3, 5, 10, 20]:
    scores = cross_val_score(KNeighborsRegressor(n_neighbors=n_neighbors), X, y, cv=cv)
    print("n_neighbors: %d, average score: %f" % (n_neighbors, np.mean(scores)))

In [None]:
from sklearn.model_selection import validation_curve
n_neighbors = [1, 3, 5, 10, 20, 50]
train_scores, test_scores = validation_curve(KNeighborsRegressor(), X, y, param_name="n_neighbors",
                                             param_range=n_neighbors, cv=cv)
plt.plot(n_neighbors, train_scores.mean(axis=1), 'b', label="train accuracy")
plt.plot(n_neighbors, test_scores.mean(axis=1), 'g', label="test accuracy")
plt.ylabel('Accuracy')
plt.xlabel('Number of neighbors')
plt.xlim([50, 0])
plt.legend(loc="best");

In [None]:
from sklearn.model_selection import cross_val_score, KFold
from sklearn.svm import SVR

# each parameter setting do cross-validation:
for C in [0.001, 0.01, 0.1, 1, 10]:
    for gamma in [0.001, 0.01, 0.1, 1]:
        scores = cross_val_score(SVR(C=C, gamma=gamma), X, y, cv=cv)
        print("C: %f, gamma: %f, average score: %f" % (C, gamma, np.mean(scores)))

In [None]:
from sklearn.model_selection import GridSearchCV
param_grid = {'C': [0.001, 0.01, 0.1, 1, 10], 'gamma': [0.001, 0.01, 0.1, 1]}

grid = GridSearchCV(SVR(), param_grid=param_grid, cv=cv)

In [None]:
grid.fit(X, y)

In [None]:
grid.predict(X)

In [None]:
print(grid.best_score_)

In [None]:
print(grid.best_params_)

In [None]:
print(grid.cv_results_.keys())

In [None]:
import warnings
warnings.filterwarnings('ignore')

import pandas as pd

cv_results = pd.DataFrame(grid.cv_results_)
cv_results.head(3)

In [None]:
cv_results_tiny = cv_results[['param_C', 'param_gamma', 'mean_test_score']]
cv_results_tiny.sort_values(by='mean_test_score', ascending=False).head(3)

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

param_grid = {'C': [0.001, 0.01, 0.1, 1, 10], 'gamma': [0.001, 0.01, 0.1, 1]}
cv = KFold(n_splits=10, shuffle=True)

grid = GridSearchCV(SVR(), param_grid=param_grid, cv=cv)

grid.fit(X_train, y_train)
grid.score(X_test, y_test)

### Model evaluation

#### Model score

In [None]:
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC

digits = load_digits()

In [None]:
print(digits.data.shape)

import matplotlib.pyplot as plt 
plt.gray() 
plt.matshow(digits.images[0]) 
plt.show() 

In [None]:
X, y = digits.data, digits.target
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    random_state=1,
                                                    stratify=y,
                                                    test_size=0.25)

classifier = LinearSVC(random_state=1).fit(X_train, y_train)
y_test_pred = classifier.predict(X_test)

print("Accuracy: {}".format(classifier.score(X_test, y_test)))

#### Confusion matrix

In [None]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, y_test_pred)

In [None]:
plt.matshow(confusion_matrix(y_test, y_test_pred), cmap="Blues")
plt.colorbar(shrink=0.8)
plt.xticks(range(10))
plt.yticks(range(10))
plt.xlabel("Predicted label")
plt.ylabel("True label");

#### Model report

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_test_pred))

#### ROC curve

In [None]:
from sklearn.metrics import roc_curve, roc_auc_score

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

for gamma in [.01, .05, 1]:
    plt.xlabel("False Positive Rate")
    plt.ylabel("True Positive Rate (recall)")
    svm = SVC(gamma=gamma).fit(X_train, y_train)
    decision_function = svm.decision_function(X_test)
    fpr, tpr, _ = roc_curve(y_test, decision_function)
    acc = svm.score(X_test, y_test)
    auc = roc_auc_score(y_test, svm.decision_function(X_test))
    plt.plot(fpr, tpr, label="acc:%.2f auc:%.2f" % (acc, auc), linewidth=3)
plt.legend(loc="best");

### Model persistence

#### Using Pickle

In [None]:
from sklearn import svm
from sklearn import datasets
clf = svm.SVC(gamma='scale')
iris = datasets.load_iris()
X, y = iris.data, iris.target
clf.fit(X, y)  

import pickle
s = pickle.dumps(clf)
clf2 = pickle.loads(s)
clf2.predict(X[0:1])

y[0]

#### Using Joblib

In the specific case of scikit-learn, it may be better to use joblib’s replacement of pickle (dump & load), which is more efficient on objects that carry large numpy arrays internally as is often the case for fitted scikit-learn estimators, but can only pickle to the disk and not to a string:

In [None]:
from joblib import dump, load
dump(clf, 'model/filename.joblib') 



Later you can load back the pickled model (possibly in another Python process) with:

In [None]:
clf = load('model/filename.joblib') 

## Practice!


<img src="images/scikitlearn_algorithm.png " width="80%">

<div class="alert alert-success">

<b>EXERCISE: Predict survival on the Titanic</b>

<div class="alert alert-success">

<b>Load and explore</b>
<ul>
  <li>Load the titanic data</li>
  <li>Explore the data with Pandas</li>
  <li>Plot data histogramm</li>
  <li>Look for correlation with corr() and scatter matrix</li>
</ul>
</div>

<div class="alert alert-success">

<b>Prepare de the data</b>
<ul>
  <li>Create the labels from the "Survived" data column</li>
  <li>Drop useless columns with drop(...)</li>
  <li>Encode the colum "Sex" in binary with get_dummies(...)</li>
  <li>Deal with missing values with dropna() from pandas or Inputer from scikit-learn</li>
  <li>Scale the date with Pandas or Scikit-learn methods</li>
  <li>Split the data in a train and test sets</li>
</ul>
</div>

<div class="alert alert-success">

<b>Select and train a model</b>
<ul>
  <li>Test with a simple model</li>
  <li>Use Scikit learn tools to perform best models and parameters research</li>
  <li>Evaluate the model on the test data set</li>
  <li>Plot some results of this evaluation</li>
</ul>
</div>