# The Python ecosystem - The scikit-learn library

Scikit-learn is probably the most used Machine Learning Library in Python. It is built on NumPy, SciPy, and matplotlib. The library offers a simple and efficient tools for data mining and data analysis. Scikit-learn offers a consistent API across different model and applications and hence is one of the best tools in Python for shallow learning algorithms.  
 

## Modeling pipeline in scikit learn 

* Loading The Data
* Training And Test Data
* Preprocessing The Data
* Create the Model
* Model Fitting
* Prediction
* Evaluate the Model's Performance
* Tune the Model

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

**Add the `src` directory as one where we can import modules**

In [None]:
import os
import sys

# add the 'src' directory as one where we can import modules
src_dir = os.path.abspath(os.path.join(os.getcwd(), 'src'))
sys.path.append(src_dir)
print(src_dir)

In [None]:
import helper_funcs as hf

## Loading The Data

Your data needs to be numeric and stored as NumPy arrays or SciPy sparse matrices. Other types that are convertible to numeric arrays, such as Pandas `DataFrame`, are also acceptable.

**Data for a Machine Learning model may look like this...**

In [None]:
import numpy as np
X = np.random.random((10,5))
X[X < 0.3] = 0
print("Features array: {}\n".format(X.shape), X)
y = np.array(['M','M','F','F','F','M','M','F','F','F'])
print("Labels: {}\n".format(y.shape),y)

Further scikit-learn provides a rich [dataset loading utilities](http://scikit-learn.org/stable/datasets/index.html). It comes with easy to load toy datasets, sample images and sample generators.

**Toy datasets:**

Boston house-prices dataset (regression)

    load_boston() 	
    
Iris dataset (classification)

    load_iris()

Diabetes dataset (regression)

    load_diabetes() 

Digits dataset (classification)

    load_digits()

Linnerud dataset (multivariate regression) 

    load_linnerud()
 
Wine dataset (classification)
  
    load_wine() 	

Breast cancer wisconsin dataset (classification)
  
    load_breast_cancer() 	
    

In [None]:
from sklearn.datasets import load_boston
ds = load_boston()

In [None]:
print(ds.DESCR)

In [None]:
print("Features array: {}\n".format(ds.data.shape), ds.data)
print("Labels: {}\n".format(ds.target.shape),y)

**Sample generators:**

In [None]:
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=25, n_features=2, n_redundant=0, 
                           n_informative=2, n_clusters_per_class=1, 
                           random_state=33)

In [None]:
print("Features array: {}\n".format(X.shape), X)
print("Labels: {}\n".format(y.shape),y)

In [None]:
fig, ax = plt.subplots()
ax.scatter(X[:, 0], X[:, 1], marker='o', c=y, s=25, edgecolor='k');

## Training And Test Data

In [None]:
X, y = make_classification(n_samples=1000, n_features=2, n_redundant=0, 
                           n_informative=2, n_clusters_per_class=1, 
                           weights=[0.096], 
                           random_state=33)
fig, ax = plt.subplots()
ax.scatter(X[:, 0], X[:, 1], marker='o', c=y, s=25, edgecolor='k');

In [None]:
np.bincount(y)

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4,  
                                                    shuffle=True, random_state=42)

In [None]:
print("Train set size:" , X_train.shape, "\nTest set size:", X_test.shape)

In [None]:
print("Train label size:" , y_train.shape, "\nTest label size:", y_test.shape)

In [None]:
pd.Series(y_train).value_counts()

In [None]:
pd.Series(y_test).value_counts()

This `stratify` parameter makes a split so that the proportion of values in the sample produced will be the same as the proportion of values provided to parameter `stratify`. 

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y,
                                                    test_size=0.4, random_state=42)

print(pd.Series(y_train).value_counts())
print(pd.Series(y_test).value_counts())

## Preprocessing The Data

In general, __learning algorithms benefit from standardization of the data set__. The `sklearn.preprocessing` package provides several utility functions and transformer classes to change raw feature vectors.


In [None]:
from sklearn import preprocessing

In [None]:
X_train = np.array([[ 1., -1.,  2.],
                    [ 2.,  0.,  0.],
                    [ 0.,  1., -1.]])

In [None]:
scaler = preprocessing.StandardScaler()
scaler

In [None]:
scaler.fit(X_train)

In [None]:
X_train_scaled = scaler.transform(X_train) 
X_train_scaled

In [None]:
print("Mean: ",np.mean(X_train_scaled, axis=0))
print("\nStandard deviation: ", np.std(X_train_scaled, axis=0))

An alternative standardization is __aling features to lie between a given minimum and maximum value, often between zero and one, or so that the maximum absolute value of each feature is scaled to unit size__

In [None]:
X_train 

In [None]:
min_max_scaler = preprocessing.MinMaxScaler()

In [None]:
min_max_scaler.fit(X_train)

In [None]:
X_train_scaled = min_max_scaler.transform(X_train) 
X_train_scaled

In [None]:
print("Min: ", np.min(X_train_scaled, axis=0))
print("\nMax: ", np.max(X_train_scaled, axis=0))

###  Encoding categorical features

Often features are not given as continuous values but categorical. For example a person could have features `"male"`, `"female"` or `"high"`, `"low"`, `"medium"`. Such features can be efficiently coded as integers, such as `0`, `1` (_nominal_)  or `2`, `0`, `1` (_ordinal_).

In [None]:
X_train = np.array([['Male', 1, 0.76], ['Female', 3, 0.22], ['Female', 2, 0.57]])
X_train

In [None]:
X_train[:,0]

#### Label encoding

In [None]:
le = preprocessing.LabelEncoder()
le.fit(X_train[:,0])

In [None]:
list(le.classes_)

In [None]:
le.transform(X_train[:,0]) 

In [None]:
X_encoded = np.copy(X_train)
X_encoded[:,0] = le.transform(X_train[:,0])
X_encoded = X_encoded.astype(float)
X_encoded

In [None]:
le.inverse_transform(X_encoded[:,0].astype(int))

#### One-Hot encoding

In [None]:
X_encoded[:,0].reshape(-1,1)

In [None]:
ohe = preprocessing.OneHotEncoder()
ohe.fit(X_encoded[:,0].reshape(-1,1))

In [None]:
X_train[:,0]

In [None]:
ohe.transform(X_encoded[:,0].reshape(-1,1)).toarray()


_Note that starting with scikit-learn version 20.0 there will be a `CategoricalEncoder` available to convert categorical features to by __one-hot encoding__ or __ordinal encoding__ in one step_. 

In [None]:
import sklearn
sklearn.__version__

### Imputation of missing values

Many real world datasets contain missing values, often encoded as blanks, `NaN`s or other placeholders. A basic strategy to use incomplete datasets is to discard entire rows and/or columns containing missing values. Another strategy is to impute the missing values, i.e., to infer them from the known part of the data.

IN scikit-learn the `Imputer` class provides basic strategies for imputing missing values, either using __the mean__, __the median__ or __the most frequent value__ of the row or column in which the missing values are located. 

In [None]:
X_train = np.array([[1, 2], [np.nan, 3], [7, 6]])
X_train

In [None]:
imp = preprocessing.Imputer(missing_values='NaN', strategy='mean', axis=0)
imp.fit(X_train)

In [None]:
np.nanmean(X_train, axis=0)

In [None]:
imp.transform(X_train)

## Create the Model

Scikit-learn ships with many different supervised and unsupervised models as well as some basic neural network models; among those are:

* ### Linear models
  * Ordinary Least Squares
  * Polynomial regression
  * Ridge Regression
  * Lasso
  * Elastic Net
  * Logistic regression

* ### Support Vector Machines

* ### Nearest Neighbors

* ### Gaussian Processes

* ### Decision Trees

* ### Ensemble methods
  * Random Forest
  * Gradient Boosting

Visit the [scikit-learn documentation website](http://scikit-learn.org/stable/index.html) for a comprehensive list of avaiake models and techniques. 

### Loading a toy data set

For  the sake of this tutorial we load the _wine dataset_ provided by scikit learn.

In [None]:
from sklearn.datasets import load_wine

In [None]:
ds = load_wine()
print(ds.DESCR)

In [None]:
ds.data

In [None]:
ds.target

### Subsetting the data set to two classes

In [None]:
y = ds.target[ds.target < 2]
y.shape

In [None]:
X = ds.data[ds.target < 2]
X.shape

### Train-test split

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
                                                    random_state=42)

In [None]:
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

### Creation of a model instance 

For the purpose of this tutorial we introduce [**Logistic regression**](https://en.wikipedia.org/wiki/Logistic_regression), also known as **logit regression**, or **logit model**, a probabilistic linear model for dichotomous data. The response variable is a binary variable (nominal variable), which means the variable has two categories or two values; Class 1 vs. Class 2, True vs. False, or $1$ vs. $0$, or success vs. failure, with the probabilities of $\pi$ and $1-\pi$, respectively. 

The output of a logistic regression is a probability $(\pi)$, thus a value between $0$ and $1$. Moreover, this output is a linear function of known covariates $x_i$, which is just another word for the features in our data set. 
$$\pi =\beta_0+ \beta_1x_1+ \beta_2x_2+ ... +\beta_kx_k$$

However, the right term of the equation can take any real value, whereas the left term of the equation is a probability, on the scale $0$ to $1$. In order to transform the scale of the data (right term) into a probability between $0$ and $1$ we apply a so-called **link function**. 

For the logistic regression model this link function is the [**logit function**](https://en.wikipedia.org/wiki/Logit). The logit function maps probabilities from the range $(0, 1)$ to the entire real number range $(-\infty, \infty)$. It is written as 

$$\eta = logit(\pi)\text{,}$$

where $\pi$ is the probability. 

To understand the logit we first introduce the [**odds ratio**](https://en.wikipedia.org/wiki/Odds_ratio) or in short **odds**. The odds (o) can be written as 

$$o = \frac{\pi}{1-\pi}\text{,}$$

where $\pi$ is the probability that an event occurs. If the probability of an event is a $0.5$, the odds are one-to-one or even $\left(\frac{0.5}{1-0.5}=1\right)$. We further define the or **log-odds**, which is the logarithm of the odds:
$$\eta = logit(\pi)= log \left( \frac{\pi}{1-\pi}\right)$$

This logarithmic function has the effect of removing the floor restriction, thus the function, the [**logit function**](https://en.wikipedia.org/wiki/Logit), our link function, transforms values in the range $0$ to $1$ to values over the entire real number range $(-\infty, \infty)$. If the probability is $1/2$ the odds are even and the logit is zero. Negative logits represent probabilities below one half and positive logits correspond to probabilities above one half.

The inverse form of the logit function is also called the [logistic function](https://en.wikipedia.org/wiki/Logistic_function), sometimes simply abbreviated as [**sigmoid function**](https://en.wikipedia.org/wiki/Sigmoid_function) due to its characteristic S-shape. Is allows us to go back from logits to probabilities.

$$\pi =logit^{-1}(\eta)= \frac{e^{\eta}}{1+e^{\eta}}=\frac{1}{1+e^{-\eta}}=\frac{1}{1+e^{-\beta_0+ \beta_1x_1+ \beta_2x_2+ ... +\beta_kx_k}}$$

The logistic function for the interval $[-6,6]$ is shown below. For values of $\eta$ in the range from $-\infty$ to $\infty$ $\pi$ is in the range of $0$ to $1$.


In [None]:
def logit(interval=[-6,6]):
    x = np.linspace(interval[0], interval[1])
    y = 1 / (1 + np.exp(-x))
    return (x,y)
    
fig, ax = plt.subplots(figsize=(8,4))
ax.plot(logit()[0], logit()[1])
ax.axvline(0, color="k", linestyle="dashed", linewidth=0.5)
ax.axhline(0.5, color="k", linestyle="dashed", linewidth=0.5)
ax.set_title("The logistic function", size=18);

The logit function maps probabilities to values over the entire real number range. Thus, the probability of an event/outcome/success to be true $(y=1)$, given the set of predictors $x_i$, which is our data, is written as

$$logit(P(y=1|x_i))= \beta_0+ \beta_1x_1+ \beta_2x_2+ ... +\beta_kx_k\text{,}$$
For a matter of simplification we express the inverse of the function above as 

$$\phi(\eta) = \frac{1}{1+e^{-\eta}}\text{,}$$

where $\eta$ is the linear combination of coefficients $(\beta_i)$ and predictor variables $(x_i)$, calculated as $\eta = \beta_0+ \beta_1x_1+ \beta_2x_2+ ... +\beta_kx_k$.

The parameters $(\beta_i)$ of the logit model are estimated by the [**method of maximum likelihood**](https://en.wikipedia.org/wiki/Maximum_likelihood_estimation). However, there is no closed-form solution, so the maximum likelihood estimates are obtained by using iterative algorithms such as [gradient descent](https://en.wikipedia.org/wiki/Gradient_descent), among others. 

The output of the sigmoid function is interpreted as the probability of a particular observation belonging to class 1. It is written as  $\phi(\eta)=P(y=1|x_i,\beta_i)$, the probability of success $(y=1)$ given the predictor variables $x_i$ parameterized by the coefficients $\beta_i$. For example, if we compute $\phi(\eta)=0.65$ for a particular observation, this means that the chance that this observation belongs to class 1 is 65%. Similarly, the probability that this observation belongs to class 2 is calculated as $\phi(\eta)=P(y=0|x_i,\beta_i)= 1 - P(y=1|x_i,\beta_i)=1-0.65=0.35$ or 35%. For class assignment the predicted probability is then converted into a binary outcome via a unit step function:

$$
\hat y =
\begin{cases}
1,  & \text{if $\phi(\eta) \ge$ 0.5} \\
0, & \text{otherwise}
\end{cases}
$$

In [None]:
from sklearn import linear_model
logistic = linear_model.LogisticRegression()
logistic

## Model Fitting

In [None]:
logistic.fit(X_train, y_train)

## Prediction

### In-sample prediction

In [None]:
logistic.predict(X_train)

### Out of sample prediction

In [None]:
y_pred = logistic.predict(X_test)
y_pred

In [None]:
logistic.predict_proba(X_test)[:5]

##  Evaluate the Model's Performance

In scikit-learn there are 3 different APIs for evaluating the quality of a model’s predictions:

* __Estimator score method__: Estimators have a score method providing a default evaluation criterion for the problem they are designed to solve. 
* __Scoring parameter__: Model-evaluation tools using cross-validation rely on an internal scoring strategy. 
* __Metric functions__: The metrics module implements functions assessing prediction error for specific purposes. Functions ending with `_score` return a value to maximize, the higher the better, and Functions ending with `_error` or `_loss` return a value to minimize, the lower the better.

### Estimator score method

Returns the mean accuracy on the given test data and labels.

$$\text{accuracy}(y, \hat y) = \frac{1}{n_{\text{samples}}} \sum_{i=0}^{n_{\text{samples}}-1}\mathbb1(\hat y_i = y_i)$$

In [None]:
logistic.score(X_test, y_test)

In [None]:
np.sum(y_pred == y_test)/len(y_test)

### Metric functions

In [None]:
from sklearn.metrics import accuracy_score
accuracy_score(y_pred, y_test)

Note that there are many more [model metrics](http://scikit-learn.org/stable/modules/model_evaluation.html) available in scikit-learn.  

In [None]:
dir(sklearn.metrics)[11:]

## Tune the Model

Be aware that there are two types of parameters: 
* __model parameters__ and 
* __hyperparameters__. 

Models parameters (i.e. regression coefficients) are learned from the data. Hyperparameters however, are parameters whose values are set before the learning process begins. Different model training algorithms require different hyperparameters, some simple algorithms  require none. 

The optimal hyperparameter configuration for a particular modeling task is unknown. Hence, we apply different techniques, such as grid search, random search or Baysian optimization to approximate the best hyperparamter configuration (referred to as [hyperparameter optimization](https://en.wikipedia.org/wiki/Hyperparameter_optimization).

During the process of model optimization we want to avoid [overfitting](https://en.wikipedia.org/wiki/Overfitting). Hence, in machine learning a technique referred to as [k-fold cross-validation][1] is applied. Cross-validation is a process for reliably estimating the performance of a method for building a model by training and evaluating your model multiple times using the same method.

#### K-fold cross-validation (Source: [Wikipedia][1])
![](./_img/K-fold_cross_validation_EN.png)


[1]: https://en.wikipedia.org/wiki/Cross-validation_(statistics)

### Model creation

For the purpose of this tutor ail we apply [support vector machines]() (SVMs) for classification of the subseted _wine data set_ (see above). An SVM model is a representation of the examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible. 


The advantages of support vector machines are:

* Effective in high dimensional spaces.
* Still effective in cases where number of dimensions is greater than the number of samples.
* Uses a subset of training points in the decision function (called support vectors), so it is also memory efficient.
* Versatile: different Kernel functions can be specified for the decision function. 

The disadvantages of support vector machines include:

* If the number of features is much greater than the number of samples, avoid over-fitting in choosing Kernel functions and regularization term is crucial.
* SVMs do not directly provide probability estimates.
* SVM do not scale well.

The mathematics of the algorithm are beyond the scope of this tutorial, however if you are interested we suggest to dive into the [scikit-learn documentation](http://scikit-learn.org/stable/modules/svm.html) or watch the informative [video](https://www.youtube.com/watch?v=-Z4aojJ-pdg) by Brandon Rohrer.

In [None]:
from sklearn import svm
svc = svm.SVC()
svc

### Hyperparameter optimization using `GridSearchCV`

* `C`: Penalty parameter `C` of the error term. The parameter allows one to trade off training error vs. model complexity. Hence, the `C` parameter trades off misclassification of training examples against simplicity of the decision surface. A low `C` makes the decision surface smooth, while a high `C` aims at classifying all training examples correctly by giving the model freedom to select more samples as support vectors.

* `kernel`:  Specifies the kernel type to be used in the algorithm. It must be one of `linear`, `poly`, `rbf`, `sigmoid`, `precomputed` or a callable. 

* `gamma`: Kernel coefficient for `rbf`, `poly` and `sigmoid`. Intuitively, the `gamma` parameter defines how far the influence of a single training example reaches, with low values meaning _far_ and high values meaning _close_. The `gamma` parameters can be seen as the inverse of the radius of influence of samples selected by the model as support vectors.

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = [{'C': [0.1, 1, 10, 100],
               'gamma': [0.01, 0.001, 0.0001],
               'kernel': ['linear', 'rbf']}]

clf = GridSearchCV(svc, param_grid, cv=5, verbose=1, return_train_score=True)

### Model fitting

In [None]:
clf.fit(X_train, y_train)

### Cross validation results

In [None]:
pd.DataFrame(clf.cv_results_ )

`best_params_`: Parameter setting that gave the best results on the hold out data.

In [None]:
clf.best_params_

`best_estimator_`: Estimator that was chosen by the search, i.e. estimator which gave highest score (or smallest loss if specified) on the left out data. 

In [None]:
clf.best_estimator_

`best_score_`: Mean cross-validated score of the best_estimator

In [None]:
clf.best_score_ 

### Model preditction

In [None]:
y_pred = clf.best_estimator_.predict(X_test)
y_pred

### Model evaluation

**Accuracy**

In [None]:
from sklearn.metrics import accuracy_score
accuracy_score(y_pred, y_test)

**Confusion matrix**

In [None]:
from sklearn.metrics import confusion_matrix
cnf_matrix = confusion_matrix(y_test, y_pred)
cnf_matrix

In [None]:
# Plot non-normalized confusion matrix
hf.plot_confusion_matrix(cnf_matrix, classes=["class 0", "class 1"],
                      title='Confusion matrix, without normalization')

**Seclected classification metrics**

\begin{equation} 
\text{precision} = \frac{TP}{TP+FP}
\end{equation}

\begin{equation} 
\text{recall} = \frac{TP}{TP+FN}
\end{equation}

\begin{equation} 
\text{F1-score} = 2 \times \frac{\text{precision} \times \text{recall}}{\text{precision}+\text{recall}}
\end{equation}

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))