# Quick intro to Machine Learning with Scikit



In general, a learning problem considers a set of n samples of data and then tries to predict properties of unknown data. If each sample is more than a single number and, for instance, a multi-dimensional entry (aka multivariate data), it is said to have several attributes or features.

For those interested, we strongly suggest to follow one or several of the sickits [tutorials](https://scikit-learn.org/stable/tutorial).

## Generalities

### Basic regression: 

$n$ samples for
- dependent variable: $y_n$
- explanatory variable (regressor): $x_{n,i}$ for $i\in[1,I]$

A linear regression consists in finding coefficients $a_i$ such that:
    $$y_n \approx \sum_i a_{i} x_{i,n} + \epsilon$$
    
Given new data points $x_{k,i}$:
- predict $y_{k,i}$ by  $\sum_i a_{i} x_{i,k}$
- (or distribution of $y_{k,i}$ by  $\sum_i a_{i} x_{i,k} + \epsilon$)
- interpret $a_i$: marginal effect of variable $x_{i,n}$

Test:
- properties of $\epsilon$, $x_{i,n}$

### Machine learning: supervised

$n$ samples for:

- data (labels, targets): $y_n$
- *features*: $x_{n,i}$ for $i\in[1,I]$

Construct (nonlinear) model of unknown parameters $\theta$ such that:

$$y_n \approx f(x_{1,n}..., x_{I,n};\theta)$$

The process of finding good coefficients is called "training" aka "learning".

Two kinds of supervised learning depending on the kind of output. If output $y_n$ is:
- continuous: regression
- discrete: classification

A classification model can be built from a regression model:
- for binary outcomes: $c_n\approx \sigma(f(x_{1,n}..., x_{I,n};\theta))$ where $\sigma$ is the cumulative distribution of a random shock
- for nonbinary outcomes: encode target as dichotomous variable first, then use multilogit or voting algorithm

In [57]:
from sklearn.preprocessing import LabelBinarizer
y = [0, 0, 1, 1, 2]

In [58]:
y = LabelBinarizer().fit_transform(y)
y

array([[1, 0, 0],
       [1, 0, 0],
       [0, 1, 0],
       [0, 1, 0],
       [0, 0, 1]])

### Unsupervised learning

Only features $x_{n,i}$. Possible objectives:
- create categories (clustering)
- evaluate a distribution (density estimation)
- reduce dimension

### Reinforcement learning

Out of topic

Maximize the of future discounted rewards:

$$R = E_0 \sum_{t=0}^{\infty} \gamma^t r_t$$

by choosing actions $a$ in state $s$. Model provides probabilities of transition $P(s,s^{\prime}|a)$ and (stochastic) rewards $R(s,s^{\prime})$.

Looks suspiciously like the problem of an economic rational agent ;-)

RL: heuristics to learn optimal policy rule $a(s)$. 

##  Statistical learning

### Preprocess data
- cleanup
- normalize

### Estimator

Scikit estimators are dealt with in the following way:
- estimator = Estimator(**parameters)
- estimator.fit(data) # or estimator.fit(x,y)
- estimator.predict(testdata)

### Training set / test set

Standard procedure consists in splitting a dataset into
- a training set used to determine coefficients
- a test set used to measure prediction

Training set is typically a random fraction of the whole sample (e.g. 10%)


Different accuracy measures can be used inside and outside the training set, to check for fitness (outside), and avoid overfitting (inside).

![](./Precisionrecall.png)

All estimators expose a `.score()` method for that purpose.

To test the validity of a given model, we can use split the data
into several $10%$ subsets. Then for each subsect (fold)
train the model on remaining data and test on the fold.

This is called Kfold cross validation. 

In [77]:
#Selection of the random samples can be done
#
# using numpy:
# np.split, np.random.permutation
#
# with scikit:
# sklearn.model_selection.KFold

# kfold cross-validation can also be done with
# sklearn.model_selection.cross_val_score

## Algorithms

Remark: many algorithms amount to the optimization of a loss function

- Classification/regression

    - k nearest neighbors classifier
    - support vector machines
        - linear
        - plynomial
        - rbf    
    - sparse regressions
        - lasso
        - ridge
    - Decision Trees

- Clustering / dim reduction
    - k-means
    - PCA
    - ...

Look at the [map](https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html)


![Many classifiers](many_classifiers.png)

### sparse regressions

- lasso (shrinks)

$$\min_{\beta} \frac{1}{N} || y- X\beta||^2_2+\lambda ||\beta||_1$$

- ridge (sparsifies, sets coeffs to 0):

$$\min_{\beta} \frac{1}{N} || y- X\beta||^2_2+\lambda ||\beta||_2$$

- elastic net:

$$\min_{\beta} \frac{1}{N} || y- X\beta||^2_2 + \lambda_1 ||\beta||_1 + \lambda_2 ||\beta||_2$$

Formulation as an objective to minimize is typical of machine learning.

## Primer on neural networks

... time allowing