# Types of ML

## Supervised Learning

```{image} https://cdn-images-1.medium.com/max/1600/1*Iz7bCLrPTImnBDOOEyE3LA.png
:alt: supervised-learning
:class: bg-primary mb-1
:width: 500px
:align: center
```

Supervised learning is a popular category of machine learning algorithms that involves training a model on labeled data to make predictions or decisions. In this approach, the algorithm learns from a given set of input-output pairs and uses this knowledge to predict the output for new, unseen inputs. The goal is to find a mapping function that generalizes well to unseen data.

Now put it more mathematically. Denote

* training **dataset** $\mathcal D = \{(\boldsymbol x_i, y_i)\}_{i=1}^N$;
* **features** $\boldsymbol x \in \mathcal X$ (usually $\mathcal X = \mathbb R^D$);
* **targets** (**labels**) $y_i \in \mathcal Y$.

The goal of the supervised learning is to find a mapping $f\colon \mathcal X \to \mathcal Y$ which would minimize the **cost** (**loss**) **function** 

$$
\mathcal L = \frac 1N \sum\limits_{i=1}^N \ell(y_i, f(\boldsymbol x_i)).
$$

Note that the loss $\ell(y_i, f(\boldsymbol x_i))$ is calculated separately on each training object $(\boldsymbol x_i, y_i)$, and then averaged over the whole training dataset.

### Predictive model

The mapping $f_{\boldsymbol \theta}\colon \mathcal X \to \mathcal Y$ is usually taken from some parametric family 

$$
\mathcal F = \{f_{\boldsymbol \theta}(\boldsymbol x) \vert \boldsymbol \theta \in \mathbb R^n\}
$$

which is also called a **model**.

To **fit** a model means to find $\boldsymbol \theta$ which minimizes the loss function

$$
    \mathcal L(\boldsymbol \theta) = \frac 1N \sum\limits_{i=1}^N \ell(y_i, f_{\boldsymbol \theta}(\boldsymbol x_i))
$$

### Classification

```{image} https://miro.medium.com/max/1400/1*biZq-ihFzq1I6Ssjz7UtdA.jpeg
:alt: cats-vs-dogs
:class: bg-primary mb-1
:width: 500px
:align: center
```

**Binary classification**

* $\mathcal Y = \{0, 1\}$ or $\mathcal Y = \{-1, +1\}$
* typical loss function is **misclassification rate**

    $$
        \mathcal L(\boldsymbol \theta) = \frac 1N \sum\limits_{i=1}^N \big[y_i \ne f_{\boldsymbol \theta}(\boldsymbol x_i)\big]
    $$

* this loss is not a smooth function, that's why they often predict $\hat y_i = f_{\boldsymbol \theta}(\boldsymbol x_i)$ which is treated as probability of class $1$, and then use **cross-entropy loss**

$$
\mathcal L(\boldsymbol \theta) = -\frac 1N \sum\limits_{i=1}^N \big(y_i \log(\hat y_i) + (1-y_i) \log(1 - \hat y_i)\big)
$$

```{image} https://miro.medium.com/max/1400/1*JAXmOAImcf683aXaBDPPVg.jpeg
:alt: multiclass
:class: bg-primary mb-1
:width: 500px
:align: center
```

**Multiclass classification**
* $\mathcal Y = \{1, 2, \ldots, K\}$ 
* one-hot encoding: $\boldsymbol y_i \in \{0, 1\}^K$, $\sum\limits_{k=1}^K y_{ik} = 1$
* $\hat{\boldsymbol y}_i = f_{\boldsymbol \theta}(\boldsymbol x_i) \in [0, 1]^K$ is now the vector of probabilities of belonging to class $k$: 

    $$
        \hat y_{ik} = \mathbb P(\boldsymbol x_i \in \text{ class }k)
    $$
* the cross-entropy loss is now written as follows:

$$
\mathcal L(\boldsymbol \theta) = -\frac 1N \sum\limits_{i=1}^N \sum\limits_{k=1}^Ky_{ik} \log(\hat y_{ik})
$$

### Regression

* $\mathcal Y = \mathbb R$ or $\mathcal Y = \mathbb R^n$
* the common choice is the quadratic loss 

    $$
        \ell_2(y, \hat y) = (y - \hat y)^2
    $$
* then the overall loss function — mean squared error:

    $$
    \mathcal L(\boldsymbol \theta) = \mathrm{MSE}(\boldsymbol \theta) = \frac 1N\sum\limits_{i=1}^N (y_i - f_{\boldsymbol \theta}(\boldsymbol x_i))^2
    $$

If the function $f_{\boldsymbol \theta}(\boldsymbol x_i) = \boldsymbol {\theta^\top x}_i + b$ is linear, then the model is called **linear regression**.

In [None]:
import matplotlib.pyplot as plt
import numpy as np
from matplotlib.path import Path
import matplotlib.patches as patches
from matplotlib import rc
from scipy.special import expit

rc('text', usetex=True)
rc('text.latex', preamble=r'\usepackage[utf8]{inputenc}')
rc('text.latex', preamble=r'\usepackage[russian]{babel}')

font = {'family' : 'monospace',
        'size'   : 24,
        'weight' : 'heavy'
       }

rc('font', **font)

%config InlineBackend.figure_formats = ['svg']

def plot_sigmoid(xmin, xmax, ymin, ymax):
    text_size = 24
    legend_size = 20
    eps=0.2
    fig, ax = plt.subplots(figsize=(11, 6))
    xs = np.linspace(xmin, xmax, num=500)
    
    
    ax.spines['bottom'].set_position('zero')
    ax.spines['left'].set_position('zero')

    # Remove top and right spines
    ax.spines['top'].set_visible(False)
    ax.spines['right'].set_visible(False)

    ax.spines['bottom'].set_linewidth(2)
    ax.spines['left'].set_linewidth(2)

    ax.text(xmax + eps, -.2, r"$x$", size=text_size)
    ax.text(0.1, ymax, r"$y$", size=text_size)
    
    arrow_fmt = dict(markersize=6, color='black', clip_on=False)
    ax.plot((1), (0), marker='>', transform=ax.get_yaxis_transform(), **arrow_fmt)
    ax.plot((0), (1), marker='^', transform=ax.get_xaxis_transform(), **arrow_fmt)
    
    ax.plot(xs, expit(xs), c='r', lw=3, label= r'$\sigma(x) = \frac{1}{1+e^{-x}}$')
    # plt.plot(xs, np.maximum(0.2*xs, xs), c='m', lw=3, label= r'$\mathrm{LReLU}(x)$')
    
    ax.plot([0, xmax], [1, 1], c='k', ls='--', lw=2)
    ax.plot([xmin, 0], [-1, -1], c='k', ls='--', lw=2)
    
    ax.text(-0.18, 0.05, r"0")
    
    ax.legend(fontsize=legend_size);
    ax.grid(ls=':')
    ax.set_xlim(xmin-eps, xmax+eps)
    ax.set_ylim(ymin - eps/2, ymax+eps/2)
    yticks = np.arange(ymin, ymax+1)
    xticks = np.arange(xmin, xmax+1)
    ax.set_yticks(yticks[yticks != 0]);
    ax.set_xticks(xticks[xticks != 0])
    ax.set_yticklabels(yticks[yticks != 0], size=legend_size)
    ax.set_xticklabels(xticks[xticks != 0], size=legend_size);

## Unsupervised learning

```{image} https://cdn-images-1.medium.com/max/1440/1*YUl_BcqFPgX49sSb5yrk3A.jpeg
:alt: unsupervised-learning
:class: bg-primary mb-1
:width: 500px
:align: center
```

No targets anymore! The training dataset $\mathcal D = (\boldsymbol x_i)_{i=1}^N$.

Examples of unsupervised learning tasks:
* clustering
* dimension reduction
* discovering latent factors
* searching for association rules

## Semisupervised learning


```{image} https://cdn-images-1.medium.com/max/1600/1*0TUC4m6yB7HUuPNO2SXEBw.png
:alt: semisupervised-learning
:class: bg-primary mb-1
:width: 500px
:align: center
```

Semi-supervised learning comes into play when you have a dataset that contains both labeled and unlabeled data. Semi-supervised learning is often used in scenarios where obtaining labeled data is expensive, time-consuming, or otherwise challenging. 

## Reinforcement learning

**Reinforcement learning** is a machine learning paradigm where an agent learns to make sequential decisions by interacting with an environment. It aims to maximize a cumulative reward signal by exploring actions and learning optimal strategies through trial and error.