# Types of supervised learning

Given a labeled training dataset 

$$
\mathcal D = \{(\boldsymbol x_i, y_i)\}_{i=1}^n, \quad \boldsymbol x_i \in \mathbb R^d, \quad y_i \in \mathcal Y,
$$

we want to build a **predictive model** $f_{\boldsymbol \theta}\colon \mathbb R^d\to \mathcal Y$ which is usually taken from some parametric family 

$$
\mathcal F = \{f_{\boldsymbol \theta}(\boldsymbol x) \vert \boldsymbol \theta \in \mathbb R^m\}.
$$

To **fit** a model means to find a value of $\boldsymbol \theta$ which minimizes the loss function

$$
    \mathcal L(\boldsymbol \theta) = \frac 1n \sum\limits_{i=1}^n \ell(f_{\boldsymbol \theta}(\boldsymbol x_i), y_i) \to \min \limits_{\boldsymbol \theta}
$$

Depending on the set of targets $\mathcal Y$ supervised learning is split into **classification** and **regression**.

## Classification

If $\mathcal Y$ is a finite set of categories, the predictive model $f_{\boldsymbol \theta}\colon \mathbb R^d\to \mathcal Y$ is often called a **classifier**. In should classify inputs into several categories from $\mathcal Y$.

### Binary classification

```{figure} https://miro.medium.com/max/1400/1*biZq-ihFzq1I6Ssjz7UtdA.jpeg
:alt: cats-vs-dogs
:align: center
```

In **binary classification** problems there are only two classes, which are often called *positive* and *negative*. The target set $\mathcal Y$ in this case consists of two elements, and usually denoted as $\mathcal Y = \{0, 1\}$ or $\mathcal Y = \{-1, +1\}$. 

### Examples

* spam filtering (`1 = spam`, `0 = not spam`)
* medical diagnosis (`1 = sick`, `0 = healthy`)
* sentiment analysis (`1 = positive`, `0 = negative`)
* credit card fraud detection (`1 = fraudulent transaction`, `0 = legitimate transaction`)
* customer churn prediction (`1 = cutomer leaves`, `0 = customer stays`)

```{note}
There’s no inherent rule that "positive" must correspond to a "good" outcome. Instead, it usually refers to the class of greater interest.
```

### Loss function

Let $\widehat y_i = f_{\boldsymbol \theta}(\boldsymbol x_i)$ be the prediction of the model on the $i$-th sample.
A typical loss function for binary classification is **misclassification rate** (or **error rate**) — the fraction of incorrect predictions:

```{math}
    :label: mis-rate
    \mathcal L(\boldsymbol \theta) = \frac 1n \sum\limits_{i=1}^n \mathbb I\big[y_i \ne \hat y_i\big]
```

This loss is not a smooth function, that's why they often predict a number $\widehat y \in (0, 1)$ which is treated as probability of class $1$, and then use **cross-entropy loss**

```{math}
        :label: binary-cross-entropy
    \mathcal L(\boldsymbol \theta) = -\frac 1n \sum\limits_{i=1}^n \big(y_i \log(\hat y_i) + (1-y_i) \log(1 - \hat y_i)\big)
```

```{admonition} Notation
:class: important

1. By convention $0\log 0 = 0$

2. By default each $\log$ has base $e$

3. **Indicator** $\mathbb I(P)$ where $P$ is some logical expression (**predicate**) is defined as

$$
    \mathbb I(P) = \begin{cases}
        1, & P \text{ is true}, \\
        0, & P \text{ is false}
    \end{cases}
$$

```

````{admonition} Example
Suppose that true labels $y$ and predictions $\hat y$ are as follows:

```{table} Binary classificaton
:name: binary-metrics

|$y$ | $\hat y$|
|:---:|:------:|
|$0$| $0$  |
|$0$| $1$  |
|$1$| $0$  |
|$1$| $1$  |
|$0$| $0$  |
```

Calculate the missclassification rate {eq}`mis-rate` and cross-entropy loss {eq}`binary-cross-entropy`.
````

<span style="display:none" id="binary_loss">W3sicXVlc3Rpb24iOiAiQ2FsY3VsYXRlIHRoZSBtaXNjbGFzc2lmaWNhdGlvbiByYXRlIiwgInR5cGUiOiAibnVtZXJpYyIsICJhbnN3ZXJzIjogW3sidHlwZSI6ICJ2YWx1ZSIsICJ2YWx1ZSI6IDAuNCwgImNvcnJlY3QiOiB0cnVlLCAiZmVlZGJhY2siOiAiUmlnaHQgeW91IGFyZSEgSGVyZSAkTj01JCwgdHdvIGluY29ycmVjdCBwcmVkaWN0aW9ucywgc28gdGhlIGxvc3MgaXMgJFxcZnJhYyAyNSQifSwgeyJ0eXBlIjogImRlZmF1bHQiLCAiZmVlZGJhY2siOiAiTm9wZSJ9XX0sIHsicXVlc3Rpb24iOiAiQ2FsY3VsYXRlIHRoZSBjcm9zcy1lbnRyb3B5IGxvc3MiLCAidHlwZSI6ICJtYW55X2Nob2ljZSIsICJhbnN3ZXJzIjogW3siYW5zd2VyIjogIiQwJCIsICJjb3JyZWN0IjogZmFsc2UsICJmZWVkYmFjayI6ICJUb28gbG93In0sIHsiYW5zd2VyIjogIiRcXGZyYWMgMTUgXFxsb2cgMiQiLCAiY29ycmVjdCI6IGZhbHNlLCAiZmVlZGJhY2siOiAiRG9lc24ndCBsb29rIGNvcnJlY3QifSwgeyJhbnN3ZXIiOiAiJFxcZnJhYyAxNSBcXGxvZyAzJCIsICJjb3JyZWN0IjogZmFsc2UsICJmZWVkYmFjayI6ICJOb3BlIn0sIHsiYW5zd2VyIjogIiQrXFxpbmZ0eSQiLCAiY29ycmVjdCI6IHRydWUsICJmZWVkYmFjayI6ICJPaCwgeWVhaC4uLiJ9XX1d</span>

In [1]:
from jupyterquiz import display_quiz
display_quiz("#binary_loss")

<IPython.core.display.Javascript object>