# 1. Intro

Suppose that sample $x \in X$ can belong to one of the K class, where $k > 2$. Then there are several ways to classify it:

* classifier which support multiclass classification
* reduction to the set of binary classification problems

# 1.1 Multiclass logistic regression

In logistic regression we define a linear model as follows:
$$b(x)=\sigma(\langle w,x \rangle)= \frac{1}{1+ exp(-\langle w,x \rangle)}$$

Assume, that for each class $k$ we build linear model $b_k(x)$, which answers the question, if object $x$ belongs to class $k$ or not (or measures the probability of belonging). Thus we can construct a following vector:

$$(b_1(x), b_2(x), \dots, b_k(x))$$

A distinctive feature of logistic regression is that it returns the probability that an object belongs to a certain class. How to make probability from the previous vector?

Use a soft-max operator!

$$SoftMax \left(x_1, \dots, x_k \right) = \left(\frac{\exp \left(x_1\right)}{\sum\limits^{k}_{i=1}\exp \left(x_i\right)},\dots, \frac{\exp \left(x_k \right)}{\sum\limits^{k}_{i=1}\exp \left(x_i \right)} \right)$$

Each parameter here will be a probability of belonging to a certain class, i.e the probability of belonhing to class $k$ is:
$$P(y=k| x,w) = \frac{\exp \left( \langle w_k,x \rangle + w_{0k} \right)}{\sum\limits^{k}_{i=1}\exp \left( \langle w_j,x \rangle + w_{0j} \right)}$$

Objective fucntion here will be also log - likelihood:

$$\sum\limits_{i=1}^{l}log(P(y=y_i|x,w)) \to max_{w_1, \dots, w_K}$$

```python

sklearn.linear_model.LogisticRegression(multi_class='multinomial')
```

--------

# 1.2 Multiclass SVM

[Paper about multiclass SVM](http://jmlr.csail.mit.edu/papers/volume2/crammer01a/crammer01a.pdf)

Consider following algorithm:
$$a(x) = argmax_k(\langle w_k, x \rangle), k= 1, \dots K$$

Task of multi-class classification with SVM can be written as follows:

\begin{equation}
 \begin{cases}
   \frac{1}{2}||W||^2 + C \sum\limits_i^l \varepsilon_i \to min_w \\
   \langle w_{y_i}, x_i \rangle + \left[ y_i=k \right] - \langle w_{k}, x_i \rangle \geq 1 - \varepsilon_i, i = 1, \dots, l; k=1, \dots,K \\
   \varepsilon_i \geq 0, i = 1, \dots, l
 \end{cases}
\end{equation}

```python

sklearn.svm.LinearSVC(multi_class=”crammer_singer”)
```

--------

# 1.3 Reduction to the set of binary classifiers 
## 1.3.1 One-vs-All (One-vs-Rest)

**Idea:** 

Create set of classifiers, where $k$-th classifier will answer the question if sample belongs to $k$ or not. 
    
$$b_k(x)= \langle w_k,x \rangle + w_{0k}$$

We will train classifier on the sample: $(x_i, 2I(y_i=k) -1)$. 

All in all we will have $K$ classifiers, equal to the number of classes. Final classifier defines as follows:

$$a(x) = argmax_{b_1, \dots, b_k}(b_k(x))$$

**Problem**: each classifeir trains on its own sample, thus answers may have different scale, therefore they cannot be compared. 

```python
sklearn.svm.LinearSVC(multi_class='ovr')

sklearn.linear_model.LogisticRegression(multi_class='ovr')

sklearn.multiclass.OneVsRestClassifier

```

In [5]:
import numpy as np
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression

X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, n_classes=3, random_state=1)

With LogisticRegression (inbuilt one versus all):

In [9]:
lr = LogisticRegression(multi_class='ovr')
lr.fit(X, y)
y_predict = lr.predict(X)

print(np.unique(y_predict))

[0 1 2]


One versus All + LogisticRegression 

In [13]:
from sklearn.multiclass import OneVsRestClassifier

lr = LogisticRegression()
ovr = OneVsRestClassifier(lr)
ovr.fit(X, y)

y_predict = ovr.predict(X)

print(np.unique(y_predict))

[0 1 2]


------

## 1.3.2 One-vs-One

Train $C^2_K$ classifiers kind of:
$$a_{ij}(x); i \neq j; i,j = 1, \dots K,$$

Each of which is trained on $X_{ij} = \left[ (x_l, y_l) \in X| y_n = i \text{   or   } y_n = j \right]$.

Answer on new sample calculates as follows: 

$$a(x) = argmax_{k \in 1, \dots K} \sum\limits_{i=1}^{l}\sum\limits_{i\neq j} I(a_{ij}(x)=k)$$

```python
sklearn.svm.SVC

sklearn.multiclass.OneVsOneClassifier

```

With SVM (inline one versus one):

In [16]:
from sklearn.svm import SVC

svm = SVC(decision_function_shape='ovo')
svm.fit(X, y)

y_predict = svm.predict(X)

print(np.unique(y_predict))

[0 1 2]


One versus One + SVM:

In [17]:
from sklearn.multiclass import OneVsOneClassifier

svm = SVC()
ovo = OneVsOneClassifier(svm)
ovo.fit(X, y)

y_predict = ovo.predict(X)

print(np.unique(y_predict))

[0 1 2]


# 2. Metrics for multiclass classification

For multiclass classification, there are the same metrics as for binary classification. Most of them are based on TP, TN, FP, FN. But the final metrics, like precision and etc., can be calculsted in $2$ special ways:
* micro: calculate TP, TN, ... for each class, take mean, calculate final metrics (presicion, recall etc.)
* macro: calculate presicion, recall, etc for each class, take mean. 

**Examples:**

* micro-precision: $\frac{\overline{TP}}{\overline{TP}+\overline{FP}}$, where $\overline{TP} = \frac{1}{K}\sum\limits_{k=1}^{K}TP_k$
* macro-precision: $precision = \frac{1}{K}\sum\limits_{k=1}^{K}precision_k,$ where $precision_k = \frac{TP_k}{TP_k+FP_k}$

**Problem:** if the classes are not balanced then in micro type of calculating metrics classes with little power will hardly contribute to the final result. In the macro-averaging there is no such a problem. 

```python
sklearn.metrics.roc_auc_score(average='micro')
sklearn.metrics.roc_auc_score(average='macro')
sklearn.metrics.recall_score(average='micro')
sklearn.metrics.recall_score(average='macro')
sklearn.metrics.precision_score(average='micro')
sklearn.metrics.precision_score(average='macro')
```

In [23]:
from sklearn.metrics import recall_score, precision_score

recall_score(y, y_predict, average='micro')

0.89

In [22]:
recall_score(y, y_predict, average='macro')

0.8901173556501455