# Classification

## Definition



$X_{i\cdot} = (x_{i1},\ldots,x_{im})$ - single observation \
$X = (X_{1\cdot},\ldots,X_{n\cdot})^T = \begin{bmatrix}x_{11}&x_{12}&\cdots &x_{1m}\\
x_{21} & x_{22} &\cdots &x_{2m} \\
\vdots & \vdots & \ddots & \vdots \\ 
x_{n1} & x_{n2} &\cdots &x_{nm}\end{bmatrix}$ - dataset \
$C_1,\ldots,C_k$ - possible classes \
$P\left(C=C_i|X=x\right)=p_i$ - conditional probability / model \
$C = \operatorname{argmax}_{c\in\{C_1,\ldots,C_k\}}{P(c|X=x)}$ - predicted class

In [16]:
import numpy as np

In [17]:
import pandas as pd

In [18]:
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style('darkgrid')

## Training and Test datasets

![](media\dataset.webp)

In [19]:
def plot_data(data, name):
    neg_df = data.copy()
    neg_df['negx1'] = -neg_df.x1
    sns.scatterplot(data=data, x='x1', y='x2', hue='y')
    sns.lineplot(data=neg_df, x='x1', y='negx1', color='r')
    plt.title(name)
    plt.show()

## Represetation

### Binary

$c\in\left\{C_1,C_2\right\}\Rightarrow \tilde{C_1}=0,\:\tilde{C_2}=1$ \
$\hat{p}=P(c=C_2|X=x)$ \
$\hat{c}=\begin{cases}C_1,&\hat{p}\le\frac{1}{2}\\C_2,&\hat{p}>\frac{1}{2}&\end{cases}$

### Multiclass

$c\in\{C_1,\ldots,C_k\}\Rightarrow \tilde{C_1}=(1,0,\ldots,0),\:\tilde{C_2}=(0,1,\ldots,0),\:\ldots,\:\tilde{C_k}=(0,0,\ldots,1)$ - one hot encoding \
$\hat{p_i}=P(c=C_i|X=x)$ \
$\hat{p}=(p_1,p_2,\ldots,p_k)$ \
$\hat{I}=\operatorname{argmax}_{i\in\{1,\ldots,k\}}{p_i}$ \
$\hat{c}=C_{\hat{I}}$

## scikit-learn
[![](media\sklearn.png)](https://scikit-learn.org/stable/)

In [20]:
!pip install scikit-learn



## Algorithms

![](media\classifiers.png)

### Logistic Regression

$$\forall_{i\in\{1,\ldots,n\}}\:p_i=\frac{1}{1+e^{-\left(\beta_0+\beta_1x_i1+\ldots+\beta_mx_{im}\right)}}=\frac{1}{1+e^{-X_{i\cdot}\beta}}$$
$X=\begin{bmatrix}x_{11}&x_{12}&\cdots &x_{1m}\\
x_{21} & x_{22} &\cdots &x_{2m} \\
\vdots & \vdots & \ddots & \vdots \\ 
x_{n1} & x_{n2} &\cdots &x_{nm}\end{bmatrix}$ \
$\beta=(\beta_0,\ldots,\beta_m)^T$
![](media\logistic_regression.png)

$y_i\in\{0,1\}$ - real "class" of ith observation \
$L=\sum_i^n{\left(\hat{p_i}(1-y_i)+\left(1-\hat{p_i}\right)y_i\right)}$ - loss function \
$\hat{\beta}=\operatorname{argmin}_{\beta}{L}$

#### How to minimize the LOSS?
#### Gradient Descent!
![](media\gradient_descent.png)
More about Gradient Descent next time

In [21]:
from sklearn.linear_model import LogisticRegression

In [22]:
def model_plane(model, points=100):
    
    x = np.array(train_x)
    x1 = np.array([np.linspace(min(x[:, 0]), max(x[:, 0]), points) for _ in range(points)]).flatten()
    x2 = np.array([np.linspace(min(x[:, 1]), max(x[:, 1]), points) for _ in range(points)]).T.flatten()
    y = model.predict_proba(np.stack((x1, x2)).T)
    sns.scatterplot(x=x1, y=x2, hue=y[:, 1])
    sns.scatterplot(x=x[:, 0], y=x[:, 1], hue=np.array(train_y))

### Regularized logistic regression
$$L=\sum_i^n{\left(\hat{p_i}(1-y_i)+\left(1-\hat{p_i}\right)y_i\right)}+\alpha\|\beta\|_l$$

In [23]:
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

## K-Nearest Neighbours

$x$ - data to be predicted
$d_i = \|x-X_{i\cdot}\|$ - distance from observed data \
$c_i=y_i|d_{i:n}$ - ith nearest class \
$NN=(c_1,\ldots,c_k)$ - nearest neighbours \
$\hat{c}=\operatorname{argmax}_{k\in\{1,\ldots,K\}}{\sum_i^k{1_{c_i=k}}}$ \
![](media\knn.png)

In [24]:
from sklearn.neighbors import KNeighborsClassifier

### Support Vector Machine

![](media\svm.webp)
$w^TX_{i\cdot}-b=0$ - decision boundary \
$y_i\in\{-1,1\}$ \
$y_i(w^TX_{i\cdot}-b)\ge1$ - constraint
$w=\operatorname{argmax}_{w^{*}}{\frac{2}{\left\|w^{*}\right\|}}$ - margin size

### Soft margins
$$L=\lambda\|w\|^2+\frac{1}{n}\sum_i^n{\max{\left(0,1-y_i\left(w^TX_{i\cdot}-b\right)\right)}}$$
$w=\operatorname{argmin}_{w^{*}}{L}$

In [25]:
from sklearn.svm import SVC

### Kernel trick
![](media\kernel.ppm)
$k(X_{i\cdot},X_{j\cdot})=\phi(X_{i\cdot})\phi(X_{j\cdot})$ - kernel \
$w=\sum_i^n{c_iy_i\phi(X_{i\cdot})}$

![](media\kernels_formula.png)
![](media\kernels_plot.png)

## Decision Trees

![](media\decision_tree.png)

$R_p$ - parent node \
$R_l,R_r$ - left and right nodes \
$S:R_p\rightarrow \{R_l,R_r\}$ - split \
$p_i^{(a)}=P(c=C_i|X\in R_a)$ - probability of ith class in $R_a$ node \
$G(R_a)=1-\sum_i^K{\left(p_i^{(a)}\right)^2}$ - Gini Impurity \
$E(R_a)=-\sum_i^K{p_i\log{p_i}}$ - Entropy \
$IG(S)=f(R_p)-\left(P(R_l)f(R_l)+P(R_r)f(R_p)\right);\: f(x)=G(x)\lor f(x)=E(x)$ - Information Gain \
$S=\operatorname{argmax}_{S^{*}}IG\left(S^{*}\right)$

In [26]:
from sklearn.tree import DecisionTreeClassifier, plot_tree

## Naive Bayes
![](media\bayes.png)
$$P(c=C_i|X=x)=\frac{P(c=C_i,X=x)}{P(X)}\propto P(c=C_i,X=x)=\\=P(c=C_i)P(X_{\cdot 1}=x_1,\ldots X_{\cdot m}=x_m|c=C_i)P(X_{\cdot 1}=x_1|X_{\cdot 2},\ldots,X_{\cdot m}=x_m,c=C_i)\ldots P(X_{\cdot m}=x_m|c=C_i)$$

$$P(X_{\cdot j}=x_j|X_{\cdot -j}=x_{-j},c=C_i)=P(X_{\cdot j}=x_j|c=C_i)$$

$$P(c=C_i|X=x)=P(c=C_i)\prod_{j=1}^m{P(X_{\cdot j}=x_j|c=C_i)}$$

In dependence of data type, different likelihood functions are applied

### Gaussian (continous variable)
![](https://wikimedia.org/api/rest_v1/media/math/render/svg/8967b34cca6aeffe1820bc5f2624cee311dccaeb)

### Bernoulli (binary variable)
![](https://wikimedia.org/api/rest_v1/media/math/render/svg/2b23b8affe1fa31b1ce499d5d2944d9763ff2e6e)

### Multinomial (categorical variable)
![](https://wikimedia.org/api/rest_v1/media/math/render/svg/8967b34cca6aeffe1820bc5f2624cee311dccaeb)

In [27]:
from sklearn.naive_bayes import GaussianNB

### Probability Calibration
[![](media\calibration.png)](https://scikit-learn.org/stable/modules/calibration.html)

### Others
[Gaussian Processes](https://scikit-learn.org/stable/modules/gaussian_process.html) \
[LDA/QDA](https://scikit-learn.org/stable/modules/lda_qda.html)

## Metrics
![](media\confusion_matrix.jpeg) \
$F_{\beta}=\left(1+\beta^2\right)\frac{\operatorname{precision}\cdot\operatorname{recall}}{\beta^2\operatorname{precision}+\operatorname{recall}}$ \
$F_1=2\frac{\operatorname{precision}\cdot\operatorname{recall}}{\operatorname{precision}+\operatorname{recall}}$

In [28]:
from sklearn.metrics import precision_score, recall_score, f1_score, confusion_matrix

## ROC/AUC

In [29]:
from sklearn.metrics import RocCurveDisplay, PrecisionRecallDisplay