# 4 分类


接下来看分类问题.


## K-近邻算法 K-Nearest Neighbor (KNN)

对于每个数据点, 找到训练集中距离它最近的 $k$ 个点, 寻找其中最多的一类, 于是将这个数据点也分为该类.


$k$ 是超参数, 可用交叉验证尝试.

<br>

如果是回归问题, 理论上也可以类似套用该方法: 每个数据点的值用其邻居的平均值估计.


## 线性判别分析 Linear Discrimiant Analysis (LDA)

假设 $X=x$ 属于 $k$ 类中的某一类: $c_1,\dotsc,c_k$. 假设对于每个 $c_i$, 我们可以求出 $c_i$ 中 $X$ 的分布密度 $f_i(x) = f(X=x|c_i)$, 我们可以选取 $x$ 对应的分布密度最大的一类将其归类.


甚至还能用贝叶斯公求出 $x$ 在每一类的概率: 
$$\mathbb P(c_j|X=x) = \frac{\mathbb P(c_j) f(x|c_j)}{\sum_i \mathbb P(c_i)f(x|c_i)}$$
其中 $\mathbb P(c_i)$ 是类 $i$ 出现的先验概率 (可以用训练集中每一类出现的频率估计), $\sum \mathbb P(c_i) = 1$.

<br>

如果假设在每一类 $c_j$ 中, $X$ 的分布为正态的:
$$f(x|c_j) = \frac{1}{{(2\pi )}^{p/2}|\Sigma_j|^\frac 12}\exp\left\{-\frac 12(x-\mu_j)^T\Sigma_j^{-1}(x-\mu_j)\right\}$$
其中 $\mu_j,\Sigma_j$ 为分布均值与方差. 可利用训练集的数据首先求得这两个参数.


进一步, 如果假设每一类方差相等: $\Sigma_1=\dotsc = \Sigma_k = \Sigma$, 那么两个类的对数似然之比为:
$$\begin{aligned}\log\frac{\mathbb P(c_{j_1}|X=x)}{\mathbb P(c_{j_2}|X=x)}&
=-\frac 12(x-\mu_{j_1})^T\Sigma ^{-1}(x-\mu_j)+\frac 12(x-\mu_{j_2})^T\Sigma ^{-1}(x-\mu_{j_2})
+\log \frac{\mathbb P(c_{j_1})}{\mathbb P(c_{j_2})}
\\ &
=(\mu_{j_1}-\mu_{j_2})^T\Sigma ^{-1}x-\frac 12(\mu_{j_1}+\mu_{j_2})^T\Sigma ^{-1}(\mu_{j_1}-\mu_{j_2})+\log \frac{\mathbb P(c_{j_1})}{\mathbb P(c_{j_2})}.
\end{aligned}
$$

它是关于 $x$ 的线性函数. 即任取两个类, 对其中某个的倾向可以用线性分类, 故称线性判别分析.

## 朴素贝叶斯 Naive Bayes (NB)

类似线性判别分析 (LDA), 

$$\mathbb P(Y=c_j|X=x) = \frac{\mathbb P(Y=c_j) f(x|Y=c_j)}{\sum_i \mathbb P(Y=c_i)f(x|Y=c_i)}$$

使用条件独立性假设: 向量 $X\in\mathbb R^p$ 的每个维度互相独立, 则
$$f (X = x|Y = c) = \prod_{i=1}^p f (X^{(i)}=x^{(i)}|Y=c)$$

若 $X$ 在每个维度的取值是离散的, 可用训练集中数据的频率估计 $f (X^{(i)}=x^{(i)}|Y=c)$.

对于任意 $x$, 取概率最大的一类:
$$\begin{aligned}y &= {\rm argmax}_y\ \mathbb P(Y= y|X =x)= {\rm argmax}_y \ \frac{\mathbb P(Y=y) f(x|Y=y)}{f(x)}\\ &=  {\rm argmax}_y \ {\mathbb P(Y=y) f(x|Y=y)}\\& = {\rm argmax}_y \ \mathbb P(Y=y) \prod_{i=1}^p f (X^{(i)}=x^{(i)}|Y=y)\end{aligned}$$

<br>


**定理** 假设 $(X,Y)$ 来自全空间的联合分布 $(\Omega,\mathcal F,\mathbb P)$, 则如上朴素贝叶斯 (后验概率最大化) 等价于期望风险最小化. 其中期望风险为:
$$\begin{aligned}R(\varphi) &= \mathbb E(\mathbb I_{Y\neq \varphi(X)})=\int \mathbb E(\mathbb I_{Y\neq \varphi(X)}
|X=x)f(X = x)dx\\ &
=\int \mathbb P(Y\neq \varphi(x)|X=x)f(X = x)dx\\ & 
=\int \left( 1 -\mathbb  P(Y=\varphi(x)|X=x)\right) f(X = x)dx.
\end{aligned}$$

因此最小化 $R(\varphi)$ 就是对于每个 $x$, 最大化 $\mathbb P(Y = \varphi(x)|X = x)$, 即朴素贝叶斯.

In [2]:
from typing import List, Optional

import numpy as np
import pandas as pd
class NaiveBayes:
    def __init__(self) -> None:
        """Naive Bayes Classifier. See function `fit` for details."""
        self.feature_prob = None
        self.class_prob = None
    
    def fit(self, 
            X: np.ndarray, 
            y: np.ndarray, 
            class_num: Optional[np.ndarray] = None, 
            class_num_y: Optional[int] = None, 
            smooth: float = 1.
        ):
        """
        Fit on data (X, y).

        Parameters
        -------
        X: ndarray[int], shape (N, k)
            Feature matrix on which to make prediction. k is the dimension of features.
        y: ndarray[int], shape (N,)
            Ground truth labels, integers within 0 ~ classnum - 1.
        class_num: Optional[np.ndarray]
            Each feature has c_i classes (0 <= i < k). 
            Provide the class number of each feature.
            If None, it will be inferenced by the maximum index + 1 in the training dataset.
        class_num_y: Optional[int]
            The class number of `y`. If None, it will be inferenced from the maximum index + 1 of `y`.
        smooth: float
            Parameter for Bayes smoothing. Defaults to 1 (Laplacian smoothing).
        """

        # calculate the number of classes of each feature
        if class_num is None:   class_num   = np.max(X, axis = 0) + 1
        if class_num_y is None: class_num_y = np.max(y) + 1

        N, K, p = X.shape[0], X.shape[1], class_num.max()
        self.feature_prob = np.zeros((class_num_y, K, p))
        self.class_prob   = np.zeros(class_num_y)
        for c in range(class_num_y):
            self.class_prob[c] = (y == c).sum()
            for l in range(p):
                # conditional count
                self.feature_prob[c, :, l] = ((X == l) & np.tile((y == c).reshape((N, 1)), (1, K))).sum(axis = 0)
        
        self.feature_prob = (self.feature_prob + smooth) / \
            (np.tile(np.reshape(self.class_prob, (-1, 1)), (1, K)) + np.reshape(class_num * smooth, (1, K))).reshape((-1, K, 1))
        self.class_prob = (self.class_prob + smooth) / (N + class_num_y * smooth)

        # take logarithm to prevent underflow in multiplication
        self.feature_prob = np.transpose(np.log(self.feature_prob), (1, 2, 0)) # -> (K, p, c)
        self.class_prob = np.log(self.class_prob)

    def predict(self, 
            X: np.ndarray, 
            return_prob: bool = False
        ):
        """
        Make prediction on `X`. 
        
        Parameters
        -------
        X: ndarray[int], shape (N, k)
            Feature matrix on which to make prediction.
        return_prob: bool
            If True, return the logarithm of probability.
            If False, return the class with maximum predicted probability.
        """
        # we use summation because we have taken logarithm on probability
        prob = self.feature_prob[np.arange(self.feature_prob.shape[0]), X].sum(axis = 1) # (N, c)
        prob += self.class_prob.reshape((1, -1))

        if return_prob:
            return prob
        return np.argmax(prob, axis = 1)

    def test(self, 
            X: np.ndarray,
            y: np.ndarray, 
            label_names: Optional[List[str]] = None
        ):
        """
        Make test on `X` given labels `y`. Return a cross table.
        
        Parameters
        -------
        X: ndarray[int], shape (N, k)
            Feature matrix on which to make prediction.
        y: ndarray[int], shape (N,)
            Ground truth labels, integers within 0 ~ classnum - 1.
        label_names: Optional[List[str]]
            Name of each label.
        """
        pred_y = self.predict(X)
        p = self.class_prob.size
        
        if label_names is None:
            label_names = [str(i) for i in range(p)]

        cross_tab = np.zeros((p+1, p+1), dtype = 'int32')
        for i in range(p):
            for j in range(p):
                cross_tab[i,j] = ((pred_y == j) & (y == i)).sum()
        cross_tab[-1,:] = cross_tab.sum(axis = 0)
        cross_tab[:,-1] = cross_tab.sum(axis = 1)
        
        print('Accuracy = %.2f%%\n'%(100. * (pred_y == y).mean()) + '=' * 60)
        print('True \ Pred')
        label_names = label_names + ['Total']
        cross_tab = pd.DataFrame(cross_tab, columns = label_names, index = label_names)
        return cross_tab

In [3]:
# example borrowed from <统计学习方法> 李航 p. 52
nb = NaiveBayes()
X = np.array([[1,1,1,1,1,2,2,2,2,2,3,3,3,3,3], [1,2,2,1,1,1,2,2,3,3,3,2,2,3,3], [1,1,2,2,1,1,1,2,2,2,2,2,2,2,1]]) - 1
X, y = X[:-1].T, X[-1]
print('X.T =\n%s\ny   =\n %s'%(X.T, y))
nb.fit(X, y)

pred_X = np.array([[2, 1]]) - 1
pred_prob = np.exp(nb.predict(pred_X, return_prob = True))
print('\nPred X =\n%s\nProb   =\n%s'%(pred_X, pred_prob))

X.T =
[[0 0 0 0 0 1 1 1 1 1 2 2 2 2 2]
 [0 1 1 0 0 0 1 1 2 2 2 1 1 2 2]]
y   =
 [0 0 1 1 0 0 0 1 1 1 1 1 1 1 0]

Pred X =
[[1 0]]
Prob   =
[[0.06100218 0.03267974]]


In [5]:
nb.test(X, y, label_names = ['Feature 1', 'Feature 2'])

Accuracy = 73.33%
True \ Pred


Unnamed: 0,Feature 1,Feature 2,Total
Feature 1,3,3,6
Feature 2,1,8,9
Total,4,11,15
