# 二值分类之贝叶斯分类器（下）

Probabilistic generative model

题目：预测一个人的工资是否能超过 $ 50000

贝叶斯分类器可能会“脑补”出一些解。如果数据量较少，可能generative model的效果会更好，如果数据量较多的话，使用discriminative model效果更好。虽然discriminative model是由generative model产生，但是DM不会有“脑补”。

前几个步骤同逻辑回归相同。

Kaggle: https://www.kaggle.com/c/ml2020spring-hw2

Simple baseline: 0.88675

Strong baseline:0.89102

Score: 0.87575

**关键词：** Generative model; Binary classfication; Probabilistic; Bayes theory;

In [1]:
# Probabilistic generative model
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

x = pd.read_csv('./data/X_train.csv', index_col=0).astype(np.float).to_numpy()
y = pd.read_csv('./data/Y_train.csv', index_col=0).astype(np.int).to_numpy().flatten()
x_test = pd.read_csv('./data/X_test.csv', index_col=0).astype(np.float).to_numpy()
EPS = 1e-8

dim = x.shape[1]

读取数据

In [2]:
def normalize(x):
    x_mean = np.mean(x, axis=0)
    x_std = np.std(x, axis=0)
    return (x - x_mean) / (x_std + EPS)


x = normalize(x)

进行标准化

In [3]:
x_train_0 = np.array([a for a, b in zip(x, y) if b == 0])
x_train_1 = np.array([a for a, b in zip(x, y) if b == 1])

mean_0 = np.mean(x_train_0, axis=0)  # 计算u1
mean_1 = np.mean(x_train_1, axis=0)  # 计算u2

求不同类别的数量，由于是二值分类，因此有两个类别。

贝叶斯分类: $P(C_1|x) = \dfrac{P(x|C_1)P(C_1)}{\sum_i^n P(x|C_i)P(C_i)}$

由于使用的是Generative Model，两者共用一个covariance，分类结果两者的边界会变成一个直线边界。

共用的协方差为两者分别协方差的加权平均：$Cov(C) = \dfrac{m\cdot Cov(C_1)+n\cdot Cov(C_2)}{m+n}$

进行计算之后预测结果：

$z = (\mu_1-\mu_2)^T\Sigma^{-1}x-\dfrac12(\mu_1)^T\Sigma^{-1}\mu_1+\dfrac12(\mu_2)^T\Sigma^{-1}\mu_2+\ln\dfrac{N_1}{N_2}$

其中：$\omega^T = (\mu_1-\mu_2)^T\Sigma^{-1}x\\ b = \dfrac12(\mu_1)^T\Sigma^{-1}\mu_1+\dfrac12(\mu_2)^T\Sigma^{-1}\mu_2+\ln\dfrac{N_1}{N_2}$

因此在GM中，需要计算的值有$N_1,N_2,\mu_1,\mu_2,\Sigma$

In [4]:
N1 = x_train_0.shape[0]
N2 = x_train_1.shape[0]

cov_0 = np.zeros((dim, dim))
cov_1 = np.zeros((dim, dim))

for i in x_train_0:
    cov_0 += np.dot(np.transpose([i - mean_0]), [i - mean_0]) / N1
for i in x_train_1:
    cov_1 += np.dot(np.transpose([i - mean_1]), [i - mean_1]) / N2

# shared covariance. Use weighted average of individual in-class covariance.
cov = (cov_0 * N1 + cov_1 * N2) / (N1 + N2)

计算共同的协方差，可以明显感觉到，**这里需要重复计算大矩阵的运算，耗费时间增多**。

In [5]:
# Compute inverse of covariance matrix.
# Since covariance matrix may be nearly singular, np.linalg.inv() may give a large numerical error.
# Via SVD decomposition, one can get matrix inverse efficiently and accurately.
u, s, v = np.linalg.svd(cov, full_matrices=False)
inv = np.matmul(v.T * 1 / s, u.T)

计算逆矩阵，这个注释解释了为什么用这个函数`np.linalg.svd()`

In [6]:
w = np.dot(inv, mean_0 - mean_1)
b = (-0.5) * np.dot(mean_0, np.dot(inv, mean_0)) + 0.5 * np.dot(mean_1, np.dot(inv, mean_1)) + np.log(N1/N2)

y_train_pred = 1 - np.round(1 / (1 + np.exp(-(np.dot(x, w) + b))))
print(f'Training accuracy: {1 - np.mean(np.abs(y - y_train_pred))}')

Training accuracy: 0.8731753170156296


这里的“1-”可能是标签错误了，需要置换一下。

In [8]:
x_test = normalize(x_test)
predict = 1 - np.round(1 / (1 + np.exp(-(np.dot(x_test, w) + b))))
rs = pd.DataFrame(predict,columns=["label"]).astype(np.int)
rs.to_csv("ans3.csv")
print("over")
rs

over


Unnamed: 0,label
0,0
1,0
2,0
3,0
4,0
...,...
27617,1
27618,0
27619,1
27620,0


### 总结与展望

* 对其中的矩阵运算不是很了解，尤其是协方差的计算，其函数`np.linalg.svd()`
* 大致了解Generative Model的思路和流程
* 由于本数据集的数据量较多，因此Generative Model的效果没有逻辑回归的结果较好。