<a href="https://colab.research.google.com/github/DavidZyy/dive-into-ml/blob/main/AdaBoost.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# AdaBoost

In [104]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

输入：训练数据集 $T = \{ (x_1, y_1), (x_2, y_2), \ldots, (x_N, y_N) \}$，其中$x_i \in \mathcal{X} \subseteq \mathbb{R}^n$，$y_i \in \mathcal{Y} = \{ -1, +1 \}$；弱学习算法； \\
输出：最终分类器 $G(x)$。

## 第一轮

初始化训练数据的权值分布
$$ D_1 = (w_{11}, \cdots, w_{1i}, \cdots, w_{1N}) , \quad w_{1i} = \frac{1}{N} , \quad i = 1, 2, \cdots, N $$


In [105]:
x = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
y = np.array([1, 1, 1, -1, -1, -1, 1, 1, 1, -1])
#概率为:
D_1 = np.ones(x.shape) / len(x)


使用具有权值分布 $D_1$ 的训练数据集学习，得到基本分类器
$$
G_1(x) : \mathcal{X} \rightarrow \{-1, +1\}
$$

学习一个分类器$G_1$，使得误差最小
$e_1=P(G_1(x^{(i)}) \not= y^{(i)}) =\sum_{i=1}^{N}w_{1i}I(G_1(x_i)\neq y_i)$，假设这个分类器的表示形式为：
$$
G_1(x) =
\begin{cases}
1,& \quad x \leq b \\\
-1, & \quad x > b
\end{cases}
$$
我们通过计算来找到这个$b$(boundary)。先定义$G$和$G_r$：

In [106]:
def G(b, x):
    if isinstance(x, np.ndarray):
        return np.where(x <= b, 1, -1)
    else:
        return 1 if x <= b else -1

# return the reverse value of G
def G_r(b, x):
    if isinstance(x, np.ndarray):
        return np.where(x <= b, -1, 1)
    else:
        return -1 if x <= b else 1

def sign_function(x):
    if isinstance(x, np.ndarray):
        return np.sign(x)
    else:
        return 1 if x > 0 else -1 if x < 0 else 0

再定义一些list，用来存放每次运算的结果，因为下标是从1开始的，所以先放进去一个数。

In [107]:
err_list = []
b_list = []
type_list = [] # the type function, 0 means G, 1 means G_r
err_list.append(0)
b_list.append(0)
type_list.append(-1)

In [108]:
min_err = np.inf
min_b = -1
min_type = -1
# suppose b is the value in x, traverse b
for b in range(len(x)):
  err = 0
  for i in range(len(x)):
    if G(b, x[i]) != y[i]:
      err += D_1[i]
  if err < min_err:
    min_err = err
    min_b = b
    min_type = 0
  # compute the reverse error
  err = 0
  for i in range(len(x)):
    if G_r(b, x[i]) != y[i]:
      err += D_1[i]
  if err < min_err:
    min_err = err
    min_b = b
    min_type = 1

print(min_err, min_b, min_type)

# put result in list
err_list.append(min_err)
b_list.append(min_b)
type_list.append(min_type)

0.30000000000000004 2 0


得到的结果为
$$
G_1(x) =
\begin{cases}
1,& \quad x \leq 2 \\\
-1, & \quad x > 2
\end{cases}
$$
误差$e_1$为0.3。


计算$G_1(x)$的系数：
\begin{equation}
\alpha_1 = \frac{1}{2}\log{\frac{1-e_1}{e_1}}
\end{equation}

In [109]:
alpha_list = []
alpha_list.append(0)
# calculate alpha
alpha1 = 0.5 * np.log((1 - err_list[1]) / err_list[1])
alpha_list.append(alpha1)
print(alpha1)

0.4236489301936017


得到
$$
f_1(x) = \alpha_1G_1(x)
$$

分类器为$\textbf{sign}[f_1(x)]$，判断误分类点数：

In [110]:
err_count = 0
for i in range(len(x)):
  if sign_function(alpha_list[1] * G(b_list[1], x[i])) != y[i]:
    err_count += 1
print(err_count)

3


## 第二轮

Update the weight distribution of the training data set
\begin{equation}
D_{m+1} = (w_{m+1,1}, \cdots, w_{m+1,i}, \cdots, w_{m+1,N})
\end{equation}

\begin{equation}
w_{m+1,i} = \frac{w_{mi}}{Z_m}exp(-\alpha_my_iG_m(x_i)), \quad i = 1,2,\cdots,N
\end{equation}

where $Z_m$ is the normalization factor
\begin{equation}
Z_m = \sum_{i=1}^{N}w_{mi}exp(-\alpha_my_iG_m(x_i))
\end{equation}

It makes $D_{m+1}$ become a probability distribution.
这里由前一轮的结果来计算$D_2$。

In [111]:
D_2 = D_1 * np.exp(-alpha_list[1] * y * G(b_list[1], x))
D_2 = D_2 / D_2.sum()
print(D_2)

[0.07142857 0.07142857 0.07142857 0.07142857 0.07142857 0.07142857
 0.16666667 0.16666667 0.16666667 0.07142857]


In [112]:
min_err = np.inf
min_b = -1
min_type = -1
# suppose b is the value in x, traverse b
for b in range(len(x)):
  err = 0
  for i in range(len(x)):
    if G(b, x[i]) != y[i]:
      err += D_2[i]
  if err < min_err:
    min_err = err
    min_b = b
    min_type = 0
  # compute the reverse error
  err = 0
  for i in range(len(x)):
    if G_r(b, x[i]) != y[i]:
      err += D_2[i]
  if err < min_err:
    min_err = err
    min_b = b
    min_type = 1

print(min_err, min_b, min_type)

# put result in list
err_list.append(min_err)
b_list.append(min_b)
type_list.append(min_type)

0.21428571428571427 8 0


In [113]:
alpha2 = 0.5 * np.log((1 - err_list[2]) / err_list[2])
alpha_list.append(alpha2)
print(alpha2)

0.6496414920651304


得到的结果为
$$
G_2(x) =
\begin{cases}
1,& \quad x \leq 8 \\\
-1, & \quad x > 8
\end{cases}
$$
误差$e_2$为0.2143。
$$
f_2(x) = \alpha_1G_1(x) + \alpha_2G_2(x)
$$
分类器为$\textbf{sign}[f_2(x)]$，判断误分类点数：

In [114]:
err_count = 0
for i in range(len(x)):
  if sign_function(alpha_list[2] * G(b_list[2], x[i]) + alpha_list[1] * G(b_list[1], x[i])) != y[i]:
    err_count += 1
print(err_count)

3


## 第三轮

In [115]:
D_3 = D_2 * np.exp(-alpha_list[2] * y * G(b_list[2], x))
D_3 = D_3 / D_3.sum()
print(D_3)

[0.04545455 0.04545455 0.04545455 0.16666667 0.16666667 0.16666667
 0.10606061 0.10606061 0.10606061 0.04545455]


In [116]:
min_err = np.inf
min_b = -1
min_type = -1
# suppose b is the value in x, traverse b
for b in range(len(x)):
  err = 0
  for i in range(len(x)):
    if G(b, x[i]) != y[i]:
      err += D_3[i]
  if err < min_err:
    min_err = err
    min_b = b
    min_type = 0
  # compute the reverse error
  err = 0
  for i in range(len(x)):
    if G_r(b, x[i]) != y[i]:
      err += D_3[i]
  if err < min_err:
    min_err = err
    min_b = b
    min_type = 1


print(min_err, min_b, min_type)

# put result in list
err_list.append(min_err)
b_list.append(min_b)
type_list.append(min_type)

0.18181818181818185 5 1


得到分类函数的结果为：
$$
G_3(x) =
\begin{cases}
-1,& \quad x \leq 5 \\\
1, & \quad x > 5
\end{cases}
$$
注意小于为负，大于为正。

In [117]:
alpha3 = 0.5 * np.log((1 - err_list[3]) / err_list[3])
alpha_list.append(alpha3)
print(alpha3)

0.752038698388137


可以得到$D_4$为

In [118]:
D_4 = D_3 * np.exp(-alpha_list[3] * y * G_r(b_list[3], x))
D_4 = D_4 / D_4.sum()
print(D_4)

[0.125      0.125      0.125      0.10185185 0.10185185 0.10185185
 0.06481481 0.06481481 0.06481481 0.125     ]


得到$f(3) = \alpha_1 G_1(x) + \alpha_2 G_2(x) + \alpha_3 G_3$，
判别函数为
$$
\textbf{sign}[f_3(x)]
$$
计算误差点：

In [119]:
err_count = 0
for i in range(len(x)):
  if sign_function(alpha_list[3] * G_r(b_list[3], x[i]) + alpha_list[2] * G(b_list[2], x[i]) + alpha_list[1] * G(b_list[1], x[i])) != y[i]:
    err_count += 1
print(err_count)

0


误差点为0，分类结束。