## 数据准备

In [1]:
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

In [2]:
data=load_breast_cancer()
X=data.data
Y=data.target

X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.2)
print(X)

[[1.799e+01 1.038e+01 1.228e+02 ... 2.654e-01 4.601e-01 1.189e-01]
 [2.057e+01 1.777e+01 1.329e+02 ... 1.860e-01 2.750e-01 8.902e-02]
 [1.969e+01 2.125e+01 1.300e+02 ... 2.430e-01 3.613e-01 8.758e-02]
 ...
 [1.660e+01 2.808e+01 1.083e+02 ... 1.418e-01 2.218e-01 7.820e-02]
 [2.060e+01 2.933e+01 1.401e+02 ... 2.650e-01 4.087e-01 1.240e-01]
 [7.760e+00 2.454e+01 4.792e+01 ... 0.000e+00 2.871e-01 7.039e-02]]


数据$X$是一个$(n{\times}m)$的矩阵，每一行是一个样本，每一列代表一个特征：

In [3]:
n=X_train.shape[0]
m=X_train.shape[1]

标签$Y$是一个列向量，其行数与$X$相同：

In [4]:
Y_train = Y_train.reshape((n, 1))
Y_test = Y_test.reshape((-1, 1))

## 粗略模型

模型表达式为：
$$
\hat{Y}=\sigma{(XW+b)}
$$
其中
$$
\sigma(x)=\frac{1}{1+e^{-x}}
$$
权重系数$W$的形状为$(m,1)$，偏置系数$b$为单变量系数。这里注意sigmoid函数的曲线，**只有当$XW+b$处于非常有限的范围内如$[-5,5]$时才能被sigmoid函数划分到$[0,1]$区间，而$XW+b$相当于是对原始数据的一个线性回归，想要线性回归的结果落在$[-5,5]$的范围内，需要对数据做预处理，或者缩小初始的$W$与$b$。实际发现缩小$W$的效果远优于对数据的预处理。**

In [5]:
def sigmoid(x):
    return 1/(1+np.exp(-x))

# 缩小的初始权重参数
W = 0.001*np.random.randn(m).reshape((m, 1))  # 权重
b = 0  # 偏置

Y_hat=sigmoid(np.dot(X_train, W)+b)

模型的损失函数为：
$$
\begin{align}
L&=-\sum\limits_{i=1}^n[y^{(i)}\ln{\hat{y}^{(i)}}+(1-y^{(i)})\ln{(1-\hat{y}^{(i)})}] \\
&=-\frac{1}{n}[Y^{T}\ln{\hat{Y}}+(1-Y)^{T}\ln{(1-\hat{Y})}] \\
\end{align}
$$
损失函数关于参数$W$与$b$的梯度可以求得：
$$
\begin{align}
\frac{\partial{L}}{\partial{W}}&=\frac{1}{n}X^{T}{\cdot}(\hat{Y}-Y) \\
\frac{\partial{L}}{\partial{b}}&=\frac{1}{n}{\cdot}[1,1,...,1](\hat{Y}-Y) \\
\end{align}
$$

In [6]:
dW = X_train.T.dot(Y_hat - Y_train) / n
db = np.sum(Y_hat - Y_train) / n

参数的迭代更新公式：
$$
W:=W-{\alpha}\frac{\partial{L}}{\partial{W}}, \quad b:b-{\alpha}\frac{\partial{L}}{\partial{b}}
$$

In [7]:
max_iter=2000
alpha=0.00001        # 注意学习率过大会导致震荡，然后误差越来越大

for i in range(max_iter):
    Y_hat=sigmoid(np.dot(X_train, W)+b)
    
    dW = X_train.T.dot(Y_hat - Y_train) / n
    db = np.sum(Y_hat - Y_train) / n
    
    W = W - alpha * dW
    b = b - alpha * db

使用该模型分别对训练集与预测集做预测。**注意这里有一个实现上的坑，就是经过信号函数处理过的输出无法用于计算logistics regression的交叉熵计算，因为$ln(x)$函数不能接受0作为参数。**所以说如果要设计一个函数可以计算训练模型的交叉熵损失，必须提供模型的$W$与$b$，使用模型的原始输出概率进行计算。

In [8]:
threshold = 0.5

# 注意以下输出值不能用于计算交叉熵，只能用于计算准确率
Y_pred_train = np.squeeze(
    np.where(sigmoid(np.dot(X_train, W)+b) > threshold, 1, 0))
Y_pred_test = np.squeeze(
    np.where(sigmoid(np.dot(X_test, W)+b) > threshold, 1, 0))

定义一个Precision函数来评价模型的表现：

In [20]:
def ACC(Y_true,Y_pred):
    return np.sum(Y_true==Y_pred)/len(Y_true)

print(ACC(np.squeeze(Y_train),Y_pred_train),ACC(np.squeeze(Y_test),Y_pred_test))

0.9186813186813186 0.9298245614035088


模型简单打包：

In [10]:
def logit_reg(X,Y,alpha=0.0001,max_iter=2000,threshold=0.5):
    n=X.shape[0]
    m=X.shape[1]
    
    W = 0.001*np.random.rand(m).reshape((m, 1))  # 权重
    b = 0  # 偏置

    for i in range(max_iter):
        Y_hat=sigmoid(np.dot(X_train, W)+b)

        dW = X.T.dot(Y_hat - Y) / n
        db = np.sum(Y_hat - Y) / n

        W = W - alpha * dW
        b = b - alpha * db

        if i%200==200-1:
            Y_hat=sigmoid(np.dot(X_train, W)+b)
            L=np.sum(-np.dot(Y.T,np.log(Y_hat))-np.dot((1-Y).T,np.log(1-Y_hat)))/n
            print(L,end=' ')

    return W,b

W,b=logit_reg(X_train,Y_train)

14.969459649871379 0.8889286352618059 0.6024099789774738 1.0113621935047468 1.3403396616472885 inf 0.3922188407667311 0.6791125734979867 0.36252126288375014 0.6171411628445601 



## 数据归一化
**Normalization：**
$$
x=\frac{x-x_{min}}{x_{max}-x_{min}}
$$

In [11]:
X=np.row_stack((X_train,X_test))

X_max=X.max(axis=0)
X_min=X.min(axis=0)

X_train_norm=(X_train-X_min)/(X_max-X_min)
X_test_norm=(X_test-X_min)/(X_max-X_min)

对数据归一化之后再测试模型表现：

In [12]:
W,b=logit_reg(X_train_norm,Y_train)

0.9434003602974719 0.6903347450947397 0.6799181219280664 0.6667887984429572 0.6539288433696916 0.6417908644238418 0.6303663828274414 0.6196028332373115 0.6094476975017823 0.5998527891866081 

因为数据做了归一化，整个数据集上的梯度分布得到了改良，所以可以调大学习率，由此可以看出数据标准化在logistic regression上的威力：

In [13]:
W,b=logit_reg(X_train_norm,Y_train,alpha=0.1)

5.690098513474843 0.19688483730940615 0.1693336426731653 0.15250373609580045 0.14363072553377784 0.13676765162018478 0.13101335899922253 0.12610817931191173 0.12187660983343392 0.1181909766881168 

**Standardization：**
$$
x=\frac{x-x_{\mu}}{\sigma}
$$

In [14]:
X=np.row_stack((X_train,X_test))

X_avg=X.mean(axis=0)
X_std=X.std(axis=0)

X_train_std=(X_train-X_avg)/X_std
X_test_std=(X_test-X_avg)/X_std

In [15]:
W,b=logit_reg(X_train_std,Y_train)

0.7687581313007381 4.81839763375368 10.351627426927369 15.890988362318414 21.430422382596937 26.96985809088612 32.50929385037887 38.04872961160914 43.588165372901095 49.12760113419542 

In [16]:
W,b=logit_reg(X_train_std,Y_train,alpha=0.0001)

0.8097929714536234 4.756176249460121 10.2890356963611 15.828393099019062 21.367827045001125 26.907262751116185 32.44669851053608 37.98613427176376 43.5255700330556 49.06500579434992 