# Logistic Regression

## Logistic Regression
### binomial logistic regression model
$P(Y=1|x)=\frac{exp(w\cdot x+b)}{1+exp(w\cdot x+b)}$

$P(Y=0|x)=\frac{1}{1+exp(w\cdot x+b)}$

$logit(p)=log\frac{p}{1-p}=w\cdot x$

### multi-nomial logistic regression model
$P(Y=k|x)=\frac{exp(w_k\cdot x)}{1+\sum_{k=1}^{K-1}exp(w_k\cdot x)}, k=1,2,\cdots,K$

$P(Y=K|x)=\frac{1}{1+\sum_{k=1}^{K-1}exp(w_k\cdot x)}$

## maximum entropy model

&emsp; 给定一个训练数据集，可以确定联合分布P(x, y)的经验分布和边缘分布P(X)的经验分布，分别以$\tilde{P}(X, Y)$和$\tilde{P}(X)$表示

$\tilde{P}(X=x, Y=y)=\frac{\upsilon(X=x, Y=y)}{N}$

$\tilde{P}(X=x)=\frac{\upsilon(X=x)}{N}$

&emsp; 特征函数$f(x, y)$描述输入x和输出y之间的某一个事实，其定义为

$$f(x, y)=\begin{cases}
    1, & x与y满足某一事实\\
    0, & otherwise
    \end{cases}$$

&emsp; 特征函数$f(x, y)$关于经验分布$\tilde{P}(X, Y)$的期望值，用$E_{\tilde{P}}(f)$表示

$E_{\tilde{P}}(f)=\sum_{x, y}\tilde{P}(X, Y)f(x, y)$

&emsp; 特征函数$f(x, y)$关于模型$P(Y|X)$经验分布$\tilde{P}(X, Y)$的期望值，用$E_{P}(f)$表示

$E_{P}(f)=\sum_{x, y}\tilde{P}(X)P(Y|X)f(x, y)$

&emsp; 若模型能够获取训练数据中的信息，那么可以假设这两个期望值相等，将其作为模型学习的约束条件，最大熵模型的学习等价于约束最优化问题：

$$
\begin{align}
min_{P\in C} \ &\sum_{x, y}\tilde{P}(X)P(Y|X)logP(Y|X)\\
s.t. \ &E_{P}(f_i)=E_{\tilde{P}}(f_i)\\
      &\sum_yP(y|x)=1
\end{align}
$$

## 牛顿法与拟牛顿法

$x^{(k+1)}=x^{(k)}-H_{k}^{-1}\nabla f(x^{(k)})$

&emsp; 在拟牛顿法中将$G_k$作为$H_k^{-1}$的近似

$G_{k+1}(\nabla f(x^{(k+1)})-\nabla f(x^{(k)}))=x^{(k+1)}-x^{(k)}$

### BFGS

* $g$为$f$的梯度，$B$为$f$的海森矩阵近似，选定初始点$w^{(0)}$，取$B_0$为正定对称矩阵

* $g_k=g(w^{(k)})$，若$g_k=0$，停止计算

* 由$B_kp_k=-g_k$求出$p_k$,搜索$\lambda_k=min_{\lambda \geq 0}f(w^{(k)}+\lambda p_k)

* $w^{(k+1)}=w^{(k)}+\lambda_k p_k$

* $B_{k+1}=B_k+\frac{y_ky_k^T}{y_k^T\delta_k}-\frac{B_k\delta_k\delta_k^TB_k}{\delta_k^TB_k\delta_k},\ y_k=g_{k+1}-g_k,\ \delta_k=w^{(k+1)}-w^{(k)}$

In [1]:
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris

In [2]:
X, y = load_iris(return_X_y=True)
X, y = X[:100], y[:100]
print(X.shape, y.shape)

(100, 4) (100,)


In [3]:
class LR():
    def __init__(self, lr=0.01, max_iter=100):
        self.max_iter = max_iter
        self.lr = lr
        self.weight = None
    
    def fit(self, X, y):
        N, D = X.shape
        self.weight = np.zeros((D, ))
        for i in range(self.max_iter):
            y_hat = self.predict(X)
            dw = np.matmul(X.T, y_hat - y)
            self.weight -= self.lr * dw
    
    def predict(self, X):
        if self.weight is None:
            print("no training")
            return
        
        s = np.exp(np.matmul(X, self.weight)).reshape((-1, 1))
        s = np.concatenate((np.ones((X.shape[0], 1)), s), axis=1)
        return np.argmax(s, axis=1)
        
    def score(self, X, y):
        y_hat = self.predict(X)
        return np.sum(y_hat == y) / y.shape[0]

In [5]:
indices = np.array(range(100))
np.random.shuffle(indices)
X_train, y_train = X[indices[:90]], y[indices[:90]]
X_test, y_test = X[indices[90:]], y[indices[90:]]

logit = LogisticRegression()
model = LR()
model.fit(X_train, y_train)
logit.fit(X_train, y_train)

print(model.score(X_test, y_test), logit.score(X_test, y_test))

1.0 1.0


