For the binary responses, we have two classes, say 0 and 1. We want to estimate the conditional probability of $Pr(1| X)$. Given $X$, if $Pr(1|X) > 0.5$, then we can say they are falling into the class 1. Otherwise, they may belong to the class 0. How to form the relationship between $X$ and $Pr(1| X)$?

### Logistic regression
If we start from linear regression, we can still predict that $Pr(1| X) = {\bf X}\beta$. However, the value of $Pr(1| X)$ may be larger than $1$ or negative. In order to aviod this problem, we have to model $Pr(1| X)$ by a function that gives outputs in the interval of $[0,1]$. If we use the _logistic function_, then we got logistic regression.

Logistic regression is formed as $p(x)=Pr(1| X)=\frac{e^{\beta_0+\beta_1x_1+\cdots+\beta_nx_n}}{1+e^{\beta_0+\beta_1x_1+\cdots+\beta_nx_n}}$.

Here, I perfer to understand the logistic regression by the decision boundary.

For the binary respones, the decision boundary is a set of points for $Pr(1|X)=Pr(0|X)$. Or setting the equation like this, $log(\frac{Pr(1|X)}{Pr(0|X)})=0$ (__log-odds__). From the logistic regression expressed above, we could form a liner log-odds:

$log odds = log(\frac{Pr(1|X)}{Pr(0|X)}) = \beta_0+\beta_1x_1+\cdots+\beta_nx_n $

Thus the decision boundary is the set of points for which the log-odds are zero, or hyperplane defined by $\{x|\beta_0+\beta_1x_1+\cdots+\beta_nx_n\}$.

##### Fitting logistic regression models
We still need to estimate the coefficient. In general, this model is fitted by _maximum likelihood_.

For the two-class case, $Pr(1|X)=1-Pr(0|X)$. The log-likelihood function can be written as $l(\beta)=\sum_{i=1}^{n}\{y_ilog(p(x_i))+(1-y_i)log(1-p(x_i)\}$. To estimate a set of $\beta$, we have to maximum the log-likelihood function.

Setting the dervative respectives to $\beta$ of the log-likehood function to zero, we got $\sum_{i=1}^{n}x_i(y_i-p(x))=0$. To address this problem, we use the _Newton-Raphson_ algorithm and thus we have to know the second-derivative which is $-\sum_{i=1}^{n}x_ix_i^{T}p(x_i)(1-p(x_i))$. For the convenece, we re-write the first-derivative and the second-deriviative in matrix notation.

The first dervative: ${\bf X}^T({\bf y}-{\bf p})$

The second dervative: $-{\bf X}^T{\bf W}{\bf X}$ where ${\bf W}$ is a $n\times n$ diagonal matrix of weights with $i$th diagonal element $p(x_i)(1-p(x_i))$.

The Newton step at $j$th iterative is:

$\beta^{(j+1)}=\beta^{(j)}+({\bf X}^T{\bf W}{\bf X})^{-1}{\bf X}^T({\bf y} - {\bf p})$

_(stop here, may add more details about newton methods later)_




In [38]:
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline 
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_digits
from sklearn import metrics

In [40]:
def logistic_regression(x,y):
        dataMat_train, dataMat_test, label_train, label_test = train_test_split(x, y, test_size=0.25, random_state=42)
        
        logsiticreg=LogisticRegression()
        logsiticreg.fit(dataMat_train,label_train)
        
        predictions=logsiticreg.predict(dataMat_test)
        
        score=logsiticreg.score(dataMat_test,label_test)
        cm=metrics.confusion_matrix(label_test, predictions)
        return score,cm

In [43]:
df=pd.read_csv('../testSetRBF2.txt',sep='\t',header=None)
x=df.iloc[:,0:2]
y=df.iloc[:,2]
score1,cm1 = logistic_regression(x,y)
print(score1)
print(cm1)

0.68
[[8 4]
 [4 9]]


In [44]:
digits = load_digits()
score2,cm2=logistic_regression(digits.data,digits.target)
print(score2)
print(cm2)

0.966666666667
[[42  0  0  0  1  0  0  0  0  0]
 [ 0 36  0  0  0  0  0  0  1  0]
 [ 0  0 38  0  0  0  0  0  0  0]
 [ 0  0  0 44  0  1  0  0  1  0]
 [ 0  1  0  0 54  0  0  0  0  0]
 [ 0  0  1  0  0 56  0  0  1  1]
 [ 0  0  0  0  0  1 44  0  0  0]
 [ 0  0  0  0  0  0  0 40  0  1]
 [ 0  1  0  0  0  1  0  0 36  0]
 [ 0  0  0  0  0  0  0  0  3 45]]
