# Classification and Logistic Regression

The classification problem is much like the [regression problem](/notebooks/machine-learning/supervised-learning/linear-regression.ipynb), except that values we want to predict $\mathcal{y}$ take only a small number of discrete values.

## Logistic regression

Using the [linear regression](/notebooks/machine-learning/supervised-learning/linear-regression.ipynb) approach to predict $\mathcal{y}$ given $\mathcal{x}$ may perform very poorly. The reason is explicitly shown in the below example.

In [None]:
import pylab
import numpy as np
from sklearn import linear_model

## Learning first round
reg = linear_model.LinearRegression()

x = np.array([2, 3, 4, 5, 6, 7])
y = np.array([0, 0, 0, 1, 1, 1])

reg.fit(x.reshape((x.size,1)), y)
lx = np.linspace(1, 8, 10)
ly = reg.intercept_ + reg.coef_ * lx

pylab.plot(lx, ly, 'blue')
pylab.plot(x, y, 'bo')

## Learning second round
reg = linear_model.LinearRegression()

x = np.array([2, 3, 4, 5, 6, 7, 15])
y = np.array([0, 0, 0, 1, 1, 1, 1])

reg.fit(x.reshape((x.size,1)), y)
lx = np.linspace(1, 15, 10)
ly = reg.intercept_ + reg.coef_ * lx

pylab.plot(lx, ly, 'g')
pylab.plot([15], 1, 'go')

Suppose that we are performing linear regression over a training set containing 6 entries such that:

| y | 0 | 0 | 0 | 1 | 1 | 1 |
|:-:|---|---|---|---|---|---|
| x | 2 | 3 | 4 | 5 | 6 | 7 |

The output generated will be the blue line, which is essentially a good predictor. However, if we add a new point to our table, let's say $(15, 1)$, which shouldn't change our modeling, but it will affect our model changing its predictions (see green line).

To fix this, lets change the form for our hypotheses $h_{\theta}(x)$. We will choose

$$h_{\theta} = g(\theta^{T}x) = \frac{1}{1+e^{-\theta^{T}x}}$$

where

$$g(z)=\frac{1}{1+e^{-z}}$$

is called the logistic function or the sigmoid function.

Other functions that smoothly increase from 0 to 1 can also be used, but for good reasons, the sigmoid function is a fairly good natural choice. Before moving on, here's a useful property of the derivative of the sigmoid function, which we write a $g'$:

$$
\begin{align}
g'(z) &= \frac{\partial}{\partial{z}} \frac{1}{1+ e^{-z}} \\
&= \frac{1}{(1+e^{-z})^2} e^{-z} \\
&= \frac{1}{1+e^{-z}} \big(1- \frac{1}{1+e^{-z}}\big) \\
&= g(z)(1-g(z)) \\
\end{align}
$$

So, given the logistic regression model, how do we fit $\theta$ for it?

Let's assume that

$$
\begin{cases}
\begin{align}
P(y=1|x;\theta) &= h_{\theta}(x) \\
P(y=0|x;\theta) &= 1- h_{\theta}(x) \\
\end{align}
\end{cases}
\implies p(y|x;\theta)=(h_{\theta}(x))^y (1-h_{\theta}(x))^{1-y}
$$

Assuming that the m training samples were generated independently, we can maximize the log likelyhood:

$$
\begin{align}
L(\theta) &= p(\vec{y} | X; \theta) \\
&= \prod_{i=1}^{m} p(y^{(i)} | x^{(i)}; \theta) \\
&= \prod_{i=1}^{m} (h_{\theta}(x^{(i)}))^{y^{(i)}} (1 - h_{\theta}(x^{(i)})^{1-y^{(i)}} \\
 \\
l(\theta) &= \log L(\theta) \\
&= \sum_{i=1}^{m} \big( y^{(i)} \log h(x^{(i)}) + (1 - y^{(i)}) \log(1 - h(x^{(i)})) \big)
\end{align}
$$

How to maximize the likelyhood? Using [gradient ascent](/notebooks/math/gradient-descent.ipynb) we surprisingly end up with the same update rule:

$$
\theta_{j} := \theta_{j} + \alpha \big(y^{(i)} - h_{\theta}(x^{(i)})\big) x_j^{(i)}
$$

Although this looks like the same as [LMS algorithm](/notebooks/machine-learning/supervised-learning/linear-regression.ipynb), it is not the same because $h_{\theta}(x^{(i)}$ is now defined as a non-linear function of $\theta^T x^{(i)}$. Nonetheless, this is not a coincidence.

### Example

Let's implement locally-weighted version of the logistic regression. In the datafile [logistic-regression-data.tar.gz](https://drive.google.com/open?id=1ZfyFgzkZYxtA_O5zv4Y7oEIBAdlE4p1N) we have a group of points (x, y) that we want to learn from new predictions. The problem is to maximize:

$$
l(\theta) = - \frac{\lambda}{2}\theta^{T}\theta + \sum_{i=1}^{m} \bigg( y^{(i)} \log h_{\theta}(x^{(i)}) + (1-y^{(i)}) \log (1-h_{\theta}(x^{(i)}))  \bigg)
$$

The $- \frac{\lambda}{2}\theta^{T}\theta$ is known as the regularization parameter which is needed for Newton's method to perform well on this task. Here we will use $\lambda = 0.0001$.

Using this definition, the gradient of $l(\theta)$ is given by

$$\nabla_{\theta} l(\theta) = X^{T}z - \lambda \theta$$

where $z \in \mathbb{R}^{m}$ is defined by

$$z_{i} = w^{(i)} (y^{(i)} - h_{\theta}(x^{(i)}))$$

And the Hessian is given by

$$H = X^T D X - \lambda I$$

where $D \in \mathbb{R}^{m \times m}$ is the diagonal matrix with 

$$D_{ii} = - w^{(i)} h_{\theta}(x^{(i)})(1-h_{\theta}(x^{(i)}))$$

Given a query point $x$, we can choose to compute weights 

$$w^{(i)} = \exp \Big(- \frac{\Vert{x-x^{(i)}\Vert}^2}{2\tau^2} \Big) $$

In [17]:
%run logistic-regression-functions.py

def lwlr(X, y, x, tau):
    m, n = X.shape
    theta = np.zeros(n)
    w = np.zeros(m)
    
    for i in range(0, m):
        w[i] = np.exp(- np.linalg.norm(x - X[i], ord=1)**2 / 2 * tau ** 2)
    
    z = np.ones(n)
    while(np.linalg.norm(z, ord=1) > 0.000001):
        h = 1 / (1 + np.exp(np.dot(-X, theta)))
        print(h)
        break

X, y = load_dataset()

lwlr(X, y, 0.15, 5)

print(theta)

Downloading 1ZfyFgzkZYxtA_O5zv4Y7oEIBAdlE4p1N into /tmp/tmp61nwb_7h/logistic-regression-data.tar.gz... Done.
[0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5
 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5
 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5
 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5]
[0. 0.]


TODO: - finish example
      - stochastic gradient descent
      - newton's method