### Setup and Imports

In [1]:
import numpy as np

In [11]:
def f(x):
    theta = np.array([3, 1, 3])
    return 1/(1 + np.exp(-1*np.dot(x, theta)))



### Logistic Regression

Classification. Input set of features, map it to a discrete value, if binary classification, $y \in \{0, 1\}$. Thus we want a function which neatly maps inputs to either $0$ or $1$. There are a couple candidates, $\tanh$, $\sigma$, $\ldots$. Logistic regression uses the logistic function $\sigma$

$$h_\theta(x) = \sigma(\theta_n \cdot X) = \frac{1}{1 + e^{-(\theta_1 \cdot x_1 + \ldots \theta_n \cdot x_n)}}$$

Note(s)
$$\sigma'(x) = \sigma(x) \cdot (1 - \sigma(x))$$

### Loss 
Lets optimize our function such that 

$$P(y=1; x, \theta) = h_\theta(x)$$
and 
$$P(y=0; x, \theta) = 1 - h_\theta(x)$$

Now we want to maximize, the probability our classifier is right. So create a function which returns the probability our classifier assigned to the correct class label.

$$P(y; x, \theta) = h_\theta(x)^y \cdot (1 - h_\theta(x))^{1-y}$$

Eg, if the correct class label is 1, the second term is cancelled and we return $h_\theta$ which the exactly the probability (accordiningly to our classifier) that $y = 1$, and this holds in reverse

### Gradient Derivation
The likelihood of $P(y;x, \theta)$ over $n$ independent r.v's (our training set) would then be
$$P(y; x, \theta) = \prod_{i=1}^n h_\theta(x)^y \cdot (1 - h_\theta(x))^{1-y}$$

We can see though that this is pretty hard to differentiate. Since we are dealing multiplication and exponentials and we would rather see addition and coefficients, and because log is differentiable, we can instead compute the log-likelihood given by

$$J_\theta = \log P(y; x, \theta) = \log \prod_{i=1}^n h_\theta(x)^y \cdot (1 - h_\theta(x))^{1-y} =$$
$$\sum_{i=1}^n (y) \cdot \log(h_\theta(x)) + (1- y) \cdot \log(1 - h_\theta(x))$$

Now we want to maximize the gradient, knowing that gradient ascent look something like
$$\theta_j := \theta_j + \alpha \cdot \frac{\partial}{\partial \theta_j} J_\theta$$

$$\frac{\partial}{\partial \theta_j}\sum_{i=1}^n (y) \cdot \log(h_\theta(x)) + (1- y) \cdot \log(1 - h_\theta(x)) = 
\frac{y}{h_\theta(x)} \cdot \frac{\partial h_\theta(x)}{\partial \theta_j} + 
\frac{(1 -y)}{h_\theta(x)} \cdot \frac{-\partial h_\theta(x)}{\partial \theta_j}
$$

$$ = \left(\frac{y}{h_\theta(x)} - \frac{(1 -y)}{h_\theta(x)}\right) \cdot \frac{\partial h_\theta(x)}{\partial \theta_j}$$

To make taking the partial derivative easier, we then want to subtitute $h_\theta(x) = \sigma(\theta_n \cdot X)$

$$ = \left(\frac{y}{\sigma(\theta_n \cdot X)} - \frac{(1 -y)}{\sigma(\theta_n \cdot X)}\right) \cdot \frac{\partial \sigma(\theta_n \cdot X)}{\partial \theta_j} = 
\left(\frac{y}{\sigma(\theta_n \cdot X)} - \frac{(1 -y)}{\sigma(\theta_n \cdot X)}\right) \cdot \sigma(\theta_n \cdot X) \cdot (1 - \sigma(\theta_n \cdot X)) \cdot \frac{\partial \theta_n \cdot X}{\partial \theta_j}$$

Partially differentiating the last bit yields
$$\left(\frac{y}{\sigma(\theta_n \cdot X)} - \frac{(1 -y)}{\sigma(\theta_n \cdot X)}\right) \cdot \sigma(\theta_n \cdot X) \cdot (1 - \sigma(\theta_n \cdot X)) \cdot x_j$$

Distributing terms gives us

$$(y \cdot (1 - \sigma(\theta_n \cdot X)) - (1-y) \cdot \sigma(\theta_n \cdot X)) \cdot x_j = 
(y - \sigma(\theta_n \cdot X)) \cdot x_j 
$$

$$= (y - h_\theta(x)) \cdot x_j$$

Which funnily enough is the same result we got when computing the loss of our previous $J_\theta$

In [5]:
# lets test it out
def sigmoid(x):
    return 1 / (1 + np.exp(x))

In [4]:
def simple_gradient_descent():
    theta = np.zeros()