# Theory

### TODO 

Explain why it is that we use likelihood (you can't just optimize the probability of values, we need to optimize some relationship between our model output and the training output)

Logistic regression is a technique that was developed for approximating the probability of a random variable Y taking a certain value given some experimental data.  It uses the logistic (a.k.a. sigmoid) function, $\theta(t) =  \frac{1}{1+e^{-t}}$, to map the results of a linear combination of the features to the range [0,1].  Summarily, we can express the probability of a random variable Y taking a value of $y_i$ given feature data $\vec{x}$ , as:  $$ p(Y=y_i|\vec{x}) = \frac{1}{1+e^{-(\vec{w}\cdot\vec{x} + b)}}$$

In order to generate this model we note that our hypothesis fits nicely with the idea of likelihood, which is a measure of how likely a model's parameters are given some sample data.  The likelihood of the parameters of our model, $\vec{w}$, given our training examples $\vec{x}$ can be written as $L(\vec{w} | \vec{x})$.  We note that the likelihood is then the probability of our training labels given our model's output for each training example.  Now, let us derive some things more formally for the case of binary classification.

Let us define the following terms:

- X : A set of training examples (assuming a 1 is added to each example as the bias)
- Y : The labels for our training examples
- $\vec{w}$ : The parameters for our logistic regression model
- $ g_{\vec{w}}(x) = \frac{1}{1+e^{-(\vec{w}\cdot x)}}$


As stated earlier, we can express $p(Y=1|X;\vec{w}) = g_{\vec{w}}(X)$.  Since this is a binary classification (A or not A), we can write $p(Y=0|X;\vec{w}) = 1 - g_{\vec{w}}(X)$.

Noting that this is a Bernoulli distribution, we can use the differentiable form, $*^1$, of its the probability mass function, $f(k;p) = p^k(1-p)^{1-k} $ , to estimate $P(y|X;\vec{w})$.  This yields $$P(y|X;\vec{w}) = g_{\vec{w}}(X)^y(1-g_{\vec{w}}(X))^{1-y}$$


With this equation we can express our likelihood as a function of our training data:


$\begin{align}
 L(\vec{w} | X) & = P(Y | X;\vec{w}) \\
 & = \prod_{i=1}^{N} P(Y = y_i | x_i;\vec{w}) \\
 & = \prod_{i=1}^{N} g_{\vec{w}}(x_i)^{y_i}(1-g_{\vec{w}}(x_i))^{1-y_i}
\end{align}$

Since we want to maximize likelihood and this equation would be a monster to differentiate, one often uses the log likelihood instead.  Using the properties of logarithms, we can express our log likelihood as:

$\begin{align}
 log(L(\vec{w} | X)) & = log(\prod_{i=1}^{N} P(Y = y_i | x_i;\vec{w})) \\
 & = \sum_{i=1}^{N} log(P(Y = y_i | x_i;\vec{w})) \\
 & = \sum_{i=1}^{N} log(g_{\vec{w}}(x_i)^{y_i}(1-g_{\vec{w}}(x_i))^{1-y_i}) \\
 & = \sum_{i=1}^{N} log(g_{\vec{w}}(x_i)^{y_i}) + log((1-g_{\vec{w}}(x_i))^{1-y_i}) \\
 & = \sum_{i=1}^{N} y_ilog(g_{\vec{w}}(x_i)) + (1-y_i)log(1-g_{\vec{w}}(x_i))
\end{align}$


Before determining the gradient (derivative with respect to the weights) of the entire equation, let us derive some of the embedded gradients : 

$$\begin{align}
\frac{d}{d\vec{w}}g_{\vec{w}}(x_i) & = g_{\vec{w}}(x_i)(1-g_{\vec{w}}(x_i))\frac{d}{d\vec{w}}(\vec{w}\cdot x) \\
& = g_{\vec{w}}(x_i)(1-g_{\vec{w}}(x_i))x 
\end{align}$$

And the derivative for a logarithm of some function $u(\vec{w})$:
$$\frac{d}{d\vec{w}}log(u) = \frac{u'}{u} $$

Using these two building blocks, and $\theta = g_{\vec{w}}(x_i)$ for the sake of brevity, we can easily solve for the derivate of our log likelihood with respect to the weights $\vec{w}$ yields:

$\begin{align}
\frac{d}{d\vec{w}}log(L(\vec{w} | X)) & = \\
& = \frac{d}{d\vec{w}}\sum_{i=1}^{N} y_ilog(\theta) + (1-y_i)log(1-\theta) \\
& = \sum_{i=1}^{N} \frac{d}{d\vec{w}}y_ilog(\theta) + \frac{d}{d\vec{w}}(1-y_i)log(1-\theta) \\
& = \sum_{i=1}^{N} y_i\frac{\theta'}{g_{\vec{w}}(x_i)} + (1-y_i)\frac{-\theta'}{1 - \theta}  \\
& = \sum_{i=1}^{N} y_i\frac{\theta(1-\theta)x_i}{\theta} + (1-y_i)\frac{-\theta(1-\theta)x_i}{1 - \theta}  \\
& = \sum_{i=1}^{N} y_i(1-\theta)x_i + (1-y_i)(-\theta)x_i  \\
& = \sum_{i=1}^{N} x_i(y_i-\theta) 
\end{align}$


#### Footnotes

- $*^1$ : The bernoulli probability mass function is defined as a stepwise function that take a value of $p$ for one outcome and the complementary value, $1-p$, for the other outcome.  We note that $p^k(1-p)^{1-k}$ is a clever way of writing this same thing as a continuous function, since one of the exponents will be zero for each of the 2 possible outcomes, thus reducing the probability to $p$ or $1-p$.

## Implementation

## Binary Classifier

In [1]:
import numpy as np
import math

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

class logistic_regression:
    
    def __init__(self, max_iterations=100, learning_factor=0.01):
        self._max_iterations = max_iterations
        self._learning_rate = learning_factor
        self._weights = []
        
    def predict(self, x):
        if len(self._weights) > 0:
            return 0 if sigmoid(np.dot(self._weights,np.append(x,1))) < 0.5 else 1
        else:
            raise Exception('The model has not been fitted, prediction is impossible')
        
    def fit(self, raw_features,labels):
        features = np.append(raw_features,np.ones((len(raw_features),1)),axis=1) #add bias
        self._weights = np.random.rand(features.shape[1])
        
        #calculate the gradient over all examples
        for epoch in range(self._max_iterations):
            predictions = sigmoid(np.dot(self._weights,features.T))
            error = labels - predictions
            gradient = (features.T * error).T.sum(axis=0)
            
            self._weights += gradient*self._learning_rate
            
#             print(np.sum(np.abs(error)))
        
        

In [2]:
#This is mostly for my machine learning homework assignment
import numpy as np
from sklearn import datasets

iris_ds = datasets.load_iris()

features = iris_ds['data']
labels = iris_ds['target']

labels[labels > 1] = 1   #binary classification

LR = logistic_regression(500,0.001)

LR.fit(features,labels)

[[LR.predict(features[i]), labels[i]] for i in range(len(features)) if LR.predict(features[i]) != labels[i]]


[]