# Logistic Regression


Reference:
- https://github.com/bamtak/machine-learning-implemetation-python/blob/master/Logistic%20Regression.ipynb

In logistic regression, we are trying to model the outcome of a binary variable given a linear combination of input features. For example, we could try to predict the outcome of an election (win/lose) using information about how much money a candidate spent campaigning, how much time she/he spent campaigning, etc.
Model

Logistic regression works as follows.

Given:

- dataset $\{(\boldsymbol{x}^{(1)}, y^{(1)}), ..., (\boldsymbol{x}^{(m)}, y^{(m)})\}$
- with $\boldsymbol{x}^{(i)}$ being a $d-$dimensional vector $\boldsymbol{x}^{(i)} = (x^{(i)}_1, ..., x^{(i)}_d)$
- $y^{(i)}$ being a binary target variable, $y^{(i)} \in \{0,1\}$

The logistic regression model can be interpreted as a very simple neural network:

- it has a real-valued weight vector $\boldsymbol{w}= (w^{(1)}, ..., w^{(d)})$
- it has a real-valued bias $b$
- it uses a sigmoid function as its activation function



### Training

Different to linear regression, logistic regression has no closed form solution. But the cost function is convex, so we can train the model using gradient descent. In fact, gradient descent (or any other optimization algorithm) is guaranteed to find the global minimum (if the learning rate is small enough and enough training iterations are used).

Training a logistic regression model has different steps. In the beginning (step 0) the parameters are initialized. The other steps are repeated for a specified number of training iterations or until convergence of the parameters.

**Step 0**: Initialize the weight vector and bias with zeros (or small random values).

----

**Step 1**: Compute a linear combination of the input features and weights. This can be done in one step for all training examples, using vectorization and broadcasting: $\boldsymbol{a} = \boldsymbol{X} \cdot \boldsymbol{w} + b $


where $\boldsymbol{X}$ is a matrix of shape $(n_{samples}, n_{features})$ that holds all training examples, and $\cdot$ denotes the dot product.

----

**Step 2**: Apply the sigmoid activation function, which returns values between 0 and 1:

$\boldsymbol{\hat{y}} = \sigma(\boldsymbol{a}) = \frac{1}{1 + \exp(-\boldsymbol{a})}$

----

**Step 3**: Compute the cost over the whole training set. We want to model the probability of the target values being 0 or 1. So during training we want to adapt our parameters such that our model outputs high values for examples with a positive label (true label being 1) and small values for examples with a negative label (true label being 0). This is reflected in the cost function:

$J(\boldsymbol{w},b) = - \frac{1}{m} \sum_{i=1}^m \Big[ y^{(i)} \log(\hat{y}^{(i)}) + (1 - y^{(i)}) \log(1 - \hat{y}^{(i)}) \Big]$

----

**Step 4**: Compute the gradient of the cost function with respect to the weight vector and bias. A detailed explanation of this derivation can be found here.

The general formula is given by:

$ \frac{\partial J}{\partial w_j} = \frac{1}{m}\sum_{i=1}^m\left[\hat{y}^{(i)}-y^{(i)}\right]\,x_j^{(i)}$

For the bias, the inputs $x_j^{(i)}$ will be given 1.

----

**Step 5**: Update the weights and bias

$\boldsymbol{w} = \boldsymbol{w} - \eta \, \nabla_w J$

$b = b - \eta \, \nabla_b J$

where $\eta$ is the learning rate.
