## Logistic Regression

[Logisitic regression](https://en.wikipedia.org/wiki/Logistic_regression) is a statistical model used to predict a binary dependent variable, not unlike a binary classifier. Logistic regression is similar to a [single layer perceptron](https://en.wikipedia.org/wiki/Feedforward_neural_network#Single-layer_perceptron), which is the basis for feedforward [artifical neural networks](https://en.wikipedia.org/wiki/Artificial_neural_network).

The mathematical definition of logistic regression is presented below, as well as the basics of the [backpropagation](https://en.wikipedia.org/wiki/Backpropagation) algorithm used for training. Snippets of the equivalent Python code using numpy is also available.

A network of single layer perceptrons can be combined to build a [multilayer perceptron](https://en.wikipedia.org/wiki/Multilayer_perceptron) for [deep learning](https://en.wikipedia.org/wiki/Deep_learning). More sophisticated structures, such [Convolutional Neural Networks](https://en.wikipedia.org/wiki/Convolutional_neural_network) (CNNs), [Recurrent Neural Networks](https://en.wikipedia.org/wiki/Recurrent_neural_network) (RNNs), or [autoencoders](https://en.wikipedia.org/wiki/Autoencoder) can also be assembled as a network of single layer perceptrons.

### Forward Propagation

Given a weight vector $W$, bias scalar $b$, and activation function $\sigma$, activation $a$ describing forward propagation is given by:

$$ a = \sigma (W^T x + b) = (a_1, a_2, ..., a_{m-1}, a_m) \tag{1}$$

The equivalent Python code is:

`a = sigmoid(np.dot(W.T, x) + b)`

Other activation functions, such as a [rectified linear unit](https://en.wikipedia.org/wiki/Rectifier_(neural_networks)) (ReLU) or [softmax](https://en.wikipedia.org/wiki/Softmax_function) can also be used in place of [sigmoid functions](https://en.wikipedia.org/wiki/Sigmoid_function). ReLU does not suffer from the [vanishing gradient problem](https://en.wikipedia.org/wiki/Vanishing_gradient_problem) as found in the traditional sigmoid function. For this reason, ReLU is preferred in hidden layers of deep neural networks. Softmax is often used at the output layer of a classification model.

### Cost Function

Training through backpropagation is an iterative process of computing the vector $W$ and bias $b$. Training attempts to minimize _loss_ or the _cost function_ given by $J$:

$$ J = -\frac{1}{m} \sum_{i=1}^{m} y_i log(a_i) + (1 - y_i) log(1 - a_i)\tag{2}$$

`j = (-1/m)*(y.dot(np.log(a.T)) + (1-y).dot(np.log(1 - a.T)))`

### Gradients

Gradients are partial derivatives used to iteratively update weights and biases. The _learning rate_ $\alpha$ is a hyperparameter tuned for optimal training. This the most basic form of backpropagation. More sophisticated algorithms, such as [Adaptive Moment Estimation](https://en.wikipedia.org/wiki/Stochastic_gradient_descent#Adam) (ADAM) are more efficient.

$$ \frac{\partial J}{\partial w} = \frac{1}{m} x \cdot (a - y)^T \tag{3}$$

`dw = (1/m) * np.dot(x, (a - y).T)`

$$ \frac{\partial J}{\partial b} = \frac{1}{m} \sum_{i=1}^{m} (a_i - y_i) \tag{4}$$

`db = (1/m) * np.sum(a - y)`

$$ W_{n+1} = W_n - \alpha \frac{\partial J}{\partial w}\tag{5}$$

`w = w - learning_rate * dw`

$$ b_{n+1} = b_n - \alpha \frac{\partial J}{\partial b}\tag{6}$$

`b = b - learning_rate * db`