# Logistic Regression Theory

This workbook covers and demonstrates the theory behind Logistic Regression, including Gradient Descent and various Cost Functions.

## 1. Logistic Regression Overview

Logistic regression is used to solve classification problems where the outcome is a discrete variable. Usually it is used to solve __Binary Classification__ problems.\
\
We utitilze the sigmoid function to map input values from a wide range into a limited interval using the below formula:\
\
$$ y = g(z) = \frac {e^z} {1 + e^z} $$ 
\
This formula represents the probability of observing output y = 1 from a Bernoulli random variable. Essentially it squeezes and real number to the (0, 1) interval.



![image.png](images/logistic_activation_function.png)

Applying the sigmoid function on 0 gives 0.5. The output becomes 1 as the input approaches \inf. Conversely, sigmoid becomes 0 as the input approaches - \inf.

The sigmoid function for the hypothesis is defined below:

$$ h(x) = \frac {1} {1 + e ^ -\theta Tx} $$

The hypothesis function approximates the estimated probability of the actual output being equal to 1: 

$$ p(y = 1 | \theta, x) = g(z) = \frac {1} {1 + e ^ -\theta Tx} $$
$$ p(y = 0 | \theta, x) = 1 - g(z) = 1 - \frac {1} {1 + e ^ -\theta Tx} = \frac {1} {1 + e ^ \theta Tx} $$

## 2. Logistic Regression Cost Function

__The cost function summarizes how well the model is behaving__. We use the cost function to measure how close the model's predictions are to the actual output. In linear regression, we use mean squared error as the cost function, but for Logistic Regression this may give a wavy non-convex solution with many local optima:

![image.png](images/lr_with_mse.png)

Instead, we use a logarithmic function to represent the cost of logistic regression. With __binary classification__ the logarithmic cost depends on the value of y:

![image.png](images/lr_binary_classification_cost.png)


Because when the actual output y = 1, the cost is 0 for ho(x) = 1 and 1 for ho(x) = 0.
As the output is either 0 or 1, the equation can be simplified to: 

$$ cost(h\theta(x), y) = -y ^ i \times  \log(h\theta(x ^ i)) - (1 - t ^ i) \times \log(h\theta(x^i)) $$

And the m observations, we calculate the cost as:

$$ J(\theta) = - \frac {1} {m}  \sum_i^m [y ^ i \times \log(h\theta(x ^ i)) + (1 - y ^ i) \times \log(h\theta(x ^ i))] $$

## 3. Minimizing the Cost with Gradient Descent

__Gradient Descent is an iterative optimization algorithm which finds the minimum of a differentiable function.__ In this process, we try different value and update them to find  the optimal ones. We can apply this method to the cost function of logistic regression, finding an optimal solution minimizing the cost over model parameters:
$$ min J (\theta) $$

Assume that we have a total of _n_ features. In this case, we have _n_ parameters for the $ \theta $ vector. To minimize our cost function, we need to run the gradient descent on each parameter $ \theta~j $:
$$ \theta_j \larr \theta_j - \alpha \frac {\alpha} {\alpha \theta_j} J(\theta) $$

Furthermore, we need to update each parameter simultaneously for each iteration. To complete the full algorithm, we need the value of $ \frac {\alpha} {\alpha \theta_j} J(\theta)$:
$$ \frac {\alpha} {\alpha \theta_j} J(\theta) = \frac {1} {m}  \sum_i^m (h_\theta(x^i) - y^i) x_j^i $$

Plugging this back into the original Gradient Descent function gives the below:

$$ \theta_j \larr \theta_j - \alpha \frac {1} {m} \sum_i^m (h_\theta(x^i) - y^i) x_j^i $$