# Logistic Regression Algorithm <br >

Logistic regression is a classification algorithm used to assign observations to a discrete set of classes. Unlike linear regression which outputs continuous number values, logistic regression transforms its output using the logistic sigmoid function to return a probability value which can then be mapped to two or more discrete classes.


**Sigmoid Function**

In order to map predicted values to probabilities, we use the sigmoid function. The function maps any real value into another value between 0 and 1. In machine learning, we use sigmoid to map predictions to probabilities. Mathematically the sigmoid function is defined as follows:


\begin{eqnarray}
h_{\theta}(x)= \frac{1}{1 + e^{-\theta x}}=S(z) = \frac{1}{1 + e^{-z}}
\end{eqnarray}

**Note** <br>
S(z) = output values between 0 and 1 (probabililty estimate)<br>
z = input to the function <br>
e = base of natural log <br>

**The sigmoid graph**
<img src="sigmoid.png">


**Decision Boundary**

The sigmoid function returns probability score between 0 and 1. In order to map this to a discrete class (true/false), we select a threshold value above which we will classify values into class 1 and below which we classify values into class 2.<br>

For example in the above plot we chose a threshold of 0.5 in order to classify the two classes. <br>

* p $\geq$ 0.5 , class 1
* p < 0.5 , class 0

**Cost Function** <br>
A cost function's main purpose is to penalize bad choices for the parameters to be optimized and reward good ones. The cost function for logistic regression is written with logarithmic functions. An argument for using the log form of the cost function comes from the statistical derivation of the likelihood estimation for the probabilities.<br >

Instead of Mean Squared Error, we use a cost function called Cross-Entropy, also known as Log Loss. Cross-entropy loss can be divided into two separate cost functions: one for y=1 and one for y=0.

\begin{eqnarray}
J({\theta}) = \frac{1}{m} Cost(h_{\theta}(x),y)
\end{eqnarray}

\begin{eqnarray}
\
    Cost(h_{\theta}(x),y) = 
\begin{cases}
    -log(h_{\theta}(x)) & \text{if } y= 1\\
    -log(1-h_{\theta}(x))              & \text{if } y = 0
\end{cases}
\
\end{eqnarray}

Thus, we can combine the two cases as follows:
\begin{eqnarray}
J(\theta) = -\frac{1}{m}[\sum_{i=1}^m y.log(h_{\theta}(x)) + (1-y)log(1-h_{\theta}(x)) ] 
\end{eqnarray}

Multiplying by y and (1âˆ’y) in the above equation is a clever way that allows us use the same equation to solve for both y=1 and y=0 cases. If y=0, the first side cancels out. If y=1, the second side cancels out. In both cases we only perform the operation we need to perform.

**Gradient Descent** <br>

To minimize the cost function we use gradient descent method.

<img src="gradient_descent.png">

**Mathematics behind Gradiend descent:** <br >

The purpoose gradient descent is to find parameters that will minimize the cost function. The algorithm update the parameters as follows:

\begin{eqnarray}
\theta_j := \theta_j - \alpha\frac{\partial }{\partial \theta}J(\theta) 
\end{eqnarray}

$\alpha$ is the learning rate. It determines the amount of time the algorithm will take to converge. If $\alpha$ is too small the algorithm might take longer time to converge, and if is too large the algorithm might not converge. Refer to the clip that shows the concepts:

Simplifying our cost function: <br>
\begin{eqnarray}
J(\theta) = -\frac{1}{m}[\sum_{i=1}^m y.log(h_{\theta}(x)) + (1-y)log(1-h_{\theta}(x)) ] 
\end{eqnarray}



First term:<br>

\begin{eqnarray}
log(h_{\theta}(x)) = log(\frac{1}{1+e^{-\theta x}}) = -log(1+e^{-\theta x})
\end{eqnarray}

Second term: <br>
\begin{eqnarray}
log(1 - h_{\theta}(x)) = log(1 - \frac{1}{1+e^{-\theta x}}) = log(e^{-\theta x}) - log(1+e^{-\theta x}) = -\theta x - log(1+e^{-\theta x})
\end{eqnarray}


Plugging in the simplified terms into the cost function we get :<br>

\begin{eqnarray}
J(\theta) = -\frac{1}{m}\sum_{i=1}^m[-y.log(1+e^{-\theta x}) + (1-y)(-\theta x - log(1+e^{-\theta x}))]
\end{eqnarray}

which simplifies to: <br>
\begin{eqnarray}
J(\theta) = -\frac{1}{m}\sum_{i=1}^m [y\theta x -\theta x - log(1+e^{-\theta x})] = -\frac{1}{m}\sum_{i=1}^m [y\theta x - log(1+e^{\theta x})]
\end{eqnarray}



Now we can compute the partial derivatives:<br>
\begin{eqnarray}
\frac{\partial }{\partial \theta} y\theta x = yx
\end{eqnarray}

\begin{eqnarray}
\frac{\partial }{\partial \theta} log(1+e^{\theta x}) = \frac{xe^{\theta x}}{1+e^{\theta x}} = xh_{\theta}(x)
\end{eqnarray}

## <span style="color:red">Now we are ready to implement our Machine Learning algorithm.</span><sp>