# Logistic Regression Algorithm

Logistic regression is one of the most famous machine learning algorithm. It is a classification algorithm which is similar to the linear regression in a lot of ways but the main difference is Logistic Regression is used for classification problems while linear regression is used to predict values. Assume we have a training dataset of cancer cells, where some of the cells are benign and some of them are malignant. 

![lin1.png](lin1.png)

One thing we can do to this given training dataset is to apply linear regression algorithm and try to fit a straight line. After doing that, we get a line like $(2)$![lin2.png](lin2.png) When we want to make predictions we could try threshold the classifier output at 0,5. Then, if the hypothesis outputs a value greater than 0,5 we predict $y=1$ and if it is less than 0,5 we predict $y=0$ $(3)$ ![lin3.png](lin3.png) In this particular example, it seems like linear regression is acctually doing something reasonable even though this is a classification problem. But now, let's try changing the problem a bit and add one more training example to dataset.

![lin4.png](lin4.png) Once we added that extra example here and run linear regression, we would have a new straight line like $(4)$. If we threshold this hypothesis at 0,5 we end up with a decision boundary around $(5)$. This seems a pretty bad model for linear regression to do. Somehow adding that example there cause linear regression to change the line it fits to the training set from $(3)$ to $(5)$ and caused it to give us a worse hypothesis. Therefore, applying liinear regression to a classification problem usually isn't often a good idea.In this case, we will use a new algorithm called Logistic Regression which has the property that the output of the predictions are always between $0$ and $1$. We would like find a continous hypothesis function that satisfies $0\leq h(x) \leq 1$. When we were using Linear Regression algorithm, form of the hypothesis was like $h_\theta (x) = \theta^T x$ for linear regression, we will modify this form a bit. <br> <br>
$$h_\theta (x) = g(\theta^T x) \qquad,\qquad g(z) = \frac{1}{1 + e^-z}$$
<br><br>
$$\Longrightarrow h_\theta(x) = \frac{1}{1 + e^{-\theta^T x}}$$ <br> <br>
where $g(z)$ is also called <b>Sigmoid Function</b>(Logistic Function) and has a graph like below:![sigmoid.png](sigmoid.png)<br><br>
(source: https://ayearofai.com/rohan-1-when-would-i-even-use-a-quadratic-equation-in-the-real-world-13f379edab3b)<br><br>
By deciding the values of $\theta$, we transformate $\frac{g(z)}{h(x)}$ and alter the decision boundary. As as result, we get a better predictions from that decision boundary.
![sig_func2.png](sig_func2.png)


# How do we choose parameters $\theta$ ?

The goal of learning in machine learning  is basiccally to estimate parameters in order to make predictions. Like how the least squares method is used to estimate parameters in Linear Regression and fit a hypothesis, Logistic Regression uses the Maximum Likelihood Method for estimating parameters. Suppose we have a training set $${(x_1,y_1),(x_2,y_2),....,(x_k,y_k)}$$ <br>with k examples and labels are either $y=0$ or $y=1$. <br><br>
For the samples labelled 1: We need to estimate $\theta$ such that $\frac{1}{1 + e ^-\theta^T x}$ is as close to 1 as possible. (or $\prod p(x_i)$ as close to 1 as possible)<br><br>
For the samples labelled 0: We need to estimate $\theta$ such that $1-\frac{1}{1 + e ^-\theta^T x}$ is as close to 1 as possible. (or $\prod (1 - p(x_i))$ as close to 1 as possible) <br> where $x_i$ is the feature vector for $i^{th}$ sample. <br> With the requirements we obtain, we want to find $\theta$ parameters such that:<br><br> $$L(\theta) = \prod_{s \in y_i=1} p(x_i) \times \prod_{s \in y_i = 0}(1-p(x_i)) $$ <br><br> is maximum. <br>
This function we need to optimize is called the likelihood function. If we combine the products:<br><br>
$$L(\theta) = \prod p(x_i)^{y_i} \times (1 - p( x_i))^{1-y_i}$$ <br><br>
and then take the loglikelihood and convert it into a summation:<br><br>
$$l(\theta) = \sum_{i=1}^{k}y_i \log (p(x_i)) + (1-y_i)\log (1-p(x_i))$$ <br><br>
where $l$ represents log(likelihood). If we substitute $p(x_i)$ with $\frac{1}{1 + e^{\theta x_i}} $ we would have:<br><br>
$$\Longrightarrow l(\theta) = \sum_{i=1}^{k}y_i \log \left(\frac{1}{1 + e^{-\theta x_i}}\right) + (1 - y_i) \log \left(\frac{e^{\theta x_i}}{1 + e^{-\theta x_i}}\right)$$ <br><br>
$$\Longrightarrow l(\theta) = \sum_{i=1}^{k}y_i \left[\log \left(\frac{1}{1 + e^{-\theta x_i}}\right) - \log \left(\frac{e^{-\theta x_i}}{1 + e^{\theta x_i}}\right)\right] + \log (\frac{e^{\theta x_i}}{1 + e^{\theta x_i}}) $$
<br><br>
$$\Longrightarrow l(\theta) = \sum_{i=1}^{k}y_i \left [ \log (e^{\theta x_i}) \right ] + log \left( \frac{e^{\theta x_i}}{1 + e^{\theta x_i}} \times \frac {e^{\theta x_i}}{e^{\theta x_i}} \right)$$ <br><br>
$$\Longrightarrow l(\theta) = \sum_{i=1}^{k}y_i \theta x_i + \log \left( \frac{1}{1 + e^{\theta x_i}}\right)$$ <br><br>

and we end up with the final form of log-likelihood function as: <br><br>
$$\boxed{ l(\theta) = \sum_{i=1}^{k}y_i \theta x_i - \log \left({1 + e^{\theta x_i}}\right) }$$ <br><br>
the value of $\theta$ that maximizes this function is found by numerical methods.(for example Newton-Ramphson method or gradient descent method) 

<h1>Decision Boundary</h1> <br><br>
Suppose our hypothesis is:
$$h_\theta(x) = g(\theta_0 + \theta_1 x_1 + \theta_2 x_2)$$
$$y \in {0,1}$$
for the given training dataset below:

![lin5.png](lin5.png)
After maximizing the likelihood function, we obtain $\theta = 
\left [ \begin{matrix} 
-3 \\
1 \\
1 
\end{matrix} \right ]$. Since now we know $\theta$ , let's try to figure out where our hypothesis will end up predicting $y=1$ and where it would end up predicting $y=0$. <br><br>We know that $y=1$ is more likely that is the probability that $y=1$ is greater than or equal to $0.5$ whenever $\theta^T x \geq 0$ and: <br><br>  $-3 + x_1 + x_2  = \theta^T x$ where $\theta = 
\left [ \begin{matrix} 
-3 \\
1 \\
1 
\end{matrix} \right ]$ <br> <br>
So, for any example with features $x_1$ and $x_2$ that satisfy $-3 + x_1 + x_2 \geq 0$, our hypothesis will predict that $y=1$. We can also move $-3$ to the other side and rewrite the equations as $x_1 + x_2 \geq 3$. <br>
Now let's see what that means in the figure. If we write down the equation $x_1 + x_2 = 3$ , this defines an equation of a straight line and if we draw it on the figure, it is going to be like:
![lin6.png](lin6.png)
So everything to the upper right portion of the line is the region where our hypothesis will predict $y=1$. <br>
Decision boundry is a property of the hypothesis including the parameters $\theta_0$,$\theta_1$,$\theta_2$ and <b>not</b> the dataset. We use training set to determine the values of the parameters, and then fit the decision boundary using those values.