# $$Logistic~~Regression$$


Though this is called Regression, it falls under another category of **Supervised Learning** which is **Classification**.


# **Classification**


**Classification** is a process of categorizing a given set of data into **classes**. It can be performed on both structured or unstructured data. The process starts with predicting the class of given data points. The classes are often referred to as **targets**, **labels** or **categories**.

The classification predictive modeling is the task of approximating the mapping function from input variables to discrete output variables. The main goal is to identify which **class**/**category** the new data will fall into.

For example:

- Given the size of the cancer tumour, we can predict whether this is a benign or malignant tumour.
- Given the annual income of an individual, banks can predict whether that individual is capable of paying the debt.


As we can see here, the label that we want to predict take on a small number of discrete values. For **Binary Classification**, the label $y$ only take on two values: $0$ (malignant, cannot pay debt) and $1$ (benign, can pay debt) .


We can attempt to use linear regression and map all predictions greater than $0.5$ as $1$ and all less than $0.5$ as $0$ . But this doesn't work well because classification is not actually a linear function.


# Definition


## **General formula**


Since we want our hypothesis function to give certain discrete values ( $0 \leqslant f_\theta(x) \leqslant 1 $ ), which cannot be achieved with normal linear function $f_\theta(x) = \theta^{T} x$ . That is why we are using something called the **Sigmoid function**:
$$g(z) = \frac{1}{1 + e^{-z}}$$
$$\Rightarrow f_\theta(x) = g(\theta_{T}x) = \frac{1}{1 + e^{-\theta_{T}x}}$$
This is the reason why **Logistic function** can be used interchangeably with **Sigmoid function**.

Here is how the Sigmoid function look like:

![Sigmoid function illustration](https://d3c33hcgiwev3.cloudfront.net/imageAssetProxy.v1/1WFqZHntEead-BJkoDOYOw_2413fbec8ff9fa1f19aaf78265b8a33b_Logistic_function.png?expiry=1638662400000&hmac=8EA0eH0xwVK5nkS2SMV-L8FDqmb3Sc_vHJ-PG5r-gz0)


## **Usage**


This can be used to predict the probability that the label $y = 1$ on input $x$ .
For example, if:
$$x = \begin{bmatrix} x_0 \\ x_1 \end{bmatrix} = \begin{bmatrix} 1 \\ \text{Tumor Size} \end{bmatrix} $$
$$\Rightarrow f_\theta(x) = 0.7 $$
Then this tells that there is a 70% chance that the patient's tumor is malignant.

Statistically, this could be written as: 
$$f_\theta(x) = P(y = 1 | x; \theta)~~\text{Meaning: Probability that $y = 1$, given $x$, parameterized by $\theta$}$$
Since there are only two possible outcomes then
$$P(y = 0 | x; \theta) + P(y = 1 | x; \theta) = 1$$


# Decision Boundary


## **Definition**

To get our discrete $0$ and $1$ , we can translate the output of the hypthesis function as follows:
$$y = \begin{cases} 1, & \text{if $f_\theta(x) \geqslant 0.5$} \\
                    0, & \text{if $f_\theta(x) < 0.5$    } \end{cases}$$

We already know that if $z \geqslant 0$ then our logistic function $g(z) \geqslant 0.5.$

So by assigning $z = \theta^{T}X,$ we can conclude that $f_\theta(x) = g(\theta^{T}X) \geqslant 0.5$ with $\theta^{T}x \geqslant 0$ .

To sum up:
$$y = \begin{cases} 1, & \text{if $\theta^{T}X \geqslant 0$} \\
                    0, & \text{if $\theta^{T}X < 0$    } \end{cases}$$

The **decision boundary** is the line that separates the area where $y = 0$ and where $y = 1.$ It is created by our hypothesis function $f_\theta(x).$ 

For example:
$$\theta = \begin{bmatrix} 5 \\ -1 \\ 0 \end{bmatrix}~~~~~X = \begin{bmatrix} 1 \\ x_1 \\ x_2 \end{bmatrix}$$
$$\text{So $\theta^{T}x = 5 + (-1)x_1 + 0x_2$}$$
$$\text{And because $y = 1$ if $\theta^{T}X \geqslant 0$}$$
$$\text{Then $5 - x_1 \geqslant 0$}$$
$$\Rightarrow x_1 \leqslant 5$$

In this case, our decision boundary is a straight vertical line where $x_1 = 5$ , and everything to the left of that denotes $y = 1,$ while everything to the right denotes $y = 0.$

**One important note:** The input to the sigmoid function $g(z)$ **does not** need to be linear.

For example, it could be a function describe a circle $z = \theta_0 + \theta_1x_1^{2} + \theta_2x_2^{2}$

# Cost Function

## **Definition**

We cannot use the same cost function that we use for linear regression because the Logistic Function will cause the output to be wavy, causing many local optima. In other words, it will not be a convex function. This is why we have to come up with another way to define the **Cost Function**.

The **Cost Function** for **Logistic Regression** looks like this:
$$J(\theta) = \frac{1}{m} \sum_{i=1}^{m}{\text{Cost}(f_\theta(x^{(i)}),  y^{(i)})}$$
With: 
$$\text{Cost}(f_\theta(x^{(i)}),  y^{(i)}) = \begin{cases} -\log(f_\theta(x)) & \text{if $y = 1$} \\
                                                           -\log(1 - f_\theta(x)) & \text{if $y = 0$} \end{cases}$$
Here I omitted the superscript $^{(i)}$ to simplify it since it is the same for all training example.

From the given definition, we can conclude that:
$$\text{Cost$(f_\theta(x, y)) = 0$ if $f_\theta(x)$ = 0}$$
$$\text{Cost$(f_\theta(x, y)) \to \infty$ if $y = 0$ and $f_\theta(x) \to 1$}$$
$$\text{Cost$(f_\theta(x, y)) \to \infty$ if $y = 1$ and $f_\theta(x) \to 0$}$$

The last two equations captures intuition that when the predicted label $f_\theta(x)$ is wrong (oppose to $y$ ), we will penalize the learning algorithm by a very large cost. 

**For example**, if a patient with a malignant tumor was predicted as a benign one, the consequence would have been very large.

**By writing the Cost Function this way guarantees that $J(\theta)$ is convex for Logistic Regression**.

## **Simplifying the Cost Function**

In reality, this **Cost Function** is too complicated since there are $2$ cases. However, since $y$ has strictly $2$ discrete values: $0$ and $1$ so it can be simplified to this:
$$\text{Cost}(f_\theta(x, y)) = -y\log(f_\theta(x)) - (1 - y)\log(1 - f_\theta(x))$$
All right, it might look intimidating for you. But if you examine it closely, you will see that:
* If $y = 0$ then $\text{Cost}(f_\theta(x, y))$ = $-\log(f_\theta(x))$
* If $y = 1$ then $\text{Cost}(f_\theta(x, y))$ = $-\log(1 - f_\theta(x))$

It's that simple!

So now since we have the simplified $\text{Cost}(f_\theta(x, y)).$ We can now plug it back into $J(\theta):$

$$J(\theta) = \frac{1}{m} \sum_{i=1}^{m}{\text{Cost}(f_\theta(x^{(i)}),  y^{(i)})}
          = -\frac{1}{m} \sum_{i=1}^{m}{[y^{(i)}\log(f_\theta(x^{(i)})) + (1 - y^{(i)})\log(1 - f_\theta(x^{(i)})]}$$
The reason behind this **Cost Function** is that it can be derived from statistics using the **The principle of Maximum Likelihood Estimation**, which is an idea in statistics for how to efficiently find parameters' data for different model. And it also has a nice property that it is **Convex**.

To fit the parameters $\theta,$ we have to find the minimized **Cost Function** $\underset{\theta}{\text{min }}J(\theta).$ After minimized this, we will get the optimized parameters $\theta,$ which could be put into the hypothesis function $f_\theta(x)$ to make a prediction given a new $x.$

To achieve the optimal $\theta,$ we will use **Gradient Descent**.

# Gradient Descent

It is the same as 