#### Logistic Regression

Sources:  
  
https://www.stat.purdue.edu/~zhanghao/MAS/handout/Likelihood.pdf  
https://www.kdnuggets.com/2022/11/comparing-linear-logistic-regression.html  
https://math.unm.edu/~schrader/biostat/bio1/notes/lecture15.pdf  
https://online.stat.psu.edu/stat501/lesson/13/13.2  
https://www.geeksforgeeks.org/machine-learning/understanding-logistic-regression/  
https://www.geeksforgeeks.org/data-science/likelihood-function/  
https://medium.com/@robdelacruz/logistic-regression-from-scratch-7db690fb380b  
https://medium.com/@nicolasanti_43152/ml-loss-and-likelihood-e0a5ff4ae594  

For binary data $y\in\{0, 1\}$ with $\pi=p(y=1)$ and $1-\pi=p(y=0)$, the probability mass function (pmf) is:

$$
p(y, \pi) = \pi^y(1-\pi)^{1-y}
$$

This is commonly reffered to as a "Bernoulli Distribution".

Given m predictors and n samples, we can arrange our predictors into an (m+1) x n matrix $X$. For cleanliness, we can make the first column vecor of $X$, $x_0$, filled with $1$s to be used with the intercept parameter.

$$
X = 
\begin{pmatrix}
1 & x_{11} & \cdots & x_{1m} \\
1 & x_{21} & \cdots & x_{2m} \\
\vdots & \vdots & \ddots & \cdots \\
1 & x_{n1} & \cdots & x_{nm}
\end{pmatrix}
$$

Each row vector, $x_i$, represents the feature values corresponding to the i'th observation, and each column vector, $x_j$, contains the predictor values for a single feature across all n samples.

Logistic regression relies on the assumption that we can relate $p(y)$ to $X$ using the below:

$$
\log(\frac{p_i}{1-p_i})=\sum_{j=1}^m x_ij \theta_j=x_i^T\Theta \quad \xrightarrow{} \quad p(y_i=1 | x_i, \Theta) = \frac{e^{x_i^T\Theta}}{1 + e^{x_i^T\Theta}}
$$

where $\Theta$ is the paremter vector, and thus any $\theta_j$ is a given scalar parameter corresponding to a predictor vector $x_j$.


For intuition, note that

$$
\lim_{p\xrightarrow{}1}\log(\frac{p}{1-p})=+\infty
$$

and 

$$
\lim_{p\xrightarrow{}0}\log(\frac{p}{1-p})=-\infty
$$

so simply put, when $x_i^T\Theta$ is large, it indicates a high probability that $y=1$, and when very negative, a high probability that $y=0$.

######




Now, let's introduce something called the "Likelihood Function":
It is defined as the likelihood of the parameters given the data.

So, the likelihood for a binary logistic classification given n samples follows the below:

$$
L(\Theta | \mathbf{Y,X})=\prod_{i=0}^n p(y_i| \pi_i) = \prod_{i=0}^n \pi_i^{y_i}(1-\pi_i)^{1-y_i}
$$

$$
=\prod_{i=0}^n \frac{e^{x_i^T\Theta}}{1 + e^{x_i^T\Theta}}
$$


And thus the Log-Likelihood:

$$
\log L(\Theta | \mathbf{Y,X}) = \sum_{i=0}^n [y_i \log(p_i) + (1-y_i) \log(1-p_i)]
$$

When we find a maximum of the Log-Likelihood function, we have determined the values of $\Theta$ that maximizes the probability of the observed data.

In [1]:
import numpy as np
import matplotlib.pyplot as plt