# Logistic Regression (LR)
---

1. Linear regression is a supervised machine learning model that fits a correlation linear line between input and label variables. The output value can be arbitrary continuous value and thus it is used for regression. 
2. The model has a weight vector and a bias as parameters. The output of the model is the dot product of the weight vector and the input vector plus a bias value. 
3. The linear regression model is trained by solving an optimization problem that is defined by applying a cost function that evaluates the difference between the model's output and the correct labels. The cost function for linear regression is mean squared error function that takes the mean of the squared value of each prediction's error. MSE for linear regression is proved to be convex, so solving it using convex optimization or gradient descent will get the global minimum. 
4. Logistic regression is similar to linear regression, but the output is a probability value between 0 and 1, so it is used for binary classification instead of regression.
5. A sigmoid (logistic) function is attached after the output of linear regression to output a probability for logistic regression. Instead of using MSE, the cost function is changed to binary cross entropy such that the loss grows exponentially with the difference between outputs and labels.

## Preliminary
---

### Optimization

#### Convex function 
> TODO

#### Gradient descent 
> TODO

## Linear regression for regression problems
---

### Linear function

If a function with single input and output $y = f(x)$ is a linear function, it represents a straight line on the coordinate plane. Thus it has the form:

$$ y = wx + b $$

where $w$ is the slope and $b$ is the y-intercept of the line. 

A linear function can also take multiple inputs and we can represent $n$ inputs with a vector $\mathbf{x} \in \mathbb{R}^{d}$. Then the equation changes to:

$$ y = \mathbf{w} \cdot \mathbf{x} + b $$

where $\mathbf{w} \in \mathbb{R}^{d}$ is called **weights** or the weight vector and $b \in \mathbb{R}$ is called **bias**.

Linear regression is a supervised regression model that simply fits a linear function between the inputs $\mathbf{x}$ and output $y$. 

### Loss function

Loss function is a function that measures the error between the labels predicted by your current model and the true labels for a set of instances (training set).

Linear regression minimizes the a particular loss function called **mean squared error (MSE)** function. Given a set of training labels $\mathbf{y} = y_{1}, y_{2}, \dots, y_{n}$ for $n$ training instances and the predicted labels $\hat{\mathbf{y}} = \hat{y}_{1}, \hat{y}_{2}, \dots, \hat{y}_{n}$ for the same instances, MSE is defined as: 

$$ \operatorname{MSE}(\hat{\mathbf{y}}, \mathbf{y}) = \frac{1}{n} \sum_{i}^{n}(\hat{y}_{i} - y_{i})^{2} $$

Thus, fitting a linear regression model for a training set $\mathbf{X} \in \mathbb{R}^{n \times d}$ with labels $\mathbf{y} \in \mathbb{R}^{n}$ is a convex optimization problem:

$$
\begin{align}
\min & \quad \lVert (\mathbf{X}\mathbf{w} + b) - \mathbf{y} \rVert_{2}^{2} \\
= \min & \quad \frac{1}{n} \sum_{i=1}^{n} ((\mathbf{w} \cdot \mathbf{x}_{i} + b) - y_{i})^{2} \\
\end{align}
$$

### Regularization

Regularization is a technique used to avoid over-fitting of the machine learning models. Regularization works by adding an additional penalty term in the loss function to penalize aspects of the model other than the prediction error. 

$$ L_{\text{new}} = L_{\text{old}} + \lambda L_{\text{reg}} $$

where $\lambda$ is a hyperparamter that adjust the trade-off between the primary objective and the regularization.

For linear regression and other many models, usually $L_{2}$ regularization (Ridge) and $L_{1}$ regularization (Lasso) are used. 
- Ridge penalizes the $L_{2}$ norm of the weights

$$ 
\begin{align}
L_{\text{ridge}} & = \lVert \mathbf{w} \rVert_{2}^{2} \\
& = \sum_{i=1}^{\lvert \mathbf{w} \rvert} w_{i}^{2} \\
\end{align}
$$

- Lasso penalizes the $L_{1}$ norm of the weights

$$ 
\begin{align}
L_{\text{lasso}} & = \lVert \mathbf{w} \rVert_{1} \\
& = \sum_{i=1}^{\lvert \mathbf{w} \rvert} \lvert w_{i} \rvert \\
\end{align}
$$

Both Ridge and Lasso penalize the magnitude of the weights. Large weights tend to overfit the training dataset because
> TODO

## Solving linear regression
---

Here we show how we can analytically solve linear regression to get a closed form solution.

We first make $b$ as part of $\mathbf{w}$ to simplify the derivation process, which is done by adding $b$ as an extra weight into $\mathbf{w}$ vector. The result weight vector $\mathbf{\hat{w}} \in \mathbb{R}^{d + 1}$ has one extra dimension:

$$ \mathbf{\hat{w}} = (b, w_{1}, w_{2}, \dots, w_{d}). $$

Then we add a dummy input $x_{0} = 1$ to all input instances, so that 

$$ \mathbf{\hat{x}} = (1, x_{1}, x_{2}, \dots, x_{d}). $$

As a result, we have

$$ \mathbf{\hat{w}} \cdot \mathbf{\hat{x}} = \mathbf{w} \cdot \mathbf{x} + b. $$

The equation that we want to solve is

$$ \min \quad \frac{1}{n} \sum_{i=1}^{n} (\mathbf{\hat{w}} \cdot \mathbf{\hat{x}}_{i} - y_{i})^{2} + \lambda \lVert \mathbf{w} \rVert_{2}^{2} $$

Since this equation is a convex function, it can be directly solved by setting its derivative w.r.t its parameters ($\mathbf{\hat{w}})$ to 0.

## Solving linear regression

Here we show how we can analytically solve linear regression with $L_{2}$ regularization to get a closed form solution.

$$ \min \quad \frac{1}{n} \sum_{i=1}^{n} ((\mathbf{w} \cdot \mathbf{x}_{i} + b) - y_{i})^{2} + \lambda \lVert \mathbf{w} \rVert_{2}^{2} $$

Since this equation is a convex function, it can be directly solved by taking its derivative w.r.t its parameters ($\mathbf{w}$ and $b$).

$$
\begin{align}
\frac{\partial}{\partial \mathbf{w}_{j}} \frac{1}{n} \sum_{i=1}^{n} ((\mathbf{w} \cdot \mathbf{x}_{i} + b) - y_{i})^{2} + \lambda \lVert \mathbf{w} \Vert_{2}^{2} & = 0 \\
\frac{\partial}{\partial \mathbf{w}_{j}} \frac{1}{n} \sum_{i=1}^{n} (\mathbf{w} \cdot \mathbf{x}_{i} + (b - y_{i}))^{2} + \lambda \lVert \mathbf{w} \Vert_{2}^{2} & = 0 \\
\frac{2}{n} \sum_{i=1}^{n} \mathbf{x}_{i, j} (\mathbf{w} \cdot \mathbf{x}_{i} + (b - y_{i})) + 2 \lambda \mathbf{w}_{j} & = 0 \\
\end{align}
$$

$$
\begin{align}
\frac{\partial}{\partial \mathbf{w}} \frac{1}{n} \sum_{i=1}^{n} ((\mathbf{w} \cdot \mathbf{x}_{i} + b) - y_{i})^{2} + \lambda \lVert \mathbf{w} \Vert_{2}^{2} & = 0 \\
\frac{\partial}{\partial \mathbf{w}} \frac{1}{n} \sum_{i=1}^{n} (\mathbf{w} \cdot \mathbf{x}_{i} + (b - y_{i}))^{2} + \lambda \lVert \mathbf{w} \Vert_{2}^{2} & = 0 \\
\frac{\partial}{\partial \mathbf{w}} \frac{1}{n} \sum_{i=1}^{n} (\mathbf{w} \cdot \mathbf{x}_{i})^{2} + 2(\mathbf{w} \cdot \mathbf{x}_{i})(b - y_{i}) + (b - y_{i})^{2} + \lambda \lVert \mathbf{w} \Vert_{2}^{2} & = 0 \\
\frac{1}{n} \sum_{i=1}^{n} 2(\mathbf{x}_{i} \cdot \mathbf{x}_{i})\mathbf{w} + 2\mathbf{x}_{i}(b - y_{i}) + 2 \lambda \mathbf{w} & = 0\\
\frac{2 \mathbf{w}}{n} \sum_{i=1}^{n} \mathbf{x}_{i} \cdot \mathbf{x}_{i} + \frac{2b}{n} \sum_{i=1}^{n} \mathbf{x}_{i} - \frac{2}{n} \sum_{i=1}^{n} \mathbf{x}_{i} y_{i} + 2 \lambda \mathbf{w} & = 0 \\ 
\end{align}
$$

> TODO

$$
\begin{align}
\frac{\partial}{\partial b} \frac{1}{n} \sum_{i=1}^{n} ((\mathbf{w} \cdot \mathbf{x}_{i} + b) - y_{i})^{2} + \lVert \mathbf{w} \Vert_{2}^{2} & = 0 \\
\frac{\partial}{\partial b} \frac{1}{n} \sum_{i=1}^{n} (b + (\mathbf{w} \cdot \mathbf{x}_{i} - y_{i}))^{2} + \lVert \mathbf{w} \Vert_{2}^{2} & = 0 \\
\frac{\partial}{\partial b} \frac{1}{n} \sum_{i=1}^{n} b^{2} + 2b(\mathbf{w} \cdot \mathbf{x}_{i} - y_{i}) + (\mathbf{w} \cdot \mathbf{x}_{i} - y_{i})^{2} + \lVert \mathbf{w} \Vert_{2}^{2} & = 0 \\
\frac{1}{n} \sum_{i=1}^{n} 2b + 2(\mathbf{w} \cdot \mathbf{x}_{i} - y_{i}) & = 0 \\
2b + \frac{2}{n} \sum_{i=1}^{n} （\mathbf{w} \cdot \mathbf{x}_{i} - y_{i}) & = 0 \\
b & = - \frac{1}{n} \sum_{i=1}^{n} （\mathbf{w} \cdot \mathbf{x}_{i} - y_{i}) \\
\end{align}
$$

## Logistic regression for classification problems
---

### From regression to classification using sigmoid function

How can we use linear regression on a binary classification problem where the labels are 0 and 1?

Answer: take the output of a linear regression model and pass it to a **sigmoid** (logistic) function:

$$ \sigma(x) = \mathrm{sigmoid}(x) = \frac{1}{1 + e^{-x}} $$

Sigmoid function has the following characterstics that are suitable for binary classification
1. Sigmoid function maps range $(-\inf, \inf)$ to range $(0, 1)$, which can be interpreted as the possibility of being class 1. 
1. Positive inputs map to output larger than 0.5 and negative inputs map to output less than 0.5. 

Thus, given an instance $\mathbf{x}$, the binary output can be derived by setting a threshold $\theta$ (usually set to $0.5$) to the output of the logistic regression,

$$ \hat{y} = \sigma(\mathbf{w}\mathbf{x} + b) $$

$$ 
\hat{y}_{\text{label}} = 
\begin{cases}
1, & \hat{y} \geq \theta \\
0, & \hat{y} < \theta \\
\end{cases}
$$

Note another commonly used function is **logit** function,

$$ \sigma^{-1}(x) = \mathrm{logit}(x) = \log \frac{x}{1 - x} $$

The inverse of the sigmoid function is the logit function, which can be derived by exchange the input and the output of the sigmoid function:

$$
\begin{align}
x &= \frac{1}{1 + e^{-y}} \\
\frac{1}{x} &= 1 + e^{-y} \\
e^{-y} &= \frac{1 - x}{x} \\
e^{y} &= \frac{x}{1 - x} \\
y &= \log\frac{x}{1 - x} \\
\end{align}
$$

### Binary cross entropy (log loss) instead of mean squared error 

Although sigmoid function can work for binary classification problem, it doesn't work quite well with MSE loss. The primary reason is that MSE with sigmoid function is not a convex function anymore. 

$$ L_{\text{MSE}} = \frac{1}{n} \sum_{i}^{n} (\frac{1}{1 + e^{-(\mathbf{w}\mathbf{x} + b)}} -y_{i})^{2} = \frac{1}{n} \sum_{i}^{n} (\sigma(\mathbf{w}\mathbf{x} + b) -y_{i})^{2} $$ 

To prove a function is convex or not, one way is to see if the second derivative of $L_{\text{MSE}}$ w.r.t to $\mathbf{w}$ is positive semidefinite. 

$$
\begin{align}
\frac{\partial L_{\text{MSE}}}{\partial \mathbf{w}} & = \frac{\partial L_{\text{MSE}}}{\partial \sigma} \frac{\partial \sigma}{\partial \mathbf{w}} & \text{[chain rule]} \\
& = \frac{\partial}{\partial \sigma} \left( \frac{1}{n} \sum_{i}^{n} (\sigma -y_{i})^{2} \right) \sigma(1 - \sigma) \mathbf{x} & \text{[$\sigma' = \sigma(1 - \sigma)$]} \\
& = \frac{2}{n} \sum_{i}^{n} (\sigma - y_{i}) \sigma(1 - \sigma) \mathbf{x} \\
& = \frac{2 \mathbf{x}}{n} \sum_{i}^{n} \sigma^{2} - \sigma^{3} - y_{i}\sigma - y_{i}\sigma^{2}  \\
\end{align}
$$

$$
\begin{align}
\frac{\partial^{2} L_{\text{MSE}}}{\partial \mathbf{w}^{2}} & = \frac{\partial}{\partial \mathbf{w}} \left( \frac{2 \mathbf{x}}{n} \sum_{i}^{n} \sigma^{2} - \sigma^{3} - y_{i}\sigma - y_{i}\sigma^{2} \right) \\
& = \frac{\partial}{\partial \sigma} \left( \frac{2 \mathbf{x}}{n} \sum_{i}^{n} \sigma^{2} - \sigma^{3} - y_{i}\sigma - y_{i}\sigma^{2} \right) \frac{\partial \sigma}{\partial \mathbf{w}}^{T} \\
& = \left( \frac{2 \mathbf{x}}{n} \sum_{i}^{n} 2\sigma - 3\sigma^{2} - y_{i} - 2y_{i}\sigma \right) \sigma(1 - \sigma) \mathbf{x}^{T} \\
\end{align}
$$

> TODO: prove the hessian matrix is not positive semidefinite.

Thus, instead of MSE, **binary cross entropy** (BCE) loss (log loss) is used with sigmoid to create a convex objective.

$$ \mathrm{BCE} = -\frac{1}{n}\sum_{i}^{n}(y_{i}\log(\hat{y}_{i}) + (1-y_{i})\log(1-\hat{y}))) $$

BCE assumes both inputs $y$ and $\hat{y}$ are in the range $[0, 1]$. Since normally the labels $y_{i}$ are 0 or 1, BCE can be interpreted by decomposing to two cases for each prediction and label pair:

$$ 
\begin{cases}
-\log(\hat{y}_{i}) &  y_{i} = 1 \\
-\log(1 - \hat{y}_{i}) & y_{i} = 0 \\
\end{cases}
$$

In [1]:
from IPython.display import IFrame
IFrame("https://www.desmos.com/calculator/ojpmcptvt0?embed", width=500, height=500)

## References
---

1. https://towardsdatascience.com/why-not-mse-as-a-loss-function-for-logistic-regression-589816b5e03c
1. https://www.cs.toronto.edu/~rgrosse/courses/csc311_f20/readings/notes_on_linear_regression.pdf