Logistic Regression

1) Logistic regression basics/details
* Motivation:
    * With linear regression, we are modeling a continuous response and finding the linear function that give the best fit
    * But, what happens if we try linear regression for *binary* response (e.g. yes/no)
![lin_reg_binary](linear_regression_binary.png)
* Logistic regression properties:
    * Takes continuous input (e.g. $(-\infty,\infty)$
    * Produces output between 0 and 1
    * Transitions from outputting 0 to outputting 1 quickly
    * Has interpretable coefficients (similar to linear regression)
![log_reg](logistic_regression.png)
* **Sigmoid (logistic) function** - an S-shaped curve (sigmoid curve) that have domain of all real numbers, with return value monotonically increasing most often from 0 to 1 or alternatively from −1 to 1, depending on convention
    * Equation: $$S(x)=\frac{e^x}{e^x + 1}=\frac{1}{1+e^{-x}}$$
    * For logistic regression: $$p(y)=\frac{1}{1+e^{-X\beta}}$$ <center> where $p(y)$ denotes the probability of success of $y$ (mean of the response) </center>
    * How do we get this function?
        * **Link function** - provides the relationship between a linear combination of our inputs ($X\beta$) and the mean of the response ($p(y)$)
        * Equation: $$\begin{align} ln(\frac{p(y)}{1-p(y)}) & = X\beta \\
            \frac{p(y_i)}{1-p(y_i)} & = e^{X\beta} \\
            p(y_i) & = (1-p(y_i))e^{X\beta} \\
            p(y_i) & = e^{X\beta}-p(y_i)e^{X\beta} \\
            p(y_i)+p(y_i)e^{X\beta} & = e^{X\beta} \\
            p(y_i)(1+e^{X\beta}) & = e^{X\beta} \\
            p(y_i) & = \frac{e^{X\beta}}{1+e^{X\beta}} \\
            p(y_i) & = \frac{\frac{e^{X\beta}}{e^{X\beta}}}{\frac{1+e^{X\beta}}{e^{X\beta}}} \\
            p(y_i) & = \frac{1}{1+e^{-X\beta}} \\
            \end{align}$$
* Linear vs Logistic regression assumptions:
    * In linear regression, we assume: $y_i | X $~$N(X\beta,\sigma^2)$ (Gaussian/Normal Distribution)
    * In logistic regression, we assume: $y_i | X$~$Bernoulli(p)$ (Bernoulli distribution)
        * In binary classification: $y_i = $$\begin{cases} 
        1  & \text{if event occurs} \\
        0 & \text{if event doesn't occur}
        \end{cases}$
* Estimating through **Maximum Likelihood Estimation (MLE)** - parameters of logistic regression is estimated through maximum likelihood
    1. Each individual observation follows **Bernoulli Distribution**: $$y_i | X \text{ ~ } Bernoulli(p) \rightarrow P(X) = p^x(1-p)^{1-x}$$
    2. Given the distribution type, the **likelihood** of our $\beta$ matrix is: $$L(\beta|y)=\prod_{i=1}^N p(y_i)^{y_i}+(1-p(y_i))^{(1-y_i)}$$
    3. Then, convert to **log likelihood**: $$l=\sum_{i=1}^N y_i log(p(y_i))+(1-y_i)log(1-p(y_i))$$
        * Unfortunately, **no closed form solution**, thus, iterative methods are typically used (e.g. **SGD**)
        * Work with the first/second derivatives to try to take clever steps towards an optimal solution with a random guess start
* **Interpreting the Results:**
    * Example: Fit logistic regression model with outcome/response as **whether or not a person works** ($1$ or $0$ $\rightarrow$ y or n) and only **one predictor, income**: $$p(y)=\frac{1}{1+e^{-(\beta_0+X_{income}\beta_{income})}}$$
        * To interpret the coefficients, use **link function**: $$\begin{align} ln(\frac{p(y)}{1-p(y)}) & = \beta_0+X_{income}\beta_{income} \\
                \frac{p(y)}{1-p(y)} & = e^{\beta_0+X_{income}\beta_{income}} \\
                \frac{p(y)}{1-p(y)} & = e^{\beta_0}e^{X_{income}\beta_{income}} \\
                \end{align}$$
        * The **Odds Ratio**: $$\frac{p(y)}{1-p(y)}$$
        * Interpretation: for a one-unit increase in $X_1$, the odds increases by $e^{\beta_{income}}$
            * $\beta_1=0.00001 \rightarrow$ one-unit increase in income, $\$1$, causes an $e^{0.00001}$ increase in the odds of somebody working: $e^{0.00001}=1.00001$
            * Basically, for each additional dollar that a person makes, we expect a $0.001\%$ increase in the odds that they are working
            * For an additional $\$1000$ dollars that a person makes, we expect $1\%$ increase in the odds that they work: $0.0001*1000.0=1.0\%$

2) Classification metrics and the Confusion matrix
![log_reg_matrix](log_reg_matrix.png)

| True $\rightarrow$<br/>$\downarrow$Predicted  	| Positive 	| Negative 	|  	|
|----------------------------------------------:	|:---------------------------------------------:	|:--------------------------------------:	|:---------------------------------------:	|
| **Predicted Positive** 	| True Positive (TP) 	| False Positive (FP)<br/>(Type I error) 	| **Precision** /<br/>Positive Predictive Value 	|
| **Predicted Negative** 	| False Negative (FN)<br/>(Type II error) 	| True Negative (TN) 	|  	|
|  	| **Recall** /<br/>True Positive Rate /<br/>Sensitivity 	| False Positive Rate<br/> 	| **$F_1$ Score** 	|
|  	| False Negative Rate 	| True Negative Rate /<br/>Specificity 	| Accuracy 	|
* Metrics:
![classification_metrics_confusion_matrix](classification_metrics_confusion_matrix.png)
    * **Accuracy** - how many observations in total were labelled correctly (for positive and negative labels)?
        $$\frac{\text{True Positives + True Negatives}}{\text{True Positives + False Negatives + True Negatives + False Positive}}$$
    * **Recall / True Positive Rate / Sensitivity** - Of those observations that are truly positive, which ones were labelled positive correctly?
        $$\frac{\text{True Positives}}{\text{True Positives + False Negatives}}$$
    * **True Negative Rate / Specificity** - Of those observations that are truly negative, which ones were labelled negative correctly?
        $$\frac{\text{True Negatives}}{\text{True Negatives + False Positives}}$$
    * **Precision / Positive Predictive Value** - Of those observations that are labelled positive, which ones are actually positive?
        $$\frac{\text{True Positives}}{\text{True Positives + False Positives}}$$
    * **False Positive Rate** - Of those observations that are truly negatives, which ones did were labelled positive incorrectly?
        $$\frac{\text{False Positives}}{\text{True Negatives + False Positives}}$$
    * **$F_1$ Score** - considers both precision and recall, which emphasis both Type I and Type II errors
        $$\frac{2}{ \frac{1}{\text{Recall}}+\frac{1}{\text{Precision}} } = \frac{2\text{(Precision)(Recall)}}{\text{Precision + Recall}}$$
        $$F_{\beta} = (1+\beta^2) \frac{\text{(Precision)(Recall)}}{\beta^2\text{* Precision + Recall}}$$
        $$\beta > 1 \rightarrow \text{ recall weights increases}$$
        $$\beta < 1 \rightarrow \text{ precision weights increases}$$
* **Receiver Operating Curve (ROC Curve)** - visualizes the performance of a given *binary classifier* by examining **True Positive Rate** changes as the **False Positive Rate** changes (or vice versa)
<img src=ROC_curve.png text="ROC Curve" width=60% />
    * Aim to choose the model that minimizes the FPR and maximizes the TPR (e.g. the model that plots closes to the top-left)
    * Compare across model curves to determine which model gives the best TPR for a given FPR
    * **Area Under Curve (AUC)** - examine the area under the curve to try to differentiate one model from another
        * Greater area under curve is typically better, but this will also depend on what *True/False positive rate you are willing to accept*
        * We can typically achieve *45 degree line through random guessing (50% AUC)* so we should aim to have a model better than that line