# Logistic Regression

$\newcommand{\b}[1]{\mathbf{#1}} \newcommand{\c}[1]{\mathcal{#1}}$
## 1. Introduction

Logistic Regression is a classification model that classifies whether a particular datapoint belongs to a class or not. Logistic Regression models the conditional densities ($P(\b{x}|\c{C}_1)$) and the priors ($P(\c{C}_1)$) and uses them to find posterior probabilities $P(\c{C}_1|\b{x})$

From Bayes' Rule:
$$\begin{align}
P(\c{C}_1|\b{x}) &= \frac{P(\b{x}|\c{C}_1)P(\c{C}_1)}{P(\b{x}|\c{C}_1)P(\c{C}_1) + P(\b{x}|\c{C}_2)P(\c{C}_2)} \\
&= \frac{1}{1 + e^{-a}} \\
\text{where } a &= \ln \left( \frac{P(\b{x}|\c{C}_1)P(\c{C}_1)}{P(\b{x}|\c{C}_2)P(\c{C}_2)} \right)
\end{align}$$

Note how this models $P(\c{C}_1|\b{x})$ as a sigmoid function. The inverse of the sigmoid is the logit, and represents the log of the ratio of probabilities

For more than two classes, the sigmoid function changes to a softmax function
$$\begin{align}
P(\c{C}_k|\b{x}) &= \frac{P(\b{x}|\c{C}_k)P(\c{C}_k)}{\sum_i P(\b{x}|\c{C}_i)P(\c{C}_i)} \\
&= \frac{e^{a_k}}{\sum_i e^{a_i}} \\
\text{where } a_i &= \ln \left( P(\b{x}|\c{C}_k)P(\c{C}_k) \right)
\end{align}$$

### 1.1. Gaussian Class-Conditional Density

Let's take the simple case of two groups which have a gaussian distribution of the class-conditional density, with the same covariance matrix
$$\begin{align}
P(\c{C}_k|\b{x}) &= \frac{1}{(2\pi)^{D/2}|\b{\Sigma}|^{1/2}} \exp \left\{ -\frac{1}{2}(\b{x} - \b{\mu}_k)^T \b{\Sigma}^{-1} (\b{x} - \b{\mu}_k) \right\}
\end{align}$$

plugging this into $a$'s equation gives us 

$$\begin{align}
a &= -\frac{1}{2}\left( (\b{x} - \b{\mu}_1)^T \b{\Sigma}^{-1} (\b{x} - \b{\mu}_1) \right) + \frac{1}{2}\left( (\b{x} - \b{\mu}_2)^T \b{\Sigma}^{-1} (\b{x} - \b{\mu}_2) \right) + \ln \frac{P(\c{C}_1)}{P(\c{C}_2)}
\end{align}$$

Note that 
1. $\b{\Sigma}$ is symmetric, hence $\b{\Sigma}^{-1}$ is symmetric as well
2. Hence, $\b{x}^T \b{\Sigma}^{-1} \b{\mu}_i = \b{\mu}_i^T \b{\Sigma}^{-1} \b{x}$

<sub>(this was not mentioned in bishop, thought it might help)</sub>

This gives us
$$a = \left( \b{\Sigma}^{-1}(\b{\mu}_1 - \b{\mu}_2)\right)^T\b{x} + \frac{1}{2}\left( \b{\mu}_1^T\b{\Sigma}^{-1}\b{\mu}_1 + \b{\mu}_2^T\b{\Sigma}^{-1}\b{\mu}_2 \right) + \ln \frac{P(\c{C}_1)}{P(\c{C}_2)}$$

This is nicely visualized here:

![sing](single_boundary.png)

Note that there are no quadratic terms here: they cancel out because of the same covariance matrix. If the covariance matrices were different, we would have a $\frac{1}{2} \left( \b{x}^T (\b{\Sigma}_2^{-1} - \b{\Sigma}_1^{-1}) \b{x} \right)$ term as well, giving us a quadratic decision boundary, which would look something like this 

![mult](multiple_boundary.png)

(pictures taken from bishop. The red and green distributions have the same covariance matrix and hence a linear decision boundary, whereas the blue one has a quadratic decision boundary)


### 1.2. MLE estimation



### 1.3. General Linear Model 

## 2. Logistic Regression with 2 Classes

We use a processed version of the [Cleveland Heart Disease dataset](https://archive.ics.uci.edu/ml/datasets/heart+disease) for this task

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns



In [9]:
from sklearn.model_selection import train_test_split

df = pd.read_csv('../data/heart_disease/cleveland.csv').set_index('SNO')
X_train, X_test, y_train, y_test = train_test_split(df.drop('num',axis=1),df['num'],stratify=df['num'],test_size=0.3)
