# Lesson

## Import

In [None]:
import warnings

import pandas as pd
import numpy as np

import statsmodels.api as sm
from statsmodels.tools.sm_exceptions import HessianInversionWarning

In [None]:
warnings.filterwarnings('ignore')
warnings.simplefilter('ignore', HessianInversionWarning)

## Code

In [None]:
def compute_model(y_name, X_name, data):
    y = data.loc[:, y_name]
    X = sm.add_constant(data.loc[:, X_name].values)
    model = sm.Logit(y, X).fit(disp=0)
    
    return show_table(model, X_name)

In [None]:
def show_table(model, X_name):
    index_name = ['Intercept']
    if isinstance(X_name, str):
        index_name.append(X_name)
    elif isinstance(X_name, list):
        index_name = index_name + X_name
    
    df = pd.read_html(model.summary2().as_html())[1]
    colname = df.iloc[0]
    df = df.rename(columns=df.iloc[0]).drop(0).set_index(np.nan)
    df.index.name = None
    df.index = index_name
    
    return df

In [None]:
default = pd.read_csv('data/Default.csv')

In [None]:
# table 4.1
default_dummy = pd.get_dummies(default, columns=['default'])
compute_model('default_Yes', 'balance', default_dummy)

In [None]:
# table 4.2
default_dummy = pd.get_dummies(default, columns=['default', 'student'])
compute_model('default_Yes', 'student_Yes', default_dummy)

In [None]:
# table 4.3
default_dummy = pd.get_dummies(default, columns=['default', 'student'])
compute_model('default_Yes', ['balance', 'income', 'student_Yes'], default_dummy)

<div style='border:1px solid black; padding:10px'>

**Confounder** (also **confounding variable**, **confounding factor**, or **lurking variable**): a variable that influences both the dependent variable and independent variable, causing a spurious association.

**Spurious relationship** (or **spurious correlation**): a mathematical relationship in which two or more events or variables are associated but not causally related, due to either coincidence or the presence of a certain third, unseen factor.

**Example**: Two events can cause grass to be wet: an active sprinkler or rain. Rain has a direct effect on the use of the sprinkle (when it rains, the sprinkler is usually not active).
> <img src='images/bayesian_network.png' width='430px'>

<div>

# Linear discriminant analysis (LDA)

Logistic regression involves directly modeling $\Pr{(Y=k | X=x)}$ using the logistic function. 

Consider an alternative approach where we model the distribution of $X$ separately in each of the response classes, and then use Bayes' theorem to flip these around into estimates for $\Pr{(Y=k | X=x)}$.

Why do we need another method?

* When the classes are well-separated, the parameter estimates for the logistic regression model are surprisingly unstable. LDA is not.
* If $n$ is small and the distribution of the predictors $X$ is approximately normal in each of the classes, LDA is more statble than logistic regression.

## Using Bayes' theorem for classification

**Left:** Two normal density function $f_1(x)$ (with $\mu_1$ = -1.25 and $\sigma_1^2$ = 1) and $f_2(x)$ (with $\mu_2$ = -1.25 and $\sigma_2^2$ = 1)

<div style='float:right; border:1px solid black; padding:10px; padding-left:15px; width:260px'>

**Posterior probability**: the probability an event will happen after all evidence or background information has been taken into account. It is closely related to prior probability, which is the probability an event will happen before you taken any new evidence into account.

</div>

<div style='margin-right:280px'>
Suppose that we classify an observation into one of $K$ classes ($K \geq 2$).

* Let $p_k(X) = \Pr{(Y=k | X)}$ be the posterior probability that an observation $X = x$ belongs to the $k$th class.
* Let $\pi_k$ represent the overall or prior probability that a randomly chosen observation is associated with the $k$th category of the response variable $Y$.
* Let $f_k(x) \equiv \Pr{(X=x | Y=k)}$ denote the density function of $X$ for an observation that comes from the $k$th class.
    * $f_k(x)$ large $\Rightarrow$ very likely that an observation in the $k$th class has $X \approx x$.
    * $f_k(x)$ small $\Rightarrow$ very unlikely that an observation in the $k$th class has $X \approx x$.
</div>

**Bayes' theorem**: 
$\Pr{(Y=k | X=x)} = \dfrac{\pi_k f_k(x)}{\sum\limits_{l=1}^K{\pi_l f_l(x)}}$

* This equation suggests that instead of directly computing $p_k(X)$, we can plug in estimates for $\pi_k$ and $f_k(X)$.
    * Estimating $\pi_k$ is easy if we have a random sample of $Y$s from the population.
        * Compute the fraction of the training observations that belong to the $k$th class.
    * Estimating $f_k(X)$ is more challenging, unless we assume some simple forms for these densities.

## LDA for $p=1$

Assume we only have one predictor. Hence, $p=1$. We want to obtain an estimate for $f_k(x)$ to plug into Bayes' theorem to estimate $p_k(x)$.

Suppose we assume that $f_k(x)$ is *normal* or *Gaussian*.

Normal density: $f_k(x) = \dfrac{1}{\sqrt{2\pi}\sigma_k} \exp{\left(-\dfrac{1}{2\sigma_k^2}(x - \mu_k)^2\right)}$
    
&emsp;where $\mu_k$ and $\sigma_k^2$ are the mean and variance parameters for the $k$th class.
    
Assume that $\sigma_1^2 = ... = \sigma_K^2$, that is, there is a shared variance term across all $K$ classes. For simplicity, denote $\sigma^2$.
    
&emsp;&emsp;$p_k(x) 
= \dfrac{\pi_k \dfrac{1}{\sqrt{2\pi}\sigma} \exp{\left(-\dfrac{1}{2\sigma^2}(x - \mu_k)^2\right)}}{\sum\limits_{l=1}^K{\pi_l \dfrac{1}{\sqrt{2\pi}\sigma} \exp{\left(-\dfrac{1}{2\sigma^2}(x - \mu_l)^2\right)}}} \hspace{1cm}(*)$

The Bayes classifier involves assigning an observation $X=x$ to the class for which (\*) is largest.

Taking the log of (\*) and rearranging the terms, it is not hard to show that this is equivalent to assigning the observation to the class for which the following equation is largest.

&emsp;&emsp;$\delta_k(x) = x \cdot \dfrac{\mu_k}{\sigma^2} - \dfrac{\mu_k^2}{2\sigma^2} + \log{(\pi_k)} \hspace{1cm}$

**Example**: $K=2$ and $\pi_1 = \pi_2$
> Bayes classifier assigns an observation to class 1 if $2x(\mu_1 - \mu_2) > \mu_1^2 - \mu_2^2$ and to class 2 otherwise.
>
> Bayes decision boundary corresponds to the point where: $x = \dfrac{\mu_1^2 - \mu_2^2}{2(\mu_1 - \mu_2)} = \dfrac{\mu_1 + \mu_2}{2}$
>
> <img src='images/4.4.PNG' width='550px'>
>
> The mean and variance parameters for the two density functions are: $\mu_1 = -1.25$, $\mu_2 = 1.25$, and $\sigma_1^2 = \sigma_2^2 = 1$. 
> 
> If we assume that an observation is equally likely to come from either class, i.e., $\pi_1 = \pi_2 = 0.5$, then Bayes classifier assigns observation to class 1 if $x<0$ and class 2 otherwise.

In reality, we are not able to calculate the Bayes classifier since we still have to estimate $\mu_1,...,\mu_K$ and $\sigma^2$.

The LDA method approximates the Bayes classifier by plugging estimates for $\pi_k$, $\mu_k$, and $\sigma^2$.

&emsp;&emsp;
$\begin{cases}
\hat{\mu}_k = \dfrac{1}{n_k} \sum\limits_{i:y_i=k}{x_i} \\
\hat{\sigma}^2 = \dfrac{1}{n-K} \sum\limits_{k=1}^K \sum\limits_{i:y_i=k} {(x_i - \hat{\mu}_k)^2}
\end{cases} \hspace{1cm}$ 

&emsp;where: $\bullet\hspace{0.2cm}$$n$: the total number of training observations <br>
&emsp;$\hspace{1.1cm}\bullet\hspace{0.2cm}$$n_k$: the number of training observations in the $k$ class

LDA estimates $\pi_k$ using the proportion of the training observations that belong to the $k$th class: $\hat{\pi}_k = \dfrac{n_k}{n}$. <br>
Plug $\hat{\mu}_k$, $\hat{\sigma}^2$, and $\hat{\pi}_k$ to $\sigma_k(x)$ gives:
$\hat{\sigma}_k(x) = x \cdot \dfrac{\hat{\mu}_k}{\hat{\sigma}^2} - \dfrac{\hat{\mu}_k^2}{2\hat{\sigma}^2} + \log{(\hat{\pi}_k)}$

## LDA for $p>1$

Assume that $X = (X_1,X_2,...,X_p)$ is drawn from a multivariate Gaussian distribution, with a class specific mean vector and a common covariance matrix.

Assume that each individual predictor follows a 1D normal distribution with some correlation between each pair of predictors.

<img src='images/4.5.PNG' width='500px'>

* The height of the surface at any particular point represents the probability that both $X_1$ and $X_2$ fall in a small region around that point.
* Cross-sections result from cutting the surface along $X_1$ or $X_2$ axis will have the shape of 1D normal distribution.
* If $X_1$ and $X_2$ are non-correlated, then the base of the bell will be a circle. If correlated then that of the bell will have an elliptical shape.

Denote $X \sim N(\mu, \sum)$ as $p$-dimensional random variable $X$ has a multivariate Gaussian distribution.
* $E(X) = \mu$: the mean of $X$ (a vector with $p$ components)
* $\text{Cov}(X) = \sum$: the $p \times p$ covariance matrix of $X$

**Multivariate Gaussian density**:
$f(x) = \dfrac{1}{(2\pi)^{p/2} |\sum|^{1/2}} \exp{\left(-\dfrac{1}{2}(x-\mu)^T \sum^{-1}(x-\mu)\right)}$

If more than 1 predictors:

* LDA classifier assumes that the observations in the $k$th class are drawn from a multivariate Gaussian distribution $N(\mu_k, \sum)$ where $\mu_k$ is a class-specific mean vector and $\sum$ is a covariance matrix.
* Plug $f_k(X=x)$ to Bayes' theorem $\Rightarrow$ the Bayes classifier assigns an observation $X=x$ to the class for which the following equation is largest.

    &emsp;&emsp;$\delta_k(x) = x^T \sum^{-1} \mu_k - \dfrac{1}{2} \mu_k^T \sum^{-1} \mu_k + \log{\pi_k}$

Perform LDA model fit to the 10000 training samples of Default data set results in a *training* error rate of 2.75\%. <br>
Two caveats:
* Training error rates will usually be lower than test error rates.
    * Because we adjust the paramerers of our model to do well on the training data.
    * The higher the ratio of parameters $p$ to number of samples $n$, the more we expect this *overfitting* to play a role.
* Since 3.33\% of the individuals in the training sample defaulted, a simple but useless classifier that always predicts that each individual will not default, regardless of credit card balance and student status, will result in an error rate of 3.33\%.

**Example**: Confusion matrix shows LDA prediction on Default data set.
<table class="tg" style="float:right">
<thead>
  <tr>
    <th class="tg-0lax" colspan="2" rowspan="2"></th>
    <th class="tg-c3ow" colspan="3">True default status</th>
  </tr>
  <tr>
    <td class="tg-baqh">No</td>
    <td class="tg-baqh">Yes</td>
    <td class="tg-baqh">Total</td>
  </tr>
</thead>
<tbody>
  <tr>
    <th class="tg-baqh" rowspan="3">Predicted<br>default<br>status</th>
    <td class="tg-c3ow">No</td>
    <td class="tg-c3ow">9644</td>
    <td class="tg-c3ow">252</td>
    <td class="tg-c3ow">9896</td>
  </tr>
  <tr>
    <td class="tg-c3ow">Yes</td>
    <td class="tg-c3ow">23</td>
    <td class="tg-c3ow">81</td>
    <td class="tg-c3ow">104</td>
  </tr>
  <tr>
    <td class="tg-c3ow">Total</td>
    <td class="tg-c3ow">9667</td>
    <td class="tg-c3ow">333</td>
    <td class="tg-c3ow">10000</td>
  </tr>
</tbody>
</table>

<blockquote style='margin-right:250px'> 
    
* LDA predicted that a total of 104 people would default.
    * 81 actually defaulted
    * 23 did not

&emsp;$\Rightarrow$ 23 out of 9667 of the individuals who did not default were incorrectly labeled.

* Out of 333 individuals who defaulted, 252 were missed by LDA.

$\Rightarrow$ while the overall error rate is low, the error rate among individuals who defaulted is very high.

From the perspective of a credit card company that is trying to identify high-risk individuals, an error rate of $252\big/333 = 75.7\%$ among individuals who default may well be unacceptable.

</blockquote>

<br>

*Why does it have such a low sensitivity?*

* LDA is trying to approximate the Bayes classifier, which has the lowest **total** error rate out of all classifier (if the Gaussian model is correct), i.e., Bayes' classifier will yield the smallest possible total number of misclassified observations.

A credit card company might particularly wish to avoid incorrectly classifying an individual who will default, whereas incorrectly classifying an individual who will not default is less problematic.

* A fix for this particular interest is by changing the threshold from 0.5 to, maybe 0.2.

    &emsp;&emsp;$\Pr{(\text{default} = \text{Yes} | X=x)} > 0.2$
    
    $\Rightarrow$ error rate decrease from $75.7\%$ to $41.1\%$; overall error increase to $3.73\%$.
    
<img src='images/4.7.PNG' width='450px'>

*How can we decide which threshold value is best?*

* **ROC** (or **Receiver Operating Characteristics**) curve.
    * The overall performance of a classifier, the summarized over all possible thresholds, is given by the area under the ROC curve (AUC).
* An ideal ROC curve will hug the top left corner, so the larger the AUC the better the classifier. 
* ROC curves are useful for comparing different classifiers, since they take into account all possible thresholds.

For Default data, the AUC is 0.95, which is close to the maximum of one so would be considered very good.

<table class="tg">
<thead>
  <tr>
    <th class="tg-0pky" colspan="2" rowspan="2"></th>
    <th class="tg-c3ow" colspan="3" style="text-align:center">Predicted class</th>
  </tr>
  <tr>
    <td class="tg-c3ow" style="text-align:left">- or Null</td>
    <td class="tg-c3ow" style="text-align:left">+ or Not-null</td>
    <td class="tg-c3ow" style="text-align:left">Total</td>
  </tr>
</thead>
<tbody>
  <tr>
    <th class="tg-0pky" rowspan="3" style="text-align:left">True class</th>
    <td class="tg-c3ow" style="text-align:left">- or Null</td>
    <td class="tg-c3ow" style="text-align:left">True Neg. (TN)</td>
    <td class="tg-c3ow" style="text-align:left">False Pos. (FP)</td>
    <td class="tg-c3ow" style="text-align:center">N</td>
  </tr>
  <tr>
    <td class="tg-c3ow" style="text-align:left">+ or Non-null</td>
    <td class="tg-c3ow" style="text-align:left">False Neg. (FN)</td>
    <td class="tg-c3ow" style="text-align:left">True Pos. (TP)</td>
    <td class="tg-c3ow" style="text-align:center">P</td>
  </tr>
  <tr>
    <td class="tg-c3ow" style="text-align:left">Total</td>
    <td class="tg-c3ow" style="text-align:center">N*</td>
    <td class="tg-c3ow" style="text-align:center">P*</td>
    <td class="tg-c3ow" style="text-align:left"></td>
  </tr>
</tbody>
</table>

<table class="tg" style='text-align: left'>
<thead>
  <tr>
    <th class="tg-0lax" style="text-align:left">Name</th>
    <th class="tg-0lax" style="text-align:left">Definition</th>
    <th class="tg-0lax" style="text-align:left">Synonyms</th>
  </tr>
</thead>
<tbody>
  <tr>
    <td class="tg-0lax" style="text-align:left">False Pos. rate</td>
    <td class="tg-0lax" style="text-align:center">FP/N</td>
    <td class="tg-0lax" style="text-align:left">Type I error, 1-Specificity</td>
  </tr>
  <tr>
    <td class="tg-0lax" style="text-align:left">True Pos. rate</td>
    <td class="tg-0lax" style="text-align:center">TP/P</td>
    <td class="tg-0lax" style="text-align:left">Recall, 1-Type II error, power, sensitivity</td>
  </tr>
  <tr>
    <td class="tg-0lax" style="text-align:left">Pos. Pred. value</td>
    <td class="tg-0lax" style="text-align:center">TP/P*</td>
    <td class="tg-0lax" style="text-align:left">Precision, 1-false discovery proportion</td>
  </tr>
  <tr>
    <td class="tg-0lax" style="text-align:left">Neg. Pred. value</td>
    <td class="tg-0lax" style="text-align:center">TN/N*</td>
    <td class="tg-0lax" style="text-align:left"></td>
  </tr>
</tbody>
</table>

## Quadratic discriminant analysis (QDA)

QDA classifier assumes that the observations from each class drawn from a Gaussian distribution and has its own covariance matrix. That is, $X \sim N(\mu, \sum_k)$.

The Bayes classifier assigns an observation $X=x$ to the class for which the following equation is largest.

&emsp;&emsp;$\begin{split}
\sigma_k(x)
&= \textstyle -\dfrac{1}{2} (x - \mu_k)^T \sum_k^{-1} (x - \mu_k) - \dfrac{1}{2} \log(|\sum_k|) + \log(\pi_k) \\
&= \textstyle -\dfrac{1}{2} x^T \sum_k^{-1} x + x^T \sum_k^{-1} \mu_k - \dfrac{1}{2} \mu_k^T \sum_k^{-1} \mu_k - \dfrac{1}{2} \log(|\sum_k|) + \log(\pi_k)
\end{split}$

*Why does it matter whether or not we assume that the $K$ classes share a common covariance matrix?*

* When there are $p$ predictors, then estimating a covariance matrix requires estimating $p(p+1)\big/2$ parameters. $\Rightarrow$ a lot of parameter
* If LDA’s assumption that the $K$ classes share a common covariance matrix is badly off, then LDA can suffer from high bias.
* LDA tends to be a better bet than QDA if there are relatively few training observations and so reducing variance is crucial. 
* QDA is recommended if the training set is very large, so that the variance of the classifier is not a major concern, or if the assumption of a common covariance matrix for the $K$ classes is clearly untenable.

# A comparison of classification methods

## LINEAR METHODS

Consider 2-class setting with $p=1$ predictor, and let $p_1(x)$ and $p_2(x) = 1 - p_1(x)$ be the probabilities that the observation $X=x$ belongs to class 1 and class 2, respectively.

* Log odds of LDA: $\log{\left(\dfrac{p_1(x)}{1 - p_1(x)}\right)} = \log{\left(\dfrac{p_1(x)}{p_2(x)}\right)} = c_0 + c_1x$

    where $c_0$ and $c_1$ are functions of $\mu_1$, $\mu_2$, and $\sigma$. <br><br>
    
* Log odds of logistic regression: $\log{\left(\dfrac{p_1}{1-p_1}\right)} = \beta_0 + \beta_1x$

Same: Both are linear functions of $x$.

Different: $\beta_0$ and $\beta_1$ are estimated using maximum likelihood, while $c_0$ and $c_1$ mean and variance from a normal distribution.

## NON-LINEAR METHODS

KNN: 
* To make a prediction for an observation $X=x$, the $K$ training observations that are closest to $x$ are identified. Then $X$ is assigned to the class to which the plurality of these observations belong.
* No assumptions are made about the shape of the decision boundary.
* Does tell use which predictors are important.

QDA:
* Assume a quadratic decision boundary.
* Peform better in the present of a limited number of training observations/

## SCENARIOS

There are $p=2$ predictors.

<img src='images/4.10.PNG' width='750px'>

* **Scenario 1**: 20 training observations in each class. The observations within each class were uncorrelated random normal variables with a different mean in each class.

* **Scenario 2**: Same as scenario 1, except that within each class, the two predictors has a correlation of -0.5.

* **Scenario 3**: $X_1$ and $X_2$ are generated from the $t$-distribution with 50 observations per class. The $t$-distribution has a similar shape to the normal distribution but it has a tendency to yield more extreme points.

<img src='images/4.11.PNG' width='750px'>

* **Scenario 4**: The data were generated from a normal distribution with a correlation of 0.5 between the predictors in the first class, and correlation of -0.5 between the predictors in the second class.

* **Scenario 5**: The data were generated from a normal distribution with uncorrelated predictors. However, the responses were sampled from the logistic function using $X_1^2$, $X_2^2$, and $X_1 \times X_2$ as predictors.

* **Scenario 6**: Same as scenario 5, but the responses were sampled from a more complicated non-linear functions.