# Introduction to Statistical Learning - Chapter 3

- [3. Classification](#3.-Classification)
    * [3.1. Logistic Regression](#3.1.-Logistic-Regression)
        + [3.1.1. The Logistic Model](#3.1.1.-The-Logistic-Model)
        + [3.1.2. Estimating the Regression Coefficients](#3.1.2.-Estimating-the-Regression-Coefficients)
        + [3.1.3. Making Predictions](#3.1.3.-Making-Predictions)
        + [3.1.4. Multiple Logistic Regression](#3.1.4.-Multiple-Logistic-Regression)
    * [3.2. Linear Discriminant Analysis](#3.2.-Linear-Discriminant-Analysis)
        + [3.2.1. Using Bayes' Theorem for Classification](#3.2.1.-Using-Bayes'-Theorem-for-Classification)
        + [3.2.2. Linear Discriminant Analysis (LDA) for p = 1](#3.2.2.-Linear-Discriminant-Analysis-(LDA)-for-p-=-1)
        + [3.2.3 Linear Discriminant Analysis for p >1](#3.2.3-Linear-Discriminant-Analysis-for-p->1)
        + [3.2.4 Quadratic Discriminant Analysis (QDA)](#3.2.4-Quadratic-Discriminant-Analysis-(QDA))
    * [3.3. Comparison of Classification Methods](#3.3.-Comparison-of-Classification-Methods)

# 3. Classification

- A situation where the response variable is `qualitative` instead of quantitative
    * Techniques used to predict a qualitative response include:
        + Logistic regression
        + Linear discriminant analysis
        + K-nearest neighbours
    * Other more computer-intensive methods include:
        + Generalized additive models
        + Trees and random forests and boosting
        + Support vector machines
- Forcing qualitative variables into a regression model wrongly assumes that the difference between the predictors is similar
- Two types of qualitative data:
    * Ordinal
        + Variables with a specific order (Low, Medium, High)
    * Nominal
        + Variables with no specific order (Book, Television, Car)

## 3.1. Logistic Regression

- Logisitic regression models the probability that the response $Y$ belongs to a particular category
    * Values will range between 0 and 1
    
$$ p(X) = Pr(Y = 1|X) $$

### 3.1.1. The Logistic Model

- To model the $p(X)$ that gives outputs between 0 and 1 for all values of X, we can use the logistic function

$$ p(X) = \frac{e^{\beta_{0} + \beta_{1}X}}{1 + e^{\beta_{0} + \beta_{1}X}} $$

where $\frac{p(X)}{1-p(X)}$ is the `odds` and can take on any value between 0 and $\infty$

- Values of the odds close to 0 and $\infty$ indicate *very low* and *high* probabilities of the response belonging to a particular category respectively
- By taking the logarithm of both sides, we can get the log-odds or logit

$$ log(\frac{p(X)}{1-p(X)}) = \beta_{0} + \beta_{1}X $$

where increasing $X$ by one unit changes the log odds by $\beta_{1}$.
- However, $\beta_{1}$ does not correspond to the change in $p(X)$ associated with a one-unit increase in X

### 3.1.2. Estimating the Regression Coefficients

- Maximum likelihood to fit a logistic regression model
    * Seeks estimates for $\beta_{0}$ and $\beta_{1}$ such that the predicted proability $\hat{p}(X_{i})$ will correspond as closely as possible to the observed response.
        + One unit increase in the predictor $X_{j}$ is associated with an increase in the log odds of the response (Y=1) by $\hat{\beta_{1}}$
- To measure the accuracy of the coefficient estimates, the standard error can be computed
    * *Z-statistic* is associated with $\beta_{1}$ is equal to $\hat{\beta_{1}}/SE(\hat{\beta_{1}})$
        + A large value of *z-statistic* indicates evidence against the null hypothesis (p < 0.05)

### 3.1.3. Making Predictions

- Qualitative predictors can be used with the logistic regression model using the dummy variable approach
    * We can assign a dummy variable that takes on a value of 1 and 0

**For value of 1:**

$$ p(X) = \frac{e^{\beta_{0} + \beta_{1}X}}{1 + e^{\beta_{0} + \beta_{1}X}} $$

**For value of 0:**

$$ p(X) = \frac{e^{\beta_{0}}}{1 + e^{\beta_{0}}} $$

### 3.1.4. Multiple Logistic Regression

$$ log(\frac{p(X)}{1-p(X)}) = \beta_{0} + \beta_{1}X_{1} + ... + \beta_{p}X_{p} $$

where $X = (X_{1}, ... , X_{p})$ are $p$ predictors. Can also be rewritten as

$$ p(X) = \frac{e^{\beta_{0} + \beta_{1}X_{1} + ... + \beta_{p}X_{p}}}{1 + e^{\beta_{0} + \beta_{1}X_{1} + ... + \beta_{p}X_{p}}} $$

- As variables can be correlated, performing regression involving a single predictor could result in drastically different interpretations
    * Confounding variables

## 3.2. Linear Discriminant Analysis

- Model the distribution of the predictors X separately in each of the response classes (**Y**)
    * Use Bayes' theorem to flip these around into estimates for $Pr(Y=k|X=x)$
- Benefits of LDA:
    * When classes are well-separated, parameter estimates for logistic regression model are surprisely unstable
        + LDA does not suffer from this problem
    * When n is small and the distribution of the predictors X is approximately normal in each of the classes, LDA is more stable than logistic regression
- LDA is better for >2 response classes

### 3.2.1. Using Bayes' Theorem for Classification

$$ Pr(Y=k|X=x) = \frac{\pi _{k}f_{k}(x)}{\sum^{K}_{l=1}\pi _{1}f_{1}(x)} $$

where 
$\pi_{k}$ represent the overall or prior probability that a randomly chosen observation comes from the kth class;  
$f_{k}(x) = Pr(X=x|Y=k)$ denote the density function of X for an observation that comes from the kth class;  
$p_{k}(X) = Pr(Y= k|X)$

### 3.2.2. Linear Discriminant Analysis (LDA) for p = 1

$$ f_{k}(x) = \frac{1}{\sqrt{2\pi}\sigma_{k}}exp\left(-\frac{1}{2\sigma_{k}^{2}}(x-\mu_{k})^{2}\right)$$

where $\mu_{k}$ and $\sigma_{k}^{2}$ are the mean and variance parameters for the kth class

Assuming shared variance term across all **K** class, the posterior probability can be calculated as

$$ p_{k}(x) = \dfrac{\pi_{k}\frac{1}{\sqrt{2\pi}\sigma}exp\left(-\frac{1}{2\sigma^{2}}(x-\mu_{k})^{2}\right)}{\sum^{K}_{l=1}\pi_{l}\frac{1}{\sqrt{2\pi}\sigma}exp\left(-\frac{1}{2\sigma^{2}}(x-\mu_{l})^{2}\right)} $$ 

- LDA classifier assumes that the observations within each class comes from a normal distribution with a class-specific mean vector and a common variance $\sigma^{2}$, and pluggin estimates for these parameters into the Baye's classifier

### 3.2.3 Linear Discriminant Analysis for p >1

- Assume that X is drawn from a multivariate Gaussian distribution, with a class-specific mean vector and a common covariance matrix
- To indicate a $p$-dimensional random variable X has a multivariate Gaussian ditribution
    * $X~N(\mu,\sum)$
        + $E(X) = \mu$ is the mean of **X** (A vector with p components)
        + $Cov(X) = \sum$ is the $p \times p$ covariance of **X**
$$ f(x) = \frac{1}{(2\pi)^{p/2}|\sum|^{1/2}}exp\left(-\frac{1}{2}(x-\mu)^{T}\sum^{-1}(x-\mu)\right) $$
- Bayes decision boundary then divides the predictors into the different number of regions
- To determine the types of error made in a classification problem, numerous matrixs can be evaluated
    * Confusion matrix 
    * Sensitivity
    * Specificity

**Why LDA might not be as good**
- LDA tries to approximate the Bayes' classifier
    * Lowest total error rate out of all the classifiers (if the Gaussian model is correct)
- LDA does not differentiate where the class the errors come from
    * Low sensitivity
        + Can lower the threshold but results in a tradeoff, higher error rate

### 3.2.4 Quadratic Discriminant Analysis (QDA)

- Quadratic Discriminant Analysis is an alternative to LDA
- Assumes each class has its own covariance matrix where $\sum_{k}$ is a covariance matrix for the kth class
    * Bayes classifier assigns an observation where the probability is the largest

**Bias-variance trade-off between LDA vs QDA**
- QDA:
    * By estimating a different covariance matrix, we are estimating $p(p+1)/2$ parameters
    * Better than LDA when the model is non-linear
- LDA:
    * Assuming a common covariance matrix, we are assuming the model to be linear
    * LDA is much less flexible classifier than QDA and has substantially lower variance.
    * Better when there are few training observations

## 3.3. Comparison of Classification Methods

**LDA and Logistic Regression**
- LDA and Logistic Regression are both linear functions of x.
- LDA estimates the mean and variance from a normal distribution while Logistic Regression uses maximum likelihood
- The assumption of a Gaussian distribution is the key in determining which method will outperform

**QDA and KNN**
- QDA serves as a compromise between the non-parametric KNN method and the linear LDA and logistic regression approach
    * Assumes a quadratic decision boundary
        + Good when the model is moderately non0linear
- KNN is superior when the model is very non-linear