# 1. Discriminant Analysis 

- Bayes Theorem : $Pr(Y=k|X=x) = \frac{\pi_k f_k(x)}{\sum_{l=1}^K \pi_l f_l(x)}$
- The density for $X$ in class $k$ : $f_k(x) = \frac{1}{\sqrt{2\pi}\sigma_k}e^{-\frac{(x-\mu_k)^2}{2\sigma_k^2}}$
- The prior probability for class $k$ : $\pi_k = Pr(Y=k)$
- If the distribution of the $X$ is approximately normal, LDA and QDA is more stable.

## 1.1 LDA(Linear Discriminant Analysis) 

- Assume that $\sigma = \sigma_1 = \sigma_2 = ... \sigma_k$
- $f_k(x)$ 
    - p = 1 : $f_k(x) \to N(\mu_k, \sigma)$
    - p > 1 : $f_k(x) \to N(\mu_k, \sum)$
- $p_k(x) =  \frac{\pi_k \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{(x-\mu_k)^2}{2\sigma^2}}}{\sum_{l=1}^K \pi_l \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{(x-\mu_l)^2}{2\sigma^2}}}$
- To classify at the value $X=x$, see which of the $p_k(x)$ is largest. 
- Because below term of $p_k(x)$ is same, we need to consider upper terms.
    - Discriminant score : 
        - p = 1 : $\sigma_k(x) = x\frac{\mu_k}{\sigma^2} - \frac{\mu_k^2}{2\sigma^2} + log(\pi_k)$
        - p > 1 : $\sigma_k(x) = x^T\sum^{-1}\mu_k - \frac{1}{2}\mu_k^T\sum^{-1}\mu_k + log(\pi_k)$
    - $\sigma_k(x)$ is a linear function of $x$. 
- Decision boundary at $x = \frac{\hat{\mu_1} + \hat{\mu_2}}{2}$
- Need to estimate : $(\mu_1, ..., \mu_k), (\pi_1, ..., \pi_k), \sum$

## 1.2 [Ex] Iris Data 

**Training model with LDA** 

```R
# open the iris dataset 
data(iris)
str(iris)
summary(iris)
plot(iris[, -5], col=as.numeric(iris$Species) + 1)

# Apply LDA for iris data
library(MASS)
g <- lda(Species ~., data=iris)
plot(g)
plot(g, dimen=1)
```

![](Img/LDA02.png)

**Compute Missclassification Rate for training sets** 

```R
# Compute misclassification error for training sets
pred <- predict(g)
table(pred$class, iris$Species)
mean(pred$class!=iris$Species)
```

![](Img/LDA01.png)

**Calculate test error of validation set** 

```R
# Randomly separate training sets and test sets
set.seed(1234)
tran <- sample(nrow(iris), size=floor(nrow(iris)*2/3))
g <- lda(Species ~., data=iris, subset=tran)

# Compute misclassification error for test sets
pred <- predict(g, iris)$class[-tran]
test <- iris$Species[-tran]
table(pred, test)
mean(pred!=test)

# Posterior probability
post <- predict(g, iris)$posterior[-tran,]
post[1:10,]
apply(post, 1, which.max)
as.numeric(pred)
``` 
![](Img/LDA03.png)

- We can get inferred probability $\hat{p}_k(x)$ : predict(g, iris)$posterior[-tran, ]

**Performance comparison between LDA vs Multinomial Regression** 

```R
library(nnet)
set.seed(1234)
K <- 100
RES <- array(0, c(K, 2))
for (i in 1:K) {
    tran.num <- sample(nrow(iris), size=floor(nrow(iris)*2/3))
    tran <- as.logical(rep(0, nrow(iris)))
    tran[tran.num] <- TRUE
    g1 <- lda(Species ~., data=iris, subset=tran)
    g2 <- multinom(Species ~., data=iris, subset=tran, trace=FALSE)
    pred1 <- predict(g1, iris[!tran,])$class
    pred2 <- predict(g2, iris[!tran,])
    RES[i, 1] <- mean(pred1!=iris$Species[!tran])
    RES[i, 2] <- mean(pred2!=iris$Species[!tran])
}
apply(RES, 2, mean)

# 0.0254 0.0436
```

## 1.3 [Ex] Default Dataset 

```R
# Importing library 
library(ISLR)
data(Default)
attach(Default)
library(MASS)

# Training model 
g <- lda(default~., data=Default)
pred <- predict(g, default)
table(pred$class, default)
mean(pred$class!=default)
```

![](Img/LDA04.png)

- False Positive rate : The fraction of negative examples that are classified as positive(0.22%)
- False Negative rate : The fraction of positive examples that are classified as negative(76.2%)
- If we classified the prior - always to class "No", then we would make 333/10000 errors, which is only 3.33%. 

# 2. Assessment of the Performance of Classifier 

## 2.1 Two types of Missclassification errors 

- We can change the two error rates by changing the threshold from 0.5 to some other value in [0, 1].
    - $\hat{Pr}(Default = Yes|Balance, Student) \ge \alpha$
    - $\alpha$ is a threshold. 
    - If $\alpha$ ↑, FN increase while FP decrease. 
    - If $\alpha$ ↓, FN decrease while FP increase.

## 2.2 [Ex] Changes in errors along with Thresholds 

```R
thresholds <- seq(0, 1, 0.01) 
res <- matrix(NA, length(thresholds), 3) 

# Compute overall error, false positive, false negatives
for (i in 1:length(thresholds)) {
    decision <- rep("No", length(default))
    decision[pred$posterior[,2] >= thresholds[i]] <- "Yes"
    res[i, 1] <- mean(decision != default)
    res[i, 2] <- mean(decision[default=="No"]=="Yes")
    res[i, 3] <- mean(decision[default=="Yes"]=="No")
}

k <- 1:51
matplot(thre[k], res[k,], col=c(1,"orange",4), lty=c(1,4,2), type="l", xlab="Threshold", ylab="Error Rate", lwd=2)
legend("top", c("Overall Error", "False Positive", "False Negative"), col=c(1,"orange",4), lty=c(1,4,2), cex=1.2)
apply(res, 2, which.min)
``` 

![](Img/Confusion1.png)

- Overall error seems to decrease in every alpha, because there are only 22 FPs.  
- However, it will increase slightly after the turning point. 

## 2.3 Confusion Matrix 

![](Img/Confusion2.png)

## 2.4 Roc curve 

![](Img/ROC1.png)

- Class-specific performance in medicine and biology : Sensitive(TPR), and specificity(TNR)
- The ROC(Receiver Operating Characteristics) curve 
- ($\alpha=1$, TPR=0, TNR=1) in left-lower point, ($\alpha=0$, TPR=1, TNR=0) in right-upper point. 
- The overall performance of a classifier : AUC(The area under the ROC curve) 
    - Larger the AUC the better the classifier 

## 2.5 [Ex] Roc curve 


**Way1 : Drawing ROC curve** 

```R
# Prerequirisite
library(ISLR)
data(Default)
attach(Default)
library(MASS)

# Train model 
g <- lda(default~., data=Default)
pred <- predict(g, default)

# Error grids
thre <- seq(0,1,0.001)
Sen <- Spe <- NULL
RES <- matrix(NA, length(thre), 4)

# Classification metrics 
colnames(RES) <- c("TP", "TN", "FP", "FN")
for (i in 1:length(thre)) {
  decision <- rep("No", length(default))
  decision[pred$posterior[,2] >= thre[i]] <- "Yes"
  Sen[i] <- mean(decision[default=="Yes"] == "Yes")
  Spe[i] <- mean(decision[default=="No"] == "No")
  RES[i,1] <- sum(decision[default=="Yes"] == "Yes")
  RES[i,2] <- sum(decision[default=="No"] == "No")
  RES[i,3] <- sum(decision=="Yes") - RES[i,1]
  RES[i,4] <- sum(default=="Yes") - RES[i,1]
}

# Visualize ROc curve 
plot(1-Spe, Sen, type="b", pch=20, xlab="False positive rate",
     col="darkblue", ylab="True positive rate", main="ROC Curve")
abline(0, 1, lty=3, col="gray")
```

![](Img/ROC2.png)

**Way2 : Drawing ROC curve**
```R
# Way 2 : Calculating TPR, TNR
TPR <- RES[,1] / (RES[,1] + RES[,4])
TNR <- RES[,2] / (RES[,2] + RES[,3])

plot(1-TNR, TPR, type="b", pch=20, xlab="False positive rate",
col="darkblue", ylab="True positive rate", main="ROC Curve")
abline(0, 1, lty=3, col="gray")
```

## 2.6 [Ex] Roc curve with ROCR package

```R
library(ROCR)

# Compute ROC curve
label <- factor(default, levels=c("Yes","No"),
labels=c("TRUE","FALSE"))
preds <- prediction(pred$posterior[,2], label)
perf <- performance(preds, "tpr", "fpr" )

# Visualization 
plot(perf, lwd=4, col="darkblue")
abline(a=0, b=1, lty=2)
slotNames(perf)

k <- 1:100
# X - axis values : FPR 
list(perf@x.name, perf@x.values[[1]][k])
# Y - axis values : TPR 
list(perf@y.name, perf@y.values[[1]][k])
# alpha - cutoffs 
list(perf@alpha.name, perf@alpha.values[[1]][k])

# Compute AUC
performance(preds, "auc")@y.values
```

![](Img/AUROC.png)

# 3. (QDA)Quardratic Discriminant Analysis 

- QDA assumes that each class has its own covariance matrix, $X ~ N(\mu_k, \sum _ k)$
- LDA vs QDA
    - Probability : $P(y_i=k|x)$
    - X : $N(\mu_k,\sum)$ vs $N(\mu_k, \sum_k)$ 
    - Parameters : $\mu_1, ..., \mu_k$ vs $\mu_1, ..., \mu_k, \sum_1, ..., \sum_k$
    - Num of grids : $PK + \frac{P(P+1)}{2}$ vs $PK + K\frac{P(P+1)}{2}$

## 3.1 LDA vs QDA 

**Classification Error Rate of LDA** 

```R
# Import library 
library(ISLR) 
data(Default) 
attach(Default) 
library(MASS) 

# Train-test split
set.seed(1234)
n <- length(default) 
train <- sample(1:n, n*0.7) 
test <- setdiff(1:n, train) 

# Classification error rate of LDA 
g1 <- lda(default~., data=Default, subset=train)
pred1 <- predict(g1, Default) 
table(pred1$class[test], Default$default[test]) 
mean(pred1$class[test]!=Default$default[test])
``` 

![](Img/QDA1.png)

**Classification Error Rate of QDA** 
```R
# Classification error rate of QDA
g2 <- qda(default~., data=Default, subset=train)
pred2 <- predict(g2, Default)
table(pred2$class[test], Default$default[test])
mean(pred2$class[test]!=Default$default[test])
```

![](Img/QDA2.png)

**AUC Score between LDA and QDA** 

```R
# AUC comparison between LDA and QDA
library(ROCR)
label <- factor(default[test], levels=c("Yes","No"),
                labels=c("TRUE","FALSE"))
preds1 <- prediction(pred1$posterior[test,2], label)
preds2 <- prediction(pred2$posterior[test,2], label)
performance(preds1, "auc")@y.values
performance(preds2, "auc")@y.values
```

![](Img/QDA3.png)

- Performance of applying LDA works better than QDA

**Simulation study of LDA and QDA iterating 100 times** 

```R
# Simulation Study 
set.seed(123)
N <- 100
CER <- AUC <- matrix(NA, N, 2)
for (i in 1:N) {
  train <- sample(1:n, n*0.7)
  test <- setdiff(1:n, train)
  y.test <- Default$default[test]
  g1 <- lda(default~., data=Default, subset=train)
  g2 <- qda(default~., data=Default, subset=train)
  pred1 <- predict(g1, Default)
  pred2 <- predict(g2, Default)
  CER[i,1] <- mean(pred1$class[test]!=y.test)
  CER[i,2] <- mean(pred2$class[test]!=y.test)
  label <- factor(default[test], levels=c("Yes","No"), labels=c("TRUE","FALSE"))
  preds1 <- prediction(pred1$posterior[test,2], label)
  preds2 <- prediction(pred2$posterior[test,2], label)
  AUC[i,1] <- as.numeric(performance(preds1, "auc")@y.values)
  AUC[i,2] <- as.numeric(performance(preds2, "auc")@y.values)
}
apply(CER, 2, mean)
apply(AUC, 2, mean)
```

![](Img/QDA4.png)

# 4. Naive Bayes Method 

- Assumes that features are independent in each class. 
- Useful when p is large. 
- Gaussian naive Bayes assumes each $\sum_k$ is diagonal. 
- Despite strong assumptions, NB method often produces good classification results. 

## 4.1 [Ex] Naive Bayes of Iris dataset 


**Calculate Train Error**

```R
# Import library and data
data(iris)
library(e1071) 

# Train model 
# g1 <- naiveBayes(Species ~ ., data=iris) 
g1 <- naiveBayes(iris[,-5], iris[,5])
pred <- predict(g1, iris[,-5]) 
table(pred, iris[,5]) 
mean(pred!=iris$Species) 
```

![](Img/NB1.png)

**Validation Set** 

```R
# Randomly separate training sets and test sets
set.seed(1234)
tran <- sample(nrow(iris), size=floor(nrow(iris)*2/3))

# Compute misclassification error for test sets
g2 <- naiveBayes(Species ~ ., data=iris, subset=tran)
pred2 <- predict(g2, iris)[-tran]
test <- iris$Species[-tran]
table(pred2, test)
mean(pred2!=test)
```

![](Img/NB2.png)

## 4.2 [Ex] Naive Bayes of default dataset 

**Calculate Missclassifiaction Error Rate of Test-set**

```R
# Import dataset 
data(Default)

# Train-test split 
set.seed(1234)
n <- nrow(Default)
train <- sample(1:n, n*0.7)
test <- setdiff(1:n, train)

# train model and calculate missclassification rate 
g3 <- naiveBayes(default ~ ., data=Default, subset=train)
pred3 <- predict(g3, Default)[test]
table(pred3, Default$default[test])
mean(pred3!=Default$default[test])
```

![](Img/NB3.png)

**AUC of Naive Bayes** 

```R
# AUC of Naive Bayes 
library(ROCR)
label <- factor(default[test], levels=c("Yes","No"), labels=c("TRUE","FALSE"))
pred4 <- predict(g3, Default, type="raw")
preds <- prediction(pred4[test, 2], label)
performance(preds, "auc")@y.values
```

- 0.9454898

# 5. KNN(K-Nearest Neighbors)

- Predict qualitative response using the Bayes classifier. 
- KNN classifier estimates the conditional distribution of 
    - $Pr(Y=j|X=x_0) = \frac{1}{K}\sum_{i \in N_0} I(y_i=j)$
    - $x_0$ : a test observation 
    - $N_0$ : a set of K points in the training data that are closest to $x_0$.
    

![](Img/KNN1.png) 

- KNN decision boundary(black) and The Bayesian decision boundary(purple). 
- The choice of $K$ has a drastic effect on the KNN classifier. 
- When K=1, the decision boundary is overfitting(low bias + high variance). 
- When K=100, the decision boundary is underfitting(high bias + low variance). 
- We need to find the best $K$ which optimizes the test error rate. 

## 5.1 [Ex] Caran Insurance Data 

**Prepare dataset** 

```R
# Importing library and data 
library(ISLR)
data(Caravan)
dim(Caravan)
str(Caravan)
attach(Caravan)

# only 6% of people purchased caravan insurance.
summary(Purchase)
mean(Purchase=="Yes")
```

**Training logistic regression** 

```R
# Logistic regression 
g0 <- glm(Purchase~., data=Caravan, family="binomial")
summary(g0)
````

**Training glmnet with CV** 

```R
library(glmnet)
y <- Purchase
x <- as.matrix(Caravan[,-86])

# glmnet with cross validation
set.seed(123)
g1.cv <- cv.glmnet(x, y, alpha=1, family="binomial")
plot(g1.cv)

# Extract the value of lambda of model g1.cv 
g1.cv$lambda.min
g1.cv$lambda.1se

# Check coefficients 
coef1 <- coef(g1.cv, s="lambda.min")
coef2 <- coef(g1.cv, s="lambda.1se")
cbind(coef1, coef2)

# Degree of freedom 
sum(coef1!=0)-1
sum(coef2!=0)-1
````

![](Img/KNN2.png)

- Degree of freedom 
    - lambda.min = 31
    - lambda.1se = 8 
    
    
**KNN methods** 

```R
# Standardize data so that mean=0 and variance=1.
X <- scale(Caravan[,-86])
apply(Caravan[,1:5], 2, var)
apply(X[,1:5], 2, var)

# Separate training sets and test sets
test <- 1:1000
train.X <- X[-test, ]
test.X <- X[test, ]
train.Y <- Purchase[-test]
test.Y <- Purchase[test]

## Classification error rate of KNN
set.seed(1)

knn.pred <- knn(train.X, test.X, train.Y, k=1)
mean(test.Y!=knn.pred)
mean(test.Y!="No")
table(knn.pred, test.Y)

knn.pred=knn(train.X, test.X, train.Y, k=3)
table(knn.pred, test.Y)
mean(test.Y!=knn.pred)

knn.pred=knn(train.X, test.X, train.Y, k=5)
table(knn.pred, test.Y)
mean(test.Y!=knn.pred)

knn.pred=knn(train.X, test.X, train.Y, k=10)
table(knn.pred, test.Y)
mean(test.Y!=knn.pred)
```

![](Img/KNN4.png)

## 5.2 [Ex] KNN with Hyperparameter Tuning of K based on Validation Set 

```R
# Import dataset 
library(ISLR)
data(Caravan)
attach(Caravan)
library(class)

# Train-Test Splitting : Train-Validation-Test set 
set.seed(1234)
n <- nrow(Caravan) 
s <- sample(rep(1:3, length=n))
tran <- s==1
valid <- s==2 
test <- s==3 

# Hyperparameter of K 
K = 100 

# Train-Test Splitting : Train-Validation-Test set 
X <- scale(Caravan[,-86])
y <- Caravan[,86]

train.X <- X[tran,]
valid.X <- X[valid,]
test.X <- X[test,]
train.y <- y[tran]
valid.y <- y[valid]
test.y <- y[test]

# Calculate Missclassification Error rate of validation set 
miss <- rep(0, K)
for (i in 1:K) { 
  knn.pred <- knn(train.X, valid.X, train.y, k=i)
  miss[i] <- mean(valid.y != knn.pred)
}
miss
wm <- which.min(miss)

# Calculate Missclassifiaction Error rate of test set
miss_test <- knn(train.X, test.X, train.y, k=wm)
mean(test.y != miss_test)
```

- The optimized value of K is 8, and its missclassification error rate is 0.06134021

## 5.3 [Ex] KNN Simulation Study 


```R
# Import dataset 
library(ISLR)
data(Caravan)
attach(Caravan)
library(class)

# Simulation 
library(mnormt)
set.seed(1010)
sigma <- matrix(c(1, 0.5, 0.5, 1), 2, 2)
x.tran1 <- rmnorm(100, c(0, 0.8), sigma)
x.tran2 <- rmnorm(100, c(0.8, 0), sigma)
x.test1 <- rmnorm(3430, c(0, 0.8), sigma)
x.test2 <- rmnorm(3430, c(0.8 ,0), sigma)
x.tran <- rbind(x.tran1, x.tran2)
x.test <- rbind(x.test1, x.test2)
y.tran <- factor(rep(0:1, each=100))
mn <- min(x.tran)
mx <- max(x.tran)
px1 <- seq(mn, mx, length.out=70)
px2 <- seq(mn, mx, length.out=98)
gd <- expand.grid(x=px1, y=px2)

# Training model 
g1 <- knn(x.tran, gd, y.tran, k = 1, prob=TRUE)
g2 <- knn(x.tran, gd, y.tran, k = 10, prob=TRUE)
g3 <- knn(x.tran, gd, y.tran, k = 100, prob=TRUE)

# Visualization of K=1 
par(mfrow=c(1,3))
prob1 <- attr(g1, "prob")
prob1 <- ifelse(g1=="1", prob1, 1-prob1)
pp1 <- matrix(prob1, length(px1), length(px2))
contour(px1, px2, pp1, levels=0.5, labels="", xlab="", ylab="",
        main="KNN: K=1", axes=FALSE)
points(x.tran, col=ifelse(y.tran==1, "cornflowerblue", "coral"))
co1 <- ifelse(pp1>0.5, "cornflowerblue", "coral")
points(gd, pch=".", cex=1.2, col=co1)
box()

# Visualization of K=10
prob2 <- attr(g2, "prob")
prob2 <- ifelse(g2=="1", prob2, 1-prob2)
pp2 <- matrix(prob2, length(px1), length(px2))
contour(px1, px2, pp2, levels=0.5, labels="", xlab="", ylab="",
        main="KNN: K=10", axes=FALSE)
points(x.tran, col=ifelse(y.tran==1, "cornflowerblue", "coral"))
co2 <- ifelse(pp2>0.5, "cornflowerblue", "coral")
points(gd, pch=".", cex=1.2, col=co2)
box()

# Visualization of K = 100 
prob3 <- attr(g3, "prob")
prob3 <- ifelse(g3=="1", prob3, 1-prob3)
pp3 <- matrix(prob3, length(px1), length(px2))
contour(px1, px2, pp3, levels=0.5, labels="", xlab="", ylab="",
        main="KNN: K=100", axes=FALSE)
points(x.tran, col=ifelse(y.tran==1, "cornflowerblue", "coral"))
co3 <- ifelse(pp3>0.5, "cornflowerblue", "coral")
points(gd, pch=".", cex=1.2, col=co3)
box()
```

When the graph according to the value of K is checked, it can be seen that the decision boundary is overfitting when K = 1. If K = 100, the decision boundary is briefly drawn. It can be seen that K = 10 forms an appropriate decision boundary. 

![](Img/KNN5.png)