# 1. Support Vector Machine 

- Find a plane that separate the classes in feature space.
- Soften what we mean by separates and enrich and enlarge the feature space so that separation is possible 
- Three methods 
    - Maximum margin classifier 
    - Support vector classifier(SVC) 
    - Support vector machine(SVM) 

## 1.1 Hyperplane 

- A **hyperplane** in p dimensions is a flat affine subspace of dimension p-1. 
- General equation for a hyper plane : $\beta_0 + \beta_1 X_1 + ... \beta_p X_p = 0$
    - If $X = (X_1, ..., X_p)^T$ satisfies above, then $X$ lies on the hyper plane.
    - If $X = (X_1, ..., X_p)^T$ does not satify above, then $X$ lies to one side of the hyperplane. 

## 1.2 Separating Hyperplanes 

- If $f(x) = \beta_0 + \beta_1 X_1 + ... + \beta_p X_p$, then $f(x) > 0$ for points on one side of thehyperplane, and $f(x) < 0$ for points on the other. 
- If a separating hyperplane exists, we can use it to construct a very natural classifier. 
- For a test observation $x^{*}$ : $f(x^*) = \beta_0 + \beta_1 x_1^* + ... + \beta_p x_p^*$
    - If $f(x^*) > 0$, we assign the test observation to class 1. 
    - If $f(x^*) < 0$, we assign the test observation to class 2 
- If $|f(x^*)|$ is relatively large, the class assignment is confident 
- If $|f(x^*)|$ is relatively small, the class assignment is less confident 

# 2. Maximal Margin Classifier 

- The **maximal margin hyperplane** is the separating hyperplane that is farthest from the training observations. 
- Among all separating hyperplane, find the one that makes the biggest gap or margin between the two classes. 
- Constrained optimization problem : 
    - maximize M subject to 
- The function svm() in an R package e1071 solves this problem efficiently. 

<img src="Img/SVM1.png" width="600" height="400">

## 2.1 The Non-separable Case

- The maximal margin classifier is a very natural way to perform classification. 
- However, in many cases no separating hyperplane exists, and so there is no maximal margin classifier. 
- If a separating hyperplane doesn't exist, there is no solution to M in the optimization problem. 
- However, we can develop a hyperplane that almost separates the classes, using a so-called soft margin. 
- The generalization of the maximal margin classifier to the non-separable case is known as the support vector classifier. 

# 3. Support Vector Classifier 

- We need to consider a classifier based on a hyperplane that does not perfectly separate the two classes, in the interest of 
    - Greater robustness to individual observations
    - Better classification of most of the training observations
- It could be worthwile to misclassify a few training observations in order todo a better job in classifying the remaining observations. 
- The **support vector classfier (soft margin classifier)** allows some observations to be on the incorrect side of the margin, or even incorrect side of the hyperplane. 
- The margin is soft because it can be violated by some of the training observations. 
- Optimization of SVC : 
    - $maximize_{\beta_0, \beta_1, ..., \beta_p, \epsilon_1, ..., \epsilon_n} M$ subject to $\sum_{j=1}^p \beta_j^2 = 1$,
    - $y_i(\beta_0 + \beta_1 x_{i1} + ... + \beta_p x_{ip} \geq M(1-\epsilon_i)$
    - $\epsilon_i \geq 0$, and $\sum_{i=1}^n \epsilon_i \leq C$
    - $C$ is a tuning parameter

## 3.1 Margins and Slack Variables 

- $M$ is the width of the margin. 
- $\epsilon_1, ..., \epsilon_n$ are slack variables. 
    - If $\epsilon_i = 0$, the $i$th obs. is on the correct side of the margin. 
    - If $\epsilon_i > 0$, the $i$th obs. is on the wrong side of the margin. 
    - If $\epsilon_i > 1$, the $i$th obs. is on the wrong side of the hyperplane. 

## 3.2 [Ex] Support Vector Classifier 

```R
# Simple example (simulate data set)
set.seed(1)
x <- matrix(rnorm(20*2), ncol=2)
y <- c(rep(-1, 10), rep(1, 10))
x[y==1, ] <- x[y==1, ] + 1
plot(x, col=(3-y), pch=19, xlab="X1", ylab="X2")
```

```R
# Support vector classifier with cost=10
library(e1071)
# y must be in format (-1, 1) 
dat <- data.frame(x, y=as.factor(y))
# Tuning parameter cost (inverse of C) 
svmfit <- svm(y~., data=dat, kernel="linear", cost=10, scale=FALSE)
plot(svmfit, dat)
summary(svmfit)
```

<img src="Img/SVM2.png" wdith="600">

## 3.3 [Ex] Support Vector Classifier with different margins

```R
# SVM of tuning parameter (cost=0.1) 
svmfit <- svm(y~., data=dat, kernel="linear", cost=0.1,
scale=FALSE)
svmfit$index
beta <- drop(t(svmfit$coefs)%*%x[svmfit$index,])
beta0 <- svmfit$rho

# Visualize results
plot(x, col=(3-y), pch=19, xlab="X1", ylab="X2")
points(x[svmfit$index, ], pch=5, cex=2)
abline(beta0 / beta[2], -beta[1] / beta[2])
abline((beta0 - 1) / beta[2], -beta[1] / beta[2], lty = 2)
abline((beta0 + 1) / beta[2], -beta[1] / beta[2], lty = 2)
```

<img src="Img/SVM3.png" width="400">

## 3.4 [Ex] Support Vector Classifier with Hyper parameter tuning

```R
# Cross validation to find the optimal tuning parameter (cost) 
# Training using tune function 
set.seed(1)
tune.out <- tune(svm, y~., data=dat, kernel="linear",
                 ranges=list(cost=c(0.001, 0.01, 0.1, 1, 5, 10, 100)))
summary(tune.out)
bestmod <- tune.out$best.model
summary(bestmod)

# Generate test set 
set.seed(4321)
xtest <- matrix(rnorm(20*2), ncol=2)
ytest <- sample(c(-1,1), 20, rep=TRUE)
xtest[ytest==1, ] <- xtest[ytest==1, ] + 1
testdat <- data.frame(xtest, y=as.factor(ytest))

# Calculate missclassification rate for optimal model 
# Compute misclassification rate for the optimal model
ypred <- predict(bestmod, testdat)
table(predict=ypred, truth=testdat$y)
mean(ypred!=testdat$y)
```

- Missclassification error rate : 0.15

## 3.5 [Ex] Simulation study of Support Vector Classifier

```R
# Prepare package 
library(mnormt)
library(e1071)

# Function for calculating Misclassification Rate of SVC
SVC.MCR <- function(x.tran, x.test, y.tran, y.test,
                    cost=c(0.01,0.1,1,10,100)) {
  dat <- data.frame(x.tran, y=as.factor(y.tran))
  testdat <- data.frame(x.test, y=as.factor(y.test))
  MCR <- rep(0, length(cost)+1)
  for (i in 1:length(cost)) {
    svmfit <- svm(y~., data=dat, kernel="linear",
                  cost=cost[i])
    MCR[i] <- mean(predict(svmfit, testdat)!=testdat$y)
  }
  tune.out <- tune(svm, y~., data=dat, kernel="linear",
                   ranges=list(cost=cost))
  pred <- predict(tune.out$best.model, testdat)
  MCR[length(cost)+1] <- mean(pred!=testdat$y)
  MCR
}

# Simulation test for 100 replications
set.seed(123)
K <- 100
RES <- matrix(NA, K, 6)
colnames(RES) <- c("0.01", "0.1" ,"1" , "10" ,"100", "CV")
for (i in 1:K) {
  x.A <- rmnorm(100, rep(0, 2), matrix(c(1,-0.5,-0.5,1),2))
  x.B <- rmnorm(100, rep(1, 2), matrix(c(1,-0.5,-0.5,1),2))
  x.tran <- rbind(x.A[1:50, ], x.B[1:50, ])
  x.test <- rbind(x.A[-c(1:50), ], x.B[-c(1:50), ])
  y.tran <- factor(rep(0:1, each=50))
  y.test <- factor(rep(0:1, each=50))
  RES[i,] <- SVC.MCR(x.tran, x.test, y.tran, y.test)
}
apply(RES, 2, summary)
boxplot(RES, boxwex=0.5, col=2:7,
        names=c("0.01", "0.1", "1", "10", "100", "CV"),
        main="", ylab="Classification Error Rates")
```

<img src="Img/SVM4.png" width="400" height="400"> 

- The model with C=10/100 get best missclassification error rate of test sets.

# 4. Non-linear Support Vector Classifier 

- In practice, we can be faced with non-linear class boundaries. 
- We need to consider enlarging the feature space using functions of the predictors such as quadratic and cubic terms, in order to address this non-linearity. 
- In the case of the support vector classifier, we can also enlarge the feature space using quadratic, cubic, and even higher-order polynomial functions of the predictors. 

## 4.1 Feature Expension

- Enlarge the space of features by including transformations : $X_1^2, X_1^3, ...$. 
- For example, $\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_1^2 + \beta_4 X_2^2 + \beta_5 X_1 X_2 = 0$, This leads to non-linear decision boundaries in the original space.
- The support vector classifier in the enlarged space solves the problem in the lower-dimensional space. 

## 4.2 Computational Issues for High-dimensional Polynomials

- However, high-dimensional polynomials get wild rather fast. 
- There are many possible ways to enlarge the feature space, but computations would become unmanageable for a huge number of features. 
- There is a more elegant and controlled way to introduce nonlinearities in support vector classifiers, using kernels. 

## 4.3 Kernels and Support Vector Machines 

- The linear support vector classifier can be represented as : $f(x) = \beta_0 + \sum_{i \in S} \hat{\alpha}_i K(x, x_i)$.
- $S$ is the support set of indicies $i$ such that $\alpha_i > 0$.
- Parameters : 
    - $\alpha_1, ..., \alpha_n$
    - $\beta_0$
    - $(n, 2)$ inner products 
- A generalization of the inner product of the form $K(x_i, x_{i'})$ is called a kernel.
- A kernel is a function that quantifies the similarity of two observations. 

## 4.4 Examples of Kernels 

- Standard linear kernel : $K(x_i, x_{i'}) = \sum_{j=1}^p x_{ij}x_{i'j}$
- Polynomial kenrel of degree $d > 1$ : $K(x_i, x_{i'}) = (1 + \sum_{j=1}^p x_{ij}x_{i'j})^d$
    - We need to tune parameter d 
- Radial kernel : $K(x_i, x_{i'}) = exp(-\gamma \sum_{j=1}^p(x_{ij} - x_{i'j})^2)$

![](Img/NLSVM1.png)

- Left : SVM with a polynomial kernel of degree 3. 
- Right : SVM with a radial kernel. 

## 4.5 Computational Advantages 

- One advantage of using kernels over enlarging the feature space using function of original features is computational efficiency. 
- This is important because in many applications of SVMs, the enlarged feature space is so large that computations are intractable. 

# 5. Non-linear Support Vector Classifier 

## 5.1 [Ex] Non-linear SVMs with radial kernel 

```R
# Simulate non-linear data set
library(e1071)
set.seed(1)
x <- matrix(rnorm(200*2), ncol=2)
x[1:100, ] <- x[1:100, ] + 2
x[101:150, ] <- x[101:150, ] - 2
y <- c(rep(-1, 150), rep(1, 50))
dat <- data.frame(x, y=as.factor(y))
plot(x, col=y+3, pch=19)

# Training svm model with radial kernel with r=0.5, C=0.1
fit <- svm(y~.,data=dat, kernel="radial", gamma=0.5, cost=0.1)
plot(fit, dat)
summary(fit)

# Training svm model with radial kernel with r=0.5, C=5
fit <- svm(y~.,data=dat, kernel="radial", gamma=0.5, cost=5)
plot(fit, dat)
summary(fit)

# Visualize of test grid for radial kernel with r=0.5, C=1
fit <- svm(y~.,data=dat, kernel="radial", gamma=0.5, cost=1)
px1 <- seq(round(min(x[,1]),1), round(max(x[,1]),1), 0.1)
px2 <- seq(round(min(x[,2]),1), round(max(x[,2]),1), 0.1)
xgrid <- expand.grid(X1=px1, X2=px2)
ygrid <- as.numeric(predict(fit, xgrid))
ygrid[ygrid==1] <- -1
ygrid[ygrid==2] <- 1
plot(xgrid, col=ygrid+3, pch = 20, cex = .2)
points(x, col = y+3, pch = 19)
pred <- predict(fit, xgrid, decision.values=TRUE)
func <- attributes(pred)$decision
contour(px1, px2, matrix(func, length(px1), length(px2)),
level=0, col="purple", lwd=2, lty=2, add=TRUE)
```

<img src="Img/NLSVM2.png">

## 5.2 [Ex] Optimizing non-linear SVMs(radial kernel) using validation set 

```R
# Calculate missclassification error of validation set 
# Separate training and test sets 
set.seed(1234)
tran <- sample(200, 100)
test <- setdiff(1:200, tran)

# Training with hyperparameter tuning of gamma, C
gamma <- c(0.5, 1, 5, 10)
cost <- c(0.01, 1, 10, 100)
R <- NULL
for (i in 1:length(gamma)) {
  for (j in 1:length(cost)) {
    svmfit <- svm(y~., data=dat[tran, ], kernel="radial",
                  gamma=gamma[i] , cost=cost[j])
    pred <- predict(svmfit, dat[test, ])
    R0 <- c(gamma[i], cost[j], mean(pred!=dat[test, "y"]))
    R <- rbind(R, R0)
  }
}

# Check results 
colnames(R) <- c("gamma", "cost", "error")
rownames(R) <- seq(dim(R)[1])
R

# Training with hyperparameter tuning of gamma, C using tune function 
set.seed(1)
tune.out <- tune(svm, y~., data=dat[tran, ], kernel="radial",
                 ranges=list(gamma=gamma, cost=cost))
summary(tune.out)
tune.out$best.parameters

# Calculate missclassification error rate of test sets
pred <- predict(tune.out$best.model, dat[test,])
table(pred=pred, true=dat[test, "y"])
mean(pred!=dat[test, "y"])
```

- best parameters : gamma(0.5), cost(1) 
- Missclassification error rate : 0.09

## 5.3 [Ex] Optimizing non-linear SVMs(polynomial) using validation set 

```R
degree <- c(1, 2, 3, 4)
R <- NULL

for (i in 1:length(degree)) {
  for (j in 1:length(cost)) {
    svmfit <- svm(y~., data=dat[tran, ], kernel="polynomial",
                  degree=degree[i] , cost=cost[j])
    pred <- predict(svmfit, dat[test, ])
    R0 <- c(degree[i], cost[j], mean(pred!=dat[test, "y"]))
    R <- rbind(R, R0)
  }
}
colnames(R) <- c("degree", "cost", "error")
rownames(R) <- seq(dim(R)[1])
R


tune.out <- tune(svm, y~., data=dat[tran, ], kernel="polynomial",
                 ranges=list(degree=degree, cost=cost))
summary(tune.out)
tune.out$best.parameters

pred <- predict(tune.out$best.model, dat[test,])
table(pred=pred, true=dat[test, "y"])
mean(pred!=dat[test, "y"])
```

- best parameters : degree(2), cost(10) 
- Missclassification error rate : 0.16

## 5.4 [Ex] Optimizing non-linear SVMs(sigmoid) using validation set 

```R
R <- NULL
for (i in 1:length(gamma)) {
  for (j in 1:length(cost)) {
    svmfit <- svm(y~., data=dat[tran, ], kernel="sigmoid",
                  gamma=gamma[i] , cost=cost[j])
    pred <- predict(svmfit, dat[test, ])
    R0 <- c(gamma[i], cost[j], mean(pred!=dat[test, "y"]))
    R <- rbind(R, R0)
  }
}

colnames(R) <- c("gamma", "cost", "error")
rownames(R) <- seq(dim(R)[1])
R

tune.out <- tune(svm, y~., data=dat[tran, ], kernel="sigmoid",
                        ranges=list(gamma=gamma, cost=cost))
summary(tune.out)
tune.out$best.parameters

pred <- predict(tune.out$best.model, dat[test,])
table(pred=pred, true=dat[test, "y"])
mean(pred!=dat[test, "y"])
```

- best parameters : gamma(0.5), cost(0.01) 
- Missclassification error rate : 0.29

## 5.5 [Ex] Simulation study using different kernels of 20 replications 

```R
# Set reps and RES matrix 
set.seed(123)
N <- 20
RES <- matrix(0, N, 3)
colnames(RES) <- c("radial", "poly", "sigmoid")

# Training model with calculate missclassification error rate 
for (i in 1:N) {
  tran <- sample(200, 100)
  test <- setdiff(1:200, tran)
  tune1 <- tune(svm, y~., data=dat[tran, ], kernel="radial",
                ranges=list(gamma=gamma, cost=cost))
  pred1 <- predict(tune1$best.model, dat[test,])
  RES[i, 1] <- mean(pred1!=dat[test, "y"])
  tune2 <- tune(svm, y~., data=dat[tran, ], kernel="polynomial",
                ranges=list(degree=degree, cost=cost))
  pred2 <- predict(tune2$best.model, dat[test,])
  RES[i, 2] <- mean(pred2!=dat[test, "y"])
  tune3 <- tune(svm, y~., data=dat[tran, ], kernel="sigmoid",
                ranges=list(gamma=gamma, cost=cost))
  pred3 <- predict(tune3$best.model, dat[test,])
  RES[i, 3] <- mean(pred3!=dat[test, "y"])
}
# Check statistical reports 
apply(RES, 2, summary)
```

|Statistics|radial|poly|sigmoid|
|:---:|:---:|:---:|:---:|
|Min|0.07|0.14|0.17|
|1st|0.1|0.1675|0.2375|
|Median|0.12|0.1950|0.26|
|Mean|0.123|0.209|0.2705|
|3rd|0.1425|0.235|0.29|
|Max|0.18|0.31|0.43|

# 6. Non-linear Support Vector Classifier on Heart.csv

### Step 1 : Prepare Heart Dataset

```R
# Prepare Heart dataset 
url.ht <- "https://www.statlearning.com/s/Heart.csv"
Heart <- read.csv(url.ht, h=T)
summary(Heart)
Heart <- Heart[, colnames(Heart)!="X"]
Heart[,"Sex"] <- factor(Heart[,"Sex"], 0:1, c("female", "male"))
Heart[,"Fbs"] <- factor(Heart[,"Fbs"], 0:1, c("false", "true"))
Heart[,"ExAng"] <- factor(Heart[,"ExAng"], 0:1, c("no", "yes"))
Heart[,"ChestPain"] <- as.factor(Heart[,"ChestPain"])
Heart[,"Thal"] <- as.factor(Heart[,"Thal"])
Heart[,"AHD"] <- as.factor(Heart[,"AHD"])
summary(Heart)
dim(Heart)
sum(is.na(Heart))
Heart <- na.omit(Heart)
dim(Heart)
summary(Heart)
```

### Step 2 : Separate training and test sets 

```R
# Separate training and test sets 
set.seed(123)
train <- sample(1:nrow(Heart), nrow(Heart)/2)
test <- setdiff(1:nrow(Heart), train)
```

### Step 3 : Training using SVMs 

```R
# SVM with a linear kernel
tune.out <- tune(svm, AHD~., data=Heart[train, ],
                 kernel="linear", ranges=list(
                   cost=c(0.001, 0.01, 0.1, 1, 5, 10, 100)))
heart.pred <- predict(tune.out$best.model, Heart[test,])
table(heart.pred, Heart$AHD[test])
mean(heart.pred!=Heart$AHD[test])

# SVM with a radial kernel
tune.out <- tune(svm, AHD~., data=Heart[train, ],
                 kernel="radial", ranges=list(
                   cost=c(0.1,1,10,100), gamma=c(0.5,1,2,3)))
heart.pred <- predict(tune.out$best.model, Heart[test,])
table(heart.pred, Heart$AHD[test])
mean(heart.pred!=Heart$AHD[test])

# SVM with a polynomial kernel
tune.out <- tune(svm, AHD~.,data=Heart[train, ],
                 kernel="polynomial", ranges=list(
                   cost=c(0.1,1,10,100), degree=c(1,2,3)))
heart.pred <- predict(tune.out$best.model, Heart[test,])
table(heart.pred, Heart$AHD[test])
mean(heart.pred!=Heart$AHD[test])

# SVM with a sigmoid kernel
tune.out <- tune(svm, AHD~.,data=Heart[train, ],
                 kernel="sigmoid", ranges=list(
                   cost=c(0.1,1,10,100), gamma=c(0.5,1,2,3)))
heart.pred <- predict(tune.out$best.model, Heart[test,])
table(heart.pred, Heart$AHD[test])
mean(heart.pred!=Heart$AHD[test])
```

|kernels|Missclassification error rate|
|:---:|:---:|
|linear|0.2214765|
|radial|0.2080537|
|polynomials|0.2080537|
|sigmoid|0.2147651|

### Step 4 : Simulation Study using different kernels of 20 replications 

```R
set.seed(123)
N <- 20
Err <- matrix(0, N, 5)

for (i in 1:N) {
  train <- sample(1:nrow(Heart), floor(nrow(Heart)*2/3))
  test <- setdiff(1:nrow(Heart), train)
  g1 <- randomForest(x=Heart[train,-14], y=Heart[train,14],
                     xtest=Heart[test,-14], ytest=Heart[test,14], mtry=4)
  Err[i,1] <- g1$test$err.rate[500,1]
  g2 <- tune(svm, AHD~., data=Heart[train, ], kernel="linear",
             ranges=list(cost=c(0.001, 0.01, 0.1, 1, 5, 10, 100)))
  p2 <- predict(g2$best.model, Heart[test,])
  Err[i,2] <- mean(p2!=Heart$AHD[test])
  g3 <- tune(svm, AHD~., data=Heart[train, ], kernel="radial",
             ranges=list(cost=c(0.1,1,10,100), gamma=c(0.5,1,2,3)))
  p3 <- predict(g3$best.model, Heart[test,])
  Err[i,3] <- mean(p3!=Heart$AHD[test])
  
  g4 <- tune(svm, AHD~.,data=Heart[train, ],kernel="polynomial",
             ranges=list(cost=c(0.1,1,10,100), degree=c(1,2,3)))
  p4 <- predict(g4$best.model, Heart[test,])
  Err[i,4] <- mean(p4!=Heart$AHD[test])
  g5 <- tune(svm, AHD~.,data=Heart[train, ],kernel="sigmoid",
             ranges=list(cost=c(0.1,1,10,100), gamma=c(0.5,1,2,3)))
  p5 <- predict(g5$best.model, Heart[test,])
  Err[i,5] <- mean(p5!=Heart$AHD[test])
}

# Visualize results 
labels <- c("RF","SVM.linear","SVM.radial","SVM.poly","SVM.sig")
boxplot(Err, boxwex=0.5, main="Random Forest and SVM", col=2:6,
        names=labels, ylab="Classification Error Rates",
        ylim=c(0,0.4))
colnames(Err) <- labels
apply(Err, 2, summary)
```

<img src="Img/NLSVM3.png" width="400">