# Statistical Learning

- **Statistical Learning** is a set of tools for modeling and understanding complex datasets. 
- **Supervised Statistical Learning** builds a statistical model for predicting or estimating for data with output based on one or more inputs.
- **Unsupervised Statistical Learning** learns relationships and structure from data that has inputs but no supervising output. 

## Supervised Learning Problem 

- Outcome measurement $Y$ (dependent variable, response, target)
- Vector of p predictor measurement $X$ (independent variables, inputs, regressors, features)
- Regression problem : $Y$ is quantitive. 
- Classification problem : $Y$ takes values in a finite, unorder set. 

## Statistical Learning vs Machine Learning

- **Machine Learning** has a greater emphasis on large scale applications and prediction accuracy. 
- **Statistical Learning** emphasizes models, interpretability, precision and uncertainty. 

## [Ex] Advertising Data

<img src = "Img/Advertising01.png"/>

```R
## Open the dataset linked to the book website
url.ad <- "https://www.statlearning.com/s/Advertising.csv"
Advertising <- read.csv(url.ad, h=T)
attach(Advertising)

## Least square fit for simple linear regression
par(mfrow = c(1,3))
plot(sales~TV, col=2, xlab="TV", ylab="Sales")
abline(lm(sales~TV)$coef, lwd=3, col="darkblue")

plot(sales~radio, col=2, xlab="Radio", ylab="Sales")
abline(lm(sales~radio)$coef, lwd=3, col="darkblue")

plot(sales~newspaper, col=2, xlab="Newspaper", ylab="Sales")
abline(lm(sales~newspaper)$coef, lwd = 3, col="darkblue")
```

- Sales is a response or target that we wish to predict. 
- TV is a feature, or input, or predictor which we can name it $X_1$. 

# Supervised Learning

## Model  

- Ideal model : $Y = f(X) + \epsilon$
- Good $f(X)$ can make predictions of $Y$ at new points $X = x$. 
- Statistical Learning refers to a set of approaches for estimating the function $f(X)$.

```R
## Indexing without index 
AD <- Advertising[, -1] 

## Multiple linear regression 
lm.fit <- lm(sales ~., AD) 
summary(lm.fit)
names(lm.fit) 
coef(lm.fit)
confint(lm.fit) 

## Visualizaing models 
par(mfrow=c(2,2))
plot(lm.fit) 

dev.off()
plot(predict(lm.fit), residuals(lm.fit))    # Residual vs Fitted  
plot(predict(lm.fit), rstudent(lm.fit))    
plot(hatvalues(lm.fit))
which.max(hatvalues(lm.fit)) 
```

## Estimation of $f$ for Preidction

- $\hat{Y} = \hat{f}(X)$
- $\hat{f}$ : Estimation for $f$. 
- $\hat{Y}$ : Prediction for $Y$. 
- Ideal function $f(x)$ is $f(x) = E(Y|X=x)$.
- Reducible error : $E[(f(x) - \hat{f}(x))^2]$
- Irreducible error : $\epsilon = Y - f(x)$
- <font color = 'red'>Statistical learning techniques for estimating $f$ is minimizing reducible error.</font>
- <font color = 'red'>Statistical learning is the way finding $\hat{f}$ which is the most similar function to $f$.</font>

## [Ex] Income 

![](Img/Income1.png)

```R
## Load Datasets
url.in <- "https://www.statlearning.com/s/Income1.csv"
Income <- read.csv(url.in, h=T)

## Polynomial regression fit 
par(mfrow = c(1,2)) 
plot(Income~Education, col=2, pch=19, xlab="Years of Education", 
     ylab="Income", data=Income) 

g <- lm(Income ~ poly(Education, 3), data=Income) 
plot(Income~Education, col=2, pch=19, xlab="Years of Education", 
     ylab="Income", data=Income)
lines(Income$Education, g$fit, col="darkblue", lwd=4, ylab="Income", 
      xlab="Years of Education")

## Compare residuals
y <- Income$Income
mean((predict(g) - y)^2) 
mean(residuals(g)^2)
```

![](Img/Income2.png)
```R 
dist <- NULL
par(mfrow=c(3,4)) 
for (k in 1:12) { 
    g <- lm(Income ~ poly(Education, k), data=Income) 
    dist[k] <- mean(residuals(g)^2)
    plot(Income~Education, col=2, pch=19, xlab="Years of Education", ylab="Income",
         data=Income, main=paste("k =", k)) 
    lines(Income$Education, g$fit, col="darkblue", lwd=3, ylabe="Income", xlab="Years of Education")
}
```

![](Img/Income3.png)

```R
x11()
plot(dist, type="b", xlab="Degree of Polynomial", 
     ylab="Mean squared distance")
```

# Parametric and Non-Parametric Methods 

- **Parametric methods** : make an assumption about the functional form or shape of $f$. 
- **N-parametric methods** : do not make explicit assumptions about the functional of $f$.

# Flexibility and Interpretability 

- **Flexibility** increases when we increase **df(degree of freedom)**.
- **Less flexible(Restrictive)** models are much more **interpretable**. 
- Considering only prediction, the most flexible model is prefable. 

![](Img/Flexibility_01.png)

<center> 즉 Flexibility와 Interpretability는 교차 관계로, 매개변수에 대한 해석과 모델의 예측 성능을 중요도에 따라 모델을 선정한다. 

# Assessing Model Accuracy

- **Qunatitive** : MSE(mean squared error) 
- **Qualitative** : Classification error rate
- Type of data set
    - **Training set** : to fit statistical learning models 
    - **Validation set** : to select optimal tuning parameter
    - **Test set** : to select the best model 

## MSE 

- Suppose our fitted model $\hat{f}(x)$ from training dataset, $(x_i, y_i)$.
- $MSE_{train} = \frac{1}{n_1}\sum_{i \in {train}}[y_i - \hat{f}(x_i)]^2$
- $MSE_{test} = \frac{1}{n_2}\sum_{i \in {test}}[y_i - \hat{f}(x_i)]^2$
- The best $\hat{f}(x)$ is model which minimize $MSE_{test}$.


## [Ex] Cubic Model MSE

![](Img/MSE01.png)

- red curve : $MSE_{test}$
- grey curve : $MSE_{train}$

```R
set.seed(12345)
## Simulate x and y based on a known function
fun1 <- function(x) -(x-100)*(x-30)*(x+15)/13^4+6
x <- runif(50,0,100)
y <- fun1(x) + rnorm(50)

## Plot linear regression and splines
par(mfrow=c(1,2))
plot(x, y, xlab="X", ylab="Y", ylim=c(1,13))
plot(x, y, xlab="X", ylab="Y", ylim=c(1,13))
lines(sort(x), fun1(sort(x)), col=1, lwd=2)
abline(lm(y~x)$coef, col="orange", lwd=2)
lines(smooth.spline(x,y, df=5), col="blue", lwd=2)
lines(smooth.spline(x,y, df=23), col="green", lwd=2)
legend("topleft", lty=1, col=c(1, "orange", "blue", "green"),
legend=c("True", "df = 1", "df = 5", "df =23"),lwd=2)


set.seed(45678)
## Simulate training and test data (x, y)
tran.x <- runif(50,0,100)
test.x <- runif(50,0,100)
tran.y <- fun1(tran.x) + rnorm(50)
test.y <- fun1(test.x) + rnorm(50)

## Compute MSE along with different df
df <- 2:40
MSE <- matrix(0, length(df), 2)
for (i in 1:length(df)) {
tran.fit <- smooth.spline(tran.x, tran.y, df=df[i])
MSE[i,1] <- mean((tran.y - predict(tran.fit, tran.x)$y)^2)
MSE[i,2] <- mean((test.y - predict(tran.fit, test.x)$y)^2)
}

## Plot both test and training errors
matplot(df, MSE, type="l", col=c("gray", "red"),
xlab="Flexibility", ylab="Mean Squared Error",
lwd=2, lty=1, ylim=c(0,4))
abline(h=1, lty=2)
legend("top", lty=1, col=c("red", "gray"),lwd=2,
legend=c("Test MSE", "Training MSE"))
abline(v=df[which.min(MSE[,1])], lty=3, col="gray")
abline(v=df[which.min(MSE[,2])], lty=3, col="red")
```

# Bias Variance Trade-off 

- $E(y_0 - \hat{f}(x_0))^2 = Var(\hat{f}(x_0)) + [Bias(\hat{f}(x_0))]^2 + Var(\epsilon)$
- $Bias(\hat{f}(x_0)) = E[\hat{f}(x_0)] - f(x_0)$

![](Img/TradeOff_01.png)

- **Flexibility** of $\hat{f}$ increases, **Variance** increases, **Bias** decreases.
- **Flexibility** of $\hat{f}$ decreaess, **Variance** decreases, **Bias** increases.
- The best performance of a statistical learning methods : <font color='red'>Low Varaince + Low Bias</font>

![](Img/TradeOff_02.png)

- For the best performance of a statistical learning methods, we need to set model which minimize $MSE_{test}$.

# Cross Validation

- In real world, we can't get test data for $MSE_{test}$. 
- So we should divide train data into train data and test data. 
- **Test-set Error Estimation**
    - **Mathematical Adjustment** : Include $C_p$ statistic, $AIC$ and $BIC$.
    - **Hold out** : holding out a subset of the training set. 
        - Validation set approach
        - K-fold Cross Validation
        - LOOCV, LpOCV

## Validation Set Approach

- Divide set into two parts : <font color='red'>Training set + Validation set</font>
- Regression problem : MSE 
- Classification problem : Misclassification Rate 
- <font color='red'>Validation shouldn't take part in training statistical model.</font>

## [Ex] Validation Set Approach 

![](Img/ValidationSetApproach01.png) 

```R
# Dataset Preparation 
library(ISLR) 
data(Auto) 
str(Auto) 
summary(Auto) 

# Extract target 
mpg <- Auto$mpg
horsepower <- Auto$horsepower

# set df 
dg <- 1:9
u <- order(horsepower) 

# Preview dataset 
par(mfrow=c(3,3))
for (k in 1:length(dg)) {
    g <- lm(mpg ~ poly(horsepower, dg[k]))
    plot(mpg~horsepower, col=2, pch=20, xlab="Horsepower",
    ylab="mpg", main=paste("dg =", dg[k]))
    lines(horsepower[u], g$fit[u], col="darkblue", lwd=3)
}
```

![](Img/ValidationSetApproach02.png)

```R
# Single Split 
set.seed(1)
n <- nrow(Auto)

## training set
tran <- sample(n, n/2)
MSE <- NULL
for (k in 1:length(dg)) {
    g <- lm(mpg ~ poly(horsepower, dg[k]), subset=tran)
    MSE[k] <- mean((mpg - predict(g, Auto))[-tran]^2)
}

# Visualization MSE_test
plot(dg, MSE, type="b", col=2, xlab="Degree of Polynomial",
ylab="Mean Squared Error", ylim=c(15,30), lwd=2, pch=19)
abline(v=which.min(MSE), lty=2)
```

## K-fold Cross Validation

- K-fold Cross-validation divide the data into K equal-sized parts. We leave out part $K$, fit the model to the other $K - 1$ parts, and then obtain prediction for the left-out kth part. 
- If we evaluate 10 models with 5-fold CV, then we need to consider $5 \times 10$ cross validation score. 
- We compare the average mean score of K cross validation scores among models. 
- $CVE = \frac{1}{n}\sum_{k=1}^{K}(n_k MSE_k)$ 

## [Ex] K-fold Cross Validation

```R
# 10-fold cross validation
k <- 10 
MSE <- matrix(0, n, length(dg)) # degree is 1:9

# Assertion each data point to each fold 
# e.g. [1, 3, 3, 5, 6, ..., 10] (n) 
set.seed(1234) 
u <- sample(rep(seq(K), lengnth=n)) 

# Model training 
""" 
 f1     f2 f3 f4 f5 ... f9 
MSE1
MSE2
...
MSE10 
"""
for (k in 1:K) {
    tran <- which(u!=k) 
    test <- which(u==k) 
    for (i in 1:length(dg)) { 
        g <- lm(mpg ~ poly(horsepower, i), subset=tran) 
        MSE[test, i] <- (mpg - predict(g, Auto))[test]^2 
    } 
}
CVE <- apply(MSE, 2, mean) 

# Visualization
plot(dg, CVE, type="b", col="darkblue",
xlab="Degree of Polynomial", ylab="Mean Squared Error",
ylim=c(18,25), lwd=2, pch=19)
abline(v=which.min(CVE), lty=2)
```

![](Img/Kfold01.png) 

- The best proper value of degree of freedom is 2 when we consider inferring on population sets.
- This is elbow point of CVE plot.

## LOOCV

- Setting $K=n$ yields leave-one out cross validation(LOOCV). 

## [Ex] LOOCV

```R
# Auto Data : LOOCV
# Set the degree of freedom and result matrix 
n <- nrow(Auto)
dg <- 1:9
MSE <- matrix(0, n, length(dg))

for (i in 1:n) {
    for (k in 1:length(dg)) {
        g <- lm(mpg ~ poly(horsepower, k), subset=(1:n)[-i])
        MSE[i, k] <- mean((mpg - predict(g, Auto))[i]^2)
    }
}
# Calculate CVE 
aMSE <- apply(MSE, 2, mean)

# Visualization
par(mfrow=c(1, 2))
plot(dg, aMSE, type="b", col="darkblue",
     xlab="Degree of Polynomial", ylab="Mean Squared Error",
     ylim=c(18,25), lwd=2, pch=19)
abline(v=which.min(aMSE), lty=2)
```