# 1. Non-Linear Models 

- Kinds of non-linear models : 
    - Polynomials 
    - Step functions 
    - Splines 
    - Local regression
    - Generalized additive models 
- Population : $y_i = \sum_{j=1}^p f_j(x_j) + \epsilon_i$ 
- Inference : $\hat{y}_i = \sum_{j=1}^p \hat{f}_j(x_j)$

# 2. Polynomial Regression 

- Not really intersted in the coefficients.  
- More interested in fitted function values. 
- $\hat{f}(x_0) = \hat{\beta_0} + \hat{\beta_1}x_0 + ... + \hat{\beta_4}x_0^4$
- Compute the fit and pointwise standard error(confidence interval) : $\hat{f}(x_0) +- 2se[\hat{f}(x_0)]$. 

## 2.1 [Ex] Polynomial Regression of wage dataset 

```R
# install.packages('ISLR')

# Import library 
library(ISLR)
attach(Wage)

# Orthogonal polynomials:
# Each column is a linear orthogonal combination of
# age, age^2, age^3 and age^4
fit <- lm(wage ~ poly(age, 4), data=Wage)
summary(fit)
# Direct power of age
fit2 <- lm(wage ~ poly(age, 4, raw=T), data=Wage)
summary(fit2)
fit2a <- lm(wage ~ age + I(age^2) + I(age^3) + I(age^4), Wage)
fit2b <- lm(wage ~ cbind(age, age ^2, age^3, age^4), Wage)
round(data.frame(fit=coef(fit), fit2=coef(fit2),
                 fit2a=coef(fit2a), fit2b=coef(fit2b)), 5)

# Testing datasets 
age.grid <- seq(min(age), max(age))
# se=TRUE, extract sd(f(x)) from fhat  
preds <- predict(fit, newdata=list(age=age.grid), se=TRUE)

# Calcualte confidence band
se.bands <- cbind(preds$fit+2*preds$se.fit, preds$fit-2*preds$se.fit)

# Visualize results
plot(age, wage, xlim=range(age), cex=.5, col="darkgrey")
title("Degree-4 Polynomial", outer=T)
lines(age.grid, preds$fit, lwd=2, col="darkblue")
matlines(age.grid, se.bands, lwd=2, col="darkblue", lty=2)
```

![](Img/Poly1.png)

- The more narrower width of confidence bands, the more stable values of prediction we can get. 
- Checking the confidence interval graph, it can be seen that the area of the interval was narrow for the data that existsed in the training dataset, but the prediction accuracy was poor for the data that was completely new. 

```R
# Orthogonal vs. Non-orthogonal polynomial regression
preds2 <- predict(fit2, newdata=list(age=age.grid), se=TRUE)
data.frame(fit=preds$fit, fit2=preds2$fit)
sum(abs(preds$fit-preds2$fit))
```

- The difference of prediction between orthogonal vs non-orthogonal polynomial regression is just 1.103956e-09.

## 2.2 [Ex] Choose optimal polynomial terms : Statistics 

```R
# Anova test to find the optimal polynomial degree
fit.1 <- lm(wage ~ age, data=Wage)
fit.2 <- lm(wage ~ poly(age, 2), data=Wage)
fit.3 <- lm(wage ~ poly(age, 3), data=Wage)
fit.4 <- lm(wage ~ poly(age, 4), data=Wage)
fit.5 <- lm(wage ~ poly(age, 5), data=Wage)
g <- anova(fit.1, fit.2, fit.3, fit.4, fit.5)
g
```
<img src="Img/Poly2.png" width="400" height="200">

- The ANOVA test will perform comparison of two models. 
- In case of Model 2, 
    - $H_0 : \beta_2 = 0$ : Model 1 is more preferable. 
    - $H_1 : \beta_2 \neq 0$ : Model 2 is more preferable.
- We can choose adequate polynomial terms from this test. 

```R
# Perform T-test
coef(summary(fit.5))
round(coef(summary(fit.5)), 5)

# T-test^2 = F-test
summary(fit.5)$coef[-c(1, 2), 3]
summary(fit.5)$coef[-c(1, 2), 3]^2
g$F[-1]
```

![](Img/Poly3.png)

- With T-test with fit.5 models, we can choose optimal polynomial terms. 
- The preferable polynomial terms will be 3($\beta_3$).

## 2.3 [Ex] Choose optimal polynomial terms : K-fold CV simulation

```R
# 10-fold cross-validation to choose the optimal polynomial
set.seed(1111)
N <- 10 # simulation replications
K <- 10 # 10-fold CV

# Simulation Error matrix 
CVE <- matrix(0, N, 10)

# Model training - Calculate test error 
for (k in 1:N) {
    gr <- sample(rep(seq(K), length=nrow(Wage)))
    # Cross Validation Error matrix 
    pred <- matrix(NA, nrow(Wage), 10)
    for (i in 1:K) {
        tran <- (gr != i)
        test <- (gr == i)
        for (j in 1:10) {
            g <- lm(wage ~ poly(age, j), data=Wage, subset=tran)
            yhat <- predict(g, data.frame(poly(age, j)))
            mse <- (Wage$wage - yhat)^2
            pred[test, j] <- mse[test]
        }
    }
    CVE[k, ] <- apply(pred, 2, mean)
}
RES <- apply(CVE, 2, mean)
RES

# Visualize Result
par(mfrow=c(1,2))
matplot(t(CVE), type="l",xlab="Degrees of Polynomials ",
        ylab="Mean Squared Error")
plot(seq(10), RES, type="b", col=2, pch=20, 
     xlab="Degrees of Polynomials ", ylab="Mean Squared Error")
```

![](Img/NonLinear1.png)

- When we check the graph, we can see that the CVE value significantly decreases when the polynomial term increase from 1 to 2.
- When building a model, it is important to use the lease expensive one with statistically significant results, rather than choosing the one with the lowest CVE value. 

## 2.4 [Ex] Logistic Regression

```R
# Logistic regression using binary response
# 1 for wage > 250 and 0 for wage <= 250
fit <- glm(I(wage>250) ~ poly(age, 4), Wage, family="binomial")

# Predict dataset and calculate confidence bands 
preds <- predict(fit, newdata=list(age=age.grid), se=T)
# logit transformation 
pfit <- exp(preds$fit) / (1 + exp(preds$fit))
# calculate confidence bands 
se.bands.logit <- cbind(preds$fit + 2*preds$se.fit, preds$fit - 2*preds$se.fit)
se.bands <- exp(se.bands.logit)/(1 + exp(se.bands.logit))

# Another function 
preds2 <- predict(fit, newdata=list(age=age.grid), type="response", se=T)
cbind(pfit, preds2$fit)

# Visualize result of logistic regression 
plot(age , I(wage > 250), xlim=range(age), type="n", ylim=c(0, .2))
points(jitter(age), I((wage>250)/5), cex=.5, pch="|", col="darkgrey")
lines(age.grid, pfit, lwd=2, col="darkblue")
matlines(age.grid, se.bands, lwd=2, col="darkblue", lty=2)
```

![](Img/NonLinear2.png)

When the wage target was changed to TRUE/FALSE classification problem based on 250 and the model was fit, it can be seen that the confidence interval was narrow in the case of values existing in the training dataset, but the width of the confidence interval increased significantly. 

# 3. Step Functions

- Polynomial regression extends the linear model by adding extra predictors, obtained by raising each of the original predictors to a power. Cubic regression uses three variables $X_1$, $X_2$, $X_3$ as predictors. This is a simple way to provide a non-linear fit to the data. 
- Step functions cut the range of a variable into K distinct regions in order to produce a qualitative variable. This has the effect of fitting a piece wise constant function. 

Using polynomial functions of the features as predictors imposes a global structure on the non-linear function of X. Instead, we could use step functions to avoid such global structure. Here we break X into bins, and fit a different constant in each bin. Essentially, we create the bins by selecting K cut-points in the range of X, and then construct K+1 new variables, which behave like dummy variables.    
   
 
**Source from :** https://rstudio-pubs-static.s3.amazonaws.com/24589_7552e489485b4c2790ea6634e1afd68d.html

- Cut the variable into distinct regions : $C_1(X) = I(X < c_1), C_2(X) = I(c_1 \leq X < c_2), ..., C_K(X) = I(X \geq c_K)$ 
- For any value of $X$, $C_0(X) + C_1(X) + ... C_k(X) = 1$ 
- Choice of cutpoints or **knots** can be problematic. 

## 3.1 [Ex] Step Functions with linearity

```R
# cut() automatically picked the cut points 
table(cut(age, 4)) 
# (17.9, 33.5] (33.5, 49] (49, 64.5] (64.5, 80.1]

# Linear regression
fit <- lm(wage ~ cut(age, 4), data=Wage) 
# Logistic regression 
fit2 <- glm(I(wage > 250) ~ cut(age, 4), data=Wage, family="binomial") 

# The age < 33.5 category is left out. 
# The first category is recognized as reference. 
coef(summary(fit)) 
```

![](Img/NonLinear4.png)

```R
# Fitted values along with confidence bands. 
# confidence bands of linear regression 
age.grid <- seq(min(age), max(age))
preds <- predict(fit, newdata=list(age=age.grid), se=TRUE)
se.bands <- cbind(preds$fit + 2*preds$se.fit, preds$fit - 2*preds$se.fit)

# confidence bands of logistic regression 
preds2 <- predict(fit2, newdata=list(age=age.grid), se=T)
pfit <- exp(preds2$fit)/(1 + exp(preds2$fit))
se.bands.logit <- cbind(preds2$fit + 2*preds2$se.fit, preds2$fit - 2*preds2$se.fit)
se.bands2 <- exp(se.bands.logit)/(1 + exp(se.bands.logit))

# Visualize result 
# Linear regression 
par(mfrow=c(1,2), mar=c(4.5 ,4.5 ,1 ,1), oma=c(0, 0, 4, 0))
plot(age, wage, xlim=range(age), cex=.5, col="darkgrey")
title("Degree-4 Step Functions", outer=T)
lines(age.grid, preds$fit, lwd=3, col="darkgreen")
matlines(age.grid, se.bands, lwd=2, col="darkgreen", lty=2)

# Logistic regression 
plot(age , I(wage > 250), xlim=range(age), type="n", ylim=c(0, .2))
points(jitter(age), I((wage >250)/5), cex=.5, pch="|", col="darkgrey")
lines(age.grid, pfit, lwd=3, col="darkgreen")
matlines(age.grid, se.bands2, lwd=2, col="darkgreen", lty=2)
```
![](Img/NonLinear3.png)

## 3.2 Piecewise Polynomials 

- Instead of a single polynomials in $X$ over, we can use different polynomials in regions defined by knots. 
- The value of the continuous variable of $X$ is separated by to apply the polynomial regression for each continuous value. 
- $y_i$
    - $y_i = \beta_{01} + \beta_{11}x_i + \beta_{21}x_i^2 + \beta_{31}x_i^3 + \epsilon_i$, if $x_i < c$
    - $y_i = \beta_{02} + \beta_{12}x_i + \beta_{22}x_i^2 + \beta_{32}x_i^3 + \epsilon_i$, if $x_i \geq c$

## 3.3 [Ex] Step Functions with non-linearity


```R
# 200 obs. are randomly generated from 3000 obs 
set.seed(19) 
ss <- sample(3000, 200)
nWage <- Wage[ss,]

# testing sets 
age.grid <- seq(min(nWage$age), max(nWage$age)) 

# Training model in piecewise polynomials 
g1 <- lm(wage ~ poly(age, 3), data=nWage, subset=(age < 50))
g2 <- lm(wage ~ poly(age, 3), data=nWage, subset=(age > 50)) 

# Predict on testing dataset
pred1 <- predict(g1, newdata=list(age=age.grid[age.grid < 50]))
pred2 <- predict(g2, newdata=list(age=age.grid[age.grid >= 50]))
         
# Visualize result
par(mfrow = c(1, 2))
plot(nWage[, 2], nWage[, 11], col="darkgrey", xlab="Age",
ylab="Wage")
title(main = "Piecewise Cubic")
lines(age.grid[age.grid < 50], pred1, lwd=2, col="darkblue")
lines(age.grid[age.grid >= 50], pred2, lwd=2, col="darkblue")
abline(v=50, lty=2)
```

- Using piecewise polynomials, discontinuities occur for each point(knots) separated from the continuous value.
- To solve this problem, continuous piecewise polynomials are calculated by additionally calculating LHS and RHS(the effect of removing one coefficients)

```R
# Define the two hockey-stick functions
LHS <- function(x) ifelse(x < 50, 50-x, 0)
RHS <- function(x) ifelse(x < 50, 0, x-50)

# Fit continuous piecewise polynomials
g3 <- lm(wage ~ poly(LHS(age), 3) + poly(RHS(age), 3), nWage)
pred3 <- predict(g3, newdata=list(age=age.grid))
plot(nWage[, 2], nWage[, 11], col="darkgrey", xlab="Age",
     ylab="Wage")
title(main="Continuous Piecewise Cubic")
lines(age.grid, pred3, lwd=2, col="darkgreen")
abline(v=50, lty=2)

summary(g1)
summary(g2)
summary(g3)
```

![](Img/NonLinear5.png)

# 4. Splines 

Regression splines are more flexible than polynomials and step functions, and are actually an extension of the two. The divide the range of X into K distinct regions. For each region, a polynomial function is fit to the data, however, the polynomials are constrained so that they join smoothly at the region boundaries or knots. 

For regression splines, instead of fitting a high degree polynomial over the entire of X, we can fit several low-degree polynomials over different regions of X. Each of these functions can be fit using lease squares.

## 4.1 Linear Splines 

- A Linear splines with knots at $\zeta_k$, $k = 1, ..., K$ is a piecewise linear polynomial continuous at each knot. 
- $y_i = \beta_0 + \beta_1b_1(x_i) + \beta_{K+1}b_{K+1}(x_i) + \epsilon$ 
    - $b_1(x_i) = x_i$
    - $b_{k+1}(x_i) = (x_i - \zeta_k)_{+}$
    - $(x_i - \zeta_k)_{+} = x_i - \zeta_k$, if $x_i > \zeta_k$
    
    
```R
# Linear spline
g6 <- lm(wage ~ bs(age, knots=50, degree=1), data=nWage)
pred6 <- predict(g6, newdata=list(age=age.grid))

plot(nWage[, 2], nWage[, 11], col="darkgrey", xlab="Age",
     ylab="Wage")
title(main="Linear Spline")
lines(age.grid, pred6, lwd=2, col="darkred")
abline(v=50, lty=2)
```

![](Img/NonLinear7.png)

## 4.2 Cubic Splines 

- A Cubic spline with knots at $\zeta_k$, $k = 1, ..., K$ is a piecewise cubic polynomial with contunuous derivatives up to order 2 at each knot. 
- $y_i = \beta_0 + \beta_1b_1(x_i) + \beta_2b_2(x_i) + ... + \beta_{K+3}b_{K+3}(x_i) + \epsilon$ 
    - $b_1(x_i) = x_i$
    - $b_2(x_i) = x_i^2$
    - $b_3(x_i) = x_i^3$
    - $b_{k+3}(x_i) = (x_i - \zeta_k)_{+}^3$
    - $(x_i - \zeta_{k+3})_{+}^3 = (x_i - \zeta_k)^3$, if $x_i > \zeta_k$
    
```R
# Truncated power basis functions 
d <- 3
knots <- 50 
x1 <- outer(nWage$age, 1:3, "^") # make form of x, x^2, x^3 
x2 <- outer(nWage$age, knots, ">") * outer(nwage$age, knots, "-")^d # make form of (x-z)^3 
x <- cbind(x1, x2) # make form of x, x^2, x^3, (x-50)^3

# Train models 
g4 <- lm(wage ~ x, data=nWage)

# Make testing set and predictions 
nx1 <- outer(age.grid, 1:d, "^")
nx2 <- outer(age.grid, knots, ">") * outer(age.grid, knots, "-")^d
nx <- cbind(nx1, nx2)
pred4 <- predict(g4, newdata=list(x=nx))

# Visualize 
par(mfrow=c(1,2))
plot(nWage[, 2], nWage[, 11], col="darkgrey", xlab="Age", ylab="Wage")
title(main = "Cubic Spline")
lines(age.grid, pred4, lwd=2, col="red")
abline(v=50, lty = 2)

# Make automatic splines using bs function 
library(splines)
g5 <- lm(wage ~ bs(age, knots=50), data=nWage)
pred5 <- predict(g5, newdata=list(age=age.grid))
plot(nWage[, 2], nWage[, 11], col="darkgrey", xlab="Age", ylab="Wage")
title(main="Cubic Spline")
lines(age.grid, pred5, lwd=2, col="red")
abline(v=50, lty=2)
```
![](Img/NonLinear6.png)

## 4.3 Natural Cubic Splines 

- Splines have high variance at the outer range of the predictors. 
- A natural splines is a regression spline with additional boundary constraints. 
- The natural function is required to be linear at the boundary. 

```R
# Cubic Spline
fit <- lm(wage ~ bs(age, knots=c(25 ,40 ,60)), data=nWage)
pred <- predict(fit, newdata=list(age=age.grid), se=T)

# Natural Spline : using(ns) function 
fit2 <- lm(wage ~ ns(age, knots=c(25 ,40 ,60)), data=nWage)
pred2 <- predict(fit2, newdata=list(age=age.grid), se=T)
plot(nWage[, 2], nWage[, 11], col="darkgrey", xlab="Age", ylab="Wage")
lines(age.grid, pred$fit, lwd=2, col=4)
lines(age.grid, pred$fit + 2*pred$se, lty="dashed", col=4)
lines(age.grid, pred$fit - 2*pred$se, lty="dashed", col=4)
lines(age.grid, pred2$fit, lwd=2, col=2)
lines(age.grid, pred2$fit + 2*pred2$se, lty="dashed", col=2)
lines(age.grid, pred2$fit - 2*pred2$se, lty="dashed", col=2)
abline(v=c(25, 40, 60), lty=2)
legend("topright", c("Natural Cubic Spline", "Cubic Spline"), lty=1, lwd=2, col=c(2, 4))
```

![](Img/NonLinear8.png)

### [Ex] Natural Cubic Splines with degree of freedom 

- We can train data with natural cubic splines based on percentile using ns(df=k) keywords. 
- We fit a natrual cubic spline with three knots, where the knots locations were chosen automatically as the 25th, 50th, and 75th percentiles. 

```R
# Use a complete Wage data 
age <- Wage$age 
wage <- Wage$wage 

# Make test dataset 
age.grid <- seq(min(age), max(age)) 

# Training model with natural cubic splines 
g1 <- lm(wage ~ ns(age, df=4), data=Wage) 
pred1 <- predict(g1, newdata=list(age=age.grid), se=T) 
se.bands1 <- cbind(pred1$fit + 2*pred1$se.fit, pred1$fit - 2*pred1$se.fit) 

# Training model with natural cubic splines of binary problems 
g2 <- glm(I(wage > 25) ~ ns(age, df=4), data=Wage, family="binomial") 
pred2 <- predict(g2, newdata=list(age=age.grid), se=T) 
pfit <- exp(pred2$fit)/(1 + exp(pred2$fit)) 
se.bands.logit <- cbind(pred2$fit + 2*pred2$se.fit, pred2$fit - 2*pred2$se.fit)
se.bands2 <- exp(se.bands.logit)/(1 + exp(se.bands.logit)) 

# Visualize predicted confidence bands : lm fit 
par(mfrow=c(1,2), mar=c(4.5 ,4.5 ,1 ,1), oma=c(0, 0, 4, 0))
plot(age, wage, cex=.5, col="darkgrey", xlab="Age", ylab="Wage")
title("Natural Cubic Spline", outer=T)
lines(age.grid, pred1$fit, lwd=3, col=2)
matlines(age.grid, se.bands1, lwd=2, col=2, lty=2)
ncs <- ns(age, df=4)
attr(ncs, "knots")
abline(v=attr(ncs, "knots"), lty=3)

# Visualize predicted confidence bands : glm fit 
plot(age, I(wage > 250), type ="n", ylim=c(0, .2), xlab="Age",
     ylab="Pr(Wage>250 | Age)")
points(jitter(age), I((wage >250)/5), cex=.5, pch ="|",
       col="darkgrey")
lines(age.grid, pfit, lwd=3, col=2)
matlines(age.grid, se.bands2, lwd=2, col=2, lty=2)
abline(v=attr(ncs, "knots"), lty=3)
```

- Cubic splines with $K$ knots has $K+4$ parameters or df. 

<img src="Img/NonLinear9.png" width=400 height=200>

### [Ex] Optimized Natural Cubic Splines 

- Place more knots where the function might vary most rapidly. 
- Place fewer knots where it seems more stable. 
- How many knots should we use : Using a cross-validation

```R
set.seed(1111)
CVE <- matrix(0, 20, 10) 

# Iterate 20 times of Cross Validation 
for (k in 1:20) { 
    gr <- sample(rep(seq(10), length=nrow(Wage))) 
    pred <- matrix(NA, nrow(Wage), 10) 
    # Applying 10 K-folds Cross Validation 
    for (i in 1:10) {
        tran <- (gr != i) 
        test <- (gr == i) 
        # Natural Cubic Splines with j degree of freedom 
        for (j in 1:10) {
            nsx <- ns(age, df=j) 
            g <- lm(wage ~ nsx, data=Wage, subset=tran) 
            mse <- (Wage$wage - predict(g, nsx))^2
            pred[test, j] <- mse[test]
        } 
    }
    CVE[k, ] <- apply(pred, 2, mean)
}
RES <- apply(CVE, 2, mean)
plot(seq(10), RES, type="b", col=2, pch=20, xlab="Degrees of Freedom of Natural Spline", 
     ylab="Mean Squared Error")   
```

![](Img/NonLinear10.png)

### [Ex] Comparison to Polynomial Regression 

- Regression splines often give superior results to polynomial regression. 
- The extra flexibility in the polynomial produces undesirable results at the boundaries, while the natural cubic spline still provides a reasonable fit to the data. 

```R
# Training model with natural cubic splines with 15 df and polynomail with 15 terms 
g1 <- lm(wage ~ ns(age, df=15), data=Wage)
g2 <- lm(wage ~ poly(age, 15), data=Wage)

# Make prediction of testing set
pred1 <- predict(g1, newdata=list(age=age.grid), se=T)
pred2 <- predict(g2, newdata=list(age=age.grid), se=T)

# Visualize comparison results 
plot(age, wage, cex=.5, col="darkgrey", xlab="Age", ylab="Wage")
lines(age.grid, pred1$fit, lwd=2, col=2)
lines(age.grid, pred1$fit + 2*pred1$se, lty="dashed", col=2)
lines(age.grid, pred1$fit - 2*pred1$se, lty="dashed", col=2)
lines(age.grid, pred2$fit, lwd=2, col=4)
lines(age.grid, pred2$fit + 2*pred2$se, lty="dashed", col=4)
lines(age.grid, pred2$fit - 2*pred2$se, lty="dashed", col=4)
legend("topleft", c("Natural Cubic Spline", "Polynomial"), lty=1, lwd=2, col=c(2, 4))
```

![](Img/NonLinear11.png)

## 4.4 Smoothing Splines 

- Finding a function $g(x)$ that make $RSS$ small, but that is also smooth. 
- A function $g(x)$ that minimizes below is a smoothing spline. 
    - $\sum_{i=1}^n (y_i - g(x_i))^2 + \lambda \int g(t)^2dt $ 
    - $\lambda$ is a non-negative tuning paramete. 
    - $\lambda$ controls how wiggly $g(x)$ is.    
    - The smaller $\lambda$, the more wiggly the function. 
    - As $\lambda \to \infty$, the function $g(x)$ becomes linear. 
- The solution is a natural cubic spline, with a knot at every unique value of $x_i$. 
    - Smoothing splines avoid the knot-selection issue, leaving a single $\lambda$ to be choosen. 
    - The function smooth.spline() will fit a smoothing spline. 
    - The leave-one-out cross-validation error(LOOCV) can be computed very efficiently for smoothing splines 
    - $LOOCV_{\lambda} = \sum_{i=1}^n(y_i - \hat{g}_{\lambda}^{[-i]}(x_i))^2 = \sum_{i=1}^n[\frac{y_i - \hat{g}_{\lambda}(x_i)}{1 - S_{\lambda ii}}]^2$
    
    
```R
# Preparing dataset
library(ILSR)
library(splines)
data(Wage)
age <- Wage$age
wage <- Wage$wage
age.grid <- seq(min(age), max(age)) 

# Training models fitted in smoothing splines 
# Polynomial fits with df = 16
fit <- smooth.spline(age, wage, df=16) 

# Optimized smoothing splines with cross validation
fit2 <- smooth.spline(age, wage, cv=TRUE) 
fit2$df 

# Natural cubic splines with df=7
fit3 <- lm(wage ~ ns(age, df=7), data=Wage)
pred3 <- predict(fit3, newdata=list(age=age.grid)) 

# Visualize fits 
plot(age, wage, cex=.5, col="darkgrey") 
title("Smoothing Spline vs Natural Spline") 
lines(fit, col="red", lwd=2)
lines(fit2, col="blue", lwd=2) 
lines(age.grid, pred3, col="green", lwd=2) 
legend("topright", legend=c("SS (DF=16)", "SS (DF=6.8)", "NS (DF=7)"), col=c("red","blue","green"), lty=1, lwd=2)

# Cross Validation Simulation
set.seed(1234)
N <- 10 # Simulation replications
K <- 10 # 10-fold CV
df <- seq(2, 20) ## Degrees of freedom
CVE <- matrix(0, N, length(df))
for (k in 1:N) {
    gr <- sample(rep(seq(K), length=nrow(Wage)))
    pred <- matrix(NA, nrow(Wage), length(df))
    for (i in 1:K) {
        tran <- (gr != i)
        test <- (gr == i)
        for (j in 1:length(df)) {
            fit <- smooth.spline(age[tran], wage[tran], df=df[j])
            mse <- (wage-predict(fit, age)$y)^2
            pred[test, j] <- mse[test]
        }
    }
CVE[k, ] <- apply(pred, 2, mean)
}
RES <- apply(CVE, 2, mean)

# Visualize Cross Validation Error 
par(mfrow=c(1,2))
matplot(t(CVE), type="b", col=2, lty=2, pch=20, ylab="CV errors", xlab="Degrees of freedom")
plot(df, RES, type="b", col=2, pch=20, ylab="averaged CV errors", xlab="Degrees of freedom")
abline(v=df[which.min(RES)], col="grey", lty=4)
```

![](Img/NonLinear12.png)

```R
# Smoothng splines vs Natural Cubic Splines 
set.seed(1357)
MSE1 <- matrix(0, 100, 2)
for (i in 1:100) {
    tran <- sample(nrow(Wage), size=floor(nrow(Wage)*2/3))
    test <- setdiff(1:nrow(Wage), tran)
    g1 <- smooth.spline(age[tran], wage[tran], df=7)
    g2 <- lm(wage ~ ns(age, df=7), data=Wage, subset=tran)
    mse1 <- (wage-predict(g1, age)$y)[test]^2
    mse2 <- (wage-predict(g2, Wage))[test]^2
    MSE1[i,] <- c(mean(mse1), mean(mse2))
}
apply(MSE1, 2, mean)
```

# 5. Local Regression

- **Local regression** or **local polynomial regression**, also known as moving regression, is a generalization of the moving average and polynomial regression. 
- Local regression computes the fit at target upoint $x_0$ using only the regression nearby training obervation. 
- With a sliding weight function, we fit separate linear fits over the range of $x$ by weighted least squares. 
- OLS : $\hat{\beta}^{OLS} = (X^{'}X)^{-1}X^{'}y$
- WLS : $\hat{\beta}^{WLS} = (X^{'}WX)^{-1}X^{'}W^{-1}y$ 
- GLS : $\hat{\beta}^{GLS} = (X^{'}\Omega^{-1} X)^{-1}X^{'}\Omega^{-1}y$ 

![](Img/NonLinear13.png)

## 5.1 [Ex] Local Regeression 

```R
# Prepare datasets
data(Wage)
age <- Wage$age
wage <- Wage$wage
age.grid <- seq(min(age), max(age)) 

# Training model 
fit1 <- loess(wage ~ age, span=.2, data=wage)
fit2 <- loess(wage ~ age, span=.3, data=wage) 

# Visualize fitted local regression model 
plot(age, wage, cex =.5, col = "darkgrey")
title("Local Linear Regression")
lines(age.grid, predict(fit1, data.frame(age=age.grid)),
col="red", lwd=2)
lines(age.grid, predict(fit2, data.frame(age=age.grid)),
col="blue", lwd=2)
legend("topright", legend = c("Span = 0.2", "Span = 0.7"),
col=c("red", "blue"), lty=1, lwd=2)
```

![](Img/NonLinear14.png)

## 5.2 [Ex] Coparison of Polynomial, Natural Cubic Spline, Local Regression

```R
set.seed(1357)
MSE2 <- matrix(0, 100, 2)
for (i in 1:100) {
    # Train-Test split 
    tran <- sample(nrow(Wage), size=floor(nrow(Wage)*2/3))
    test <- setdiff(1:nrow(Wage), tran)
    # Training model based on value of span 
    g1 <- loess(wage ~ age, span=.2, data=Wage)
    g2 <- loess(wage ~ age, span=.7, data=Wage)
    # Calculate MSE result of testing set 
    mse1 <- (wage-predict(g1, Wage))[test]^2
    mse2 <- (wage-predict(g2, Wage))[test]^2
    MSE2[i,] <- c(mean(mse1), mean(mse2))
}
MSE <- cbind(MSE1, MSE2)
apply(MSE, 2, mean)
apply(MSE, 2, sd)

# Visualize results 
boxplot(MSE, boxwex=0.5, col=4:7, ylim=c(1200, 2000), 
        names= c("Smoothing Spline", "Natural Cubic", "Local (S=0.2)", "Local (S=0.7)"), 
        ylab="Mean Squared Errors")
```

![](Img/NonLinear15.png) 

# 6. Generalized Additive Models

- **Generalized additive models(GAMs)** allows non-linear functions of each of the variables, while maintaining additivity.
- $y_i = \beta_0 + f_1(x_{i1}) + f_2(x_{i2}) + ... + \epsilon_i$
- It is called an additive model because we calcualte a separate $f_j$ fore each $x_j$, and then add together all of their contributions. 
- **GAMs** provide a useful-compromise between linear and fully nonparametric models.

## 6.1 [Ex] GAMs using gam package 

```R
# Prepare library and dataset 
library(ISLR)
library(splines) 
data(Wage)

# Training model 
gam1 <- lm(wage ~ ns(year, 4) + ns(age, 5) + education, data=Wage)
summary(gam1)
library(gam)

# s() : smoothing spline
gam <- gam(wage ~ s(year, 4)+s(age, 5)+education, data=Wage)

# Visualize fitted model 
par(mfrow =c(1,3))
plot(gam, se=TRUE, col="blue", scale=70)
plot.Gam(gam1, se = TRUE, col = "red")
```

![](Img/NonLinear16.png)

```R
# Significant test 
gam.m1 <- gam(wage ~ s(age, 5) + education, data=Wage)
gam.m2 <- gam(wage ~ year + s(age, 5) + education, data=Wage)
anova(gam.m1, gam.m2, gam, test = "F")
summary(gam)
```

- The way of selecting features is using ANOVA F-test. 
- Using Group Lasso is recommended. 

## 6.2 Simulation Study 

```R
set.seed(1357)
MSE3 <- matrix(0, 100, 3)
for (i in 1:100) {
    tran <- sample(nrow(Wage), size=floor(nrow(Wage)*2/3))
    test <- setdiff(1:nrow(Wage), tran)
    g1 <- gam(wage ~ s(age, 5) + education, data=Wage, subset=tran)
    g2 <- gam(wage ~ year + s(age, 5) + education, data=Wage, subset=tran)
    g3 <- gam(wage ~ s(year, 4) + s(age, 5) + education, data=Wage, subset=tran)
    mse1 <- (wage - predict(g1, Wage))[test]^2
    mse2 <- (wage - predict(g2, Wage))[test]^2
    mse3 <- (wage - predict(g3, Wage))[test]^2
    MSE3[i,] <- c(mean(mse1), mean(mse2), mean(mse3))
}
apply(MSE3, 2, mean)
```

- The result of MSE3 : 1265.564, 1261.940, 1262.823
- The performance of gam(wage ~ s(age, 5) + education) works worst. 