# 8 Moving Beyond Nonlinearity and Nonparametric Regression


## Moving Beyond Nonlinearity

Here we introduce some nonlinear models.

### Generalized Linear Model

Recall generalized linear model characterizes a distribution. Let $\beta$ be some paramters, then 
$$\mathbb P(y|x_i) = f(x_i^T\beta).$$

### Polynomial Regression

The response might be a continuous function of $x_i$:

$$y_i=f(x_i)+\epsilon_i.$$


Bounded continuous function can be arbitrarily approximated by polynomials. So we can assume $f$ is a polynomial, e.g.
$$y_i=\beta_0+\beta_1 x_i+\beta_2 x_i^2+\dotsc +\beta_p x_i^p+\epsilon_i,$$
which can be solved by the method of multivariate regression. Its difficulty lies in the determination of the hyperparameter $p$, the degre of the polynomial.

### Step Functions

Bounded continuous function can also be arbitrarily approximated by step functions. If $x\in [a,b)$ and we divide the interval with $a=c_0<c_1<\dotsc <c_n=b$, then we can form the model 
$$y_i = \beta_0 + \beta_1 \mathbb I_{c_0\leqslant x_i<c_1}+\beta_2\mathbb I_{c_1\leqslant x_i<c_2}+\dotsc 
+\beta_n \mathbb I_{c_{n-1}\leqslant x_i<c_n}+\epsilon_i.$$

### Basis Functions

More generally, we can select some functions $f_1,f_2,\dotsc,f_n$ and form the following model
$$y_i = \beta_0+\sum_{k=1}^n\beta_k f_k(x_i)+\epsilon_i.$$

Polynomial regression and step functions are its special cases. Other methods include wavelet functions, Fourier and polynomial splines. See more about cubic splines in the course \<Numerical Algorithm and Case Studies II>.

## Kernel Regression

Kernel regression is nonparametric. To start with, we review two concepts.

### Bias-Variance Trade-off

Recall that the mean squared error is the sum of squared bias and the variance: 

$$\mathbb E\left\{(Y -  \hat Y)^2|X\right\}
=\mathbb E\left\{(Y -  \mathbb E(Y))^2|X\right\}$$

### Hölder Class

Define the Hölder class $H_d$, a function class, as follows:
$$H_d(\beta ,L) = \left\{g: \Vert D^s g(x) - D^sg(y)\Vert \leqslant L\Vert x - y\Vert,\quad \forall x,y\in\mathbb R^n\quad s_1+\dotsc+s_n=\beta - 1\right\}$$
where $D^s=\frac{\partial^{s_1+\dotsc+s_n}}{\partial x_1^{s_1}\dotsm\partial x_n^{s_n}}$ is the derivative operator.

For instance, $H_d(0,L)$ are Lipschitz functions.

### Nadaraya–Watson Kernel Regression

Let $K$ be some kernel function, which is large at the origin but dwindles when faraway from the origin. E.g. $K = e^{-x^2}$. Then we estimate 
$$\hat y  = \sum_{i=1}^n \frac{K\left(\frac{\Vert x-x_i\Vert}{h}\right)}{\sum_{j=1}^n K\left(\frac{\Vert x-x_j\Vert}{h}\right)}y_i.$$

This is called the (Nadaraya-Watson) [kernel regression](https://bookdown.org/egarpor/PM-UC3M/npreg-kre.html#npreg-kre-nw). The paramter $h$ is called the bandwidth.

The formula assign weights on $y_i$ and the data points that are near to $x$ has larger weights. Commonly selected kernel functions include Gaussian $\frac{1}{\sqrt{2\pi}}e^{-x^2/2}$, box kernel $  \frac 12\mathbb I_{|x|\leqslant 1}$, Epanechnikov kernel $\frac{3}{4(1-x^2)}\mathbb I_{|x|\leqslant 1}$, etc.

### Local Linear Regression