# Physics 494/594
## Feature Maps for Non-Linear Functions

In [None]:
# %load ./include/header.py
import numpy as np
import matplotlib.pyplot as plt
import sys
from tqdm import trange,tqdm
sys.path.append('./include')
import ml4s
%matplotlib inline
%config InlineBackend.figure_format = 'svg'
plt.style.use('./include/notebook.mplstyle')
np.set_printoptions(linewidth=120)
ml4s.set_css_style('./include/bootstrap.css')
colors = plt.rcParams['axes.prop_cycle'].by_key()['color']

### Last Time

#### [Notebook Link: 08_Linear_Regression_Exercise.ipynb](./08_Linear_Regression_Exercise.ipynb)

- Cost functions and formulating a machine learning task as an optimization problem
- Multidimensional linear regression 

### Today

- Learn how linear regression can learn non-linear functions using feature maps.
- Understanding model complexity and the bias-variance tradeoff

We can generalize everything we learned about linear regression to non-linear models composed of a linear combination of non-linear parts, *i.e.* feature maps.  Let us change our notation slightly (as we would like to be fully general).  As before, our goal is to predict a scalar *target* $y$ as a function of $\vec{x}$ given a dataset of pairs $\mathcal{D} = \{(\vec{x}^{(n)},y^{(n)})\}_{n=1}^N$.  Here the $\vec{x}^{(n)}$ are inputs and the $y^{(n)}$ are targets or observations.  We now let our model be more general:

\begin{equation}
\boxed{
F(\vec{x},\vec{w}) = \vec{\varphi}(\vec{x}) \cdot \vec{w} }
\end{equation}

which is **linear** in the *weights* (here we have incorporated a potential bias as $w_0$): 

\begin{equation}
\vec{w} = \left( \begin{array}{c}
w_0 \\
w_1 \\
\vdots \\
w_{M-1}
\end{array}
\right)
\end{equation}

but can be **non-linear** in $\vec{x}$ depending on the basis functions features (think pre-processing layer):

\begin{equation}
\vec{\varphi}(\vec{x}) = \left(
\varphi_0(\vec{x}), 
\varphi_1(\vec{x}),
\dots,
\varphi_{M-1}(\vec{x})
\right)
\end{equation}

where $\phi_j(\vec{x}) : \mathbb{R}^D \mapsto \mathbb{R}, \forall j \in [0,\dots,M-1]$.

### Examples of Basis Functions
1. **Identity Transformation:** $\varphi_j(\vec{x}) = x_j$; this is just the linear regression we have already seen. Here $M = D$.
2. **Polynomial Decomposition:** $\varphi_j(\vec{x}) = |\vec{x}|^j$; note that the basis functions are global, they behave non-trivially on the entire domain of $x$.
3. **Sigmoid Basis:** $\varphi_j(\vec{x}) = \sigma((x_j-\mu_j)/s)$ where $\sigma(z) = 1/(1+\mathrm{e}^{-z})$; note that this can affect a local region of the domain non-trivially.
4. **Fourier  Basis:** $\varphi_j(\vec{x}) = \mathrm{e}^{i 2\pi x_j j}$; action is local in the frequency domain.

#### Consider a 3rd order polynomial in the scalar $x$ (i.e. $D = 1$):

In [None]:
N = [1,4,1]
labels = [[r'$x$'],['$1$','$x$','$x^2$','$x^3$'],[r'$F(x,\vec{w})$']]
ml4s.draw_network(N,node_labels=labels, weights=[['' for i in range(N[1])],[f'$w_{i}$' for i in range(N[1])]], biases=[])

All our previous derivations go through unchanged provided we replace our design matrix $\mathbf{X}$ with a new feature matrix $\mathbf{\Phi}$:

\begin{equation}
\mathbf{\Phi} = \left( \begin{array}{cccc}
        \varphi_0(\vec{x}^{(1)}) & \varphi_1(\vec{x}^{(1)}) & \cdots & \varphi_{M-1}(\vec{x}^{(1)}) \\
        \varphi_0(\vec{x}^{(2)}) & \varphi_1(\vec{x}^{(2)}) & \cdots & \varphi_{M-1}(\vec{x}^{(2)}) \\
\vdots        &      \vdots    & \ddots & \vdots \\
\varphi_0(\vec{x}^{(N)}) & \varphi_1(\vec{x}^{(1)}) & \cdots & \varphi_{M-1}(\vec{x}^{(N)}) \\
\end{array}
\right)
\end{equation}

which, when minimizing the squared error costs across the entire dataset:

\begin{equation}
\boxed{
\mathcal{C} = \frac{1}{2N} \sum_{n=1}^N  \lvert \lvert F^{(n)}(\vec{x}^{(n)},\vec{w}) - y^{(n)} \rvert \rvert^2
}
\end{equation}

yields the optimal parameters:

\begin{equation}
\mathbf{W}^\ast = \left(\mathbf{\Phi}^{\sf T} \mathbf{\Phi}\right)^{-1} \mathbf{\Phi}^{\sf T} \mathbf{y}.
\end{equation}

where 
\begin{equation}
\mathbf{y} = \left(
\begin{array}{c}
y^{(1)} \\
\vdots \\
y^{(N)}
\end{array}
\right)
\end{equation}

ss the vector of targets (corresponding to each sample in the dataset $\mathcal{D}$).


## Example

Load data from disk `../data/poly_regression.dat`


<!--
x = np.linspace(0,1,10)
header = f"{'x':>13s}\t{'y':>15s}"
data_out = np.column_stack([x,np.sin(2*np.pi*x)+np.random.normal(loc=0,scale=0.15,size=x.size)])
np.savetxt('../data/poly_regression.dat', data_out,fmt='% 15.8e', header=header, delimiter='\t')
-->

In [None]:
!head ../data/poly_regression.dat

### Plot the input data

In [None]:
x,y = np.loadtxt('../data/poly_regression.dat',unpack=True)
plt.plot(x,y, 'o')
plt.xlabel('x')
plt.ylabel('y')

### What should our model look like? 

We can guess by looking at the number of zero crossings

In [None]:
poly_order = 3
Φ = np.zeros([len(x),poly_order+1])
for j in range(Φ.shape[1]):
    Φ[:,j] = x**j

In [None]:
W_opt = np.dot(np.dot(np.linalg.inv(np.dot(Φ.T,Φ)),Φ.T),y)
C_opt = 0.5*np.average((np.dot(Φ,W_opt)-y)**2)

print(f'W_opt = {W_opt}')
print(f'C_opt = {C_opt}')

Again, we can compare this with the `np.polyfit` package

In [None]:
np.polyfit(x,y,poly_order)

<div class="span alert alert-warning">
Remember: <tt>np.polyfit()</tt> reverses the order of the coeffecients.
</div>


### Plot the data with the optimized fit

Again we can now evaluate our data at all points (*interpolation*) inside our fitting domain.

In [None]:
plt.plot(x,y, 'o', label='data')

x_fit = np.linspace(np.min(x),np.max(x),100)
Φ_fit = np.zeros([len(x_fit),poly_order+1])

# we use the fact that x^0 = 1
for j in range(Φ.shape[1]):
    Φ_fit[:,j] = x_fit**j

plt.plot(x_fit,Φ_fit @ W_opt,'-', color=colors[0], label='fit')
plt.xlabel('x')
plt.ylabel('y')
plt.legend();