# Physics 494/594
## Linear Regression - Finding the Optimal Parameters

In [None]:
# %load ./include/header.py
import numpy as np
import matplotlib.pyplot as plt
import sys
from tqdm import trange,tqdm
sys.path.append('./include')
import ml4s
%matplotlib inline
%config InlineBackend.figure_format = 'svg'
plt.style.use('./include/notebook.mplstyle')
np.set_printoptions(linewidth=120)
ml4s.set_css_style('./include/bootstrap.css')
colors = plt.rcParams['axes.prop_cycle'].by_key()['color']

## Last Time

### [Notebook Link: 06_Linear_Regression_Setup.ipynb](./06_Linear_Regression_Setup.ipynb)

- Cost functions and formulating a machine learning task as an optimization problem
- Understand linear regression 

## Today
- Generalize the linear model from $1$ to $D$ dimensions
- Minimize the cost function to extract the optimal parameter set

### Example from Last Time: Steady-State One-Dimensional Heat Conduction

Fourier's law of heat conduction for a bar of constant cross-sectional area connected between two reservoirs in the steady-state limit gives a simple differential equation for the spatial dependence of the temperature $T$:

\begin{align}
\frac{d^2 T(x)}{d x^2} &= 0 \\
\frac{d T(x)}{dx} &= w \\
T(x) &= w x + b 
\end{align}

Load experimental data from `../data/rod_temperature.dat` using the very convenient `np.loadtxt()` function

In [None]:
x,T,ΔT = np.loadtxt('../data/rod_temperature.dat', unpack=True)
plt.errorbar(x,T,ΔT, marker='o', linestyle='')
plt.xlabel('x  (m)')
plt.ylabel('T  (°C)');

## Goal

Want to predict a scalar $T$ as a function of scalar $x$ given a dataset of pairs $\{(x^{(n)},T^{(n)})\}_{n=1}^N$.  Here the $x^{(n)}$ are inputs and the $T^{(n)}$ are targets or observations. From physics, we have a model:

\begin{equation}
F(x) = w x + b
\end{equation}

i.e. $F^{(n)} = w x^{(n)} + b$.

We can think of this as the simplest possible **shallow** neural network (no hidden layer) and non non-linearity, i.e. $a(x) = 1$.

In [None]:
labels = [[r'$x$'],[r'$F(x) = wx + b$']]
ml4s.draw_network([1,1],weights=[np.array(['w'])],biases=[np.array(['b'])], node_labels=labels, annotate=True)

We want to *learn* the **parameters** (weight $w$ and bias $b$) based on the **prediction** $F$ (here a linear function).  We will do this by minimizing (optimizing) a **loss** function. For a single data point (observation) this is defined to be:

\begin{equation}
\mathcal{L}^{(n)} = \frac{1}{2} \lvert \lvert F^{(n)} - T^{(n)} \rvert \rvert^2
\end{equation}

which quantifies the goodness of fit over our **hypothesis** space (all values of the parameters).  

$F-T$ is the residual, we want to make this as small as possible, which we can do by computing the **Cost** function, the loss function averaged over all training examples (input data):

\begin{equation}
\boxed{
\mathcal{C} = \frac{1}{2N} \sum_{n=1}^N  \lvert \lvert F^{(n)} - T^{(n)} \rvert \rvert^2
}
\end{equation}

### Moving beyond 1 dimension

Note that while we have looked at a simple 1D example, there is nothing stopping us from generalizing our model to an arbitrary number of  $D$ dimensions (i.e. think number of input neurons).  We don't even need to change our notation if we use matrix vector multiplication.  Now $F : \mathbb{R}^D \mapsto \mathbb{R}$

\begin{align}
F &= \sum_j x_j w_j + b \\
&= \vec{x} \cdot \vec{w} + b \\
\end{align}

where $\vec{w}^{\top} = (w_1,\dots, w_{D})$ and $\vec{x} = (x_1,\dots, x_{D})$.   Note that our code `np.dot(x,w)+b` doesn't even need to change!  

Consider the case $D = 10$:

In [None]:
D = 10
N = [D,1]
labels = [[r'$x_{' + f'{i}' + r'}$' for i in range(1,N[0]+1)],[r'$F$']]
ml4s.draw_network(N,node_labels=labels, weights=[np.array([r'$w_{' + f'{i}' + r'}$' for i in range(1,N[0]+1)])], biases=['b'])

Furthermore, we can utilize the  batch processing techniques we know, where now each **row** of $\mathbf{x}$ corresponds to a training example.

\begin{equation}
\mathbf{x} \cdot \mathbf{w} + b \vec{1} = 
\left(
\begin{array}{ccc}
x_1^{(1)} & \dots & x_D^{(1)} \\
& \vdots & \\
x_1^{(N)} & \dots &x_D^{(N)} \\
\end{array}
\right)
\left(
\begin{array}{c}
w_1\\
\vdots \\
w_D \\
\end{array}
\right)
+ 
b
\left(
\begin{array}{c}
1  \\
\vdots \\
1 
\end{array}
\right)
= 
\left(
\begin{array}{c}
\mathbf{x}^{(1)} \cdot \mathbf{w} + b \\
\mathbf{x}^{(2)} \cdot \mathbf{w} + b \\
\vdots \\
\mathbf{x}^{(N)} \cdot \mathbf{w} + b
\end{array}\right) = 
\left(
\begin{array}{c}
F^{(1)} \\
F^{(2)} \\
\vdots \\
F^{(N)}
\end{array}
\right) = \mathbf{F}
\end{equation}

Here $\vec{1}$ is the $N \times 1$ column vectors of 1's.  

We can simplify our notation even further by noticing that we can incorporate the bias into the weight by tacking on a dummy input $x_0$ that always takes the value $1$ such that $w_0$ can be interpreted as the weight.  We introduce new notation ($\mathbf{X}$ and $\mathbf{W}$)

\begin{equation}
\mathbf{X} = 
\left(
\begin{array}{cccc}
1 & x_1^{(1)} & \dots & x_D^{(1)} \\
\vdots & \vdots &  \vdots & \vdots \\
1 & x_1^{(N)} & \dots &x_D^{(N)} \\
\end{array}
\right) 
\quad
\text{ and } 
\quad
\mathbf{W} = 
\left(
\begin{array}{c}
b \\
w_1\\
\vdots \\
x_D \\
\end{array}
\right)
\end{equation}

This allows us to write:

\begin{equation}
\mathbf{X} \cdot \mathbf{W} = \mathbf{F}
\end{equation}

where we now interpret $w_0 = b$.

In [None]:
N = [D+1,1]
labels = [[r'$x_{' + f'{i}' + r'}$' for i in range(N[0])],[r'$F$']]
labels[0][0] = '1'

ml4s.draw_network(N,node_labels=labels, weights=[np.array([r'$w_{' + f'{i}' + r'}$' for i in range(N[0])])], biases=[' '])

We can then compute the squared error cost across the entire data set:

\begin{align}
C &= \frac{1}{2N} \lvert\lvert \mathbf{F} - \mathbf{T}\rvert\rvert^2 \\
&= \frac{1}{2N} \lvert\lvert \mathbf{X} \cdot \mathbf{W} - \mathbf{T} \rvert\rvert^2
\end{align}

without modifying any of our python code.

## Solving the Optimization Problem

Recall, we are interested in finding the values of the weights and biases which minimize the cost function.  For the case of linear regression this can be done explicitly via calculus.

\begin{equation}
\frac{\partial C}{\partial w_j} = \frac{1}{N} \sum_{n=1}^{N}\left(F^{(n)} - T^{(n)} \right) x_j^{(n)} = \left \langle \left(F^{(n)} - T^{(n)} \right) x_j^{(n)} \right\rangle
\end{equation}

The minimum occurs when this equation is set to zero, which offers a convenient closed-form solution (**you will derive this in the homework**):

\begin{equation}
\mathbf{W}^\ast = \left(\mathbf{X}^{\top} \mathbf{X}\right)^{-1} \mathbf{X}^{\top} \mathbf{T}
\end{equation}

We can check this for our simple example of the rod temperature above.

In [None]:
X = np.zeros([len(x),2]) # number of samples by 1+1 = 2 for the extra input dimension
X[:,0] = 1 # we are assigning the first value of our input array to 1 (for all samples)
X[:,1] = x

In [None]:
W_opt = np.dot(np.dot(np.linalg.inv(np.dot(X.transpose(),X)),X.transpose()),T)

# We can also write this more succintly using matrix multiplication
W_opt = np.linalg.inv(X.T @ X) @ X.T @ T

C_opt = 0.5*np.average((np.dot(X,W_opt)-T)**2)

print(f'W_opt = {W_opt}')
print(f'C_opt = {C_opt}')

This can be compared with the `np.polyfit` package

In [None]:
np.polyfit(x,T,1)

<div class="span alert alert-warning">
They have chosen to pack their extra dimension at the end! 
</div>

We can also compare the cost with a global optimization of the cost function.

In [None]:
grid_size = 100 
weights,biases = np.meshgrid(np.linspace(400,1200,grid_size),np.linspace(-1,18,grid_size))
C = np.zeros_like(weights)

for i in range(grid_size):
    for j in range(grid_size):
        F = np.dot(x,weights[i,j]) + biases[i,j]
        C[i,j] = 0.5*np.average((F-T)**2)
print(f'C_min = {np.min(C)}')

### Show the result on the cost function

In [None]:
plt.contour(weights,biases,C, cmap='Spectral_r', levels=100)
plt.plot(W_opt[1],W_opt[0], 'x', ms=10, color='k')

plt.xlabel('w / (°C/m)')
plt.ylabel('b / °C')
plt.colorbar(label='Cost Function')

##  Let's plot the optimal linear regression:

In [None]:
plt.errorbar(x,T,ΔT,marker='o', linestyle='', label='Exp. Data')

x_fit = np.linspace(np.min(x),np.max(x),100)
X_fit = np.zeros([x_fit.shape[0],2])
X_fit[:,0] = 1
X_fit[:,1] = x_fit

# compute the model prediciton and plot
F = np.dot(X_fit,W_opt)
plt.plot(x_fit, F, color=colors[0], label='Linear Regression' )

plt.xlabel('x (m)')
plt.ylabel('T (°C)')
plt.legend()

### Let's put this to use!