## 1a) 
We need to find a good estimate for the parameters / weights. Since we only have one input feature $x$ and one target $t$, we have two weights $w_0$ and $w_1$ with $f(x, \mathbf{w}) = w_0 + w_1x$

## 1b) 
The optimal point estimate for $\mathbf{w}$ would be the one with the least sum of squared errors. This can be found by setting the gradient of the sum of squared errors $E$ to zero:

$\nabla E(\mathbf{w}) = \sum_i \mathbf{w}^T \mathbf{x}_i \mathbf{x}_i^T - \sum_i t_i \mathbf{x}_i^T = (0, 0)$

with $\mathbf{x}_i = (1, x_i)^T$ and $\mathbf{w} = (w_0, w_1)^T$. This yields

$\mathbf{w}^T \sum_i \mathbf{x}_i \mathbf{x}_i^T = \sum_i t_i \mathbf{x}_i^T$

The sums can be calculated independently from $\mathbf{w}$, so we can rewrite the equation above to

$(w_0, w_1) \begin{bmatrix}a_{1,1} & a_{1,2}\\a_{2,1} & a_{2,2}\end{bmatrix} = (b_1, b_2)$

This system of linear equations can easily be solved yielding the optimal solution.

## 1c)
### Maximum Likelihood Estimation

**Assumptions**:
1. $p(x)$ is the same for all $x$
2. The residuals $(t - f(x, \mathbf{w}))$ are normally distributed
  1. The mean of this distribution is zero
  2. The variance $\sigma^2$ of this distribution is fixed. It does not depend on $\mathbf{w}$ or $x$.
  
$\mathbf{w}^* = \underset{w}{\text{argmax}} \sum_i \log p(t_i | x_i, \mathbf{w})$

with $p(t_i | x_i, \mathbf{w}) = \dfrac{1}{\sqrt{2\pi \sigma^2}} \exp(-\frac{(t_i - f(x_i, \mathbf{w}))^2}{\sigma^2})$

$\mathbf{w}^* = \underset{w}{\text{argmax}} \sum_i  \Big[\log \dfrac{1}{\sqrt{2\pi \sigma^2}} - \frac{(t_i - f(x_i, \mathbf{w}))^2}{\sigma^2}\Big]$

Since $\sigma^2$ is fixed, we get

$\mathbf{w}^* = \underset{w}{\text{argmin}} \sum_i (t_i - f(x_i, \mathbf{w}))^2$

which means we need to minimize the Sum of Squared Errors.

### Maximum A-Posteriori Estimation

**Assumptions**:
1. $p(x)$ is the same for all $x$
2. The residuals $(t - f(x, \mathbf{w}))$ are normally distributed
  1. The mean of this distribution is zero
  2. The variance $\sigma^2$ of this distribution is fixed. It does not depend on $\mathbf{w}$ or $x$.
3. The weights are normally distributed: $\mathbf{w} \sim \mathcal{N}((0, 0)^T, \mathbb{I}_2)$

$\mathbf{w}^* = \underset{w}{\text{argmax}} \sum_i \Big[\log p(t_i | x_i, \mathbf{w}) + \log p(\mathbf{w})\Big]$

$= \underset{w}{\text{argmin}} \frac{1}{N} \sum_i (t_i - f(x_i, \mathbf{w}))^2 - \log p(\mathbf{w})$

$= \underset{w}{\text{argmin}} \frac{1}{N} \sum_i (t_i - f(x_i, \mathbf{w}))^2 - \lambda \lVert \mathbf{w} \rVert^2_2$

for a fixed $\lambda$, which means we need to minimize the Sum of Squared Errors and the L2 penalty on the weights. 

### Bayesian Inference

**Assumptions**:
1. $p(x)$ is the same for all $x$
2. The residuals $(t - f(x, \mathbf{w}))$ are normally distributed
  1. The mean of this distribution is zero
  2. The variance $\sigma^2$ of this distribution is fixed. It does not depend on $\mathbf{w}$ or $x$.
3. The weights are normally distributed: $\mathbf{w} \sim \mathcal{N}((0, 0)^T, \mathbb{I}_2)$

With our dataset $D = \{x_i, t_i\}_{i=1...N}$, we calculate the posterior 

$p(\mathbf{w}|D) = \dfrac{p(D|\mathbf{w}) \cdot p(\mathbf{w})}{p(D)} = \dfrac{p(D|\mathbf{w}) \cdot p(\mathbf{w})}{\int p(D|\mathbf{w}) \cdot p(\mathbf{w}) d\mathbf{w}}$

with $p(D|\mathbf{w}) = \prod_i p(t|x,\mathbf{w})$. 

$p(t|x,\mathbf{w})$ and $p(\mathbf{w})$ are defined as above. 

We can then use this posterior for predicting $t'$ for $x'$:

$p(t'|x', D) = \int p(t'|x', \mathbf{w}) \cdot p(\mathbf{w} | D) d \mathbf{w}$

## 3a)

In [None]:
%matplotlib inline

import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')

x = np.array([
    [2, 0, 1],
    [1.08, 1.68, 2.38],
    [-0.83, 1.82, 2.49],
    [-1.97, 0.28, 2.15],
    [-1.31, -1.51, 2.59],
    [0.57, -1.91, 4.32]
])

ax.set_xlabel('x')
ax.set_ylabel('y')
ax.set_zlabel('z')
ax.plot(x[:,0], x[:,1], x[:,2])

## 3b)

In [None]:
# Calculate the distance per time interval
deltas = x[1:] - x[:-1]
print('deltas', deltas)

distances = np.linalg.norm(deltas, axis=1)
print('distances', distances)

# Calculate the least squares fit for the data

plt.plot(distances)

## Questions
1. In the Bayesian approach to linear regression we have 
  
  $p(w|x,t) = \dfrac{p(x, t|w) \cdot p(w)}{p(x, t)}$
  
  However, the slides use 
  
  $p(w|x,t) = \dfrac{p(t|x, w) \cdot p(w)}{p(t|x)}$
  
  Since 
  
  $p(x,t|w) = p(t|x,w) \cdot p(x|w) = p(t|x,w) \cdot p(x)$ 
  
  do the slides implicitly assume that $p(x)$ is the same for all $x$?
    
2. Does MLE / MAP count as "a probabilistic way"?