# Linear Regression - some Theory
* [other compariative exmaples](https://www.analyticsvidhya.com/blog/2017/06/a-comprehensive-guide-for-linear-ridge-and-lasso-regression/)
For linear regression the closed solution is known via the so called normal equation
Todo:
1. Specify model
2. Speficy loss function
2. Write down log Likelihood function
3. Minimize + get Normal equation

### Approach 1: Non-probabilisitc using loss function
The easies approach to linear regression is to use the squared loss function, 
$$l = \sum_{i=1}^N (y_i - \hat y_i)^2,$$
where $\hat y_i$ is the predicted value and $y_i$ is the true value of sample $i$. Together with a linear, non probabilistic model for the. For feature vector $x_i$ (where the first component is per convention the constant one -aka known as intercept) the prediction is given by the linear model. In scalar product notation:
$$y_i =  x_i^T \theta,$$
with coefficient vector $\theta$. Rewriting this in Matrix notation to account for all sampels we get for the loss together with the linear model:
$$l = (Y - X\theta)^T (Y - X\theta)$$
Now we would like to find $\theta^*$ that _minimizes_ the loss. 
Setting the derivative w.r.t. $\theta$ of the loss function zero  gives the normal equation 
$$X^T(Y - X\theta) = 0.$$
If $X^T X$ is non-singular (this is the case when there are more traning examples than features because then $X^T X$ is positive definite) allows for finding the unique solutions of the normal equation, 

$$\theta ^* =  \left( X^T X  \right)^{-1} X^T Y$$

Note that this approach is non-probabilistic and thus, does not explicitly account for uncertainty  (as a probability distributions) in the data and coefficients.
### Approach 2: Probabilistic  +  generative approach using max likelihood
This approach introduces a probability distribution but does not explicitly consider a loss function. The  response is modelled  via a Normal distribution ("Gauss error") assuming constant standard deviation
$$y_i = x_i^T \theta + \epsilon := \mathcal N (x_i^T \theta, \sigma^2)$$
In other words, the conditional distribtion  $p(y \mid x, \theta, \sigma^2)$ is given by a Normal distribtion.
The likelihood function is just the pdf of __all__ datapoints assuming i.i.d (this assumption in fact leads to the factorization), 
$$\mathcal L = \Pi_{i=1}^N p(y_i \mid x, \theta, \sigma).$$
As we are aiming to optimize $\theta$ in a way, a striclty monotonic transformation is applied on the likelihood function. It leaves the optimum invariant. The standard procedure is thus to consider the the logarithm ot likelihood function:
$$\mathcal L_l = \sum_{i=1}^N \log p(y_i \mid x, \theta, \sigma).$$
Evaluating this expression for the Normal distribution gives
$$\mathcal L_l =  - \frac{1}{2\sigma^2}\sum_{i=1}^N (y_i - x_i^T \theta)^2  - \frac{N}{2}log(2\pi\sigma^2) $$
Now, we would like to _maximise_ the likelihood and thus the log likelihood with respect to $\theta$. This is equivalent to _minimizing_ the negative of it. Throwing away terms that, don't depend on $\theta$ gives the function for which we would like to find the minimizer. That is we want to solve this expression, 
$$\text{argmin}_\theta\left( Y - X \theta \right )^T \left( Y - X \theta \right ),$$
where we have rewritten the sum of squares over all training data again in Matrix notation. But this is exactly the same problem as in approach one and this gives the same solution (under the same circumstances), 

$$\theta ^* =  \left( X^T X  \right)^{-1} X^T Y$$

**Remarks**
* Note that this procedure only puts a probablilty distribution on the response $y$ while treating the remaining ingredients as variables (via the conditional pdf Ansatz). 
* This already implies that in this context the solutions $\theta^*$ just tell how the pdf is parametrized (not even completely as we did not consider the optimum value of $\sigma$).
* In either case this approach does not really tell us how to predict a specific value $y$ for a given $x$. It just tells us the corresponding distribution of $y$. The fundamental reason is that we did not make any use of a loss function in this approach. 
* Pragmatically and in practice  of course, the prediction is made by plugging into the linear model as e.g. in the first appraoch
* Note that a constant value for $\sigma$ is called homoscedasticity. This implies that the variance does may not be a function of the features but only the mean within the Normal model for the response.
* Furhtermore note that this approach does not consider uncertainties in the paramters. This would eventually require a Bayesian approach. 


### Apporach 4: Uncorrelated Bayesian Approach
A Bayesian linar regression is a straightforward generalization, 
#### The model:
$$y_i = x_i^T \theta + \epsilon \propto \mathcal N (y_i  | x_i^T \theta, \sigma^2)\\ 
p(\theta)  \propto \mathcal N (\theta | 0, \lambda ^{-1} \mathbb 1 ) 
$$

* This model takes as input parameters $\lambda$ and $\sigma^2$.
* The prio of $\theta$ is a Normal distribution parametrized by the parameters $\lambda$
* This lends also the model its name as the prior is uncorrelated (i.e., diagonal $\Sigma$ Matrix)
* **Todo: in order to make contact with the non-baysian formulations above we need to $X \rightarrow X^T$**

#### Learning
*  Learning corresponds to determining the posterior probability distribution of $\theta$: 
$$\mathcal N \left(\theta ~ |~ \left(XX^T + \lambda \sigma^2 \mathbb 1\right)XY, \left( \lambda \mathbb 1 + \frac{1}{\sigma^2} XX^T\right)^{-1} \right)$$
in other words posterior of $\theta$ follows a normal distribution with 
- **mean**
$$\left(XX^T + \lambda \sigma^2 \mathbb 1\right)XY$$
- **standard deviation**
$$\left( \lambda \mathbb 1 + \frac{1}{\sigma^2} XX^T\right)^{-1}$$
* Note the feature correlation term the makes the posterior correlated!

#### Inference
* Corresponds to predicting a new value $y^*$ for a new $x ^*$
$$p(y^* ~|~ x^* \sigma^2 \lambda, \mathcal D) = 
\mathcal N \left( y^* ~|~ w^Tx^*, ~  \sigma^2  + x^*\left(\lambda \mathbb1 + \frac{1}{\sigma^2} XX^T\right)^{-1}x^{*T}\right)$$

#### Ridge Regression as limiting case
* Learning
* $\lambda \sigma ^2 \rightarrow \lambda$ (regularization parameter)
* $\langle p(\theta ~|~ \mathcal D, \lambda, \sigma^2) \rangle =$ Estimator of ridge regression
* Prediction
* $\langle p(y^* ~|~ x^* \sigma^2 \lambda, \mathcal D)\rangle = w^T x^*$ (= Ridge Regression prediction)



### Remarks
Approaches: 
- Via max likelihood https://www.quantstart.com/articles/Maximum-Likelihood-Estimation-for-Linear-Regression
- Minimize quadratic error directly

# Get the data

# Linear Regression via the Normal equation - with Tensorflow

# Neural Network with Tensorflow / Keras 

# Using the estimator API
* within the estimator API the number of training steps is controlled via the the `dataset`
* more precisely via the `repeat` keyword

# Sklearn 

# Comparision of results