# Linear models for regression

## Goal of regression
Predict the value of one or more continous variables based on input variables. 

## What are linear regression models?
Simplest form of linear regression model is of course a linear function. But, much better results can be obtained by using **linear combinations of a fixed set of functions, known as basis functions**. Even though some of the basis functions are nonlinear w.r.t. the input variables, **they are linear function of the parameters**.



## How does regression make predictions?
Given a training data set which includes observation and target values, the goal is to predict target value for a new observation.

Model distribution p(target|input) represents the uncertainty in predictions and allows minimizing expected loss.

______________

## Basis Functions

### Linear regression

$$ y(x, w) = w_0 + w_1x_1 + ... + w_Dx_D $$

The key property of linear regression is that it is not only a linear function of the parameters, but also of the input variables.



### Other basis functions

But in linear methods, we want to include linear combinations of fixed nonlinear functions (basis functions) of the input variables, which take the form:

$$ y(x, w) = w_0 + \sum_{j=1}^{M-1}w_j\phi_j(x) $$

The parameter $w_{0}$ is called bias parameter and allows for any fixed offset in the data.

### Polynomial regression
$$ \phi_j(x) = x^j $$

One of the limitations of polynomial regression can be that it is a global function of the input variable. When it changes in one region of input space it affects all other regions. 

This can be solved by dividing the input space into regions and fit different polynomial in each region -> spline functions.

### Gaussian basis functions
$$ \phi_j(x) = exp \left\{ \frac{(x - \mu_j)^2}{2s^2} \right\} $$

$\mu$ is responsible for the locations of the basis functions in the input space and the parameter s for their spatial scale. even though names 'Gaussian' they do not have to have probabilistic interpretations. 

### Sigmoidal basis function

$$ \phi_j(x) = \sigma \left( \frac{x - \mu_j}{s} \right) $$

where the sigmoid function is defined by $ \sigma(a) = \frac{1}{1+ exp(-a)}$. Similarly, we can use tanh function, because $tanh(a) = 2\sigma(a) - 1$

______________

## Maximum likelihood and least squares

In an essence - **minimizing the sum-of-squared error function is equivalent to maximizing the likelihood under the assumption that the noise follows a Gaussian distribution.**

_Note: primary goal in the supervised learning is to model the relationship between inputs and outputs, rather than modeling the probability distribution of the inputs p(x) themselves._

A function can be fit to the data by minimizing sum-of-square error function. 

1. Assume the target variable t is given by a deterministic function y(x, w) with additive Gaussian noise:
$$ t = y(x,w) + \epsilon $$


2. Thus, it can be written as: 
$$ p(t|x,w, \beta) = N \left( t|y(x,w), \beta^{-1} \right) $$
which basically implies what kind of probability density function the target values follow.


3. If a data set inputs $X = {x_1, ... x_N}$ with corresponding target values $t_1, ..., t_N$ are assumed to be drawn independently from the distribution, then the following is the **likelihood function of the adjustable parameters w and $\beta$**:
$$p(t|X, w, \beta) = \prod_{n=1}^{N} \mathcal{N}(t_n | w^T \phi(x_n), \beta^{-1} )$$
    
    
4. Next, we take the logarithm of the likelihood function because it simplifies computations:
$$ln(pt|w, \beta) = \sum_{n=1}^N ln \mathcal{N}(t_n | w^T \phi(x_n), \beta^{-1} ) = \frac{N}{2}ln \beta - \frac{N}{2}ln(2\pi) - \beta E_D(w)$$

where $E_D$ is the sum-of-square error is defined by:
$$E_d(W) = \frac{1}{2} \sum_{n=1}^N \left(t_n - w^T\phi(x_n)\right)^2$$


5. To find the maximum likelihood estimate (MLE) for the parameters w and $\beta$ is equivalent to minimizing the negative log-likelihood:
$$-lnp(t|w, \beta) = -\frac{N}{2}ln \beta + \frac{N}{2}ln(2\pi) + \beta E_D(w)$$
It is allowed to drop first two expressions as they are scaling factors, and while they are relevant for determining $\beta$, they do not change the maximization problem with respect to w.

By maximizing the log likelihood function w.r.t $\beta$ or w.r.t. w it is possible to derive their values.
- the bias $w$ compensates for the difference between the averages of the target values and the wieghted sum of the averages of the basis function values.
- the noise precision parameter $\beta$ is given by residual variance of the target values around the regression function.

______________

## Stochastic gradient descent

It's technique where data points are considered one at a time and the model parameters are updated after each such presentation. 

When applied to the sum of squares error function, it gives:
$$w^{(\tau +1)} = w^{(\tau)} - \eta (t_n - w^{(\tau)T}\phi_n)\phi_n$$

where $\eta$ is the learning rate parameter. This is also known as least-mean-squares.

______________

## Regularized least squares

Adding a regularization term to an error function helps to control overfitting. The regularized form of error function takes the form:
$$E_D(w) + \lambda E_W(w)$$

where $\lambda$ is the regularization coefficient that controls the relative importance of the data-dependent error $E_D(w)$ and the regularization term $E_W(w)$.


### Weight decay

A simple form of regularizer can be for example sum-of-squares of the weight vector elements: $$E_W(w) = \frac{1}{2}w^Tw$$

If we apply it to the sum-of-squares error then the total error function becomes:

$$\frac{1}{2} \sum_{n=1}^N \left(t_n - w^T\phi(x_n)\right)^2 + \frac{\lambda}{2}w^Tw$$


This is called weight decay, because it encourages weight values to decay towards zero. It's an example of a parameter shrinkage method.

### Lasso / $L_1$ regularization

$$\frac{1}{2} \sum_{n=1}^N \left(t_n - w^T\phi(x_n)\right)^2 + \lambda \sum_{j=1}^{p}|w_j|$$

The $\sum_{j=1}^{p}|w_j|$ is the $L_1$ norm which encourages sparsity in w. That means that if $\lambda$ is sufficiently large, then some of the coefficients $w_j$ are driven to exactly zero. (Feature selection) This makes LASSO useful for models with many irrelevant features.

### Ridge / $L_2$ regularization

$$\frac{1}{2} \sum_{n=1}^N \left(t_n - w^T\phi(x_n)\right)^2 + \lambda \sum_{j=1}^{p}w_j^2$$

Ridge regression modifies the sum of square errors by adding an L2 penalty, which is the sum of squared weights. Thisencourages encourages smaller weight values. Unlike Lasso, Ridge shrinks weights close to zero but does not force them to be exactly zero.

______________

## What to do if there are multiple target variables?

### Univariate Approach

Idea is to train a separate regression model for each target variable.  This leads to multiple, independent regression problems. But, ignores the relationships between target variables.

### Multivariate Approach
Extend the traditional regression by treating both the input and output as matrices. Instead of predicting a single target variable, predict multiple dependent variables simultaneously.


______________

## Bias-Variance Tradeoff 

It describes the balance between underfitting and overfitting and helps explain why models may generalize well or poorly to new data.

The model's expected loss can be decomposed into three components:

$$Total Error = Bias^2 + Variance + Noise$$

### Bias
- Represents the extent to which the average prediction over all data sets
differs from the desired regression function, which means that it indicates how wrong the model is on average.
- High bias means the model makes strong assumptions about the data
- Leads to **underfitting** - oversimplified models.

### Variance 
- Measures the extent to which the solutions for individual data sets vary around their average.
– Shows how much the model’s predictions change when trained on different data samples.
- High variance means the model is too sensitive to training data.
- Leads to **overfitting** - memorizing noise instead of learning patterns.

The goal is to find a sweet spot between them.

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/9/9f/Bias_and_variance_contributing_to_total_error.svg/2560px-Bias_and_variance_contributing_to_total_error.svg.png" width="800">


**If bias is too high:**
- Use a more complex model
- Add more features.
- Reduce regularization.

**If variance is too high:**
- Use a simpler model.
- Get more training data.
- Apply regularization.

However, bias-variance trade-off is of limited practical value, because it is based on averages with respect to ensembles of data sets, whereas in practice we have only the single observed data set.

______________