# Introduction to Quant Finance

## Module 2.2: Modelling Techniques

### 2.2.1 Linear regression models

In this module we will go through the Linear Regression models again in more detail, before reviewing ARMA and then enhancing it with ARIMA and GARCH


### Representing Data

Types of data, forms of data, X matrix for scikit-learn style classifiers.

Our data is usually called $X$, specifically a 2 dimensional matrix, where the rows correspond to each sample in our dataset (like a stock data for a day, a company, or a person), and columns correspond to features, or variables (such as the closing price, the number of employees or the person's height).

Therefore, value $X_{i, j}$ is the value of the $j$th variable for the $i$th sample.

Normally, we use the term $n$ to describe how many samples we have, and $k$ to describe how many variables. If no other information is given, assume these values. That said, lots of other terms are used as well (for instance, $m$ is usually used if a second set of data is included for the number of samples in that second set).

Therefore, our matrix $X$ has shape $n \times k$.

Next, we have our coefficients, which are $\beta$, and this is a value for each variable, so it has shape $k$. Often, it is much more useful for this to be a 2D array of size $k \times 1$. This allows us to compute the values $X\beta$, which gives a $n \times 1$ value, which is $Y$.


### Question:

If $u$ is the error term associated with each sample, what would its shape be?

# Solution:

$u$ is of shape $n \times 1$, as there is an error term for each sample. A shape of $n$ is also an allowed answer, but note that it is usually a column vector.

### Samples versus Population, and Notation

To date, these modules have been a little slack with terminology. We will address that here before we continue further.

If we remember our equation for our LInear Regression model, it has been presented so far like this:

$Y = \beta X + u$

In this case, we have $X$ as our independent variables in the format above. $\beta$ is the *actual parameters for mapping the linear relationship*. Note that we may not know these, we may never know these actual values. (They may not actually exist either, but the assumptions behind a linear model assume they do.)

Finally, $u$ is the error of the model, which really is just what is left over from taking the values of $\beta$, and then computing $u = Y - \beta X$. These are the errors, but they are the *actual population errors*, again that are quite theoretical and you may never know what these are. Further, the values of $u$ are not learned, they are the remaining error.

Almost always you are working with sample data. This means that we don't have all $X$ values (some use the notation $x$ for this). This means, that at best we can estimate the true values for $\beta$, so we use the notation $\hat{\beta}$ to denote we have an estimation of $\beta$, not the true values.

As we do not know the true values for $\beta$, the errors we get in the model are just estimated too, so we use the notation $\hat{u}$. Then, finally, we come to our $Y$ values being estimated, so we use the notation $\hat{Y}$ to describe them, or $\hat{y}$ to describe the predicted value for a given $y$ value.

So while OLS is aiming to get as close to the true values as possible, what we are really learning is:

$\hat{Y} = \hat{\beta}X + \hat{u}$

Other notation may use $\textbf{b}$ as the learned and estimated values for the true values of $\beta$.

This confuses the notation to keep doing this, and have separate symbols for effectively the same thing. For these notebooks, I won't be too particular about the terminology here. Know however, that if you ever publish a journal paper in an academic journal, you'll need to ensure you have all the symbols correctly identified.

At the end of the day, notation is just notation - simply describe what your variables are upfront if you do not know the normal terminology to use.

### Baseline - the mean

Computing the mean, and using that for prediction as a baseline. This is the best estimator if you have just one variable. i.e. if we have *just* the heights for lots of dogs, and want to do the best guess fo the next dog's height, we just use the mean. This is confirmed as the best estimate you can get - the expected value.


Goodness of fit is the difference between this prediction, and the actual mean, the errors or residuals, of the model.

Our model is therefore of the form:

$\hat{y} = \bar{x} + u$

Where $\bar{x}$ is the mean of $X$ and $u$ is the errors, the residuals. The sum of squared error (SSE) is therefore:

$SSE = \sum{u^2}$

The mean is the value that minimises the SSE in a single row of data.

The goal of linear regression is to minimise the SSE when we use multiple independent variables to predict a dependent variable.

### Linear Models

