# 3 Linear Regression
- we want to determine each feature's contribution to the target (if existing)
- and we also need to determine the accuracy of our prediction
- observe if there is an interaction effect between the features

## 3.1 Simple Linear Regression

- predicting a quantitative response $Y$ on the basis of a single predictor variable $X$
- we assume a linear relationship between predictor and target: $Y ≈ β_0 + β_1 X$
    - where $β_0$ and $β_1$ are two unknown constants that represent the *intercept* and *slope* terms; together they are known as the models *coefficients* or *parameters*
    
- we want to produce estimates $\hat{\beta}_0$ and $\hat{\beta}_1$, so that $\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x$
    - where $\hat{y}$ indicates a prediction of $Y$ on the basis of $X = x$

### 3.1.1 Estimating the Coefficients

- the real $β_0$ and $β_1$ are unknown
- we want to find an intercept $\hat{\beta}_0$ and a slope $\hat{\beta}_1$ such that the resulting line is as close as possible to the $n$ data points

#### Cost function 
- most common way to measure that closeness is **least squares**, which relates to the MSE

- let $\hat{y_i} = \hat{\beta}_0 + \hat{\beta}_1 x_i$ be the prediction for $Y$ based on the ith value of $X$:
    - then $e_i = y_i − \hat{y_i}$ represents the ith *residual* (difference between `y` and `y_pred` at the ith observation)
    - and the **residual sum of squares** (RSS) is the sum of all the squared residuals: $RSS = e_1^2 + e_2^2 + \cdots + e_n^2$

        - $RSS = (y_1 − \hat{\beta}_0 − \hat{\beta}_1 x_1 )^2 + (y 2 − \hat{\beta}_0 − \hat{\beta}_1 x_2 )^2 + · · · + (y_n − \hat{\beta}_0 − \hat{\beta}_1 x_n )^2 $

        - we square the residuals to avoid cancellation, we want to stress larger errors and because it creates a smooth, continuous function, which allows us to use calculus-based optimization techniques, such as taking derivatives, to find the minimum of the RSS efficiently
        
            - side note: in contrast, the *Mean Absolute Error* (MAE) is not continuous and we cannot apply a calculus based optimization (such as deriving the normal equation, because it's not fully differentiable); it produces a median-based regression line rather than a mean-based one
    
    - The MSE (**Mean Squared Error**) uses the RSS (**Residual Sum of Squares**) for calculating the score over all the `y_true` - `y_pred` - pairs

        - the RSS is the raw sum of squared residuals, which measures the total error across all data points: </br>

            $RSS = \sum_{i=1}^n (y_i - \hat{y}_i)^2$ </br>
            (loss function)

        - the MSE normalizes the RSS by dividing by the number of data points ($n$); it represents the average squared error per data point:</br>
        
            $MSE = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2$ </br>
            (cost function)

#### Optimization method: normal equation
- the **normal equations** provides a closed-form solution to optimization problems, unlike iterative optimization algorithms such as gradient descent

    - normal equations are one equation per coefficient and there is a matrix representation that sums them all up: </br>
    $X^T X \hat{\beta} = X^T y$

    - the normal equations are derived from the RSS

- to find the values of $\hat{\beta}_0$​ and $\hat{\beta}_1$​ that minimize the loss function (RSS), we use a closed-form solution that gives us an exact estimate without iteration though using calculus and solving the resulting system of equations
    -  the loss function and the optimization are so closely related here because the goal of optimization is to minimize the loss function

- the RSS can also be written as $\text{RSS}(\beta) = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 = \| y - X\beta \|^2$

    - the last term denotes the squared Euclidean norm and can also be written in matrix form: $\| y - X\beta \|^2 = (y−Xβ)^T(y−Xβ)$

- from the RSS formula, we can derive the following by taking partial derivatives with respect to the coefficients $\hat{\beta}_0$​ and $\hat{\beta}_1$ and setting them equal to zero (while x and y are all fixed): <br>

    $\hat{\beta}_1 = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2}$

    $\hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}$

    - where $\bar{y} \equiv \frac{1}{n} \sum_{i=1}^{n} y_i$ and $\bar{x} \equiv \frac{1}{n} \sum_{i=1}^{n} x_i$ are the sample means
        - the tree vertical bars stands for the "equivalent" symbol, used to indicate that two expressions are defined to be equal in a specific context
        - the summation in the numerator of the first formula applies to the entire product

    - this is the (non-iterative, closed form) OLS optimisation for simple linear regression
        - this would be the more general form for multiple linear regression (see section below): 
        $$
        y = X \hat{\beta}
        $$
        
        $$
        X^T y = X^T X \hat{\beta}
        $$

        $$
        \hat{\beta}  = (X^T X)^{-1} X^T y
        $$

    - intuition on how this works:

        - taking derivative of a function represents its rate of change

        - setting the derivative of the RSS with respect to $\beta$ equal to $0$ corresponds to finding the stationary points, which may correspond to the minimum (or maximum)

        - RSS function is a convex function (a parabola-shaped curve) and there is only one global minimum
        
    - YouTube playlist that [visualises OLS with normal equation](https://www.youtube.com/watch?v=3g-e2aiRfbU&list=PLjgDp12yUmpw7lsyCKzh11ppUFJfzOjfY) and the above formulas
        - the videos show a linear algebra take on using the normal equations and seems that these are essentially two perspectives on the same process

- optimizing via the normal equation is based on the assumption that $X^TX$ is invertible, meaning, $X$ needs to have full rank, meaning no multicollinearity among features
    - in this case the normal equation yields an exact solution

    - if there is multicollinearity between features ($X$ is not full rank), the normal equation doesn't have a solution and we need to use other optimization techniques

- we can find a best-fit line using OLS, even though we know we have an irreducible error, because it gives us the best approximation to the true relationship between the predictors $x$ and the response $y$, by minimizing the total squared error (RSS) across all data points

#### Supplemantary Insights: loss function, objective function, cost function

- **loss function**: applies to an individual observation
    - RSS however is an aggregate loss function (a sum is used as a summary statistic); but it is simply a summed version of the individual squared losses and thus can serve as a loss function here
    - in most supervised learning models in scikit-learn, the loss function is aggregated across all data points because these models aim to minimize an overall prediction error (aggregated loss)

- **objective function**: aggregates the loss over all observations and might include additional terms (e.g., regularization)

    - **cost function**: specific type of objective function that represents the aggregated loss over the entire dataset, but does typically not include other terms like the regularisation

#### `LinearRegression` in scikit-learn

- in scikit-learn a `LinearRegression()` model always uses least squares (MSE) as the loss function and does not provide an option to directly choose another loss function like MAE or Huber loss
    - for Huber loss, we could use `HuberRegressor()` and for a quantile loss the `QuantileRegressor()`
    - a way to use MAE as a loss in scikit-learn would be to use `QuantileRegressor(quantile=0.5)`

- the optimisation for a `LinearRegression()` model is a normal equation based on OSL for small datasets and an iterative method for larger datasets

### 3.1.2 Assessing the Accuracy of the Coefficient Estimates

- the model is now trained, but before we can use it, we need to quantify the uncertainty of the estimated parameters ($\hat{\beta}_0$ and $\hat{\beta}_1$) using **standard error**, **p-values**, and **hypothesis testing**

    - this is about assessing the precision and significance of the coefficients before interpreting them

- our estimates for $\hat{\beta}_0$ and $\hat{\beta}_1$ will most likely not be exactly equal to ${\beta}_0$ and ${\beta}_1$, but they will not systematically over- or under-estimate the true parameters: if we could average the estimates obtained over a huge number of data sets, then the average of these estimates would converge to the true parameters

    - this property of a model is called unbiasedness

- this is comparable to estimating the population mean $\mu$ from a sample mean $\bar{x}$
    - the estimate is then denoted with $\hat{\mu}$

- to measure how accurate an estimate is, we compute the **standard error (SE)**

    - the standard error quantifies the average amount an estimate differs from the actual value
        - that's loosely connected to the variance in a model's predictions, but here we are talking about the standard error in a single estimate

    - smaller standard errors indicate greater confidence in the estimate

    - e.g. the standard error for $\hat{\mu}$ would be calculated </br>
    
        $\text{Var}(\hat{\mu}) = SE(\hat{\mu})^2 = \frac{\sigma^2}{n}$

        - where $\sigma$ is the standard deviation of each of the realizations from the distribution, $\sigma^2$ is the variance of the population distribution (measures the spread or variability of the actual data points around the mean) and $n$ is the sample size

            - "sample size" $n$ here means: how often we have re-sampled the data or number of folds, and not how many samples we have per fold (thinking about cross validation)
            
            - in order to guess the population distribution while only the sample distribution is known, we assume a distribution (for instance normal distribution)

        - the standard error is the square root of the variance

        - this deviation shrinks with n: the more observations we have, the smaller the standard error of $\hat{\mu}$ 

        - in general, $\text{Variance}(\text{something}) = \frac{\text{Sum of Squares}}{\text{Number of those things}} = \text{Average Sum of Squares}$

- similarly, the predicted coefficients $\hat{\beta}_0$ and $\hat{\beta}_1$ follow a distribution as well, because they are random variables derived from the data (which are randomly samples from a population as well)

- assuming that our irreducible error $\epsilon$ is normally distributed with mean $0$ and standard deviation $\sigma$, then $\hat{\beta}_0$ and $\hat{\beta}_1$ will follow a normal distribution as well, centered around their true values ${\beta}_0$ and ${\beta}_1$
    - because the estimated coefficients are linear combinations of the dependent variables and errors (central limit theorem applies)
    - or, in other words: random variability in $\hat{\beta}_0$ and $\hat{\beta}_1$ ​ arises from the error terms (which are included in our estimates, because we cannot separate them), and their influence is captured through a linear transformation

    - the variance of these coefficients depends on the design matrix $X$ (or $x_i$ in simple regression) and the sample size $n$
        - because the design matrix defines how spread out the data is and the more spread out, the more reliably we can estimate the true coefficients
        - and because the more data points $n$ we have, the easier it is to figure out the true line

- these are the formulas for the **standard error (SE)** for the estimated coefficients: </br>

    $\text{SE}(\hat{\beta}_0)^2 = \sigma^2 \left[ \frac{1}{n} + \frac{\bar{x}^2}{\sum_{i=1}^n (x_i - \bar{x})^2} \right]$ </br>
        - tells us how much we should expect our guess for $\hat{\beta}_0$ to vary from the real value $\beta_0$

    $\text{SE}(\hat{\beta}_1)^2 = \frac{\sigma^2}{\sum_{i=1}^n (x_i - \bar{x})^2}$ </br>
        - looks at how the points are spread out in the x-direction: the more spread out the points are, the more we trust our guess for the steepness of the line

    - where $\sigma^2 = \text{Var}(\epsilon)$

    - answering the question, what $\epsilon$ has to do with all of this: since it represents the irreducible error or the noise in the data, $\text{Var}(\epsilon)$ tells us how spread out or variable this error is: if the variance is small, it means the errors are small and the model predictions are generally close to the actual data points; if the variance is large, it means the errors are large and the predictions are less reliable

        - the variance in the data has two components: the variance explained by the model and the variance from the irreducible error ($\epsilon$). Reducing the error variance is the goal

- the estimate of $\sigma$ is known as the **residual standard error**, and is given by the formula : $\text{RSE} = \sqrt{\frac{\text{RSS}}{n - 2}}$
    - this is a way to evaluate a models accuracy: calculate a mean over the models for some metric and calculate standard error over these means

- standard errors can be used to compute **confidence intervals**

### 3.1.3 Assessing the Accuracy of the Model
- StatQuest [video on Linear Regression](https://www.youtube.com/watch?v=7ArmBVF2dCs) explains very nicely, what $R²$ is and how to calculate if it's statistically significant using the *p-value*