# 3 Linear Regression
- we want to determine each feature's contribution to the target (if existing)
- and we also need to determine the accuracy of our prediction
- observe if there is an interaction effect between the features

## 3.1 Simple Linear Regression

- predicting a quantitative response $Y$ on the basis of a single predictor variable $X$
- we assume a linear relationship between predictor and target: $Y ≈ β_0 + β_1 X$
    - where $β_0$ and $β_1$ are two unknown constants that represent the *intercept* and *slope* terms; together they are known as the models *coefficients* or *parameters*
    
- we want to produce estimates $\hat{\beta}_0$ and $\hat{\beta}_1$, so that $\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x$
    - where $\hat{y}$ indicates a prediction of $Y$ on the basis of $X = x$

### 3.1.1 Estimating the Coefficients

- the real $β_0$ and $β_1$ are unknown
- we want to find an intercept $\hat{\beta}_0$ and a slope $\hat{\beta}_1$ such that the resulting line is as close as possible to the $n$ data points

#### Cost function 
- most common way to measure that closeness is **least squares**, which relates to the MSE

- let $\hat{y_i} = \hat{\beta}_0 + \hat{\beta}_1 x_i$ be the prediction for $Y$ based on the ith value of $X$:
    - then $e_i = y_i − \hat{y_i}$ represents the ith *residual* (difference between `y` and `y_pred` at the ith observation)
    - and the **residual sum of squares** (RSS) is the sum of all the squared residuals: $RSS = e_1^2 + e_2^2 + \cdots + e_n^2$

        - $RSS = (y_1 − \hat{\beta}_0 − \hat{\beta}_1 x_1 )^2 + (y 2 − \hat{\beta}_0 − \hat{\beta}_1 x_2 )^2 + · · · + (y_n − \hat{\beta}_0 − \hat{\beta}_1 x_n )^2 $

        - we square the residuals to avoid cancellation, we want to stress larger errors and because it creates a smooth, continuous function, which allows us to use calculus-based optimization techniques, such as taking derivatives, to find the minimum of the RSS efficiently
        
            - side note: in contrast, the *Mean Absolute Error* (MAE) is not continuous and we cannot apply a calculus based optimization (such as deriving the normal equation, because it's not fully differentiable); it produces a median-based regression line rather than a mean-based one
    
    - The MSE (**Mean Squared Error**) uses the RSS (**Residual Sum of Squares**) for calculating the score over all the `y_true` - `y_pred` - pairs

        - the RSS is the raw sum of squared residuals, which measures the total error across all data points: </br>

            $RSS = \sum_{i=1}^n (y_i - \hat{y}_i)^2$ </br>
            (loss function)

        - the MSE normalizes the RSS by dividing by the number of data points ($n$); it represents the average squared error per data point:</br>
        
            $MSE = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2$ </br>
            (cost function)

#### Optimization method: normal equation
- the **normal equations** provide a closed-form solution to optimization problems, unlike iterative optimization algorithms such as gradient descent

    - they are derived directly from the RSS which can also be written as $\text{RSS}(\beta) = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 = \| y - X\beta \|^2 = (y−Xβ)^T(y−Xβ)$

        - where the middle term denotes the squared Euclidean norm and can also be written in matrix form, which is the last term

- normal equations are one equation per coefficient and there is a matrix representation that sums them all up: </br>

$$
y = X \hat{\beta}
$$

$$
X^T y = X^T X \hat{\beta}
$$

$$
\hat{\beta}  = (X^T X)^{-1} X^T y
$$

- we can calculate our $\beta$ as follows:

    $\hat{\beta}_1 = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2} = \frac{\sum_{i=1}^n x_i y_i}{\sum_{i=1}^n x_i^2} = $
            $\frac{x_i^T y}{x^T x} = (x^T x)^{-1} x^T y$

    $\hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}$

    - where $\bar{y} \equiv \frac{1}{n} \sum_{i=1}^{n} y_i$ and $\bar{x} \equiv \frac{1}{n} \sum_{i=1}^{n} x_i$ are the sample means
        - the tree vertical bars stands for the "equivalent" symbol, used to indicate that two expressions are defined to be equal in a specific context
        - the summation in the numerator of the first formula applies to the entire product
        
    - the first equation calculates the slope by re-centering the points to their respective means, removing the offset (intercept), so that we can calculate how changes in $x$ correspond to changes in $y$; this is called the [covariance](https://en.wikipedia.org/wiki/Covariance) (see section on the $R²$ statistic below)

    - the second equation describes a transition and can be solved by substituting the first one

- this is a closed form solution to find the values of $\hat{\beta}_0$​ and $\hat{\beta}_1$​ that minimize the loss function (RSS)

    -  the loss function and the optimization are so closely related here because the goal of optimization is to minimize the loss function


- ISL states, that the normal equations are derived from the RSS by taking partial derivatives with respect to the coefficients $\hat{\beta}_0$​ and $\hat{\beta}_1$ and setting them equal to zero (while x and y are all fixed)

    - however, I cannot see where the derivative taking happens, since everything seems to be explainable algebraically ... (*scratching head*)

    - still, I have found some information to build an intuition on how this works (while not being sure if it really applies to OLS):

        - taking derivative of a function represents its rate of change

        - setting the derivative of the RSS with respect to $\beta$ equal to $0$ corresponds to finding the stationary points, which may correspond to the minimum (or maximum)

        - RSS function is a convex function (a parabola-shaped curve) and there is only one global minimum
    
- YouTube playlist that [visualises OLS with normal equation](https://www.youtube.com/watch?v=3g-e2aiRfbU&list=PLjgDp12yUmpw7lsyCKzh11ppUFJfzOjfY) and the above formulas
    - the videos show a linear algebra take on using the normal equations and seems that these are essentially two perspectives on the same process

- optimizing via the normal equation is based on the assumption that $X^TX$ is invertible, meaning, $X$ needs to have full rank, meaning no multicollinearity among features
    - in this case the normal equation yields an exact solution

    - if there is multicollinearity between features ($X$ is not full rank), the normal equation doesn't have a solution and we need to use other optimization techniques

- we can find a best-fit line using OLS, even though we know we have an irreducible error, because it gives us the best approximation to the true relationship between the predictors $x$ and the response $y$, by minimizing the total squared error (RSS) across all data points

#### Supplementary Insights: loss function, objective function, cost function

- **loss function**: applies to an individual observation
    - RSS however is an aggregate loss function (a sum is used as a summary statistic); but it is simply a summed version of the individual squared losses and thus can serve as a loss function here
    - in most supervised learning models in scikit-learn, the loss function is aggregated across all data points because these models aim to minimize an overall prediction error (aggregated loss)

- **objective function**: aggregates the loss over all observations and might include additional terms (e.g., regularization)

    - **cost function**: specific type of objective function that represents the aggregated loss over the entire dataset, but does typically not include other terms like the regularisation

#### `LinearRegression` in scikit-learn

- in scikit-learn a `LinearRegression()` model always uses least squares (MSE) as the loss function and does not provide an option to directly choose another loss function like MAE or Huber loss
    - for Huber loss, we could use `HuberRegressor()` and for a quantile loss the `QuantileRegressor()`
    - a way to use MAE as a loss in scikit-learn would be to use `QuantileRegressor(quantile=0.5)`

- the optimisation for a `LinearRegression()` model is a normal equation based on OSL, using the scipys `lsqr` (`scipy.sparse.linalg.lsqr`) that's like OSL, but with some optimisations (it approximates the lowest values rather than calculating it to get around the bottle neck of converting $X^TX$)

### 3.1.2 Assessing the Accuracy of the Coefficient Estimates

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ </br>
                                                                                      
This part is much longer than the book, because I had to learn basic statistics for understanding it. Therefore, here I describe everything very thoroughly.             
                                                                                     
Helpful introductory statistics material:                                                                 
Podcast: [Statistics Made Simple](https://chartable.com/podcasts/statistics-for-the-social-sciences); </br>
[Statistics Foundations](https://www.linkedin.com/learning/statistics-foundations-3-using-data-sets/discover-samples-confidence-intervals-and-hypothesis-testing?u=72605090) Course on LinkedinLearning </br>

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

- the model is now trained, but before we should use it, we need to quantify the uncertainty of the estimated parameters ($\hat{\beta}_0$ and $\hat{\beta}_1$) by calculating **standard error** for getting **confidence intervals** and by **hypothesis testing** for getting **p-values** 

    - this is about assessing the precision (confidence we have in our prediction) and significance (from hypothesis testing) of the coefficients before interpreting them

- our estimates for $\hat{\beta}_0$ and $\hat{\beta}_1$ will most likely not be exactly equal to ${\beta}_0$ and ${\beta}_1$, but they should not systematically over- or under-estimate the true parameters: if we could average the estimates obtained over a huge number of data sets, then the average of these estimates should converge to the true parameters

    - this property of a model is called unbiased-ness

- this is comparable to estimating the population mean $\mu$ from a sample mean $\bar{x}$ from introductory statistics courses:
    - the estimate is then denoted with $\hat{\mu}$
    - our estimate is a statistic (single value calculated from the data)

- to measure how **precise** our estimate is, we compute the **standard error (SE)**

    - the standard error quantifies the average amount by which the estimate varies if we repeatedly sample from the population
        - it's calculated the same way as the standard deviation in single sample, but here we are talking about the standard error a [sampling distribution](https://en.wikipedia.org/wiki/Sampling_distribution), which can be derived empirically from a single sample for instance by [bootstrapping](https://en.wikipedia.org/wiki/Bootstrapping_(statistics))

    - smaller standard errors indicate greater confidence in the estimate

    - e.g. the standard error for $\hat{\mu}$ would be calculated </br>
    
        $\text{Var}(\hat{\mu}) = SE(\hat{\mu})^2 = \frac{\sigma^2}{n}$

        - where $\sigma$ is the standard deviation of each of the realizations from the distribution, $\sigma^2$ is the variance of the population distribution (measures the spread or variability of the actual data points around the mean) and $n$ is the sample size

            - "sample size" $n$ here means: how often we have re-sampled the data or number of folds, and not how many samples we have per fold (thinking about cross validation)
            
            - in order to guess the population distribution while only the sample distribution is known, we assume a distribution (for instance normal distribution)

        - the standard error is the square root of the variance

        - as sample size increases, we have more data to reliably estimate the mean, reducing the variability (and the standard error) of the sample mean $\hat{\mu}$ 

        - in general, $\text{Variance}(\text{something}) = \frac{\text{Sum of Squares}}{\text{Number of those things}} = \text{Average Sum of Squares}$ and the standard deviation is the square root of that number and is in the same unit as the data (or the statistic we are interested in)

- note that the relationship between $\mu$ and $\hat{\mu}$ is unbiased, because we don't deal with an error: would the sample since be large enough, we would be able to perfectly predict from $\mu$ from $\hat{\mu}$
    - we also wish to have this unbiased-ness for predictions in models, but we only have it if the error really is normally distributed

- similarly, the predicted coefficients $\hat{\beta}_0$ and $\hat{\beta}_1$ follow a distribution as well, because they are statistics derived from the data (which are randomly samples from a population as well)

- our coefficients ($\hat{\beta}$) are point estimates from a larger distribution that is a normal distribution centered around their true values ${\beta}_0$ and ${\beta}_1$ as long as the sample size we have trained on is large enough ([central limit theorem](https://en.wikipedia.org/wiki/Central_limit_theorem) applies)

    - the variance of these coefficients depends on the design matrix $X$ (or $x_i$ in simple regression) and the sample size $n$
        - because the design matrix defines how spread out the data is and the more spread out, the more reliably we can estimate the true coefficients
        - the more data points $n$ we have, the easier it is to figure out the true line, because there is less variance

    - side note: hopefully also our irreducible error $\epsilon$ is normally distributed with mean $0$ and standard deviation $\sigma$ (we assume this); there is a connection between the normal distribution of the coefficients and the irreducible error, because the coefficients are linear combinations of the dependent variables and the errors  (which are included in our estimates, because we cannot separate them)
        - this assumption is critical for calculating the coefficients' variance analytically (see below)

- there are multiple  ways to calculate the **standard error (SE)** for an estimate

    - a traditional approach (ISL uses it) is to calculate the standard error analytically using the formula: $\sigma_{\bar{x}} = \frac{\sigma}{\sqrt{n}}$

        - where we would find the standard error for the mean (but it's valid for any statistic)

    - but we can also empirically estimate it by sampling from the estimates (for instance by [bootstrapping](https://en.wikipedia.org/wiki/Bootstrapping_(statistics))) and form a [sampling distribution](https://en.wikipedia.org/wiki/Sampling_distribution) from which we can calculate the standard error

    - these are the (analytical) formulas for the standard error (SE) for the estimated coefficients (in simple linear regression) that ISL presents us with: </br>

        $\text{SE}(\hat{\beta}_0)^2 = \sigma^2 \left[ \frac{1}{n} + \frac{\bar{x}^2}{\sum_{i=1}^n (x_i - \bar{x})^2} \right]$ </br>
            - tells us how much we should expect our guess for $\hat{\beta}_0$ to vary from the real value $\beta_0$ </br>
            - $\text{SE}(\hat{\beta}_0)$ would be the same as $\text{SE}(\hat{\mu})$ if $\bar{x}$ were zero (in which case $\hat{\beta}_0$ would be equal to $\bar{y}$)

        $\text{SE}(\hat{\beta}_1)^2 = \frac{\sigma^2}{\sum_{i=1}^n (x_i - \bar{x})^2}$ </br>
            - looks at how the points are spread out in the x-direction: the more spread out the points are, the more we trust our guess for the steepness of the line

        - where $\bar{x}$ represents the mean of the predictors $x_i$​, and ${\sum_{i=1}^n (x_i - \bar{x})^2}$ measures the spread (variance) of the predictors

        - and where $\sigma^2 = \text{Var}(\epsilon)$
            - question: "What's $\epsilon$ gotta do with it? What's $\epsilon$ but a second hand variable?"

                - attempt of answering that question: since it represents the irreducible error or the noise in the data, $\text{Var}(\epsilon)$ tells us how spread out or variable this error is: if the variance is small, it means the errors are small and the model predictions are generally close to the actual data points; if the variance is large, it means the errors are large and the predictions are less reliable

                - the variance in the data has two components: the variance explained by the model and the variance from the irreducible error ($\epsilon$); reducing the overall error variance is the goal (while we can only really reduce the residual error variance)

        - these formulas are based on several assumptions:

            - relationship between $X$ and $y$ is linear

            - normal distribution of $\epsilon$

            - error is not correlated with any (other) feature

            - error's variance is constant across all observations (homoscedasticity)

                - if not all the conditions are met, this formula is still a good approximation

- standard errors can be used to compute **confidence intervals** around our estimate:

    - for instance two standard errors up and down from it, 95% of our data will lie

- standard errors can also be used to perform hypothesis tests on the **hypothesis testing**:

    - one way for hypothesis testing would be randomized sampling as in `sklearn.model_selection.permutation_test_score()`, where the correlating feature (in this case the target) is randomly shuffled

    - ISL suggests another method that involves calculating the standard error analytically and then (assuming normal distribution) we can know which percentile rank a certain value falls onto

    - Null Hypothesis is: $H_0: \beta_1 = 0$ ("There is no relationship between $X$ and $Y$.")

    - Alternative Hypothesis: $H_a: \beta_1 \neq 0$ ("There is some relationship between $X$ and $Y$.")

    - to test the null hypothesis, we need to determine whether $\hat{\beta}_1$ , our estimate for $\beta_1$ , is sufficiently far from zero that we can be confident that $\beta_1$ is non-zero

        - If $\text{SE}(\hat{\beta}_1)$ is small, then even relatively small values of $\hat{\beta}_1$ may provide strong evidence that $\beta_1 \neq 0$, and hence that there is a relationship between $X$ and $Y$.
        - If $\text{SE}(\hat{\beta}_1)$ is large, then $\hat{\beta}_1$ must be large in absolute value in order for us to reject the null hypothesis. 

    - in practice, we use the *t-statistic*, which measures the number of standard deviations that $\hat{\beta}_1$ is away from 0: </br>

        $t = \frac{\hat{\beta}_1 - 0}{SE(\hat{\beta}_1)}$

        - it's like he z-score, but not for normal distributions but for t-distributions instead and it applies better to small sample sizes than to large ones (like the z-score)
        - the more degrees of freedom, the more the t-distribution converges to the standard normal distribution

- from the t value, we can compute the probability of observing any number equal o |t| or larger: the **p-value**

    - a small p-value indicates that it is unlikely to observe such a substantial association between the predictor and the response due to chance: then we reject the null hypothesis

- the **significance** of such a hypothesis test is a normative decision (yes or no): we had defined a p-value for which we would reject the null hypothesis and for which we would fail to reject it a priori and the significance of the test is the decision of whether our estimates are passing or failing this test

#### Supplementary Insights: true population size $N$ in formulas
- I was wondering why the true population size $N$ is never taken into account in any of the formulas

    - because I can be confident about an estimate, if I have sampled 100 samples from a population size of 150, than if I had sampled 100 samples from a population of size of 1000000000

- this concept is addressed by the **finite population correction** (FPC) factor in statistics, but it's not always applied because many statistical formulas assume the population is "effectively infinite"

    - in real world applications, populations are often extremely large compared to the sample size, so the correction is negligible

- formulas for confidence intervals and standard errors are simplified to avoid including population size when it doesn't significantly affect the result

- similarly, in large populations, the difference between sampling with and without replacement becomes negligible

### 3.1.3 Assessing the Accuracy of the Model
- StatQuest [video on Linear Regression](https://www.youtube.com/watch?v=7ArmBVF2dCs) explains very nicely, what $R²$ is and how to calculate if it's statistically significant using the *p-value*

- after we have made a significance test for our coefficients (for instance using `sklearn.model_selection.permutation_test_score()`), we want to measure the extend to which our model fits the data

- in linear regression two related quantities are used for that end: the **residual standard error** (RSE) and the $R²$ statistic

#### Residual Standard Error (RSE)

- is a sample-based estimate of the standard deviation of the irreducible error $\epsilon$

- provides a way to evaluate how well the model fits the data: the smaller the RSE, the better the model fits the data

- $\text{RSE} = \sqrt{\frac{1}{n - 2} \text{RSS}} = \sqrt{\frac{1}{n - 2} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2}$

    - where $\text{RSS}$ is the residual sum of squares
    
    - the subtraction of 2 accounts for the two parameters $\hat{\beta}_0$ and $\hat{\beta}_1$ (reducing degrees of freedom by 2)

    - this formula is - again - derived analytically

    - it represents a standard deviation of the errors, because:
        - RSS sums the squared *residuals*: this is analogous to how variance is calculated (summing squared deviations from the mean)
        - dividing by $n - 2$ (number of residuals minus 2 degrees of freedom) is analogous to how variance is divided by the number of independent observations
        - taking the square root converts this into the same units as the original data, making it a standard deviation of the residuals

- RSE is measured in the units of the target variable

- in scikit-learn RSE is called `sklearn.metrics.root_mean_squared_error`
    
    - it doesn't account for the reduced degrees of freedom, because in practice with a large number n, the difference is neglible (statsmodels doesn't either)

#### $R²$ statistic

- is the proportion of total variance in $Y$ that is explained by the regression model
    - we calculate it by dividing the difference of the **un**explained variance and the total variance by the total variance

- value is between 0 and 1

- is independent of the scale of $Y$ (RSE is not, since it has the same units as the target variable)

- $R^2 = \frac{\text{TSS} - \text{RSS}}{\text{TSS}} = 1 - \frac{\text{RSS}}{\text{TSS}}$

    - where $\text{TSS} = \sum_{i=1}^{n} (y_i - \bar{y})^2$ is the total sum of squares, if $X$ had no effect

        - TSS measures the total variance in the response $Y$ , and can be thought of as the amount of variability inherent in the response before the regression is performed

    - and $RSS = \sum_{i=1}^n (y_i - \hat{y}_i)^2$ is the residual sum of squares
        - RSS measures the amount of variability that is left unexplained after performing the regression

    - $\text{TSS} - \text{RSS}$ measures the amount of variability in the response that is explained (or removed) by performing the regression

    - $R^2$ measures the proportion of variability in $Y$ that can be explained using $X$

- a score near 1 indicates a large proportion of variability explained by the model

- a score near 0 indicates that the regression does not explain much of the variability in the response; this might occur because the linear model is wrong, or the error variance $\sigma²$ is high, or both 


##### Comparison between  $R²$ statistic and correlation

- like correlation, the $R²$ statistic is a measure of the linear relationship between $X$ and $Y$: </br>

    $r = \text{Cor}(X, Y) = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^n (x_i - \bar{x})^2} \sqrt{\sum_{i=1}^n (y_i - \bar{y})^2}}$

    - where $\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})$ is the **covariance** between $X$ and $Y$, which quantifies how $X$ and $Y$ vary together

        - the covariance is not simply the product of both variances, but a measure of joint variability (how both random variables $X$ and $Y$ vary together *per pair*)

    - and ${\sqrt{\sum_{i=1}^n (x_i - \bar{x})^2} \sqrt{\sum_{i=1}^n (y_i - \bar{y})^2}}$ is the product of the standard deviations of $X$ and $Y$, meaning this term normalises the covariance

    - see the StatQuest video on the [Pearson's Correlation, Clearly Explained!!!](https://www.youtube.com/watch?v=xZ_z8KWkhXE)

    - it can be shown that in simple linear regression $R^2 = r^2$

        - this doesn't scale to multiple linear regression though, since the correlation $r$ is a pairwise measure and can only be used on two random variables

        - the $R²$ statistic in contrast, is able to capture proportion of variance in the target that can be explained by the entire set of predictors (it doesn't account for possible multicollinearity though)


## 3.2 Multiple Linear Regression
- fitting a separate simple linear regression model for each predictor doesn't take correlation between the predictors into account and might thus lead to misleading estimates of the associations between predictors and target

    - we could not see whether the correlation between predictor and target is due to the target or due to unmeasured or ignored features correlating with the predictor

- instead we include a separate slope coefficient for each predictor in our model:

    $Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_p X_p + \epsilon$

    - where $X_j$ represents the $j$th predictor and $β_j$ quantifies the association between that variable and the response

    - we interpret $β_j$ as the average effect on $Y$ of a one unit increase in $X_j$, *holding all other predictors fixed*

### 3.2.1 Estimating the Regression Coefficients
- as in simple linear regression, we estimate the coefficients, calculate the predictions and try to minimize the sum of squared residuals: 
    
    $RSS = \sum_{i=1}^n (y_i - \hat{y}_i)^2 = \sum_{i=1}^n \left(y_i - \hat{\beta}_0 - \hat{\beta}_1 x_{i1} - \hat{\beta}_2 x_{i2} - \cdots - \hat{\beta}_p x_{ip} \right)^2$

### 3.2.2 Some Important Questions
#### One: Is There a Relationship Between the Response and any of the Predictors?

- for hypothesis testing, we test: </br>

    $H_0 : \beta_1 = \beta_2 = \cdots = \beta_p = 0$ against $H_a$ : at least one $β_j$ is non-zero

- this hypothesis test is performed by computing the $F$-statistic
    
    $F = \frac{(TSS - RSS)/p}{RSS / (n - p - 1)}$

    - where $p$ is the number of predictors (excluding the intercepts) and as before $TSS = \sum_{i=1}^{n} (y_i - \bar{y})^2$ and $RSS = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$

    - it is a ratio of two chi-square distributions:

    - the numerator $(TSS - RSS)$ measures the variance explained by the model (how much variation in the response variable is explained by the predictors) per predictor

    - the denominator $(RSS / (n - p - 1))$ measures the variance of the residuals, accounting for the number of data points $n$ and the number of predictors $p$

    - the $F$-statistic tests whether the model with predictors is significantly better than a model with no predictors (i.e., only the intercept term)

- if the null hypothesis is rejected (the F-statistic is large enough), we can conclude that at least one of the predictors is significantly related to the response variable

    - the $F$-statistic can range from $[0, ∞)$

    - a larger $F$-statistic suggests that the predictors explain more of the variance in the response variable compared to a null model

        - when there is no relationship between the response and predictors, one would expect the $F$-statistic to take on a value close to 1

        - if $H_a$ is true, then we expect $F$ to be greater than 1

- the significance of a model when we know it's $F$-statistic is depending on the number of samples $n$ we used to train it:

    - if $n$ is large, even a $F$-statistic a little larger than 1 can provide evidence against $H_0$

    - with a smaller sample size $n$, we need a larger $F$-statistic$ to reject $H_0$ because of the increased variability

    - computing the *p-value* can help us determine the threshold: we can compute the p-value of the $F$-statistic$ on the F-distribution

- we can also use the $F$-statistic to calculate the significance of only one particular predictor

    - in this case the Null hypothesis states $H_0 : \beta_p = 0$ and $H_a$  is that $β_p$ is non-zero

    - we would calculate $F = \frac{(RSS_0 - RSS)/1}{RSS / (n - p - 1)}$

        - where $RSS_0$ is the RSS of a second (restricted) model that uses all the variables except the ones we are interested in ($β_p$ in this case) and $RSS$ is the residual sum of squares for the full model (with the predictor)

        - the $1$ in the numerator is because you're testing only one predictor at a time

        - this $F$-statistic reports the partial effect of adding that variable to the model

- looking at the p-values alone is not enough, because there is always a 5% chance that any coefficient will have a better p-value than 0.05

- the $F$-statistic doesn't suffer from this problem because it adjusts for the number of predictors

- in order to use $F$-statistic though, the number of predictors needs to be smaller than the number of samples; otherwise we should do dimensionality reduction first

#### Two: Deciding on Important Variables
- the $F$-statistic and the p-values for the several coefficients don't give us absolute certainty about which coefficients are actually most important, as they are sample statistics

- ideally, we would to perform variable selection by trying out a lot of different models, each containing a different subset of the predictors and then select the best model, resulting in $2^p$ combinations (which is inpracticable if we have many predictors)

    - but what does *best* mean?

    - Mallow’s $C_p$, Akaike information criterion (AIC), Bayesian information criterion (BIC), and adjusted $R^2$ are measures of good model and are discussed in chapter 6

- since trying out all possible $2^p$ combinations of predictors is unfeaseable unless $p$ is small, there are some automatable approaches:

    - forward selection: starts with a *null model* (only intercept) and then keep adding the variable with the lowest $RSS$ until a stopping criterion is reached

    - backward selection: start with all variables in the model, and remove the variable with the largest p-value until a stopping criterion is reached

    - mixed selection: mixture of both

- these are theoretical concepts; practically more computationally efficient methods are used

    - in scikit-learn these would be Recursive Feature Elimination (RFE), L1 Regularization (Lasso), Cross-Validation and Grid Search

#### Three: Model Fit
- **RSE** and $R^2$ are computed and interpreted in the same fashion as for simple linear regression

- $R²$ is the square of the correlation between the response (target) and the fitted linear model: $\text{Cor}(Y, \hat{Y})^2$

- $R²$  will always increase when more variables are added to the model, even if those variables are only weakly associated with the response

    -  this is due to the fact that adding another variable always results in a decrease in the residual sum of squares on the training data (though not necessarily the testing data)

    - variables than only add a tiny increase in the $R²$ score should be dropped, as they provides no real improvement in the model fit to the training samples, and its inclusion will likely lead to poor results on independent test samples due to overfitting

- we should also plot the data (predictor-response-wise) to see if one of the predictors might not be linearly correlated to the response

#### Four: Predictions
- when we predict, we need to be aware that there are three causes of uncertainty in the process:

    1. the coefficient estimates $\hat{\beta}_0, \hat{\beta}_1, \dots, \hat{\beta}_p$ are only estimates for $\beta_0, \beta_1, \dots, \beta_p$ meaning: the least squares plane: $\hat{Y} = \hat{\beta}_0 + \hat{\beta}_1 X_1 + \cdots + \hat{\beta}_p X_p$ is only an estimate for the true population regression plane $f(X) = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p$: the inaccuracy in the coefficient estimates is related to the reducible error 

    2. another reducible error which we call *model bias*: maybe reality is not linear

    3. even if we knew $f(X)$, there is still the irreducible error; to account for this, we use *prediction intervals* which are larger than confidence intervals

- *confidence intervals* are used to quantify our certainty about the estimated coefficients under certain assumptions about the model (e.g., linearity, normality of residuals, etc.)

- *prediction intervals* are used to quantify the uncertainty in predicting future individual outcomes in the real world, taking into account the irreducible error

## 3.3 Other Considerations in the Regression Model
### 3.3.1 Qualitative Predictors

- we **can** use qualitative (categorical) predictors as if they were quantitative predictors, but they must be properly encoded

    - for example the feature "own house" for a prediction of credit card debt by coding 0 for not owning a house and 1 for owning a house

    - the decision to code owners as 1 and non-owners as 0 is arbitrary, and has no effect on the regression fit, but does alter the interpretation of the coefficients
        - we could have also coded it as -1 and 1

- another option is to split the dataset by the levels of a categorical variable and fit separate models for each group

    - for instance fit one model on credit card debt for house owners and another for non-house owners
    
    - but if one of the groups results to have too few samples in it, the p-value may be high which indicates that we cannot be statistically certain about eventual differences

- if the categorical feature has more than two levels  (e.g., "region" with values "North," "South," "East," "West"), it must be transformed into $n−1$ dummy variables (to one-hot-encode)

    - for example: create three binary variables: "is_North," "is_South," "is_East"

        - "is_West" is omitted to avoid redundancy (the omitted level is called the reference category)
    
    - encoding, there is no clear interpretation of how different categories influence the response variable

### 3.3.2 Extensions of the Linear Model
- linear regression models make the assumption that the relationship between the predictors and response are *additive* and *linear*

    - an *additive* relationship means that association between a predictor $X_j$ and the response $Y$ does not depend on the values of the other predictor (no interactions effects or collinearity between features)

    - a *linear* relationship that the change in the response $Y$ associated with a one-unit change in $X_j$ is constant, regardless of the value of $X_j$

- we can extend the linear model, by removing these assumptions

- the following methods are some of the simpler ones (in later chapters more complex methods will be described)

    - in scikit-learn, we have a transformer `PolynomialFeatures` that does both at the same time: generate polynomial and interaction features

#### Removing the Additive Assumption
- if an increase in one feature is influencing the target more if another feature also increases, this is called an **interaction effect**

    - for example the sales would increase if radio and television advertisement are combined

- we can account for this by including interaction terms (e.g., $X_1 × X_2$​) in the model to account for predictors whose effects on $Y$ depend on one another

    - the model would then be $Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_1 X_2 + \epsilon$

    - which can be re-written as $Y = \beta_0 + (\beta_1 + \beta_3 X_2)X_1 + \beta_2 X_2 + \epsilon$
        - where $\tilde{\beta}_1 = \beta_1 + \beta_3 X_2$ is a modified version of $\be$ta_1

        - $\tilde{\beta}_1$​ is no longer constant because it depends on $X_2$

        - meaning the relationship between $X_1$​ and $Y$ changes depending on the value of $X_2$

        - similarly, we could construct $\tilde{\beta}_2$ from this

        - a change in the value of $X_2$ will change the association between $X_1$ and $Y$ and the other way around, a change in the value of $X_1$ will change the association between $X_2$ and $Y$

- if the p-value of the interaction term is statistically significant, the interaction term should be included in the model

- the hierarchical principle states that if we include an interaction in a model, we should also include the main effects, even if the p-values associated with their coefficients are not significant
    
    - the rationale for this principle is that if $X_1 × X_2$ is related to the response, then whether or not the coefficients of $X_1$ or $X_2$ are exactly zero is of little interest
        
        - this implies that even features that appear insignificant on their own might have significant explanatory power when combined through interaction terms

- this is still a linear model, because the relationship between the coefficients still is linear and the interaction term is simply treated like a new feature and the model is still expressed as a linear combination of the predictors and their coefficients; but it's no longer additive

#### Non-linear Relationships
- **polynomial regression** can be used to de-couple the assumption of linearity from a linear model

    - it extends linear regression to model non-linear relationships between a single predictor $X$ and the response $Y$ by adding a new feature that is a higher-order polynomial of itself

    - for example, the miles per galleon a car can drive depend on it's horsepower in a non-linear way: </br>

        $mpg = \beta_0 + \beta_1 \times \text{horsepower} + \beta_2 \times \text{horsepower}^2 + \epsilon$

    - this is still a linear model, but with with $X_1 = horsepower$ and $X_2 = horsepower^2$

### 3.3.3 Potential Problems
- identifying and overcoming these problems is as much an art as a science

#### 1. Non-linearity of the Data
- *residual plots* (i.e. plotting the residuals, $e_i = y_i - \hat{y}_i$ versus the predictor $x_i$) can help to discover this

- in multiple regression we instead plot the residuals versus the predicted (or fitted) values $\hat{y}_i$

- the presence of a pattern (for instance a U-shape) may indicate a problem with some aspect of the linear model

- if the residual plot indicates that there are non-linear associations in the data, then a simple approach is to use non-linear transformations of the predictors, such as $log X$, $√X$, and $X^2$ in the regression model

#### 2. Correlation of Error Terms
- an assumption of the linear model is that the error terms per sample are uncorrelated

    - for instance, if the errors are uncorrelated, then the fact that $\epsilon_i$ is positive provides little or no information about the sign of $\epsilon_{i+1}$

- if the error terms are in fact correlated, then this affects confidence intervals and p-values and we may have an unwarranted sense of confidence in our model

- correlations between the error terms frequently occur in the context of time series data, where observations that are obtained at adjacent time points will
have positively correlated errors

    - to detect that, we can plot the residuals from our model as a function of time and inspect for patterns

#### 3. Non-constant Variance of Error Terms
- if the *residual plot* has a funnel shape, this indicates non-homoscedasticity (heteroscedasticity); meaning residulals don't have  constant variance

- if the errors increase with the value of the response, we could transform the target with $log Y$ or $√Y$

- special case: if categorical variables divide our data into separate sub-groups, each sub-group might have a different variance associated with the outcome, leading to heteroscedasticity

    - if we know the variance in each sub-group, we can use it to apply weights to your regression model to account for the fact that some groups have more variable data than others, where the weights are inversely proportional to the variance within each sub-group

    - this is called weighted least squares

#### 4. Outliers
- even if they might not always have a large effect in the prediction, they always have a huge effect on the residual standard error (RSE), the p-values and the confidence intervals

- *residual plots* can be used to identify outliers

- we can also plot the *studentized residuals*, computed by dividing each residual $e_i$ by its estimated standard error
    - observations further than 3 standard errors away are considered outliers

#### 5. High Leverage Points
- this is an unusual value (not necessarily an outlier) in a region of the data, where there are little observations in the predictors

- in contrast to an outlier, the distribution of the predictor is considered, not in the target

- to identify high leverage points, we can simply look for observations for which the predictor value is outside of the normal range of the observations

-  in a multiple linear regression with many predictors, it is possible to have an observation that is well within the range of each individual predictor’s values, but that is unusual in terms of the full set of predictors

- we can identify high leverage points with scatter plots, either between feature and target (in simple linear regression) or between several predictors (in multiple linear regression=)

- we can also calculate the *leverage statistic*: $h_i = \frac{1}{n} + \frac{(x_i - \bar{x})^2}{\sum_{i' = 1}^{n} (x_{i'} - \bar{x})^2}$ (for simple linear regression)

- a data point that is both an outlier and a high leverage point is a dangerous combination, that we might to take care of

#### 6. Collinearity
- is present when two or more predictor variables are closely related to one another: correlation between predictors

- results in a great deal of uncertainty in the coefficient estimates and in a decline in the t-statistic

- a simple way to detect collinearity is to look at the correlation matrix of the predictors

    - but only collinearity between pairs of features can be detected this way

    -  it is possible for collinearity to exist between three or more variables, which is called multicollinearity

- to dectect multicollinearity we can compute the **variance inflation factor** (VIF), which is the ratio of the variance of $\hat{\beta}_j$ when fitting the full model divided by the variance of $\hat{\beta}_j$ if fit on its own

- VIF for each variable can be computed using the formula: $\text{VIF}(\hat{\beta_j}) = \frac{1}{1 - R^2_{X_j | X_{-j}}}$

    - where $R^2_{X_j | X_{-j}}$ is the $R^2$ from a regression of $X_j$ onto all of the other predictors

    - a VIF of 1 indicates no collinearity and a value above 5 or 10 would indicate that there is too much of it

- solutions:

    - drop one of the problematic variables

    - combine the collinear variables together into a single predictor