## 3 Linear Regression
- we want to determine each feature's contribution to the target (if existing)
- but we also need to determine the accuracy of our prediction
- observe is there is an interaction effect between the  features

### 3.1 Simple Linear Regression

- predicting a quantitative response $Y$ on the basis of a single predictor variable $X$
- we assume a linear relationship between predictor and target: $Y ≈ β_0 + β_1 X$
    - where $β_0$ and $β_1$ are two unknown constants that represent the *intercept* and *slope* terms; together they are known as the models *coefficients* or *parameters*
    
- we want to produce estimates $\hat{\beta}_0$ and $\hat{\beta}_1$, so that $\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x$
    - where $\hat{y}$ indicates a prediction of $Y$ on the basis of $X = x$

#### 3.1.1 Estimating the Coefficients

- the real $β_0$ and $β_1$ are unknown
- we want to find an intercept $\hat{\beta}_0$ and a slope $\hat{\beta}_1$ such that the resulting line is as close as possible to the number ($n$) of data points

##### Cost function 
- most common way to measure that closeness is **least squares**

- let $\hat{y_i} = \hat{\beta}_0 + \hat{\beta}_1 x_i$ be the prediction for $Y$ based on the ith value of $X$:
    - then $e_i = y_i − \hat{y_i}$ represents the ith *residual* (difference between `y` and `y_pred` at the ith observation)
    - and the **residual sum of squares** (RSS) is the sum of all the squared residuals: $RSS = e_1^2 + e_2^2 + \cdots + e_n^2$

        - $RSS = (y_1 − \hat{\beta}_0 − \hat{\beta}_1 x_1 )^2 + (y 2 − \hat{\beta}_0 − \hat{\beta}_1 x_2 )^2 + · · · + (y_n − \hat{\beta}_0 − \hat{\beta}_1 x_n )^2 $
        - we square the residuals to avoid cancellation, we want to stress larger errors and because it creates a smooth, continuous function, which allows us to use calculus-based optimization techniques, such as taking derivatives, to find the minimum of the RSS efficiently
    
    - The MSE (**Mean Squared Error**) uses the RSS (**Residual Sum of Squares**) for calculating the score over all the `y_pred`.

        - the RSS is the raw sum of squared residuals, which measures the total error across all data points: </br>

            $RSS = \sum_{i=1}^n (y_i - \hat{y}_i)^2$ </br>
            (loss function)

        - the MSE the RSS and normalizes it by dividing by the number of data points ($n$); it represents the average squared error per data point:</br>
        
            $MSE = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2$ </br>
            (cost function)

##### Optimization method: normal equation
- it is an optimization method rather than an optimization algorithm

- to find the values of $\hat{\beta}_0$​ and $\hat{\beta}_1$​ that minimize the loss function (RSS), we use a closed-form solution that gives us an exact solution without iteration though using calculus and solving the resulting system of equations
    -  the loss function and the optimization are so closely related here because the goal of optimization is to minimize the loss function

- the RSS can also be written as $\text{RSS}(\beta) = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 = \| y - X\beta \|^2$

    - the last term denotes the squared Euclidean norm and can also be written in matrix form: $\| y - X\beta \|^2 = (y−Xβ)^T(y−Xβ)$

- from the RSS formula, we can derive the following by taking derivatives with respect to the coefficients $\beta$ and setting them equal to zero: <br>

    $\hat{\beta}_1 = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2}$

    $\hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}$

    - where $\bar{y} \equiv \frac{1}{n} \sum_{i=1}^{n} y_i$ and $\bar{x} \equiv \frac{1}{n} \sum_{i=1}^{n} x_i$ are the sample means
        - the tree vertical bars stands for the "equivalent" symbol, used to indicate that two expressions are defined to be equal in a specific context

    - this is the (non-iterative, closed form) OLS optimisation for simple linear regression
        - this would be the more general form for multiple linear regression: $\hat{\beta} = (X^T X)^{-1} X^T y$

    - intuition on how this works:
        - taking derivative of a function represents its rate of change
        - setting the derivative of the RSS with respect to $\beta$ equal to $0$ corresponds to finding the stationary points, which may correspond to the minimum (or maximum)
        - why not derivatives of x or y? --> those are both fixed
        - RSS function is a convex function (a parabola-shaped curve) and there is only one global minimum

- optimizing via the normal equation is based on the assumption that $X^TX$ is invertible, meaning, $X$ needs to have full rank
    - in this case the normal equation yields an exact solution

    - if there is multicollinearity between features ($X$ is not full rank), the normal equation doesn't have a solution and we need to use other techniques

- we can still find a best-fit line using OLS, even though we know we have an irreducible error, because it gives us the best approximation to the true relationship between the predictors $x$ and the response $y$, by minimizing the total squared error (RSS) across all data points

- in scikit-learn the optimisation for linear regression is a normal equation based on OSL for small datasets and an interative method for larger datasets

##### Supplemantary Insights: loss function, objective function, cost function

- **loss function**: applies to an individual observation
    - RSS however is an aggregate loss function (a sum is used as a summary statistic); but it is simply a summed version of the individual squared losses and thus can serve as a loss function here
    - in most supervised learning models in scikit-learn, the loss function is aggregated across all data points because these models aim to minimize an overall prediction error (aggregated loss)

- **objective function**: aggregates the loss over all observations and might include additional terms (e.g., regularization)

    - **cost function**: specific type of objective function that represents the aggregated loss over the entire dataset, but does typically not include other terms like the regularisation

#### 3.1.2 Assessing the Accuracy of the Coefficient Estimates
-