# Chapter 3 - Linear Regression


## What is linear Regression
- Simple supervised learning method
- Linear Regression assumes that the dependency of Y on predictors X is linear.
- We assume a model: Y = intercept+slope*X + error term
- with linear data the mean of error term = 0 

## Questions we want answered

- Is there a relationship between X and Y?
- How strong is the relationship?
- Which X contribute most to Y?
- How accurately can we predict future values of Y?
- Is the relationship linear?

## Estimation of parameters by least squares

- let ^y be the prediciton for Y 
- Then ei = yi - ^yi which is the ith residual.
- We define the residual sum of squares (RSS) as e1^2 + e2^2 + ... + en^2


- The least squares approach chooses intercept and slope that minimise the RSS.

## Confidence Interval Of Model Parameters
Note Y = f(x) + e is the real model
let f(x) = Ax + B, be the true f(x) model
let y = ax + b, be the estimate model for f(x)

With different data sets we will get a new regression line each time.
So how do we know which one would be best sutible for unseen data?

we can find the 95% interval for the true model by adding 2 SD to the mean.
a +- 2*SD(a)
b +- 2*SD(b)

With 95% confidence we can say that the true values of a and b will be in this range.



## Hypothesis Testing

h0: There is no relationship between X and Y.

h1: There is a relationship between X and Y.

## Assessing the Overall Accuray of the Model

Residual sum of squares (RSS): sum of i=1 to n (yi - ^yi)^2
  - How much variation is left unexplained by your model

Residual standard Error (RSE): root(RSS/n-2)
  - The average amount by which the observed values differ from the regression line (typical error size)
  - Why n-2? because we estimated 2 parameters.
  - Small RSE, points are close to the regression line 
  - This is an estimate of the SD of e (irreducible error)

Total sum of squares (TSS): sum of i=1 to n (yi- _y)^2
  - Where _y is the mean of the y
  - How much variation is in the response variable Y before considering the model.
  
R-squared (R^2): (TSS - RSS) / TSS
  - Statistical measure of how well your model fits the data
  - How much of the total variation is explained by the model.


### Notes
- e.g r-squared = 0.9: The model through the predictors, explains 90% of the variability in Y.
- TSS - RSS is the drop in training error, bigger the drop the better the model.
- If the p value is high say over 0.05 then there is a chance that h0 is true. The smaller the p the more chances of rejecting h0.



## Multiple Linear Regression

- Y = a bX1 + cX2 + dX3 + ..... + e

### Estimating model for multiple regression

- We minimise RSS.



## More Quesitions

1. Is atleast 1 predictor useful?
    - We use F-statistic: F = ((TSS - RSS) / p) / (RSS / (n-p-1))
    - p is the numner of parameters (coefficients of x, intercept is not counted in this)
    - n is the sample size
    - F = explained variance per predictor / unexplained variance per observation
    - If F is close to 1 then it is likely that no predictors are useful.

2. Deciding on Important Variables
    - Forward selection
        - Begin with null model (a model that contains y intercept but no predictors)
        - We then fit simple linear regression for each p and add to the null model the variable that results in the lowest RSS. 
        - Then add to the model the variable that results in the lowest RSS amongst all two-variable models.
        - Continue until some stopping rule is satisfied, for example when all remaining variables have a p-value above some threshold.
    - Backward selection
        - Begin with all variables in the model.
        - Remove the variable with the largest p-value - that is the variable that is the least statistically significant.
        - Continue until a stopping rule is reached.
        - This cant be used if p > n because the full least squares model cannot be fit. While forward can always be used.

3. Model Fit
    - RSE and R-sqaured are the most common measures of model fit.
    - In simple linear regression R^2 is the square of the correlation of the response and variable.
    - In multiple linear regression R^2 is = Cor(Y,^Y)^2 (the square of correlation between the response and the fitted model.)
    - R^2 will always increase on training data with more variables as the RSS always increases. (tho this may not be the case for test data)

4. Predictions 
    - Y = f(X) + e
    - we can find the 95% confidence interval of the true value of f(X)
    - we can also find the 95% prediction interval of the true value of Y. This inlcudes both the error in estimate of f(X) and irreducible noise which is why the range is bigger than confidence interval.



## Qualitative Predictors

- There can be cases where we have qualitative predictors.

- If the qualitative predictor has only 2 values (i.e boolean)
    - We can transform it into a quantitative variable using 1s and 0s.
    ![](../images/qualitative_predictors.png)

- What if the predictor has more than 2 values?
    - create another variable
    - x1 = 1 if asian 0 if not asian
    - x2 = 1 if african 0 if not african
    - if not x1 or x2 then american
    - with k levels, we create k - 1 dummy variables
    - The base line is the value without the variable, in this case it is american. The baseline will have value of the intercept.
    - comparisons will be made against the baseline. 
    - The choice of the baseline does not matter as RSS will be the same, however the contrasts you make will change (p value will also change). e.g asian is +10 to american. 


## Assumptions
- The relationship between the response and predictors are additive and linear.
- Additive: The association between a predictor X and response Y does not depend on the valiues of the other predictors. (this is usually not the case)
- Linear: As X increases by 1 unit, Y always changes the same.

### Removing the Additive Assumption
- standard linear model with 2 variables: Y = a + bx1 + cx2 + e
- add an interaction term
    - Y = a + bx1 + cx2 + dx1x2 + e
- we can also haeve interactive term for qualitative and quantitative variables.

### Remvoing the linear Assumption
- we can create a variable for the polynomial
- y = a + bx1 + cx1^2 + e


### Correlation of Error Terms
- we assume that the error terms are not correlated, however if they were then we would be underestimating them and the actual 95% confidence interval would be larger than the one we calulated.

### Non-constant Variance of Error terms
- Another assumption of linear regression model is that the error terms have a constant variance, Var(ei) = o^2

### High Leverage Points (HLP)
- outliers: unusual value of y given predictor x.
- high leverage points: unsual value of predictor x.
- Its harder to see the HLP with more dimensions (parameters).
- there is a formula to calculate the leverge.

### Collinearity
- 2 or more predictor variables are closely related to one another.
- VIFj = 1/1-Rj^2
    - VIFj = 1: no correlation with others
    - >5: moderate collinearity
    - >10: severe multicollinearity
- it’s impossible for the regression model to tell which variable is actually responsible for the change in the response y.




## Comparison of Linear Regression and K-Nearest Neighbors
This section compares **Linear Regression (parametric)** and **K-Nearest Neighbors (KNN) (non-parametric)** regression methods.

Both aim to estimate the relationship:
\[
Y = f(X_1, X_2, \dots, X_p) + \epsilon
\]
but use very different approaches.

### Linear Regression

### Concept
Assumes the relationship between predictors and response is **linear**:
\[
f(X) = \beta_0 + \beta_1 X_1 + \dots + \beta_p X_p
\]

### Characteristics
- **Parametric**: Specifies a fixed form for \( f(X) \)
- **Low variance**, **high bias**
- Performs well if the true relationship is approximately linear
- Easy to interpret, but cannot capture complex nonlinear patterns

### Pros
✅ Simple and interpretable  
✅ Efficient with small data  
✅ Works well in high dimensions  

### Cons
❌ High bias if the true relationship is nonlinear  
❌ Limited flexibility

## K-Nearest Neighbors (KNN)

### Concept
A **non-parametric** method: no assumption about the form of \( f(X) \).

For a test point \( x_0 \):
1. Identify the **K nearest** observations in the training set.
2. Predict:
   \[
   \hat{f}(x_0) = \frac{1}{K} \sum_{i \in \mathcal{N}_K(x_0)} y_i
   \]

### Characteristics
- **Flexible** and can adapt to complex patterns
- **Low bias**, **high variance** (especially for small K)
- Sensitive to the choice of K and dimensionality of data

### Pros
✅ Captures nonlinear relationships  
✅ No need to specify functional form  

### Cons
❌ Poor interpretability  
❌ Sensitive to noise and irrelevant features  
❌ Suffers from the **curse of dimensionality**

## ⚖️ Bias–Variance Tradeoff

| Model | Bias | Variance | Flexibility |
|--------|------|-----------|--------------|
| Linear Regression | High | Low | Low |
| KNN (small K) | Low | High | High |
| KNN (large K) | Higher | Lower | Less flexible |

**Goal:** Minimize the **Test Mean Squared Error (MSE)** by balancing bias and variance.

\[
\text{Test MSE} = [\text{Bias}(\hat{f}(x_0))]^2 + \text{Var}(\hat{f}(x_0)) + \text{Var}(\epsilon)
\]

## 📊 Performance Insights

- **Linear Regression**: better when the true relationship is roughly linear or data are limited.  
- **KNN Regression**: better when the relationship is complex and nonlinear, and you have lots of data.  
- In **high dimensions**, KNN performs poorly due to sparse data (curse of dimensionality).
- As p increases the MSE for KNN significantly increases