## Purpose of this Notebook

This notebook contains theoretical foundations of linear models for Machine Learning 1.
The focus is on mathematical understanding, statistical inference, and structured derivations.
Implementation examples are provided in separate notebooks.

# Machine Learning 1 — Linear Models

**Author:** Elena Fuchs  
**Course:** ML1  
**Topic:** Linear Regression Foundations  

---

# What is a Linear Model?

A linear model assumes that the relationship between a response variable $Y$  
and one or more predictors $X$ can be described as:

$$
Y = \beta_0 + \beta_1 X + \varepsilon
$$

Where:

- $\beta_0$ = intercept  
- $\beta_1$ = slope  
- $\varepsilon$ = error term  

More generally, for $n$ observations:

$$
Y_i = \beta_0 + \beta_1 X_i + \varepsilon_i
$$

---

# Multiple Linear Regression

If we have multiple predictors $X_1, X_2, \dots, X_p$, the model becomes:

$$
Y_i = \beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2} + \dots + \beta_p X_{ip} + \varepsilon_i
$$

Or more compactly:

$$
Y_i = \beta_0 + \sum_{j=1}^{p} \beta_j X_{ij} + \varepsilon_i
$$

---

# Matrix Form

Linear regression can be written in matrix notation as:

$$
\mathbf{y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\varepsilon}
$$

Where:

$$
\mathbf{y} =
\begin{pmatrix}
Y_1 \\
Y_2 \\
\vdots \\
Y_n
\end{pmatrix},
\quad
\mathbf{X} =
\begin{pmatrix}
1 & X_{11} & \dots & X_{1p} \\
1 & X_{21} & \dots & X_{2p} \\
\vdots & \vdots & \ddots & \vdots \\
1 & X_{n1} & \dots & X_{np}
\end{pmatrix},
\quad
\boldsymbol{\beta} =
\begin{pmatrix}
\beta_0 \\
\beta_1 \\
\vdots \\
\beta_p
\end{pmatrix}
$$

---

# Estimation via Ordinary Least Squares (OLS)

The parameters $\boldsymbol{\beta}$ are estimated by minimizing the sum of squared residuals:

$$
\min_{\boldsymbol{\beta}} \sum_{i=1}^{n} \left( Y_i - \hat{Y}_i \right)^2
$$

Since:

$$
\hat{Y}_i = \beta_0 + \sum_{j=1}^{p} \beta_j X_{ij}
$$

We minimize:

$$
\min_{\boldsymbol{\beta}} \sum_{i=1}^{n} 
\left( Y_i - \beta_0 - \sum_{j=1}^{p} \beta_j X_{ij} \right)^2
$$

The closed-form OLS solution is:

$$
\hat{\boldsymbol{\beta}} = 
(\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y}
$$

---

# Residuals

The residual for observation $i$ is:

$$
e_i = Y_i - \hat{Y}_i
$$

In vector form:

$$
\mathbf{e} = \mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}
$$

---

# Core Assumptions

For classical linear regression inference to hold:

1. **Linearity**
   $$
   \mathbb{E}[Y_i \mid X_i] = \beta_0 + \sum_{j=1}^{p} \beta_j X_{ij}
   $$

2. **Independence**
   $$
   \text{Cov}(\varepsilon_i, \varepsilon_j) = 0 \quad \text{for } i \neq j
   $$

3. **Homoscedasticity**
   $$
   \text{Var}(\varepsilon_i \mid X_i) = \sigma^2
   $$

4. **Normality (for inference)**
   $$
   \varepsilon_i \sim \mathcal{N}(0, \sigma^2)
   $$

5. **No Perfect Multicollinearity**
   $$
   \mathbf{X}^\top \mathbf{X} \text{ is invertible}
   $$

---

# Interpretation

- $\beta_0$ = expected value of $Y$ when all predictors equal zero  
- $\beta_j$ = expected change in $Y$ for a one-unit increase in $X_j$, holding other predictors constant  
- $\sigma^2$ = variance of the error term  

---

# Geometric Interpretation

The OLS estimator projects $\mathbf{y}$ onto the column space of $\mathbf{X}$:

$$
\hat{\mathbf{y}} = \mathbf{X}\hat{\boldsymbol{\beta}}
$$

This is the orthogonal projection of $\mathbf{y}$ onto the linear subspace spanned by the predictors.

---

# Key Takeaway

Linear regression is the foundation of statistical learning:

- It formalizes prediction as a projection problem  
- It connects geometry, probability, and optimization  
- It underlies many machine learning models  

---

# Statistical Inference in Linear Regression

After estimating the model

$$
\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}
$$

we often want to test hypotheses about the parameters.

---

# 1. $t$-Tests for Individual Coefficients

To test whether a single coefficient $\beta_j$ differs from zero:

### Null and Alternative Hypotheses

$$
H_0: \beta_j = 0
$$

$$
H_1: \beta_j \neq 0
$$

---

## Test Statistic

The $t$-statistic is defined as:

$$
t_j = \frac{\hat{\beta}_j}{\operatorname{SE}(\hat{\beta}_j)}
$$

where

$$
\operatorname{SE}(\hat{\beta}_j)
=
\sqrt{\hat{\sigma}^2 \left[(\mathbf{X}^\top \mathbf{X})^{-1}\right]_{jj}}
$$

and

$$
\hat{\sigma}^2 =
\frac{1}{n - p - 1}
\sum_{i=1}^{n} e_i^2
$$

Under the null hypothesis:

$$
t_j \sim t_{n - p - 1}
$$

---

# 2. $F$-Test for Overall Model Significance

We test whether *all* predictors jointly have no effect:

$$
H_0: \beta_1 = \beta_2 = \dots = \beta_p = 0
$$

$$
H_1: \text{At least one } \beta_j \neq 0
$$

---

## $F$-Statistic

The $F$-statistic compares explained variance to unexplained variance:

$$
F =
\frac{
\left( \text{RSS}_0 - \text{RSS}_1 \right) / p
}{
\text{RSS}_1 / (n - p - 1)
}
$$

Where:

- $\text{RSS}_0$ = residual sum of squares of the null model  
- $\text{RSS}_1$ = residual sum of squares of the full model  

Under $H_0$:

$$
F \sim F_{p,\, n - p - 1}
$$

---

# Nested Model Comparison

More generally, to compare:

- Reduced model: $\mathcal{M}_R$
- Full model: $\mathcal{M}_F$

The statistic is:

$$
F =
\frac{
(\text{RSS}_R - \text{RSS}_F)/(df_R - df_F)
}{
\text{RSS}_F/df_F
}
$$

---

# 3. Interaction Terms

An interaction term allows the effect of one variable to depend on another.

For two predictors $X_1$ and $X_2$:

$$
Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 (X_1 X_2) + \varepsilon
$$

---

## Interpretation

The marginal effect of $X_1$ is:

$$
\frac{\partial Y}{\partial X_1}
=
\beta_1 + \beta_3 X_2
$$

The effect of $X_1$ now depends on $X_2$.

---

## Principle of Marginality

If an interaction term is included:

$$
X_1 X_2
$$

then the main effects $X_1$ and $X_2$ must also remain in the model.

---

# 4. Bias–Variance Decomposition

Suppose the true data-generating process is:

$$
Y = f(X) + \varepsilon
$$

with

$$
\varepsilon \sim \mathcal{N}(0, \sigma^2)
$$

For a prediction at point $x_0$:

$$
\hat{f}(x_0)
$$

The expected prediction error can be decomposed as:

$$
\mathbb{E} \left[ (Y - \hat{f}(x_0))^2 \right]
=
\underbrace{\left( \operatorname{Bias}[\hat{f}(x_0)] \right)^2}_{\text{Bias}^2}
+
\underbrace{\operatorname{Var}[\hat{f}(x_0)]}_{\text{Variance}}
+
\underbrace{\sigma^2}_{\text{Irreducible Error}}
$$

---

## Bias

$$
\operatorname{Bias}[\hat{f}(x_0)]
=
\mathbb{E}[\hat{f}(x_0)] - f(x_0)
$$

Measures systematic error.

---

## Variance

$$
\operatorname{Var}[\hat{f}(x_0)]
=
\mathbb{E} \left[
\left( \hat{f}(x_0) - \mathbb{E}[\hat{f}(x_0)] \right)^2
\right]
$$

Measures sensitivity to training data.

---

# Trade-Off

- Simple models → high bias, low variance  
- Complex models → low bias, high variance  

Optimal prediction balances both.

---

# Big Picture

Linear regression connects:

- Optimization  
- Probability  
- Geometry  
- Statistical inference  
- Machine learning theory  

It is the foundation for:

- Logistic regression  
- Ridge / Lasso  
- Generalized linear models  
- Neural networks (linear layers)  
- Deep learning  

---