# Linear Regression



Each data point $\mathbf{x}_i$ is a vector $\mathbf{x}_i \in \mathbb{R}^m$ consisting of $m$ features, with an associated output $y_i$. When considering multiple data points, we organize them into a feature matrix $\mathbf{X} \in \mathbb{R}^{n \times (m+1)}$, where $n$ is the number of data points, and an additional column of ones is included to account for the intercept term. Assuming that there is a linear line of best fit, we can use linear approximation to predict the y values, with given x vector.

$$ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_mx_m+ \epsilon $$

For the second data point, we denote:

$$ y_2 = \beta_0 + \beta_1 x_{21} + \beta_2 x_{22} + \dots + \beta_mx_{2m} + \epsilon_2 $$

For multiple data points:

$$
\begin{aligned}
&\text{First data point:} \quad y_1 = \beta_0 + \beta_1 x_{11} + \beta_2 x_{12} + \dots + \beta_mx_{1m} + \epsilon_1 \\
&\text{Second data point:} \quad y_2 = \beta_0 + \beta_1 x_{21} + \beta_2 x_{22} + \dots + \beta_mx_{2m} + \epsilon_2 \\
&\text{Third data point:} \quad y_3 = \beta_0 + \beta_1 x_{31} + \beta_2 x_{32} + \dots + \beta_mx_{3m} + \epsilon_3
\end{aligned}
$$

And so on...

## Matrix Representation

Organizing this into a matrix notation:

$$ \mathbf{Y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\epsilon} $$

where:
- $ \mathbf{Y} $ is the vector of observed values,
- $ \mathbf{X} $ is the matrix of input features ($\mathbf{X} \in \mathbb{R}^{n \times (m+1)}$),
- $ \boldsymbol{\beta} $ is the vector of coefficients ($\boldsymbol{\beta} \in \mathbb{R}^{(m+1)}$),
- $ \boldsymbol{\epsilon} $ is the error term.

## Error in Prediction

For each data point:

$$ e_i = y_i - \hat{y}_i $$

where $ \hat{y}_i $ is the predicted value.

## Residual Sum of Squares (RSS)

The sum of squared errors (RSS) is given by:

$$ RSS = \sum (y_i - \hat{y}_i)^2 $$

Inspecting RSS, we see that it is a function of $ \boldsymbol{\beta} $:

$$ RSS = (\mathbf{Y} - \mathbf{X}\boldsymbol{\beta})^T (\mathbf{Y} - \mathbf{X}\boldsymbol{\beta}) $$

## Minimizing RSS

To minimize RSS, we take the derivative:

$$ \frac{\partial RSS}{\partial \boldsymbol{\beta}} = -2\mathbf{X}^T(\mathbf{Y} - \mathbf{X}\boldsymbol{\beta}) = 0 $$

Solving for $ \boldsymbol{\beta} $:

$$ \boldsymbol{\beta} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{Y} $$

### Conditions for Existencei

For $ (\mathbf{X}^T \mathbf{X})^{-1} $ to exist, $ \mathbf{X}^T \mathbf{X} $ must be invertible (full rank, linearly independent columns). If it is not invertible, we use:

- **Regularization**: Adding a term to make it invertible
- **Moore-Penrose Pseudo Inverse**:

  $$ \mathbf{X}^+ = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T $$

This ensures a solution even if $ \mathbf{X}^T \mathbf{X} $ is not full rank.

---
