In [None]:
This notebook demonstrates the equivalence of ML estimation and linear regression

It also justifies MSE as a loss function for linear regression

We would do other loss functions in the tutorial

# Equivalence between Maximum Likelihood and MSE in Linear Regression

In linear regression, we assume a linear relationship between input features $\mathbf{x}$ and the target variable $y$, with normally distributed errors.

Consider the dataset $\mathcal{D} = [(\mathbf{x}^{(1)}, y^{(1)}), (\mathbf{x}^{(2)}, y^{(2)}), \ldots, (\mathbf{x}^{(m)}, y^{(m)})] $.

- The target variable $ y^{(i)} $ can be written as:
     $ y^{(i)} = \mathbf{w}^T \mathbf{x}^{(i)} + b + \epsilon^{(i)} $
   - The errors $ \epsilon^{(i)} $ are i.i.d. and normally distributed: $\epsilon \sim \mathcal{N}(0, \sigma^2)$; and it accounts for the residual error of regression

- The joint probability of observing the dataset $ \mathcal{D}  $ is:

    $$ P(\mathcal{D} ) = P\left((\mathbf{x}^{(1)},y^{(1)} ), (\mathbf{x}^{(2)},y^{(2)}), \ldots, (\mathbf{x}^{(m)},y^{(m)})\right) $$

- Assuming independent and identifically distributed samples 
   $$ P(\mathcal{D} ) =  \prod_{i=1}^{m}P((\mathbf{x}^{(i)},y^{(i)})) $$

- Using the chain rule, the conditional probability is:
     $$ P(\mathcal{D} ) = \prod_{i=1}^{m} P(y^{(i)} | \mathbf{x}^{(i)})P(\mathbf{x}^{(i)}) $$

- The likelihood $ L(\mathbf{w}, b) $ is:
     $$ L(\mathbf{w}, b) = \prod_{i=1}^{m} P(y^{(i)} | \mathbf{x}^{(i)}; \mathbf{w}, b) $$

- For each term, using normal distribution:
     $$ P(y^{(i)} | \mathbf{x}^{(i)}; \mathbf{w}, b) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(y^{(i)} - (\mathbf{w}^T \mathbf{x}^{(i)} + b))^2}{2\sigma^2}\right)  $$

- Taking the natural log of the likelihood:
   
     $$\ell(\mathbf{w}, b) = \sum_{i=1}^{m} \left( -\frac{1}{2} \log(2\pi\sigma^2) - \frac{(y^{(i)} - (\mathbf{w}^T \mathbf{x}^{(i)} + b))^2}{2\sigma^2} \right) $$

- Ignoring constants for optimization:
     $$\ell(\mathbf{w}, b) \propto -\sum_{i=1}^{m} (y^{(i)} - (\mathbf{w}^T \mathbf{x}^{(i)} + b))^2 $$

- **Mean Squared Error (MSE)** is defined as:

     $$ \text{MSE}(\mathbf{w}, b) = \frac{1}{m} \sum_{i=1}^{m} (y^{(i)} - (\mathbf{w}^T \mathbf{x}^{(i)} + b))^2 $$


- Minimizing MSE is equivalent to maximizing the log-likelihood:
     $$  \min_{\mathbf{w}, b} \text{MSE}(\mathbf{w}, b) \Leftrightarrow \max_{\mathbf{w}, b} \ell(\mathbf{w}, b) $$

This derivation also justifies the MSE loss to train the parameters of linear regression model

