$$
\hat{y} = \beta_0 + \beta_1 x_1 + \beta_2 x_3 + \beta_3 x_4
$$


$$
\hat{y} = yPredicted 
$$

##### Matrix Representation of Predictions

$$
\hat{Y} =
\begin{bmatrix}
\hat{y}_1 \\
\hat{y}_2 \\
\vdots \\
\hat{y}_{100}
\end{bmatrix}
=
\begin{bmatrix}
1 & x_{11} & x_{12} & x_{13} \\
1 & x_{21} & x_{22} & x_{23} \\
\vdots & \vdots & \vdots & \vdots \\
1 & x_{100,1} & x_{100,2} & x_{100,3}
\end{bmatrix}
\begin{bmatrix}
\beta_0 \\
\beta_1 \\
\beta_2 \\
\beta_3
\end{bmatrix}
$$


##### General Matrix Form (n rows, m features)

$$
\hat{Y}_{(n \times 1)} =
\begin{bmatrix}
\hat{y}_1 \\
\hat{y}_2 \\
\vdots \\
\hat{y}_n
\end{bmatrix}
=
\begin{bmatrix}
1 & x_{11} & x_{12} & \dots & x_{1m} \\
1 & x_{21} & x_{22} & \dots & x_{2m} \\
\vdots & \vdots & \vdots & & \vdots \\
1 & x_{n1} & x_{n2} & \dots & x_{nm}
\end{bmatrix}
\begin{bmatrix}
\beta_0 \\
\beta_1 \\
\beta_2 \\
\vdots \\
\beta_m
\end{bmatrix}
$$


$$
\hat{Y} = X\beta
$$

##### Error (Residual) Vector : 
$$
e = Y - \hat{Y}
$$

### Squared Error (Loss)

$$
E = e^T e                        
$$

where E : 
$$
E = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2
$$

##### Expansion of Squared Error Term

We start with the total squared error:

$$
E = e^T e = (Y - \hat{Y})^T (Y - \hat{Y})
$$

---

##### Substitute \( \hat{Y} = X\beta \)

$$
E = (Y - X\beta)^T (Y - X\beta)
$$

---

##### Apply transpose property

Using:
$$
(A - B)^T = A^T - B^T
$$

We get:

$$
E = (Y^T - (X\beta)^T)(Y - X\beta)
$$

---

##### Expand the multiplication

$$
E = Y^T Y - Y^T X\beta - (X\beta)^T Y + (X\beta)^T X\beta
$$

---

##### Final expanded form

$$
E = Y^T Y
\;-\; Y^T X\beta
\;-\; (X\beta)^T Y
\;+\; (X\beta)^T X\beta
$$


### Notes !!!

- **E** is a scalar value  
- **YᵀXβ** and **(Xβ)ᵀY** are scalars and are equal  
- This expanded error expression is used to:
  - Take the derivative with respect to **β**
  - Derive the **Normal Equation**


### Error Function (Quadratic Form)

$$
E = Y^T Y - 2Y^T X\beta + \beta^T X^T X \beta
$$

---

### Derivative with respect to \( \beta \)

$$
\frac{\partial E}{\partial \beta}
=
\frac{\partial}{\partial \beta}
\left[
Y^T Y - 2Y^T X\beta + \beta^T X^T X \beta
\right]
$$

---

### Derivative of each term

**1. First term**

$$
\frac{\partial}{\partial \beta}(Y^T Y) = 0
$$

(because it does not depend on \( \beta \))

---

**2. Second term**

$$
\frac{\partial}{\partial \beta}(-2Y^T X\beta) = -2X^T Y
$$

---

**3. Third term**

$$
\frac{\partial}{\partial \beta}(\beta^T X^T X \beta) = 2X^T X\beta
$$

---

### Combine all terms

$$
\frac{\partial E}{\partial \beta}
=
-2X^T Y + 2X^T X\beta
$$

---

### Set derivative to zero (minimization)

$$
-2X^T Y + 2X^T X\beta = 0
$$

Divide both sides by 2:

$$
X^T X\beta = X^T Y
$$

---

### Normal Equation

$$
\boxed{
\beta = (X^T X)^{-1} X^T Y
}
$$


_____

##### why Gradient Descent ??

$$
\boxed{(X^T X)}
$$

##### 1. Matrix inversion is computationally expensive
The normal equation requires computing the inverse of \( X^T X \), which has a time complexity of O(m³) . This becomes infeasible when the number of features is large.

##### 2. High memory requirements
The normal equation needs the entire dataset and intermediate matrices (\( X \), \( X^T \), \( X^T X \)) to be stored in memory, which is not practical for large datasets.

##### 3. Non-invertible matrix issue
If features are highly correlated, duplicated, or if the number of features exceeds the number of samples, \( X^T X \) may not be invertible. Gradient descent does not suffer from this limitation.

##### 4. Better scalability for large datasets
Gradient descent works iteratively and supports batch, mini-batch, and stochastic updates, making it suitable for very large datasets.

##### 5. Generalization to other machine learning models
The normal equation applies only to linear regression, whereas gradient descent is a general optimization technique used in logistic regression, neural networks, and deep learning.




| Aspect                 | Normal Equation | Gradient Descent   |
| ---------------------- | --------------- | ------------------ |
| Time complexity        | O(m³)           | O(mn × iterations) |
| Memory usage           | High            | Low                |
| Large datasets         | ❌               | ✅                  |
| Invertibility required | Yes             | No                 |
| General-purpose        | ❌               | ✅                  |


[Computational_complexity_of_mathematical_operations(read for matrix inversion)](https://en.wikipedia.org/wiki/Computational_complexity_of_mathematical_operations)

### OLS Linear Regression vs SGDRegressor

**OLS (Ordinary Least Squares) Linear Regression**
- Solves the problem using the **Normal Equation**
- Computes an exact solution
- Requires matrix inversion \( (X^T X)^{-1} \)
- Computationally expensive for large feature sets
- Used by `LinearRegression()` in sklearn

**SGDRegressor**
- Uses **Stochastic Gradient Descent**
- Finds an approximate solution iteratively
- Does not require matrix inversion
- Scales well to large datasets
- Suitable for online and streaming data
