## Linear Regression
- Linear Regression is a types of supervised machine learning where we attempt to predict continuous variables given several independent variables.

- **Goal:**
    - To create a mathematical model that captures linear relationship between dependent and independent variables.
    - To generate predictions.
    - To assist in Decision Making by quantifying the expected outcomes based on changes in independent/predictors variables.

- 
- **Regression Types**:
    1. `Simple Linear Regression`
    2. `Multiple Linear Regresssion`
    3. `Simple Polynomial Regression`
    4. `Multiple Polynomial Regression`

### 1. Simple Linear Regression

- In Simple Linear Regression, we consider a single independent variable and a single dependent variable.
- It helps to figure out the relationship between 2 variables i.e. independent variable (say x) and dependent variable (say y).
- Mathematical model of simple linear regression takes the form of straight line.

- **Mathematically,**
    - `Y = β0 + β1X` 
        - β0: the intercept
        - β1: the slope
        - X: an independent variable (the variable used to predict Y)
        - Y: dependent variable (the variable we want to predict)
- Here the term β0, β1 are called model parameters which will be estimated using optimization techniques such as Gradient Descent via minimizing the objective or cost function.

- **Visually (β0=38423, β1=821)**
    - `Y = 38423 + 821X`
    - <img src='images/1.png' width='400'>
    - Here, we fit the Linear Regression Line on Scatter data.
    - After, we can make prediction on New X via project to the line and looking to corresponding Y value


### **Simple Linear Regression (Training Process Using Ordinary Least Squares (OLS))**
- OLS stands for Ordinary Least Square.
- OLS by default uses mean squared error.
- Sum of Squared error i.e. SSE (β0 and β1)
    - Σ(Yi - (β0 + β1 * Xi))² 
- Mean Squared Error  i.e. MSE (β0 and β1) 
    - SSE (β0 and β1) / N
- OLS says using some formulae, It is possible to compute β0 and β1.
- Aim: To select the best fit line (optimal β0 and β1) that reduces the error between predicted and actual value.

### Derivation of OLS Estimate for Single Linear Regression

We start with the simple linear regression model:


Y = β0 + β1*X


Where:
- Y is the observed dependent variable.
- X  is the independent variable.
- β0 is the intercept.
- β1 is the slope.

**Step 1: Define the Cost Function**

The goal is to minimize the sum of squared errors (SSE), which is the sum of the squared differences between the observed Y and the predicted 
$(\hat{Y})$ values:

$$
SSE = \sum_{i=1}^{N} (Y_i - \hat{Y}_i)^2
$$


**Step 2: Minimize SSE by Finding $(\beta_0)$**

To minimize SSE, we differentiate it with respect to $( \beta_0 )$ and set the derivative equal to 0 (Since Minimum error is the point where Derivative of Error function is 0).

Differentiate SSE with respect to $(\beta_1)$: 

$$
\frac{\partial SSE}{\partial \beta_0} = \frac{\partial \left( \sum_{i=1}^{N} (Y_i - (\beta_0 + \beta_1X_i))^2 \right)}{\partial \beta_1}
$$

Apply the chain rule and simplify:

$$
\frac{\partial SSE}{\partial \beta_0} = -2 \sum_{i=1}^{N} (Y_i - (\beta_0 + \beta_1X_i))
$$

Set the derivative equal to 0 and solve for $(\beta_0)$:  
$$
-2 \sum_{i=1}^{N} (Y_i - (\beta_0 + \beta_1X_i)) = 0  
$$  

$$
\sum_{i=1}^{N} (Y_i - (\beta_0 + \beta_1X_i)) = 0
$$

Now, solving for $(\beta_0)$:  

$$
\beta_0 = \bar{Y} - \beta_1\bar{X}          --> (i)
$$


**Step 3: Minimize SSE by Finding $(\beta_1)$**

To minimize SSE, we differentiate it with respect to $( \beta_1 )$ and set the derivative equal to 0 (Since Minimum error is the point where Derivative of Error function is 0).

Differentiate SSE with respect to $(\beta_1)$: 

$$
\frac{\partial SSE}{\partial \beta_1} = \frac{\partial \left( \sum_{i=1}^{N} (Y_i - (\beta_0 + \beta_1X_i))^2 \right)}{\partial \beta_1}
$$

Apply the chain rule and simplify:

$$
\frac{\partial SSE}{\partial \beta_1} = -2 \sum_{i=1}^{N} (Y_i - (\beta_0 + \beta_1X_i))X_i
$$

Set the derivative equal to 0 and solve for $( \beta_1 )$:

$$
-2 \sum_{i=1}^{N} (Y_i - (\beta_0 + \beta_1X_i))X_i = 0
$$

$$
\sum_{i=1}^{N} (Y_i - (\beta_0 + \beta_1X_i))X_i = 0
$$

Solving for $( \beta_1)$ by substituting value of $( \beta_0)$:

$$
\beta_1 = \frac{\sum_{i=1}^{N} (Y_i - \bar{Y})}{\sum_{i=1}^{N} X_i - \bar{X}} 
$$


Hence,

$$
\beta_1 = \frac{\sum_{i=1}^{N} (Y_i - \bar{Y})}{\sum_{i=1}^{N} X_i - \bar{X}} 
$$  


$$
\beta_0 = \bar{Y} - \beta_1\bar{X}          
$$


### Simple Linear Regression (using OLS)
1. Define the Linear Model
    - Y = β0 + β1*X  
    
    - Y is the observed dependent variable.
    - X  is the independent variable.
    - β0 is the intercept.
    - β1 is the slope.

2. Define the objective function
    - Objective in Linear Regression is to find β0 and β1 that minimizes the sum of squared errors (SSE).
    - The error for each data point is the difference between the observed value and the predicted value:
    - `e(i) = Y(i) -  β0 + β1*X(i)`
    - SSE = Σ(Yi - (β0 + β1 * Xi))²

3. Minimize the Objective function
    - As computed in the above section, To minimize SSE, we take the partial derivative of SSE with respect to β0 and β1, set them equal to zero, and solve for β0 and β1.
    - Refer to above section for the derivation part.

4. Final OLS Equation
    - $\hat{Y}$ = β0 + β1*X
    - where, β0 and β1 is esimated using the formulae derived in the above section.

### **Simple Linear Regression (Training Process using Gradient Descent)**
>Linear Regression using gradient descent is an optimization technique for finding the coefficient (slope and intercept) of a linear regression model that minimizes the cost function (typically MSE or Mean Squared Error).

**Step 1:** Initialize the model parameters randomly.
- Initialize the intercept (β₀) and slope (β₁) with random values.

**Step 2:** Compute the gradients of the cost function (MSE) with respect to the parameters β₀ and β₁.
- `Mean Squared Error (MSE) Formula`
  - MSE is a measure of the average squared difference between predicted and actual values in the dataset and is calculated as follows:
    - MSE = (1/N) * Σ(yᵢ - (β₀ + β₁ * xᵢ))² for i = 1 to N, where N is the number of data points.


- Calculate the gradients using partial derivatives:
  - ∂MSE/∂β₀ = -2/N * Σ(yᵢ - (β₀ + β₁ * xᵢ))
  - ∂MSE/∂β₁ = -2/N * Σxᵢ * (yᵢ - (β₀ + β₁ * xᵢ)), where N is the number of data points.

**Step 3:** Update the parameters of the model by taking steps in the opposite direction of the gradients.
- Update the intercept (β₀) and slope (β₁) using the learning rate α (alpha):
  - β₀ = β₀ - α * ∂MSE/∂β₀
  - β₁ = β₁ - α * ∂MSE/∂β₁

**Step 4:** Repeat steps 2 and 3 iteratively until convergence.
- Iterate through steps 2 and 3 until a stopping criterion is met (e.g., a maximum number of iterations or a small change in MSE). The goal is to find the optimal values of β₀ and β₁ that minimize the MSE and provide the best fit for the simple linear regression model.

In summary, the gradient descent algorithm for simple linear regression aims to find the optimal intercept (β₀) and slope (β₁) by iteratively adjusting these parameters in the direction that minimizes the mean squared error (MSE) between the predicted and actual values of the target variable. This process continues until convergence is achieved or a predefined stopping condition is met.


### 2. Multiple Linear Regression
- Unlike Single Linear Regression, In multiple linear regression there will be a multiple independent variables i.e. x1, x2, x3, and so on.
- The term Linear means it will be line in 2D, Plane in 3D, and Hyperplane in higher dimensions.
- The equation of Multiple Linear Regression is given by,
$$
Yhat = a + b_1 X_1 + b_2 X_2 + b_3 X_3 + b_4 X_4
$$

$$
a: intercept\\\\
b_1 :coefficients \ of\ Variable \ 1\\\\
b_2: coefficients \ of\ Variable \ 2\\\\
b_3: coefficients \ of\ Variable \ 3\\\\
b_4: coefficients \ of\ Variable \ 4\\\\
$$


$$
Yhat: Response \ Variable\\\\
X_1 :Predictor\ Variable \ 1\\\\
X_2: Predictor\ Variable \ 2\\\\
X_3: Predictor\ Variable \ 3\\\\
X_4: Predictor\ Variable \ 4\\\\
$$


- The algorithm discussed in simple linear regression to get optimal parameters can also be applied to multiple linear regression. The only difference is we now have more independent variables meaning more parameters. But the idea remains the same.

## Evaluation Metrics (Regression Models)
- Useful in explaning the performance of the model.
- Based on actual value, we can compare how well our model is performing, via comparing predicted value with actual value.

- **Error:**
    - Error of Regression model is a measure of how far the data is from the fitted regression line.
    
- **Different evaluation metrics are:**
    - `Mean Absolute Error (MAE)`
    - `Mean Square Error (MSE)`
    - `Root Mean Square Error (RMSE)`
    - `Coefficient of Determination (R^2)`


**1. Mean Absolute Error (MAE)**
- Mean absolute error is the mean of the absolute value of the errors.

$$
MAE = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i|
$$

Where:
- \(n\) is the number of data points.
- $y_i$ is the actual value of the target variable for the i-th data point.
- $\hat{y}_i$ is the predicted value for the i-th data point.



`Example:`  
Suppose we have a dataset with three data points, and we want to calculate the MAE for a simple linear regression model:

- Actual Values ($y$): [2, 4, 7]  
- Predicted values ($y_i$): [2.5, 3.8, 6.2]  

Using the formulae:  
$$
MAE = \frac{1}{3} \left(|2 - 2.5| + |4 - 3.8| + |7 - 6.2|\right) = \frac{1}{3} \cdot 0.3 = 0.1
$$


- Here absolute value is used otherwise positive and negative number cancel each other.


**2. Mean Squared Error (MSE)**  
- Mean squared error is the mean of the squared error.  

Mathematical Notation:
$$
MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2
$$

Where:
- \(n\) is the number of data points.
- $y_i$ is the actual value of the target variable for the i-th data point.
- $\hat{y}_i$ is the predicted value for the i-th data point.

`Example:`
Using the same dataset as before:

$$
MSE = \frac{1}{3} \left((2 - 2.5)^2 + (4 - 3.8)^2 + (7 - 6.2)^2\right) = \frac{1}{3} \cdot 0.09 = 0.03
$$

So, the MSE for this model is 0.03.



**3. Root mean squared error**
- Root mean squared error is the square root of the mean squared error.

Mathematical Notation:
$$
RMSE = \sqrt{MSE}
$$


**Example:**
Continuing from the previous example:

$$
RMSE = \sqrt{0.03} \approx 0.1732
$$

So, the RMSE for this model is approximately 0.1732.

In summary, these evaluation metrics help us assess the accuracy of regression models, with lower values indicating better performance.

**4. Coefficient of Determination (R^2)**
- $R^2$ is also called the Coefficient of Determination.
- It's a measure to determine how close the data is to the fitted regression line.
- It takes value in between 0 and 1. 
    - 0 indicates poor model, and 1 indicates high model.

The formula for the coefficient of determination $R^2$ in linear regression is given by:

$$R^2 = 1 - \frac{SSR}{SST}$$

Where:
- $R^2$ is the coefficient of determination (R-squared).
- $SSR$ is the sum of squared residuals (also known as the sum of squared errors or SSE), which represents the total unexplained variation in the dependent variable by the model.
    - The formula for SSR is:

    $$SSR = \sum_{i=1}^{n} (\hat{y}_i - y_i)^2$$

    Where:  
    - $n$ is the number of data points.
    - $(\hat{y}_i)$ is the predicted value for the i-th data point.
    - $y_i$ is the actual value of the dependent variable for the i-th data point.
   
- \(SST\) is the total sum of squares, which represents the total variation in the dependent variable.
    - The formula for SST is:

  $$SST = \sum_{i=1}^{n} (y_i - \bar{y})^2$$

  Where:  
   - $n$ is the number of data points.
   - $y_i$ is the actual value of the dependent variable for the i-th data point.
   - $(\bar{y})$ is the mean (average) of the actual values $(y_i)$.


### References: 
- https://www.youtube.com/watch?v=KZ1mWboXE6g
- https://www.coursera.org/learn/data-analysis-with-python/lecture/Wlyce/linear-regression-and-multiple-linear-regression
- https://www.geeksforgeeks.org/ml-linear-regression/