# Multiple Linear Regression

**Multiple Linear Regression (MLR)** is a statistical method used to examine the relationship between one **dependent variable** and two or more **independent variables**. It expands upon simple linear regression by incorporating multiple predictors to improve the model's accuracy.

---

### üìå Mathematical Equation

The general equation for a multiple linear regression model is:

$$
Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_n X_n
$$
$$
Y = [1 \quad x_1 \quad x_2 \quad \cdots \quad x_n]
\begin{bmatrix}
\beta_0 \\
\beta_1 \\
\beta_2 \\
\vdots \\
\beta_n
\end{bmatrix}
$$
$$
\mathbf{Y} = \mathbf{X} \boldsymbol{\beta}
$$

Where:

- **Y**: Dependent variable (the target you're trying to predict)  
- **X‚ÇÅ, X‚ÇÇ, ..., X‚Çô**: Independent variables (input features)  
- **Œ≤‚ÇÄ**: Intercept (constant term)  
- **Œ≤‚ÇÅ, Œ≤‚ÇÇ, ..., Œ≤‚Çô**: Coefficients (weights/slopes for each independent variable)

---

### üìå Normal Equation in Linear Regression

The equation:

$$
\boldsymbol{\beta} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}
$$

is the formula to compute the **best-fit coefficients** \( \boldsymbol{\beta} \) (also called **weights**) for a linear regression model using the **least squares method**.

---

### üîç Explanation

- **X** (`X`): Feature matrix (design matrix), including a column of 1s for the intercept  
- **X·µÄ** (`X^T`): Transpose of the feature matrix  
- **(X·µÄ X)‚Åª¬π** (`(X^T X)^-1`): Inverse of the product of the transposed and original feature matrix  
- **y** (`y`): Target variable (dependent variable) vector  
- **Œ≤** (`beta`): Coefficient vector including the intercept and slopes

---

This equation minimizes the **sum of squared differences** between the actual and predicted values:

$$
\min \sum_{i=1}^m (y_i - \hat{y}_i)^2
$$

It provides the **best-fit line (or hyperplane)** that models the relationship between the independent variables and the dependent variable.



## Step 1: Import the required Modules

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

## Step 2: Read the csv as DataFrame

I use preprocessed dataset in this project, I f you want to see the PreProcessing steps of the dataset kindly refer - https://github.com/Namachivayam2001/ML-Tutorials/blob/main/Simple_Linear_Regression.ipynb

In [4]:
df = pd.read_csv('https://github.com/Namachivayam2001/Public_Datasets/raw/main/preprocessed_synthetic_online_retail_data.csv')

In [5]:
df.head()

Unnamed: 0,customer_id,order_date,product_id,category_id,category_name,product_name,quantity,price,payment_method,city,review_score,gender,age,order_month
0,13542,2024-12-17,784,10,1,16,2,373.36,2,522,1.0,0,56,12
1,23188,2024-06-01,682,50,4,18,5,299.34,2,668,5.0,1,59,6
2,55098,2025-02-04,684,50,4,22,5,23.0,2,942,5.0,0,64,2
3,65208,2024-10-28,204,40,0,19,2,230.11,0,244,5.0,1,34,10
4,63872,2024-05-10,202,20,2,15,4,176.72,2,281,1.0,0,33,5


#### We should find the age of the person based on category_name, quantity and price from the dataset
#### I use 10 records only for the calculations by practice

In [6]:
x_train = df[['category_name', 'quantity', 'price']][:10]
y_train = df['age'][:10]

**Input data**

In [8]:
df[['category_name', 'quantity', 'price', 'age']][:10]

Unnamed: 0,category_name,quantity,price,age
0,1,2,373.36,56
1,4,5,299.34,59
2,4,5,23.0,64
3,0,2,230.11,34
4,2,4,176.72,33
5,1,4,196.16,21
6,1,5,272.75,57
7,4,2,292.9,60
8,3,3,429.11,69
9,3,4,191.39,34


# üìò Multiple Linear Regression: Step-by-Step

## üéØ Goal  
Predict `age` using:
- `category_name`
- `quantity`
- `price`

---

## üßÆ Step 1: Design Matrix `X` and Target Vector `y`

We include a column of 1s for the **intercept** term in matrix `X`:

**Design Matrix (X):**

X = \begin{bmatrix}
1 & 1 & 2 & 373.36 \\
1 & 4 & 5 & 299.34 \\
1 & 4 & 5 & 23.00 \\
1 & 0 & 2 & 230.11 \\
1 & 2 & 4 & 176.72 \\
1 & 1 & 4 & 196.16 \\
1 & 1 & 5 & 272.75 \\
1 & 4 & 2 & 292.90 \\
1 & 3 & 3 & 429.11 \\
1 & 3 & 4 & 191.39 \\
\end{bmatrix}


**Target Vector (y):**

y = \begin{bmatrix}
56 \\
59 \\
64 \\
34 \\
33 \\
21 \\
57 \\
60 \\
69 \\
34 \\
\end{bmatrix}


---

## üß† Step 2: Normal Equation

The formula to compute coefficients (Œ≤):
Œ≤ = (X·µÄ X)‚Åª¬π X·µÄ y

---

## üî¢ Step 3: Matrix Calculations

### üîπ X·µÄ X:

\begin{bmatrix}
10.00 & 23.00 & 36.00 & 2484.84 \\
23.00 & 73.00 & 88.00 & 5518.17 \\
36.00 & 88.00 & 144.00 & 8312.60 \\
2484.84 & 5518.17 & 8312.60 & 733138.938 \\
\end{bmatrix}



### üîπ (X·µÄ X)‚Åª¬π:

\begin{bmatrix}
2.922 & -0.047 & -0.436 & -0.00461 \\
-0.047 & 0.055 & -0.0207 & -0.00002 \\
-0.436 & -0.0207 & 0.0992 & 0.00051 \\
-0.00461 & -0.00002 & 0.00051 & 0.0000114 \\
\end{bmatrix}



### üîπ X·µÄ y:

\begin{bmatrix}
487.0 \\
1241.0 \\
1759.0 \\
127052.68 \\
\end{bmatrix}


---

## ‚úÖ Step 4: Final Coefficients (Œ≤)

After applying the formula:

$$
\boldsymbol{\beta} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}
$$
$$
= \begin{bmatrix}
12.6509 \\
6.3997 \\
1.1389 \\
0.0693 \\
\end{bmatrix}
$$

---

## üìà Final Regression Equation

The estimated linear regression model is:

**age = 12.65 + 6.40 √ó category_name + 1.14 √ó quantity + 0.069 √ó price**


---

### Let's Create a model

In [9]:
multiple_linear_model = LinearRegression()

### Train the model

In [11]:
multiple_linear_model.fit(x_train, y_train)

### Predict the output using training data

In [15]:
y_pred_train = multiple_linear_model.predict(x_train)

‚úÖ **Error Metrics in Regression Models**

Here are the most common ways to evaluate a regression model:

---

### ‚úÖ 1. **Mean Absolute Error (MAE)**

$$
MAE = \frac{1}{n} \sum_{i=1}^{n} \left| y_i - \hat{y}_i \right|
$$

- **MAE** is the average of the absolute differences between actual and predicted values.
- It‚Äôs simple to understand and **less sensitive to outliers**.

---

### ‚úÖ 2. **Mean Squared Error (MSE)**

$$
MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2
$$

- **MSE** squares the errors, so **larger errors have more impact**.
- It‚Äôs commonly used during **model training/optimization**.

---

### ‚úÖ 3. **Root Mean Squared Error (RMSE)**

$$
RMSE = \sqrt{MSE}
$$

- **RMSE** brings the error back to the same **unit as the target variable**.
- Easier to interpret than MSE.

---

### ‚úÖ 4. **R¬≤ Score (Coefficient of Determination)**

$$
R^2 = 1 - \frac{\sum (y_i - \hat{y}_i)^2}{\sum (y_i - \bar{y})^2}
$$

- Explains the **proportion of variance** in the target variable that is captured by the model.
- Ranges from **0 to 1** (closer to 1 means better fit).


### Evaluate the Model
### Check training acuraccy

In [16]:
evaluation_data = {}
mae = mean_absolute_error(y_train, y_pred_train)
mse = mean_squared_error(y_train, y_pred_train)
rmse = np.sqrt(mse)
r2 = r2_score(y_train, y_pred_train)

evaluation_data['Train Metrics'] = dict(zip(['MAE', 'MSE', 'RMSE', 'R2_SCORE'], [mae, mse, rmse, r2]))
pd.DataFrame(evaluation_data)

Unnamed: 0,Train Metrics
MAE,9.536274
MSE,124.889244
RMSE,11.175386
R2_SCORE,0.489852


In [17]:
x_test, y_test = df[['category_name', 'quantity', 'price']][10:15], df['age'][10:15]

### Predict the output using testing data

In [18]:
y_pred_test = multiple_linear_model.predict(x_test)

### Check testing acuraccy

In [19]:
mae = mean_absolute_error(y_test, y_pred_test)
mse = mean_squared_error(y_test, y_pred_test)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred_test)

evaluation_data['Test Metrics'] = dict(zip(['MAE', 'MSE', 'RMSE', 'R2_SCORE'], [mae, mse, rmse, r2]))
pd.DataFrame(evaluation_data)

Unnamed: 0,Train Metrics,Test Metrics
MAE,9.536274,14.426353
MSE,124.889244,234.719004
RMSE,11.175386,15.320542
R2_SCORE,0.489852,0.098067
