### ðŸ§  Multiple Linear Regression (Definition)

**Multiple Linear Regression (MLR)** is a statistical technique that models the relationship between one dependent variable and two or more independent variables by fitting a linear equation to observed data.

It extends simple linear regression by allowing multiple predictors to explain variations in the target variable.  
The general form of the equation is:

\[
Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \dots + \beta_nX_n + \epsilon
\]

Where:  
- \( Y \): Dependent variable  
- \( X_1, X_2, ..., X_n \): Independent variables  
- \( \beta_0 \): Intercept term  
- \( \beta_1, \beta_2, ..., \beta_n \): Coefficients of independent variables  
- \( \epsilon \): Error term  

**Goal:** To find the best-fitting line (or hyperplane) that minimizes the difference between the actual and predicted values of \( Y \).


### ðŸ§® Finding the Beta Coefficients (Î²)

In **Multiple Linear Regression**, we estimate the coefficients that best fit the data.

The model in matrix form is written as:

$Y = X\beta + \epsilon$

Where:  
- $Y$ â†’ vector of dependent variable values  
- $X$ â†’ matrix of independent variables (including a column of ones for the intercept)  
- $\beta$ â†’ vector of coefficients  
- $\epsilon$ â†’ vector of residuals (errors)

The **Ordinary Least Squares (OLS)** method finds $\beta$ that minimizes the sum of squared errors:

$\text{Minimize } \; (Y - X\beta)^T (Y - X\beta)$

The solution (normal equation) is:

$\hat{\beta} = (X^T X)^{-1} X^T Y$

Where:  
- $\hat{\beta}$ â†’ estimated coefficients  
- $X^T$ â†’ transpose of matrix $X$  
- $(X^T X)^{-1}$ â†’ inverse of the product $X^T X$  
- $X^T Y$ â†’ product of $X^T$ and $Y$

This formula gives the values of $\beta$ that minimize the squared differences between the actual and predicted values of $Y$.


In [44]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

In [45]:
data = pd.read_csv('Student_Performance.csv')

In [46]:
data.head()

Unnamed: 0,Hours Studied,Previous Scores,Extracurricular Activities,Sleep Hours,Sample Question Papers Practiced,Performance Index
0,7,99,Yes,9,1,91.0
1,4,82,No,4,2,65.0
2,8,51,Yes,7,2,45.0
3,5,52,Yes,5,2,36.0
4,7,75,No,8,5,66.0


In [47]:
data["Extracurricular Activities"] = data["Extracurricular Activities"].map({"Yes": 1, "No": 0})


In [48]:
data.head()

Unnamed: 0,Hours Studied,Previous Scores,Extracurricular Activities,Sleep Hours,Sample Question Papers Practiced,Performance Index
0,7,99,1,9,1,91.0
1,4,82,0,4,2,65.0
2,8,51,1,7,2,45.0
3,5,52,1,5,2,36.0
4,7,75,0,8,5,66.0


In [49]:
X = data.iloc[:,: 4].values
Y = data['Performance Index'].values

### ðŸŽ¯ Note on Feature Selection

In this project, I am **not performing any feature selection** or data preprocessing techniques such as encoding, scaling, or removing correlated features.

The main objective here is to **implement the Multiple Linear Regression algorithm completely from scratch**, focusing purely on the **mathematical and computational steps** behind the model â€” not on optimizing feature inputs.

Hence, all available features in the dataset are used directly (after minimal adjustments for numerical compatibility, if necessary).  
This approach helps in understanding **how the algorithm works internally**, rather than relying on pre-built libraries or preprocessing methods.


In [50]:
class MultipleRegression:
    def __init__(self):
        self.intercept = 0
        self.coef = 0 
    def fit(self, x_train, y_train):
        
       x_train = np.insert(x_train, 0, 1, axis= 1)
       betas = np.linalg.inv(np.dot(x_train.T, x_train))
       
       betas = betas.dot(x_train.T)
       betas = betas.dot(y_train)
       
       self.intercept = betas[0]
       self.coef = betas[1:]
       
    def predict(self, x_test):
        y_pred = np.dot(x_test, self.coef) + self.intercept
        return y_pred
    
        
         

In [51]:
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

In [52]:
lr = MultipleRegression()

In [53]:
lr.fit(x_train, y_train)
y_pred = lr.predict(x_test)
y_pred

array([55.21428305, 21.94861365, 47.83051038, ..., 17.27564575,
       63.26890279, 46.05829465])

In [54]:
pd.DataFrame({"MSE": [mean_squared_error(y_test, y_pred)],
 "R2 Score": [r2_score(y_test, y_pred)],
 "RMSE": [np.sqrt(mean_squared_error(y_test, y_pred))]})

Unnamed: 0,MSE,R2 Score,RMSE
0,4.422438,0.988066,2.102959
