## Multiple Linear Regression
Multiple Linear regression has a dependent feature(target feature) and more than 1 indepdendent features
![image.png](attachment:image.png)

## Coefficients B0,B1,B2,...Bn needs to calculated in such a way that the model results in least squared model
![image.png](attachment:image.png)

    Loss function:
    when you are trying the error/residual difference for a single datapoint
    error/residual : yactual - ypredicted

    Cost function:
    when you are trying the error/residual difference for all the datapoints
    error/residual : Summation(yactual - ypredicted)^2

## Gradient Descent Algorithm
1. Initialzing the coefficients randomly.
2. Caluclate the Squared errors. Determine the cost function
3. Calculate the derivate of cost function and multiply it with a learning rate. Learning rates are preferred between 0.001 - 0.1
4. calculate the new coefficients
5. calculate the cost function, choose the optimised coefficients resulting in least squared model. Look for global minima point where least error is achieved with corresponding optimised coefficients


![image.png](attachment:image.png)

## Read the dataset

In [1]:
import pandas as pd

df = pd.read_csv(
    "https://raw.githubusercontent.com/Sindhura-tr/Datasets/refs/heads/main/50_Startups.csv"
)
df.head()

Unnamed: 0,RND,ADMIN,MKT,STATE,PROFIT
0,165349.2,136897.8,471784.1,New York,192261.83
1,162597.7,151377.59,443898.53,California,191792.06
2,153441.51,101145.55,407934.54,Florida,191050.39
3,144372.41,118671.85,383199.62,New York,182901.99
4,142107.34,91391.77,366168.42,Florida,166187.94


## Perform basic data quality checks

In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   RND     50 non-null     float64
 1   ADMIN   50 non-null     float64
 2   MKT     50 non-null     float64
 3   STATE   50 non-null     object 
 4   PROFIT  50 non-null     float64
dtypes: float64(4), object(1)
memory usage: 2.1+ KB


In [4]:
df.shape

(50, 5)

In [5]:
df.isna().sum()

RND       0
ADMIN     0
MKT       0
STATE     0
PROFIT    0
dtype: int64

In [6]:
df.duplicated().sum()

np.int64(0)

## There are no missing values nor any duplicated rows in this dataset

## Separate X and Y features
    Y => PROFIT
    X => ADMIN,RND

In [7]:
X = df[["ADMIN", "RND"]]
Y = df[["PROFIT"]]

In [8]:
X.head()

Unnamed: 0,ADMIN,RND
0,136897.8,165349.2
1,151377.59,162597.7
2,101145.55,153441.51
3,118671.85,144372.41
4,91391.77,142107.34


In [9]:
Y.head()

Unnamed: 0,PROFIT
0,192261.83
1,191792.06
2,191050.39
3,182901.99
4,166187.94


## Model Building

In [10]:
from sklearn.linear_model import LinearRegression

In [11]:
model = LinearRegression()
model.fit(X, Y)

# Evaluate the model

In [13]:
model.score(X, Y)

0.9478129385009173

In [14]:
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

In [17]:
def evalualte_model(model, x, y):
    ypred = model.predict(x)

    mse = mean_squared_error(y, ypred)
    mae = mean_absolute_error(y, ypred)
    r2 = r2_score(y, ypred)
    rmse = mse ** (1 / 2)

    print(f"MSE: {mse}")
    print(f"MAE: {mae}")
    print(f"RMSE: {rmse}")
    print(f"R2 score: {r2}")

In [18]:
evalualte_model(model, X, Y)

MSE: 83086833.25816332
MAE: 6691.397424314962
RMSE: 9115.197927536368
R2 score: 0.9478129385009173


## The above model can be considered for final predictions as the r2 score is greater than 0.8

In [19]:
model.intercept_

array([54886.62062756])

In [20]:
model.coef_

array([[-0.05299543,  0.86211798]])

## Profit_predicted = B0 + B1*ADMIN +B2*RND => below are the optimised values of coefficients
    PROFIT_PREDICTED = 54886.62062756 + (-0.05299543)*ADMIN + (0.86211798)*RND
    IF ADMIN and RND is 0, then profit_predicted = 54886.62062756


In [21]:
sample = [160000, 19000]

In [23]:
model.predict([sample])



array([[62787.59310165]])

In [24]:
sample2 = [[91391.77, 142107.34]]
model.predict(sample2)



array([[172556.56688596]])