## Simple Linear Regression
Focuses on 1 dependent variable and 1 indepedent variable

## Steps:
1. Read the dataset
2. Perform basic quality checks
3. Performing data cleaning - handling missing data, remove the duplicated data
4. EDA - Statistical analysis
5. Descriptive Analysis
    Univariate analysis
    Bivariate analysis
    multivariate analysis
6. Separate X and Y features .. Independent and Dependent features
7. Model building
8. Model evaluation
9. Consider the model for final predictions

## Read the dataset

In [20]:
import pandas as pd

df = pd.read_csv(
    "https://raw.githubusercontent.com/Sindhura-tr/Datasets/refs/heads/main/50_Startups.csv"
)
df.head()

Unnamed: 0,RND,ADMIN,MKT,STATE,PROFIT
0,165349.2,136897.8,471784.1,New York,192261.83
1,162597.7,151377.59,443898.53,California,191792.06
2,153441.51,101145.55,407934.54,Florida,191050.39
3,144372.41,118671.85,383199.62,New York,182901.99
4,142107.34,91391.77,366168.42,Florida,166187.94


## Perform the basic data quality checks

In [21]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   RND     50 non-null     float64
 1   ADMIN   50 non-null     float64
 2   MKT     50 non-null     float64
 3   STATE   50 non-null     object 
 4   PROFIT  50 non-null     float64
dtypes: float64(4), object(1)
memory usage: 2.1+ KB


In [22]:
# check for missing data
df.isna().sum()

RND       0
ADMIN     0
MKT       0
STATE     0
PROFIT    0
dtype: int64

In [23]:
# check for duplicated info
df.duplicated().sum()

np.int64(0)

## There are no missing values or any duplicated data in this dataset.

## Separate X and Y features
    X : Independent feature
    Y : Traget feature , Dependent feature : PROFIT

In [24]:
X = df[["ADMIN"]]
Y = df[["PROFIT"]]

In [25]:
X.head()

Unnamed: 0,ADMIN
0,136897.8
1,151377.59
2,101145.55
3,118671.85
4,91391.77


In [26]:
Y.head()

Unnamed: 0,PROFIT
0,192261.83
1,191792.06
2,191050.39
3,182901.99
4,166187.94


## Model Building - Linear model
sklearn libraries

In [27]:
from sklearn.linear_model import LinearRegression

In [28]:
model = LinearRegression()
model.fit(X, Y)

## Ypredictions = B0 + B1*X
    Profit_predictions = yintercept + slope * ADMIN
    B1 => slope => coefficients => weights

In [29]:
model.coef_

array([[0.2887492]])

In [30]:
model.intercept_

array([76974.47130542])

## Profit_predictions = 76974.47130542 + 0.2887492 * ADMIN

In [31]:
model.score(X, Y)

0.04028714077757223

## Model Evaluation
    Metrics:
    MEan Absolute Error
    Mean Squared Error
    Root Mean Squared Error
    R2 score

In [32]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

In [33]:
ypreds = model.predict(X)
ypreds[0:5]

array([[116503.6018596 ],
       [120684.62967237],
       [106180.1681897 ],
       [111240.87333494],
       [103363.77199475]])

In [34]:
Y.head()

Unnamed: 0,PROFIT
0,192261.83
1,191792.06
2,191050.39
3,182901.99
4,166187.94


In [35]:
MSE = mean_squared_error(Y, ypreds)
MAE = mean_absolute_error(Y, ypreds)
r2 = r2_score(Y, ypreds)
RMSE = MSE ** (1 / 2)

In [36]:
print(f"Mean Squared Error : {MSE}")
print(f"Mean Absolute Error : {MAE}")
print(f"Root Mean Squared Error : {RMSE}")
print(f"R2 Score : {r2}")

Mean Squared Error : 1527955397.744143
Mean Absolute Error : 30659.814789071817
Root Mean Squared Error : 39089.07005473708
R2 Score : 0.04028714077757223


## Consider another X independent feature
X => RND, Y=> PROFIT

In [37]:
X1 = df[["RND"]]

In [38]:
model2 = LinearRegression()
model2.fit(X1, Y)

In [40]:
ypreds2 = model2.predict(X1)

In [41]:
ypreds2[:5]

array([[190289.29389289],
       [187938.71118575],
       [180116.65707807],
       [172369.00320589],
       [170433.97345032]])

In [42]:
Y.head()

Unnamed: 0,PROFIT
0,192261.83
1,191792.06
2,191050.39
3,182901.99
4,166187.94


In [48]:
MSE2 = mean_squared_error(Y, ypreds2)
MAE2 = mean_absolute_error(Y, ypreds2)
r2_2 = r2_score(Y, ypreds2)
RMSE2 = MSE2 ** (1 / 2)

In [49]:
print(f"Mean Squared Error : {MSE2}")
print(f"Mean Absolute Error : {MAE2}")
print(f"Root Mean Squared Error : {RMSE2}")
print(f"R2 Score : {r2_2}")

Mean Squared Error : 85120931.32706906
Mean Absolute Error : 6910.98435457961
Root Mean Squared Error : 9226.100548285232
R2 Score : 0.9465353160804393
