# Simple Linear Regression

Largely inspired by https://www.kaggle.com/code/sushaldevasari1306/simplelinearregression

* Matplotlib is replaced with Plotly Express for more simplicity and better visualization
* Dataset is local
* Small evolutions in the code

In [41]:
import numpy as np
import pandas as pd
import plotly.express as px

In [42]:
df = pd.read_csv('./salary-data.csv')

In [43]:
df.head()

Unnamed: 0.1,Unnamed: 0,Experience Years,Salary
0,1,1.1,39343
1,2,1.2,42774
2,3,1.3,46205
3,4,1.5,37731
4,5,2.0,43525


In [44]:
## removing unnecessary columns
df.drop(columns=["Unnamed: 0"], axis=1, inplace=True)

In [45]:
## Checking for Null Values
df.isnull().sum()

Experience Years    0
Salary              0
dtype: int64

In [46]:
## check for correlation
df.corr()

Unnamed: 0,Experience Years,Salary
Experience Years,1.0,0.977692
Salary,0.977692,1.0


In [47]:
## visualizing data points using plotly express
plot = px.scatter(
    df,
    x='Experience Years',
    y='Salary',
    title='Salary vs YearsExperience')
plot.show()

In [48]:
## independent and dependent features
X = df[['Experience Years']]  ## avoiding series datatype for the input features
y = df['Salary']

In [49]:
## Performing train test split
from sklearn.model_selection import train_test_split

In [50]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

In [51]:
## Standardization 
from sklearn.preprocessing import StandardScaler

In [52]:
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)  ## avoiding data leakage

In [53]:
## Applying Linear Regression
from sklearn.linear_model import LinearRegression

regression = LinearRegression()

In [54]:
regression.fit(X_train, y_train)

In [55]:
print("Coefficient or slope: ", regression.coef_)
print("Intercept: ", regression.intercept_)

Coefficient or slope:  [24735.17118781]
Intercept:  74915.20000000001


In [62]:
## plotting Training data for best fit line using plotly
plot = px.scatter(
    x=X_train.flatten(),
    y=y_train,
    title='Salary vs YearsExperience',
    labels={'x': 'Experience Years', 'y': 'Salary'}
)
plot.add_scatter(x=X_train.flatten(), y=regression.predict(X_train), mode='lines', name='Best Fit Line')
plot.show()

In [57]:
## Prediction
y_pred = regression.predict(X_test)

In [58]:
## Performance Metrics
from sklearn.metrics import mean_absolute_error, mean_squared_error

In [59]:
mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
rmse = np.sqrt(mse)
print("MSE: ", mse)
print("MAE: ", mae)
print("RMSE ou écart quadratique moyen: ", rmse)

MSE:  40884772.62082657
MAE:  5793.506861825283
RMSE ou écart quadratique moyen:  6394.120160024096


In [60]:
from sklearn.metrics import r2_score

## printing R-square
score = r2_score(y_test, y_pred)
print(score)

## printing adjusted R-square
print((1 - (1 - score) * (len(y_test) - 1) / (len(y_test) - X_test.shape[1] - 1)))

0.9426749383532557
0.9355093056474126
