### Linear Regression Implementation using Scikit learn

https://www.geeksforgeeks.org/ml-linear-regression/

Regression analysis is the most widely used method of prediction. Linear regression is used when the dataset has a linear correlation and as the name suggests, simple linear regression has one independent variable (predictor) and one dependent variable(response).

The simple linear regression equation is represented as y = a+bx where x is the explanatory variable, y is the dependent variable, b is coefficient and a is the intercept.

For regression analysis, first we have to import libraries.

In [None]:
#Importing libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import plotly.express as px
import plotly.graph_objs as go
from plotly.offline import iplot

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


After importing libraries, the dataset is to be imported.

In [None]:
#Importing dataset
dataset = pd.read_csv('/content/drive/MyDrive/Dataset Files/Salary_Data.csv')

To see the first five rows of the dataset we can use dataset.head().

# Data Analysis

In [None]:
#To see first 5 rows of the dataset
dataset.head().style.background_gradient()

Unnamed: 0,YearsExperience,Salary
0,1.1,39343.0
1,1.3,46205.0
2,1.5,37731.0
3,2.0,43525.0
4,2.2,39891.0


In [None]:
# To see information of dataset
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30 entries, 0 to 29
Data columns (total 2 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   YearsExperience  30 non-null     float64
 1   Salary           30 non-null     float64
dtypes: float64(2)
memory usage: 608.0 bytes


Both are numerical column

In [None]:
dataset.isnull().sum().sum()

0

There is no null value in dataset

Statistical analysis of the dataset like no: of observation, mean, standard deviation and interquartile value can be done using the dataset. describe().

In [None]:
#Statistical Analysis
dataset.describe().style.background_gradient()

Unnamed: 0,YearsExperience,Salary
count,30.0,30.0
mean,5.313333,76003.0
std,2.837888,27414.429785
min,1.1,37731.0
25%,3.2,56720.75
50%,4.7,65237.0
75%,7.7,100544.75
max,10.5,122391.0


The relationship between variables can be seen by using sns.pairplot(). For multiple linear regression, it will be very useful, as it shows each feature's relationship with the response.

# Heat Map

In [None]:
#px.imshow: plotly.express This function creates an image-like representation (often a heatmap) of the input data, which in this case is the correlation matrix.
#The function automatically scales the colors based on the range of the correlation coefficients, helping you easily identify relationships between variables.

#.corr(): This method calculates the correlation matrix for the DataFrame dataset. The correlation matrix is a table showing the correlation coefficients between
#variables (columns) in the dataset. The correlation coefficient ranges from -1 to 1, where:
#1 indicates a perfect positive linear relationship.
#-1 indicates a perfect negative linear relationship.
#0 indicates no linear relationship.

fig = px.imshow(dataset.corr())
fig.show()

# Visualization

## Scatter Plot

In [None]:
scatter = [go.Scatter(x = dataset['YearsExperience'],
                      y = dataset['Salary'],
                      mode ='markers')]

fig = go.Figure(scatter)

iplot(fig)

## histogram

In [None]:
hist = [go.Histogram(x = dataset['YearsExperience'],\
                     marker=dict(color ='#AFE400',line = dict(color='black',width=2)))]


fig = go.Figure(data = hist)

iplot(fig)

In [None]:
hist = [go.Histogram(x = dataset['Salary'],\
                     marker=dict(color ='#0FE400',line = dict(color='black',width=2)))]

fig = go.Figure(data = hist)

iplot(fig)

# Assigning dependent variable to y and independent variable to X.

In [None]:
#Assiging values in X & Y
#iloc: This is a method in pandas that allows for integer-location based indexing to select specific rows and columns.
#[:, :-1]:
#:: Refers to all rows in the dataset.
#:-1: Refers to all columns except the last one.
#The negative index -1 means "the last column." So :-1 means "from the start up to, but not including, the last column."
#.values: This converts the selected data into a NumPy array. The result is a 2D array (matrix) containing all rows and all columns except the last one.

X = dataset.iloc[:, :-1].values

#iloc[:, -1]:
#:: Refers to all rows.
#-1: Refers to the last column only.
#.values: Converts the selected data into a NumPy array. The result is a 1D array (vector) containing all the values from the last column.
y = dataset.iloc[:, -1].values

In [None]:
print(X)

[[ 1.1]
 [ 1.3]
 [ 1.5]
 [ 2. ]
 [ 2.2]
 [ 2.9]
 [ 3. ]
 [ 3.2]
 [ 3.2]
 [ 3.7]
 [ 3.9]
 [ 4. ]
 [ 4. ]
 [ 4.1]
 [ 4.5]
 [ 4.9]
 [ 5.1]
 [ 5.3]
 [ 5.9]
 [ 6. ]
 [ 6.8]
 [ 7.1]
 [ 7.9]
 [ 8.2]
 [ 8.7]
 [ 9. ]
 [ 9.5]
 [ 9.6]
 [10.3]
 [10.5]]


In [None]:
print(y)

[ 39343.  46205.  37731.  43525.  39891.  56642.  60150.  54445.  64445.
  57189.  63218.  55794.  56957.  57081.  61111.  67938.  66029.  83088.
  81363.  93940.  91738.  98273. 101302. 113812. 109431. 105582. 116969.
 112635. 122391. 121872.]


The dataset has to be split into a training set and a test set analysis. This can be done by the function train_test_split function from the Model_selection module of the Scikit-learn library.


# Spliting Dataset

In [None]:
#Splitting testdata into X_train,X_test,y_train,y_test
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=.33,random_state=2)

Now the data set will be divided into X_train,X_test,y_train,y_test based on the test_size we have provided as input.

Here dataset has 30 observations and test_size is taken as 33% of the total observation. This indicates the test set should have 33% * 30 =9.9 ~10 observations and the training set should have 20 observations. Random_state is assigned to some value so that the dataset is split randomly.

In [None]:
print(X_train)

[[ 2.9]
 [ 9.6]
 [ 4. ]
 [ 2.2]
 [ 3.9]
 [ 5.1]
 [10.3]
 [ 9. ]
 [ 5.3]
 [ 1.5]
 [ 3.2]
 [ 9.5]
 [ 8.7]
 [ 5.9]
 [ 4. ]
 [ 7.9]
 [10.5]
 [ 4.1]
 [ 4.9]
 [ 3.2]]


In [None]:
print(X_test)

[[1.3]
 [1.1]
 [4.5]
 [3.7]
 [7.1]
 [6. ]
 [8.2]
 [3. ]
 [2. ]
 [6.8]]


In [None]:
print(y_train)

[ 56642. 112635.  56957.  39891.  63218.  66029. 122391. 105582.  83088.
  37731.  54445. 116969. 109431.  81363.  55794. 101302. 121872.  57081.
  67938.  64445.]


In [None]:
print (y_test)

[ 46205.  39343.  61111.  57189.  98273.  93940. 113812.  60150.  43525.
  91738.]


random_state is provided as input to divide the test set and the training set randomly. If we use random_state as 47, then the dataset will be divided in a different random way.

# Linear Regression

To perform linear regression, LinearRegression class is imported from the module linear_model of the Scikit-learn library. The simple regression model built will be an instance of class LinearRegression.

In [None]:
from sklearn.linear_model import LinearRegression
lr = LinearRegression()

#fit: This method is used to train the linear regression model on the training data.
#The linear regression model finds the best-fitting line (or hyperplane in higher dimensions) through the training data by minimizing the sum of
#squared differences between the observed actual outcomes (y_train) and the predicted outcomes (y_pred).
#The model computes the coefficients (also called weights) for each feature in X_train that define the relationship between the features and the target variable.
#If there's a single feature, the model fits a line in 2D space (a simple linear regression). If there are multiple features, it fits a hyperplane in multidimensional
#space (a multiple linear regression).
lr.fit(X_train, y_train)

# Predict

lr.predict(X_test): The predict method is used to generate predictions using the trained model. It takes the test data X_test as input and outputs the predicted values for the target variable.

y_pred: This variable stores the predicted values for the target variable based on the input features X_test. These predictions can then be compared with the actual target values (if available) to evaluate the performance of the model.

In [None]:
#Prediciting Y from Linear regression Model
y_pred = lr.predict(X_test)

np.linspace: This function generates evenly spaced numbers over a specified range.

X.min(): Finds the minimum value in X, the feature array used for training.

X.max(): Finds the maximum value in X.

100: Specifies that 100 evenly spaced values should be generated between X.min() and X.max().

x_range: This is a 1D array of 100 values ranging from the minimum to the maximum value of X. This range will be used to predict corresponding y values and plot the regression line.

lr.predict(): This method uses the trained linear regression model (lr) to predict the output values for the given input values.

x_range.reshape(-1, 1): Reshapes x_range into a 2D array with one column and as many rows as needed. This is necessary because the predict method expects a 2D array as input, even if there’s only one feature.

y_range: This array contains the predicted y values corresponding to each x value in x_range. It represents the line of best fit predicted by the model.

go.Figure: Creates a new figure object to hold the plot.

go.Scatter: Each call to go.Scatter adds a trace (a set of data points) to the figure.

First go.Scatter (Training Data):
x=X_train.squeeze(): The X values for the training data, flattened into a 1D array.

y=y_train: The y values for the training data.

name='train': The name of the trace, shown in the plot legend.

mode='markers': Plots the data points as markers (dots) without connecting them with lines.

Third go.Scatter (Prediction Line):
x=x_range: The evenly spaced x values from the x_range array.

y=y_range: The predicted y values from the linear regression model.

name='prediction': The name of the trace.

This trace will draw the predicted line of best fit across the data.

Traning data is visualized with X_train and y_train, the red mark indicates the data point and the blue line indicates the regression line or best fit line.

# ScatterPlot

In [None]:
x_range = np.linspace(X.min(), X.max(), 100)
y_range = lr.predict(x_range.reshape(-1, 1))

fig = go.Figure([
        go.Scatter(x=X_train.squeeze(), y=y_train,
                   name='train', mode='markers'),
        go.Scatter(x=X_test.squeeze(), y=y_test,
                   name='test', mode='markers'),
        go.Scatter(x=x_range, y=y_range,
                   name='prediction')
    ])

fig.show()

# Coefficient and Intercept

To find the linear regression equation, coefficient and intercept are to be calculated which can be done with the below equation.

In [None]:
#Assigning Coefficient (slope) to b
b = lr.coef_

In [None]:
print("Coefficient  :" , b)

Coefficient  : [9512.94498763]


In [None]:
#Assigning Y-intercept to a
a = lr.intercept_

In [None]:
print("Intercept : ", a)

Intercept :  23707.81324657549


# Predicting Unknown Values

For this model, the linear regression equation will be:

Predicted Salary=Coefficient × (years of experience) + Intercept

For Years of Experience 11, predicted salary can be calculated as:

_ × (11) + _ = _

y(11) can be predicted with the model as below.


In [None]:
# y_pred=9426.03876907×(years of experience)+25324.33537924433
#y_predict(11)
print(lr.predict([[11]]))

[128350.20811048]


Mean Squared Error (MSE) is one of the regression evaluation metrics. It is calculated as the average squared difference between the predicted values and the real value. The mathematical equation for MSE is as:
![image.png](attachment:image.png)

MSE can be calculated from the metrics module of Scikit-learn library.

# Evaluation

In [None]:
#Mean Squared Error (MSE)
from sklearn import metrics

In [None]:
print('Mean Squared Error (MSE)  : ', metrics.mean_squared_error(y_test, y_pred))

Mean Squared Error (MSE)  :  60451409.832681164


In [None]:
import statsmodels.api as sm

Ordinary Least-Squares (OLS) estimator module can be called from statsmodels.api to get regression summary.

Purpose: This line adds a constant (intercept) term to the independent variables (X_train).

Explanation:
sm.add_constant() is a function from the statsmodels library that adds a column of ones to your input data X_train. This is important because, in linear regression, we often include an intercept term (β₀) in the model.

Without this constant, the regression model would be forced through the origin (0,0), which might not be appropriate for your data.

The new variable X_stat now includes the constant term and the original features from X_train.

Purpose: This line fits an Ordinary Least Squares (OLS) regression model using the training data.

Explanation:
sm.OLS(y_train, X_stat) creates an OLS regression model object. Here, y_train is the dependent variable (target), and X_stat contains the independent variables (features) along with the added constant.

.fit() fits the model to the data by finding the best-fitting line through the training data. It computes the coefficients (slopes for each feature and the intercept) that minimize the sum of the squared residuals (differences between observed and predicted values).

The result is stored in the variable regsummary, which is a model summary object containing detailed information about the fitted regression model.

Purpose: This line generates a comprehensive summary of the OLS regression results.

Explanation:
regsummary.summary() produces a detailed table with the regression results, including coefficients, standard errors, t-statistics, p-values, R-squared value, and other statistics that help evaluate the model's performance.

This summary is crucial for understanding the significance of each variable, the overall fit of the model, and diagnosing any potential issues (e.g., multicollinearity, heteroscedasticity).

In [None]:
X_stat = sm.add_constant(X_train)
regsummary = sm.OLS(y_train, X_stat).fit()
regsummary.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.972
Model:,OLS,Adj. R-squared:,0.97
Method:,Least Squares,F-statistic:,616.8
Date:,"Sun, 18 Aug 2024",Prob (F-statistic):,2.23e-15
Time:,15:31:46,Log-Likelihood:,-197.13
No. Observations:,20,AIC:,398.3
Df Residuals:,18,BIC:,400.3
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,2.371e+04,2468.760,9.603,0.000,1.85e+04,2.89e+04
x1,9512.9450,383.048,24.835,0.000,8708.190,1.03e+04

0,1,2,3
Omnibus:,1.535,Durbin-Watson:,1.98
Prob(Omnibus):,0.464,Jarque-Bera (JB):,1.193
Skew:,0.567,Prob(JB):,0.551
Kurtosis:,2.616,Cond. No.,14.9


R-Square or Adj-R-Square value can be obtained as below.

Adjusted R-squared (regsummary.rsquared_adj) is a modified version of the R-squared that takes into account the number of predictors (independent variables) in the model. It adjusts for the degrees of freedom, penalizing the addition of non-significant predictors.

R-squared (regsummary.rsquared) is a statistical measure of how well the independent variables explain the variance in the dependent variable.

In [None]:
print("Adjusted R-Square : ",regsummary.rsquared_adj)
print("R-Square : ",regsummary.rsquared)

Adjusted R-Square :  0.9700678765774073
R-Square :  0.9716432514943859


If only interested to find the R-Square value, r2_score can be imported from the metrics module of the Scikit-learn library.

In [None]:
from sklearn.metrics import r2_score

In [None]:
r2_score(y_train, lr.predict(X_train))

0.9716432514943859