# Linear Regression

## Steps of LR
Understand the data and the problem
Clean Data - Missing Values, Transformation, 
Load Libraries
Data Setup - Train/Test, 
Create Model
View Statistics of Model
Predict, Check RMSE, Efficiency 
Check for Assumptions
- We are investigating a linear relationship
- All variables follow a normal distribution
- There is very little or no multicollinearity
- There is little or no autocorrelation
- Data is homoscedastic

In [3]:
#pip install pydataset

### Method-1 : sklearn - linear_model

In [4]:
from sklearn import linear_model as lm
# from sklearn.linear_model import LinearRegression
from pydataset import data
import pandas as pd
import numpy as np

initiated datasets repo at: C:\Users\Hirak\.pydataset/


In [5]:
df = data('mtcars')
df.head()

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
Mazda RX4,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4
Mazda RX4 Wag,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4
Datsun 710,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1
Hornet 4 Drive,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1
Hornet Sportabout,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2


In [7]:
print(df.columns) #names of col
print(df.shape)  #rows and columns
print(df.describe())  #describe
print(df.dtypes)

Index(['mpg', 'cyl', 'disp', 'hp', 'drat', 'wt', 'qsec', 'vs', 'am', 'gear',
       'carb'],
      dtype='object')
(32, 11)
             mpg        cyl        disp          hp       drat         wt  \
count  32.000000  32.000000   32.000000   32.000000  32.000000  32.000000   
mean   20.090625   6.187500  230.721875  146.687500   3.596563   3.217250   
std     6.026948   1.785922  123.938694   68.562868   0.534679   0.978457   
min    10.400000   4.000000   71.100000   52.000000   2.760000   1.513000   
25%    15.425000   4.000000  120.825000   96.500000   3.080000   2.581250   
50%    19.200000   6.000000  196.300000  123.000000   3.695000   3.325000   
75%    22.800000   8.000000  326.000000  180.000000   3.920000   3.610000   
max    33.900000   8.000000  472.000000  335.000000   4.930000   5.424000   

            qsec         vs         am       gear     carb  
count  32.000000  32.000000  32.000000  32.000000  32.0000  
mean   17.848750   0.437500   0.406250   3.687500   2.8125  

In [16]:
df1 = df[['mpg','wt','hp']]
df1.head(3)

Unnamed: 0,mpg,wt,hp
Mazda RX4,21.0,2.62,110
Mazda RX4 Wag,21.0,2.875,110
Datsun 710,22.8,2.32,93


Another way to split data into X, y

X, y = df[['wt', 'hp']], df.mpg

In [18]:
#IV and DVs
X = df1[['wt','hp']]  #IV
y = df1['mpg'] #DV
print(X.head(3), '\n', y.head(3))

                  wt   hp
Mazda RX4      2.620  110
Mazda RX4 Wag  2.875  110
Datsun 710     2.320   93 
 Mazda RX4        21.0
Mazda RX4 Wag    21.0
Datsun 710       22.8
Name: mpg, dtype: float64


In [19]:
# Predict mpg on basis of wt & hp
lm1 = lm.LinearRegression()
lm1.fit(X,y)
# y = c + m1* x1 + m2 * x2

LinearRegression()

In [None]:
# see output functions
#lm1. press tab button

In [20]:
# R2 value - coefficient of determination
lm1.score(X,y)
#82.67% of the variation in the response variable(y, mpg) can be explained by the two predictor variables(x, wt & hp)
# in the model. ie mpg can be predicted upto 82% on the basis of wt & hp

0.8267854518827915

In [21]:
# Coeficients = no of IVs
lm1.coef_
#wt, hp

array([-3.87783074, -0.03177295])

In [22]:
# Intercept
lm1.intercept_

37.22727011644721

In [28]:
print(df1.max(axis=0), df1.min(axis=0))

mpg     33.900
wt       5.424
hp     335.000
dtype: float64 mpg    10.400
wt      1.513
hp     52.000
dtype: float64


In [30]:
# few values
print(df1.head(1))
print(X.head(3))
print(X.head(1)) # select this value and put in equation
print(y.head(1)) # this is actual mileage for above car

            mpg    wt   hp
Mazda RX4  21.0  2.62  110
                  wt   hp
Mazda RX4      2.620  110
Mazda RX4 Wag  2.875  110
Datsun 710     2.320   93
             wt   hp
Mazda RX4  2.62  110
Mazda RX4    21.0
Name: mpg, dtype: float64


In [31]:
# equation
#mpg =  37 + -3.7 * wt + -0.03 * hp
mpg1A  = 37 - 3.7 * 2.6 - .03 * 110   # Mazda Rx4
mpg1B = lm1.predict(X.head(1))
print(mpg1A, mpg1B)
print('\n Input Values -', X.head(1), '\n Actual MPG -  ', y.head(1), '\n From Formula - ', mpg1A, '\n From Predict Function -', mpg1B)
print('\n Formula and Predict Function - Almost the same Value')
print('\n Difference Between Actual and Predicted Values - Residuals')

24.08 [23.5723294]

 Input Values -              wt   hp
Mazda RX4  2.62  110 
 Actual MPG -   Mazda RX4    21.0
Name: mpg, dtype: float64 
 From Formula -  24.08 
 From Predict Function - [23.5723294]

 Formula and Predict Function - Almost the same Value

 Difference Between Actual and Predicted Values - Residuals


In [32]:
#predict mpg based on linear equation on input values of X
lm1.predict(X)

array([23.5723294 , 22.58348256, 25.27581872, 21.26502011, 18.32726664,
       20.47381631, 15.59904236, 22.88706734, 21.99367331, 19.97945988,
       19.97945988, 15.72536854, 17.04383099, 16.84993945, 10.35520459,
        9.36273257,  9.19248699, 26.59902798, 29.31238022, 28.04620915,
       24.58644148, 18.81136386, 19.14097947, 14.55202805, 16.75674519,
       27.62665313, 26.03737415, 27.76976919, 16.54648866, 20.92541324,
       12.73947713, 22.98364943])

In [35]:
print(len(lm1.predict(X)))

32


In [36]:
df1.shape

(32, 3)

In [37]:
X.shape

(32, 2)

In [34]:
# see them together
np.set_printoptions(formatter={'float': lambda x: "{0:0.1f}".format(x)})
np.column_stack((X,y, lm1.predict(X).round(2), y-lm1.predict(X)))
# wt - hp - mpg - predicedmpg - residuals

array([[2.6, 110.0, 21.0, 23.6, -2.6],
       [2.9, 110.0, 21.0, 22.6, -1.6],
       [2.3, 93.0, 22.8, 25.3, -2.5],
       [3.2, 110.0, 21.4, 21.3, 0.1],
       [3.4, 175.0, 18.7, 18.3, 0.4],
       [3.5, 105.0, 18.1, 20.5, -2.4],
       [3.6, 245.0, 14.3, 15.6, -1.3],
       [3.2, 62.0, 24.4, 22.9, 1.5],
       [3.1, 95.0, 22.8, 22.0, 0.8],
       [3.4, 123.0, 19.2, 20.0, -0.8],
       [3.4, 123.0, 17.8, 20.0, -2.2],
       [4.1, 180.0, 16.4, 15.7, 0.7],
       [3.7, 180.0, 17.3, 17.0, 0.3],
       [3.8, 180.0, 15.2, 16.9, -1.6],
       [5.2, 205.0, 10.4, 10.4, 0.0],
       [5.4, 215.0, 10.4, 9.4, 1.0],
       [5.3, 230.0, 14.7, 9.2, 5.5],
       [2.2, 66.0, 32.4, 26.6, 5.8],
       [1.6, 52.0, 30.4, 29.3, 1.1],
       [1.8, 65.0, 33.9, 28.1, 5.9],
       [2.5, 97.0, 21.5, 24.6, -3.1],
       [3.5, 150.0, 15.5, 18.8, -3.3],
       [3.4, 150.0, 15.2, 19.1, -3.9],
       [3.8, 245.0, 13.3, 14.6, -1.3],
       [3.8, 175.0, 19.2, 16.8, 2.4],
       [1.9, 66.0, 27.3, 27.6, -0.3],
       [2

In [51]:
sum(y-lm1.predict(X))
#(y-lm1.predict(X))^2
from math import pow
#int(pow((y - lm1.predict(X)),2))
pow(5,2)

25.0

In [38]:
# RMS 
import sklearn.metrics  #RMSE
import math
mse = sklearn.metrics.mean_squared_error(y, lm1.predict(X)) 
print("The difference between actual and predicted values", mse, round(math.sqrt(mse),2) )  

The difference between actual and predicted values 6.0952423356708145 2.47


In [40]:
#RMSE
from sklearn.metrics import mean_squared_error
print('RMSE',  mean_squared_error(y, lm1.predict(X)))

RMSE 6.0952423356708145


RMSE is a square root of value gathered from the mean square error function. It helps us plot a difference between the estimate and actual value of a parameter of the model.
Using RSME, we can easily measure the efficiency of the model.
A well-working algorithm is known if its RSME score of less than 180. If the RSME value surpasses 180, we need to apply feature selection and hyper-parameter tuning on the model parameter.

### Some Functions still missing
scikit-learn doesn’t offer many built-in functions to analyze the summary of a regression model since it’s typically only used for predictive purposes.
we still don’t know the overall F-statistic of the model, the p-values of the individual regression coefficients, and other useful metrics that can help us understand how well the model fits the dataset.

## Stats Model

In [52]:
import statsmodels.api as sm

In [53]:
#define response variable
y = df['mpg']

#define predictor variables
x = df[['wt', 'hp']]
print(x.head(), '\n', y.head())

                      wt   hp
Mazda RX4          2.620  110
Mazda RX4 Wag      2.875  110
Datsun 710         2.320   93
Hornet 4 Drive     3.215  110
Hornet Sportabout  3.440  175 
 Mazda RX4            21.0
Mazda RX4 Wag        21.0
Datsun 710           22.8
Hornet 4 Drive       21.4
Hornet Sportabout    18.7
Name: mpg, dtype: float64


In [54]:
#another way to show 2 dataframe in column way
pd.concat([x.head(), y.head()], axis=1)

Unnamed: 0,wt,hp,mpg
Mazda RX4,2.62,110,21.0
Mazda RX4 Wag,2.875,110,21.0
Datsun 710,2.32,93,22.8
Hornet 4 Drive,3.215,110,21.4
Hornet Sportabout,3.44,175,18.7


In [57]:
#add constant to predictor variables
x = sm.add_constant(x)
#fit linear regression model
lm2 = sm.OLS(y, x).fit()

In [56]:
x.head()

Unnamed: 0,const,wt,hp
Mazda RX4,1.0,2.62,110
Mazda RX4 Wag,1.0,2.875,110
Datsun 710,1.0,2.32,93
Hornet 4 Drive,1.0,3.215,110
Hornet Sportabout,1.0,3.44,175


In [58]:
#view model summary - Very important output
print(lm2.summary())

                            OLS Regression Results                            
Dep. Variable:                    mpg   R-squared:                       0.827
Model:                            OLS   Adj. R-squared:                  0.815
Method:                 Least Squares   F-statistic:                     69.21
Date:                Sat, 25 Feb 2023   Prob (F-statistic):           9.11e-12
Time:                        22:52:06   Log-Likelihood:                -74.326
No. Observations:                  32   AIC:                             154.7
Df Residuals:                      29   BIC:                             159.0
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         37.2273      1.599     23.285      0.0

#### Output Summary
p-values for each individual predictor variable:
wt - .000  < .005 : Significant
hp - .001  < .005 : Significant
Overall F Statistics
p value - 9.11e-12 < .005  : Model Exist, at least 1 IV predicts DV to some extend

AIC Value
154  : Lower the value better is the model when model comparison is done

R2
Adj R2 = .815   : for Multiple Regression

In [59]:
# Residuals : Predict - Actual (DV)
lm2.resid.head()

Mazda RX4           -2.572329
Mazda RX4 Wag       -1.583483
Datsun 710          -2.475819
Hornet 4 Drive       0.134980
Hornet Sportabout    0.372733
dtype: float64

In [60]:
#pip install bioinfokit
#from bioinfokit import visuz
import seaborn as sns
import matplotlib.pyplot as plt

# create a DataFrame of predicted values and residuals
df1.loc["predicted"] = lm2.predict(x)
df1.loc["residuals"] = lm2.resid
print(df1.head())

                    mpg     wt     hp
Mazda RX4          21.0  2.620  110.0
Mazda RX4 Wag      21.0  2.875  110.0
Datsun 710         22.8  2.320   93.0
Hornet 4 Drive     21.4  3.215  110.0
Hornet Sportabout  18.7  3.440  175.0


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df1.loc["predicted"] = lm2.predict(x)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df1.loc["residuals"] = lm2.resid


In [62]:
#sns.scatterplot(data=df1, x="predicted", y="residuals")
#plt.axhline(y=0)

In [None]:
### Some Terms
Best Fit - The straight line in a plot that minimizes the divergence between related dispersed data points
Coefficient - Also known as a parameter, is the factor that is multiplied by a variable. A coefficient in linear regression represents changes in a Response Variable 
Coefficient of Determination - It is the correlation coefficient. In a regression, this term is used to define the precision or degree of fit
Correlation - the measurable intensity and degree of association between two variables, often known as the 'degree of correlation.' The values range from -1.0 to 1.0
Dependent Feature - A variable represented as y in the slope equation y=ax+b. Also referred to as an Output or a Response
Estimated Regression Line - the straight line that best fits a set of randomly distributed data points
Independent Feature - a variable represented by the letter x in the slope equation y=ax+b. Also referred to as an Input or a predictor
Intercept - It is the point at where the slope intersects the Y-axis, indicated by the letter b in the slope equation y=ax+b
Least Squares - a method for calculating the best fit to data by minimizing the sum of the squares of the discrepancies between observed and estimated values
Mean - an average of a group of numbers; nevertheless, in linear regression, Mean is represented by a linear function
OLS (Ordinary Least Squares Regression) - sometimes known as Linear Regression.
Residual - the vertical distance between a data point and the regression line
Regression - is an assessment of a variable's predicted change in relation to changes in other variables
Regression Model - The optimum formula for approximating a regression 
Response Variables - This category covers both the Predicted Response (the value predicted by the regression) and the Actual Response (the actual value of the data point) 
Slope - the steepness of a regression line. The linear relationship between two variables may be defined using slope and intercept: y=ax+b
Simple linear regression - A linear regression with a single independent variable

In [64]:
# Train and split data
from sklearn.model_selection import train_test_split

In [65]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)
print('Train Set ', pd.concat([X_train, y_train], axis=1).shape)
print('Test Set ', pd.concat([X_test, y_test], axis=1).shape)

Train Set  (24, 3)
Test Set  (8, 3)


In [None]:
lm.fit(X_train, y_train)