<a href="https://colab.research.google.com/github/JaisonPJoy/Data-Science/blob/main/CO2PG2-MultipleLinearRegression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# MultipleLinearRegression

**Program to implement multiple linear regression technique using any standard dataset available in the public domain and evaluate its performance.**



The description for all the columns containing data for air pollutants, temperature, relative humidity and absolute humidity is provided below.


|Columns|Description|
|-|-|
|PT08.S1(CO)|PT08.S1 (tin oxide) hourly averaged sensor response (nominally $\text{CO}$ targeted)|
|C6H6(GT)|True hourly averaged Benzene concentration in $\frac{\mu g}{m^3}$|
|PT08.S2(NMHC)|PT08.S2 (titania) hourly averaged sensor response (nominally $\text{NMHC}$ targeted)|
|PT08.S3(NOx)|PT08.S3 (tungsten oxide) hourly averaged sensor response (nominally $\text{NO}_x$ targeted)|
|PT08.S4(NO2)|PT08.S4 (tungsten oxide) hourly averaged sensor response (nominally $\text{NO}_2$ targeted)|
|PT08.S5(O3) |PT08.S5 (indium oxide) hourly averaged sensor response (nominally $\text{O}_3$ targeted)|
|T|Temperature in Â°C|
|RH|Relative Humidity (%)|
|AH|AH Absolute Humidity|

---

#### Multiple Linear Regression Model Using `sklearn` Module


In [None]:
#Load Dataset & display 1st 5 rows. Github link is as follows:
# https://raw.githubusercontent.com/jiss-sngce/air/main/airquality.csv.csv
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/jiss-sngce/air/main/airquality.csv.csv')
df.head()

Unnamed: 0,DateTime,PT08.S1(CO),C6H6(GT),PT08.S2(NMHC),PT08.S3(NOx),PT08.S4(NO2),PT08.S5(O3),T,RH,AH,Year,Month,Day,Day Name
0,2004-03-10 18:00:00,1360.0,11.9,1046.0,1056.0,1692.0,1268.0,13.6,48.9,0.7578,2004,3,10,Wednesday
1,2004-03-10 19:00:00,1292.0,9.4,955.0,1174.0,1559.0,972.0,13.3,47.7,0.7255,2004,3,10,Wednesday
2,2004-03-10 20:00:00,1402.0,9.0,939.0,1140.0,1555.0,1074.0,11.9,54.0,0.7502,2004,3,10,Wednesday
3,2004-03-10 21:00:00,1376.0,9.2,948.0,1092.0,1584.0,1203.0,11.0,60.0,0.7867,2004,3,10,Wednesday
4,2004-03-10 22:00:00,1272.0,6.5,836.0,1205.0,1490.0,1110.0,11.2,59.6,0.7888,2004,3,10,Wednesday


In [None]:
#Display the columns in dataframe
df.columns

Index(['DateTime', 'PT08.S1(CO)', 'C6H6(GT)', 'PT08.S2(NMHC)', 'PT08.S3(NOx)',
       'PT08.S4(NO2)', 'PT08.S5(O3)', 'T', 'RH', 'AH', 'Year', 'Month', 'Day',
       'Day Name'],
      dtype='object')

In [None]:
df.info

<bound method DataFrame.info of                  DateTime  PT08.S1(CO)  C6H6(GT)  ...  Month  Day   Day Name
0     2004-03-10 18:00:00       1360.0      11.9  ...      3   10  Wednesday
1     2004-03-10 19:00:00       1292.0       9.4  ...      3   10  Wednesday
2     2004-03-10 20:00:00       1402.0       9.0  ...      3   10  Wednesday
3     2004-03-10 21:00:00       1376.0       9.2  ...      3   10  Wednesday
4     2004-03-10 22:00:00       1272.0       6.5  ...      3   10  Wednesday
...                   ...          ...       ...  ...    ...  ...        ...
9352  2005-04-04 10:00:00       1314.0      13.5  ...      4    4     Monday
9353  2005-04-04 11:00:00       1163.0      11.4  ...      4    4     Monday
9354  2005-04-04 12:00:00       1142.0      12.4  ...      4    4     Monday
9355  2005-04-04 13:00:00       1003.0       9.5  ...      4    4     Monday
9356  2005-04-04 14:00:00       1071.0      11.9  ...      4    4     Monday

[9357 rows x 14 columns]>

In [None]:
# Build a linear regression model using the sklearn module by including all the features except DateTime,Day Name & RH.
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

features = list(df.columns.values[1:-4])

features.remove('RH')
print(features)

X = df[features]
y = df['RH']

['PT08.S1(CO)', 'C6H6(GT)', 'PT08.S2(NMHC)', 'PT08.S3(NOx)', 'PT08.S4(NO2)', 'PT08.S5(O3)', 'T', 'AH']


In [None]:
# Splitting the DataFrame into the train and test sets.
# Test set will have 33% of the values.

XTrain, XTest, yTrain, yTest = train_test_split(X, y, test_size = 0.33, random_state = 42)
yTrainReshaped = yTrain.values.reshape(-1,1)
yTestReshaped = yTest.values.reshape(-1,1)

# yTrainReshaped.shape
# XTrain.shape

In [None]:
# Build a linear regression model using the 'sklearn.linear_model' module.

linReg = LinearRegression()
linReg.fit(XTrain, yTrainReshaped)

# Print the value of the intercept.

print('INTERCEPT--> ', linReg.intercept_)


INTERCEPT-->  [34.57305436]


In [None]:
# Print the names of the features along with the values of their corresponding coefficients.

print("COEFFICIENT--> ",linReg.coef_)
for i in list(zip(X.columns.values,linReg.coef_[0])):
  print(i[0]," --> ", i[1])

COEFFICIENT-->  [[ 1.03230559e-02 -4.69470990e-01 -4.20228581e-04  6.66680494e-04
   9.24589119e-03 -4.54822557e-04 -2.38521173e+00  3.76150339e+01]]
PT08.S1(CO)  -->  0.01032305594357359
C6H6(GT)  -->  -0.46947099047417595
PT08.S2(NMHC)  -->  -0.0004202285808548034
PT08.S3(NOx)  -->  0.0006666804942972449
PT08.S4(NO2)  -->  0.009245891185224564
PT08.S5(O3)  -->  -0.0004548225571515685
T  -->  -2.385211734787753
AH  -->  37.615033910204126


In [None]:
# Evaluate the linear regression model using the 'r2_score', 'mean_squared_error' & 'mean_absolute_error' functions of the 'sklearn' module.
from sklearn.metrics import r2_score, mean_squared_error,mean_absolute_error
import numpy as np

yTrainPred = linReg.predict(XTrain)
yTestPred = linReg.predict(XTest)

print('TRAIN DATA SET')
print('R2-sCORE --> ',r2_score(yTrainReshaped, yTrainPred))
print('MEAN SQUARE ERROR --> ',mean_squared_error(yTrainReshaped, yTrainPred))
# np.sqrt(mean_squared_error(yTrainReshaped, yTrainPred))
print('MEAN ABSOLUTE ERROR --> ',mean_absolute_error(yTrainReshaped, yTrainPred))

print("\n-----------------------------------------------------------------------------------------------")

print('\nTEST DATA SET')
print('R2-sCORE --> ',r2_score(yTestReshaped, yTestPred))
print('MEAN SQUARE ERROR --> ',mean_squared_error(yTestReshaped, yTestPred))
print('MEAN ABSOLUTE ERROR --> ',mean_absolute_error(yTestReshaped, yTestPred))

TRAIN DATA SET
R2-sCORE -->  0.8697289778155132
MEAN SQUARE ERROR -->  37.67070677136997
MEAN ABSOLUTE ERROR -->  4.688976086075844

-----------------------------------------------------------------------------------------------

TEST DATA SET
R2-sCORE -->  0.871930547463685
MEAN SQUARE ERROR -->  36.63938906768081
MEAN ABSOLUTE ERROR -->  4.6619032116666705
