<a href="https://colab.research.google.com/github/Ashuto7h/ML-basics/blob/main/Multiple_Linear_Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Multiple Linear Regression

In most of the cases the prediction is based on more than one features. In this regression we plot data points in a multiple dimension.


## About DataSet
The data taken in this tutorial is from UCI Machine learning repository.

link = https://archive.ics.uci.edu/ml/datasets/Combined+Cycle+Power+Plant

The dataset contains 9568 data points collected from a Combined Cycle Power Plant over 6 years (2006-2011), when the power plant was set to work with full load. Features consist of hourly average ambient variables Temperature (T), Ambient Pressure (AP), Relative Humidity (RH) and Exhaust Vacuum (V) to __predict the net hourly electrical energy output (EP)__ of the plant.

#### Attribute Information:

Features consist of hourly average ambient variables
- Temperature (AT) in the range 1.81°C and 37.11°C,
- Ambient Pressure (AP) in the range 992.89-1033.30 milibar,
- Relative Humidity (RH) in the range 25.56% to 100.16%
- Exhaust Vacuum (V) in teh range 25.36-81.56 cm Hg
- Net hourly electrical energy output (PE) 420.26-495.76 MW

The averages are taken from various sensors located around the plant that record the ambient variables every second. The variables are given without normalization. 

In [None]:
import pandas
path = r"C:\Users\Home\jupyter works\CSV\CCPP_data_set.csv"
dataframe = pandas.read_csv(path)
dataframe.info()
dataframe.head(10)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9568 entries, 0 to 9567
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   AT      9568 non-null   float64
 1   V       9568 non-null   float64
 2   AP      9568 non-null   float64
 3   RH      9568 non-null   float64
 4   PE      9568 non-null   float64
dtypes: float64(5)
memory usage: 373.9 KB


Unnamed: 0,AT,V,AP,RH,PE
0,14.96,41.76,1024.07,73.17,463.26
1,25.18,62.96,1020.04,59.08,444.37
2,5.11,39.4,1012.16,92.14,488.56
3,20.86,57.32,1010.24,76.64,446.48
4,10.82,37.5,1009.23,96.62,473.9
5,26.27,59.44,1012.23,58.77,443.67
6,15.89,43.96,1014.02,75.24,467.35
7,9.48,44.71,1019.12,66.43,478.42
8,14.64,45.0,1021.78,41.25,475.98
9,11.74,43.56,1015.14,70.72,477.5


In [None]:
# extracting independent and dependent variables
x = dataframe.iloc[:,:-1].values  # index based
y = dataframe.loc[:,"PE"].values  # value based
print(x[0],"\n",y[0])

[  14.96   41.76 1024.07   73.17] 
 463.26


### Feature Scaling
In Multiple Linear Regression also, we dont have to do feature scaling, because it is internally handled by the library.

In [None]:
#  splitting data
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size = 0.2, random_state=0)

# fitting model to training set
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(x_train,y_train)

LinearRegression()

In [None]:
# predictions
y_pred = model.predict(x_test)
print("actual     predicted")
for i in range(0,5):
    print("{:.2f}     {:.2f}".format(y_test[i], y_pred[i]))

actual     predicted
431.23     431.43
460.01     458.56
461.14     462.75
445.90     448.60
451.29     457.87


In [None]:
# Scores
r_sq = model.score(x,y)
r_sq_train = model.score(x_train,y_train)
r_sq_test = model.score(x_test,y_test)
print("r_sq : ",r_sq, r_sq_train, r_sq_test)

# to find error = 1 - r_sq

print('intercept :', model.intercept_ )
print('slope :', model.coef_)


r_sq :  0.9286947104407257 0.9277253998587902 0.9325315554761303
intercept : 452.8410371616384
slope : [-1.97313099 -0.23649993  0.06387891 -0.15807019]


value of r_sq tells that the accuracy of whole model is 92.86% while that of training set is 92.77% and of test set is 93.25%

eq of dependent variable is 
<br>
y = b0 + b1X1 + b2X2......
<br>
where b0 is intercept and
<br>
coefficients b1,b2,... are shown as an array.
<br>
the predictions are made by putting values in above equation

# Backwad Elimination Method

Backward Elimination is a feature selection technique, used to remove the features which are not effective in prediction.

Steps : 
- Select a Significance level (P-value) generally (SL = 0.05)
- fit the model with all posible predictors
- find p-values of all predictors
- remove the predictor with highest p value then fit model again and repeat the process till p-value is greater than 0.05



In [None]:
import statsmodels.regression.linear_model as sm
import numpy
print(x[0])
# add a column of values = 1 (int)
be_x = numpy.append(arr = numpy.ones((9568,1)).astype(int), values = x, axis=1)
print(be_x[0:5])


[  14.96   41.76 1024.07   73.17]
[[1.00000e+00 1.49600e+01 4.17600e+01 1.02407e+03 7.31700e+01]
 [1.00000e+00 2.51800e+01 6.29600e+01 1.02004e+03 5.90800e+01]
 [1.00000e+00 5.11000e+00 3.94000e+01 1.01216e+03 9.21400e+01]
 [1.00000e+00 2.08600e+01 5.73200e+01 1.01024e+03 7.66400e+01]
 [1.00000e+00 1.08200e+01 3.75000e+01 1.00923e+03 9.66200e+01]]


In [None]:
# finding significance level
x_opt = be_x[:,[0,1,2,3,4]]
ols = sm.OLS(endog = y, exog = x_opt).fit()
ols.summary()

# means every variable of this dataset is necessa

0,1,2,3
Dep. Variable:,y,R-squared:,0.929
Model:,OLS,Adj. R-squared:,0.929
Method:,Least Squares,F-statistic:,31140.0
Date:,"Sun, 25 Oct 2020",Prob (F-statistic):,0.0
Time:,17:31:18,Log-Likelihood:,-28088.0
No. Observations:,9568,AIC:,56190.0
Df Residuals:,9563,BIC:,56220.0
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,454.6093,9.749,46.634,0.000,435.500,473.718
x1,-1.9775,0.015,-129.342,0.000,-2.007,-1.948
x2,-0.2339,0.007,-32.122,0.000,-0.248,-0.220
x3,0.0621,0.009,6.564,0.000,0.044,0.081
x4,-0.1581,0.004,-37.918,0.000,-0.166,-0.150

0,1,2,3
Omnibus:,892.002,Durbin-Watson:,2.033
Prob(Omnibus):,0.0,Jarque-Bera (JB):,4086.777
Skew:,-0.352,Prob(JB):,0.0
Kurtosis:,6.123,Cond. No.,213000.0
