# Assignment 1 

Student Name: Brendan Lai<br>
Student ID: 19241173

### 1. Project Summary
The dataset contains 9568 data points collected from a Combined Cycle Power Plant over 6 years (2006-2011) when the plant was set to work with a full load. The dataset consists of hourly average ambient variables Temperature (T), Ambient Pressure (AP), Relative Humidity (RH), and Exhaust Vacuum (V) to predict the net hourly electrical energy output (EP) of the plant.

Features consist of hourly average ambient variables

- Temperature (T) in the range 1.81°C and 37.11°C,
- Ambient Pressure (AP) in the range 992.89-1033.30 millibar,
- Relative Humidity (RH) in the range 25.56% to 100.16%
- Exhaust Vacuum (V) in the range 25.36-81.56 cm Hg
- Net hourly electrical energy output (EP) 420.26-495.76 MW

### 2. Data Pre-processing 


##### Importing necessary libraries

In [18]:
import numpy as np 
import pandas as pd 
from matplotlib import pyplot as plt
import seaborn as sns

##### Importing the Dataset

In [19]:
dataset = pd.read_csv('Power Plant Data.csv')

##### Showing the Dataset in a Table

In [20]:
dataset.head()

Unnamed: 0,Ambient Temperature (C),Exhaust Vacuum (cm Hg),Ambient Pressure (milibar),Relative Humidity (%),Hourly Electrical Energy output (MW)
0,14.96,41.76,1024.07,73.17,463.26
1,25.18,62.96,1020.04,59.08,444.37
2,5.11,39.4,1012.16,92.14,488.56
3,20.86,57.32,1010.24,76.64,446.48
4,10.82,37.5,1009.23,96.62,473.9


##### A Quick Review of the Data

In [21]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9568 entries, 0 to 9567
Data columns (total 5 columns):
 #   Column                                Non-Null Count  Dtype  
---  ------                                --------------  -----  
 0   Ambient Temperature (C)               9568 non-null   float64
 1   Exhaust Vacuum (cm Hg)                9568 non-null   float64
 2   Ambient Pressure (milibar)            9568 non-null   float64
 3   Relative Humidity (%)                 9568 non-null   float64
 4   Hourly Electrical Energy output (MW)  9568 non-null   float64
dtypes: float64(5)
memory usage: 373.9 KB


##### Encoding The Categorical Data 
Separate inputs and outputs

In [22]:
x = dataset.iloc[:, 0:3]
y = dataset.iloc[:, -1]

##### Showing the Input Data in a Table Format

In [23]:
x.head()

Unnamed: 0,Ambient Temperature (C),Exhaust Vacuum (cm Hg),Ambient Pressure (milibar)
0,14.96,41.76,1024.07
1,25.18,62.96,1020.04
2,5.11,39.4,1012.16
3,20.86,57.32,1010.24
4,10.82,37.5,1009.23


##### A Quick Check of the Output Data

In [24]:
y.head()

0    463.26
1    444.37
2    488.56
3    446.48
4    473.90
Name: Hourly Electrical Energy output (MW), dtype: float64

##### Splitting the Dataset into the Training Set and Test Set

In [25]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=1)

### 3. Training and Testing Predictive Models

### Method 1 - Multiple Linear Regression


In [10]:
from sklearn.linear_model import LinearRegression 

Model = LinearRegression() 
linearRegressionModel = Model.fit(x_train, y_train)
linRegress_pred = linearRegressionModel.predict(x_test)


In [11]:
from sklearn.metrics import mean_squared_error

mse = mean_squared_error(y_test, linRegress_pred)
rmse = mse ** 0.5
print(f"The root mean squared error is: {rmse}")

corr_matrix = np.corrcoef(y_test, linRegress_pred)
corr = corr_matrix[0,1]
R_sq = corr**2
print(R_sq)

The root mean squared error is: 4.824469302054682
0.9224748485647559


##### Scaling features
Now what happens if we scale our data for the linear regression model?


In [26]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()

x_train_scaled = sc.fit_transform(x_train)
x_test_scaled = sc.fit_transform(x_test)

scaled_linearRegressionModel = Model.fit(x_train_scaled, y_train)
scaled_linRegress_pred = linearRegressionModel.predict(x_test_scaled)

In [28]:
mse = mean_squared_error(y_test, scaled_linRegress_pred)
rmse = mse ** 0.5
print(f"The root mean squared error is: {rmse}")

corr_matrix = np.corrcoef(y_test, scaled_linRegress_pred)
corr = corr_matrix[0,1]
R_sq = corr**2
print(f"The R squared value of the linear regression model applied on scaled data is: {R_sq}")

The root mean squared error is: 4.8324180134414725
The R squared value of the linear regression model applied on scaled data is: 0.9224772517166426


As we can see the $R^{2}$ value does not change between the scaled and non scaled data.
Given the resulting value we can also see with reasonable certainty that our model is accurate.

### Method 2 - Support Vector Regression


In [29]:
# Support vector regression
from sklearn.svm import SVR

classifier = SVR(kernel = 'rbf')
classifier.fit(x_train, y_train)
svm_pred = classifier.predict(x_test)

In [30]:
from sklearn.metrics import r2_score

sc = r2_score(y_test, svm_pred)
print(f"The R squared value for the non-scaled SVR model is: {sc}")

The R squared value for the non-scaled SVR model is: 0.37168449856535546


##### Scaled results?

In [32]:
classifier.fit(x_train_scaled, y_train)
svm_pred_scaled = classifier.predict(x_test_scaled)

# print results
sc = r2_score(y_test, svm_pred_scaled)
print(sc)

0.9402429999585048


As we see here given the significantly improved $R^{2}$ value that the scaling of the inputs drastically improved the models capability of prediction
Further, we see that comparing our SVR and linear regression that the SVR was the more accurate model

### Some further data analysis