# Assignment 1 - Power Plant Prediction
Kyle Ah Von #57862609

## Summary
Predict the Electrical Energy Output (EP) using: Multiple Regression and SVM Regression

## Dataset Details

- **Total Data Points:** 9,568
- **Time Period:** 2006-2011
- **Frequency:** Hourly averages

## Features

The dataset includes the following features:

1. **Temperature (T)**
   - **Description:** The temperature of the ambient environment.
   - **Range:** 1.81°C to 37.11°C

2. **Ambient Pressure (AP)**
   - **Description:** The ambient pressure at the location of the power plant.
   - **Range:** 992.89 to 1033.30 millibar

3. **Relative Humidity (RH)**
   - **Description:** The relative humidity of the ambient environment.
   - **Range:** 25.56% to 100.16%

4. **Exhaust Vacuum (V)**
   - **Description:** The vacuum pressure in the exhaust system.
   - **Range:** 25.36 to 81.56 cm Hg

5. **Net Hourly Electrical Energy Output (EP)**
   - **Description:** The net electrical energy output of the power plant per hour.
   - **Range:** 420.26 to 495.76 MW



# Importing Libraries

In [11]:
# Data Manipulation
import pandas as pd  # For data manipulation and analysis
import numpy as np   # For numerical operations

# Data Visualization
import matplotlib.pyplot as plt  # For creating static, interactive, and animated visualizations
import seaborn as sns            # For statistical data visualization

# Machine Learning
from sklearn.model_selection import train_test_split  # For splitting the dataset into training and testing sets
from sklearn.preprocessing import StandardScaler, MinMaxScaler  # For feature scaling
from sklearn.linear_model import LinearRegression  # For linear regression
from sklearn.svm import SVR  # For Support Vector Regression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score  # For model evaluation metrics

## Import DataSet 

In [5]:
dataset = pd.read_csv('Power Plant Data.csv')

### Visualize dataset in table

In [8]:
# Set the float format to 2 decimal - data does not have more than 2 significant figures after decimal point
pd.options.display.float_format = '{:.2f}'.format

# Display the first few rows as sample
dataset.head()

Unnamed: 0,Ambient Temperature (C),Exhaust Vacuum (cm Hg),Ambient Pressure (milibar),Relative Humidity (%),Hourly Electrical Energy output (MW)
0,14.96,41.76,1024.07,73.17,463.26
1,25.18,62.96,1020.04,59.08,444.37
2,5.11,39.4,1012.16,92.14,488.56
3,20.86,57.32,1010.24,76.64,446.48
4,10.82,37.5,1009.23,96.62,473.9


All the columns have numerical data. All the columns contain useful information for the model. 

## Review of Data
Primarily to check if we have missing data

In [9]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9568 entries, 0 to 9567
Data columns (total 5 columns):
 #   Column                                Non-Null Count  Dtype  
---  ------                                --------------  -----  
 0   Ambient Temperature (C)               9568 non-null   float64
 1   Exhaust Vacuum (cm Hg)                9568 non-null   float64
 2   Ambient Pressure (milibar)            9568 non-null   float64
 3   Relative Humidity (%)                 9568 non-null   float64
 4   Hourly Electrical Energy output (MW)  9568 non-null   float64
dtypes: float64(5)
memory usage: 373.9 KB


Conveniently, there is no missing data in any rows, woohoo!
Also, there is no categorical data, there is no need to encode the data.

## Separate Inputs and Outputs

In [19]:
Y = dataset.iloc[:, 4] # sets last column as output
X = dataset.iloc[:, :4] # sets the first 4 columns as input


## Split data into Training and Testing 

In [23]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=1)

#Visualize the training input data
X_train.head()

Unnamed: 0,Ambient Temperature (C),Exhaust Vacuum (cm Hg),Ambient Pressure (milibar),Relative Humidity (%)
9061,6.61,38.91,1015.77,92.31
6937,22.72,65.61,1014.64,70.53
5631,10.06,39.61,1018.22,70.22
6218,27.53,67.83,1009.4,53.73
1362,23.89,48.41,1010.48,62.31
