# Section: Linear Regression

## <font color='#4073FF'>Project: Power output prediction</font>

###  <font color='#14AAF5'>The business problem is predicting the power output of a peaker power plant as a function of the environmental conditions. </font>

### Project Brief:

Power generation is a complex process, and understanding and predicting power output is an important element in managing a plant and its connection to the power grid. The operators of a regional power grid create predictions of power demand based on historical information and environmental factors (e.g., temperature). They then compare the predictions against available resources (e.g., coal, natural gas, nuclear, solar, wind, hydro power plants). Power generation technologies such as solar and wind are highly dependent on environmental conditions, and all generation technologies are subject to planned and unplanned maintenance. 

 
**The power output of a peaker power plant varies depending on environmental conditions, so the business problem is predicting the power output of a peaker power plant as a function of the environmental conditions -- since this would enable the grid operator to make economic tradeoffs about the number of peaker plants to turn on (or whether to buy expensive power from another grid).** 

### 1. Dataset
The dataset contains 9568 data points collected from a Combined Cycle Power Plant over 6 years (2006-2011), when the power plant was set to work with full load. Features consist of hourly average ambient variables Temperature (T), Ambient Pressure (AP), Relative Humidity (RH) and Exhaust Vacuum (V) to predict the net hourly electrical energy output (EP) of the plant. 

A combined cycle power plant (CCPP) is composed of gas turbines (GT), steam turbines (ST) and heat recovery steam generators. In a CCPP, the electricity is generated by gas and steam turbines, which are combined in one cycle, and is transferred from one turbine to another. While the Vacuum is colected from and has effect on the Steam Turbine, he other three of the ambient variables effect the GT performance. 

Features/Columns consist of hourly average ambient variables. 

- Temperature (T) in the range 1.81°C and 37.11°C, 

- Ambient Pressure (AP) in the range 992.89-1033.30 milibar, 

- Relative Humidity (RH) in the range 25.56% to 100.16% 

- Exhaust Vacuum (V) in teh range 25.36-81.56 cm Hg 

- Net hourly electrical energy output (EP) 420.26-495.76 MW 

The averages are taken from various sensors located around the plant that record the ambient variables every second. The variables are given without normalization. 

In [1]:
# Importing Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="whitegrid")

### 2.  Data Collection and exploration

In [2]:
df = pd.read_csv(r"combined_cycle_power_plant.csv",sep=';')
df.shape

(9568, 5)

In [3]:
# Looking at data
df.head()

Unnamed: 0,temperature,exhaust_vacuum,ambient_pressure,relative_humidity,energy_output
0,9.59,38.56,1017.01,60.1,481.3
1,12.04,42.34,1019.72,94.67,465.36
2,13.87,45.08,1024.42,81.69,465.48
3,13.72,54.3,1017.89,79.08,467.05
4,15.14,49.64,1023.78,75.0,463.58


In [4]:
# checking for column information
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9568 entries, 0 to 9567
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   temperature        9568 non-null   float64
 1   exhaust_vacuum     9568 non-null   float64
 2   ambient_pressure   9568 non-null   float64
 3   relative_humidity  9568 non-null   float64
 4   energy_output      9568 non-null   float64
dtypes: float64(5)
memory usage: 373.9 KB


In [5]:
# Descriptive statistics
df.describe()

Unnamed: 0,temperature,exhaust_vacuum,ambient_pressure,relative_humidity,energy_output
count,9568.0,9568.0,9568.0,9568.0,9568.0
mean,19.651231,54.305804,1013.259078,73.308978,454.365009
std,7.452473,12.707893,5.938784,14.600269,17.066995
min,1.81,25.36,992.89,25.56,420.26
25%,13.51,41.74,1009.1,63.3275,439.75
50%,20.345,52.08,1012.94,74.975,451.55
75%,25.72,66.54,1017.26,84.83,468.43
max,37.11,81.56,1033.3,100.16,495.76


### 3. Data Cleaning

In [6]:
# Looking for duplicate entries
df.duplicated().sum()

41

In [7]:
# Dropping duplicates
df.drop_duplicates(inplace=True)

In [8]:
# No more duplicates
df.duplicated().sum()

0

In [9]:
df.isnull().sum()

temperature          0
exhaust_vacuum       0
ambient_pressure     0
relative_humidity    0
energy_output        0
dtype: int64

In [10]:
# Skewness
df.skew()

temperature         -0.136107
exhaust_vacuum       0.196819
ambient_pressure     0.273846
relative_humidity   -0.435138
energy_output        0.305791
dtype: float64

### 4. Feaure Engineering

### Correlation Analysis

    if corr(x,y) is between -0.1 to +0.1 = bad correlation

    if corr(x,y) is between +0.1 to +0.5 = good correlation
    if corr(x,y) is between -0.1 to -0.5 = good correlation

    if corr(x,y) > +0.5 = Very good correlation
    if corr(x,y) < -0.5 = Very good correlation

In [11]:
corr = df.corr()

# Generate a mask for the upper triangle
mask = np.triu(np.ones_like(corr, dtype=bool))

# Create correlation matrix



# write your code here



#### outcome =  all four features are important features

In [12]:
# creating x and y sets
x = df.drop('energy_output',axis=1)
y = df['energy_output']

### 5. Predictive Modelling - Combined Cycle Power Plant

#### **Methodology – Linear Regression** 

Regression analysis is a form of predictive modelling technique which investigates the relationship between a dependent (target) and independent variable (s) (predictor). This technique is used for forecasting, time series modelling and finding the causal effect relationship between the variables. We will be predicting continuous output.

#### Train test split

In [13]:
# Split dataset into 80% training data and 20% testing data.

from sklearn.model_selection import train_test_split

# write your code here



### 6. Model Fitting

In [14]:
# create model

from sklearn.linear_model import LinearRegression

# write your code here



In [15]:
# train the model with the train dataset - xtrain, ytrain

# write your code here


In [16]:
# making predictions with new data

new_data = [22.1,71.29,1008.2,75.38]

# write your code here



### 7. Evaluating the model

In [17]:
# Making predictions on test set


# write your code here



In [18]:
# Finding the r2 score

from sklearn import metrics

# write your code here



In [19]:
# Finding adjusted r2 score


# write your code here



### 8. Export the model for deployment

In [20]:
# Export the model for deployment


# write your code here

