**Combined Cycle Power Plant**

This program is a regression problem based on the article "Prediction of full load electrical power output of a base load operated combined cycle power plant using machine learning methods" (Pınar Tüfekci, 2014). The dataset was obtained from UCI repository (https://archive.ics.uci.edu/ml/datasets/Combined+Cycle+Power+Plant).

**Problem Description**



# Exploratory Data Analysis

## Importing Libraries

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

## Importing the Dataset

In [None]:
dataset = pd.read_csv('Combined_Cycle_Power_Plant.csv')

In [None]:
dataset.shape

In [None]:
dataset.columns

In [None]:
dataset.head(5)

In [None]:
dataset.describe()

AT = Ambient Temperature

AP = Atmospheric pressure

RH = Relatie Humidity

V = Vacuum (Exhaust Steam Pressure)

PE = Full load eletrical power output

## Histogram of numerical variable

In [None]:
plt.figure(figsize=(25,15))
plt.suptitle('Histograms of numerical variables', fontsize = 20)
for i in range(1, dataset.shape[1] + 1):
    plt.subplot(2, 3, i)
    f = plt.gca()
    #f.set_title(dataset.columns.values[i-1])
    sns.histplot(dataset.iloc[:, i-1], color = '#3F5D7D', kde= True)

## Pair Plot of numerical variables

In [None]:
g = sns.PairGrid(data=dataset, vars=['AT', 'V', 'AP', 'RH','PE'], hue='PE')
g.map(sns.scatterplot,  color = '#3F5D7D')

## Scatter plot between the target variable and features

In [None]:
plt.figure(figsize=(25,15))
plt.suptitle('Scatter plot and linear regression', fontsize = 20)
for i in range(1, dataset.shape[1]):
    plt.subplot(2, 2, i)
    #f = plt.gca()
    sns.regplot(data=dataset, x=dataset.iloc[:, i-1], y=dataset['PE'], scatter=False, fit_reg=True)
    plt.scatter(dataset.iloc[:, i-1], dataset['PE'], color = 'r', marker='+' )


plt.figure()

plt.show()

## Correlation with the response

In [None]:
dataset.drop(columns='PE').corrwith(dataset.PE).plot.bar(figsize = (20,10),
                                            title = 'Correlation with reponse variable',
                                            fontsize = 15, rot = 45, grid = True, color = '#5F5D7D')

## Matrix correlation between all variables

In [None]:
sns.set(style='white', font_scale= 1)
corr = dataset.drop(columns='PE').corr() # here we compute the correlation between numericals variables
# Generate a mask for the upper triangle
mask = np.zeros_like(corr, dtype= np.bool) # To generate a numpy array from correlation with true or false
mask[np.triu_indices_from(mask)] = True # To have the index of the upper triangle
# Setup the matplotlib figures
f, ax = plt.subplots(figsize = (20,10))
f.suptitle('Correlation Matrix', fontsize=40)
# Generate a custum diverging color map
cmap = sns.diverging_palette(50, 0, as_cmap=True)
# Draw the heatmap with the mask and the correct aspect ratio
sns.heatmap(corr, mask=mask, annot=True, cmap=cmap, vmax=1, center=0,
            square=True, linewidth=5, cbar_kws={'shrink': .5})

# Building the model (Exhaustive analysis)

## Subsets with one independent variable

### AT vs PE