<h3>Course Project - Combined Cycle Power Plant (CCPP)</h3>

A combined cycle power plant, also known as a CCGT (Combined Cycle Gas Turbine) plant, is a type of power plant that uses a combination of gas turbine and steam turbine technology to generate electricity. In these power plants, a gas turbine is first used to generate electricity. The waste heat generated in the process is then used to produce steam, which drives a steam turbine to generate additional electricity.

This type of power plant is particularly efficient because it utilizes the waste heat generated in the process of electricity generation with gas turbines, instead of releasing it unused into the atmosphere. As a result, CCGT plants can achieve higher efficiencies than conventional power plants that are based on only one of these technologies. They are also more flexible in operation and can respond more quickly to changes in electricity demand than traditional steam power plants.

<h4>Instructions:</h4>

1. In this project, the application of new knowledge in the modeling process is required to develop and evaluate a machine learning model using a provided dataset about CCPP.
2. The project involves executing essential steps in the modeling process, including training and assessing the machine learning model.
3. The final submission will be a brief video presentation (maximum 5 minutes) demonstrating the modeling approach, featuring a demo or screenshot of the model, and explaining the evaluation method for the final model.

<h4>Grading Criteria:</h4>

1. Modeling approach - did you correctly identify the type of modeling task, features to use, and possible algorithms to use?
2. Model building  - did you compare at least two different models (different algorithms, different combinations of features, or different hyperparameter combinations) using a vlidation set or cross-validation to optimize your model?
3. Model evaluation - did you set a reasonable evaluation metric to determine the performance of your model, and then calculate it on the test set?
4. Model interpretation - did you correctly interpret and clearly communicate the performance of your model?


In [155]:
import pandas as pd
import numpy as np

from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error

In [156]:
# Import the of the dataset
df_ccpp = pd.read_csv("CCPP_data.csv", encoding='utf-8')

# A quick overview of the available features 
df_ccpp.head()

Unnamed: 0,T,V,AP,RH,PE
0,14.96,41.76,1024.07,73.17,463.26
1,25.18,62.96,1020.04,59.08,444.37
2,5.11,39.4,1012.16,92.14,488.56
3,20.86,57.32,1010.24,76.64,446.48
4,10.82,37.5,1009.23,96.62,473.9


The columns in the data consist of hourly average ambient variables:
- Temperature (T) in the range 1.81°C to 37.11°C,
- Ambient Pressure (AP) in the range 992.89-1033.30 milibar,
- Relative Humidity (RH) in the range 25.56% to 100.16%
- Exhaust Vacuum (V) in the range 25.36-81.56 cm Hg
- Net hourly electrical energy output (PE) 420.26-495.76 MW (Target we are trying to predict)

In [157]:
# Concise summary of our data for initial data exploration
df_ccpp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9568 entries, 0 to 9567
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   T       9568 non-null   float64
 1   V       9568 non-null   float64
 2   AP      9568 non-null   float64
 3   RH      9568 non-null   float64
 4   PE      9568 non-null   float64
dtypes: float64(5)
memory usage: 373.9 KB


In [158]:
# Statistical summary of the numerical columns
df_ccpp.describe()

Unnamed: 0,T,V,AP,RH,PE
count,9568.0,9568.0,9568.0,9568.0,9568.0
mean,19.651231,54.305804,1013.259078,73.308978,454.365009
std,7.452473,12.707893,5.938784,14.600269,17.066995
min,1.81,25.36,992.89,25.56,420.26
25%,13.51,41.74,1009.1,63.3275,439.75
50%,20.345,52.08,1012.94,74.975,451.55
75%,25.72,66.54,1017.26,84.83,468.43
max,37.11,81.56,1033.3,100.16,495.76


In [159]:
# Selecting features and target variable for the model
X = df_ccpp[['T', 'V', 'AP', 'RH']]
y = df_ccpp['PE']

# Split the new dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, train_size=0.8)

# Creating a Linear Regression model instance
model_lr = LinearRegression()

# Creating a Decision Tree Regression model instance
model_dt = DecisionTreeRegressor(max_depth=4, min_samples_leaf=0.1, random_state=3)

# Creating a Random Forest model instance
model_rf = RandomForestRegressor(random_state=42)

# Fit the linear_model using the training data
model_lr.fit(X_train, y_train)

# Fit the decission_tree using the training data
model_dt.fit(X_train, y_train)

# Fit the random_forest_model using the training data
model_rf.fit(X_train, y_train)

In [160]:
# Linear Regression MSE, RMSE, R-Square
y_pred_linear = model_lr.predict(X_test)

linear_mse = mean_squared_error(y_test, y_pred_linear)
linear_rmse = np.sqrt(linear_mse)
linear_r_square = model_lr.score(X_test, y_test)
linear_mae = mean_absolute_error(y_test, y_pred_linear)

# Decission Tree Regression MSE, RMSE, R-Square
y_pred_dt = model_dt.predict(X_test)

dt_mse = mean_squared_error(y_test, y_pred_dt)
dt_rmse = np.sqrt(dt_mse)
dt_r_square = model_dt.score(X_test, y_test)
dt_mae = mean_absolute_error(y_test, y_pred_dt)

# Random Forest Regression MSE, RMSE, R-Square
y_pred_rf = model_rf.predict(X_test)

rf_mse = mean_squared_error(y_test, y_pred_rf)
rf_rmse = np.sqrt(rf_mse)
rf_r_square = model_rf.score(X_test, y_test)
rf_mae = mean_absolute_error(y_test, y_pred_rf)

comparison_results = {
    "Linear MSE": linear_mse,
    "Decission Tree MSE": dt_mse,
    "Random Forest MSE": rf_mse,
    
    "Linear RMSE": linear_rmse,
    "Decision Tree RMSE": dt_rmse,
    "Random Forest RMSE": rf_rmse,
    
    "Linear R-Square": linear_r_square,
    "Decission Tree R-Square": dt_r_square, 
    "Random Forest R-Square": rf_r_square,
    
    "Linear MAE": linear_mae,
    "Decision Tree MAE": dt_mae,
    "Random Forest MAE": rf_mae    
}

counter = 0
for metric, value in comparison_results.items():
    print(f"{metric}: {value}")
    counter += 1
    if counter % 3 == 0:  # After every third entry, print a line break
        print()


Linear MSE: 19.887672979361724
Decision Tree MSE: 31.194023107529635
Random Forest MSE: 10.432210448834946

Linear RMSE: 4.459559729318773
Decision Tree RMSE: 5.585160974182359
Random Forest RMSE: 3.229893256569781

Linear R-Square: 0.9325073807799998
Decision Tree R-Square: 0.8941371207319624
Random Forest R-Square: 0.964596300020782

Linear MAE: 3.663660894808539
Decision Tree MAE: 4.419752364221531
Random Forest MAE: 2.413862852664589

