# Classification exercise

This code shows an exercise for a classification problem solved as part of the Artificial Intelligence Project Management course from Duke University on Coursera.

In [3]:
# Import libraries
import pandas as pd
import numpy as np
import sklearn as sk

In [4]:
# import data
file_path = "data/CCPP_data.csv"
df = pd.read_csv(file_path, sep=',')

print(df.head(10))
print(f"shape: {df.shape}")

      AT      V       AP     RH      PE
0  14.96  41.76  1024.07  73.17  463.26
1  25.18  62.96  1020.04  59.08  444.37
2   5.11  39.40  1012.16  92.14  488.56
3  20.86  57.32  1010.24  76.64  446.48
4  10.82  37.50  1009.23  96.62  473.90
5  26.27  59.44  1012.23  58.77  443.67
6  15.89  43.96  1014.02  75.24  467.35
7   9.48  44.71  1019.12  66.43  478.42
8  14.64  45.00  1021.78  41.25  475.98
9  11.74  43.56  1015.14  70.72  477.50
shape: (9568, 5)


In this project we will build a model to **predict the electrical energy output** of a 
**Combined Cycle Power Plant**, which uses a combination of gas turbines, steam turbines, and heat recovery steam generators to generate power.  We have a set of 9568 hourly average ambient environmental readings from sensors at the power plant which we will use in our model.

The columns in the data consist of hourly average ambient variables:
- Temperature (T) in the range 1.81°C to 37.11°C,
- Ambient Pressure (AP) in the range 992.89-1033.30 milibar,
- Relative Humidity (RH) in the range 25.56% to 100.16%
- Exhaust Vacuum (V) in the range 25.36-81.56 cm Hg
- Net hourly electrical energy output (PE) 420.26-495.76 MW (**Target we are trying to predict**)

To complete the project, you must complete each of the below steps in the modeling process.  

1) For the problem described in the Project Topic section above, determine what type of machine learning approach is needed and select an appropriate output metric to evaluate performance in accomplishing the task.

2) Determine which possible features we may want to use in the model, and identify the different algorithms we might consider.

3) Split your data to create a test set to evaluate the performance of your final model.  Then, using your training set, determine a validation strategy for comparing different models - a fixed validation set or cross-validation.  Depending on whether you are using Excel, Python or AutoML for your model building, you may need to manually split your data to create the test set and validation set / cross validation folds.

4) Use your validation approach to compare at least two different models (which may be either 1) different algorithms, 2) the same algorithm with different combinations of features, or 3) the same algorithm and features with different values for hyperparameters).  From among the models you compare, select the model with the best performance on your validation set as your final model.

5) Evaluate the performance of your final model using the output metric you defined earlier. 

Note: data can be obtained at https://storage.googleapis.com/aipi_datasets/CCPP_data.csv

### Set up vectors

First, let's create the input and output vectors and separate the data into 3 sets:
- training (train), 70%
- cross validation (cv), 20% 
- and test 10%

### 3) Data Split

In [39]:
# Let's create 3 data sets using scikit-learn train_test_split
from sklearn.model_selection import train_test_split

# first, we'll use a temp set that we'll then split into cv and test
X_train, X_temp, y_train, y_temp = train_test_split(df.drop('PE', axis=1), df['PE'], test_size=0.3, random_state=42)
X_cv, X_test, y_cv, y_test = train_test_split(X_temp, y_temp, test_size=1/3, random_state=42)

# Let's validate the size of the data sets (number of samples and then number of features)
assert X_train.shape[0] + X_cv.shape[0] + X_test.shape[0] == df.shape[0]
assert X_train.shape[0] == y_train.shape[0]
assert X_cv.shape[0] == y_cv.shape[0]
assert X_test.shape[0] == y_test.shape[0]
assert X_train.shape[1] == df.shape[1] - 1

# validate size of sets:
print(f"Validate {X_train.shape[0]} == {round(df.shape[0]*0.7,0)}")
print(f"Validate {X_cv.shape[0]} == {round(df.shape[0]*0.2,0)}")
print(f"Validate {X_test.shape[0]} == {round(df.shape[0]*0.1,0)}")

Validate 6697 == 6698.0
Validate 1914 == 1914.0
Validate 957 == 957.0


### 1-2) Modeling approach

We know that this model needs to **predict an output**, not classify one, which clearly rules out classifying algorithms.

#### Feature selection

To define what features to use, however, we need to gain some insight into the problem. If we look at the source information, we can gather that "The base load operation of a power plant is influenced by four main parameters: ambient temperature, atmospheric pressure, relative humidity, and exhaust steam pressure". This means that **all 4 features are relevant to the problem**.

#### Model Selection

As this is a prediction model, it makes sense to start evaluating a **Linear Regression** model to define as the standard against which to compare other models. We should also evaluate the performance of tree-based models, like **random forest** and **XGBoost**. To compare the improvement that these last two models present vis a vis a simple **Decision Tree**, it is not a bad idea to evaluate this model also.

#### Model evaluation

In terms of the model evaluation, we know that we need to measure the error between the prediction of our model and the actual data in the cross validation and test sets. We can expect the solution to be affected by outliers, so we should **prefer a measure of error like MSE**. The user might also get a clearer idea of the error in terms of the deviation in MW given that he/she knows the capacity of the generator. However, for someone not familiar with the generator, a percentage error metric like MAPE can be more revealing. We should probably also use **R-squared** to determine how much of the variability of the target variable (y) is not captured by the model.

### 4) Model Comparison

Let's start with **Linear Regression** using scikit-learn:

In [107]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_percentage_error

# let's train the model
lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)

# now, we'll predict the values for the output with the cross validation set
y_hat_lr = lin_reg.predict(X_cv)

# and now we can evaluate the MSE and R-squared for the prediction
mse_lin_reg = (np.square(y_hat_lr - y_cv)).mean(axis=0)
r_2_lin_reg = np.corrcoef(y_cv,y_hat_lr)[0,1]**2
mape_lin_reg = mean_absolute_percentage_error(y_cv,y_hat_lr)

print(f"MSE linear_regression: {mse_lin_reg:.2f}")
print(f"R-squared linear regression: {r_2_lin_reg:.2f}")
print(f"MAPE linear regression: {mape_lin_reg:.2%}")

MSE linear_regression: 21.13
R-squared linear regression: 0.93
MAPE linear regression: 0.81%


**Decision Tree**

We will evaluate 2 different depths. We know that as the depth increases we can get a more complex model and possible overfit. Let's review a max_depth of 5 and 15 to compare the results.

In [108]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_percentage_error

# let's train the model
tree_reg = DecisionTreeRegressor(max_depth=5)
tree_reg.fit(X_train, y_train)

# now, we'll predict the values for the output with the cross validation set
y_hat_dt = tree_reg.predict(X_cv)

# and finally we can evaluate the MSE and R-squared for the prediction
mse_dt = (np.square(y_hat_dt - y_cv)).mean(axis=0)
r_2_dt = np.corrcoef(y_cv,y_hat_dt)[0,1]**2
mape_dt = mean_absolute_percentage_error(y_cv,y_hat_dt)

print(f"MSE Desision_Tree depth 5: {mse_dt:.2f}")
print(f"R-squared decision tree depth 5: {r_2_dt:.2f}")
print(f"MAPE decision tree depth 5: {mape_dt:.2%}")

MSE Desision_Tree depth 5: 19.99
R-squared decision tree depth 5: 0.93
MAPE decision tree depth 5: 0.77%


In [109]:
# let's train the model
tree_reg = DecisionTreeRegressor(max_depth=15)
tree_reg.fit(X_train, y_train)

# now, we'll predict the values for the output with the cross validation set
y_hat_dt = tree_reg.predict(X_cv)

# and finally we can evaluate the MSE and R-squared for the prediction
mse_dt = (np.square(y_hat_dt - y_cv)).mean(axis=0)
r_2_dt = np.corrcoef(y_cv,y_hat_dt)[0,1]**2
mape_dt = mean_absolute_percentage_error(y_cv,y_hat_dt)

print(f"MSE Desision_Tree depth 15: {mse_dt:.2f}")
print(f"R-squared decision tree depth 15: {r_2_dt:.2f}")
print(f"MAPE decision tree depth 15: {mape_dt:.2%}")

MSE Desision_Tree depth 15: 19.06
R-squared decision tree depth 15: 0.94
MAPE decision tree depth 15: 0.67%


We can see that the MSE is reduced from the one obtained with a simple Linear Regression and that we can further improve the performance by increasing the depth. However, the performance is not particularly better than the one we can achieve with a simple Linear Regression.

#### Random Forest

We will now try a Random Forest regression with 500 trees and a maximum of 20 values on terminal nodes. We should expect better performance as we increase the number of trees to use and less overfitting as we increase the number of leaves at the terminal nodes.

In [110]:
from sklearn.ensemble import RandomForestRegressor

# let's train the model
rnd_fst = RandomForestRegressor(n_estimators = 500, max_leaf_nodes=20, n_jobs=-1)
rnd_fst.fit(X_train, y_train)

# now, we'll predict with the cross validation set
y_hat_rnd_fst = rnd_fst.predict(X_cv)

# evaluate the MSE and R-squared for the prediction
mse_rnd = (np.square(y_hat_rnd_fst - y_cv)).mean(axis=0)
r_2_rnd = np.corrcoef(y_cv,y_hat_rnd_fst)[0,1]**2
mape_rnd = mean_absolute_percentage_error(y_cv,y_hat_rnd_fst)

print(f"MSE Random Forest: {mse_rnd:.2f}")
print(f"R-squared Random Forest: {r_2_rnd:.2f}")
print(f"MAPE Random Forest: {mape_rnd:.2%}")

MSE Random Forest: 19.08
R-squared Random Forest: 0.94
MAPE Random Forest: 0.75%


We can see that the performance is good but does not improve much from the Decision Trees with the best performance found previously.

#### XGBoost

Now we'll try the XGBoost algorithm which uses shallow trees in sequence to improve the errors of previous trees in the sequence.

In [111]:
from xgboost import XGBRegressor

# let's train the model
xgb_tree = XGBRegressor()
xgb_tree.fit(X_train, y_train)

# now, we'll predict with the cross validation set
y_hat_xgb = xgb_tree.predict(X_cv)

# evaluate the MSE and R-squared for the prediction
mse_xgb = (np.square(y_hat_xgb - y_cv)).mean(axis=0)
r_2_xgb = np.corrcoef(y_cv,y_hat_xgb)[0,1]**2
mape_xgb = mean_absolute_percentage_error(y_cv,y_hat_xgb)

print(f"MSE XGBoost: {mse_xgb:.2f}")
print(f"R-squared XGBoost: {r_2_xgb:.2f}")
print(f"MAPE XGBoost: {mape_xgb:.2%}")

MSE XGBoost: 10.11
R-squared XGBoost: 0.97
MAPE XGBoost: 0.51%


We can see that this algorithm greatly improves both on the level of error and the variance captured by the model.

### 5) Final model - performance evaluation

In [112]:
# test output with test set
y_hat = xgb_tree.predict(X_test)

# evaluate the MSE and R-squared for the prediction
mse = (np.square(y_hat - y_test)).mean(axis=0)
r_2 = np.corrcoef(y_test,y_hat)[0,1]**2
mape = mean_absolute_percentage_error(y_test,y_hat)

print(f"MSE XGBoost test: {mse:.2f}")
print(f"R-squared XGBoost test: {r_2:.2f}")
print(f"MAPE XGBoost test: {mape:.2%}")

MSE XGBoost test: 10.96
R-squared XGBoost test: 0.96
MAPE XGBoost test: 0.50%


We see that the performance does not decrease relevantly from the cross valuation set.

#### **Is a tree-based algorithm a good solution for this problem?**

We know that tree-based models are not hard to train and are quite useful when we don't have extensive data. These models will capture patterns easily even in small data sets like this one.

However, we also know that, given the way tree-based algorithms work, they might not extrapolate well when input data falls beyond the boundaries defined by the training set. 

In this case, we can rely on the fact that we are trying to predict the output of a machine and it can be safe to assume that we need the algorithm to predict behaviour within the boundaries defined by a normal operation. If the data set includes sufficient data to properly reflect normal working conditions, then this algorithm should provide a good solution.