The model developed through the Python 'pandas' and 'scikit-learn' libraries

# Steps

1. Load Data 
2. Understand & Prepare Data
3. Select Features
4. Choose Algorithm(s)
5. Train Model(s)
6. Evaluate Model(s)



# Loading Data

In [None]:
import pandas as pd
import numpy as np
import os
import seaborn as sns
import matplotlib.pylab as plt
plt.style.use ('ggplot')



# Get the directory containing this notebook
notebook_dir = os.path.join(os.path.abspath("__file__"))
ccpp = pd.read_csv(os.path.join(notebook_dir,"/CCPP_data.csv"))

# Basic Data Analysis 

In [None]:
ccpp.head(10)
# provides first 10 rows


### Columns 

- AT - Ambient Temperature
- V - Exhaust Vacuum
- AP - Ambient Pressure
- RH - Relative Humidity 
- PE - Energy Output

In [None]:
ccpp.rename(columns = {'AT':'Temp', 'V':'Vacuum', 'AP': 'Pressure', 'RH':'Humidity', 'PE': 'Energy'}, inplace=True)

In [None]:
ccpp.info()
# provides information such as number of columns, data types and null values

In [None]:
ccpp.corr()

In [None]:
sns.pairplot(ccpp, vars = ['Temp', 'Pressure', 'Vacuum', 'Humidity', 'Energy'])
plt.show()

# Features and Correlation 

### Strong Correlation

- (+) Temp and Vacuum
- (-) Energy and Temp
- (-) Energy and Vacuum

### Moderate Correlation

- (-) Temp and Pressure
- (-) Temp and Humidity
- (+) Pressure and Energy
 

# Models

- Linear Regression
- Lasso Regression
- Ridge Regression
- Support Vector Regression (SVR)
- Random Forest Regression

# Output Metrics

- RMSE
- R^2


In [None]:
# Split the data into features and target variable
X = ccpp.drop('Energy', axis=1)
y = ccpp['Energy']

In [None]:
# splitting the data into training and testing sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split (X, y, test_size=0.2, random_state=1)

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# General Model Training Approach

1. Creating the model in Python
2. Using the 'cross_val_score' method in Python which performs training and cross validation
3. We will be folding the data 5-fold by passing parameter 'cv=5'
4. The 'scoring' parameter identifies the chosen output metrics
5. We will average the results from the cross validation and add the result to a table (pandas DataFrame) to display at the end 

# Linear Regression 

In [11]:
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression

LinearRegression = LinearRegression()

linear_RMSE = -np.mean(cross_val_score(LinearRegression, X_train, y_train, cv=5, scoring='neg_root_mean_squared_error'))
linear_r2 = np.mean(cross_val_score(LinearRegression, X_train, y_train, cv=5, scoring='r2'))

print(linear_RMSE)
print(linear_r2)

# Creating a dataframe to add the results to 

results = pd.DataFrame({
    'Model': ['Linear Regression'],
    'RMSE': [linear_RMSE],
    'R2': [linear_r2]
})


4.573755155822281
0.927565122121436


# Lasso Regression 

In [12]:
from sklearn.linear_model import Lasso
LassoRegression = Lasso(alpha=0.1)

lasso_RMSE = -np.mean(cross_val_score(LassoRegression, X_train, y_train, cv=5, scoring='neg_root_mean_squared_error'))
lasso_r2 = np.mean(cross_val_score(LassoRegression, X_train, y_train, cv=5, scoring='r2'))

print(lasso_RMSE)
print(lasso_r2)

new_row = {'Model':'Lasso Regression', 'RMSE':lasso_RMSE,'R2':lasso_r2}
results.loc[len(results)] = new_row

4.5784552331727655
0.927413186699208


In [13]:
results

Unnamed: 0,Model,RMSE,R2
0,Linear Regression,4.573755,0.927565
1,Lasso Regression,4.578455,0.927413


# Ridge Regression

In [14]:
from sklearn.linear_model import Ridge
RidgeRegression = Ridge(alpha=1.0)

ridge_RMSE = -np.mean(cross_val_score(RidgeRegression, X_train, y_train, cv=5, scoring='neg_root_mean_squared_error'))
ridge_r2 = np.mean(cross_val_score(RidgeRegression, X_train, y_train, cv=5, scoring='r2'))

print(ridge_RMSE)
print(ridge_r2)

new_row = {'Model':'Ridge Regression', 'RMSE':ridge_RMSE,'R2':ridge_r2}
results.loc[len(results)] = new_row

4.573759127918946
0.927564972249615


In [15]:
results

Unnamed: 0,Model,RMSE,R2
0,Linear Regression,4.573755,0.927565
1,Lasso Regression,4.578455,0.927413
2,Ridge Regression,4.573759,0.927565


# SVR 

In [16]:
from sklearn.svm import SVR

SVR = SVR(kernel='rbf')

svr_RMSE = -np.mean(cross_val_score(SVR, X_train, y_train, cv=5, scoring='neg_root_mean_squared_error'))
svr_r2 = np.mean(cross_val_score(SVR, X_train, y_train, cv=5, scoring='r2'))

print(svr_RMSE)
print(svr_r2)

new_row = {'Model':'SVR', 'RMSE':svr_RMSE,'R2':svr_r2}
results.loc[len(results)] = new_row

4.22398763806733
0.9381987406704825


In [17]:
results

Unnamed: 0,Model,RMSE,R2
0,Linear Regression,4.573755,0.927565
1,Lasso Regression,4.578455,0.927413
2,Ridge Regression,4.573759,0.927565
3,SVR,4.223988,0.938199


# Random Forest Regression

In [18]:
from sklearn.ensemble import RandomForestRegressor

RandomForestRegression = RandomForestRegressor(n_estimators=100, random_state=42)

rfr_RMSE = -np.mean(cross_val_score(RandomForestRegression, X_train, y_train, cv=5, scoring='neg_root_mean_squared_error'))
rfr_r2 = np.mean(cross_val_score(RandomForestRegression, X_train, y_train, cv=5, scoring='r2'))

print(rfr_RMSE)
print(rfr_r2)

new_row = {'Model':'Random Forest Regression', 'RMSE':rfr_RMSE,'R2':rfr_r2}
results.loc[len(results)] = new_row

3.514742994716528
0.9572023910460457


In [19]:
results

Unnamed: 0,Model,RMSE,R2
0,Linear Regression,4.573755,0.927565
1,Lasso Regression,4.578455,0.927413
2,Ridge Regression,4.573759,0.927565
3,SVR,4.223988,0.938199
4,Random Forest Regression,3.514743,0.957202


# Conclusion 

The model cross validation favors **'Random Forest Regression'** as the most accurate machine learning model for predicting the energy output (PE) based on the given inputs. 