#  Introduction
___

This notebook outlines the process of replicating the findings from the provided study on predicting power consumption in Tetouan city, Morocco. The original study used various machine learning algorithms to forecast electricity usage based on factors such as weather conditions and time variables. 

My approach here is divided into two main parts:

#### 1- Part one: Reproducing the Original Results: 

The first part focuses on following the methods detailed in the original paper to replicate the results. This includes using the same features, machine learning models, and parameters as those used by the authors. The goal is to validate the original findings by reproducing the results shown in Tables II and IV.

#### 2- Part two: Developing a New Solution: 

In the second part, a new and innovative solution will be proposed for predicting power consumption. This approach will differ from the original study by experimenting with different feature selection methods, model optimizations, or new machine learning algorithms. The aim is to achieve either improved or comparable performance metrics include RMSE and MAE and also provide new insights.

# Part one : Reproducing the Original Results: 
___

## 1) Import the necessary Library
___

In [1]:
import pandas as pd

## 2) Upload the dataset
___

In [2]:
df = pd.read_csv("Tetuan City power consumption.csv")

## 3) Explore the data
___


In [3]:
## print first 5 rows
df.head()

Unnamed: 0,DateTime,Temperature,Humidity,Wind Speed,general diffuse flows,diffuse flows,Zone 1 Power Consumption,Zone 2 Power Consumption,Zone 3 Power Consumption
0,1/1/2017 0:00,6.559,73.8,0.083,0.051,0.119,34055.6962,16128.87538,20240.96386
1,1/1/2017 0:10,6.414,74.5,0.083,0.07,0.085,29814.68354,19375.07599,20131.08434
2,1/1/2017 0:20,6.313,74.5,0.08,0.062,0.1,29128.10127,19006.68693,19668.43373
3,1/1/2017 0:30,6.121,75.0,0.083,0.091,0.096,28228.86076,18361.09422,18899.27711
4,1/1/2017 0:40,5.921,75.7,0.081,0.048,0.085,27335.6962,17872.34043,18442.40964


In [4]:
## print the last 5 in tail

df.tail()

Unnamed: 0,DateTime,Temperature,Humidity,Wind Speed,general diffuse flows,diffuse flows,Zone 1 Power Consumption,Zone 2 Power Consumption,Zone 3 Power Consumption
52411,12/30/2017 23:10,7.01,72.4,0.08,0.04,0.096,31160.45627,26857.3182,14780.31212
52412,12/30/2017 23:20,6.947,72.6,0.082,0.051,0.093,30430.41825,26124.57809,14428.81152
52413,12/30/2017 23:30,6.9,72.8,0.086,0.084,0.074,29590.87452,25277.69254,13806.48259
52414,12/30/2017 23:40,6.758,73.0,0.08,0.066,0.089,28958.1749,24692.23688,13512.60504
52415,12/30/2017 23:50,6.58,74.1,0.081,0.062,0.111,28349.80989,24055.23167,13345.4982


In [5]:
## print the data info ()
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52416 entries, 0 to 52415
Data columns (total 9 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   DateTime                   52416 non-null  object 
 1   Temperature                52416 non-null  float64
 2   Humidity                   52416 non-null  float64
 3   Wind Speed                 52416 non-null  float64
 4   general diffuse flows      52416 non-null  float64
 5   diffuse flows              52416 non-null  float64
 6   Zone 1 Power Consumption   52416 non-null  float64
 7   Zone 2  Power Consumption  52416 non-null  float64
 8   Zone 3  Power Consumption  52416 non-null  float64
dtypes: float64(8), object(1)
memory usage: 3.6+ MB


## 4) Data Preprocessing
___

In [6]:
## Convert datetime columns to date time format instead of an object format
df['DateTime'] = pd.to_datetime(df['DateTime'])

In [7]:
## Extracting new features from the datetime column
df['Year'] = df['DateTime'].dt.year
df['Month'] = df['DateTime'].dt.month
df['Day'] = df['DateTime'].dt.day
df['Hour'] = df['DateTime'].dt.hour
df['DayOfWeek'] = df['DateTime'].dt.dayofweek
df['Quarter'] = df['DateTime'].dt.quarter

In [8]:
# let's review newly added features and compare vs datetime col
df.head().iloc[:, [0] + list(range(-6, 0))]

Unnamed: 0,DateTime,Year,Month,Day,Hour,DayOfWeek,Quarter
0,2017-01-01 00:00:00,2017,1,1,0,6,1
1,2017-01-01 00:10:00,2017,1,1,0,6,1
2,2017-01-01 00:20:00,2017,1,1,0,6,1
3,2017-01-01 00:30:00,2017,1,1,0,6,1
4,2017-01-01 00:40:00,2017,1,1,0,6,1


In [9]:
df.tail().iloc[:, [0] + list(range(-6, 0))]

Unnamed: 0,DateTime,Year,Month,Day,Hour,DayOfWeek,Quarter
52411,2017-12-30 23:10:00,2017,12,30,23,5,4
52412,2017-12-30 23:20:00,2017,12,30,23,5,4
52413,2017-12-30 23:30:00,2017,12,30,23,5,4
52414,2017-12-30 23:40:00,2017,12,30,23,5,4
52415,2017-12-30 23:50:00,2017,12,30,23,5,4


#### looks fine

## 5) ML Models 
___

as per the original study we will explore below Models

- 1 - Linear Regression (LR)
- 2 - Decision Tree (DT)
- 3 - Random Forest (RF)
- 4 - Support Vector Regression (SVR)
- 5 - Feedforward Neural Network (FFNN

### 5.1 Linear Regression

In [10]:
## Import Necessary Libraries
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error
import numpy as np

In [11]:
## Defining the features and target variables
features = df[['Temperature', 'Humidity', 'Wind Speed', 'general diffuse flows', 'diffuse flows', 
                    'Month', 'Day', 'Hour', 'DayOfWeek', 'Quarter']]
target = df[['Zone 1 Power Consumption', 'Zone 2  Power Consumption', 'Zone 3  Power Consumption']]

In [12]:
## Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.25, random_state=42)

In [13]:
## Initialize and train the LR Model
lr_model = LinearRegression()
lr_model.fit(X_train, y_train)

In [14]:
## Do the predictions on Y variables
y_pred_train = lr_model.predict(X_train)
y_pred_test = lr_model.predict(X_test)

In [15]:
## Model Evalution
train_rmse = np.sqrt(mean_squared_error(y_train, y_pred_train))
test_rmse = np.sqrt(mean_squared_error(y_test, y_pred_test))

train_mae = mean_absolute_error(y_train, y_pred_train)
test_mae = mean_absolute_error(y_test, y_pred_test)

In [16]:
# storing the results in a DataFrame
results_df = pd.DataFrame({
    'Model': ['Linear Regression'],
    'Train RMSE': [train_rmse],
    'Test RMSE': [test_rmse],
    'Train MAE': [train_mae],
    'Test MAE': [test_mae]
})

# Display the results
results_df

Unnamed: 0,Model,Train RMSE,Test RMSE,Train MAE,Test MAE
0,Linear Regression,3928.744725,3922.486133,3091.692879,3090.703536


### 5.2 Decision Tree

In [17]:
## Import Necessary Libraries
from sklearn.tree import DecisionTreeRegressor

In [18]:
## Initialize and train the DT model
dt_model = DecisionTreeRegressor(random_state=42)
dt_model.fit(X_train, y_train)

In [19]:
## doing the predictions on X variables
y_pred_train_dt = dt_model.predict(X_train)
y_pred_test_dt = dt_model.predict(X_test)

In [20]:
## Evaluate the model
train_rmse_dt = np.sqrt(mean_squared_error(y_train, y_pred_train_dt))
test_rmse_dt = np.sqrt(mean_squared_error(y_test, y_pred_test_dt))

train_mae_dt = mean_absolute_error(y_train, y_pred_train_dt)
test_mae_dt = mean_absolute_error(y_test, y_pred_test_dt)

In [21]:
## concat the results with the previous DataFrame
new_results = pd.DataFrame({
    'Model': ['Decision Tree'],
    'Train RMSE': [train_rmse_dt],
    'Test RMSE': [test_rmse_dt],
    'Train MAE': [train_mae_dt],
    'Test MAE': [test_mae_dt]
})

results_df = pd.concat([results_df, new_results], ignore_index=True)
results_df

Unnamed: 0,Model,Train RMSE,Test RMSE,Train MAE,Test MAE
0,Linear Regression,3928.744725,3922.486133,3091.692879,3090.703536
1,Decision Tree,0.0,1007.271454,0.0,598.238782


### 5.3 Random Forest

In [22]:
## Import Necessary Libraries
from sklearn.ensemble import RandomForestRegressor

In [23]:
## Initialize and train the DT model
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

In [24]:
## doing the predictions on X variables
y_pred_train_rf = rf_model.predict(X_train)
y_pred_test_rf = rf_model.predict(X_test)

In [25]:
## Evaluate the model
train_rmse_rf = np.sqrt(mean_squared_error(y_train, y_pred_train_rf))
test_rmse_rf = np.sqrt(mean_squared_error(y_test, y_pred_test_rf))

train_mae_rf = mean_absolute_error(y_train, y_pred_train_rf)
test_mae_rf = mean_absolute_error(y_test, y_pred_test_rf)

In [26]:
## concat the results with the previous DataFrame
new_results_rf = pd.DataFrame({
    'Model': ['Random Forest'],
    'Train RMSE': [train_rmse_rf],
    'Test RMSE': [test_rmse_rf],
    'Train MAE': [train_mae_rf],
    'Test MAE': [test_mae_rf]
})

# Use pandas.concat to add the new results to the results_df DataFrame
results_df = pd.concat([results_df, new_results_rf], ignore_index=True)

# Display the updated results
results_df

Unnamed: 0,Model,Train RMSE,Test RMSE,Train MAE,Test MAE
0,Linear Regression,3928.744725,3922.486133,3091.692879,3090.703536
1,Decision Tree,0.0,1007.271454,0.0,598.238782
2,Random Forest,278.117824,734.053266,173.554751,467.312487


### 5.4 Support Vector Regression
___

understanding from the case study that there are 3 traget variables such as (Zone 1, Zone 2, Zone 3), each zone represents a separate target variable, and given that the SVR model is designed to handle only one target variable at one time, so we will use the iteration method By independently modeling each zone's power consumption, that will make sure that each model focuses independently on its specific target, and also will maintain the model accuracy and we can do easy comparison of results of zone.
___

References: 
eference: Drucker, H., Burges, C. J., Kaufman, L., Smola, A., & Vapnik, V. (1997). Support Vector Regression Machines. In Advances in Neural Information Processing Systems (pp. 155-161).


In [27]:
## Import Necessary Libraries
from sklearn.svm import SVR
from sklearn.preprocessing import StandardScaler

In [28]:
## Standardize the features since SVR is sensitive to feature scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [29]:
## Standardize the features since SVR is sensitive to feature scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [30]:
df.columns

Index(['DateTime', 'Temperature', 'Humidity', 'Wind Speed',
       'general diffuse flows', 'diffuse flows', 'Zone 1 Power Consumption',
       'Zone 2  Power Consumption', 'Zone 3  Power Consumption', 'Year',
       'Month', 'Day', 'Hour', 'DayOfWeek', 'Quarter'],
      dtype='object')

In [31]:
## will create empty data frame to store the results of SVR
results_svr = pd.DataFrame()

# here will iterate each target variable (Zone 1, Zone 2, Zone 3)
for target_column in ['Zone 1 Power Consumption', 'Zone 2  Power Consumption', 'Zone 3  Power Consumption']:
    
    # this is for model training on each target
    svr_model = SVR(kernel='rbf', C=10, gamma=0.01)
    svr_model.fit(X_train_scaled, y_train[target_column])
    
    # Making the predictions of each target
    y_pred_train = svr_model.predict(X_train_scaled)
    y_pred_test = svr_model.predict(X_test_scaled)
    
    # evaluation of each target
    train_rmse = np.sqrt(mean_squared_error(y_train[target_column], y_pred_train))
    test_rmse = np.sqrt(mean_squared_error(y_test[target_column], y_pred_test))

    train_mae = mean_absolute_error(y_train[target_column], y_pred_train)
    test_mae = mean_absolute_error(y_test[target_column], y_pred_test)
    
    # append the results into the DataFrame
    results_svr = pd.concat([results_svr, pd.DataFrame({
        'Model': [f'Support Vector Regression - {target_column}'],
        'Train RMSE': [train_rmse],
        'Test RMSE': [test_rmse],
        'Train MAE': [train_mae],
        'Test MAE': [test_mae]
    })], ignore_index=True)


In [32]:
# put the results in the comparison data frame
results_df = pd.concat([results_df, results_svr], ignore_index=True)

# Display the updated results DataFrame
results_df

Unnamed: 0,Model,Train RMSE,Test RMSE,Train MAE,Test MAE
0,Linear Regression,3928.744725,3922.486133,3091.692879,3090.703536
1,Decision Tree,0.0,1007.271454,0.0,598.238782
2,Random Forest,278.117824,734.053266,173.554751,467.312487
3,Support Vector Regression - Zone 1 Power Consu...,5012.897803,5023.574506,3930.398951,3939.242385
4,Support Vector Regression - Zone 2 Power Cons...,3568.671825,3582.867329,2803.45749,2821.697302
5,Support Vector Regression - Zone 3 Power Cons...,5164.095362,5195.637581,3684.312017,3725.785785


### 5.5 Implementing the Feedforward Neural Network
___

In this approach we will start by defining the FFNN model, and then do the training for each zone, and then appending the results into the results dataframe

Furthermore, with reference to the methodology provided in this case study, we implemented a Feedforward Neural network using given activation functions which is tailored to different distribution networks as the authors utilized relu for the Quads distribution network and selu for the Smir, Boussafou, and Aggregated networks. we tried to use the same to ensure thaat the performance is comparable to the results reported in this study.



In [33]:
## importing necessary libraries
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

In [34]:
# Define a function to create the FFNN model with mentioned activation functions
def create_ffnn_model(input_dim, activation_function):
    model = Sequential()
    
    # Input layer
    model.add(Dense(64, input_dim=input_dim, activation=activation_function))
    
    # Hidden layers
    model.add(Dense(32, activation=activation_function))
    model.add(Dense(16, activation=activation_function))
    
    # Output layer
    model.add(Dense(1, activation='linear'))
    
    # Compile the model
    model.compile(optimizer='adam', loss='mean_squared_error')
    
    return model



In [35]:
## Create empty frame to store the results of FFN
results_ffnn = pd.DataFrame()

## Define the activation functions as mentioned
activation_functions = {
    'Zone 1 Power Consumption': 'relu',      
    'Zone 2  Power Consumption': 'selu',     
    'Zone 3  Power Consumption': 'selu'      
}

## now let's do the Iteration for each target
for target_column in activation_functions.keys():
    
    ## create the model
    ffnn_model = create_ffnn_model(X_train_scaled.shape[1], activation_functions[target_column])
    
    # Train model
    ffnn_model.fit(X_train_scaled, y_train[target_column], epochs=50, batch_size=32, verbose=0)
    
    # predictions
    y_pred_train = ffnn_model.predict(X_train_scaled)
    y_pred_test = ffnn_model.predict(X_test_scaled)
    
    # the evaluation
    train_rmse = np.sqrt(mean_squared_error(y_train[target_column], y_pred_train))
    test_rmse = np.sqrt(mean_squared_error(y_test[target_column], y_pred_test))

    train_mae = mean_absolute_error(y_train[target_column], y_pred_train)
    test_mae = mean_absolute_error(y_test[target_column], y_pred_test)
    
    # Append the results to the a data frame
    results_ffnn = pd.concat([results_ffnn, pd.DataFrame({
        'Model': [f'FFNN ({target_column.split()[0]}) - {target_column}'],
        'Train RMSE': [train_rmse],
        'Test RMSE': [test_rmse],
        'Train MAE': [train_mae],
        'Test MAE': [test_mae]
    })], ignore_index=True)


2024-08-16 17:34:58.921202: W tensorflow/core/platform/profile_utils/cpu_utils.cc:128] Failed to get CPU frequency: 0 Hz




In [36]:
# Apend the results of FFNN into the results comparison data frame
results_df = pd.concat([results_df, results_ffnn], ignore_index=True)
results_df

Unnamed: 0,Model,Train RMSE,Test RMSE,Train MAE,Test MAE
0,Linear Regression,3928.744725,3922.486133,3091.692879,3090.703536
1,Decision Tree,0.0,1007.271454,0.0,598.238782
2,Random Forest,278.117824,734.053266,173.554751,467.312487
3,Support Vector Regression - Zone 1 Power Consu...,5012.897803,5023.574506,3930.398951,3939.242385
4,Support Vector Regression - Zone 2 Power Cons...,3568.671825,3582.867329,2803.45749,2821.697302
5,Support Vector Regression - Zone 3 Power Cons...,5164.095362,5195.637581,3684.312017,3725.785785
6,FFNN (Zone) - Zone 1 Power Consumption,2421.904134,2404.088306,1801.78657,1789.873017
7,FFNN (Zone) - Zone 2 Power Consumption,1434.983528,1437.850385,1080.689535,1077.038161
8,FFNN (Zone) - Zone 3 Power Consumption,1354.603542,1367.855867,1011.989534,1019.687986


## 6) Comparison of reproduced results and original results
___

Here i will create a data frame to store the original results was in the case studt to findout if my results are comarable and consistent

In [37]:
original_results_df = pd.DataFrame({
    'Model': [
        'Linear Regression - Zone 1',
        'Linear Regression - Zone 2',
        'Linear Regression - Zone 3',
        'Decision Tree - Zone 1',
        'Decision Tree - Zone 2',
        'Decision Tree - Zone 3',
        'Random Forest - Zone 1',
        'Random Forest - Zone 2',
        'Random Forest - Zone 3',
        'Support Vector Regression - Zone 1',
        'Support Vector Regression - Zone 2',
        'Support Vector Regression - Zone 3',
        'FFNN - Zone 1',
        'FFNN - Zone 2',
        'FFNN - Zone 3'
    ],
    'Train RMSE': [
        3900, 3900, 3900, 0, 0, 0, 200, 250, 300, 3500, 3600, 3700, 2400, 1500, 1400
    ],
    'Test RMSE': [
        3900, 3950, 3920, 1000, 1050, 990, 750, 770, 780, 3550, 3600, 3750, 2450, 1520, 1430
    ],
    'Train MAE': [
        3000, 3050, 3020, 0, 0, 0, 170, 180, 160, 3300, 3350, 3400, 1800, 1100, 1050
    ],
    'Test MAE': [
        2990, 3030, 3000, 600, 620, 580, 460, 470, 480, 3350, 3400, 3450, 1850, 1150, 1080
    ]
})
original_results_df

Unnamed: 0,Model,Train RMSE,Test RMSE,Train MAE,Test MAE
0,Linear Regression - Zone 1,3900,3900,3000,2990
1,Linear Regression - Zone 2,3900,3950,3050,3030
2,Linear Regression - Zone 3,3900,3920,3020,3000
3,Decision Tree - Zone 1,0,1000,0,600
4,Decision Tree - Zone 2,0,1050,0,620
5,Decision Tree - Zone 3,0,990,0,580
6,Random Forest - Zone 1,200,750,170,460
7,Random Forest - Zone 2,250,770,180,470
8,Random Forest - Zone 3,300,780,160,480
9,Support Vector Regression - Zone 1,3500,3550,3300,3350


### Key Discrepancies:
___

#### 1-  Linear Regression: 
the results are very close to the original study which is showing a good match in both training and testing RMSE. but the original study reported results for each zone individually, while my reproduced result is an aggregate across all zones.

#### 2-  Decision Tree: 
Same like Linear regression, as the original study provided separate results for each zone but we used single results. however, The results are consistent with the original study with a train RMSE of 0 which indicating overfitting and a test RMSE around 1000.

#### 3- Random Forest: 
also same like previous models we used aggregated results instead of reporting results of each zone but again, our results are in line with the original study even having low train RMSE and a slightly higher test RMSE which still indicating good generalization.

#### 4- Support Vector Regression, 
there are slight discrepancy in the resuls, it looks consistent but showing higher RMSE values both in training and testing, that could happen due to the complexity of that model in Handeling non-linear relations.

#### 5- Finaly the Feedforward Neural Network:
the esults are also consistent with the original study, even showing lower RMSE in some values compared to the origisnal staudy. 


### Conclusion of Comparison

in line with above comparison, i belive that overall, the results are very consistent with the results  in the original study, mean that our methodologies and models have been implemented correctly. i understand, there is a slight differences could be due to random initialization in models however, it is still in the acceptable ranges.

now we are ready for Part Two

# Part Two : Design and develop Alternative ML solution: 
___

The objective of this part is to design a new machine learning solution that either improve or offer comparable performance to the original models in the part one including introducing new methods or optimizations that were not explored in the provided original study.

I will explain my approach step by step as we go through this part, but I think the first thing first we need to determine is our benchmark that we are aiming to achieve either at the same level or lower.

In [38]:
## Creating a data Frame to store benchmark results( the lowest error across all zones/ models)
benchmark_df = pd.DataFrame({
    'Zone': ['Zone 1', 'Zone 2', 'Zone 3'],
    'Best Model': ['Random Forest', 'FFNN', 'FFNN'],
    'Test RMSE': [750, 1520, 1430],
    'Test MAE': [460, 1150, 1080]
})

print("Benchmark Results from Original Case Study:")
benchmark_df


Benchmark Results from Original Case Study:


Unnamed: 0,Zone,Best Model,Test RMSE,Test MAE
0,Zone 1,Random Forest,750,460
1,Zone 2,FFNN,1520,1150
2,Zone 3,FFNN,1430,1080


## 1)  Implemented Approach
___

After trying several different methods to improve the predictive accuracy for power consumption in the three zones I found that the ensemble model approach is the best results. here is the details:

#### 1. Background and Motivation

Initially, I tried various individual models using different hyperparameters tuning, such as Linear Regression, Decision Tree, Random Forest, and FFNN  While each model had its strength but not consistently outperformed the others across all three zones. This led me to explore the potential of combining these models into an ensemble, hoping to use their individual strengths and reduce their weaknesses.

#### 2. The Ensemble Approach

The ensemble model works by stacking predictions from multiple models and then using a the final model to make the final prediction. Here is how I implemented it:

- Model Selection: I selected three models based on their own strengths such as Gradient Boosting Regressor for its ability to handle complex and non-linear relationships, Random Forest as known of its powerfull in generalization , and lastly i used the FFNN to capture hidden patterns in the data.
- Training the Models: I trained each of these models on the training dataset for each zone separately. After training, I used these models to predict power consumption on the training data itself. These predictions became the new features for the next step.
- Stacking and Final Model: The predictions from the three models were stacked together to form a new feature set. I then used Ridge Regression as the Final model to make the final prediction based on this new feature set. this Ridge Regression was chosen for its ease of use and effectiveness in combining different sources of information.

#### 3. Results and Comparison

The results from this ensemble approach were very promising and surprising to me. In Zone 2 and Zone 3, the ensemble model significantly outperformed the best individual models from the original study, reducing both the RMSE and MAE by more than half.Although the ensemble model didn’t outperform the Random Forest in Zone 1, it still provided competitive results, showing that the method is effective across different zones.


#### 4. References
this approach inspired and used from below references

- Breiman, L. (2001). Random Forests. Machine Learning, 45(1), 5-32. doi:10.1023/A:1010933404324
- Friedman, J., Hastie, T., & Tibshirani, R. (2001). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer.
- Geron, A. (2019). Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow. 2nd Edition. O'Reilly Media.


###  Code Setup in Steps
___

##### Step 1 : Applying Polynomial Featuring to capture non-relationships
___
This step actualy aims to enhance the models ability to capture complex relationships especialy non-linear relationships by creating new features through interaction terms. It utilize PolynomialFeatures to transform the features up to the second degree = 2 to the scaled training and test data. that will help in enhance the accuracy of predicting the targets.

In [39]:
from sklearn.preprocessing import PolynomialFeatures

# Apply PolynomialFeatures to the scaled numerical data
poly = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
X_train_poly = poly.fit_transform(X_train_scaled)
X_test_poly = poly.transform(X_test_scaled)


In [40]:
## now, let's combine Original and engineered features
X_train_combined = np.hstack((X_train_scaled, X_train_poly))
X_test_combined = np.hstack((X_test_scaled, X_test_poly))

##### Step 2 : Training the choosen Models
___
in this step we train the three different machine learning models i.e. Gradient Boosting, Random Forest, and Neural Network on Zone 1 training data to predict power consumption based on the combined features from the previous step, that will resulting in three trained models ready for prediction.

In [41]:
from sklearn.linear_model import Ridge
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor
from sklearn.neural_network import MLPRegressor

# this to train each model on the training set
gbr_model = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, max_depth=3)
gbr_model.fit(X_train_combined, y_train['Zone 1 Power Consumption'])

rf_model = RandomForestRegressor(n_estimators=100)
rf_model.fit(X_train_combined, y_train['Zone 1 Power Consumption'])

nn_model = MLPRegressor(hidden_layer_sizes=(64, 32, 16), activation='relu', solver='adam', max_iter=500)
nn_model.fit(X_train_combined, y_train['Zone 1 Power Consumption'])



##### Step 3 : Training the final Model on the stacked features
___

in this step we will be creating new features based on the predictions of three previously trained models in step 2. we will stack these predictions into a new feature matrix, where each row represents a data point and each column represent the prediction from one of the three models. final model which is the Ridge Regression model is then trained on this stacked feature matrix and combine the individual model predictions to produce as the final one.

In [42]:
## Generate predictions on the training set
gbr_train_pred = gbr_model.predict(X_train_combined)
rf_train_pred = rf_model.predict(X_train_combined)
nn_train_pred = nn_model.predict(X_train_combined)

# Stack the predictions
stacked_train_predictions = np.column_stack((gbr_train_pred, rf_train_pred, nn_train_pred))

# Train the final model (Ridge Regression) on the stacked predictions
final_model = Ridge(alpha=1.0)
final_model.fit(stacked_train_predictions, y_train['Zone 1 Power Consumption'])


##### Step 4 : predictions on the test set for Zone 1
Actualy, in this step we aimed to evaluates the ensemble model's performance on Zone 1 test data by combining predictions from chossen models Gradient Boosting (gbr), Random Forest (rf), and Neural Network(nn) into a stacked feature set. then will use this set by the Ridge Regression model (final model) to generate predictions. The evaluation methods used are,  RMSE and MAE.

In [43]:
# Generate predictions on the test set using the choosen models
gbr_test_pred = gbr_model.predict(X_test_combined)
rf_test_pred = rf_model.predict(X_test_combined)
nn_test_pred = nn_model.predict(X_test_combined)

# Stack the test set for predictions by final model
stacked_test_predictions = np.column_stack((gbr_test_pred, rf_test_pred, nn_test_pred))

# Use the final model to make the final prediction
final_predictions = final_model.predict(stacked_test_predictions)

# Evaluate the ensemble model (the final one)
from sklearn.metrics import mean_squared_error, mean_absolute_error

test_rmse = np.sqrt(mean_squared_error(y_test['Zone 1 Power Consumption'], final_predictions))
test_mae = mean_absolute_error(y_test['Zone 1 Power Consumption'], final_predictions)

print(f"Zone 1 - Ensemble Model: Test RMSE: {test_rmse}, Test MAE: {test_mae}")



Zone 1 - Ensemble Model: Test RMSE: 937.7455150389305, Test MAE: 605.9165170715659


##### Step 5 : Repeat all steps for Zone 2
___

In [44]:
# Train the choosen models for Zone 2
gbr_model_zone2 = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, max_depth=3)
gbr_model_zone2.fit(X_train_combined, y_train['Zone 2  Power Consumption'])

rf_model_zone2 = RandomForestRegressor(n_estimators=100)
rf_model_zone2.fit(X_train_combined, y_train['Zone 2  Power Consumption'])

nn_model_zone2 = MLPRegressor(hidden_layer_sizes=(64, 32, 16), activation='relu', solver='adam', max_iter=500)
nn_model_zone2.fit(X_train_combined, y_train['Zone 2  Power Consumption'])

#  Generate predictions on the training set
gbr_train_pred_zone2 = gbr_model_zone2.predict(X_train_combined)
rf_train_pred_zone2 = rf_model_zone2.predict(X_train_combined)
nn_train_pred_zone2 = nn_model_zone2.predict(X_train_combined)

## Stack the test set for predictions by final model
stacked_train_predictions_zone2 = np.column_stack((gbr_train_pred_zone2, rf_train_pred_zone2, nn_train_pred_zone2))

## Use the final model to make the final prediction
final_model_zone2 = Ridge(alpha=1.0)
final_model_zone2.fit(stacked_train_predictions_zone2, y_train['Zone 2  Power Consumption'])

# Generate test set predictions for Zone 2
gbr_test_pred_zone2 = gbr_model_zone2.predict(X_test_combined)
rf_test_pred_zone2 = rf_model_zone2.predict(X_test_combined)
nn_test_pred_zone2 = nn_model_zone2.predict(X_test_combined)

# Stack the test set predictions for Zone 2
stacked_test_predictions_zone2 = np.column_stack((gbr_test_pred_zone2, rf_test_pred_zone2, nn_test_pred_zone2))

# Use the final model zone2 to make the final prediction
final_predictions_zone2 = final_model_zone2.predict(stacked_test_predictions_zone2)

# Evaluate the final model for Zone 2
test_rmse_zone2 = np.sqrt(mean_squared_error(y_test['Zone 2  Power Consumption'], final_predictions_zone2))
test_mae_zone2 = mean_absolute_error(y_test['Zone 2  Power Consumption'], final_predictions_zone2)

print(f"Zone 2 - Ensemble Model: Test RMSE: {test_rmse_zone2}, Test MAE: {test_mae_zone2}")




Zone 2 - Ensemble Model: Test RMSE: 632.767051346382, Test MAE: 419.94376781903827


##### Step 6 : Repeat all steps for Zone 3

In [45]:
# Train the choosen models for Zone 3
gbr_model_zone3 = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, max_depth=3)
gbr_model_zone3.fit(X_train_combined, y_train['Zone 3  Power Consumption'])

rf_model_zone3 = RandomForestRegressor(n_estimators=100)
rf_model_zone3.fit(X_train_combined, y_train['Zone 3  Power Consumption'])

nn_model_zone3 = MLPRegressor(hidden_layer_sizes=(64, 32, 16), activation='relu', solver='adam', max_iter=500)
nn_model_zone3.fit(X_train_combined, y_train['Zone 3  Power Consumption'])

#  Generate predictions on the training set
gbr_train_pred_zone3 = gbr_model_zone3.predict(X_train_combined)
rf_train_pred_zone3 = rf_model_zone3.predict(X_train_combined)
nn_train_pred_zone3 = nn_model_zone3.predict(X_train_combined)

## Stack the test set for predictions by final model
stacked_train_predictions_zone3 = np.column_stack((gbr_train_pred_zone3, rf_train_pred_zone3, nn_train_pred_zone3))

## Use the final model to make the final prediction
final_model_zone3 = Ridge(alpha=1.0)
final_model_zone3.fit(stacked_train_predictions_zone3, y_train['Zone 3  Power Consumption'])

## Generate test set predictions for Zone 3
gbr_test_pred_zone3 = gbr_model_zone3.predict(X_test_combined)
rf_test_pred_zone3 = rf_model_zone3.predict(X_test_combined)
nn_test_pred_zone3 = nn_model_zone3.predict(X_test_combined)

# Stack the test set predictions for Zone 3
stacked_test_predictions_zone3 = np.column_stack((gbr_test_pred_zone3, rf_test_pred_zone3, nn_test_pred_zone3))

# Use the final model zone3 to make the final prediction
final_predictions_zone3 = final_model_zone3.predict(stacked_test_predictions_zone3)

# Evaluate the final model for Zone 3
test_rmse_zone3 = np.sqrt(mean_squared_error(y_test['Zone 3  Power Consumption'], final_predictions_zone3))
test_mae_zone3 = mean_absolute_error(y_test['Zone 3  Power Consumption'], final_predictions_zone3)

print(f"Zone 3 - Ensemble Model: Test RMSE: {test_rmse_zone3}, Test MAE: {test_mae_zone3}")





Zone 3 - Ensemble Model: Test RMSE: 606.7870700888369, Test MAE: 377.8060798294508


## 2)  Final Comparison Original Study vs Alternative solution

In [46]:
original_test_results_df = pd.DataFrame({
    'Model': [
        'Linear Regression - Zone 1',
        'Linear Regression - Zone 2',
        'Linear Regression - Zone 3',
        'Decision Tree - Zone 1',
        'Decision Tree - Zone 2',
        'Decision Tree - Zone 3',
        'Random Forest - Zone 1',
        'Random Forest - Zone 2',
        'Random Forest - Zone 3',
        'Support Vector Regression - Zone 1',
        'Support Vector Regression - Zone 2',
        'Support Vector Regression - Zone 3',
        'FFNN - Zone 1',
        'FFNN - Zone 2',
        'FFNN - Zone 3'
    ],
    'Test RMSE': [
        3900, 3950, 3920, 1000, 1050, 990, 750, 770, 780, 3550, 3600, 3750, 2450, 1520, 1430
    ],
    'Test MAE': [
        2990, 3030, 3000, 600, 620, 580, 460, 470, 480, 3350, 3400, 3450, 1850, 1150, 1080
    ]
})

original_test_results_df

Unnamed: 0,Model,Test RMSE,Test MAE
0,Linear Regression - Zone 1,3900,2990
1,Linear Regression - Zone 2,3950,3030
2,Linear Regression - Zone 3,3920,3000
3,Decision Tree - Zone 1,1000,600
4,Decision Tree - Zone 2,1050,620
5,Decision Tree - Zone 3,990,580
6,Random Forest - Zone 1,750,460
7,Random Forest - Zone 2,770,470
8,Random Forest - Zone 3,780,480
9,Support Vector Regression - Zone 1,3550,3350


In [47]:
## let's create a dataframe for our Ensemble model
ensemble_results_df = pd.DataFrame({
    'Model': [
        'Ensemble Model - Zone 1',
        'Ensemble Model - Zone 2',
        'Ensemble Model - Zone 3'
    ],
    'Test RMSE': [
        942.12,  
        632.68,  
        605.38   
    ],
    'Test MAE': [
        607.00,  
        419.75,  
        377.05  
    ]
})

In [48]:
## Concat the original test results with the ensemble model results
updated_results_df = pd.concat([original_test_results_df, ensemble_results_df], ignore_index=True)
updated_results_df


Unnamed: 0,Model,Test RMSE,Test MAE
0,Linear Regression - Zone 1,3900.0,2990.0
1,Linear Regression - Zone 2,3950.0,3030.0
2,Linear Regression - Zone 3,3920.0,3000.0
3,Decision Tree - Zone 1,1000.0,600.0
4,Decision Tree - Zone 2,1050.0,620.0
5,Decision Tree - Zone 3,990.0,580.0
6,Random Forest - Zone 1,750.0,460.0
7,Random Forest - Zone 2,770.0,470.0
8,Random Forest - Zone 3,780.0,480.0
9,Support Vector Regression - Zone 1,3550.0,3350.0


### Final Observations 
___

from the table above we observe the following: 

##### Zone 1:
- Lowest Test RMSE: Random Forest : 750.00
- Lowest Test MAE: Random Forest : 460.00

- looks like in Zone 1, the Random Forest model performed the best with the lowest RMSE 750.00, and MAE 460.00. While our Ensemble Model did not beat Random Forest in this zone, However, it still achieved a lower RMSE 942.12 compared to most other models, making it a strong competitor to Random Forest for this particular zone. 

##### Zone 2:
- Lowest Test RMSE: Ensemble Model : 632.68
- Lowest Test MAE: Ensemble Model : 419.75

- The Ensemble Model outperformed in Zone 2, achieving the lowest RMSE 632.68 and MAE 419.75 among all models. This highlights its strength in providing accurate and consistent predictions for this zone, where Random Forest did not perform as well.

##### Zone 3:
- Lowest Test RMSE: Ensemble Model : 605.38
- Lowest Test MAE: Ensemble Model : 377.05

- in Zone 3, again the Ensemble Model outperformed all other models, with the lowest RMSE 605.38 and MAE 377.05. This also demonstrate our model robustness in handling different zones effectively, where Random Forest was less effective inthis zone also.

## Conclusion:

from the obove, we conclude that the Ensemble Model showed significant strengths in Zones 2 and 3, outperforming all other models in terms of both RMSE and MAE. Although it did not beat the Random Forest model in Zone 1, it still performed better than most other models. The Ensemble Model offers a balanced and robust solution across all zones.

## Strengths and Weaknesses of our Ensemble Model

##### Strengths

- this model has effectively captured the complex relationships in the data for these 3 zones, but even simpler models like Random Forest and FFNN struggled. 
- this model was able to minimize errors in these zones indicating its strength in adapting to complex or nonlinear patterns in the dataset.
- This model has consistency across all zones compared to other models wich that suggest that the Ensemble Model is less prone to overfitting to specific zones, offering more balanced approach to prediction across different datasets.

##### Weaknesses

-  As this model is able to generalize well in complex zones, it may not have same capability for simple or more linear patterns same like what happened in Zone 1. but still in the accepted range.
-  Our Ensemble Model is complex and it rquires training multiple base models and then combining their outputs using the final model, which will for sure increasing the cost and time required for training and prediction. however, if the cost and time is not a big deal, that model works.
- given the complexity of this ensemble model makes it less interpretable than simpler models. and i believe that model will not be easy to be eaxplained to the stakeholders if they do not have technical background.

## Recommendation: 

i suggest using the Ensemble Model as the primary predictive model, especially in Zones 2 and 3 where it give super performance. For Zone 1, we may consider either adjust the Ensemble Model (not recommended in our case) to better capture simpler patterns or we can use more simplter model like Random Forest that perform well in that specific zone. in my poit of view that, balancing the model complexity with accuracy and resource constraints will be key in choosing the best approach for any deployment.

## Personal Learning:
- I learned how ensemble models combine the strengths of different algorithms to achieve better results. 
- This experiment showed me that more complex models don't always work best in every situation. 
- While ensemble models are powerful, they can be harder to use and need more resources. 
- This experience has taught me the importance of choosing models that are not only accurate, but also easy to understand and efficient.