### Summary on the Electricity grid Dataset

 > Dataset: EGSS Data
 
 > Electrical grids require a balance between electricity supply and demand in order to be stable. Conventional systems achieve this balance through demand-driven electricity production. For future grids with a high share of inflexible (i.e., renewable) energy sources, the concept of demand response is a promising solution. This implies changes in electricity consumption in relation to electricity price changes. In this work, we’ll build a binary classification model to predict if a grid is stable or unstable using the UCI Electrical Grid Stability Simulated dataset.
 
 > Predictive features:
'tau1' to 'tau4': the reaction time of each network participant, a real value within the range 0.5 to 10 ('tau1' corresponds to the supplier node, 'tau2' to 'tau4' to the consumer nodes);
'p1' to 'p4': nominal power produced (positive) or consumed (negative) by each network participant, a real value within the range -2.0 to -0.5 for consumers ('p2' to 'p4'). As the total power consumed equals the total power generated, p1 (supplier node) = - (p2 + p3 + p4);
'g1' to 'g4': price elasticity coefficient for each network participant, a real value within the range 0.05 to 1.00 ('g1' corresponds to the supplier node, 'g2' to 'g4' to the consumer nodes; 'g' stands for 'gamma');

> Dependent variables:
'stab': the maximum real part of the characteristic differential equation root (if positive, the system is linearly unstable; if negative, linearly stable);
'stabf': a categorical (binary) label ('stable' or 'unstable').


In [162]:
import pandas as pd
from xgboost import XGBClassifier
import numpy as np
from lightgbm import LGBMClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier

In [163]:
data = pd.read_csv('electricity_grid_data.csv', encoding='latin-1')

In [164]:
data.head()

Unnamed: 0,tau1,tau2,tau3,tau4,p1,p2,p3,p4,g1,g2,g3,g4,stab,stabf
0,2.95906,3.079885,8.381025,9.780754,3.763085,-0.782604,-1.257395,-1.723086,0.650456,0.859578,0.887445,0.958034,0.055347,unstable
1,9.304097,4.902524,3.047541,1.369357,5.067812,-1.940058,-1.872742,-1.255012,0.413441,0.862414,0.562139,0.78176,-0.005957,stable
2,8.971707,8.848428,3.046479,1.214518,3.405158,-1.207456,-1.27721,-0.920492,0.163041,0.766689,0.839444,0.109853,0.003471,unstable
3,0.716415,7.6696,4.486641,2.340563,3.963791,-1.027473,-1.938944,-0.997374,0.446209,0.976744,0.929381,0.362718,0.028871,unstable
4,3.134112,7.608772,4.943759,9.857573,3.525811,-1.125531,-1.845975,-0.554305,0.79711,0.45545,0.656947,0.820923,0.04986,unstable


In [165]:
data.shape

(10000, 14)

In [166]:
data.describe()

Unnamed: 0,tau1,tau2,tau3,tau4,p1,p2,p3,p4,g1,g2,g3,g4,stab
count,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0
mean,5.25,5.250001,5.250004,5.249997,3.75,-1.25,-1.25,-1.25,0.525,0.525,0.525,0.525,0.015731
std,2.742548,2.742549,2.742549,2.742556,0.75216,0.433035,0.433035,0.433035,0.274256,0.274255,0.274255,0.274255,0.036919
min,0.500793,0.500141,0.500788,0.500473,1.58259,-1.999891,-1.999945,-1.999926,0.050009,0.050053,0.050054,0.050028,-0.08076
25%,2.874892,2.87514,2.875522,2.87495,3.2183,-1.624901,-1.625025,-1.62496,0.287521,0.287552,0.287514,0.287494,-0.015557
50%,5.250004,5.249981,5.249979,5.249734,3.751025,-1.249966,-1.249974,-1.250007,0.525009,0.525003,0.525015,0.525002,0.017142
75%,7.62469,7.624893,7.624948,7.624838,4.28242,-0.874977,-0.875043,-0.875065,0.762435,0.76249,0.76244,0.762433,0.044878
max,9.999469,9.999837,9.99945,9.999443,5.864418,-0.500108,-0.500072,-0.500025,0.999937,0.999944,0.999982,0.99993,0.109403


In [167]:
data.head(5)

Unnamed: 0,tau1,tau2,tau3,tau4,p1,p2,p3,p4,g1,g2,g3,g4,stab,stabf
0,2.95906,3.079885,8.381025,9.780754,3.763085,-0.782604,-1.257395,-1.723086,0.650456,0.859578,0.887445,0.958034,0.055347,unstable
1,9.304097,4.902524,3.047541,1.369357,5.067812,-1.940058,-1.872742,-1.255012,0.413441,0.862414,0.562139,0.78176,-0.005957,stable
2,8.971707,8.848428,3.046479,1.214518,3.405158,-1.207456,-1.27721,-0.920492,0.163041,0.766689,0.839444,0.109853,0.003471,unstable
3,0.716415,7.6696,4.486641,2.340563,3.963791,-1.027473,-1.938944,-0.997374,0.446209,0.976744,0.929381,0.362718,0.028871,unstable
4,3.134112,7.608772,4.943759,9.857573,3.525811,-1.125531,-1.845975,-0.554305,0.79711,0.45545,0.656947,0.820923,0.04986,unstable


In [168]:
data.drop(columns=['stab'], inplace=True)

In [169]:
data.columns

Index(['tau1', 'tau2', 'tau3', 'tau4', 'p1', 'p2', 'p3', 'p4', 'g1', 'g2',
       'g3', 'g4', 'stabf'],
      dtype='object')

In [170]:
X = data.drop(columns=['stabf'])
X.head(2)

Unnamed: 0,tau1,tau2,tau3,tau4,p1,p2,p3,p4,g1,g2,g3,g4
0,2.95906,3.079885,8.381025,9.780754,3.763085,-0.782604,-1.257395,-1.723086,0.650456,0.859578,0.887445,0.958034
1,9.304097,4.902524,3.047541,1.369357,5.067812,-1.940058,-1.872742,-1.255012,0.413441,0.862414,0.562139,0.78176


In [172]:
y = data['stabf']
y.values

array(['unstable', 'stable', 'unstable', ..., 'stable', 'unstable',
       'unstable'], dtype=object)

In [173]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

## what s the accuracy on the test set using the random forest ? in 4dp

In [174]:
#data normalization
Xtr_scaler = StandardScaler().fit(X_train).transform(X_train)
Xtest_scaler = StandardScaler().fit_transform(X_test)

Xtr_data = pd.DataFrame(data=Xtr_scaler, columns=X_train.columns)
Xtest_data = pd.DataFrame(data=Xtest_scaler, columns = X_test.columns)
scaled_data.head(2)

Unnamed: 0,tau1,tau2,tau3,tau4,p1,p2,p3,p4,g1,g2,g3,g4
0,0.367327,-0.986042,0.650447,1.547527,-0.29149,0.061535,1.293862,-0.845074,0.160918,0.339859,0.585568,0.492239
1,-0.064659,0.089437,1.035079,-1.641494,0.619865,-0.067235,-1.502925,0.486613,-0.293143,-1.558488,1.429649,-1.443521


In [175]:
#a function written to get the evaluation metrics for the model
def metrics_evaluation(y_test, y_pred, model):
    print("\n classification report: \n  {} ".format(classification_report(y_test, y_pred)))
    print("The accuracy score of the model is {} \n".format(round(accuracy_score(y_test, y_pred), 4)))
    print("\n confusion matrix \n {}".format(confusion_matrix(y_test, y_pred)))

In [176]:
'''performing one hot encoding to convert the binary values 
in the dependent variable to numerical encodings'''
encoding = LabelEncoder()
Ytr_encoded = encoding.fit_transform(y_train)
Ytest_encoded = encoding.fit_transform(y_test)
Ytr_data = pd.DataFrame(data= Ytr_encoded).reset_index(drop=True)
Ytest_data = pd.DataFrame(data = Ytest_encoded).reset_index(drop=True)

In [177]:
#building the model
model_rf = RandomForestClassifier(n_estimators =100, random_state=1)
model_rf.fit(Xtr_data, Ytr_data)
y_pred = model_rf.predict(Xtest_data)
y_pred[0:20]

  model_rf.fit(Xtr_data, Ytr_data)


array([1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0])

In [178]:
#getting the evaluation metrics by calling the defined function
metrics_evaluation(Ytest_data, y_pred, model_rf)


 classification report: 
                precision    recall  f1-score   support

           0       0.92      0.88      0.90       712
           1       0.93      0.96      0.94      1288

    accuracy                           0.93      2000
   macro avg       0.93      0.92      0.92      2000
weighted avg       0.93      0.93      0.93      2000
 
The accuracy score of the model is 0.928 


 confusion matrix 
 [[ 624   88]
 [  56 1232]]


## what is the accuracy on the test set using the XGboost classifier? In 4dp

In [179]:
#using a boosting method in the ensemble learning(xgboost)
model_xg = XGBClassifier()
model_xg.fit(Xtr_data, Ytr_data)
y_pred_xg = model_xg.predict(Xtest_data)
y_pred_xg[0:20]

array([1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0])

In [180]:
metrics_evaluation(Ytest_data, y_pred_xg, model_xg)


 classification report: 
                precision    recall  f1-score   support

           0       0.94      0.91      0.92       712
           1       0.95      0.97      0.96      1288

    accuracy                           0.95      2000
   macro avg       0.94      0.94      0.94      2000
weighted avg       0.95      0.95      0.95      2000
 
The accuracy score of the model is 0.946 


 confusion matrix 
 [[ 647   65]
 [  43 1245]]


### what is the accuracy on the test set using the LGB,M classifier? In 4dp

In [181]:
model_lgbm = LGBMClassifier()
model_lgbm.fit(Xtr_data, Ytr_data)
y_pred_lgbm = model_lgbm.predict(Xtest_data)
y_pred_lgbm[0:20]

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


array([1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0])

In [182]:
metrics_evaluation(Ytest_data, y_pred_lgbm, model_lgbm)


 classification report: 
                precision    recall  f1-score   support

           0       0.93      0.89      0.91       712
           1       0.94      0.96      0.95      1288

    accuracy                           0.94      2000
   macro avg       0.93      0.93      0.93      2000
weighted avg       0.94      0.94      0.94      2000
 
The accuracy score of the model is 0.9365 


 confusion matrix 
 [[ 636   76]
 [  51 1237]]


### Train a new ExtraTreesClassifier Model with the new Hyperparameters from the RandomizedSearchCV (with random_state = 1). Is the accuracy of the new optimal model higher or lower than the initial ExtraTreesClassifier model with no hyperparameter tuning?

In [183]:
model_extraTree = ExtraTreesClassifier()
model_extraTree.fit(Xtr_data, Ytr_data)
ypred_extraTree = model_extraTree.predict(Xtest_data)

  model_extraTree.fit(Xtr_data, Ytr_data)


In [184]:
#hyperparameters tuning to get the best params for the ExtraTreeClassifer
parameters = {'n_estimators': [50, 100, 300, 500, 1000], 'min_samples_leaf': [1, 2, 4, 6, 8], 
             'min_samples_split': [2, 3, 5, 7, 9], 'max_features': ['auto', 'sqrt', 'log2', None]}
hyper_tuning = RandomizedSearchCV(estimator = model_extraTree, param_distributions = parameters, cv=5, n_iter=10,
                        scoring='accuracy', n_jobs=-1, verbose=1, random_state=1)
hyper_tuning.fit(Xtr_data, Ytr_data)


Fitting 5 folds for each of 10 candidates, totalling 50 fits


  self.best_estimator_.fit(X, y, **fit_params)


RandomizedSearchCV(cv=5, estimator=ExtraTreesClassifier(), n_jobs=-1,
                   param_distributions={'max_features': ['auto', 'sqrt', 'log2',
                                                         None],
                                        'min_samples_leaf': [1, 2, 4, 6, 8],
                                        'min_samples_split': [2, 3, 5, 7, 9],
                                        'n_estimators': [50, 100, 300, 500,
                                                         1000]},
                   random_state=1, scoring='accuracy', verbose=1)

In [185]:
ypred_tuning = hyper_tuning.predict(Xtest_data)

In [186]:
hyper_tuning.best_params_

{'n_estimators': 1000,
 'min_samples_split': 2,
 'min_samples_leaf': 8,
 'max_features': None}

In [187]:
#building the extraTreeclassifier with the hyperparameters
optimal_Tree = ExtraTreesClassifier(n_estimators=1000, min_samples_split=2, 
                                   min_samples_leaf=8, max_features=None)
optimal_Tree.fit(Xtr_data, Ytr_data)


  optimal_Tree.fit(Xtr_data, Ytr_data)


ExtraTreesClassifier(max_features=None, min_samples_leaf=8, n_estimators=1000)

In [188]:
optimal_Tree_pred = optimal_Tree.predict(Xtest_data)

In [189]:
metrics_evaluation(Ytest_data, ypred_extraTree, model_extraTree)


 classification report: 
                precision    recall  f1-score   support

           0       0.95      0.84      0.89       712
           1       0.92      0.98      0.95      1288

    accuracy                           0.93      2000
   macro avg       0.93      0.91      0.92      2000
weighted avg       0.93      0.93      0.93      2000
 
The accuracy score of the model is 0.929 


 confusion matrix 
 [[ 601  111]
 [  31 1257]]


In [190]:
'''getting the evaluation metrics of the model after undergoing 
hyperparameters tuning'''
metrics_evaluation(Ytest_data, ypred_tuning, hyper_tuning)


 classification report: 
                precision    recall  f1-score   support

           0       0.92      0.87      0.90       712
           1       0.93      0.96      0.95      1288

    accuracy                           0.93      2000
   macro avg       0.93      0.92      0.92      2000
weighted avg       0.93      0.93      0.93      2000
 
The accuracy score of the model is 0.9285 


 confusion matrix 
 [[ 622   90]
 [  53 1235]]


> Note: From the results above, we can see that the accuracy score of the model decreases afer hypertuning the parameters of the model

### What features are the most and least import respectively?

In [194]:
feature_importances = optimal_Tree.feature_importances_
feature_importances_data = pd.DataFrame(data=feature_importances, index=Xtr_data.columns, columns=['Importance'])
feature_importances_data

Unnamed: 0,Importance
tau1,0.137238
tau2,0.140416
tau3,0.134604
tau4,0.135635
p1,0.003841
p2,0.005397
p3,0.005433
p4,0.005113
g1,0.102822
g2,0.107793


In [202]:
#getting the maximum and the minimum features of the data
min_feature_importance = feature_importances_data['Importance'].idxmin()
print("The maximum important feature using the Optimal ExtraTreeClassifier model is : {}".format(feature_importances_data['Importance'].idxmax()))
print(f'The minimum important feature using the optimal ExtraTreeClassifier model is : {min_feature_importance}')

The maximum important feature using the Optimal ExtraTreeClassifier model is : tau2
The minimum important feature using the optimal ExtraTreeClassifier model is : p1
