<h2>Machine Learning Classification Managing the Quality Metric of Global Ecological Footprint</h2>

<h4>Stability of the Grid System</h4>
Electrical grids require a balance between electricity supply and demand in order to be stable. Conventional systems achieve this balance through demand-driven electricity production. For future grids with a high share of inflexible (i.e., renewable) energy source, the concept of demand response is a promising solution. This implies changes in electricity consumption in relation to electricity price changes. In this work, we’ll build a binary classification model to predict if a grid is stable or unstable using the UCI Electrical Grid Stability Simulated dataset.

Question 1: What is the F1 Score of the Classifier?

In [1]:
# Total instances (given) (n) = 2000
# F1 = 2 * (Precision*Recall)/(Precision + Recall)
Precision = (355/ ( 355+1480)) 
Recall =  (355 /(355+45)) 
F1_Score = 2 * (Precision*Recall)/(Precision + Recall)
print('The F1 Score of the Classifier {}'.format(round(F1_Score,4)))

The F1 Score of the Classifier 0.3177


In [2]:
#importing necessary packages 
import numpy as np
import pandas as pd

<h1>Load Dataset</h1>

In [3]:
#loading the dataset

df = pd.read_csv('Data_for_UCI_named.csv')

It has 12 primary predictive features and two dependent variables.

<h3>Predictive features:</h3>

'tau1' to 'tau4': the reaction time of each network participant, a real value within the range 0.5 to 10 ('tau1' corresponds to the supplier node, 'tau2' to 'tau4' to the consumer nodes);
'p1' to 'p4': nominal power produced (positive) or consumed (negative) by each network participant, a real value within the range -2.0 to -0.5 for consumers ('p2' to 'p4'). As the total power consumed equals the total power generated, p1 (supplier node) = - (p2 + p3 + p4);
'g1' to 'g4': price elasticity coefficient for each network participant, a real value within the range 0.05 to 1.00 ('g1' corresponds to the supplier node, 'g2' to 'g4' to the consumer nodes; 'g' stands for 'gamma');
<h3>Dependent variables:</h3>

'stab': the maximum real part of the characteristic differential equation root (if positive, the system is linearly unstable; if negative, linearly stable);
'stabf': a categorical (binary) label ('stable' or 'unstable').

<h2>Exploratory Data Analysis</h2>

In [4]:
df.head()

Unnamed: 0,tau1,tau2,tau3,tau4,p1,p2,p3,p4,g1,g2,g3,g4,stab,stabf
0,2.95906,3.079885,8.381025,9.780754,3.763085,-0.782604,-1.257395,-1.723086,0.650456,0.859578,0.887445,0.958034,0.055347,unstable
1,9.304097,4.902524,3.047541,1.369357,5.067812,-1.940058,-1.872742,-1.255012,0.413441,0.862414,0.562139,0.78176,-0.005957,stable
2,8.971707,8.848428,3.046479,1.214518,3.405158,-1.207456,-1.27721,-0.920492,0.163041,0.766689,0.839444,0.109853,0.003471,unstable
3,0.716415,7.6696,4.486641,2.340563,3.963791,-1.027473,-1.938944,-0.997374,0.446209,0.976744,0.929381,0.362718,0.028871,unstable
4,3.134112,7.608772,4.943759,9.857573,3.525811,-1.125531,-1.845975,-0.554305,0.79711,0.45545,0.656947,0.820923,0.04986,unstable


In [5]:
#checking for null values
df.isna().sum()

tau1     0
tau2     0
tau3     0
tau4     0
p1       0
p2       0
p3       0
p4       0
g1       0
g2       0
g3       0
g4       0
stab     0
stabf    0
dtype: int64

There are no null values in our datasets

In [6]:
#getting the infomation of the datasets
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   tau1    10000 non-null  float64
 1   tau2    10000 non-null  float64
 2   tau3    10000 non-null  float64
 3   tau4    10000 non-null  float64
 4   p1      10000 non-null  float64
 5   p2      10000 non-null  float64
 6   p3      10000 non-null  float64
 7   p4      10000 non-null  float64
 8   g1      10000 non-null  float64
 9   g2      10000 non-null  float64
 10  g3      10000 non-null  float64
 11  g4      10000 non-null  float64
 12  stab    10000 non-null  float64
 13  stabf   10000 non-null  object 
dtypes: float64(13), object(1)
memory usage: 1.1+ MB


In [7]:
df.shape

(10000, 14)

All the datasets are floats  except the stabf column which is a dependent variable

In [8]:
#dropping the stab column according to the instruction given 
df = df.drop(columns = 'stab')

In [9]:
df.head()

Unnamed: 0,tau1,tau2,tau3,tau4,p1,p2,p3,p4,g1,g2,g3,g4,stabf
0,2.95906,3.079885,8.381025,9.780754,3.763085,-0.782604,-1.257395,-1.723086,0.650456,0.859578,0.887445,0.958034,unstable
1,9.304097,4.902524,3.047541,1.369357,5.067812,-1.940058,-1.872742,-1.255012,0.413441,0.862414,0.562139,0.78176,stable
2,8.971707,8.848428,3.046479,1.214518,3.405158,-1.207456,-1.27721,-0.920492,0.163041,0.766689,0.839444,0.109853,unstable
3,0.716415,7.6696,4.486641,2.340563,3.963791,-1.027473,-1.938944,-0.997374,0.446209,0.976744,0.929381,0.362718,unstable
4,3.134112,7.608772,4.943759,9.857573,3.525811,-1.125531,-1.845975,-0.554305,0.79711,0.45545,0.656947,0.820923,unstable


In [10]:
df.describe()

Unnamed: 0,tau1,tau2,tau3,tau4,p1,p2,p3,p4,g1,g2,g3,g4
count,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0
mean,5.25,5.250001,5.250004,5.249997,3.75,-1.25,-1.25,-1.25,0.525,0.525,0.525,0.525
std,2.742548,2.742549,2.742549,2.742556,0.75216,0.433035,0.433035,0.433035,0.274256,0.274255,0.274255,0.274255
min,0.500793,0.500141,0.500788,0.500473,1.58259,-1.999891,-1.999945,-1.999926,0.050009,0.050053,0.050054,0.050028
25%,2.874892,2.87514,2.875522,2.87495,3.2183,-1.624901,-1.625025,-1.62496,0.287521,0.287552,0.287514,0.287494
50%,5.250004,5.249981,5.249979,5.249734,3.751025,-1.249966,-1.249974,-1.250007,0.525009,0.525003,0.525015,0.525002
75%,7.62469,7.624893,7.624948,7.624838,4.28242,-0.874977,-0.875043,-0.875065,0.762435,0.76249,0.76244,0.762433
max,9.999469,9.999837,9.99945,9.999443,5.864418,-0.500108,-0.500072,-0.500025,0.999937,0.999944,0.999982,0.99993


Because of the direct relationship between 'stab' and 'stabf' ('stabf' = 'stable' if 'stab' <= 0, 'unstable' otherwise), 'stab' would be dropped and 'stabf' will remain as the sole dependent variable (binary classification).

<h2>Splitting the dataset</h2>

In [11]:
#splitting the data into predictive feautures and dependent variable 
x = df.iloc[:,:-1]
y = df.iloc[:,-1]

In [12]:
x.head()

Unnamed: 0,tau1,tau2,tau3,tau4,p1,p2,p3,p4,g1,g2,g3,g4
0,2.95906,3.079885,8.381025,9.780754,3.763085,-0.782604,-1.257395,-1.723086,0.650456,0.859578,0.887445,0.958034
1,9.304097,4.902524,3.047541,1.369357,5.067812,-1.940058,-1.872742,-1.255012,0.413441,0.862414,0.562139,0.78176
2,8.971707,8.848428,3.046479,1.214518,3.405158,-1.207456,-1.27721,-0.920492,0.163041,0.766689,0.839444,0.109853
3,0.716415,7.6696,4.486641,2.340563,3.963791,-1.027473,-1.938944,-0.997374,0.446209,0.976744,0.929381,0.362718
4,3.134112,7.608772,4.943759,9.857573,3.525811,-1.125531,-1.845975,-0.554305,0.79711,0.45545,0.656947,0.820923


In [13]:
y.head()

0    unstable
1      stable
2    unstable
3    unstable
4    unstable
Name: stabf, dtype: object

In [14]:
#Applying 80-20 train-test split with a random state of 1.
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split( x, y, test_size=0.2, random_state=1)

In [15]:
# Applying the Standard Scaler Transformation to transform the train set (x_train, y_train) and the test set (x_test)
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
transformed_x_train = scaler.fit_transform(x_train)
transformed_x_train = pd.DataFrame(transformed_x_train, columns = x_train.columns)


transformed_x_test = scaler.transform(x_test)
transformed_x_test = pd.DataFrame(transformed_x_test, columns = x_test.columns)

In [16]:
transformed_x_test.head()

Unnamed: 0,tau1,tau2,tau3,tau4,p1,p2,p3,p4,g1,g2,g3,g4
0,0.593951,-0.412733,1.503924,1.116943,0.403423,-1.492971,-0.785033,1.566781,-0.901007,1.167203,-1.50733,1.084726
1,0.20219,0.374416,-0.1888,-0.522268,-0.225967,-1.058483,0.420047,1.028627,-1.625721,-0.39566,1.414651,1.226011
2,-1.079044,-0.313745,-0.884634,0.01708,-0.943122,0.112653,0.801335,0.733004,1.457108,-1.438495,0.651821,-1.682168
3,-0.08312,-1.107327,0.372805,-1.708152,0.75399,-1.637972,0.403805,-0.088036,0.083322,-1.672322,-0.357714,1.055865
4,0.873921,1.438466,0.086662,1.715037,-0.15388,-0.007015,-0.197053,0.472315,0.136549,-1.469731,0.956396,-0.819727


In [17]:
transformed_x_train.head()

Unnamed: 0,tau1,tau2,tau3,tau4,p1,p2,p3,p4,g1,g2,g3,g4
0,0.367327,-0.986042,0.650447,1.547527,-0.29149,0.061535,1.293862,-0.845074,0.160918,0.339859,0.585568,0.492239
1,-0.064659,0.089437,1.035079,-1.641494,0.619865,-0.067235,-1.502925,0.486613,-0.293143,-1.558488,1.429649,-1.443521
2,-1.46785,1.298418,-0.502536,1.166046,-0.180521,0.490603,0.68256,-0.855302,1.39935,1.451534,-1.045743,0.492489
3,0.820081,0.52992,1.299657,-1.141975,-0.812854,-0.763632,1.521579,0.65878,-0.958319,1.361958,1.60414,0.275303
4,0.665424,-1.425627,0.3123,0.919137,-1.614296,0.760315,1.422019,0.639243,1.676895,0.69566,1.137504,-1.312575


In [18]:
#Training for a random forest
from sklearn.ensemble import RandomForestClassifier #importing our classifier and fitting the data
rand_classifier = RandomForestClassifier(random_state=1)
rand_classifier.fit(transformed_x_train,y_train)

RandomForestClassifier(random_state=1)

In [19]:
predi = rand_classifier.predict(transformed_x_test)

In [20]:
from sklearn.metrics import recall_score, accuracy_score, precision_score, f1_score

print("Accuracy score {}".format(round(accuracy_score(y_test, predi), 4)))
print("Precision score  %.3f" % (precision_score(y_test, predi, pos_label='stable')))
print("Recall score {}".format(round(recall_score(y_test, predi, pos_label='stable'), 4)))
print("F1 score %.3f" % (f1_score(y_test, predi, pos_label='stable')))

Accuracy score 0.929
Precision score  0.919
Recall score 0.8778
F1 score 0.898


<h3>Question 14: What is the accuracy on the test set using the random forest classifier? In 4 decimal places.</h3>
<h4>0.929</h4>

<h3> Question 15 What is the accuracy on the test set using the xgboost classifier? In 4 decimal places.</h3>

In [21]:
#xgboost
from xgboost import XGBClassifier
xgb_class = XGBClassifier(random_state =1)
xgb_class.fit(transformed_x_train, y_train)
xgb_pred = xgb_class.predict(transformed_x_test)

round(accuracy_score(y_test,xgb_pred), 4)





0.9455

<h3>Question 16 What is the accuracy on the test set using the LGBM classifier? In 4 decimal places.</h3>

In [22]:
import lightgbm as lgbm
lgbm = lgbm.LGBMClassifier(random_state=1)
lgbm.fit(transformed_x_train,y_train)
lgbm_pred  = lgbm.predict(transformed_x_test)


round(accuracy_score(y_test, lgbm_pred),4)

0.9395

<h3>Question 17 To improve the Extra Trees Classifier, you will use the following parameters (number of estimators, minimum number of samples, minimum number of samples for leaf node and the number of features to consider when looking for the best split) for the hyperparameter grid needed to run a Randomized Cross Validation Search (RandomizedSearchCV).</h3>

In [23]:
#Training a extra trees classifier
from sklearn.ensemble import ExtraTreesClassifier

Extra_tree_class = ExtraTreesClassifier (random_state = 1)

In [24]:
# Applying the given parameters 
n_estimators = [50, 100, 300, 500, 1000]    # Given we are to use this parameters to answer the question

min_samples_split = [2, 3, 5, 7, 9]

min_samples_leaf = [1, 2, 4, 6, 8]

max_features = ['auto', 'sqrt', 'log2', None] 

hyperparameter_grid = {'n_estimators': n_estimators,

                       'min_samples_leaf': min_samples_leaf,

                       'min_samples_split': min_samples_split,

                       'max_features': max_features}

In [25]:
from sklearn.model_selection import RandomizedSearchCV

rand_domized = RandomizedSearchCV(estimator = Extra_tree_class, param_distributions= hyperparameter_grid, random_state=1,cv = 5, n_iter=10,scoring='accuracy',n_jobs=1, verbose=1)

In [None]:
#Execute Search
search = rand_domized.fit(transformed_x_train,y_train)

Fitting 5 folds for each of 10 candidates, totalling 50 fits


In [None]:
#Summarize result
print('Best Score: %s' % search.best_score_)
print('Best Hyperparameters: %s' %search.best_params_)

<h4>Question 18 : Train a new ExtraTreesClassifier Model with the new Hyperparameters from the RandomizedSearchCV (with random_state = 1). Is the accuracy of the new optimal model higher or lower than the initial ExtraTreesClassifier model with no hyperparameter tuning?

Ans : lower</h4>

In [None]:
New_Tree_Class = ExtraTreesClassifier(n_estimators=1000, min_samples_split=2, min_samples_leaf=8, max_features=None)
New_Tree_Class.fit(transformed_x_train, y_train)
New_Tree_Class = New_Tree_Class.predict(transformed_x_test)

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_test,New_Tree_Class, digits=4))
print('\n')
print("Accuracy score {}".format(accuracy_score(y_test, New_Tree_Class)))

In [None]:
# COMPARING THIS RESULT WITH THE ORIGINAL TREE CLASSIFER WITHOUT TUNING

Extra_tree_class.fit(transformed_x_train,y_train)
tree_pred = Extra_tree_class.predict(transformed_x_test)

print(classification_report(y_test,tree_pred))

<h4>Question 20 : Find the feature importance using the optimal ExtraTreesClassifier model. Which features are the most and least important respectively?</h4>

In [None]:
#computing the importance of the feature
feature_importance = search.best_estimator_.feature_importances_
print('Feature Importance', feature_importance)

In [None]:
sorted (zip(feature_importance,x), reverse = True)

In [None]:
# Plotting a Bar Graph to compare the models
import matplotlib.pyplot as plt

plt.bar(x.columns, feature_importance)
plt.xlabel('Feature Labels')
plt.ylabel('Feature Importances')
plt.title('Comparison of different Feature Importances')
plt.show()

<h4>Ans : most important feature - tau2

least important feature - p1 </h4>