# Stage C Quiz Solution

Oladimeji Williams
© ellipsis

---

I **Oladimeji WILLIAMS**, confirm - by submitting this document - that the solutions in this notebook are a result of my own work and that I abide by the [Code of Conduct](https://drive.google.com/file/d/1sbR80aowp1daCnElwx3kNm0fxids0e6b/view) contained therein.


### Overview: Machine Learning: Classification -  Managing the Quality Metric of Global Ecological Footprint
> The dataset for the remainder of this quiz is the Stability of the Grid System data [here](https://archive.ics.uci.edu/ml/machine-learning-databases/00471/Data_for_UCI_named.csv). Electrical grids require a balance between electricity supply and demand in order to be stable. Conventional systems achieve this balance through demand-driven electricity production. For future grids with a high share of inflexible (i.e., renewable) energy sources, the concept of demand response is a promising solution. This implies changes in electricity consumption in relation to electricity price changes. In this work, we’ll build a binary classification model to predict if a grid is stable or unstable using the UCI Electrical Grid Stability Simulated dataset.

It has 12 primary predictive features and two dependent variables.

Predictive features:

1. `'tau1'` to `'tau4'`: the reaction time of each network participant, a real value within the range 0.5 to 10 ('tau1' corresponds to the supplier node, 'tau2' to 'tau4' to the consumer nodes);
2. `'p1'` to `'p4'`: nominal power produced (positive) or consumed (negative) by each network participant, a real value within the range -2.0 to -0.5 for consumers ('p2' to 'p4'). As the total power consumed equals the total power generated, p1 (supplier node) = - (p2 + p3 + p4);
3. `'g1'` to `'g4'`: price elasticity coefficient for each network participant, a real value within the range 0.05 to 1.00 ('g1' corresponds to the supplier node, 'g2' to 'g4' to the consumer nodes; 'g' stands for 'gamma');

Dependent variables:

1. `'stab'`: the maximum real part of the characteristic differential equation root (if positive, the system is linearly unstable; if negative, linearly stable);
2. `'stabf'`: a categorical (binary) label (`'stable'` or `'unstable'`).


Because of the direct relationship between 'stab' and 'stabf' ('stabf' = 'stable' if 'stab' <= 0, 'unstable' otherwise), 'stab' should be dropped and 'stabf' will remain as the sole dependent variable (binary classification).

Split the data into an 80-20 train-test split with a random state of “1”. Use the standard scaler to transform the train set (x_train, y_train) and the test set (x_test). Use scikit learn to train a random forest and extra trees classifier. And use xgboost and lightgbm to train an extreme boosting model and a light gradient boosting model. Use random_state = 1 for training all models and evaluate on the test set.

# Preliminaries

In [1]:
# Load All Possible Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn.utils
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from imblearn.over_sampling import SMOTE
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import LeaveOneOut
from sklearn.metrics import recall_score, accuracy_score, precision_score, f1_score, confusion_matrix
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier
from sklearn.model_selection import RandomizedSearchCV

In [2]:
# Load the dateset
df = pd.read_csv('Data_for_UCI_named.csv')

In [3]:
# Copy the dataset into another dataframe
df_copy = df.copy()

In [4]:
# Peak the first few observations of the dataset
df.head()

Unnamed: 0,tau1,tau2,tau3,tau4,p1,p2,p3,p4,g1,g2,g3,g4,stab,stabf
0,2.95906,3.079885,8.381025,9.780754,3.763085,-0.782604,-1.257395,-1.723086,0.650456,0.859578,0.887445,0.958034,0.055347,unstable
1,9.304097,4.902524,3.047541,1.369357,5.067812,-1.940058,-1.872742,-1.255012,0.413441,0.862414,0.562139,0.78176,-0.005957,stable
2,8.971707,8.848428,3.046479,1.214518,3.405158,-1.207456,-1.27721,-0.920492,0.163041,0.766689,0.839444,0.109853,0.003471,unstable
3,0.716415,7.6696,4.486641,2.340563,3.963791,-1.027473,-1.938944,-0.997374,0.446209,0.976744,0.929381,0.362718,0.028871,unstable
4,3.134112,7.608772,4.943759,9.857573,3.525811,-1.125531,-1.845975,-0.554305,0.79711,0.45545,0.656947,0.820923,0.04986,unstable


In [5]:
# Peak the datatypes
df.dtypes

tau1     float64
tau2     float64
tau3     float64
tau4     float64
p1       float64
p2       float64
p3       float64
p4       float64
g1       float64
g2       float64
g3       float64
g4       float64
stab     float64
stabf     object
dtype: object

## Question 1

`TN` = **98%**, `FP` = **2%**, `FN` = **18%**, `TP` = **82%**
This satisfies both the recall rate and false positive rate values.
And it is the option with the least `False Positive` which gives the minimum business costs

## Question 2

\begin{equation}
F1 Score = \frac {2TP}{2TP  +  FP  +  FN}
\end{equation}

\begin{equation}
F1 Score = \frac {(2 * 255)}{(2 * 255) + 1380 + 45}
\end{equation}

\begin{equation}
F1 Score = 0.2635
\end{equation}

## Question 3

## Question 4

## Question 5

## Question 6

## Question 7

## Question 8

## Question 9

## Question 10

## Question 11

## Question 12

\begin{equation}
entropy = -\frac{3}{7}log(\frac{3}{7})\ -\ \frac{4}{7}log(\frac{4}{7})
\end{equation}

## Question 13

## Question 14

In [6]:
encoder = LabelEncoder()
X = df.drop(columns= ["stab", "stabf"])
y = df["stabf"]
y_encoded = encoder.fit_transform(y)
x_train, x_test, y_train, y_test = train_test_split(X, y_encoded, test_size=0.3, random_state=0)
rf = RandomForestClassifier()
rf.fit(x_train, y_train)
print(f"Accuracy on test set: {round(rf.score(x_test, y_test), 4)}")

Accuracy on test set: 0.9167


## Question 15

In [7]:
xgboost = XGBClassifier()
xgboost.fit(x_train, y_train)
print(f"Accuracy on test set: {round(xgboost.score(x_test, y_test), 4)}")

Accuracy on test set: 0.9373


## Question 16

In [8]:
lgbm = LGBMClassifier()
lgbm.fit(x_train, y_train)
print(f"Accuracy on test set: {round(lgbm.score(x_test, y_test), 4)}")

Accuracy on test set: 0.9353


## Question 17

In [9]:
xtree = ExtraTreesClassifier()
xtree.fit(x_train, y_train)
param_grid1 = {
    "n_estimators": [100, 300, 500, 1000],
    "min_samples_split": [2, 5, 7, 10],
    "min_samples_leaf": [4, 6, 8, 16],
    "max_features": ["auto", "log2", None]
}
cv = RandomizedSearchCV(estimator=xtree, param_distributions=param_grid1, cv=5, n_iter=10, scoring="accuracy", n_jobs=-1, verbose=-1, random_state=1)
cv.fit(x_train, y_train)
print(cv.best_params_)

{'n_estimators': 300, 'min_samples_split': 2, 'min_samples_leaf': 8, 'max_features': None}


## Question 18

In [10]:
print(f"Accuracy on test set: {round(xtree.score(x_test, y_test), 4)}")

Accuracy on test set: 0.9207


In [11]:
xtree2 = ExtraTreesClassifier(n_estimators = 1000 , min_samples_split = 2 , min_samples_leaf = 8, max_features = None, random_state=1)
xtree2.fit(x_train, y_train)
print(f"Accuracy on test set: {round(xtree2.score(x_test, y_test), 4)}")

Accuracy on test set: 0.92


## Question 19

## Question 20

In [12]:
weights_xtree = pd.DataFrame(data={"weights":xtree.feature_importances_}, index=x_train.columns)
print(f"max: {weights_xtree.idxmax()}")
print(f"min: {weights_xtree.idxmin()}")

max: weights    tau2
dtype: object
min: weights    p1
dtype: object
