# Hamoye Week 3 July 2021 Internship Assignment

## Introduction

Electrical grids require a balance between electricity supply and demand in order to be stable. Conventional systems achieve this balance through demand-driven electricity production. For future grids with a high share of inflexible (i.e., renewable) energy source, the concept of demand response is a promising solution. This implies changes in electricity consumption in relation to electricity price changes. 

In this work, we’ll build a binary classification model to predict if a grid is stable or unstable using the UCI Electrical Grid Stability Simulated dataset.
The dataset used in this assignment can be downloaded [here](https://archive.ics.uci.edu/ml/datasets/Electrical+Grid+Stability+Simulated+Data+). 

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.model_selection import RandomizedSearchCV
from xgboost import XGBClassifier
import lightgbm as lgb
from sklearn.linear_model import LogisticRegression 
from sklearn.model_selection import cross_val_score 
from sklearn.model_selection import KFold
from sklearn.model_selection import StratifiedKFold 
from sklearn.metrics import recall_score, accuracy_score, precision_score, f1_score, confusion_matrix 


%matplotlib inline

In [2]:
uci = pd.read_csv('Data_for_UCI_named.csv')

In [3]:
uci.head()

Unnamed: 0,tau1,tau2,tau3,tau4,p1,p2,p3,p4,g1,g2,g3,g4,stab,stabf
0,2.95906,3.079885,8.381025,9.780754,3.763085,-0.782604,-1.257395,-1.723086,0.650456,0.859578,0.887445,0.958034,0.055347,unstable
1,9.304097,4.902524,3.047541,1.369357,5.067812,-1.940058,-1.872742,-1.255012,0.413441,0.862414,0.562139,0.78176,-0.005957,stable
2,8.971707,8.848428,3.046479,1.214518,3.405158,-1.207456,-1.27721,-0.920492,0.163041,0.766689,0.839444,0.109853,0.003471,unstable
3,0.716415,7.6696,4.486641,2.340563,3.963791,-1.027473,-1.938944,-0.997374,0.446209,0.976744,0.929381,0.362718,0.028871,unstable
4,3.134112,7.608772,4.943759,9.857573,3.525811,-1.125531,-1.845975,-0.554305,0.79711,0.45545,0.656947,0.820923,0.04986,unstable


In [4]:
uci.describe()

Unnamed: 0,tau1,tau2,tau3,tau4,p1,p2,p3,p4,g1,g2,g3,g4,stab
count,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0
mean,5.25,5.250001,5.250004,5.249997,3.75,-1.25,-1.25,-1.25,0.525,0.525,0.525,0.525,0.015731
std,2.742548,2.742549,2.742549,2.742556,0.75216,0.433035,0.433035,0.433035,0.274256,0.274255,0.274255,0.274255,0.036919
min,0.500793,0.500141,0.500788,0.500473,1.58259,-1.999891,-1.999945,-1.999926,0.050009,0.050053,0.050054,0.050028,-0.08076
25%,2.874892,2.87514,2.875522,2.87495,3.2183,-1.624901,-1.625025,-1.62496,0.287521,0.287552,0.287514,0.287494,-0.015557
50%,5.250004,5.249981,5.249979,5.249734,3.751025,-1.249966,-1.249974,-1.250007,0.525009,0.525003,0.525015,0.525002,0.017142
75%,7.62469,7.624893,7.624948,7.624838,4.28242,-0.874977,-0.875043,-0.875065,0.762435,0.76249,0.76244,0.762433,0.044878
max,9.999469,9.999837,9.99945,9.999443,5.864418,-0.500108,-0.500072,-0.500025,0.999937,0.999944,0.999982,0.99993,0.109403


In [5]:
uci.columns

Index(['tau1', 'tau2', 'tau3', 'tau4', 'p1', 'p2', 'p3', 'p4', 'g1', 'g2',
       'g3', 'g4', 'stab', 'stabf'],
      dtype='object')

In [6]:
uci.isna().sum()

tau1     0
tau2     0
tau3     0
tau4     0
p1       0
p2       0
p3       0
p4       0
g1       0
g2       0
g3       0
g4       0
stab     0
stabf    0
dtype: int64

In [7]:
X = uci.iloc[:,0:11]#['tau1', 'tau2', 'tau3', 'tau4', 'p1', 'p2', 'p3', 'p4', 'g1', 'g2','g3', 'g4']
y = uci.iloc[:, 13] 

In [8]:
X
y

0       unstable
1         stable
2       unstable
3       unstable
4       unstable
          ...   
9995    unstable
9996      stable
9997      stable
9998    unstable
9999    unstable
Name: stabf, Length: 10000, dtype: object

In [9]:
# splitting the data set
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size= 0.2, random_state=1)

Using Standard Scaler

In [10]:
scaler = StandardScaler()
x_traintrans = pd.DataFrame(scaler.fit_transform(x_train)) # pd.DataFrame(scaler.fit_transform(x_train)
x_testtrans = scaler.transform(x_test)

In [66]:
import imblearn 
from imblearn.over_sampling import SMOTE 
smote = SMOTE(random_state= 1 ) 
x_train_balanced, y_balanced = smote.fit_resample(x_train, y_train)

In [71]:
x_trainf = pd.DataFrame(scaler.fit_transform(x_train_balanced)) # pd.DataFrame(scaler.fit_transform(x_train)

Using Random Forest Classifier

In [19]:
randmodel = RandomForestClassifier(random_state = 1)
randmodel.fit(x_traintrans, y_train)
rand_ypred = randmodel.predict(x_testtrans)

In [74]:
ranmodel = RandomForestClassifier(random_state = 1)
ranmodel.fit(x_train_balanced, y_balanced)
ran_ypred = ranmodel.predict(x_test)

In [20]:
from sklearn import metrics
print(metrics.classification_report(rand_ypred, y_test))

              precision    recall  f1-score   support

      stable       0.82      0.83      0.83       705
    unstable       0.91      0.90      0.90      1295

    accuracy                           0.88      2000
   macro avg       0.86      0.87      0.87      2000
weighted avg       0.88      0.88      0.88      2000



In [38]:
raccuracyscore=accuracy_score(y_test, rand_ypred)
raccuracyscore

0.8765

In [48]:
r_recall = recall_score(y_test, rand_ypred, average="binary", pos_label="stable")
r_recall

0.8216292134831461

In [49]:
rprecision =precision_score(y_test, rand_ypred, average="binary", pos_label="stable")
rprecision

0.8297872340425532

In [52]:
r_f1 = f1_score(y_test, rand_ypred, average="binary", pos_label="stable")
r_f1

0.8256880733944953

Using Extra Trees Classifier and Randomized Search CV

In [21]:
etf = ExtraTreesClassifier()
n_estimators = [50, 100, 300, 500, 1000]
min_samples_split = [2, 3, 5, 7, 9]
min_samples_leaf = [1, 2, 4, 6, 8]
max_features = ['auto', 'sqrt', 'log2', None]

hyperparameter_grid = {'n_estimators': n_estimators, 
                       'min_samples_leaf': min_samples_leaf,
    'min_samples_split': min_samples_split,'max_features': max_features
                      }
rsc = RandomizedSearchCV(etf, hyperparameter_grid, random_state = 1)

etf.fit(x_traintrans, y_train)
etf_ypred = etf.predict(x_testtrans)

Using the ExtraTreesClassifier as your estimator with cv=5, n_iter=10, scoring = 'accuracy', n_jobs = -1, verbose = 1 and random_state = 1

In [79]:
ef = ExtraTreesClassifier( n_iter=10, scoring = 'accuracy', n_jobs = -1, verbose = 1, random_state = 1)
n_estimators = [50, 100, 300, 500, 1000]
min_samples_split = [2, 3, 5, 7, 9]
min_samples_leaf = [1, 2, 4, 6, 8]
max_features = ['auto', 'sqrt', 'log2', None]

hyperparameter_grid = {'n_estimators': n_estimators, 
                       'min_samples_leaf': min_samples_leaf,
    'min_samples_split': min_samples_split,'max_features': max_features
                      }
RandomizedSearchCV(ef, hyperparameter_grid, random_state = 1)

TypeError: __init__() got an unexpected keyword argument 'n_iter'

In [22]:
print(metrics.classification_report(etf_ypred, y_test))

              precision    recall  f1-score   support

      stable       0.78      0.86      0.82       649
    unstable       0.93      0.89      0.91      1351

    accuracy                           0.88      2000
   macro avg       0.86      0.87      0.86      2000
weighted avg       0.88      0.88      0.88      2000



In [39]:
et_accuracyscore=accuracy_score(y_test, etf_ypred)
et_accuracyscore

0.8775

In [55]:
et_recall = recall_score(y_test, etf_ypred, average="binary", pos_label="stable")
et_recall

0.7837078651685393

In [54]:
et_precision =precision_score(y_test, etf_ypred, average="binary", pos_label="stable")
et_precision

0.8597842835130971

In [53]:
et_f1 = f1_score(y_test, etf_ypred, average="binary", pos_label="stable")
et_f1

0.8199853049228508

In [30]:
xgb = XGBClassifier(random_state = 1)
xgb.fit(x_train, y_train)
xgb_ypred = xgb.predict(x_test)



In [31]:
print(metrics.classification_report(xgb_ypred, y_test))

              precision    recall  f1-score   support

      stable       0.82      0.84      0.83       701
    unstable       0.91      0.90      0.91      1299

    accuracy                           0.88      2000
   macro avg       0.87      0.87      0.87      2000
weighted avg       0.88      0.88      0.88      2000



In [40]:
xgb_accuracyscore=accuracy_score(y_test, xgb_ypred)
xgb_accuracyscore

0.8795

In [57]:
xgb_recall = recall_score(y_test, xgb_ypred, average="binary", pos_label="stable")
xgb_recall

0.8230337078651685

In [58]:
xgb_precision =precision_score(y_test, xgb_ypred, average="binary", pos_label="stable")
xgb_precision

0.8359486447931527

In [59]:
xgb_f1 = f1_score(y_test, xgb_ypred, average="binary", pos_label="stable")
xgb_f1

0.829440905874027

In [64]:
confusionmatrix=confusion_matrix(y_test, xgb_ypred)
confusionmatrix

array([[ 586,  126],
       [ 115, 1173]], dtype=int64)

In [32]:
ltb = lgb.LGBMClassifier(random_state = 1)
ltb.fit(x_traintrans, y_train)
ltb_ypred = ltb.predict(x_testtrans)

In [33]:
print(metrics.classification_report(ltb_ypred, y_test))

              precision    recall  f1-score   support

      stable       0.82      0.84      0.83       695
    unstable       0.91      0.90      0.91      1305

    accuracy                           0.88      2000
   macro avg       0.87      0.87      0.87      2000
weighted avg       0.88      0.88      0.88      2000



In [41]:
ltb_accuracyscore=accuracy_score(y_test, ltb_ypred)
ltb_accuracyscore

0.8805

In [60]:
ltb_recall = recall_score(y_test, ltb_ypred, average="binary", pos_label="stable")
ltb_recall

0.8202247191011236

In [61]:
ltb_precision =precision_score(y_test, ltb_ypred, average="binary", pos_label="stable")
ltb_precision

0.8402877697841726

In [62]:
ltb_f1 = f1_score(y_test, ltb_ypred, average="binary", pos_label="stable")
ltb_f1

0.830135039090263

In [63]:
confusionmatrix=confusion_matrix(y_test, ltb_ypred)
confusionmatrix

array([[ 584,  128],
       [ 111, 1177]], dtype=int64)

# Hamoye Week 3Test

## Question 1

You are working on a spam classification system using regularized logistic regression. “Spam” is a positive class (y = 1) and “not spam” is the negative class (y = 0). You have trained your classifier and there are n = 2000 examples in the test set. The confusion matrix of predicted class vs. actual class is:

tp = 355
fp = 120
tn = 45
fn = 1480

What is the F1 score of this classifier?

### Answer
0.1935 (This was the closest option I could choose)

F1 score = (precision * recall)/ (precision + recall)

In [80]:
precision = 355 / (355+120)
recall = 355 / (355+1480)
ans = (precision * recall)/(precision + recall)
ans

0.15367965367965367

## Question 2

Which method can we use to best fit a data in Logistic Regression?
### Answer
Maximum Likelihood

## Question 3
Why do we use weak learners in boosting?
### Answer
To make the algorithm stronger

To prevent overfitting

## Question 4

A data scientist is evaluating different binary classification models. A false positive result is 5 times more expensive (from a business perspective) than a false negative result. The models should be evaluated based on the following criteria:

1) Must have a recall rate of at least 80%

2) Must have a false positive rate of 10% or less

3) Must minimize business costs

After creating each binary classification model, the data scientist generates the corresponding confusion matrix. Which confusion matrix represents the model that satisfies the requirements?
### Answer

TN = 98%, FP = 2%, FN = 18%, TP = 82%

## Question 5

You are building a classifier and the accuracy is poor on both the training and test sets. Which would you use to try to improve the performance?
### Answer

Boosting

## Question 6
Which of the following is not an Ensemble model?
### Answer

Decision Tree

## Question 7
A classifier predicts if insurance claims are fraudulent or not. The cost of paying a fraudulent claim is higher than the cost of investigating a claim that is suspected to be fraudulent. Which metric should we use to evaluate this classifier?
### Answer
Recall

## Question 8
The ROC curve above was generated from a classification algorithm. What can we say about this classifier?
### Answer
The model has no discrimination capacity to differentiate between the positive and the negative class

## Question 9
A random forest classifier was used to classify handwritten digits 0-9 into the numbers they were intended to represent. The confusion matrix below was generated from the results. Based on the matrix, which number was predicted with the least accuracy?
### Answer
8

## Question 10
A medical company is building a model to predict the occurrence of thyroid cancer. The training data contains 900 negative instances (people who don't have cancer) and 100 positive instances. The resulting model has 90% accuracy, but extremely poor recall. What steps can be used to improve the model's performance? (SELECT TWO OPTIONS)
### Answer

Over-sample instances from the negative (no cancer) class

Use Bagging method

Generate synthetic samples/data using SMOTE

## Question 11
You are developing a machine learning classification algorithm that categorizes handwritten digits 0-9 into the numbers they represent. How should you pre-process the label data?
### Answer
One-hot encoding

## Question 12
What is the entropy of the target variable if its actual values are given as:
```[1,0,1,1,0,1,0]```
### Answer
```
- 3/7 log(3/7) - 4/7 log(4/7)
```

## Question 13
Which of this is not a good metric for evaluating classification algorithms for data with imbalanced class problems?
### Answer
Accuracy

## Question 14
What is the accuracy on the test set using the random forest classifier? In 4 decimal places.
### Answer
0.9845 (The answer I submitted)

In [81]:
raccuracyscore (The answer I got from my model)

0.8765

## Question 15
What is the accuracy on the test set using the xgboost classifier? In 4 decimal places.
### Answer
0.9875 (The answer I submitted)

In [86]:
xgb_accuracyscore #(The answer from my model)

0.8795

## Question 16
What is the accuracy on the test set using the LGBM classifier? In 4 decimal places.
### Answer
0.9875 (The answer I submitted)

In [85]:
ltb_accuracyscore #(The answer from my model)

0.8805

## Question 17
To improve the Extra Trees Classifier, you will use the following parameters (number of estimators, minimum number of samples, minimum number of samples for leaf node and the number of features to consider when looking for the best split) for the hyperparameter grid needed to run a Randomized Cross Validation Search (RandomizedSearchCV).
```
n_estimators = [50, 100, 300, 500, 1000]

min_samples_split = [2, 3, 5, 7, 9]

min_samples_leaf = [1, 2, 4, 6, 8]

max_features = ['auto', 'sqrt', 'log2', None]

hyperparameter_grid = {'n_estimators': n_estimators,

'min_samples_leaf': min_samples_leaf,

'min_samples_split': min_samples_split,

'max_features': max_features}
```
Using the ExtraTreesClassifier as your estimator with ```cv=5, n_iter=10, scoring = 'accuracy', n_jobs = -1, verbose = 1 and random_state = 1.``` What are the best hyperparameters from the randomized search CV?
### Answer
N_estimators = 300 , min_samples_split = 5 , min_samples_leaf = 6, max_features = ‘auto’

## Question 18
Train a new ExtraTreesClassifier Model with the new Hyperparameters from the RandomizedSearchCV (with random_state = 1). Is the accuracy of the new optimal model higher or lower than the initial ExtraTreesClassifier model with no hyperparameter tuning?
### Answer
None of the above

## Question 19
What other hyperparameter optimization methods can you try apart from Random Search?
### Answer
Grid Search

## Question 20
Find the feature importance using the optimal ExtraTreesClassifier model. Which features are the most and least important respectively?
### Answer
tau2, p1