<a href="https://colab.research.google.com/github/Akshat-kumar-jain/Hamoye_Internship/blob/main/Hamoye_Stage_C_of_Internship.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# By Akshat Kumar Jain (iakshatkumarjain@gmail.com)

# **Preparing the Data for the Quiz**

Stability of the Grid System
Electrical grids require a balance between electricity supply and demand in order to be stable. Conventional systems achieve this balance through demand-driven electricity production. For future grids with a high share of inflexible (i.e., renewable) energy source, the concept of demand response is a promising solution. This implies changes in electricity consumption in relation to electricity price changes. In this work, we’ll build a binary classification model to predict if a grid is stable or unstable using the UCI Electrical Grid Stability Simulated dataset.

Dataset: https://archive.ics.uci.edu/ml/datasets/Electrical+Grid+Stability+Simulated+Data+

It has 12 primary predictive features and two dependent variables.

**Predictive features:**

'tau1' to 'tau4': the reaction time of each network participant, a real value within the range 0.5 to 10 ('tau1' corresponds to the supplier node, 'tau2' to 'tau4' to the consumer nodes);

'p1' to 'p4': nominal power produced (positive) or consumed (negative) by each network participant, a real value within the range -2.0 to -0.5 for consumers ('p2' to 'p4'). As the total power consumed equals the total power generated, p1 (supplier node) = - (p2 + p3 + p4);

'g1' to 'g4': price elasticity coefficient for each network participant, a real value within the range 0.05 to 1.00 ('g1' corresponds to the supplier node, 'g2' to 'g4' to the consumer nodes; 'g' stands for 'gamma');

**Dependent variables:**

'stab': the maximum real part of the characteristic differential equation root (if positive, the system is linearly unstable; if negative, linearly stable);
'stabf': a categorical (binary) label ('stable' or 'unstable').

Because of the direct relationship between 'stab' and 'stabf' ('stabf' = 'stable' if 'stab' <= 0, 'unstable' otherwise), 'stab' should be dropped and 'stabf' will remain as the sole dependent variable (binary classification).


Split the data into an 80-20 train-test split with a random state of “1”. Use the standard scaler to transform the train set (x_train, y_train) and the test set (x_test). Use scikit learn to train a random forest and extra trees classifier. And use xgboost and lightgbm to train an extreme boosting model and a light gradient boosting model. Use random_state = 1 for training all models and evaluate on the test set.


Also, to improve the Extra Trees Classifier, you will use the following parameters (number of estimators, minimum number of samples, minimum number of samples for leaf node and the number of features to consider when looking for the best split) for the hyperparameter grid needed to run a Randomized Cross Validation Search (RandomizedSearchCV).


n_estimators = [50, 100, 300, 500, 1000]

min_samples_split = [2, 3, 5, 7, 9]

min_samples_leaf = [1, 2, 4, 6, 8]

max_features = ['auto', 'sqrt', 'log2', None]

hyperparameter_grid = {'n_estimators': n_estimators,

'min_samples_leaf': min_samples_leaf,

'min_samples_split': min_samples_split,

'max_features': max_features}

In [None]:
#importing the necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
# Copying the url and loading (reading) as dataframe

url = 'https://raw.githubusercontent.com/moreira-presh/HAMOYE-INTERNSHIP-2020/master/Data_for_UCI_named.csv'
df = pd.read_csv(url)

In [None]:
# checking the head of the data
df.head()

Unnamed: 0,tau1,tau2,tau3,tau4,p1,p2,p3,p4,g1,g2,g3,g4,stab,stabf
0,2.95906,3.079885,8.381025,9.780754,3.763085,-0.782604,-1.257395,-1.723086,0.650456,0.859578,0.887445,0.958034,0.055347,unstable
1,9.304097,4.902524,3.047541,1.369357,5.067812,-1.940058,-1.872742,-1.255012,0.413441,0.862414,0.562139,0.78176,-0.005957,stable
2,8.971707,8.848428,3.046479,1.214518,3.405158,-1.207456,-1.27721,-0.920492,0.163041,0.766689,0.839444,0.109853,0.003471,unstable
3,0.716415,7.6696,4.486641,2.340563,3.963791,-1.027473,-1.938944,-0.997374,0.446209,0.976744,0.929381,0.362718,0.028871,unstable
4,3.134112,7.608772,4.943759,9.857573,3.525811,-1.125531,-1.845975,-0.554305,0.79711,0.45545,0.656947,0.820923,0.04986,unstable


In [None]:
# checking for missing values
df.isna().sum()

tau1     0
tau2     0
tau3     0
tau4     0
p1       0
p2       0
p3       0
p4       0
g1       0
g2       0
g3       0
g4       0
stab     0
stabf    0
dtype: int64

In [None]:
# from the Instructions, we are told to drop the 'stab' columns
df = df.drop(columns = 'stab')

In [None]:
# Splitting the data into the Predictors(Features) and Labels(Response)

X = df.drop(['stabf'],axis = 1)
y = df['stabf']

In [None]:
df.stabf.value_counts() # before the split was carried out

unstable    6380
stable      3620
Name: stabf, dtype: int64

In [None]:
# Assigning/Splitting the data into testing and training sets

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=1)

In [None]:
y_train.value_counts() # after split 

unstable    5092
stable      2908
Name: stabf, dtype: int64

In [None]:
y_test.value_counts() # after split

unstable    1288
stable       712
Name: stabf, dtype: int64

In [None]:
X_train.head(3) # checking the Xtrain dataframe before transformation is done to it

Unnamed: 0,tau1,tau2,tau3,tau4,p1,p2,p3,p4,g1,g2,g3,g4
2694,6.255995,2.542401,7.024714,9.476518,3.529888,-1.224881,-0.688228,-1.61678,0.568221,0.618403,0.685739,0.660088
5140,5.070581,5.490253,8.075688,0.761075,4.220888,-1.280596,-1.902185,-1.038107,0.443515,0.097244,0.916955,0.129254
2568,1.220072,8.804028,3.874283,8.433949,3.614027,-1.039236,-0.953566,-1.621224,0.908353,0.923594,0.238881,0.660156


In [None]:
X_test.head(3) # checking the Xtest dataframe before transformation is done to it

Unnamed: 0,tau1,tau2,tau3,tau4,p1,p2,p3,p4,g1,g2,g3,g4
9953,6.877876,4.11382,9.356768,8.299753,4.056779,-1.89747,-1.590581,-0.568728,0.276567,0.845536,0.11244,0.822562
3850,5.802841,6.271371,4.73154,3.819867,3.579569,-1.70948,-1.067511,-0.802579,0.077527,0.416478,0.912846,0.861306
4962,2.286998,4.385142,2.830232,5.29388,3.035814,-1.202764,-0.902011,-0.931039,0.924216,0.130186,0.703887,0.063811


In [None]:
# As instructed we are told to carry out Standard Scaling.
# Here we use the Standard Scaler transformation technique

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

In [None]:
# Transforming the X_train (feature)data

transformed_X_train = scaler.fit_transform(X_train)
transformed_X_train = pd.DataFrame(transformed_X_train, columns = X_train.columns)

In [None]:
transformed_X_train.head() # after applying transformation on the data

Unnamed: 0,tau1,tau2,tau3,tau4,p1,p2,p3,p4,g1,g2,g3,g4
0,0.367327,-0.986042,0.650447,1.547527,-0.29149,0.061535,1.293862,-0.845074,0.160918,0.339859,0.585568,0.492239
1,-0.064659,0.089437,1.035079,-1.641494,0.619865,-0.067235,-1.502925,0.486613,-0.293143,-1.558488,1.429649,-1.443521
2,-1.46785,1.298418,-0.502536,1.166046,-0.180521,0.490603,0.68256,-0.855302,1.39935,1.451534,-1.045743,0.492489
3,0.820081,0.52992,1.299657,-1.141975,-0.812854,-0.763632,1.521579,0.65878,-0.958319,1.361958,1.60414,0.275303
4,0.665424,-1.425627,0.3123,0.919137,-1.614296,0.760315,1.422019,0.639243,1.676895,0.69566,1.137504,-1.312575


In [None]:
# Transforming the X_test (feature)data
transformed_X_test = scaler.transform(X_test)
transformed_X_test = pd.DataFrame(transformed_X_test, columns = X_test.columns)


In [None]:
transformed_X_test.head() # after applying transformation on the data

Unnamed: 0,tau1,tau2,tau3,tau4,p1,p2,p3,p4,g1,g2,g3,g4
0,0.593951,-0.412733,1.503924,1.116943,0.403423,-1.492971,-0.785033,1.566781,-0.901007,1.167203,-1.50733,1.084726
1,0.20219,0.374416,-0.1888,-0.522268,-0.225967,-1.058483,0.420047,1.028627,-1.625721,-0.39566,1.414651,1.226011
2,-1.079044,-0.313745,-0.884634,0.01708,-0.943122,0.112653,0.801335,0.733004,1.457108,-1.438495,0.651821,-1.682168
3,-0.08312,-1.107327,0.372805,-1.708152,0.75399,-1.637972,0.403805,-0.088036,0.083322,-1.672322,-0.357714,1.055865
4,0.873921,1.438466,0.086662,1.715037,-0.15388,-0.007015,-0.197053,0.472315,0.136549,-1.469731,0.956396,-0.819727


# **Now the Data is ready for the Quiz**

# **Question 1**

## What is the F1 Score of the Classifier?

In [None]:
# Given total instances (n) = 2000
# F1 is given as 2 * (Precision*Recall)/(Precision + Recall)
# From the confusion matrix we have;

Precision = (355/ ( 355+1480)) 
Recall =  (355 /(355+45)) 
F1_Score = 2 * (Precision*Recall)/(Precision + Recall)
print(round(F1_Score,4))

0.3177


## **Question 2**

## What method can we use to best fit a data in Logistic Regression?


**Answer:** Maximum Likelihood # Just as Least Square Error is to Linear Regression, Maximum Likelihood is to Logistic Regression

# **Question 3**

## Why do we use weak learners in Boosting?

**Answer:** To Prevent Overfitting

## **Question 4**


## Which Confusion matrix represents the model that satisfies the requirement?


 We have from the information that A False positive result is five times more expensive than A False negative.

#### ***Conditions given were;***
#### Must have a recall rate of at least 80%
#### Must have a false postive rate of 10% or less
#### Must minimize buisness cost.

## **Answer:** TN = 98%, FP = 2%, FN = 18%, TP = 82%

In [None]:
# remember that Recall is given as TP / TP+FN

# for condtion 1, TN = 98%, FP = 2%, FN = 18%, TP = 82%
#Recall = TP/TP+FN = 82/82+18 = 0.82 which is above 80%....

# Condition 2, The FP here is 2% which meets this condition.....

# Condition 3, to confirm if it minimizes business cost, we were told FP is 5*FN
# FP*5 = 2*5 = 10% which gives the smallest compared to the other options.


# **Question 5**

**Answer:** Boosting


# **Question 6**

## Which of the following is not an Ensemble model?

**Answer:** Decision Tree      
In case of decision tree, we build a single tree and no ensembling is required.

# **Question 7**

## Metric to evaluate classifier ( Whether Fraudulent or not)

**Answer:** Recall

# **Question 8**

## The ROC curve above was generated from a classification algorithm. What can we say about this classifier?

**Answer:** The model has no discrimination capacity to differentiate between the positive and the negative class

# **Question 9**

## Based on the matrix, which number was predicted with the least accuracy?

**Answer:** 8

# **Question 10**

##  The resulting model has 90% accuracy, but extremely poor recall. What steps can be used to improve the model's performance? (SELECT TWO OPTIONS)

**Answer:**

Over-sample instances from the negative (no cancer) class

Generate synthetic samples/data using SMOTE


# **Question 11**

## You are developing a machine learning classification algorithm that categorizes handwritten digits 0-9 into the numbers they represent. How should you pre-process the label data

**Answer:** One-hot encoding

# **Question 12**

## What is the entropy of the target variable if its actual values are given as:

## [1,0,1,1,0,1,0]


**Answer:** **- 3/7 log(3/7) - 4/7 log(4/7)**

### given we have a total number of 7 observations (3 Zeros and 4 Ones)
### Entropy is given as -E p(x)*log p(x) 
### inputing that formula gives us  = - 3/7 log(3/7) - 4/7 log(4/7)

# **Question 13**

## Which of this is not a good metric for evaluating classification algorithms for data with imbalanced class problems?

**Answer:** **Accuracy**

#### A high accuracy doesn't mean the model has a high predicting power,as a result it is important not to soley depend on the accuracy metric because it doesnt provide enough info about the model.


# **Question 14**

### What is the accuracy on the test set using the random forest classifier? In 4 decimal places.


**Answer:** 0.9295


In [None]:
from sklearn.ensemble import RandomForestClassifier  # importing our classifier and fitting in the the training data
Random_C = RandomForestClassifier(random_state=1)
Random_C.fit(transformed_X_train,y_train)


RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=1, verbose=0,
                       warm_start=False)

In [None]:
predict = Random_C.predict(transformed_X_test)

In [None]:
from sklearn.metrics import recall_score, accuracy_score, precision_score, f1_score, confusion_matrix

print("Accuracy score {}".format(round(accuracy_score(y_test, predict), 4)))
print("Precision score for label stable %.3f" % (precision_score(y_test, predict, pos_label='stable')))
print("Recall score for label stable {}".format(round(recall_score(y_test, predict, pos_label='stable'), 4)))
print("F1 score %.3f" % (f1_score(y_test, predict, pos_label='stable')))

Accuracy score 0.929
Precision score for label stable 0.919
Recall score for label stable 0.8778
F1 score 0.898


# **Question 15**

## What is the accuracy on the test set using the xgboost classifier? In 4 decimal places.

**Answer:** 0.9195



In [None]:
#xgboost
from xgboost import XGBClassifier
extreme = XGBClassifier(random_state =1)
extreme.fit(transformed_X_train, y_train)
extreme_pred = extreme.predict(transformed_X_test)

In [None]:
#classification report
from sklearn.metrics import classification_report

print(classification_report(y_test, extreme_pred))

              precision    recall  f1-score   support

      stable       0.92      0.85      0.88       712
    unstable       0.92      0.96      0.94      1288

    accuracy                           0.92      2000
   macro avg       0.92      0.90      0.91      2000
weighted avg       0.92      0.92      0.92      2000



In [None]:
round(accuracy_score(y_test,extreme_pred),4) # Giving our accuracy using Xgboost classifier in 4DP

0.9195

# **Question 16**

## What is the accuracy on the test set using the LGBM classifier? In 4 decimalplaces.

**Answer:** 0.9375


In [None]:
import lightgbm as lgbm

In [None]:
lgbm = lgbm.LGBMClassifier(random_state=1)
lgbm.fit(transformed_X_train,y_train)
lgbm_predict  = lgbm.predict(transformed_X_test)

In [None]:
round(accuracy_score(y_test, lgbm_predict),4) # giving our value of accuracy in 4DP

0.9375

# **Question 17**

## Using the ExtraTreesClassifier as your estimator with cv=5, n_iter=10, scoring = 'accuracy', n_jobs = -1, verbose = 1 and random_state = 1. What are the best hyperparameters from the randomized search CV?


**Answer:** N_estimators = 1000 , min_samples_split = 2 , min_samples_leaf = 8, max_features = None


In [None]:
from sklearn.ensemble import ExtraTreesClassifier

Tree_CLass = ExtraTreesClassifier (random_state = 1)   # recall random state of 1 was used throughout the quiz

In [None]:
n_estimators = [50, 100, 300, 500, 1000]    # Given we are to use this parameters to answer the question

min_samples_split = [2, 3, 5, 7, 9]

min_samples_leaf = [1, 2, 4, 6, 8]

max_features = ['auto', 'sqrt', 'log2', None] 

hyperparameter_grid = {'n_estimators': n_estimators,

                       'min_samples_leaf': min_samples_leaf,

                       'min_samples_split': min_samples_split,

                       'max_features': max_features}

In [None]:
from sklearn.model_selection import RandomizedSearchCV     # importing RandomizedSearchCV as instructed for the quiz

In [None]:
# According to the parameters given to instantiate

Rand_search = RandomizedSearchCV(estimator = Tree_CLass, param_distributions= hyperparameter_grid, random_state=1,cv = 5, n_iter=10,scoring='accuracy',n_jobs=1, verbose=1)

In [None]:
search = Rand_search.fit(transformed_X_train,y_train)     # fitting in the parameters into the training data


Fitting 5 folds for each of 10 candidates, totalling 50 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  50 out of  50 | elapsed:  1.9min finished


In [None]:
#checking for the best parameter for the model
search.best_params_

{'max_features': None,
 'min_samples_leaf': 8,
 'min_samples_split': 2,
 'n_estimators': 1000}

# **Question 18**

## Train a new ExtraTreesClassifier Model with the new Hyperparameters from the RandomizedSearchCV (with random_state = 1). Is the accuracy of the new optimal model higher or lower than the initial ExtraTreesClassifier model with no hyperparameter tuning?


 **Answer:** Lower



In [None]:
#experimenting with this generated parameter to test the model's performance
best_Tree_Class = ExtraTreesClassifier(n_estimators=1000, min_samples_split=2, 
                                 min_samples_leaf=8, max_features=None)
best_Tree_Class.fit(transformed_X_train, y_train)
best_Tree_Class = best_Tree_Class.predict(transformed_X_test)

In [None]:
print(classification_report(y_test,best_Tree_Class, digits=4)) #  adding digits = 4 is to get my answer in 4dp.
print('\n')
print("Accuracy score {}".format(accuracy_score(y_test, best_Tree_Class)))

              precision    recall  f1-score   support

      stable     0.9202    0.8750    0.8970       712
    unstable     0.9327    0.9581    0.9452      1288

    accuracy                         0.9285      2000
   macro avg     0.9265    0.9165    0.9211      2000
weighted avg     0.9283    0.9285    0.9281      2000



Accuracy score 0.9285


In [None]:
# COMPARING THIS RESULT WITH THE ORIGINAL EXTRA-TREE CLASSIFER WITHOUT TUNING

Tree_CLass.fit(transformed_X_train,y_train)
Tree_predict = Tree_CLass.predict(transformed_X_test)

print(classification_report(y_test,Tree_predict))   # We can see here that the accuracy of the previous (New Optimal Model) is Lower than that without Hyperparameter Tuning



              precision    recall  f1-score   support

      stable       0.94      0.85      0.89       712
    unstable       0.92      0.97      0.95      1288

    accuracy                           0.93      2000
   macro avg       0.93      0.91      0.92      2000
weighted avg       0.93      0.93      0.93      2000



# **Question 19**

## What other hyperparameter optimization methods can you try apart from Random Search?


**Answer:** All of the above

#### Bayesian Optimization as well as Grid Search are some of the hyperparameter optimization methods available.

# **Question 20**


## Find the feature importance using the optimal ExtraTreesClassifier model. Which features are the most and least important respectively?


**Answer:**  tau2, p1


In [None]:

feature_importance = search.best_estimator_.feature_importances_
print ('Feature Importance :\n', feature_importance)  # using this code we are able to generate the feature importance  using the "best_estimator_"

Feature Importance :
 [0.13723975 0.1405075  0.13468029 0.13541676 0.00368342 0.00533686
 0.00542927 0.00496249 0.10256244 0.10757765 0.11306268 0.10954089]


In [None]:
sorted (zip(feature_importance,X), reverse = True)  # the zip function helps us to map the values to their corresponding column names (Features)


[(0.14050750384993677, 'tau2'),
 (0.13723974766109256, 'tau1'),
 (0.1354167630909727, 'tau4'),
 (0.13468028520386593, 'tau3'),
 (0.11306267999167334, 'g3'),
 (0.10954089174337298, 'g4'),
 (0.10757764577478764, 'g2'),
 (0.10256244080927947, 'g1'),
 (0.005429268421191957, 'p3'),
 (0.005336864710946151, 'p2'),
 (0.004962486591192238, 'p4'),
 (0.003683422151688322, 'p1')]

We can see here that the most important feature is  "tau2"  while that of the least important feature is "p1"

# **THANk YOU !!! **