# Machine Learning Classification

In [2]:
import zipfile

def extract_zip(zip_file, output_dir):
  """
  Extracts a zip file to a specified directory.

  Args:
    zip_file: The path to the zip file to extract.
    output_dir: The path to the directory where the zip file should be extracted.
  """

  with zipfile.ZipFile(zip_file, 'r') as zip_ref:
    zip_ref.extractall(output_dir)

if __name__ == "__main__":
  zip_file = 'electrical+grid+stability+simulated+data.zip'
  output_dir = 'dataset'
  extract_zip(zip_file, output_dir)

In [23]:
# data analysis
import pandas as pd

# model training
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
from lightgbm import LGBMClassifier

# data preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# model evaluation
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score, confusion_matrix

In [24]:
df = pd.read_csv('dataset/Data_for_UCI_named.csv')
df.head()

Unnamed: 0,tau1,tau2,tau3,tau4,p1,p2,p3,p4,g1,g2,g3,g4,stab,stabf
0,2.95906,3.079885,8.381025,9.780754,3.763085,-0.782604,-1.257395,-1.723086,0.650456,0.859578,0.887445,0.958034,0.055347,unstable
1,9.304097,4.902524,3.047541,1.369357,5.067812,-1.940058,-1.872742,-1.255012,0.413441,0.862414,0.562139,0.78176,-0.005957,stable
2,8.971707,8.848428,3.046479,1.214518,3.405158,-1.207456,-1.27721,-0.920492,0.163041,0.766689,0.839444,0.109853,0.003471,unstable
3,0.716415,7.6696,4.486641,2.340563,3.963791,-1.027473,-1.938944,-0.997374,0.446209,0.976744,0.929381,0.362718,0.028871,unstable
4,3.134112,7.608772,4.943759,9.857573,3.525811,-1.125531,-1.845975,-0.554305,0.79711,0.45545,0.656947,0.820923,0.04986,unstable


**Data Preprocessing**

Building Training and Test sets and Scaling the dataset

In [33]:
X = df.drop(['stab', 'stabf'], axis = 1)
y = df['stabf']

# splitting the data to training and testing set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

scaler = StandardScaler()

# fit and transform the training set
X_train = pd.DataFrame(scaler.fit_transform(X_train), columns=X_train.columns)

# transform the test set
X_test = pd.DataFrame(sc.transform(X_test), columns=X_test.columns)

**Question 1**

According to a use-case, in a certain ML task, a false positive is six times costlier than a false negative. You, as a Data Scientist, trained 4 models, to solve the use case.

Keep the following evaluation criteria in mind:

1) Must have a recall rate of at least 80% 

2) Must have a false positive rate of 8% or less 

3) Must minimize business costs

 After creating each binary classification model, you generated the corresponding confusion matrix. Which confusion matrix represents the model that satisfies the requirements? 
 
 Answer: TN = 96%, FP = 4%, FN = 10%, TP = 90% 

**Question 2**

You are working on a spam classification system using regularized logistic regression. “Spam” is a positive class (y = 1) and “not spam” is the negative class (y = 0). You have trained your classifier and there are n = 2000 examples in the test set. The confusion matrix of predicted class vs. actual class is:

What is the F1 score of this classifier?

Answer: 0.2635

In [41]:
precision = 255 / (255 + 1380)
recall = 255 / (255 + 45)

f1_score_ = 2 / ((1 / precision) + (1 / recall))
f1_score_

0.26356589147286824

**Question 3**

Jack is working on classification modelling. While evaluating the model, he saw that the difference between test and training error is a big positive number with a low training error. Which of the following, is he currently facing?

Answer: Jack is currently facing overfitting. Overfitting is a problem that occurs when a model learns the training data too well and is unable to generalize to new data. This can happen when the model is too complex or when the training data is not representative of the real world.

**Question 4**

Which of the following metric is generally NOT useful for a classification problem?

Answer: RMSE Value

**Question 5**

Why do we use weak learners in boosting, instead if strong learners?

Answer: To make the algorithm stronger. Using weak learners in boosting makes the algorithm stronger. This is because the weak learners are able to learn from each other and improve their predictions over time.

**Question 6**

You are building a classifier and the accuracy is poor on both the training and test sets. Which would you use to try to improve the performance?

Answer: RMSE Value

**Question 7**

You are building a classifier and the accuracy is poor on both the training and test sets. Which would you use to try to improve the performance?

Answer: Bagging

**Question 8**

A classifier predicts if insurance claims are fraudulent or not. The cost of paying a fraudulent claim is higher than the cost of investigating a claim that is suspected to be fraudulent. Which metric should we use to evaluate this classifier?

Answer: Recall

**Question 9**

Which of the following metric is generally NOT useful for a classification problem?

Answer: RMSE Value

**Question 10**

![Image title](roc.jpg)
The ROC curve above was generated from a classification algorithm. What can we say about this classifier?

Answer: The model has no discrimination capacity to differentiate between the positive and the negative class

**Question 11**

![Image title](conf.png)
A random forest classifier was used to classify handwritten digits 0-9 into the numbers they were intended to represent. The confusion matrix below was generated from the results. Based on the matrix, which number was predicted with the least accuracy?

Answer: 8

**Question 12**

A medical company is building a model to predict the occurrence of thyroid cancer. The training data contains 900 negative instances (people who don't have cancer) and 100 positive instances. The resulting model has 90% accuracy, but extremely poor recall. What steps can be used to improve the model's performance? (SELECT TWO)

Answer: <br>
- Use Bagging method
- Generate synthetic samples/data using SMOTE

**Question 13**

You are developing a machine learning classification algorithm that categorizes handwritten digits 0-9 into the numbers they represent. How should you pre-process the label data?

Answer: One-hot encoding

**Question 14**

What is the entropy of the target variable if its actual values are given as:

[1,0,1,1,0,1,0]

Answer: <br>
Formula of entropy: entropy = -sum(p(x)*log2(p(x))) <br><br>
where p(x) is the probability of the target variable taking on value x.

In this case, the target variable has 2 possible values: 0 and 1. The probability of the target variable taking on value 0 is 3/7, and the probability of the target variable taking on value 1 is 4/7. Therefore, the entropy of the target variable is: <br><br>
entropy = -(3/7 * log2(3/7) + 4/7 * log2(4/7))

**Question 15**

Which of these is not a good metric for evaluating classification algorithms for data with imbalanced class problems?

Answer: Accuracy

**Question 16**

What is the accuracy on the test set using the random forest classifier? In 4 decimal places.

Answer: 0.9291

In [35]:
rf = RandomForestClassifier(random_state=1)

rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
acc_score = round(accuracy_score(y_test, y_pred), 4)
print(f'Accuracy score of the Random Forest Classifier is {acc_score}')

Accuracy score of the Random Forest Classifier is 0.929


**Question 17**

What is the accuracy on the test set using the XGboost classifier? In 4 decimal places.

Answer: 0.9455

In [40]:
import warnings;warnings.filterwarnings('ignore')

xgb_model = xgb.XGBClassifier(random_state=1)

xgb_model.fit(X_train, y_train)
y_pred = xgb_model.predict(X_test)
acc_score = round(accuracy_score(y_test, y_pred), 4)
print(f'Accuracy score of the XGBOOST Classifier is {acc_score}')

Accuracy score of the XGBOOST Classifier is 0.9455


**Question 18**

What is the accuracy on the test set using the LGBM classifier? In 4 decimal places.

Answer: 0.9375

In [44]:
lgbm_clf = LGBMClassifier(random_state=1)

lgbm_clf.fit(X_train, y_train)
lgbm_pred = lgbm_clf.predict(X_test)
acc_score = round(accuracy_score(y_test, lgbm_pred), 4)
print(f'Accuracy score of the LGBM Classifier is {acc_score}')

Accuracy score of the LGBM Classifier is 0.9395
