# Evaluation Metrics for Classification Problems

<img src="https://www.researchgate.net/publication/328148379/figure/fig1/AS:679514740895744@1539020347601/Model-performance-metrics-Visual-representation-of-the-classification-model-metrics.png" height=500 width=500>

In [1]:
import opendatasets as od
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px

In [2]:
df = pd.read_csv('personal-key-indicators-of-heart-disease/2020/heart_2020_cleaned.csv')

In [3]:
df

Unnamed: 0,HeartDisease,BMI,Smoking,AlcoholDrinking,Stroke,PhysicalHealth,MentalHealth,DiffWalking,Sex,AgeCategory,Race,Diabetic,PhysicalActivity,GenHealth,SleepTime,Asthma,KidneyDisease,SkinCancer
0,No,16.60,Yes,No,No,3.0,30.0,No,Female,55-59,White,Yes,Yes,Very good,5.0,Yes,No,Yes
1,No,20.34,No,No,Yes,0.0,0.0,No,Female,80 or older,White,No,Yes,Very good,7.0,No,No,No
2,No,26.58,Yes,No,No,20.0,30.0,No,Male,65-69,White,Yes,Yes,Fair,8.0,Yes,No,No
3,No,24.21,No,No,No,0.0,0.0,No,Female,75-79,White,No,No,Good,6.0,No,No,Yes
4,No,23.71,No,No,No,28.0,0.0,Yes,Female,40-44,White,No,Yes,Very good,8.0,No,No,No
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
319790,Yes,27.41,Yes,No,No,7.0,0.0,Yes,Male,60-64,Hispanic,Yes,No,Fair,6.0,Yes,No,No
319791,No,29.84,Yes,No,No,0.0,0.0,No,Male,35-39,Hispanic,No,Yes,Very good,5.0,Yes,No,No
319792,No,24.24,No,No,No,0.0,0.0,No,Female,45-49,Hispanic,No,Yes,Good,6.0,No,No,No
319793,No,32.81,No,No,No,0.0,0.0,No,Female,25-29,Hispanic,No,No,Good,12.0,No,No,No


In [5]:
from sklearn.model_selection import train_test_split

In [6]:
train_val_df, test_df = train_test_split(df, test_size=0.2, random_state=42)
train_df, val_df = train_test_split(train_val_df, test_size=0.25, random_state=42)

In [12]:
input_cols = list(train_df.columns)[1:]
target_col = 'HeartDisease'

In [13]:
train_inputs = train_df[input_cols].copy()
train_targets = train_df[target_col].copy()
val_inputs = val_df[input_cols].copy()
val_targets = val_df[target_col].copy()
test_inputs = test_df[input_cols].copy()
test_targets = test_df[target_col].copy()

In [14]:
from sklearn.preprocessing import StandardScaler

In [17]:
numerics = train_inputs.select_dtypes(include=np.number).columns.tolist()
categoricals = train_inputs.select_dtypes('object').columns.tolist()

In [15]:
scaler = StandardScaler()

In [18]:
scaler.fit(df[numerics])

train_inputs[numerics] = scaler.transform(train_inputs[numerics])
val_inputs[numerics] = scaler.transform(val_inputs[numerics])
test_inputs[numerics] = scaler.transform(test_inputs[numerics])

In [19]:
from sklearn.preprocessing import OneHotEncoder

In [20]:
encoder = OneHotEncoder(sparse=False, handle_unknown='ignore')

In [24]:
encoder.fit(df[categoricals])

encoded_cols = list(encoder.get_feature_names_out(categoricals))

train_inputs[encoded_cols] = encoder.transform(train_inputs[categoricals])
val_inputs[encoded_cols] = encoder.transform(val_inputs[categoricals])
test_inputs[encoded_cols] = encoder.transform(test_inputs[categoricals])



In [25]:
from sklearn.linear_model import LogisticRegression

In [28]:
model = LogisticRegression()
model.fit(train_inputs[numerics + encoded_cols], train_targets)

In [30]:
Train = train_inputs[numerics + encoded_cols]
Val = val_inputs[numerics + encoded_cols]
Test = test_inputs[numerics + encoded_cols]

In [31]:
test_preds = model.predict(Test)

## Types of Evaluation Metrics

When evaluating classification models in scikit-learn, there are several common evaluation metrics you can use depending on the specific requirements of your problem. Here are some of the most commonly used ones:

1. **Accuracy**: Accuracy is the ratio of correctly predicted observations to the total observations. It's the most straightforward metric but might not be suitable for imbalanced datasets.

<img src="https://wiki.cloudfactory.com/media/pages/docs/mp-wiki/metrics/accuracy/bc5dda9c32-1684142766/12.webp" height=500 width=500>

```python
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_true, y_pred)
```

In [57]:
5/9

0.5555555555555556

In [58]:
from sklearn.metrics import accuracy_score

In [62]:
test_targets.value_counts()

No     58367
Yes     5592
Name: HeartDisease, dtype: int64

In [59]:
accuracy_score(test_preds, test_targets)

0.9138979658843946

2. **Precision**: Precision is the ratio of correctly predicted positive observations to the total predicted positive observations. It measures the correctness of positive predictions.

<img src="https://miro.medium.com/max/700/1*pDx6oWDXDGBkjnkRoJS6JA.png" height=500 width=500>
```python
from sklearn.metrics import precision_score
precision = precision_score(y_true, y_pred)
```

In [63]:
from sklearn.metrics import precision_score

In [36]:
test_preds

array(['No', 'No', 'No', ..., 'No', 'No', 'No'], dtype=object)

In [37]:
test_targets

271884    No
270361    No
219060    No
24010     No
181930    No
          ..
181387    No
13791     No
180164    No
94526     No
107129    No
Name: HeartDisease, Length: 63959, dtype: object

In [40]:
pos_label = 'Yes'

In [64]:
precision_score(test_targets, test_preds,pos_label=pos_label)

0.5416258570029383

3. **Recall (Sensitivity)**: Recall is the ratio of correctly predicted positive observations to the all observations in the actual class. It measures the completeness of positive predictions.

<img src="https://blog.roboflow.com/content/images/2022/03/recall_formula.png" height=500 width=500>

```python
from sklearn.metrics import recall_score
recall = recall_score(y_true, y_pred)
```

In [42]:
from sklearn.metrics import recall_score

In [65]:
recall_score(test_targets, test_preds,pos_label=pos_label)

0.09889127324749643

4. **F1 Score**: F1 Score is the harmonic mean of precision and recall. It provides a balance between precision and recall.

<img src="https://miro.medium.com/v2/resize:fit:1400/1*vjM46BRmBYQLt-uzSmUeag.png" height=500 width=500>

```python
from sklearn.metrics import f1_score
f1 = f1_score(y_true, y_pred)
```

In [None]:
hyperparameter tuning, balancing - preprocessing step

In [45]:
from sklearn.metrics import f1_score

In [66]:
f1_score(test_targets, test_preds,pos_label=pos_label)

0.16724633298049296

5. **Confusion Matrix**: A confusion matrix is a table that is often used to describe the performance of a classification model. It presents a summary of the predictions versus the actual values.

<img src="https://i0.wp.com/replicationindex.com/wp-content/uploads/2021/01/fdf0b-17eyyla6xlxsgbcf77j_roa.png?w=604&ssl=1" height=500 width=500>

```python
from sklearn.metrics import confusion_matrix
conf_matrix = confusion_matrix(y_true, y_pred)
```

In [47]:
from sklearn.metrics import confusion_matrix

In [48]:
confusion_matrix(test_targets, test_preds)

array([[57899,   468],
       [ 5039,   553]])

6. **ROC Curve and AUC Score**: Receiver Operating Characteristic (ROC) curve is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied. Area Under the Curve (AUC) provides an aggregate measure of performance across all possible classification thresholds.

<img src="https://www.researchgate.net/publication/276079439/figure/fig2/AS:614187332034565@1523445079168/An-example-of-ROC-curves-with-good-AUC-09-and-satisfactory-AUC-065-parameters.png" height=300 width=300>

```python
from sklearn.metrics import roc_curve, auc
fpr, tpr, thresholds = roc_curve(y_true, y_scores)
roc_auc = auc(fpr, tpr)
```

7. **Specificity (True Negative Rate):**
   - **Formula:** TN / (TN + FP)
   - Specificity measures the ability of the model to correctly identify negative instances. It is the ratio of true negative predictions to the total actual negative instances (true negatives and false positives).
   
<img src="https://www.aaronswansonpt.com/wp-content/uploads/2011/08/Sensitivity-and-Specificity.png" height=500 width=500>

These are some of the most commonly used evaluation metrics in classification models. You can choose the appropriate ones based on the specific requirements of your problem and the characteristics of your dataset.