# Recap

Each obervation of the dataset is a numerically represented heartbeat, taken from a patient's electrocardiogram (ECG). The target is binary and defines whether the heartbeat is at risk of cardiovascular disease [1] or not [0]. 


The **task** is to build a model that can **flag at-risk observations**.

In [0]:
import pandas as pd
data = pd.read_csv("data/electrocardiograms.csv")

data.head()

Unnamed: 0,x_1,x_2,x_3,x_4,x_5,x_6,x_7,x_8,x_9,x_10,...,x_179,x_180,x_181,x_182,x_183,x_184,x_185,x_186,x_187,target
0,0.0,0.041199,0.11236,0.146067,0.202247,0.322097,0.363296,0.413858,0.426966,0.485019,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
1,1.0,0.901786,0.760714,0.610714,0.466071,0.385714,0.364286,0.346429,0.314286,0.305357,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
2,0.9942,1.0,0.951276,0.903712,0.917633,0.900232,0.803944,0.656613,0.421114,0.288863,...,0.294664,0.295824,0.301624,0.0,0.0,0.0,0.0,0.0,0.0,1
3,0.984472,0.962733,0.663043,0.21118,0.0,0.032609,0.100932,0.177019,0.270186,0.313665,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
4,0.619217,0.489324,0.327402,0.11032,0.0,0.060498,0.108541,0.108541,0.145907,0.192171,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1


In [0]:
data["target"].value_counts()

0    18117
1     1448
Name: target, dtype: int64

> Very unbalanced dataset ! 
---
We need to prevent any disease as much as possible.  
It's better to be always correct when predicting disease, with the goal to don't miss any disease.  
Our goal will be very high recall `(don't miss any disease)` , and accept wrong predictions for precision `(bad precision to predict healthy people)`.

## Base Logistic Regression

The `recall` is the capacity to detect a class.  
We need a `high recall` to ensure a detection of patients with risk of cardiovascular disease.

👇 Cross-validate the recall score of a Logistic Regression model

In [0]:
# YOUR CODE HERE
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_validate

# Ready X and y
X = data.loc[:, 'x_1':'x_187']
y = data['target']

# Cross validate model
log_cv_results = cross_validate(LogisticRegression(max_iter=1000),
                                X, y, 
                                cv=10, 
                                scoring = ['recall'])

#  Recall
log_cv_results['test_recall'].mean()

0.32938697318007665

In [0]:
data["target"].value_counts()

0    18117
1     1448
Name: target, dtype: int64

> The recall score is very low, due to an unbalanced dataset.  
> The class 1(patient with disease) contains less samples, it's harder to predict correctly this class.

👇 Generate a classification report for the model.

[`classication_report` documentation](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html)

In [0]:
# YOUR CODE HERE
from sklearn.metrics import classification_report
from sklearn.model_selection import cross_val_predict

y_pred = cross_val_predict(LogisticRegression(max_iter=1000), X, y)

print(classification_report(y,y_pred))

              precision    recall  f1-score   support

           0       0.95      0.99      0.97     18117
           1       0.68      0.33      0.45      1448

    accuracy                           0.94     19565
   macro avg       0.82      0.66      0.71     19565
weighted avg       0.93      0.94      0.93     19565



> Much more prediction for class 0 than for class 1

In [0]:
pd.DataFrame(y_pred).value_counts()

0    18867
1      698
dtype: int64

`precision`: We want to predict only the patients that really have disease. It doesn't matter if we miss some patients with disease.
> `TP / (TP + FP)` 
> - 0.95 for class 0 mean that considering all the prediction of 0, 95% of predictions were correct.  
> - 0.68 for class 1 mean that considering all the prediction of 1, only 68% were predict correctly.  
It doesn't matter if we predict no disease whereas there was a disease.

`recall`: We want to ensure to predict ALL the patients with a disease.  
   It only matter to predict disease when there is.  We don't want to miss any.  
   The patients could complain about our company if we didn't detect their disease.
> `TP / (TP + FN)`
> 0.33 for class 1 mean that considering all the patients with disease, we only detect 33%. It's very bad !!!     
> We will have a lot of problem with patients that will realise they have disease that we didn't detect.  

> 0.99 for class 0 mean that considering all the patients with no disease, we detect 99%. We don't really care of this good results, because our 1% wrong concern a patient predict with disease but he doesn't. It's not a problem.

---

>For class 1:
> - 100 % Recall (4 / 4)
> - 66 % Precision ( 4 /6 )

| y_pred | y_true |
| ----------- | ----------- |
| 1 | 1 |
| 1 | 1 |
| 1 | 1 |     > 100% Recall
| 1 | 1 |
| 1 | 0 |
| 1 | 0 |

## Weighted classes Logistic Regression

👇 Cross-validate the recall score of a Logistic Regression model with weighted classes

## class_weight = "balanced" : hyperparameter from LogisticRegression

In [0]:
import numpy as np



In [0]:
y.value_counts()

0    18117
1     1448
Name: target, dtype: int64

In [0]:
# n_samples / (n_classes * np.bincount(y))
19565 / (2 * np.bincount(y))

array([0.53996247, 6.75587017])

`class_weight = "balanced"` add weight for the class with less values.   
It permit to the algorithm to learn properly the class with less occurences.  
This is an alternative to SMOTE.

In [0]:
# YOUR CODE HERE
log_cv_results = cross_validate(LogisticRegression(max_iter=1000, class_weight = "balanced"),
                                X, y, 
                                cv=10, 
                                scoring = ['recall'])

#  Recall
log_cv_results['test_recall'].mean()

0.8638793103448276

👇 Generate a classification report for the model

In [0]:
# YOUR CODE HERE
y_pred = cross_val_predict(LogisticRegression(max_iter=1000,class_weight = "balanced"), X, y)

print(classification_report(y,y_pred))

              precision    recall  f1-score   support

           0       0.99      0.84      0.91     18117
           1       0.31      0.87      0.45      1448

    accuracy                           0.85     19565
   macro avg       0.65      0.86      0.68     19565
weighted avg       0.94      0.85      0.88     19565



> `class_weight = "balanced"` gives much more prediction for class 1

In [0]:
pd.DataFrame(y_pred).value_counts()

0    15494
1     4071
dtype: int64

---

 🔥Before going to the last step of the recap, Remind what we have done until now:  
 
 We want a good score for recall, to build a model that is always correct when it comes to disease prediction.
 
 > 1. We had very bad recall score(33%), due to an imbalanced dataset (too much observations with no disease vs. observations with disease.)
 > 2. We discover the magic of `class_weight = "balanced"`, that give more weight to the value with less observations.
 > The consequence is a much better recall score(87%).
 
🏁 Last Goal :
> We want to find a way to get the recall score we want. We can do it using `Threshold Adjustment` technique !


## Threshold Adjustment

👇 Find the threshold that would guarantee a 95% recall

# 1. Predict probabilities

In [0]:
# Use cross val predict to predict probabilities of observation belonging to class 0 or 1


data["y_pred_probas_0"], data["y_pred_probas_1"] = cross_val_predict(LogisticRegression(max_iter=1000,class_weight = "balanced"),
                                                     X, y,
                                                     method = "predict_proba").T


Check the difference of `predict probabilities` with and without `class_weight = "balanced"`

In [0]:
data.head()

Unnamed: 0,x_1,x_2,x_3,x_4,x_5,x_6,x_7,x_8,x_9,x_10,...,x_181,x_182,x_183,x_184,x_185,x_186,x_187,target,y_pred_probas_0,y_pred_probas_1
0,0.0,0.041199,0.11236,0.146067,0.202247,0.322097,0.363296,0.413858,0.426966,0.485019,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,0.635416,0.364584
1,1.0,0.901786,0.760714,0.610714,0.466071,0.385714,0.364286,0.346429,0.314286,0.305357,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,0.218121,0.781879
2,0.9942,1.0,0.951276,0.903712,0.917633,0.900232,0.803944,0.656613,0.421114,0.288863,...,0.301624,0.0,0.0,0.0,0.0,0.0,0.0,1,0.024926,0.975074
3,0.984472,0.962733,0.663043,0.21118,0.0,0.032609,0.100932,0.177019,0.270186,0.313665,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,0.72699,0.27301
4,0.619217,0.489324,0.327402,0.11032,0.0,0.060498,0.108541,0.108541,0.145907,0.192171,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,0.495548,0.504452


# 2. Generate recall and thresholds using probabilities for class 1

🔆
> Keep in mind that there is always a trade-off between precision and recall.
The more you have good score on recall, the less you have a good score on precision...
As Data Scientist, you can manipulate this trade_off by yourself, using the tool precision_recall_curve.  
This tool give the corresponding precision score, and recall score, depending on the thresholds.

`precision_recall_curve` takes y_true, and the probabilities for the class you want to manipulate the threshold



In [0]:
from sklearn.metrics import precision_recall_curve


precision, recall, thresholds = precision_recall_curve(y, data["y_pred_probas_1"])



In [0]:
# Different threshold, from 1% to 99.999%
# We want a threshold corresponding to 95% score for recall
thresholds

array([0.01754717, 0.01754845, 0.01755864, ..., 0.99992154, 0.99992835,
       0.99997848])

# 3. Populate dataframe with recall and threshold

In [0]:
len(recall)

16213

In [0]:
len(thresholds)

16212

`thresholds` contains one more row than `recall`. That's why we need to exclude the last row from `recall` to build the DataFrame

In [0]:
df_recall = pd.DataFrame({"recall" : recall[:16212], "threshold" : thresholds})

df_recall


Unnamed: 0,recall,threshold
0,1.000000,0.017547
1,0.999309,0.017548
2,0.999309,0.017559
3,0.999309,0.017561
4,0.999309,0.017561
...,...,...
16207,0.002762,0.999856
16208,0.002072,0.999868
16209,0.002072,0.999922
16210,0.001381,0.999928


# 4. Find out which threshold guarantees a 95% recall score 

In [0]:
df_recall[df_recall['recall'] >= 0.95]

Unnamed: 0,recall,threshold
0,1.000000,0.017547
1,0.999309,0.017548
2,0.999309,0.017559
3,0.999309,0.017561
4,0.999309,0.017561
...,...,...
9830,0.950276,0.269746
9831,0.950276,0.270036
9832,0.950276,0.270044
9833,0.950276,0.270127


In [0]:
df_recall[df_recall['recall'] >= 0.95]['threshold'].max()

0.2703028302701805

In [0]:

new_threshold = df_recall[df_recall['recall'] >= 0.95]['threshold'].max()

new_threshold

0.2703028302701805

# 5. Let's tune the threshold at 27% to get a 95% recall score

 🔆
 > To do that, Build a function to return a prediction 1 when the probabilities is greater than 27% (new_threshold)

In [0]:
def custom_predict(proba):
    if proba >= new_threshold:
        val = 1
    else:
        val = 0
    return val

# Create a column to record the prediction, after applying the custom_predict function
data['custom_predict'] = data["y_pred_probas_1"].apply(custom_predict)

data.head()

Unnamed: 0,x_1,x_2,x_3,x_4,x_5,x_6,x_7,x_8,x_9,x_10,...,x_182,x_183,x_184,x_185,x_186,x_187,target,y_pred_probas_0,y_pred_probas_1,custom_predict
0,0.0,0.041199,0.11236,0.146067,0.202247,0.322097,0.363296,0.413858,0.426966,0.485019,...,0.0,0.0,0.0,0.0,0.0,0.0,1,0.635416,0.364584,1
1,1.0,0.901786,0.760714,0.610714,0.466071,0.385714,0.364286,0.346429,0.314286,0.305357,...,0.0,0.0,0.0,0.0,0.0,0.0,1,0.218121,0.781879,1
2,0.9942,1.0,0.951276,0.903712,0.917633,0.900232,0.803944,0.656613,0.421114,0.288863,...,0.0,0.0,0.0,0.0,0.0,0.0,1,0.024926,0.975074,1
3,0.984472,0.962733,0.663043,0.21118,0.0,0.032609,0.100932,0.177019,0.270186,0.313665,...,0.0,0.0,0.0,0.0,0.0,0.0,1,0.72699,0.27301,1
4,0.619217,0.489324,0.327402,0.11032,0.0,0.060498,0.108541,0.108541,0.145907,0.192171,...,0.0,0.0,0.0,0.0,0.0,0.0,1,0.495548,0.504452,1


## Custom Predict

👇 Generate a classification report for the model

In [0]:
# YOUR CODE HERE
print(classification_report(y, data["custom_predict"]))

              precision    recall  f1-score   support

           0       0.99      0.72      0.84     18117
           1       0.22      0.95      0.35      1448

    accuracy                           0.74     19565
   macro avg       0.61      0.84      0.59     19565
weighted avg       0.94      0.74      0.80     19565



🎉 
> Congratulations, you know how to tweak recall and precision, depending on what you want !!!