## Setting up to Work

The first part of the process, importing the libraries and depend

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import mutual_info_classif
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
from sklearn.compose import ColumnTransformer
from sklearn.decomposition import PCA
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix


Loading dataset, after downloading it from kaggle: https://www.kaggle.com/datasets/iammustafatz/diabetes-prediction-dataset.

In [2]:
df = pd.read_csv("diabetes_prediction_dataset.csv")

## Exploratory Data Analysis

The first steps are to understand the data, in order to do so, first we may get some general information about the dataset through the `head()`, `info()` and `describe()`.

In [3]:
df[(df['gender'] == 'Male') & (df['age'] == 23.0) & (df['diabetes'] == 1)]

Unnamed: 0,gender,age,hypertension,heart_disease,smoking_history,bmi,HbA1c_level,blood_glucose_level,diabetes
42197,Male,23.0,0,0,No Info,27.32,5.7,145,1
57911,Male,23.0,0,0,No Info,27.59,7.0,159,1
76268,Male,23.0,0,0,never,22.46,6.5,140,1
79982,Male,23.0,0,0,No Info,40.33,8.2,220,1
82655,Male,23.0,0,0,never,31.41,6.2,140,1


In [4]:
df.head()

Unnamed: 0,gender,age,hypertension,heart_disease,smoking_history,bmi,HbA1c_level,blood_glucose_level,diabetes
0,Female,80.0,0,1,never,25.19,6.6,140,0
1,Female,54.0,0,0,No Info,27.32,6.6,80,0
2,Male,28.0,0,0,never,27.32,5.7,158,0
3,Female,36.0,0,0,current,23.45,5.0,155,0
4,Male,76.0,1,1,current,20.14,4.8,155,0


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 9 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   gender               100000 non-null  object 
 1   age                  100000 non-null  float64
 2   hypertension         100000 non-null  int64  
 3   heart_disease        100000 non-null  int64  
 4   smoking_history      100000 non-null  object 
 5   bmi                  100000 non-null  float64
 6   HbA1c_level          100000 non-null  float64
 7   blood_glucose_level  100000 non-null  int64  
 8   diabetes             100000 non-null  int64  
dtypes: float64(3), int64(4), object(2)
memory usage: 6.9+ MB


In [6]:
df.describe()

Unnamed: 0,age,hypertension,heart_disease,bmi,HbA1c_level,blood_glucose_level,diabetes
count,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0
mean,41.885856,0.07485,0.03942,27.320767,5.527507,138.05806,0.085
std,22.51684,0.26315,0.194593,6.636783,1.070672,40.708136,0.278883
min,0.08,0.0,0.0,10.01,3.5,80.0,0.0
25%,24.0,0.0,0.0,23.63,4.8,100.0,0.0
50%,43.0,0.0,0.0,27.32,5.8,140.0,0.0
75%,60.0,0.0,0.0,29.58,6.2,159.0,0.0
max,80.0,1.0,1.0,95.69,9.0,300.0,1.0


So, after a brief analysis we may conclude:
- There are a plentiful number of data available, 100.000 cases.
- The data doesn't contain any explicit error-leading missing values such as NAs, but it has the `No info` class of `smoking_history`.
- There are two categorical variables, `gender` and `smoking_history`, both will have to be transformed to a numerical value.
- `diabetes`, the target, is a boolean value and have a mean of 0.085. Meaning that only 8.5% of the cases in fact have diabetes implying on a imbalanced dataset.

Next, we going to analyse the `smoking_history` feature as it appears to be problematic for being a categorical feature with missing values. 

The first step is to check the possible values this feature can take and their respective frequencies. As shown in the next cell, this feature has a couple of issues:
- `No Info` appears in 35916 cases, meaning that more than one-third of the cases has a unespecified value on this feature (missing data).
- There are ambiguous and overlapping categories. For example, `not current` could mean the same as `former` or `never`, and the criteria that distinguise `ever` from `current` is poorly defined.

Considering the high number of unkown values and the ambiguity in class definition, a further evaluations is needed to assess this feature impact on the target prediction. This will help justify the efford of keeping this feature, or determine if it should be discarded.

In [7]:
df['smoking_history'].value_counts()

smoking_history
No Info        35816
never          35095
former          9352
current         9286
not current     6447
ever            4004
Name: count, dtype: int64

In order to further evaluate, we'll define a `ColumnTransformer` and transform the categorical values through the One-Hot Encoding. With the categorical data transformed, we can then use some feature metrics, such as Mutual Information score and correlation, to understand the features relevance to the target.

In [8]:
# Separetes the Features from the Target.
X = df.drop(['diabetes'], axis=1)
y = df['diabetes']

CT = ColumnTransformer(
    transformers = [ 
        ('onehot', OneHotEncoder(sparse_output=False, categories='auto'), ['gender', 'smoking_history']), #sparse_output=False
        #('ordinal', OrdinalEncoder(categories=[['never','No Info', 'not current', 'former', 'current', 'ever']]), ['smoking_history']) #sparse_output=False
    ],	
    remainder='passthrough'
)

In [9]:
# Applies the OneHot onto categorical features.
X_encoded = CT.fit_transform(X[['gender', 'smoking_history']])
encoded_cols = CT.get_feature_names_out(['gender', 'smoking_history'])

X_encoded_df = pd.DataFrame(X_encoded, columns=encoded_cols, index=X.index)
X_features = pd.concat([X.drop(['gender', 'smoking_history'], axis=1), X_encoded_df], axis=1)

### Correlation
Analyzing the correlation between the features and the target variable, allows us to see which of the features are most strong linear related to the target.

In [10]:
df_trans = pd.concat([X_features, y], axis=1)
df_trans.corr()['diabetes'].sort_values(ascending=False)


diabetes                               1.000000
blood_glucose_level                    0.419558
HbA1c_level                            0.400660
age                                    0.258008
bmi                                    0.214357
hypertension                           0.197823
heart_disease                          0.171727
onehot__smoking_history_former         0.097917
onehot__gender_Male                    0.037666
onehot__smoking_history_never          0.027267
onehot__smoking_history_ever           0.024080
onehot__smoking_history_not current    0.020734
onehot__smoking_history_current        0.019606
onehot__gender_Other                  -0.004090
onehot__gender_Female                 -0.037553
onehot__smoking_history_No Info       -0.118939
Name: diabetes, dtype: float64

### Mutual Information
Mutual Information can capture many types of relationships that each variable may have with the target, not being limited by linear associations only.

In [11]:
# Calcula a MI
mi_scores = mutual_info_classif(X_features, y)

# Exibe os resultados
mi_df = pd.DataFrame({'Feature': X_features.columns, 'MI Score': mi_scores})
print(mi_df.sort_values(by='MI Score', ascending=False))


                                Feature  MI Score
4                           HbA1c_level  0.130564
5                   blood_glucose_level  0.114293
0                                   age  0.039955
3                                   bmi  0.027423
6                 onehot__gender_Female  0.017034
9       onehot__smoking_history_No Info  0.015438
1                          hypertension  0.012685
7                   onehot__gender_Male  0.009468
2                         heart_disease  0.009383
13        onehot__smoking_history_never  0.007803
12       onehot__smoking_history_former  0.005175
11         onehot__smoking_history_ever  0.002307
10      onehot__smoking_history_current  0.000733
8                  onehot__gender_Other  0.000000
14  onehot__smoking_history_not current  0.000000


### PCA
Using PCA we may obtain valueble information about components made using the linear combination of the available data.

In [12]:
X_filtered = X_features[["age", "hypertension", "heart_disease", "bmi", "HbA1c_level", "blood_glucose_level"]]
X_scaled = (X_filtered - X_filtered.mean(axis=0)) / X_filtered.std(axis=0)

pca = PCA(n_components=5)
X_pca = pca.fit_transform(X_scaled)

comp_names = [f"PC{i+1}" for i in range(X_pca.shape[1])]
X_pca = pd.DataFrame(X_pca, columns=comp_names)

loadings = pd.DataFrame(
    pca.components_.T,
    columns=comp_names,
    index=X_filtered.columns,  
)
print(loadings)

                          PC1       PC2       PC3       PC4       PC5
age                  0.566264 -0.265468 -0.057950 -0.156845 -0.023003
hypertension         0.421370 -0.175895  0.029841  0.864849  0.076444
heart_disease        0.352297 -0.143349  0.818735 -0.296400 -0.018912
bmi                  0.456550 -0.218234 -0.569784 -0.360819 -0.056236
HbA1c_level          0.284349  0.656053  0.000659  0.058274 -0.696617
blood_glucose_level  0.297298  0.632461 -0.027746 -0.077440  0.710515


In [13]:
mi_scores = mutual_info_classif(X_pca, y)
mi_df = pd.DataFrame({'Feature': X_pca.columns, 'MI Score': mi_scores})
print(mi_df)

  Feature  MI Score
0     PC1  0.111954
1     PC2  0.073373
2     PC3  0.043316
3     PC4  0.052707
4     PC5  0.037834


## Feature Selection and Preprocessing
Now, it's time to select the features that we'll use on our prediction. As we saw that the smoking history isn't a particularly good feature to invest in this case, we'll exclude it from the selected features. After that, we'll split our dataset to use 75% of it to training and the other 25% to test our prediction model.

In [14]:
# Separetes the Features from the Target.
X = df.drop(['diabetes', 'smoking_history'], axis=1)
y = df['diabetes']

CT = ColumnTransformer(
    transformers = [ 
        ('onehot', OneHotEncoder(sparse_output=False, categories='auto'), ['gender']), #sparse_output=False
    ],	
    remainder='passthrough'
)

# Applies the OneHot onto categorical features.
X_encoded = CT.fit_transform(X[['gender']])
encoded_cols = CT.get_feature_names_out(['gender'])

X_encoded_df = pd.DataFrame(X_encoded, columns=encoded_cols, index=X.index)
X_features = pd.concat([X.drop(['gender'], axis=1), X_encoded_df], axis=1)


As we saw earlier on the EDA, the dataset we're using is imbalanced. The target `diabetes` has the value of 0 much more frequently than 1, with 91.5% of the cases being 0 and 8.5% being 1. In order to ensure that the data will be splitted in a way that keeps the original distribution, we will set it to stratify the split based on the target values.

In [15]:
X_train, X_test, y_train, y_test = train_test_split(X_features, y, stratify=y)

## Modeling
Here we're going to train a Gradient Boosting with 400 decision trees, fitting it to the train data.

In [16]:
model_gboost_baseline = GradientBoostingClassifier(n_estimators = 400, random_state=0).fit(X_train, y_train)

It is very important to evaluate the classification report wisely. Precision isn't the only thing that cares, even more on this case as we've got a imbalanced dataset. So the recall and f1-score tells us that the model may be predicting a higher number of false negatives than it appears by only looking to the precision score.

In [17]:
y_pred = model_gboost_baseline.predict(X_test)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.97      1.00      0.98     22875
           1       0.98      0.68      0.80      2125

    accuracy                           0.97     25000
   macro avg       0.97      0.84      0.89     25000
weighted avg       0.97      0.97      0.97     25000

[[22841    34]
 [  677  1448]]


In fact, our model gets right 68% of the diabetics people diagnosed. We may trade a bit of false negatives with false positives by changing the confidence that our model has to have to consider a prediction of diabetes as true, using 0.3 of probability as a threshold to predict `True` instead of 0.5.

#### Adjusting Confidence


In [18]:
y_proba = model_gboost_baseline.predict_proba(X_test)[:, 1]
y_pred = (y_proba >= 0.3).astype(int)

print(classification_report(y_test, y_pred))
cm = confusion_matrix(y_test, y_pred)
print(cm)

              precision    recall  f1-score   support

           0       0.98      0.99      0.98     22875
           1       0.86      0.73      0.79      2125

    accuracy                           0.97     25000
   macro avg       0.92      0.86      0.88     25000
weighted avg       0.97      0.97      0.97     25000

[[22621   254]
 [  579  1546]]


In this context, the trade-off is justified because false negatives in medical diagnoses usually tend to represent a bigger problem than false positives. Another alternative is to set weights to each case, aplying lesser values to cases of the class that occurs more often.

#### Assign Weights
After assign weights to each classe, being the weight for class 1 cases five times higher than the weights of cases of class 0, we got even less false positives, almost half of the original amount, in exchange of having near 30 times the amout of false negatives. 


In [19]:
sample_weights = np.where(y_train == 1, 5, 1)
model_gboost_weighted = GradientBoostingClassifier(n_estimators = 400, random_state=0, n_iter_no_change=10, tol=1e-4).fit(X_train, y_train, sample_weight=sample_weights)

In [20]:
y_pred = model_gboost_weighted.predict(X_test)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.98      0.96      0.97     22875
           1       0.66      0.83      0.73      2125

    accuracy                           0.95     25000
   macro avg       0.82      0.90      0.85     25000
weighted avg       0.96      0.95      0.95     25000

[[21950   925]
 [  358  1767]]


#### Undersampling
We may employ the undersampling technique to reduce the bias of the model towards predicting the cases as non-diabetic (`False`). 
In other words, by reducing how many false non-diabetic samples the model sees during its training, we decrease the model tendency to predict values near 0.

First we may repeat the process of splitting the dataset and aplying our column transformer.

In [21]:
# Separetes the Features from the Target.
df_with_diabetes = df[df['diabetes'] == 1]
df_without_diabetes = df[~(df['diabetes'] == 1)]
drop_sample_indexes = df_without_diabetes.sample(frac=1/2, random_state=42).index
df_u = pd.concat([df_with_diabetes, df_without_diabetes.drop(drop_sample_indexes)], axis=0)

X_u = df_u.drop(['diabetes', 'smoking_history'], axis=1)
y_u = df_u['diabetes']

# Applies the OneHot onto categorical features.
X_encoded_u = CT.fit_transform(X_u[['gender']])
X_encoded_df_u = pd.DataFrame(X_encoded_u, columns=encoded_cols, index=X_u.index)
X_features_u = pd.concat([X_u.drop(['gender'], axis=1), X_encoded_df_u], axis=1)

In [22]:
X_train, X_test, y_train, y_test = train_test_split(X_features_u, y_u, stratify=y_u)

Thus, we train our model again. This time, our non-diabetic samples were reduced by half.

In [23]:
model_gboost_undersampling = GradientBoostingClassifier(n_estimators = 400, random_state=0).fit(X_train, y_train)

In [24]:
y_pred = model_gboost_undersampling.predict(X_test)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.95      0.99      0.97     11438
           1       0.93      0.74      0.82      2125

    accuracy                           0.95     13563
   macro avg       0.94      0.86      0.90     13563
weighted avg       0.95      0.95      0.95     13563

[[11322   116]
 [  563  1562]]


In [25]:
y_proba = model_gboost_undersampling.predict_proba(X_test)[:, 1]
y_pred = (y_proba >= 0.3).astype(int)

print(classification_report(y_test, y_pred))
cm = confusion_matrix(y_test, y_pred)
print(cm)

              precision    recall  f1-score   support

           0       0.97      0.96      0.96     11438
           1       0.79      0.83      0.81      2125

    accuracy                           0.94     13563
   macro avg       0.88      0.90      0.89     13563
weighted avg       0.94      0.94      0.94     13563

[[10966   472]
 [  356  1769]]


## Modeling (Neural Network Alternative)
In our first attempt, we used a tree-based model approach: `GradientBoostingClassifier`, which relies on decision trees. As an alternative to tree-based models, deep neural networks (DNN) can be employed. In this case, we're going to use Tensorflow library, so we must first import it.

In [26]:
import tensorflow as tf
from tensorflow import keras
from keras import layers

2025-04-21 18:27:45.924039: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1745270865.940525    9198 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1745270865.946119    9198 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1745270865.965790    9198 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1745270865.965811    9198 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1745270865.965813    9198 computation_placer.cc:177] computation placer alr

To ensure we're working with the correct and integral data, we perform a fresh train-test split.

In [27]:
X_train, X_test, y_train, y_test = train_test_split(X_features, y, stratify=y)

Since our problem falls within the classification domain, we must be particularly cautious with multiple metrics beyond accuracy, such as minimizing false positives and maximizing true positives, which are much more relevant to a health-related application such as a diagnostic model. 

Therefore, we need to define which metrics are the most relevant for this application and keep track of them as the model learns.

In [28]:
relevant_classification_metrics = [
      keras.metrics.TruePositives(name='tp'),
      keras.metrics.FalsePositives(name='fp'),
      keras.metrics.TrueNegatives(name='tn'),
      keras.metrics.FalseNegatives(name='fn'), 
      keras.metrics.BinaryAccuracy(name='accuracy'),
      keras.metrics.Precision(name='precision'),
      keras.metrics.Recall(name='recall'),
      keras.metrics.AUC(name='auc'),
      keras.metrics.AUC(name='prc', curve='PR'), # precision-recall curve
]

I0000 00:00:1745270868.382981    9198 gpu_device.cc:2019] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 5529 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 4060 Ti, pci bus id: 0000:01:00.0, compute capability: 8.9


Then we define the function that specifies the model's architecture, allowing us to instantiate multiple models to test alternative configurations (as we did with tree-based models).

In [29]:
def build_model():
    return tf.keras.Sequential([
        layers.BatchNormalization(),
        layers.Dense(units=X_train.shape[-1], activation='linear'),
        layers.BatchNormalization(),
        layers.Dropout(0.3),
        layers.Dense(units=256, activation='linear'),
        layers.Dropout(0.3),
        layers.BatchNormalization(),
        layers.Dense(units=64, activation='linear'),
        layers.Dense(units=1, activation='sigmoid'),
    ])

We must also define some important parameters, such as the batch size (`batch_size`) and the maximum number of epochs(`epochs_max`) for the training. We can set a higher number of epochs, as we will use early stopping to prevent overfitting and avoid spending unnecessary time on training.

In [30]:
epochs_max = 100
batch_size = 32

early_stopping = tf.keras.callbacks.EarlyStopping(
    monitor='val_prc', 
    verbose=1,
    patience=5,
    mode='max',
    restore_best_weights=True)


Finally, we can call our `build_model()` and compile the DNN model. Fitting a tensorflow model returns a history of the training, which we can keep in order to plot the model's training data (such as the loss on each epoch).

### Baseline DNN Model

In [31]:
model_dnn_baseline = build_model()

model_dnn_baseline.compile(
      optimizer=keras.optimizers.Adam(learning_rate=1e-3),
      loss=keras.losses.BinaryCrossentropy(),
      metrics=relevant_classification_metrics
)

history_baseline = model_dnn_baseline.fit(
    X_train, y_train,
    batch_size=batch_size,
    epochs=epochs_max,
    callbacks=[early_stopping],
    validation_data=(X_test, y_test)
)

Epoch 1/100


I0000 00:00:1745270871.024443    9267 service.cc:152] XLA service 0x7f3f2c006210 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1745270871.024506    9267 service.cc:160]   StreamExecutor device (0): NVIDIA GeForce RTX 4060 Ti, Compute Capability 8.9
2025-04-21 18:27:51.112753: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:269] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
I0000 00:00:1745270871.431461    9267 cuda_dnn.cc:529] Loaded cuDNN version 90300


[1m  30/2344[0m [37m━━━━━━━━━━━━━━━━━━━━[0m [1m12s[0m 5ms/step - accuracy: 0.6022 - auc: 0.6103 - fn: 14.3667 - fp: 167.4333 - loss: 0.8084 - prc: 0.1884 - precision: 0.0971 - recall: 0.5314 - tn: 292.9333 - tp: 21.2667

I0000 00:00:1745270874.154099    9267 device_compiler.h:188] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.


[1m2343/2344[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 6ms/step - accuracy: 0.9015 - auc: 0.8554 - fn: 1748.4370 - fp: 1243.8083 - loss: 0.2526 - prc: 0.4795 - precision: 0.4891 - recall: 0.4524 - tn: 33042.1211 - tp: 1469.6334




[1m2344/2344[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m30s[0m 11ms/step - accuracy: 0.9015 - auc: 0.8554 - fn: 1749.8175 - fp: 1244.3569 - loss: 0.2525 - prc: 0.4796 - precision: 0.4892 - recall: 0.4524 - tn: 33070.8594 - tp: 1470.9454 - val_accuracy: 0.9571 - val_auc: 0.9576 - val_fn: 1032.0000 - val_fp: 41.0000 - val_loss: 0.1248 - val_prc: 0.8035 - val_precision: 0.9638 - val_recall: 0.5144 - val_tn: 22834.0000 - val_tp: 1093.0000
Epoch 2/100
[1m2344/2344[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m17s[0m 7ms/step - accuracy: 0.9463 - auc: 0.9412 - fn: 1491.8989 - fp: 506.1467 - loss: 0.1452 - prc: 0.7133 - precision: 0.7673 - recall: 0.5293 - tn: 33847.1055 - tp: 1690.8273 - val_accuracy: 0.9597 - val_auc: 0.9593 - val_fn: 882.0000 - val_fp: 126.0000 - val_loss: 0.1168 - val_prc: 0.8105 - val_precision: 0.9080 - val_recall: 0.5849 - val_tn: 22749.0000 - val_tp: 1243.0000
Epoch 3/100
[1m2344/2344[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m18s[0m 8ms/step - accuracy

#### Evaluation
Now we make predictions using the DNN-based model to evaluate its performance on the validation data. Comparing the results, even when changing the confidence threshold, we observed a slightly worse performance from the DNN-based approach.

In [32]:
y_pred_probs = model_dnn_baseline.predict(X_test)
y_pred = (y_pred_probs > 0.5).astype("int32")

print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 2ms/step
              precision    recall  f1-score   support

           0       0.96      0.99      0.98     22875
           1       0.90      0.59      0.71      2125

    accuracy                           0.96     25000
   macro avg       0.93      0.79      0.85     25000
weighted avg       0.96      0.96      0.96     25000

[[22735   140]
 [  872  1253]]


In [33]:
y_pred_probs = model_dnn_baseline.predict(X_test)
y_pred = (y_pred_probs > 0.3).astype("int32")

print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step
              precision    recall  f1-score   support

           0       0.97      0.98      0.98     22875
           1       0.77      0.69      0.73      2125

    accuracy                           0.96     25000
   macro avg       0.87      0.83      0.85     25000
weighted avg       0.95      0.96      0.95     25000

[[22440   435]
 [  668  1457]]



Therefore, if we had a larger dataset, the DNN approach would likely outperform the tree-based model due to its ability to learn more complex patterns. Given this characteristic, we will not repeat the undersampling test previously applied on the Tree-based approach. However, we will test an alternative version of the model trained with different class weights, as we did before.

### Alternative DNN Model (Using class weights)

In [34]:
model_dnn_weighted = build_model()

model_dnn_weighted.compile(
      optimizer=keras.optimizers.Adam(learning_rate=1e-3),
      loss=keras.losses.BinaryCrossentropy(),
      metrics=relevant_classification_metrics
)

history_weighted = model_dnn_weighted.fit(
    X_train, y_train,
    batch_size=batch_size,
    epochs=epochs_max,
    callbacks=[early_stopping],
    validation_data=(X_test, y_test),
    class_weight={0: 1.0, 1: 5.0} # Applies a 5x higher weight to diabetic samples.
)

Epoch 1/100
[1m2344/2344[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m26s[0m 9ms/step - accuracy: 0.9154 - auc: 0.9250 - fn: 1656.4418 - fp: 3802.2798 - loss: 0.5098 - prc: 0.6197 - precision: 0.5054 - recall: 0.6768 - tn: 53449.6836 - tp: 3627.5732 - val_accuracy: 0.9222 - val_auc: 0.9590 - val_fn: 391.0000 - val_fp: 1555.0000 - val_loss: 0.1925 - val_prc: 0.8029 - val_precision: 0.5272 - val_recall: 0.8160 - val_tn: 21320.0000 - val_tp: 1734.0000
Epoch 2/100
[1m2344/2344[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m18s[0m 8ms/step - accuracy: 0.9115 - auc: 0.9448 - fn: 723.2631 - fp: 2637.4797 - loss: 0.3701 - prc: 0.7132 - precision: 0.4874 - recall: 0.7776 - tn: 31703.1582 - tp: 2472.0784 - val_accuracy: 0.9455 - val_auc: 0.9602 - val_fn: 555.0000 - val_fp: 808.0000 - val_loss: 0.1636 - val_prc: 0.8113 - val_precision: 0.6602 - val_recall: 0.7388 - val_tn: 22067.0000 - val_tp: 1570.0000
Epoch 3/100
[1m2344/2344[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m18s[0m 8ms/ste

#### Evaluation
As we saw before, with the tree-based models, the class weights in fact reduced the false negatives with the trade-off of incresing the false positives.

In [35]:
y_pred_probs = model_dnn_weighted.predict(X_test)
y_pred = (y_pred_probs > 0.5).astype("int32")

print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 2ms/step
              precision    recall  f1-score   support

           0       0.98      0.94      0.96     22875
           1       0.56      0.80      0.66      2125

    accuracy                           0.93     25000
   macro avg       0.77      0.87      0.81     25000
weighted avg       0.94      0.93      0.93     25000

[[21532  1343]
 [  432  1693]]


In [36]:
y_pred_probs = model_dnn_weighted.predict(X_test)
y_pred = (y_pred_probs > 0.3).astype("int32")

print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 2ms/step
              precision    recall  f1-score   support

           0       0.99      0.86      0.92     22875
           1       0.37      0.90      0.53      2125

    accuracy                           0.86     25000
   macro avg       0.68      0.88      0.72     25000
weighted avg       0.94      0.86      0.89     25000

[[19652  3223]
 [  219  1906]]


### Models... Ensemble!
Now that we have multiple models, we are going to create an ensemble using all them (both tree-based and DNN-based). To prevent data leakage, we should first update our training code to ensure that all the models are trained using the same train-test split.