## Setting up to Work

The first part of the process, importing the libraries and depend

In [1]:
import pandas as pd
from sklearn.feature_selection import mutual_info_classif
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
from sklearn.compose import ColumnTransformer

Loading dataset, after downloading it from kaggle: https://www.kaggle.com/datasets/iammustafatz/diabetes-prediction-dataset.

In [2]:
df = pd.read_csv("diabetes_prediction_dataset.csv")


## Exploratory Data Analysis

The first steps are to understand the data, in order to do so, first we may get some general information about the dataset through the `head()`, `info()` and `describe()`.

In [3]:
df.head()

Unnamed: 0,gender,age,hypertension,heart_disease,smoking_history,bmi,HbA1c_level,blood_glucose_level,diabetes
0,Female,80.0,0,1,never,25.19,6.6,140,0
1,Female,54.0,0,0,No Info,27.32,6.6,80,0
2,Male,28.0,0,0,never,27.32,5.7,158,0
3,Female,36.0,0,0,current,23.45,5.0,155,0
4,Male,76.0,1,1,current,20.14,4.8,155,0


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 9 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   gender               100000 non-null  object 
 1   age                  100000 non-null  float64
 2   hypertension         100000 non-null  int64  
 3   heart_disease        100000 non-null  int64  
 4   smoking_history      100000 non-null  object 
 5   bmi                  100000 non-null  float64
 6   HbA1c_level          100000 non-null  float64
 7   blood_glucose_level  100000 non-null  int64  
 8   diabetes             100000 non-null  int64  
dtypes: float64(3), int64(4), object(2)
memory usage: 6.9+ MB


In [5]:
df.describe()

Unnamed: 0,age,hypertension,heart_disease,bmi,HbA1c_level,blood_glucose_level,diabetes
count,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0
mean,41.885856,0.07485,0.03942,27.320767,5.527507,138.05806,0.085
std,22.51684,0.26315,0.194593,6.636783,1.070672,40.708136,0.278883
min,0.08,0.0,0.0,10.01,3.5,80.0,0.0
25%,24.0,0.0,0.0,23.63,4.8,100.0,0.0
50%,43.0,0.0,0.0,27.32,5.8,140.0,0.0
75%,60.0,0.0,0.0,29.58,6.2,159.0,0.0
max,80.0,1.0,1.0,95.69,9.0,300.0,1.0


So, after a brief analysis we may conclude:
- There are a plentiful number of data available, 100.000 cases.
- The data doesn't contain any explicit error-leading missing values such as NAs, but it has the `No info` class of `smoking_history`.
- There are two categorical variables, `gender` and `smoking_history`, both will have to be transformed to a numerical value.
- `diabetes`, the target, is a boolean value and have a mean of 0.085. Meaning that only 8.5% of the cases in fact have diabetes implying on a unbalanced dataset.

Next, we going to analyse the `smoking_history` feature as it appears to be problematic for being a categorical feature with missing values. 

The first step is to check the possible values this feature can take and their respective frequencies. As shown in the next cell, this feature has a couple of issues:
- `No Info` appears in 35916 cases, meaning that more than one-third of the cases has a unespecified value on this feature (missing data).
- There are ambiguous and overlapping categories. For example, `not current` could mean the same as `former` or `never`, and the criteria that distinguise `ever` from `current` is poorly defined.

Considering the high number of unkown values and the ambiguity in class definition, a further evaluations is needed to assess this feature impact on the target prediction. This will help justify the efford of keeping this feature, or determine if it should be discarded.

In [6]:
df['smoking_history'].value_counts()

smoking_history
No Info        35816
never          35095
former          9352
current         9286
not current     6447
ever            4004
Name: count, dtype: int64

In order to further evaluate, we'll define a `ColumnTransformer` and transform the categorical values through the One-Hot Encoding. With the categorical data transformed, we can then use some feature metrics, such as Mutual Information score and correlation, to understand the features relevance to the target.

In [13]:
# Separetes the Features from the Target.
X = df.drop(['diabetes'], axis=1)
y = df['diabetes']

CT = ColumnTransformer(
    transformers = [ 
        ('onehot', OneHotEncoder(sparse_output=False, categories='auto'), ['gender', 'smoking_history']), #sparse_output=False
        #('ordinal', OrdinalEncoder(categories=[['never','No Info', 'not current', 'former', 'current', 'ever']]), ['smoking_history']) #sparse_output=False
    ],	
    remainder='passthrough'
)

In [16]:
# Applies the OneHot onto categorical features.
X_encoded = CT.fit_transform(X[['gender', 'smoking_history']])
encoded_cols = CT.get_feature_names_out(['gender', 'smoking_history'])

X_encoded_df = pd.DataFrame(X_encoded, columns=encoded_cols, index=X.index)
X_features = pd.concat([X.drop(['gender', 'smoking_history'], axis=1), X_encoded_df], axis=1)

### Correlation
Analyzing the correlation between the features and the target variable, allows us to see which of the features are most strong linear related to the target.

In [17]:
df_trans = pd.concat([X_features, y], axis=1)
df_trans.corr()['diabetes'].sort_values(ascending=False)


diabetes                               1.000000
blood_glucose_level                    0.419558
HbA1c_level                            0.400660
age                                    0.258008
bmi                                    0.214357
hypertension                           0.197823
heart_disease                          0.171727
onehot__smoking_history_former         0.097917
onehot__gender_Male                    0.037666
onehot__smoking_history_never          0.027267
onehot__smoking_history_ever           0.024080
onehot__smoking_history_not current    0.020734
onehot__smoking_history_current        0.019606
onehot__gender_Other                  -0.004090
onehot__gender_Female                 -0.037553
onehot__smoking_history_No Info       -0.118939
Name: diabetes, dtype: float64

### Mutual Information
Mutual Information can capture many types of relationships that each variable may have with the target, not being limited by linear associations only.

In [18]:
# Calcula a MI
mi_scores = mutual_info_classif(X_features, y)

# Exibe os resultados
mi_df = pd.DataFrame({'Feature': X_features.columns, 'MI Score': mi_scores})
print(mi_df.sort_values(by='MI Score', ascending=False))


                                Feature  MI Score
4                           HbA1c_level  0.131486
5                   blood_glucose_level  0.112953
0                                   age  0.040393
3                                   bmi  0.025215
9       onehot__smoking_history_No Info  0.016837
6                 onehot__gender_Female  0.015522
1                          hypertension  0.013440
2                         heart_disease  0.009916
7                   onehot__gender_Male  0.009840
13        onehot__smoking_history_never  0.007329
12       onehot__smoking_history_former  0.004160
10      onehot__smoking_history_current  0.000882
14  onehot__smoking_history_not current  0.000441
8                  onehot__gender_Other  0.000000
11         onehot__smoking_history_ever  0.000000
