# Intro

Here's the [oficial source](https://archive.ics.uci.edu/dataset/296/diabetes+130-us+hospitals+for+years+1999-2008) of the dataset. It requires some preprocessing so in this notebook we show what we did and why.

# Some Notebook Options

We activate here a couple of options that ease the use of this notebook. Are you curious about what this does? You can ask chatgpt :P 

In [None]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
pd.set_option('display.max_columns', None)

# Imports

In [None]:
import pandas as pd
from ucimlrepo import fetch_ucirepo 

# Read the data

In [None]:
!pip install ucimlrepo

In [17]:
# fetch dataset 
diabetes_130_us_hospitals_for_years_1999_2008 = fetch_ucirepo(id=296) 
  
# data (as pandas dataframes) 
X = diabetes_130_us_hospitals_for_years_1999_2008.data.features 
y = diabetes_130_us_hospitals_for_years_1999_2008.data.targets 
ids = diabetes_130_us_hospitals_for_years_1999_2008.data.ids

In [21]:
dataset = pd.concat([X, y,ids], axis=1).set_index('encounter_id')
dataset = dataset[['patient_nbr'] + [col for col in dataset.columns if col != 'patient_nbr']]
dataset.head(3)
dataset.shape

Unnamed: 0_level_0,patient_nbr,race,gender,age,weight,admission_type_id,discharge_disposition_id,admission_source_id,time_in_hospital,payer_code,medical_specialty,num_lab_procedures,num_procedures,num_medications,number_outpatient,number_emergency,number_inpatient,diag_1,diag_2,diag_3,number_diagnoses,max_glu_serum,A1Cresult,metformin,repaglinide,nateglinide,chlorpropamide,glimepiride,acetohexamide,glipizide,glyburide,tolbutamide,pioglitazone,rosiglitazone,acarbose,miglitol,troglitazone,tolazamide,examide,citoglipton,insulin,glyburide-metformin,glipizide-metformin,glimepiride-pioglitazone,metformin-rosiglitazone,metformin-pioglitazone,change,diabetesMed,readmitted
encounter_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1
2278392,8222157,Caucasian,Female,[0-10),,6,25,1,1,,Pediatrics-Endocrinology,41,0,1,0,0,0,250.83,,,1,,,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,NO
149190,55629189,Caucasian,Female,[10-20),,1,1,7,3,,,59,0,18,0,0,0,276.0,250.01,255,9,,,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,Up,No,No,No,No,No,Ch,Yes,>30
64410,86047875,AfricanAmerican,Female,[20-30),,1,1,7,2,,,11,5,13,2,0,1,648.0,250.0,V27,6,,,No,No,No,No,No,No,Steady,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,Yes,NO


(101766, 49)

So we have an index with the id of each encounter, we also have the patient number, then a bunch of features and the last column is the target we want to predict

# Preprocessing

In [34]:
def preprocess_dataset(df):
    return df

The function above will be our function to preprocess the dataset. I started it empty on purpose to walkthrough with you building it a step at a time.

# 1. Dealing with Nans

First question: Do we have Nans (values that aren't filled in)?

In [24]:
dataset.isna().head(3)

Unnamed: 0_level_0,patient_nbr,race,gender,age,weight,admission_type_id,discharge_disposition_id,admission_source_id,time_in_hospital,payer_code,medical_specialty,num_lab_procedures,num_procedures,num_medications,number_outpatient,number_emergency,number_inpatient,diag_1,diag_2,diag_3,number_diagnoses,max_glu_serum,A1Cresult,metformin,repaglinide,nateglinide,chlorpropamide,glimepiride,acetohexamide,glipizide,glyburide,tolbutamide,pioglitazone,rosiglitazone,acarbose,miglitol,troglitazone,tolazamide,examide,citoglipton,insulin,glyburide-metformin,glipizide-metformin,glimepiride-pioglitazone,metformin-rosiglitazone,metformin-pioglitazone,change,diabetesMed,readmitted
encounter_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1
2278392,False,False,False,False,True,False,False,False,False,True,False,False,False,False,False,False,False,False,True,True,False,True,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
149190,False,False,False,False,True,False,False,False,False,True,True,False,False,False,False,False,False,False,False,False,False,True,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
64410,False,False,False,False,True,False,False,False,False,True,True,False,False,False,False,False,False,False,False,False,False,True,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False


`.isna()` asks each value in the dataframe if it is a NaN and return either true or false

In [27]:
dataset.isna().sum(axis='rows')

patient_nbr                     0
race                         2273
gender                          0
age                             0
weight                      98569
admission_type_id               0
discharge_disposition_id        0
admission_source_id             0
time_in_hospital                0
payer_code                  40256
medical_specialty           49949
num_lab_procedures              0
num_procedures                  0
num_medications                 0
number_outpatient               0
number_emergency                0
number_inpatient                0
diag_1                         21
diag_2                        358
diag_3                       1423
number_diagnoses                0
max_glu_serum               96420
A1Cresult                   84748
metformin                       0
repaglinide                     0
nateglinide                     0
chlorpropamide                  0
glimepiride                     0
acetohexamide                   0
glipizide     

A few columns seem to have a lot of Nans, let's filter this automatically

In [31]:
(dataset.isna().sum(axis='rows')
 .div(len(dataset)) # divide by the number of datapoints to get frequency
 .where(lambda x: x > 0).dropna() # select only the columns with at least 1 nan
 .round(2)
 .sort_values(ascending=False)
)

weight               0.97
max_glu_serum        0.95
A1Cresult            0.83
medical_specialty    0.49
payer_code           0.40
race                 0.02
diag_3               0.01
diag_1               0.00
diag_2               0.00
dtype: float64

tirar weight, max_glu_serum e payer_code

In [72]:
dataset['medical_specialty'].value_counts()

medical_specialty
InternalMedicine                    14635
Emergency/Trauma                     7565
Family/GeneralPractice               7440
Cardiology                           5352
Surgery-General                      3099
                                    ...  
Perinatology                            1
Neurophysiology                         1
Psychiatry-Addictive                    1
Pediatrics-InfectiousDiseases           1
Surgery-PlasticwithinHeadandNeck        1
Name: count, Length: 72, dtype: int64

So only a few columns seem to have at least 1 Nan. For example 97% of the datapoints have no weight information.

In [40]:
columns_with_nans = (dataset.isna().sum(axis='rows')
                     .div(len(dataset))
                     .where(lambda x: x > 0).dropna() # everything is the same as above until this point
                     .index # but we care about the column names that have nans (which are in the index of this series)
                    )

dataset[columns_with_nans].dtypes

race                 object
weight               object
payer_code           object
medical_specialty    object
diag_1               object
diag_2               object
diag_3               object
max_glu_serum        object
A1Cresult            object
dtype: object

So these columns seem to be all categories (strings), we can take a look at a couple of them for example:

In [41]:
dataset['weight'].value_counts()

weight
[75-100)     1336
[50-75)       897
[100-125)     625
[125-150)     145
[25-50)        97
[0-25)         48
[150-175)      35
[175-200)      11
>200            3
Name: count, dtype: int64

So when the value of the weight is filled in, which is done very rarely, it is not a number but a string representing the age group

In [42]:
dataset['A1Cresult'].value_counts()

A1Cresult
>8      8216
Norm    4990
>7      3812
Name: count, dtype: int64

Similar thing happens with the `A1Cresult`, we were expecting numerical values but they are already encoded into 3 categories

This is pretty useful because it creates us an opening to fill in all Nans with a string. Let's call it `Unkown`.

In [76]:
def preprocess_dataset(df):
    df = df.drop(columns=['weight', 'max_glu_serum', 'payer_code'])
    df = df.fillna(value='Unkown')
    return df

Let's test it out (without rewritting our original dataframe)

In [45]:
preprocess_dataset(dataset).head(3)

Unnamed: 0_level_0,patient_nbr,race,gender,age,weight,admission_type_id,discharge_disposition_id,admission_source_id,time_in_hospital,payer_code,medical_specialty,num_lab_procedures,num_procedures,num_medications,number_outpatient,number_emergency,number_inpatient,diag_1,diag_2,diag_3,number_diagnoses,max_glu_serum,A1Cresult,metformin,repaglinide,nateglinide,chlorpropamide,glimepiride,acetohexamide,glipizide,glyburide,tolbutamide,pioglitazone,rosiglitazone,acarbose,miglitol,troglitazone,tolazamide,examide,citoglipton,insulin,glyburide-metformin,glipizide-metformin,glimepiride-pioglitazone,metformin-rosiglitazone,metformin-pioglitazone,change,diabetesMed,readmitted
encounter_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1
2278392,8222157,Caucasian,Female,[0-10),Unkown,6,25,1,1,Unkown,Pediatrics-Endocrinology,41,0,1,0,0,0,250.83,Unkown,Unkown,1,Unkown,Unkown,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,NO
149190,55629189,Caucasian,Female,[10-20),Unkown,1,1,7,3,Unkown,Unkown,59,0,18,0,0,0,276.0,250.01,255,9,Unkown,Unkown,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,Up,No,No,No,No,No,Ch,Yes,>30
64410,86047875,AfricanAmerican,Female,[20-30),Unkown,1,1,7,2,Unkown,Unkown,11,5,13,2,0,1,648.0,250,V27,6,Unkown,Unkown,No,No,No,No,No,No,Steady,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,Yes,NO


We can see already that in the columns `weight` and `payer_code` we were able to fill in the `Nan` values with the string `Unkown`. Awesome!

# Aggregate Medications

We have a lot of colummns about the medications and most of them are unused

In [57]:
dataset.columns

Index(['patient_nbr', 'race', 'gender', 'age', 'weight', 'admission_type_id',
       'discharge_disposition_id', 'admission_source_id', 'time_in_hospital',
       'payer_code', 'medical_specialty', 'num_lab_procedures',
       'num_procedures', 'num_medications', 'number_outpatient',
       'number_emergency', 'number_inpatient', 'diag_1', 'diag_2', 'diag_3',
       'number_diagnoses', 'max_glu_serum', 'A1Cresult', 'metformin',
       'repaglinide', 'nateglinide', 'chlorpropamide', 'glimepiride',
       'acetohexamide', 'glipizide', 'glyburide', 'tolbutamide',
       'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol', 'troglitazone',
       'tolazamide', 'examide', 'citoglipton', 'insulin',
       'glyburide-metformin', 'glipizide-metformin',
       'glimepiride-pioglitazone', 'metformin-rosiglitazone',
       'metformin-pioglitazone', 'change', 'diabetesMed', 'readmitted'],
      dtype='object')

In [46]:
meds = ['metformin',
       'repaglinide', 'nateglinide', 'chlorpropamide', 'glimepiride',
       'acetohexamide', 'glipizide', 'glyburide', 'tolbutamide',
       'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol', 'troglitazone',
       'tolazamide', 'examide', 'citoglipton', 'insulin',
       'glyburide-metformin', 'glipizide-metformin',
       'glimepiride-pioglitazone', 'metformin-rosiglitazone',
       'metformin-pioglitazone']

In [58]:
# let's take a look at the first few
for med in meds[:5]:
    dataset[med].value_counts(normalize=True).round(2)
# also let's take a look at insulin
dataset['insulin'].value_counts(normalize=True).round(2)

metformin
No        0.80
Steady    0.18
Up        0.01
Down      0.01
Name: proportion, dtype: float64

repaglinide
No        0.98
Steady    0.01
Up        0.00
Down      0.00
Name: proportion, dtype: float64

nateglinide
No        0.99
Steady    0.01
Up        0.00
Down      0.00
Name: proportion, dtype: float64

chlorpropamide
No        1.0
Steady    0.0
Up        0.0
Down      0.0
Name: proportion, dtype: float64

glimepiride
No        0.95
Steady    0.05
Up        0.00
Down      0.00
Name: proportion, dtype: float64

insulin
No        0.47
Steady    0.30
Down      0.12
Up        0.11
Name: proportion, dtype: float64

As you can see, most of the time these medications aren't used. Only insulin is used more often (which we expect given our cohort of patients)

With the help of Bernardo Neves MD, we decided to aggregate these 23 columns in the following way:
1. keep the insulin column
2. aggregate all others into just one - wether there was or not a change somewhere in the `oral antidiabetics`. We understand this is a limited way to represent the data but it removes 22 columns from the dataset which could help with [curse of dimensionality](https://towardsdatascience.com/curse-of-dimensionality-a-curse-to-machine-learning-c122ee33bfeb) for example. And probably this encodes most of the original information for the ML development.


In the end we've decided to simplify the dataset in this way, doesn't mean that this is what creates in the best results in Machine Learning. We can only know what works best for each case in terms of comparisson with different approaches.

In [60]:
# same as the mds list but without insulin
oral_antidiabetics = ['metformin',
       'repaglinide', 'nateglinide', 'chlorpropamide', 'glimepiride',
       'acetohexamide', 'glipizide', 'glyburide', 'tolbutamide',
       'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol', 'troglitazone',
       'tolazamide', 'examide', 'citoglipton',
       'glyburide-metformin', 'glipizide-metformin',
       'glimepiride-pioglitazone', 'metformin-rosiglitazone',
       'metformin-pioglitazone']

In [68]:
no_oral_antidiabetics = (dataset[oral_antidiabetics] != 'No')
steady_oral_antidiabetics = (dataset[oral_antidiabetics] != 'Steady')

change_in_oral_antidiabetics = (no_oral_antidiabetics & steady_oral_antidiabetics).any(axis='columns')

(dataset[oral_antidiabetics] != 'No').any(axis='columns').value_counts(normalize=True)

False    0.534245
True     0.465755
Name: proportion, dtype: float64

About half of the dataset has at least one oral antidiabetic changing

Let's create the code that looks for a change in any of these columns

In [64]:
dataset[oral_antidiabetics]

Unnamed: 0_level_0,metformin,repaglinide,nateglinide,chlorpropamide,glimepiride,acetohexamide,glipizide,glyburide,tolbutamide,pioglitazone,rosiglitazone,acarbose,miglitol,troglitazone,tolazamide,examide,citoglipton,glyburide-metformin,glipizide-metformin,glimepiride-pioglitazone,metformin-rosiglitazone,metformin-pioglitazone
encounter_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
2278392,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No
149190,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No
64410,No,No,No,No,No,No,Steady,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No
500364,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No
16680,No,No,No,No,No,No,Steady,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
443847548,Steady,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No
443847782,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No
443854148,Steady,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No
443857166,No,No,No,No,No,No,Steady,No,No,Steady,No,No,No,No,No,No,No,No,No,No,No,No


In [67]:
(dataset[oral_antidiabetics] != 'No').any(axis='columns').value_counts()

False    54368
True     47398
Name: count, dtype: int64