# Intro

Here's the [oficial source](https://archive.ics.uci.edu/dataset/296/diabetes+130-us+hospitals+for+years+1999-2008) of the dataset. It requires some preprocessing so in this notebook we show what we did and why.

# Imports

In [None]:
!pip install ucimlrepo

In [1]:
import pandas as pd
from ucimlrepo import fetch_ucirepo 

# Some Notebook Options

We activate here a couple of options that ease the use of this notebook. Are you curious about what this does? You can ask chatgpt :P 

In [2]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
pd.set_option('display.max_columns', None)

# Read the data

In [3]:
# fetch dataset 
diabetes_130_us_hospitals_for_years_1999_2008 = fetch_ucirepo(id=296) 
  
# data (as pandas dataframes) 
X = diabetes_130_us_hospitals_for_years_1999_2008.data.features 
y = diabetes_130_us_hospitals_for_years_1999_2008.data.targets 
ids = diabetes_130_us_hospitals_for_years_1999_2008.data.ids

  df = pd.read_csv(data_url)


In [4]:
dataset = pd.concat([X, y,ids], axis=1).set_index('encounter_id')
dataset = dataset[['patient_nbr'] + [col for col in dataset.columns if col != 'patient_nbr']]
dataset.head(3)
dataset.shape

Unnamed: 0_level_0,patient_nbr,race,gender,age,weight,admission_type_id,discharge_disposition_id,admission_source_id,time_in_hospital,payer_code,medical_specialty,num_lab_procedures,num_procedures,num_medications,number_outpatient,number_emergency,number_inpatient,diag_1,diag_2,diag_3,number_diagnoses,max_glu_serum,A1Cresult,metformin,repaglinide,nateglinide,chlorpropamide,glimepiride,acetohexamide,glipizide,glyburide,tolbutamide,pioglitazone,rosiglitazone,acarbose,miglitol,troglitazone,tolazamide,examide,citoglipton,insulin,glyburide-metformin,glipizide-metformin,glimepiride-pioglitazone,metformin-rosiglitazone,metformin-pioglitazone,change,diabetesMed,readmitted
encounter_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1
2278392,8222157,Caucasian,Female,[0-10),,6,25,1,1,,Pediatrics-Endocrinology,41,0,1,0,0,0,250.83,,,1,,,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,NO
149190,55629189,Caucasian,Female,[10-20),,1,1,7,3,,,59,0,18,0,0,0,276.0,250.01,255,9,,,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,Up,No,No,No,No,No,Ch,Yes,>30
64410,86047875,AfricanAmerican,Female,[20-30),,1,1,7,2,,,11,5,13,2,0,1,648.0,250.0,V27,6,,,No,No,No,No,No,No,Steady,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,Yes,NO


(101766, 49)

So we have an index with the id of each encounter, we also have the patient number, then a bunch of features and the last column is the target we want to predict

# Preprocessing

In [5]:
def preprocess_dataset(df):
    return df

The function above will be our function to preprocess the dataset. I started it empty on purpose to walkthrough with you building it a step at a time.

# 1. Dealing with Nans

First question: Do we have Nans (values that aren't filled in)?

In [6]:
dataset.isna().head(3)

Unnamed: 0_level_0,patient_nbr,race,gender,age,weight,admission_type_id,discharge_disposition_id,admission_source_id,time_in_hospital,payer_code,medical_specialty,num_lab_procedures,num_procedures,num_medications,number_outpatient,number_emergency,number_inpatient,diag_1,diag_2,diag_3,number_diagnoses,max_glu_serum,A1Cresult,metformin,repaglinide,nateglinide,chlorpropamide,glimepiride,acetohexamide,glipizide,glyburide,tolbutamide,pioglitazone,rosiglitazone,acarbose,miglitol,troglitazone,tolazamide,examide,citoglipton,insulin,glyburide-metformin,glipizide-metformin,glimepiride-pioglitazone,metformin-rosiglitazone,metformin-pioglitazone,change,diabetesMed,readmitted
encounter_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1
2278392,False,False,False,False,True,False,False,False,False,True,False,False,False,False,False,False,False,False,True,True,False,True,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
149190,False,False,False,False,True,False,False,False,False,True,True,False,False,False,False,False,False,False,False,False,False,True,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
64410,False,False,False,False,True,False,False,False,False,True,True,False,False,False,False,False,False,False,False,False,False,True,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False


`.isna()` asks each value in the dataframe if it is a NaN and return either true or false

In [7]:
dataset.isna().sum(axis='rows')

patient_nbr                     0
race                         2273
gender                          0
age                             0
weight                      98569
admission_type_id               0
discharge_disposition_id        0
admission_source_id             0
time_in_hospital                0
payer_code                  40256
medical_specialty           49949
num_lab_procedures              0
num_procedures                  0
num_medications                 0
number_outpatient               0
number_emergency                0
number_inpatient                0
diag_1                         21
diag_2                        358
diag_3                       1423
number_diagnoses                0
max_glu_serum               96420
A1Cresult                   84748
metformin                       0
repaglinide                     0
nateglinide                     0
chlorpropamide                  0
glimepiride                     0
acetohexamide                   0
glipizide     

A few columns seem to have a lot of Nans, let's filter this automatically

In [8]:
(dataset.isna().sum(axis='rows')
 .div(len(dataset)) # divide by the number of datapoints to get frequency
 .where(lambda x: x > 0).dropna() # select only the columns with at least 1 nan
 .round(2)
 .sort_values(ascending=False)
)

weight               0.97
max_glu_serum        0.95
A1Cresult            0.83
medical_specialty    0.49
payer_code           0.40
race                 0.02
diag_3               0.01
diag_1               0.00
diag_2               0.00
dtype: float64

tirar weight, max_glu_serum e payer_code

In [9]:
dataset['medical_specialty'].value_counts()

medical_specialty
InternalMedicine          14635
Emergency/Trauma           7565
Family/GeneralPractice     7440
Cardiology                 5352
Surgery-General            3099
                          ...  
Psychiatry-Addictive          1
Proctology                    1
Dermatology                   1
SportsMedicine                1
Speech                        1
Name: count, Length: 72, dtype: int64

So only a few columns seem to have at least 1 Nan. For example 97% of the datapoints have no weight information.

In [10]:
columns_with_nans = (dataset.isna().sum(axis='rows')
                     .div(len(dataset))
                     .where(lambda x: x > 0).dropna() # everything is the same as above until this point
                     .index # but we care about the column names that have nans (which are in the index of this series)
                    )

dataset[columns_with_nans].dtypes

race                 object
weight               object
payer_code           object
medical_specialty    object
diag_1               object
diag_2               object
diag_3               object
max_glu_serum        object
A1Cresult            object
dtype: object

So these columns seem to be all categories (strings), we can take a look at a couple of them for example:

In [11]:
dataset['weight'].value_counts()

weight
[75-100)     1336
[50-75)       897
[100-125)     625
[125-150)     145
[25-50)        97
[0-25)         48
[150-175)      35
[175-200)      11
>200            3
Name: count, dtype: int64

So when the value of the weight is filled in, which is done very rarely, it is not a number but a string representing the age group

In [12]:
dataset['A1Cresult'].value_counts()

A1Cresult
>8      8216
Norm    4990
>7      3812
Name: count, dtype: int64

Similar thing happens with the `A1Cresult`, we were expecting numerical values but they are already encoded into 3 categories

This is pretty useful because it creates us an opening to fill in all Nans with a string. Let's call it `Unkown`.

In [13]:
def preprocess_dataset(df):
    df = df.drop(columns=['weight', 'max_glu_serum', 'payer_code'])
    df = df.fillna(value='Unkown')
    return df

Let's test it out (without rewritting our original dataframe)

In [14]:
preprocess_dataset(dataset).head(3)

Unnamed: 0_level_0,patient_nbr,race,gender,age,admission_type_id,discharge_disposition_id,admission_source_id,time_in_hospital,medical_specialty,num_lab_procedures,num_procedures,num_medications,number_outpatient,number_emergency,number_inpatient,diag_1,diag_2,diag_3,number_diagnoses,A1Cresult,metformin,repaglinide,nateglinide,chlorpropamide,glimepiride,acetohexamide,glipizide,glyburide,tolbutamide,pioglitazone,rosiglitazone,acarbose,miglitol,troglitazone,tolazamide,examide,citoglipton,insulin,glyburide-metformin,glipizide-metformin,glimepiride-pioglitazone,metformin-rosiglitazone,metformin-pioglitazone,change,diabetesMed,readmitted
encounter_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1
2278392,8222157,Caucasian,Female,[0-10),6,25,1,1,Pediatrics-Endocrinology,41,0,1,0,0,0,250.83,Unkown,Unkown,1,Unkown,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,NO
149190,55629189,Caucasian,Female,[10-20),1,1,7,3,Unkown,59,0,18,0,0,0,276.0,250.01,255,9,Unkown,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,Up,No,No,No,No,No,Ch,Yes,>30
64410,86047875,AfricanAmerican,Female,[20-30),1,1,7,2,Unkown,11,5,13,2,0,1,648.0,250,V27,6,Unkown,No,No,No,No,No,No,Steady,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,Yes,NO


We can see already that in the columns `weight` and `payer_code` we were able to fill in the `Nan` values with the string `Unkown`. Awesome!

# Aggregate Medications

We have a lot of colummns about the medications and most of them are unused

In [15]:
dataset.columns

Index(['patient_nbr', 'race', 'gender', 'age', 'weight', 'admission_type_id',
       'discharge_disposition_id', 'admission_source_id', 'time_in_hospital',
       'payer_code', 'medical_specialty', 'num_lab_procedures',
       'num_procedures', 'num_medications', 'number_outpatient',
       'number_emergency', 'number_inpatient', 'diag_1', 'diag_2', 'diag_3',
       'number_diagnoses', 'max_glu_serum', 'A1Cresult', 'metformin',
       'repaglinide', 'nateglinide', 'chlorpropamide', 'glimepiride',
       'acetohexamide', 'glipizide', 'glyburide', 'tolbutamide',
       'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol', 'troglitazone',
       'tolazamide', 'examide', 'citoglipton', 'insulin',
       'glyburide-metformin', 'glipizide-metformin',
       'glimepiride-pioglitazone', 'metformin-rosiglitazone',
       'metformin-pioglitazone', 'change', 'diabetesMed', 'readmitted'],
      dtype='object')

In [16]:
meds = ['metformin',
       'repaglinide', 'nateglinide', 'chlorpropamide', 'glimepiride',
       'acetohexamide', 'glipizide', 'glyburide', 'tolbutamide',
       'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol', 'troglitazone',
       'tolazamide', 'examide', 'citoglipton', 'insulin',
       'glyburide-metformin', 'glipizide-metformin',
       'glimepiride-pioglitazone', 'metformin-rosiglitazone',
       'metformin-pioglitazone']

In [17]:
# let's take a look at the first few
for med in meds[:5]:
    dataset[med].value_counts(normalize=True).round(2)
# also let's take a look at insulin
dataset['insulin'].value_counts(normalize=True).round(2)

metformin
No        0.80
Steady    0.18
Up        0.01
Down      0.01
Name: proportion, dtype: float64

repaglinide
No        0.98
Steady    0.01
Up        0.00
Down      0.00
Name: proportion, dtype: float64

nateglinide
No        0.99
Steady    0.01
Up        0.00
Down      0.00
Name: proportion, dtype: float64

chlorpropamide
No        1.0
Steady    0.0
Up        0.0
Down      0.0
Name: proportion, dtype: float64

glimepiride
No        0.95
Steady    0.05
Up        0.00
Down      0.00
Name: proportion, dtype: float64

insulin
No        0.47
Steady    0.30
Down      0.12
Up        0.11
Name: proportion, dtype: float64

As you can see, most of the time these medications aren't used. Only insulin is used more often (which we expect given our cohort of patients)

With the help of Bernardo Neves MD, we decided to aggregate these 23 columns in the following way:
1. keep the insulin column
2. aggregate all others into just one - wether there was or not a change somewhere in the `oral antidiabetics`. We understand this is a limited way to represent the data but it removes 22 columns from the dataset which could help with [curse of dimensionality](https://towardsdatascience.com/curse-of-dimensionality-a-curse-to-machine-learning-c122ee33bfeb) for example. And probably this encodes most of the original information for the ML development.


In the end we've decided to simplify the dataset in this way, doesn't mean that this is what creates in the best results in Machine Learning. We can only know what works best for each case in terms of comparisson with different approaches.

First, we will deal with the insulin feature

In [29]:
# let's check how insulin looks like
dataset['insulin']

encounter_id
2278392          No
149190           Up
64410            No
500364           Up
16680        Steady
              ...  
443847548      Down
443847782    Steady
443854148      Down
443857166        Up
443867222        No
Name: insulin, Length: 101766, dtype: object

In [30]:
# transform this variable into yes or no, in case the insulin was prescribed or not
dataset['insulin'].apply(lambda x: 'Yes' if x in ['Steady', 'Up', 'Down'] else 'No')

encounter_id
2278392       No
149190       Yes
64410         No
500364       Yes
16680        Yes
            ... 
443847548    Yes
443847782    Yes
443854148    Yes
443857166    Yes
443867222     No
Name: insulin, Length: 101766, dtype: object

In [31]:
# transform this variable into yes or no, in case the insulin dosage changed or not
dataset['insulin'].apply(lambda x: 'Yes' if x in ['Up', 'Down'] else 'No')

encounter_id
2278392       No
149190       Yes
64410         No
500364       Yes
16680         No
            ... 
443847548    Yes
443847782     No
443854148    Yes
443857166    Yes
443867222     No
Name: insulin, Length: 101766, dtype: object

In [32]:
# update the preprocess function
def preprocess_dataset(df):
    # Drop unnecessary columns
    df = df.drop(columns=['weight', 'max_glu_serum', 'payer_code'])

    # Fill missing values
    df = df.fillna(value='Unkown')

    # Process insulin column to create new columns
    df['insulin_taken'] = df['insulin'].apply(lambda x: 'Yes' if x in ['Steady', 'Up', 'Down'] else 'No')
    df['insulin_change'] = df['insulin'].apply(lambda x: 'Yes' if x in ['Up', 'Down'] else 'No')
    df=df.drop(columns=['insulin'])
  
    return df

In [33]:
preprocess_dataset(dataset).head(3)

Unnamed: 0_level_0,patient_nbr,race,gender,age,admission_type_id,discharge_disposition_id,admission_source_id,time_in_hospital,medical_specialty,num_lab_procedures,num_procedures,num_medications,number_outpatient,number_emergency,number_inpatient,diag_1,diag_2,diag_3,number_diagnoses,A1Cresult,metformin,repaglinide,nateglinide,chlorpropamide,glimepiride,acetohexamide,glipizide,glyburide,tolbutamide,pioglitazone,rosiglitazone,acarbose,miglitol,troglitazone,tolazamide,examide,citoglipton,glyburide-metformin,glipizide-metformin,glimepiride-pioglitazone,metformin-rosiglitazone,metformin-pioglitazone,change,diabetesMed,readmitted,insulin_taken,insulin_change
encounter_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1
2278392,8222157,Caucasian,Female,[0-10),6,25,1,1,Pediatrics-Endocrinology,41,0,1,0,0,0,250.83,Unkown,Unkown,1,Unkown,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,NO,No,No
149190,55629189,Caucasian,Female,[10-20),1,1,7,3,Unkown,59,0,18,0,0,0,276.0,250.01,255,9,Unkown,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,Ch,Yes,>30,Yes,Yes
64410,86047875,AfricanAmerican,Female,[20-30),1,1,7,2,Unkown,11,5,13,2,0,1,648.0,250,V27,6,Unkown,No,No,No,No,No,No,Steady,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,Yes,NO,No,No


Now, we will focus on the other medications regarding diabetes

In [34]:
# same as the meds list but without insulin
oral_antidiabetics = ['metformin',
       'repaglinide', 'nateglinide', 'chlorpropamide', 'glimepiride',
       'acetohexamide', 'glipizide', 'glyburide', 'tolbutamide',
       'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol', 'troglitazone',
       'tolazamide', 'examide', 'citoglipton',
       'glyburide-metformin', 'glipizide-metformin',
       'glimepiride-pioglitazone', 'metformin-rosiglitazone',
       'metformin-pioglitazone']

In [35]:
no_oral_antidiabetics = (dataset[oral_antidiabetics] != 'No')
steady_oral_antidiabetics = (dataset[oral_antidiabetics] != 'Steady')

change_in_oral_antidiabetics = (no_oral_antidiabetics & steady_oral_antidiabetics).any(axis='columns')

(dataset[oral_antidiabetics] != 'No').any(axis='columns').value_counts(normalize=True)

False    0.534245
True     0.465755
Name: proportion, dtype: float64

About half of the dataset has at least one oral antidiabetic changing

Let's create the code that looks for a change in any of these columns

In [43]:
dataset[oral_antidiabetics].head(3)

Unnamed: 0_level_0,metformin,repaglinide,nateglinide,chlorpropamide,glimepiride,acetohexamide,glipizide,glyburide,tolbutamide,pioglitazone,rosiglitazone,acarbose,miglitol,troglitazone,tolazamide,examide,citoglipton,glyburide-metformin,glipizide-metformin,glimepiride-pioglitazone,metformin-rosiglitazone,metformin-pioglitazone
encounter_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
2278392,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No
149190,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No
64410,No,No,No,No,No,No,Steady,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No


In [37]:
(dataset[oral_antidiabetics] != 'No').any(axis='columns').value_counts()

False    54368
True     47398
Name: count, dtype: int64

In [39]:
# transform this variable into yes or no, in case any of the medication was prescribed or not
dataset[oral_antidiabetics].apply(lambda row: 'Yes' if any(val in ['Steady', 'Up', 'Down'] for val in row) else 'No', axis=1)

encounter_id
2278392       No
149190        No
64410        Yes
500364        No
16680        Yes
            ... 
443847548    Yes
443847782     No
443854148    Yes
443857166    Yes
443867222     No
Length: 101766, dtype: object

In [40]:
# transform this variable into yes or no, in case any of the medication dosage changed or not
dataset[oral_antidiabetics].apply(lambda row: 'Yes' if any(val in ['Up', 'Down'] for val in row) else 'No', axis=1)

encounter_id
2278392      No
149190       No
64410        No
500364       No
16680        No
             ..
443847548    No
443847782    No
443854148    No
443857166    No
443867222    No
Length: 101766, dtype: object

In [41]:
# new update of the preprocess function
def preprocess_dataset(df):
    # Drop unnecessary columns
    df = df.drop(columns=['weight', 'max_glu_serum', 'payer_code'])
    
    # Fill missing values
    df = df.fillna(value='Unknown')
    
    # Process insulin column to create new columns
    df['insulin_taken'] = df['insulin'].apply(lambda x: 'Yes' if x in ['Steady', 'Up', 'Down'] else 'No')
    df['insulin_change'] = df['insulin'].apply(lambda x: 'Yes' if x in ['Up', 'Down'] else 'No')
    df=df.drop(columns=['insulin'])
    
    # List of oral antidiabetic medications
    oral_antidiabetics = [
        'metformin', 'repaglinide', 'nateglinide', 'chlorpropamide', 'glimepiride',
        'acetohexamide', 'glipizide', 'glyburide', 'tolbutamide', 'pioglitazone',
        'rosiglitazone', 'acarbose', 'miglitol', 'troglitazone', 'tolazamide',
        'examide', 'citoglipton', 'glyburide-metformin', 'glipizide-metformin',
        'glimepiride-pioglitazone', 'metformin-rosiglitazone', 'metformin-pioglitazone'
    ]
    
    # Create the other_meds_taken and other_meds_change columns
    df['other_meds'] = df[oral_antidiabetics].apply(lambda row: 'Yes' if any(val in ['Steady', 'Up', 'Down'] for val in row) else 'No', axis=1)
    df['other_meds_change'] = df[oral_antidiabetics].apply(lambda row: 'Yes' if any(val in ['Up', 'Down'] for val in row) else 'No', axis=1)
    df=df.drop(columns=oral_antidiabetics)
    df = df.drop(columns=['change']) # the change column is similar to the ones we have created
    
    return df

In [54]:
preprocess_dataset(dataset).head()

Unnamed: 0_level_0,patient_nbr,race,gender,age,admission_type_id,discharge_disposition_id,admission_source_id,time_in_hospital,medical_specialty,num_lab_procedures,num_procedures,num_medications,number_outpatient,number_emergency,number_inpatient,diag_1,diag_2,diag_3,number_diagnoses,A1Cresult,diabetesMed,readmitted,insulin_taken,insulin_change,other_meds,other_meds_change
encounter_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1
2278392,8222157,Caucasian,Female,[0-10),6,25,1,1,Pediatrics-Endocrinology,41,0,1,0,0,0,250.83,Unknown,Unknown,1,Unknown,No,NO,No,No,No,No
149190,55629189,Caucasian,Female,[10-20),1,1,7,3,Unknown,59,0,18,0,0,0,276.0,250.01,255,9,Unknown,Yes,>30,Yes,Yes,No,No
64410,86047875,AfricanAmerican,Female,[20-30),1,1,7,2,Unknown,11,5,13,2,0,1,648.0,250,V27,6,Unknown,Yes,NO,No,No,Yes,No
500364,82442376,Caucasian,Male,[30-40),1,1,7,2,Unknown,44,1,16,0,0,0,8.0,250.43,403,7,Unknown,Yes,NO,Yes,Yes,No,No
16680,42519267,Caucasian,Male,[40-50),1,1,7,1,Unknown,51,0,8,0,0,0,197.0,157,250,5,Unknown,Yes,NO,Yes,No,Yes,No


Now, transform the target column 'readmitted'.

- <30 is represented by 1, because it was readmission before 30 days
- else it will be represented by 0

In [55]:
dataset['readmitted'].value_counts()

readmitted
NO     54864
>30    35545
<30    11357
Name: count, dtype: int64

In [58]:
54864+35545

90409

In [56]:
# Transform the readmitted column
dataset['readmitted'].apply(lambda x: 1 if x == '<30' else 0)

encounter_id
2278392      0
149190       0
64410        0
500364       0
16680        0
            ..
443847548    0
443847782    0
443854148    0
443857166    0
443867222    0
Name: readmitted, Length: 101766, dtype: int64

In [57]:
dataset['readmitted'].apply(lambda x: 1 if x == '<30' else 0).value_counts()

readmitted
0    90409
1    11357
Name: count, dtype: int64

In [59]:
# new update of the preprocess function
def preprocess_dataset(df):
    # Drop unnecessary columns
    df = df.drop(columns=['weight', 'max_glu_serum', 'payer_code'])
    
    # Fill missing values
    df = df.fillna(value='Unknown')
    
    # Process insulin column to create new columns
    df['insulin_taken'] = df['insulin'].apply(lambda x: 'Yes' if x in ['Steady', 'Up', 'Down'] else 'No')
    df['insulin_change'] = df['insulin'].apply(lambda x: 'Yes' if x in ['Up', 'Down'] else 'No')
    df=df.drop(columns=['insulin'])
    
    # List of oral antidiabetic medications
    oral_antidiabetics = [
        'metformin', 'repaglinide', 'nateglinide', 'chlorpropamide', 'glimepiride',
        'acetohexamide', 'glipizide', 'glyburide', 'tolbutamide', 'pioglitazone',
        'rosiglitazone', 'acarbose', 'miglitol', 'troglitazone', 'tolazamide',
        'examide', 'citoglipton', 'glyburide-metformin', 'glipizide-metformin',
        'glimepiride-pioglitazone', 'metformin-rosiglitazone', 'metformin-pioglitazone'
    ]
    
    # Create the other_meds_taken and other_meds_change columns
    df['other_meds'] = df[oral_antidiabetics].apply(lambda row: 'Yes' if any(val in ['Steady', 'Up', 'Down'] for val in row) else 'No', axis=1)
    df['other_meds_change'] = df[oral_antidiabetics].apply(lambda row: 'Yes' if any(val in ['Up', 'Down'] for val in row) else 'No', axis=1)
    df=df.drop(columns=oral_antidiabetics)
    df = df.drop(columns=['change']) # the change column is similar to the ones we have created

    # Transform the readmitted column
    df['Readmitted'] = df['readmitted'].apply(lambda x: 1 if x == '<30' else 0)
    df = df.drop(columns=['readmitted']) # the column readmitted will be dropped because we create the transformated Readitted column and append it to the last column
    
    return df

In [62]:
preprocess_dataset(dataset).head()

Unnamed: 0_level_0,patient_nbr,race,gender,age,admission_type_id,discharge_disposition_id,admission_source_id,time_in_hospital,medical_specialty,num_lab_procedures,num_procedures,num_medications,number_outpatient,number_emergency,number_inpatient,diag_1,diag_2,diag_3,number_diagnoses,A1Cresult,diabetesMed,insulin_taken,insulin_change,other_meds,other_meds_change,Readmitted
encounter_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1
2278392,8222157,Caucasian,Female,[0-10),6,25,1,1,Pediatrics-Endocrinology,41,0,1,0,0,0,250.83,Unknown,Unknown,1,Unknown,No,No,No,No,No,0
149190,55629189,Caucasian,Female,[10-20),1,1,7,3,Unknown,59,0,18,0,0,0,276.0,250.01,255,9,Unknown,Yes,Yes,Yes,No,No,0
64410,86047875,AfricanAmerican,Female,[20-30),1,1,7,2,Unknown,11,5,13,2,0,1,648.0,250,V27,6,Unknown,Yes,No,No,Yes,No,0
500364,82442376,Caucasian,Male,[30-40),1,1,7,2,Unknown,44,1,16,0,0,0,8.0,250.43,403,7,Unknown,Yes,Yes,Yes,No,No,0
16680,42519267,Caucasian,Male,[40-50),1,1,7,1,Unknown,51,0,8,0,0,0,197.0,157,250,5,Unknown,Yes,Yes,No,Yes,No,0


In [64]:
preprocess_dataset(dataset)['medical_specialty'].value_counts()

medical_specialty
Unknown                             49949
InternalMedicine                    14635
Emergency/Trauma                     7565
Family/GeneralPractice               7440
Cardiology                           5352
                                    ...  
Surgery-PlasticwithinHeadandNeck        1
Psychiatry-Addictive                    1
Proctology                              1
Dermatology                             1
SportsMedicine                          1
Name: count, Length: 73, dtype: int64