<a href="https://colab.research.google.com/github/Jihyun-Eun/Python-ML-Bio/blob/main/Predicting_Hospital_Readmission_with_Diabetes_Dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### This project is based on "Prediction on Hospital Readmission" project by ABHISHEK SHARMA

https://www.kaggle.com/code/iabhishekofficial/prediction-on-hospital-readmission#Data-Preparation-&-Exploration

You can get more information from the link above.

##About the dataset

"Diabetes 130 US hospitals for years 1999-2008"
Diabetes - readmission

https://www.kaggle.com/datasets/brandao/diabetes/code

"The data contains such attributes as patient number, race, gender, age, admission type, time in hospital, medical specialty of admitting physician, number of lab test performed, HbA1c test result, diagnosis, number of medication, diabetic medications, number of outpatient, inpatient, and emergency visits in the year before the hospitalization, etc."

You can get more information on the following research article.
"Impact of HbA1c Measurement on Hospital Readmission Rates:Analysis of 70,000 Clinical Database Patient Records"

##Goal

1. What factors are the strongest predictors of hospital readmission in diabetic patients?
2. How well can we predict hospital readmission in this dataset with limited features?




##Variables




**Encounter ID** Unique identifier of an encounter

**Patient number** Unique identifier of a patient

**Race Values** Caucasian, Asian, African American, Hispanic, and other

**Gender Values** male, female, and unknown/invalid

**Age** Grouped in 10-year intervals: 0, 10), 10, 20), …, 90, 100)

**Weight** Weight in pounds

**Admission type** Integer identifier corresponding to 9 distinct values, for example, emergency, urgent, elective, newborn, and not available

**Discharge disposition** Integer identifier corresponding to 29 distinct values, for example, discharged to home, expired, and not available

**Admission source** Integer identifier corresponding to 21 distinct values, for example, physician referral, emergency room, and transfer from a hospital

**Time in hospital** Integer number of days between admission and discharge

**Payer code** Integer identifier corresponding to 23 distinct values, for example, Blue Cross/Blue Shield, Medicare, and self-pay Medical

**Medical specialty** Integer identifier of a specialty of the admitting physician, corresponding to 84 distinct values, for example, cardiology, internal medicine, family/general practice, and surgeon

**Number of lab procedures** Number of lab tests performed during the encounter

**Number of procedures** Numeric Number of procedures (other than lab tests) performed during the encounter

**Number of medications** Number of distinct generic names administered during the encounter

**Number of outpatient visits** Number of outpatient visits of the patient in the year preceding the encounter

**Number of emergency visits** Number of emergency visits of the patient in the year preceding the encounter

**Number of inpatient visits** Number of inpatient visits of the patient in the year preceding the encounter

**Diagnosis 1** The primary diagnosis (coded as first three digits of ICD9); 848 distinct values

**Diagnosis 2** Secondary diagnosis (coded as first three digits of ICD9); 923 distinct values

**Diagnosis 3** Additional secondary diagnosis (coded as first three digits of ICD9); 954 distinct values

**Number of diagnoses** Number of diagnoses entered to the system 0%

**Glucose serum test result** Indicates the range of the result or if the test was not taken.
* “>200,” “>300,” “normal,” and “none” if not measured

**A1c test result** Indicates the range of the result or if the test was not taken.
* “>8” if the result was greater than 8%
*“>7” if the result was greater than 7% but less than 8%
*“normal” if the result was less than 7%,
*“none” if not measured.

**Change of medications** Indicates if there was a change in diabetic medications (either dosage or generic name). Values: “change” and “no change”

**Diabetes medications** Indicates if there was any diabetic medication prescribed. “yes” and “no”

**24 features for medications**

For the generic names: metformin, repaglinide, nateglinide, chlorpropamide, glimepiride, acetohexamide, glipizide, glyburide, tolbutamide, pioglitazone, rosiglitazone, acarbose, miglitol, troglitazone, tolazamide, examide, sitagliptin, insulin, glyburide-metformin, glipizide-metformin, glimepiride- pioglitazone, metformin-rosiglitazone, and metformin- pioglitazone.

The feature indicates whether the drug was prescribed or there was a change in the dosage.

* “up” if the dosage was increased during the encounter,

* “down” if the dosage was decreased,

* “steady” if the dosage did not change

* “no” if the drug was not prescribed

**Readmitted Days to inpatient readmission.**

*   “<30” if the patient was readmitted in less than 30 days
*   “>30” if the patient was readmitted in more than 30 days
*    “No” for no record of readmission








##Data Cleansing

In [5]:
#Loading libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [6]:
#loading Dataset
df = pd.read_csv("/content/diabetic_data.csv")
df.head(10).T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
encounter_id,2278392,149190,64410,500364,16680,35754,55842,63768,12522,15738
patient_nbr,8222157,55629189,86047875,82442376,42519267,82637451,84259809,114882984,48330783,63555939
race,Caucasian,Caucasian,AfricanAmerican,Caucasian,Caucasian,Caucasian,Caucasian,Caucasian,Caucasian,Caucasian
gender,Female,Female,Female,Male,Male,Male,Male,Male,Female,Female
age,[0-10),[10-20),[20-30),[30-40),[40-50),[50-60),[60-70),[70-80),[80-90),[90-100)
weight,?,?,?,?,?,?,?,?,?,?
admission_type_id,6,1,1,1,1,2,3,1,2,3
discharge_disposition_id,25,1,1,1,1,1,1,1,1,3
admission_source_id,1,7,7,7,7,2,2,7,4,4
time_in_hospital,1,3,2,2,1,3,4,5,13,12


In [7]:
#checking shape of the dataset
df.shape

(101766, 50)

In [8]:
#Checking data types of each variable
df.dtypes

encounter_id                 int64
patient_nbr                  int64
race                        object
gender                      object
age                         object
weight                      object
admission_type_id            int64
discharge_disposition_id     int64
admission_source_id          int64
time_in_hospital             int64
payer_code                  object
medical_specialty           object
num_lab_procedures           int64
num_procedures               int64
num_medications              int64
number_outpatient            int64
number_emergency             int64
number_inpatient             int64
diag_1                      object
diag_2                      object
diag_3                      object
number_diagnoses             int64
max_glu_serum               object
A1Cresult                   object
metformin                   object
repaglinide                 object
nateglinide                 object
chlorpropamide              object
glimepiride         

In [9]:
#Checking for missing values in dataset
#In the dataset missing values are represented as '?' sign
for col in df.columns:
    if df[col].dtype == object:
         print(col,df[col][df[col] == '?'].count())

race 2273
gender 0
age 0
weight 98569
payer_code 40256
medical_specialty 49949
diag_1 21
diag_2 358
diag_3 1423
max_glu_serum 0
A1Cresult 0
metformin 0
repaglinide 0
nateglinide 0
chlorpropamide 0
glimepiride 0
acetohexamide 0
glipizide 0
glyburide 0
tolbutamide 0
pioglitazone 0
rosiglitazone 0
acarbose 0
miglitol 0
troglitazone 0
tolazamide 0
examide 0
citoglipton 0
insulin 0
glyburide-metformin 0
glipizide-metformin 0
glimepiride-pioglitazone 0
metformin-rosiglitazone 0
metformin-pioglitazone 0
change 0
diabetesMed 0
readmitted 0


In [10]:
# gender was coded differently so we use a custom count for this one
print('gender', df['gender'][df['gender'] == 'Unknown/Invalid'].count())

gender 3


###Dealing with Missing Values
Variable "weight" contains approximate 98% of the missing values so there is no significance in filling those missing values so we decided to drop these variables. Variable "Payer code" and "medical specialty" contains approximate 40% missing values so we also dropped these variables.

Variables "race", "diag_1", "diag_2", "diag_3" and "gender" contains very less missing values as compared to other attributes which we dropped so for these attributes we also decided to drop those where missing values contains

In [11]:
#dropping columns with large number of missing values
df = df.drop(['weight','payer_code','medical_specialty'], axis = 1)

In [12]:
drop_Idx = set(df[(df['diag_1'] == '?') & (df['diag_2'] == '?') & (df['diag_3'] == '?')].index)

drop_Idx = drop_Idx.union(set(df['diag_1'][df['diag_1'] == '?'].index))
drop_Idx = drop_Idx.union(set(df['diag_2'][df['diag_2'] == '?'].index))
drop_Idx = drop_Idx.union(set(df['diag_3'][df['diag_3'] == '?'].index))
drop_Idx = drop_Idx.union(set(df['race'][df['race'] == '?'].index))
drop_Idx = drop_Idx.union(set(df[df['discharge_disposition_id'] == 11].index))
drop_Idx = drop_Idx.union(set(df['gender'][df['gender'] == 'Unknown/Invalid'].index))
new_Idx = list(set(df.index) - set(drop_Idx))
df = df.iloc[new_Idx]

In [13]:
# For some variables (drugs named citoglipton and examide), all records have the same value
# So essentially these cannot provide any interpretive or discriminatory information for predicting readmission so we decided to drop these two variables
df = df.drop(['citoglipton', 'examide'], axis = 1)

In [14]:
#Checking for missing values in the data
for col in df.columns:
    if df[col].dtype == object:
         print(col,df[col][df[col] == '?'].count())

print('gender', df['gender'][df['gender'] == 'Unknown/Invalid'].count())

race 0
gender 0
age 0
diag_1 0
diag_2 0
diag_3 0
max_glu_serum 0
A1Cresult 0
metformin 0
repaglinide 0
nateglinide 0
chlorpropamide 0
glimepiride 0
acetohexamide 0
glipizide 0
glyburide 0
tolbutamide 0
pioglitazone 0
rosiglitazone 0
acarbose 0
miglitol 0
troglitazone 0
tolazamide 0
insulin 0
glyburide-metformin 0
glipizide-metformin 0
glimepiride-pioglitazone 0
metformin-rosiglitazone 0
metformin-pioglitazone 0
change 0
diabetesMed 0
readmitted 0
gender 0


## Feature engineering


This part is related with the prior knowledge of health care service, so it can be subjective. I refered to this material and followed the steps sequentially.

https://www.kaggle.com/code/iabhishekofficial/prediction-on-hospital-readmission#Data-Preparation-&-Exploration



1. Service utilization: The data contains variables for number of inpatient (admissions), emergency room visits and outpatient visits for a given patient in the previous one year. These are (crude) measures of how much hospital/clinic services a person has used in the past year. We added these three to create a new variable called service utilization (see figure below). The idea was to see which version gives us better results. Granted, we did not apply any special weighting to the three ingredients of service utilization but we wanted to try something simple at this stage.

In [15]:
df['service_utilization'] = df['number_outpatient'] + df['number_emergency'] + df['number_inpatient']

2. Number of medication changes: The dataset contains 23 features for 23 drugs (or combos) which indicate for each of these, whether a change in that medication was made or not during the current hospital stay of patient. Medication change for diabetics upon admission has been shown by previous research to be associated with lower readmission rates. We decided to count how many changes were made in total for each patient, and declared that a new feature. The reasoning here was to both simplify the model and possibly discover a relationship with number of changes regardless of which drug was changed.

In [16]:
keys = ['metformin', 'repaglinide', 'nateglinide', 'chlorpropamide', 'glimepiride', 'glipizide', 'glyburide', 'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol', 'insulin', 'glyburide-metformin', 'tolazamide', 'metformin-pioglitazone','metformin-rosiglitazone', 'glimepiride-pioglitazone', 'glipizide-metformin', 'troglitazone', 'tolbutamide', 'acetohexamide']
for col in keys:
    colname = str(col) + 'temp'
    df[colname] = df[col].apply(lambda x: 0 if (x == 'No' or x == 'Steady') else 1)
df['numchange'] = 0
for col in keys:
    colname = str(col) + 'temp'
    df['numchange'] = df['numchange'] + df[colname]
    del df[colname]

df['numchange'].value_counts()

0    70142
1    24922
2     1271
3      106
4        5
Name: numchange, dtype: int64