# PREDICTING EARLY READMISSIONS IN DIABETIC PATIENTS.

### Abstract

Hospital readmissions for diabetic patients present a significant challenge to healthcare systems, leading to increased costs and adverse patient outcomes. Despite advancements in diabetes management, many patients experience early readmissions within 30 days of discharge due to inconsistent glycemic control and inadequate follow-up care. This study utilizes a dataset covering ten years (1999–2008) of clinical care data from 130 U.S. hospitals to develop a predictive model for early readmissions. Using machine learning techniques, we analyze patient demographics, laboratory results, medications, and clinical procedures to identify key factors influencing readmissions. The findings from this research can assist healthcare providers in implementing targeted interventions to reduce readmission rates, improve diabetes management, and enhance overall patient care.

## Introduction

Diabetes is one of the most prevalent chronic diseases worldwide, affecting millions of individuals and placing a significant burden on healthcare systems. In the United States alone, diabetes management accounts for a substantial portion of healthcare expenditures, driven by complications, hospitalizations, and frequent readmissions. Despite advancements in medical care and evidence-based interventions, many diabetic patients continue to experience suboptimal outcomes due to inconsistent management and inadequate follow-up care.

Hospital readmissions, particularly those occurring**within 30 days of discharge**, are a critical concern in diabetes care. These early readmissions not only indicate gaps in patient management but also contribute to increased healthcare costs and poorer patient outcomes. For diabetic patients, unplanned readmissions are often linked to preventable factors such as poor glycemic control, medication non-adherence, and insufficient post-discharge support. Addressing these challenges is essential to improving patient care and reducing the financial strain on healthcare systems.

## Problem statement

Despite advancements in diabetes care, a significant challenge remains: early readmissions within 30 days of discharge for diabetic patients. These readmissions are costly, not only financially, but also in terms of patient outcomes, as they often signal inadequate care and suboptimal glycemic control during the initial hospital stay.

Many factors contribute to these early readmissions, including poor diabetes management, lack of proper follow-up care, and inconsistent patient adherence to treatment protocols. Unfortunately, healthcare providers currently lack reliable predictive tools to identify high-risk patients before discharge, which prevents them from taking timely, preventive action to reduce the likelihood of readmission.

This research aims to leverage clinical data from 130 U.S. hospitals spanning 1999 to 2008 to develop a predictive model for identifying patients at risk of early readmission. By focusing on patient demographics, lab results, medications, and clinical procedures, the goal is to improve patient outcomes, reduce readmission rates, and minimize the financial burden on healthcare systems.

## Objectives

#### Primary Objectives

1. Develop a Predictive Model for Early Readmissions:
Build a machine learning model to accurately predict the likelihood of diabetic patients being readmitted to the hospital within 30 days of discharge, using clinical and demographic data.

2. Identify Key Risk Factors for Early Readmissions:
Analyze patient demographics, laboratory results, medications, and clinical procedures to determine the most significant factors contributing to early readmissions among diabetic patients.

#### Secondary Objectives

3. Evaluate Model Performance and Generalizability:
Assess the predictive model's accuracy, precision, recall, and generalizability to ensure it can be effectively applied across diverse hospital settings and patient populations.

4. Provide Actionable Insights for Healthcare Providers:
Translate the model's findings into practical recommendations for healthcare providers to improve diabetes management, reduce readmission rates, and enhance patient outcomes.

5. Optimize Resource Allocation:
Use the predictive model to help hospitals identify high-risk patients and allocate resources more efficiently, reducing unnecessary healthcare costs and improving the quality of care



## Methodology/ Approach:

 -  Data collection
 -  Data Prepocessing
 -  Feature Selection
 -  Modeling
 -  Evaluation

## Expected outcomes:

## Limitations

 1. Data Quality and Completeness:
 
    - Missing Data: The dataset may have missing or incomplete data, which could lead to biases in the analysis or affect model accuracy. Handling missing values (e.g., imputation or exclusion) could impact results.
    Data Inconsistencies: Some variables may have inconsistent entries, requiring extra cleaning and preprocessing. This may affect the robustness of the predictions.

2. Data Representation and Bias:

    - Non-representative Sample: The dataset spans a specific timeframe (1999-2008) and geographic region (U.S. hospitals), which may not fully represent diabetic populations in other countries or regions. As such, the model may not generalize well to other settings.
    Sampling Bias: Certain types of patients or hospitals may be overrepresented or underrepresented in the dataset, which can skew the results.

 3. Model Limitations:

    - Overfitting or Underfitting: The machine learning models may either overfit (perform well on training data but poorly on unseen data) or underfit (not capture complex patterns), especially if hyperparameters are not tuned properly.
    - Model Generalization: The model developed may work well on the current dataset but may not generalize to other patient populations or hospital settings without further validation and adjustment.

4. Limited Scope of Data:

    - Lack of Key Variables: The dataset may not include certain variables that could be important for predicting readmissions, such as patient lifestyle factors, mental health status, or socioeconomic factors like transportation access.
    - Time Frame Limitation: The dataset spans only from 1999 to 2008, which may not account for recent changes in diabetes care practices, healthcare policies, or advancements in treatment.

 5. External Factors:

    - Changes in Medical Practices: Since the dataset covers a period from 1999 to 2008, it may not account for recent medical advancements, including new medications, technologies, or treatment protocols, which could influence readmission rates.
    Health System Variability: Variability in the healthcare systems across different hospitals and states can affect readmission rates, and not all hospitals in the dataset may have the same level of care, follow-up, or resources for managing diabetes.

6. Ethical and Privacy Concerns:
    - Data Privacy: Using healthcare data comes with privacy and ethical considerations. Although anonymized, ensuring the ethical use of patient data and maintaining confidentiality is crucial.
    - Bias in Decision-Making: Machine learning models are only as good as the data fed into them. If there are biases in the data (e.g., certain demographic groups are underrepresented), the model could perpetuate or even amplify those biases.

7. Interpretation and Clinical Relevance:

    - Lack of Causality: Although predictive models can identify correlations, they cannot establish causality. Therefore, the identified risk factors for readmissions may not necessarily be causal but simply associated.
    - Clinically Meaningful Results: While predictive models may identify risk factors, translating these into actionable clinical interventions may require further expertise and validation from healthcare professionals.    

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------

# 1. Data loading and inspection

In [28]:
# Importing the necessary libraries to facilitate data loading

import numpy as np
import pandas as pd

In [29]:
#load dataset for preview.

df = pd.read_csv("../diabetic_data.csv")

df.head()

Unnamed: 0,encounter_id,patient_nbr,race,gender,age,weight,admission_type_id,discharge_disposition_id,admission_source_id,time_in_hospital,...,citoglipton,insulin,glyburide-metformin,glipizide-metformin,glimepiride-pioglitazone,metformin-rosiglitazone,metformin-pioglitazone,change,diabetesMed,readmitted
0,2278392,8222157,Caucasian,Female,[0-10),?,6,25,1,1,...,No,No,No,No,No,No,No,No,No,NO
1,149190,55629189,Caucasian,Female,[10-20),?,1,1,7,3,...,No,Up,No,No,No,No,No,Ch,Yes,>30
2,64410,86047875,AfricanAmerican,Female,[20-30),?,1,1,7,2,...,No,No,No,No,No,No,No,No,Yes,NO
3,500364,82442376,Caucasian,Male,[30-40),?,1,1,7,2,...,No,Up,No,No,No,No,No,Ch,Yes,NO
4,16680,42519267,Caucasian,Male,[40-50),?,1,1,7,1,...,No,Steady,No,No,No,No,No,Ch,Yes,NO


In [30]:
#dertmine the shape of the dataset.
rows, columns = df.shape

print(f'The dataset has {rows} rows and {columns} columns')

The dataset has 101766 rows and 50 columns


In [31]:
#Displaying the names of the  columns
df.columns

Index(['encounter_id', 'patient_nbr', 'race', 'gender', 'age', 'weight',
       'admission_type_id', 'discharge_disposition_id', 'admission_source_id',
       'time_in_hospital', 'payer_code', 'medical_specialty',
       'num_lab_procedures', 'num_procedures', 'num_medications',
       'number_outpatient', 'number_emergency', 'number_inpatient', 'diag_1',
       'diag_2', 'diag_3', 'number_diagnoses', 'max_glu_serum', 'A1Cresult',
       'metformin', 'repaglinide', 'nateglinide', 'chlorpropamide',
       'glimepiride', 'acetohexamide', 'glipizide', 'glyburide', 'tolbutamide',
       'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol', 'troglitazone',
       'tolazamide', 'examide', 'citoglipton', 'insulin',
       'glyburide-metformin', 'glipizide-metformin',
       'glimepiride-pioglitazone', 'metformin-rosiglitazone',
       'metformin-pioglitazone', 'change', 'diabetesMed', 'readmitted'],
      dtype='object')

In [32]:
#structural summary of the dataset.
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 101766 entries, 0 to 101765
Data columns (total 50 columns):
 #   Column                    Non-Null Count   Dtype 
---  ------                    --------------   ----- 
 0   encounter_id              101766 non-null  int64 
 1   patient_nbr               101766 non-null  int64 
 2   race                      101766 non-null  object
 3   gender                    101766 non-null  object
 4   age                       101766 non-null  object
 5   weight                    101766 non-null  object
 6   admission_type_id         101766 non-null  int64 
 7   discharge_disposition_id  101766 non-null  int64 
 8   admission_source_id       101766 non-null  int64 
 9   time_in_hospital          101766 non-null  int64 
 10  payer_code                101766 non-null  object
 11  medical_specialty         101766 non-null  object
 12  num_lab_procedures        101766 non-null  int64 
 13  num_procedures            101766 non-null  int64 
 14  num_

In [33]:
#statistical summary of the dataset.
df.describe()

Unnamed: 0,encounter_id,patient_nbr,admission_type_id,discharge_disposition_id,admission_source_id,time_in_hospital,num_lab_procedures,num_procedures,num_medications,number_outpatient,number_emergency,number_inpatient,number_diagnoses
count,101766.0,101766.0,101766.0,101766.0,101766.0,101766.0,101766.0,101766.0,101766.0,101766.0,101766.0,101766.0,101766.0
mean,165201600.0,54330400.0,2.024006,3.715642,5.754437,4.395987,43.095641,1.33973,16.021844,0.369357,0.197836,0.635566,7.422607
std,102640300.0,38696360.0,1.445403,5.280166,4.064081,2.985108,19.674362,1.705807,8.127566,1.267265,0.930472,1.262863,1.9336
min,12522.0,135.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0
25%,84961190.0,23413220.0,1.0,1.0,1.0,2.0,31.0,0.0,10.0,0.0,0.0,0.0,6.0
50%,152389000.0,45505140.0,1.0,1.0,7.0,4.0,44.0,1.0,15.0,0.0,0.0,0.0,8.0
75%,230270900.0,87545950.0,3.0,4.0,7.0,6.0,57.0,2.0,20.0,0.0,0.0,1.0,9.0
max,443867200.0,189502600.0,8.0,28.0,25.0,14.0,132.0,6.0,81.0,42.0,76.0,21.0,16.0


In [34]:
#checking the distribution ofthe target variable.
df['readmitted'].value_counts()

readmitted
NO     54864
>30    35545
<30    11357
Name: count, dtype: int64

### Observations drawn from data inspection


-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

# 2. Data cleaning

## 2.1 Checking and handling columns with missing values

In [35]:
df.columns

Index(['encounter_id', 'patient_nbr', 'race', 'gender', 'age', 'weight',
       'admission_type_id', 'discharge_disposition_id', 'admission_source_id',
       'time_in_hospital', 'payer_code', 'medical_specialty',
       'num_lab_procedures', 'num_procedures', 'num_medications',
       'number_outpatient', 'number_emergency', 'number_inpatient', 'diag_1',
       'diag_2', 'diag_3', 'number_diagnoses', 'max_glu_serum', 'A1Cresult',
       'metformin', 'repaglinide', 'nateglinide', 'chlorpropamide',
       'glimepiride', 'acetohexamide', 'glipizide', 'glyburide', 'tolbutamide',
       'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol', 'troglitazone',
       'tolazamide', 'examide', 'citoglipton', 'insulin',
       'glyburide-metformin', 'glipizide-metformin',
       'glimepiride-pioglitazone', 'metformin-rosiglitazone',
       'metformin-pioglitazone', 'change', 'diabetesMed', 'readmitted'],
      dtype='object')

In [36]:
#checking for missing values

missing_values = df.isnull().sum()

#printing missing values

print(missing_values[missing_values>0])

max_glu_serum    96420
A1Cresult        84748
dtype: int64


In [37]:
#creating a function to check the distribution of columns with missing values

def print_value_counts(df, column):
    if column in df.columns:
        print(df[column].value_counts(dropna=False))  #includes NaN values
    else:
        print(f'colum {column} is not found in the DataFrame')


In [38]:
print_value_counts(df, "max_glu_serum") 

max_glu_serum
NaN     96420
Norm     2597
>200     1485
>300     1264
Name: count, dtype: int64


In [39]:
print_value_counts(df, "A1Cresult") 

A1Cresult
NaN     84748
>8       8216
Norm     4990
>7       3812
Name: count, dtype: int64


In [40]:
# Since the two colums displayed above have alot of missing values compared to the other values in those colums, 
# we will proceed and remove this columns to reduce the complexity of the dataset.

df.drop(['A1Cresult','max_glu_serum'],axis=1, inplace=True)

## 2.2 Checking and handling duplicate columns

In [45]:
#creatinga func to check id the  dataset has duplicate columns
def find_duplicate_columns(df):
    
    duplicate_columns = {}
    columns = df.columns

    for i in range(len(columns)):
        for j in range(i + 1, len(columns)):  # Compare each column with the rest
            if df[columns[i]].equals(df[columns[j]]):  # Check if two columns are identical
                duplicate_columns[columns[j]] = columns[i]  # Store duplicate mapping

    return duplicate_columns


In [46]:
# printing the duplicate column names
duplicates = find_duplicate_columns(df)
if duplicates:
    print("Duplicate columns found:")
    for duplicate, original in duplicates.items():
        print(f"'{duplicate}' is a duplicate of '{original}'")
else:
    print("No duplicate columns found.")

Duplicate columns found:
'citoglipton' is a duplicate of 'examide'


In [47]:
#dropping the duplicated columns to improve clarity and usability.

df.drop(['citoglipton'],axis=1, inplace=True)


In [49]:
# rerunning the code to ensure that the duplicated column has been removed from the dataset.
duplicates = find_duplicate_columns(df)
if duplicates:
    print("Duplicate columns found:")
    for duplicate, original in duplicates.items():
        print(f"'{duplicate}' is a duplicate of '{original}'")
else:
    print("No duplicate columns found.")

No duplicate columns found.


--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

# 3. Data Preprocessing.