
## Exploratory Data Analysis – Hospital Readmission Risk in Diabetic Patients


Author: muhameed razin v

Duration: 5 Days

Tools: Python, pandas, numpy, matplotlib, seaborn

## Overview

Hospital readmissions within 30 days represent a critical concern in healthcare systems due to increased financial burden, resource utilization, and potential indicators of inadequate post-discharge care. Diabetic patients, in particular, are at a higher risk of complications leading to readmission.

This project performs a structured exploratory data analysis (EDA) on hospital encounter data to identify patterns and factors influencing readmission risk. The goal is to generate data-driven insights that support early risk identification, improve discharge planning, and assist healthcare management in reducing avoidable readmissions.
DAY 1 — Understanding & Initial Profiling

## Dataset Description:

Dataset contains hospital encounters of diabetic patients

130 US hospitals

Period: 1999–2008

Each row represents one hospital visit

Goal: Analyze factors affecting readmission


In [6]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [7]:
df = pd.read_csv(r"C:\Users\Hp\OneDrive\Desktop\EDA PROJECT\diabetic_data.csv")
df.head()


Unnamed: 0,encounter_id,patient_nbr,race,gender,age,admission_type_id,discharge_disposition_id,admission_source_id,days_in_hospital,payer_code,...,number_diagnoses,metformin,repaglinide,glimepiride,glyburide,pioglitazone,rosiglitazone,acarbose,insulin,readmitted
0,72091308,20123568,Caucasian,Female,[70-80),1,22,7,7,MC,...,9,No,No,No,No,Up,No,No,Steady,NO
1,72848634,20377854,Caucasian,Female,[60-70),2,1,1,3,MC,...,6,No,No,No,No,No,No,No,Steady,NO
2,73062156,20408121,Caucasian,Female,[90-100),1,1,7,4,MC,...,6,No,No,Steady,No,No,No,No,No,NO
3,73731852,20542797,Caucasian,Male,[70-80),1,2,7,10,MC,...,6,Steady,No,No,No,No,No,No,Steady,NO
4,81355914,7239654,Caucasian,Female,[70-80),1,3,6,12,UN,...,5,No,No,No,No,No,No,No,Steady,NO


In [8]:
df.tail()

Unnamed: 0,encounter_id,patient_nbr,race,gender,age,admission_type_id,discharge_disposition_id,admission_source_id,days_in_hospital,payer_code,...,number_diagnoses,metformin,repaglinide,glimepiride,glyburide,pioglitazone,rosiglitazone,acarbose,insulin,readmitted
27910,443739044,106595208,Caucasian,Male,[70-80),2,6,7,6,MC,...,9,No,No,No,No,No,No,No,Up,NO
27911,443793668,47293812,Caucasian,Male,[80-90),1,13,7,3,MC,...,9,No,No,No,Up,No,Steady,No,Down,NO
27912,443804570,33230016,Caucasian,Female,[70-80),1,22,7,8,MC,...,9,No,No,No,No,No,No,No,Steady,>30
27913,443816024,106392411,Caucasian,Female,[70-80),3,6,1,3,MC,...,9,Steady,No,No,No,No,No,No,Steady,NO
27914,443857166,31693671,Caucasian,Female,[80-90),2,3,7,10,MC,...,9,No,No,No,No,Steady,No,No,Up,NO


In [9]:
df.shape

(27915, 30)

In [10]:
df.columns

Index(['encounter_id', 'patient_nbr', 'race', 'gender', 'age',
       'admission_type_id', 'discharge_disposition_id', 'admission_source_id',
       'days_in_hospital', 'payer_code', 'doctors_dep', 'num_lab_procedures',
       'num_procedures', 'num_medications', 'number_outpatient',
       'number_emergency', 'number_inpatient', 'primary_diagnosis',
       'secondary_diagnosis', 'additional_diagnosis', 'number_diagnoses',
       'metformin', 'repaglinide', 'glimepiride', 'glyburide', 'pioglitazone',
       'rosiglitazone', 'acarbose', 'insulin', 'readmitted'],
      dtype='object')

In [11]:
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27915 entries, 0 to 27914
Data columns (total 30 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   encounter_id              27915 non-null  int64 
 1   patient_nbr               27915 non-null  int64 
 2   race                      27915 non-null  object
 3   gender                    27915 non-null  object
 4   age                       27915 non-null  object
 5   admission_type_id         27915 non-null  int64 
 6   discharge_disposition_id  27915 non-null  int64 
 7   admission_source_id       27915 non-null  int64 
 8   days_in_hospital          27915 non-null  int64 
 9   payer_code                27915 non-null  object
 10  doctors_dep               27915 non-null  object
 11  num_lab_procedures        27915 non-null  int64 
 12  num_procedures            27915 non-null  int64 
 13  num_medications           27915 non-null  int64 
 14  number_outpatient     

In [12]:
df.describe()

Unnamed: 0,encounter_id,patient_nbr,admission_type_id,discharge_disposition_id,admission_source_id,days_in_hospital,num_lab_procedures,num_procedures,num_medications,number_outpatient,number_emergency,number_inpatient,number_diagnoses
count,27915.0,27915.0,27915.0,27915.0,27915.0,27915.0,27915.0,27915.0,27915.0,27915.0,27915.0,27915.0,27915.0
mean,184115800.0,55998670.0,2.011965,2.911374,4.94057,4.294107,40.745227,1.458786,16.084077,0.28268,0.301702,0.655275,7.240587
std,83946160.0,37362420.0,0.940823,4.269688,3.512699,2.942521,19.871317,1.743732,8.579285,0.983899,1.414151,1.348572,1.973913
min,72091310.0,729.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0
25%,119224000.0,24001270.0,1.0,1.0,1.0,2.0,29.0,0.0,10.0,0.0,0.0,0.0,5.0
50%,160755800.0,43228580.0,2.0,1.0,7.0,4.0,42.0,1.0,15.0,0.0,0.0,0.0,8.0
75%,230287000.0,91853230.0,3.0,3.0,7.0,6.0,54.0,2.0,20.0,0.0,0.0,1.0,9.0
max,443857200.0,189365900.0,6.0,28.0,22.0,14.0,132.0,6.0,81.0,38.0,76.0,16.0,16.0


In [13]:
df.nunique().sort_values()

gender                          2
readmitted                      3
acarbose                        4
rosiglitazone                   4
metformin                       4
pioglitazone                    4
glyburide                       4
glimepiride                     4
insulin                         4
repaglinide                     4
race                            6
admission_type_id               6
num_procedures                  7
age                            10
admission_source_id            14
days_in_hospital               14
number_diagnoses               15
payer_code                     16
number_inpatient               17
number_outpatient              20
discharge_disposition_id       23
number_emergency               31
doctors_dep                    57
num_medications                75
num_lab_procedures            110
primary_diagnosis             576
secondary_diagnosis           583
additional_diagnosis          594
patient_nbr                 20367
encounter_id  

In [14]:
df.isna().sum().sort_values(ascending=False)

encounter_id                0
patient_nbr                 0
insulin                     0
acarbose                    0
rosiglitazone               0
pioglitazone                0
glyburide                   0
glimepiride                 0
repaglinide                 0
metformin                   0
number_diagnoses            0
additional_diagnosis        0
secondary_diagnosis         0
primary_diagnosis           0
number_inpatient            0
number_emergency            0
number_outpatient           0
num_medications             0
num_procedures              0
num_lab_procedures          0
doctors_dep                 0
payer_code                  0
days_in_hospital            0
admission_source_id         0
discharge_disposition_id    0
admission_type_id           0
age                         0
gender                      0
race                        0
readmitted                  0
dtype: int64

## Initial Observations Section:


Age is stored as categorical ranges

Some categorical columns have many unique categories

Potential outliers may exist in numeric columns

High imbalance possible in readmitted variable

In [16]:
df['readmitted'].value_counts()

readmitted
NO     16006
>30     8947
<30     2962
Name: count, dtype: int64

## Day 1 Summary:

Dataset loaded successfully

Structure examined

Numerical and categorical variables separated

Target variable distribution analyzed

Data quality issues logged

In [18]:
df.to_csv("day1_output.csv", index=False)