# **Healthcare - Patient Readmission Analysis**

## **Objective**  
<small>Analyze hospital data of diabetic patients to identify factors impacting patient readmission within 30 days. Use EDA to extract key insights.</small>

## **Dataset**  
<small>
- Source: Diabetes 130 US hospitals dataset (Kaggle)  
- Records: ~100,000 hospital admissions  
- Target Variable: `readmitted` (<30, >30, NO)
</small>

## **Tools & Libraries**  
<small>
- Python  
- Pandas, NumPy - Data cleaning & manipulation  
- Matplotlib, Seaborn - Data visualization
</small>

## **Workflow**  

### 1. Data Understanding  
<small>- Load dataset  
- Check shape, columns, missing values  
- Explore target variable distribution</small>  

### 2. Data Cleaning  
<small>- Handle missing values  
- Remove duplicates  
- Convert categorical variables</small>  

### 3. Exploratory Data Analysis (EDA)  
<small>- Visualize target variable distribution  
- Explore numerical features (histograms, boxplots)  
- Explore categorical features (countplots)  
- Correlation analysis</small>  

### 4. Insights & Conclusion  
<small>- Summarize key findings  
- Highlight factors impacting readmission  
- Optional: save cleaned dataset</small>


In [None]:
# **Data Handling**

import pandas as pd
import numpy as np

# **Visualization **

import matplotlib.pyplot as plt
import seaborn as sns



In [None]:
df = pd.read_csv('/content/diabetic_data.csv')
df

# Data Understanding:


In [None]:
# **accessing first 5 rows of the data**


print("First 5 rows of the dataset are :")
df.head()


In [None]:
# **Structure of Data**
print("Structure of the dataset is :")


df.info()

In [None]:
# **Summary of Numerical Data**

print("Summary of Numerical data from dataset is :  ")
df.describe()

In [None]:
# **Checking null values in the dataset**

print("Checking null values in the dataset :")
df.isnull().sum()

In [None]:
for col in df.columns:
    if '?' in df[col].values:
        print(f"Column '{col}' has '?' values. Count: {df[col].isin(['?']).sum()}")


# Data Cleaning

In [None]:
# ** Replace all '?' values with 'NaN' **

# Reload the data to ensure df is not None
df = pd.read_csv('/content/diabetic_data.csv')



df.replace('?', np.nan, inplace=True)

df

In [None]:
# ** Count all missing values **

print("Count of missing values in each column:")
df.isnull().sum()

In [None]:
# Drop those columns which are having more number of missing values in the dataset for example, feature 'weight' having more number of missing values

#Here we are dropping the feature 'weight'
df.drop('weight',axis = 1 , inplace  = True)
df

In [None]:
df.head(30)

In [None]:
df.head()

In [None]:
# Another column with more number of NaN values is 'medical_specialty' let's drop this..

df.drop('medical_specialty' , axis = 1, inplace = True)
df

In [None]:
# Final shape of the Dataset

df.shape
df.describe()
df.info()

# Exploratory Data Analysis (EDA)

## Relational Plots

In [None]:
df.head()

In [None]:
sns.scatterplot(data=df, x='time_in_hospital', y='num_lab_procedures')


## This scatter plot shows the relationship between a patient's time in the hospital and the number of lab procedures they receive.

 There is no clear correlation between the two variables.

 The number of lab procedures performed for patients is widely varied, regardless of how many days they spend in the hospital.

In [None]:
# Line plot : representing relation between num_medication and time_in_hospital
sns.lineplot(data=df, x='time_in_hospital', y='num_medications')

## This line plot shows the relationship between a patient's time in the hospital and the number of medications they receive.

 There is a strong positive correlation between the two variables.

 As the length of a hospital stay increases, the average number of medications administered to a patient also increases. The rise is steeper in the initial days of the hospital stay.

## Distributional Plots

In [None]:
# Histplot : Showing the frequency of feature 'age' group..

sns.histplot(data=df, x='age', bins = 10, kde = True)

## This graph illustrates the age distribution of the diabetic patients in the dataset.

The majority of patients are in the older age groups, with the largest group being between 70 and 80 years old.

The number of patients increases steadily with age, peaking in the [70-80) age range, before slightly declining in the older age groups.

The dataset contains a very small number of patients in the younger age categories, particularly below the age of 40.

In [None]:
# Boxplot : Showing relationship between feature 'readmission_30_days' according to the 'num_medications' column, showing median and outliers.

sns.boxplot(data = df, x = 'readmitted', y = 'num_medications')

# This box plot shows the relationship between the number of medications a patient is on and their readmission status.

The median number of medications is nearly identical across all three groups (NO, >30, <30), sitting at approximately 15.

The overall spread of the data (the box and whiskers) is also very similar for all categories.

All three groups contain outliers, indicating that a small number of patients in each category are on a significantly higher number of medications.

## Categorical Plots

In [None]:
# Countplot : Showing 'Gender' distribution

sns.countplot(data = df, x = 'gender')

## This graph illustrates the gender distribution of the patient dataset.

The number of female patients is slightly higher than male patients, though both genders are represented in nearly equal numbers.

The count for patients with an unknown or invalid gender is negligible.

In [None]:
# Barplot: Showing 'readmitted' rate by 'race'

sns.barplot(data = df, x = 'race', y = 'readmitted')

## The chart shows a strong link between race and readmission rates for diabetic patients.
 Lowest readmission rates are found among Caucasian and African American patients.
 Highest readmission rates are seen in the Hispanic, Asian, and Other race categories.




# **CONCLUSION**


 Based on my analysis, I've identified several key factors related to hospital readmissions in diabetic patients. My data shows that the patient population is primarily older, with most patients between 70 and 80 years old. Most importantly, I found a significant racial disparity in readmission rates, with Hispanic, Asian, and Other patients being more likely to be readmitted compared to Caucasian and African American patients.

 My analysis of treatment metrics also provided an interesting insight. I found no clear link between the number of medications a patient was on and their likelihood of readmission. This tells me that readmission is likely driven by other variables, such as patient care or post-discharge support, rather than just medication count. Overall, my findings on racial disparities are the most impactful and should be the primary focus for future efforts to improve patient outcomes.