# **Hospital Readmission Prediction**

## Objectives

* Inspect and clean the dataset to prepare it for modeling.
* Handle missing values and ensure data consistency.
* Convert categorical data into numerical format for machine learning models.
* Save the cleaned dataset for further analysis.

## Inputs

* hospital_readmissions.csv: Raw dataset containing patient information and hospital readmission details.

## Outputs

* cleaned_hospital_readmissions.csv: Preprocessed dataset ready for Exploratory Data Analysis (EDA) and modeling. 

## Additional Comments

* This step ensures data quality before applying statistical tests and machine learning models.
* We will handle missing values, encode categorical variables, and check for any inconsistencies.
* Feature engineering will be done where necessary

---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [2]:
import os
current_dir = os.getcwd()
current_dir

'/workspaces/Readmission-Prediction/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [3]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [4]:
current_dir = os.getcwd()
current_dir

'/workspaces/Readmission-Prediction'

In [5]:
#install kaggle
! pip install kaggle


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [6]:
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

In [7]:
KaggleDatasetPath = "dubradave/hospital-readmissions"
DestinationFolder = "inputs/readmission_dataset"
!kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Dataset URL: https://www.kaggle.com/datasets/dubradave/hospital-readmissions
License(s): other


In [8]:
import zipfile
with zipfile.ZipFile(DestinationFolder + '/hospital-readmissions.zip', 'r') as zip_ref:
    zip_ref.extractall(DestinationFolder)

os.remove(DestinationFolder + '/hospital-readmissions.zip')

---

# Load and Explore the Dataset

# Inputs

* Source Data: hospital_readmissions.csv (raw dataset)

# Outputs

* Cleaned Data: cleaned_hospital_readmissions.csv

In [9]:
import pandas as pd

df = pd.read_csv("inputs/readmission_dataset/hospital_readmissions.csv")

# Display basic dataset info
print(f"Dataset Shape: {df.shape}")  
df.info()
df.head()

Dataset Shape: (25000, 17)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 17 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   age                25000 non-null  object
 1   time_in_hospital   25000 non-null  int64 
 2   n_lab_procedures   25000 non-null  int64 
 3   n_procedures       25000 non-null  int64 
 4   n_medications      25000 non-null  int64 
 5   n_outpatient       25000 non-null  int64 
 6   n_inpatient        25000 non-null  int64 
 7   n_emergency        25000 non-null  int64 
 8   medical_specialty  25000 non-null  object
 9   diag_1             25000 non-null  object
 10  diag_2             25000 non-null  object
 11  diag_3             25000 non-null  object
 12  glucose_test       25000 non-null  object
 13  A1Ctest            25000 non-null  object
 14  change             25000 non-null  object
 15  diabetes_med       25000 non-null  object
 16  readmitted   

Unnamed: 0,age,time_in_hospital,n_lab_procedures,n_procedures,n_medications,n_outpatient,n_inpatient,n_emergency,medical_specialty,diag_1,diag_2,diag_3,glucose_test,A1Ctest,change,diabetes_med,readmitted
0,[70-80),8,72,1,18,2,0,0,Missing,Circulatory,Respiratory,Other,no,no,no,yes,no
1,[70-80),3,34,2,13,0,0,0,Other,Other,Other,Other,no,no,no,yes,no
2,[50-60),5,45,0,18,0,0,0,Missing,Circulatory,Circulatory,Circulatory,no,no,yes,yes,yes
3,[70-80),2,36,0,12,1,0,0,Missing,Circulatory,Other,Diabetes,no,no,yes,yes,yes
4,[60-70),1,42,0,7,0,0,0,InternalMedicine,Other,Circulatory,Respiratory,no,no,no,yes,no


# Data Cleaning & Preprocessing

In [10]:
import numpy as np

# Convert 'age' column from categorical brackets to numerical values
age_mapping = {
    "[0-10)": 5, "[10-20)": 15, "[20-30)": 25, "[30-40)": 35, "[40-50)": 45,
    "[50-60)": 55, "[60-70)": 65, "[70-80)": 75, "[80-90)": 85, "[90-100)": 95
}
df["age"] = df["age"].map(age_mapping)

# Convert 'readmitted' to binary (1 = Yes, 0 = No)
df["readmitted"] = df["readmitted"].map({"yes": 1, "no": 0})

# Drop duplicate rows if any
df.drop_duplicates(inplace=True)

# Display updated dataset info
print(f"Dataset Shape After Cleaning: {df.shape}")  
df.info()
df.head()


Dataset Shape After Cleaning: (25000, 17)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 17 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   age                25000 non-null  int64 
 1   time_in_hospital   25000 non-null  int64 
 2   n_lab_procedures   25000 non-null  int64 
 3   n_procedures       25000 non-null  int64 
 4   n_medications      25000 non-null  int64 
 5   n_outpatient       25000 non-null  int64 
 6   n_inpatient        25000 non-null  int64 
 7   n_emergency        25000 non-null  int64 
 8   medical_specialty  25000 non-null  object
 9   diag_1             25000 non-null  object
 10  diag_2             25000 non-null  object
 11  diag_3             25000 non-null  object
 12  glucose_test       25000 non-null  object
 13  A1Ctest            25000 non-null  object
 14  change             25000 non-null  object
 15  diabetes_med       25000 non-null  object
 16

Unnamed: 0,age,time_in_hospital,n_lab_procedures,n_procedures,n_medications,n_outpatient,n_inpatient,n_emergency,medical_specialty,diag_1,diag_2,diag_3,glucose_test,A1Ctest,change,diabetes_med,readmitted
0,75,8,72,1,18,2,0,0,Missing,Circulatory,Respiratory,Other,no,no,no,yes,0
1,75,3,34,2,13,0,0,0,Other,Other,Other,Other,no,no,no,yes,0
2,55,5,45,0,18,0,0,0,Missing,Circulatory,Circulatory,Circulatory,no,no,yes,yes,1
3,75,2,36,0,12,1,0,0,Missing,Circulatory,Other,Diabetes,no,no,yes,yes,1
4,65,1,42,0,7,0,0,0,InternalMedicine,Other,Circulatory,Respiratory,no,no,no,yes,0


In [11]:
df.to_csv("inputs/readmission_dataset/cleaned_hospital_readmissions.csv", index=False)
print("Cleaned dataset saved successfully!")

Cleaned dataset saved successfully!


---

# Checking Data Quality & Distribution

In [13]:
# Load cleaned dataset
df = pd.read_csv("inputs/readmission_dataset/cleaned_hospital_readmissions.csv")

# Display dataset shape & info
print(f"Dataset Shape: {df.shape}")  
df.info()

# Check for missing values
print("\nMissing Values:\n", df.isnull().sum())

# Quick summary statistics
print("\nSummary Statistics:")
print(df.describe())

# Save cleaned dataset
df.to_csv("inputs/readmission_dataset/cleaned_hospital_readmissions.csv", index=False)
print("\n✅ Data successfully saved for EDA.")

Dataset Shape: (25000, 17)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 17 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   age                25000 non-null  int64 
 1   time_in_hospital   25000 non-null  int64 
 2   n_lab_procedures   25000 non-null  int64 
 3   n_procedures       25000 non-null  int64 
 4   n_medications      25000 non-null  int64 
 5   n_outpatient       25000 non-null  int64 
 6   n_inpatient        25000 non-null  int64 
 7   n_emergency        25000 non-null  int64 
 8   medical_specialty  25000 non-null  object
 9   diag_1             25000 non-null  object
 10  diag_2             25000 non-null  object
 11  diag_3             25000 non-null  object
 12  glucose_test       25000 non-null  object
 13  A1Ctest            25000 non-null  object
 14  change             25000 non-null  object
 15  diabetes_med       25000 non-null  object
 16  readmitted   

NOTE

* You may add as many sections as you want, as long as it supports your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Push files to Repo

* In case you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [None]:
import os
try:
  # create here your folder
  # os.makedirs(name='')
except Exception as e:
  print(e)


IndentationError: expected an indented block after 'try' statement on line 2 (2852421808.py, line 5)