# 1. Business Understanding

Acute Kidney Injury (AKI) is a critical medical condition associated with high morbidity and mortality rates. Despite advances in medical care, early detection and risk stratification remain major challenges. Delayed diagnosis and intervention often result in poor patient outcomes, increased hospital stays, and higher healthcare costs.

Predicting mortality risk in AKI patients using data-driven models can help healthcare providers take proactive measures, improve clinical decision-making, and allocate resources efficiently. A robust predictive model can enable early identification of high-risk patients, allowing timely interventions that may reduce complications and improve survival rates.

This project aims to develop a machine learning-based model to predict mortality in AKI patients using relevant clinical parameters such as serum creatinine levels, urine output, demographics, comorbidities, and treatment interventions. The insights from this model can support physicians in making evidence-based decisions and improving patient outcomes.

## 1.1 Problem Statement

Acute Kidney Injury (AKI) contributes significantly to in-hospital mortality, yet current risk assessment methods are often reactive rather than predictive. Traditional clinical scoring systems may not capture the complex interactions between various risk factors, leading to suboptimal patient management.

This project seeks to develop a predictive model that leverages machine learning techniques to estimate the likelihood of mortality in AKI patients. By analyzing key patient data, the model aims to provide early warnings for high-risk cases, enabling timely clinical interventions. The goal is to improve survival rates, optimize hospital resource utilization, and enhance patient care through data-driven insights.

## 1.2 Objective
1. Predict in-hospital mortality in the ICU of a retrospective cohort of patients with AKI


# 2. Data Acquisition and Understanding

This data was downloaded from MIMIC-III(Medical Information Mart for Intensive Care) database. Each registry includes a set of variables that summarize the clinical trajectory of the patients during their stay in the ICU. The target variable, in-hospital mortality (IHM), is defined as a binary variable where the value ’1’ indicates the death of the patient in the ICU. The variables are:
- **age**: The age of the patient, usually measured in years.

- **gender_F**: A binary indicator (0 or 1) representing whether the patient is female (1 for female, 0 for not female).

- **gender_M**: A binary indicator representing whether the patient is male (1 for male, 0 for not male).

- **bic_max**: The maximum value of the Body Mass Index (BMI) recorded for the patient.

- **bic_mean**: The average Body Mass Index (BMI) of the patient over a certain period.

- **bic_min**: The minimum Body Mass Index (BMI) recorded for the patient.

- **bilirubin**: A substance produced from the breakdown of red blood cells; high levels can indicate liver problems.

- **bp_max**: The maximum recorded blood pressure of the patient.

- **bp_mean**: The average blood pressure recorded for the patient.

- **bp_min**: The minimum recorded blood pressure of the patient.

- **bun_max**: The maximum value of Blood Urea Nitrogen (BUN), which can indicate kidney function.

- **bun_mean**: The average Blood Urea Nitrogen (BUN) level for the patient.

- **bun_min**: The minimum Blood Urea Nitrogen (BUN) level recorded.

- **Days_in_uci**: The number of days the patient spent in an Intensive Care Unit (ICU).

- **fio2**: Fraction of inspired oxygen; it indicates the percentage of oxygen in the air that a patient is breathing.

- **gcs_max**: The maximum score on the Glasgow Coma Scale, which measures consciousness levels.

- **gcs_mean**: The average Glasgow Coma Scale score during the patient's stay.

- **gcs_min**: The minimum Glasgow Coma Scale score recorded.

- **hr_max**: The maximum heart rate recorded for the patient.

- **hr_mean**: The average heart rate during monitoring.

- **hr_min**: The minimum heart rate recorded.

- **max pao2**: The maximum partial pressure of oxygen in arterial blood, indicating how well oxygen is being transferred into the blood from the lungs.

- **mean pao2**: The average partial pressure of oxygen in arterial blood during monitoring.

- **min pao2**: The minimum partial pressure of oxygen in arterial blood recorded.

- **pot_max**: The maximum potassium level in the patient's blood, which is important for heart and muscle function.

- **pot_mean**: The average potassium level in the patient's blood during monitoring.

- **pot_min**: The minimum potassium level recorded in the patient's blood.

- **sod_max**: The maximum sodium level in the patient's blood, important for fluid balance and nerve function.

- **sod_mean**: The average sodium level during monitoring.

- **sod_min**: The minimum sodium level recorded in the patient's blood.

- **temp**: The body temperature of the patient, usually measured in degrees Celsius or Fahrenheit.

- **wbc_max**: The maximum white blood cell count, which can indicate infection or inflammation.

- **wbc_mean**: The average white blood cell count during monitoring.

- **wbc_min**: The minimum white blood cell count recorded.

- **IHM**: stands for In-Hospital Mortality, indicating whether a patient died during their hospital stay (often coded as 1 for yes and 0 for no).

# 3. Data Preparation

In [2]:
# importing libraries
import pandas as pd
#from matplotlib import pyplot as plt

In [7]:
# import data

data_path = r"C:\Users\njamb\Desktop\DataScience\AcuteKidneyInjuryMortalityPrediction\data\raw\ihm_aki.csv"
data = pd.read_csv(data_path, index_col=0)
data.head()

Unnamed: 0,age,gender_F,gender_M,bic_max,bic_mean,bic_min,bilirubin,bp_max,bp_mean,bp_min,...,pot_mean,pot_min,sod_max,sod_mean,sod_min,temp,wbc_max,wbc_mean,wbc_min,IHM
0,74.63,1,0,40.0,34.62,30.0,0.4,154.63,123.28,98.69,...,3.85,3.3,143.5,141.5,139.0,,13.7,9.2,6.4,0
1,60.12,1,0,34.0,28.94,24.0,0.2,113.12,104.68,93.88,...,3.76,2.75,145.0,141.21,139.0,38.04,21.3,17.69,14.4,0
2,64.12,1,0,26.0,24.07,21.0,0.3,126.62,108.91,77.0,...,3.86,3.5,145.0,140.86,138.0,,33.9,19.39,10.5,0
4,54.46,0,1,34.0,30.98,26.0,1.0,151.38,114.38,97.21,...,4.17,3.6,147.5,140.43,135.0,36.83,29.0,13.18,2.5,1
5,78.22,0,1,29.6,23.1,18.0,1.0,166.26,144.62,124.5,...,4.1,3.4,150.25,141.22,136.5,,46.86,8.68,0.3,1


In [9]:
print(f"The data has {data.shape[0]:,} rows and {data.shape[1]} columns.")

The data has 3,550 rows and 35 columns.


## 3.1 Data Cleaning

In [11]:
df = data.copy()

In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 3550 entries, 0 to 4430
Data columns (total 35 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   age          3550 non-null   float64
 1   gender_F     3550 non-null   int64  
 2   gender_M     3550 non-null   int64  
 3   bic_max      3550 non-null   float64
 4   bic_mean     3550 non-null   float64
 5   bic_min      3550 non-null   float64
 6   bilirubin    2893 non-null   float64
 7   bp_max       3550 non-null   float64
 8   bp_mean      3550 non-null   float64
 9   bp_min       3550 non-null   float64
 10  bun_max      3550 non-null   float64
 11  bun_mean     3550 non-null   float64
 12  bun_min      3550 non-null   float64
 13  Days_in_uci  3550 non-null   float64
 14  fio2         3152 non-null   float64
 15  gcs_max      3550 non-null   float64
 16  gcs_mean     3550 non-null   float64
 17  gcs_min      3550 non-null   float64
 18  hr_max       3550 non-null   float64
 19  hr_mean    

In [14]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
age,3550.0,64.28662,15.986639,16.9,54.1125,66.92,77.2075,89.0
gender_F,3550.0,0.424507,0.494338,0.0,0.0,0.0,1.0,1.0
gender_M,3550.0,0.575493,0.494338,0.0,0.0,1.0,1.0,1.0
bic_max,3550.0,30.055363,5.032318,14.5,27.0,30.0,33.0,50.0
bic_mean,3550.0,25.606231,4.14798,12.33,23.125,25.46,27.85,44.69
bic_min,3550.0,20.895972,4.388776,5.0,18.0,21.0,23.5,39.0
bilirubin,2893.0,1.586049,3.16897,0.0,0.4,0.7,1.3,29.5
bp_max,3550.0,140.153299,18.63014,90.79,126.51,140.155,153.08,241.0
bp_mean,3550.0,122.73498,14.647909,84.62,111.875,121.635,132.74,182.93
bp_min,3550.0,106.023138,13.324255,44.26,97.04,104.015,113.7075,163.23


In [19]:
# checking duplicates 

print(f"The data has {df.duplicated().sum()} duplicate rows.")

The data has 0 duplicate rows.


In [35]:
# checking for missing values

missing = pd.DataFrame({
    "missing_count":df.isna().sum(), 
    "missing_percentage":df.isna().sum()/df.shape[0] *100
    })
missing[missing['missing_count'] > 0].sort_values(by='missing_count', ascending=False)

Unnamed: 0,missing_count,missing_percentage
temp,1225,34.507042
bilirubin,657,18.507042
fio2,398,11.211268
max pao2,133,3.746479
mean pao2,133,3.746479
min pao2,133,3.746479


# 4. Modelling

# 5. Deployment