## **Deliruim Data Cohort from MIMIC-IV-3.2 **
Data Downloaded February 8th , 2025

Phase 1 Data Extraction
https://colab.research.google.com/drive/1IIYkR_CSGwakQJ5g47gtY1TnQIs91T02#scrollTo=71f2964d-361e-4a8a-b59f-429624b6e1ce

# Phase 2 Extracted Dataset Mounted  
Nth Attempt Feb 22, 24



In [52]:
 # Define file path
file_path = "D:/MIMIC-IV-Data-Pipeline/processed_data/delirium_prediction_data_v3.csv.gz"

# Check if the file exists
import os
print("File Exists:", os.path.exists(file_path))


File Exists: True


In [54]:
import pandas as pd

file_path = "D:/MIMIC-IV-Data-Pipeline/processed_data/delirium_prediction_data_v3.csv.gz"

# Load dataset
df = pd.read_csv(file_path, compression="gzip", low_memory=False)

print("✅ Data Loaded! Shape:", df.shape)
print(df.head())  # Show first 5 rows


✅ Data Loaded! Shape: (555244, 20)
   subject_id   hadm_id  admission_type      admission_location  \
0    10000032  22595853          URGENT  TRANSFER FROM HOSPITAL   
1    10000032  22841357        EW EMER.          EMERGENCY ROOM   
2    10000032  25742920        EW EMER.          EMERGENCY ROOM   
3    10000032  29079034        EW EMER.          EMERGENCY ROOM   
4    10000068  25022803  EU OBSERVATION          EMERGENCY ROOM   

  discharge_location insurance marital_status   race  ed_time_spent gender  \
0               HOME  Medicaid        WIDOWED  WHITE          253.0      F   
1               HOME  Medicaid        WIDOWED  WHITE          337.0      F   
2            HOSPICE  Medicaid        WIDOWED  WHITE          286.0      F   
3               HOME  Medicaid        WIDOWED  WHITE          486.0      F   
4                NaN       NaN         SINGLE  WHITE          511.0      F   

   anchor_age  anchor_year     stay_id                       last_careunit  \
0          52  

In [56]:
print("🔍 Dataset Overview:")
print(df.info())  # Check column types and memory usage
print("\nMissing Values:\n", df.isnull().sum())  # Count missing values


🔍 Dataset Overview:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 555244 entries, 0 to 555243
Data columns (total 20 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   subject_id            555244 non-null  int64  
 1   hadm_id               555244 non-null  int64  
 2   admission_type        555244 non-null  object 
 3   admission_location    555243 non-null  object 
 4   discharge_location    405420 non-null  object 
 5   insurance             545797 non-null  object 
 6   marital_status        540815 non-null  object 
 7   race                  555244 non-null  object 
 8   ed_time_spent         385225 non-null  float64
 9   gender                555244 non-null  object 
 10  anchor_age            555244 non-null  int64  
 11  anchor_year           555244 non-null  int64  
 12  stay_id               94458 non-null   float64
 13  last_careunit         94458 non-null   object 
 14  los                   94444 non-

 Feature Category	Column Name
🔹 Identifiers	subject_id
	hadm_id
	stay_id
🔹 Patient Demographics	gender
	anchor_age
	anchor_year
	race
	marital_status
🔹 Admission & Hospitalization Details	admission_type
	admission_location
	discharge_location
	insurance
	ed_time_spent
🔹 ICU Stay Details	last_careunit
	los_icu
🔹 Diagnoses & Comorbidities	num_comorbidities
	diagnosis_list
	palliative_care_flag
🔹 Delirium Outcome	delirium
🔹 Medication Exposure	high_risk_med


✅ Insights from Dataset Overview
Your dataset now has 555,244 rows (one per admission), but some columns have missing values. Below are the key observations:

🔍 Key Observations
Missing ICU Stay Information:

stay_id, last_careunit, and los are missing in ~460,000 rows.
This suggests that many admissions were NOT ICU stays.
Missing Emergency Department (ED) Time:

ed_time_spent is missing for ~170,000 admissions.
Likely because not all patients enter via ED.
Palliative & Delirium Flags Have Small Missingness:

531 missing values in:
diagnosis_list
num_comorbidities
palliative_care_flag
delirium
These could be from admissions without recorded diagnoses.
High-Risk Medications Have Missing Values:

~82,720 rows are missing high_risk_med.
This likely means that some patients had no prescriptions.


✅ Next Steps: Handling Missing Data



In [20]:
#1️⃣ Fill or Remove Missing diagnosis_list Entries

# Since diagnoses drive delirium and palliative_care_flag, replace missing lists with an empty list:

df["diagnosis_list"] = df["diagnosis_list"].apply(lambda x: x if isinstance(x, list) else [])

# 2️⃣ Fill palliative_care_flag and delirium with 0
#If a patient has no diagnosis data, assume 0 for both:
df["palliative_care_flag"] = df["palliative_care_flag"].fillna(0).astype(int)
df["delirium"] = df["delirium"].fillna(0).astype(int)

# 3️⃣ Fill high_risk_med with 0
# Missing values likely mean no high-risk medications were prescribed.

df["high_risk_med"] = df["high_risk_med"].fillna(0).astype(int)


In [22]:
#4️⃣ Handle Missing ICU Data (stay_id, last_careunit, los)
#Since not all patients were in the ICU, replace missing values with "Not ICU" or 0:
df["stay_id"] = df["stay_id"].fillna("Not ICU")
df["last_careunit"] = df["last_careunit"].fillna("Not ICU")
# This didnt work #df["los_icu"] = df["los_icu"].fillna(0)


In [24]:
# 5️⃣ Handle Missing ed_time_spent
#Patients missing ed_time_spent likely didn’t enter via the ED.
#Fill missing values with 0:
df["ed_time_spent"] = df["ed_time_spent"].fillna(0)

In [26]:
print("🔍 Dataset Overview:")
print(df.info())  # Check column types and memory usage
print("\nMissing Values:\n", df.isnull().sum())  # Count missing values

🔍 Dataset Overview:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 555244 entries, 0 to 555243
Data columns (total 20 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   subject_id            555244 non-null  int64  
 1   hadm_id               555244 non-null  int64  
 2   admission_type        555244 non-null  object 
 3   admission_location    555243 non-null  object 
 4   discharge_location    405420 non-null  object 
 5   insurance             545797 non-null  object 
 6   marital_status        540815 non-null  object 
 7   race                  555244 non-null  object 
 8   ed_time_spent         555244 non-null  float64
 9   gender                555244 non-null  object 
 10  anchor_age            555244 non-null  int64  
 11  anchor_year           555244 non-null  int64  
 12  stay_id               555244 non-null  object 
 13  last_careunit         555244 non-null  object 
 14  icu_los               94444 non-

In [34]:
#Step 1 : Feature Engineering
#1️⃣ Encode Categorical Variables
categorical_cols = ["admission_type", "admission_location", "discharge_location", "insurance", "race", "gender", "marital_status"]
df = pd.get_dummies(df, columns=categorical_cols, drop_first=True)
#Why? One-hot encoding allows models to use categorical variables in numerical format.
#2️⃣ Handle Missing Values in Numerical Columns
df["icu_los"] = df["icu_los"].fillna(0)  # Fill missing ICU LOS with 0 (non-ICU admissions)
df["num_comorbidities"] = df["num_comorbidities"].fillna(0)
#Why? Keeps numerical data clean for modeling.

#3️⃣ Drop Unnecessary Identifiers
df = df.drop(columns=["subject_id", "hadm_id", "stay_id", "anchor_year"])
#Why? These IDs don't contribute to prediction.  Not considering repeated patient admissions as each admission should be treated seprately?

In [36]:
# Identify categorical columns that may still contain text
categorical_cols = ["last_careunit"]

# Convert categorical columns to numerical encoding
df[categorical_cols] = df[categorical_cols].astype("category").apply(lambda x: x.cat.codes)

print("✅ Categorical variables converted to numeric!")


✅ Categorical variables converted to numeric!


In [38]:
# Drop non-numeric columns before applying SMOTE

df = df.drop(columns=["diagnosis_list"])

print("✅ Removed non-numeric columns before SMOTE.")


✅ Removed non-numeric columns before SMOTE.


In [40]:
from sklearn.model_selection import train_test_split

# Define target variable
y = df["delirium"]
X = df.drop(columns=["delirium"])

# Perform train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

print("✅ Train-test split completed!")


✅ Train-test split completed!


In [42]:
from imblearn.over_sampling import SMOTE

# Initialize SMOTE
smote = SMOTE(random_state=42)

# Apply SMOTE to training data
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

print("✅ SMOTE applied! New class distribution:")
print(pd.Series(y_train_resampled).value_counts(normalize=True))


[WinError 2] The system cannot find the file specified
  File "C:\Users\truly\anaconda3\Lib\site-packages\joblib\externals\loky\backend\context.py", line 257, in _count_physical_cores
    cpu_info = subprocess.run(
               ^^^^^^^^^^^^^^^
  File "C:\Users\truly\anaconda3\Lib\subprocess.py", line 548, in run
    with Popen(*popenargs, **kwargs) as process:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\truly\anaconda3\Lib\subprocess.py", line 1026, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "C:\Users\truly\anaconda3\Lib\subprocess.py", line 1538, in _execute_child
    hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^


✅ SMOTE applied! New class distribution:
delirium
0    0.5
1    0.5
Name: proportion, dtype: float64


In [44]:
from sklearn.preprocessing import MinMaxScaler

# Identify numerical columns
numeric_cols = ["anchor_age", "num_comorbidities", "ed_time_spent", "icu_los"]

# Apply MinMaxScaler (scales values between 0 and 1)
scaler = MinMaxScaler()
X_train_resampled[numeric_cols] = scaler.fit_transform(X_train_resampled[numeric_cols])
X_test[numeric_cols] = scaler.transform(X_test[numeric_cols])

print("✅ Features normalized for Naïve Bayes!")


✅ Features normalized for Naïve Bayes!


In [46]:
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report

# Initialize and train Naïve Bayes model
nb_model = GaussianNB()
nb_model.fit(X_train_resampled, y_train_resampled)

# Make predictions
y_pred_nb = nb_model.predict(X_test)

# Evaluate performance
print("✅ Naïve Bayes Model Performance:")
print(classification_report(y_test, y_pred_nb))



✅ Naïve Bayes Model Performance:
              precision    recall  f1-score   support

           0       0.99      0.84      0.91    108918
           1       0.06      0.56      0.12      2131

    accuracy                           0.84    111049
   macro avg       0.53      0.70      0.51    111049
weighted avg       0.97      0.84      0.89    111049



In [48]:
from sklearn.naive_bayes import MultinomialNB

nb_model = MultinomialNB()
nb_model.fit(X_train_resampled, y_train_resampled)

y_pred_nb = nb_model.predict(X_test)
print(classification_report(y_test, y_pred_nb))


              precision    recall  f1-score   support

           0       0.99      0.80      0.89    108918
           1       0.06      0.69      0.12      2131

    accuracy                           0.80    111049
   macro avg       0.53      0.75      0.50    111049
weighted avg       0.97      0.80      0.87    111049

