# Supervised Modeling Pipeline for ICU Length of Stay Prediction

**Authors**: *Giuseppe Pitruzzella, Radvilė Rušaitė, Karlota Bochanaitė*

This project is situated within the field of supervised learning, aiming to predict a continuous clinical variable—**Length of Stay (LOS)** in the Intensive Care Unit—using regression models based on neural networks. The approach aligns with fundamental machine learning paradigms studied during the course, including probabilistic optimization techniques such as Maximum A Posteriori (MAP) estimation.

The analysis uses real-world data from **MIMIC-III** (Medical Information Mart for Intensive Care), a publicly available clinical database developed by the MIT Lab for Computational Physiology in collaboration with Beth Israel Deaconess Medical Center (Boston). The database contains de-identified health-related data associated with over 60,000 ICU admissions between 2001 and 2012, and has become a globally recognized benchmark for research in medical data science and critical care.

## Dataset Setup (Run Directly Below)

To facilitate reproducibility, the required `.csv` files from the MIMIC-III dataset are downloaded and decompressed **automatically** via the script cell provided just below. This script:

* Fetches the list of all available `.csv.gz` files from the educational [mirror site](https://www.dcc.fc.up.pt/~ines/MIMIC-III/)
* Downloads and decompresses each file
* Organizes them into the `data/raw/` folder

> Simply **run the code cell below** to fetch and prepare the full dataset.

## Environment Setup (Run Below)

Before proceeding with the analysis, make sure all required Python libraries are available. You can automatically install missing dependencies listed in `requirements.txt` by executing the following cell

In [24]:
import subprocess
import sys

In [20]:
# Path to the requirements.txt file
req_file = "../requirements.txt"  # modify if necessary

def read_requirements(file_path):
    with open(file_path, "r") as f:
        return [line.strip().split("==")[0] for line in f if line.strip() and not line.startswith("#")]

def is_installed(package_name):
    return importlib.util.find_spec(package_name) is not None

def pip_install(package_name):
    print(f"Installing: {package_name}")
    subprocess.check_call([sys.executable, "-m", "pip", "install", package_name])

# Map exceptions between package name and importable module
module_map = {
    "scikit-learn": "sklearn",
    "ipython": "IPython"
}

# Load packages from requirements.txt
required_packages = read_requirements(req_file)

# Install only missing packages
for pkg in required_packages:
    module_name = module_map.get(pkg, pkg)
    if not is_installed(module_name):
        pip_install(pkg)
    else:
        print(f"Already installed: {pkg}")


Installing: torch
Collecting torch
  Downloading torch-2.2.2-cp312-none-macosx_10_9_x86_64.whl.metadata (25 kB)
Collecting filelock (from torch)
  Using cached filelock-3.18.0-py3-none-any.whl.metadata (2.9 kB)
Collecting sympy (from torch)
  Downloading sympy-1.14.0-py3-none-any.whl.metadata (12 kB)
Collecting networkx (from torch)
  Downloading networkx-3.5-py3-none-any.whl.metadata (6.3 kB)
Collecting jinja2 (from torch)
  Using cached jinja2-3.1.6-py3-none-any.whl.metadata (2.9 kB)
Collecting fsspec (from torch)
  Downloading fsspec-2025.5.1-py3-none-any.whl.metadata (11 kB)
Collecting MarkupSafe>=2.0 (from jinja2->torch)
  Downloading MarkupSafe-3.0.2-cp312-cp312-macosx_10_13_universal2.whl.metadata (4.0 kB)
Collecting mpmath<1.4,>=1.1.0 (from sympy->torch)
  Using cached mpmath-1.3.0-py3-none-any.whl.metadata (8.6 kB)
Downloading torch-2.2.2-cp312-none-macosx_10_9_x86_64.whl (150.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m150.8/150.8 MB[0m [31m7.2 MB/s

## Initial Exploration

We begin by exploring the main tables in the MIMIC-III dataset. The goal of this first phase is to understand the structure of the database and identify relevant variables that may influence ICU length of stay. Key reference tables include:

* `D_ICD_DIAGNOSES.csv` – diagnosis codes (ICD9)
* `ICUSTAYS.csv` – ICU stays metadata
* `D_ITEMS.csv` – item IDs and descriptions for time-series events