# Code

**Vital Parameters**
- Temperature: >= 50 assumed Fahrenheit; 30 <= x <= 45 °C (excludes 58 rows)
- Heart Rate: 20 <= x <= 240 (exludes 12 rows)
- Respiratory Rate: 1 <= x <= 70 (excludes 15 rows)
- O2 Saturation: 20 <= x <= 100 (excludes 51 rows)
- Systolic BP: 30 <= x <= 300 (excludes 63 rows)
- Diastolic BP: 25 <= x <= 150 (excludes 315 rows)

**Ethnicity Mapping**
- 33 verschiedene werden auf 5 gemappt: Asian, Black, Unknown, White, Other

**ICD Codes**
- One ICD-9 code (707.24) cannot be translated and the corresponding row was excluded (1 row)

**Chief Complaints**
- 27687 unterschiedliche Kategorien wurden auf auf 310 geclustert (309 + "Other")
- Länge des DataFrames: 141461, hiervon sind 646 NaN-Werte (0.46%). Dies waren i.W. "Transfer" oder unsinnige Angaben "//", "-", ... und wurden daher pd.NA gesetzt.
- Anzahl der 'Other' Cluster: 38708 (27.36%)
- Top 20 Einträge mit %-Anteil:
   - abdominal pain                  8.413166
   - chest pain                      4.741682
   - dyspnea                         4.379505
   - fever                           2.619749
   - sp fall                         2.464226
   - weakness                        1.933033
   - abnormal labs                   1.764017
   - nausea vomiting                 1.755495
   - altered mental status           1.709335
   - wound eval                      1.428825
   - bright red blood per rectum     1.303838
   - back pain                       1.234954
   - suicidal ideation               1.222881
   - headache                        0.954444
   - shortness of breath             0.946632
   - ...



# Imports

In [1]:
import pandas as pd
from datetime import datetime
from pathlib import Path
import numpy as np
from pyicd.utils.icd_tools import icd9_to_icd10
from functools import lru_cache
from rapidfuzz import process, fuzz
import re

# User Configuration

In [2]:
file_path = "data/20250301_data.csv"
icd_path = "icd10_blocks.csv"

# Utility Functions
## Temperature Preprocessing

In [3]:
def preprocess_temperature(
    df: pd.DataFrame,
    temp_column: str,
    f_threshold: float = 50.0,
    c_min: float = 30.0,
    c_max: float = 45.0,
) -> pd.Series:
    """
    Standardize a temperature column to Celsius, marking implausible or malformed
    entries as NaN.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame.
    temp_column : str
        Name of the column containing temperature readings (mixed °F/°C).
    f_threshold : float, default=50.0
        Any reading >= this is treated as °F and converted; otherwise assumed °C.
    c_min : float, default=30.0
        Minimum plausible body temperature in °C.
    c_max : float, default=45.0
        Maximum plausible body temperature in °C.

    Returns
    -------
    pd.Series
        Temperatures in °C (Pandas Nullable Float64 dtype), out-of-range or invalid as NaN (pd.NA).
    """
    if temp_column not in df.columns:
        raise KeyError(f"Column '{temp_column}' not found in DataFrame.")

    col = df[temp_column]
    temps = pd.to_numeric(col, errors="coerce")
    is_f = temps >= f_threshold

    c_temps = temps.copy()
    c_temps[is_f] = (temps[is_f] - 32) * 5.0 / 9.0

    mask_implausible = (c_temps < c_min) | (c_temps > c_max)
    c_temps[mask_implausible] = pd.NA

    c_temps = c_temps.round(1).astype('Float64')

    return c_temps

# Example usage:
# df['temperature_clean'] = preprocess_temperature(df, 'temperature')

## Vital Signs Preprocessing

In [4]:
def preprocess_vital_signs(
    df: pd.DataFrame,
    column: str,
    min_valid: int,
    max_valid: int,
    winsorize: bool = False,
    winsor_limits: tuple = (0.01, 0.99),
) -> pd.Series:
    """
    Clean and optionally winsorize a vital sign column.

    Parameters
    ----------
    df : pd.DataFrame
        DataFrame with vital signs.
    column : str
        Column name to clean.
    min_valid : float
        Minimum plausible value.
    max_valid : float
        Maximum plausible value.
    winsorize : bool, default=False
        Whether to apply winsorizing.
    winsor_limits : tuple, default=(0.01, 0.99)
        Winsorizing limits (lower, upper percentiles).

    Returns
    -------
    pd.Series
        Cleaned (and optionally winsorized) column as nullable integer (Int64).
    """
    series = pd.to_numeric(df[column], errors='coerce')
    series[(series < min_valid) | (series > max_valid)] = pd.NA

    if winsorize:
        lower = series.quantile(winsor_limits[0])
        upper = series.quantile(winsor_limits[1])
        series = series.clip(lower, upper)

    series = series.round().astype('Int64')

    return series

# Example usage
# df['heart_rate_clean'] = preprocess_vital_signs(df, 'heart_rate', min_valid=30, max_valid=220)

## NEWS Score Calculation

In [5]:
def calculate_news_score(df: pd.DataFrame) -> pd.Series:
    """
    Berechnet den NEWS-Score (ohne NEWS2-Erweiterungen, SpO₂-Scale 1,
    und ohne Zusatzpunkt für O₂-Therapie) rein vektorbasiert.
    https://pubmed.ncbi.nlm.nih.gov/23295778/

    Parameters
    ----------
    df : pd.DataFrame
        Muss die Spalten  'resprate', 'o2sat', 'sbp', 'heartrate',
        'acvpu'  und 'temperature' enthalten.

    Returns
    -------
    pd.Series
        Eine Serie namens 'news_score' mit derselben Indexlänge wie `df`.
        Fehlende Werte in irgendeiner Eingangsspalten führen zu NA im Ergebnis.
    """
    required = {"resprate", "o2sat", "sbp", "heartrate", "acvpu", "temperature"}
    missing  = required.difference(df.columns)
    if missing:
        raise KeyError(f"DataFrame fehlt folgende Spalten: {', '.join(missing)}")

    # --- RESPIRATORY RATE ----------------------------------------------------
    rr_src = df.resprate.astype(float)
    rr = np.select(
        [rr_src <= 8,
         rr_src.between(9, 11, inclusive="both"),
         rr_src.between(21, 24, inclusive="both"),
         rr_src >= 25],
        [3, 1, 2, 3], default=0
    )

    # --- SpO₂ (Scale 1) ------------------------------------------------------
    spo2_src = df.o2sat.astype(float)
    spo2 = np.select(
        [spo2_src <= 91,
         spo2_src.between(92, 93, inclusive="both"),
         spo2_src.between(94, 95, inclusive="both"),
         spo2_src >= 96],
        [3, 2, 1, 0], default=0
    )

    # --- SYSTOLIC BLOOD PRESSURE --------------------------------------------
    sbp_src = df.sbp.astype(float)
    sbp = np.select(
        [sbp_src <= 90,
         sbp_src.between(91, 100, inclusive="both"),
         sbp_src.between(101, 110, inclusive="both"),
         sbp_src >= 220],
        [3, 2, 1, 3], default=0          # 111-219 → 0 Punkte
    )

    # --- HEART RATE ----------------------------------------------------------
    hr_src = df.heartrate.astype(float)
    hr = np.select(
        [hr_src <= 40,
         hr_src.between(41, 50, inclusive="both"),
         hr_src.between(51, 90, inclusive="both"),
         hr_src.between(91, 110, inclusive="both"),
         hr_src.between(111, 130, inclusive="both"),
         hr_src >= 131],
        [3, 1, 0, 1, 2, 3]
    )

    # --- TEMPERATURE ---------------------------------------------------------
    temp_src = df.temperature.astype(float)
    temp = np.select(
        [temp_src <= 35.0,
         temp_src.between(35.1, 36.0, inclusive="both"),
         temp_src.between(36.1, 38.0, inclusive="both"),
         temp_src.between(38.1, 39.0, inclusive="both"),
         temp_src >= 39.1],
        [3, 1, 0, 1, 2]
    )

    # --- ACVPU ---------------------------------------------------------------
    acvpu = np.where(df.acvpu.isin(["C", "V", "P", "U"]), 3, 0)

    # --- SUMME ---------------------------------------------------------------
    total_numeric = rr + spo2 + sbp + hr + temp + acvpu

    cat_type  = pd.CategoricalDtype(categories=range(19), ordered=True)
    news_score = pd.Series(total_numeric, index=df.index,
                           name="news_score", dtype=cat_type)

    # --- NA-Propagation ------------------------------------------------------
    na_mask = df[list(required)].isna().any(axis=1)
    news_score[na_mask] = pd.NA

    return news_score


## ICD-9 Conversion

In [6]:
def convert_icd9_series_to_icd10(df: pd.DataFrame, icd_colname: str) -> pd.Series:
    """
    Nimmt ein DataFrame mit den Spalten 'icd_version' und 'icd_code' und
    gibt eine pd.Series zurück, die für alle ICD-9-Einträge (icd_version == 9)
    den entsprechenden ICD-10-Code liefert (letztes Ergebnis von icd9_to_icd10),
    und für alle anderen Zeilen den originalen (gestrippten) icd_code behält.
    
    Beispiel:
        df['icd_new'] = convert_icd9_series_to_icd10(df)
    """
    # 1) Ursprüngliche Codes als String und gestrippt
    codes = df[icd_colname].astype(str).str.strip()
    # 2) Maske für ICD-9-Einträge
    mask = df['icd_version'] == 9

    # 3) Einmalige Codes für die Maske
    unique_codes = codes[mask].dropna().unique()

    # 4) Mapping ICD-9 → ICD-10 (letztes Ergebnis oder pd.NA)
    mapping: dict = {}
    for code in unique_codes:
        if not isinstance(code, str):
            mapping[code] = pd.NA
            continue
        res = icd9_to_icd10(icd_code=code, flag=None, show_flags=False)
        mapping[code] = res.iloc[-1]['icd10'] if not res.empty else pd.NA

    # 5) Neue Series anlegen und befüllen
    icd_new = pd.Series(pd.NA, index=df.index, dtype="object")
    # für ICD-9: zugeordnete ICD-10-Codes
    icd_new.loc[mask] = codes[mask].map(mapping)
    # für alle anderen: originales (gestripptes) Code-Feld
    icd_new.loc[~mask] = codes[~mask]

    return icd_new

## ICD-10 Mapping to Subgroups

In [7]:
@lru_cache(maxsize=1)
def _load_blocks(blocks_csv_path: str) -> pd.DataFrame:
    """
    Lädt und bereitet das ICD-10-Blocks-CSV vor:
    - Einlesen mit Trennzeichen ';'
    - Whitespace-Trimming der wichtigen Spalten
    - Erzeugen einer kombinierten 'Block'-Spalte
    """
    df_blocks = pd.read_csv(blocks_csv_path, sep=';', dtype=str)
    for col in ["Start", "End", "Description"]:
        df_blocks[col] = df_blocks[col].str.strip()
    df_blocks["Block"] = df_blocks["Start"] + "-" + df_blocks["End"]
    return df_blocks

def get_icd_info(
    icd_code: str,
    blocks_csv_path: str = "icd10_blocks.csv"
) -> pd.Series:
    """
    Gibt für einen einzelnen ICD-Code den zugehörigen Block und die Description zurück.

    Parameters
    ----------
    icd_code : str
        Der ICD-Code (z.B. 'M05.3' oder 'A01').
    blocks_csv_path : str, optional
        Pfad zur CSV-Datei mit den ICD-10-Blocks (Default: 'icd10_blocks.csv').

    Returns
    -------
    pandas.Series
        Series mit zwei Einträgen:
        - 'icd_block': z.B. 'M05-M14' oder None
        - 'icd_desc' : z.B. 'Inflammatory polyarthropathies' oder None
    """
    # Null-/Nan-Fälle
    if icd_code is None or (isinstance(icd_code, float) and pd.isna(icd_code)):
        return pd.Series({"icd_block": None, "icd_desc": None})
    if not isinstance(icd_code, str):
        icd_code = str(icd_code)

    # Normierung: Punkte entfernen, Großschreibung, nur 3 Zeichen
    code3 = icd_code.replace(".", "").upper()[:3]

    # Special case
    if code3 == "M1A":
        return pd.Series({
            "icd_block": "M05-M14",
            "icd_desc":  "Inflammatory polyarthropathies"
        })

    # Daten laden (erstes Mal)
    blocks = _load_blocks(blocks_csv_path)

    # Zeile finden, bei der Start <= code3 <= End
    match = blocks[
        (blocks["Start"] <= code3) &
        (blocks["End"]   >= code3)
    ]
    if not match.empty:
        row = match.iloc[0]
        return pd.Series({
            "icd_block": row["Block"],
            "icd_desc":  row["Description"]
        })

    # Kein Treffer
    return pd.Series({"icd_block": None, "icd_desc": None})


## Chief Complaint Clustering

In [8]:
def preprocess_text(text):
    """Normalize text: lowercase, remove punctuation, expand medical abbreviations."""
    text = str(text).lower().strip()
    text = re.sub(r'[^a-z0-9\s]', '', text)
    abbreviations = {
        r"\babd\b": "abdominal",
        r"\babn\b": "abnormal",
        r"\babnl\b": "abnormal",
        r"\bevall\b": "evaluation",
        r"\bfttl\b": "failure to thrive",
        r"\bsvtl\b": "supraventricular tachycardia",
        r"\btibfibl\b": "tibia fibula",
        r"\baaal\b": "abdominal aortic aneurysm",
        r"\bichl\b": "intracerebral hemorrhage",
        r"\bsbol\b": "small bowerl obstruction",
        r"\bsdhl\b": "subdural hematoma",
        r"\bchf\b": "congestive heart failure",
        r"\biph\b": "intraparenchymal hemorrhage",
        r"\b b \b": "bilateral",
        r"\bafib\b": "atrial fibrillation",
        r"\blbp\b": "lower back pain",
        r"\bcspine\b": "cervical spine",
        r"\bsob\b": "shortness of breath",
        r"\bn/v\b": "nausea vomiting",
        r"\bnv\b": "nausea vomiting",
        r"\bnvd\b": "nausea vomiting diarrhea",
        r"\bha\b": "headache",
        r"\bcp\b": "chest pain",
        r"\buti\b": "urinary tract infection",
        r"\bsi\b": "suicidal ideation",
        r"\bbrbpr\b": "bright red blood per rectum",
        r"\bfx\b": "fracture",
        r"\bili\b": "iliac",
        r"\blac\b": "laceration",
        r"\bloc\b": "loss of consciousness",
        r"\bgib\b": "gastrointestinal bleeding",
        r"\btia\b": "transient ischemic attack",
        r"\bdka\b": "diabetic ketoacidosis",
        r"\bams\b": "altered mental status",
        r"\betoh\b": "alcohol intoxication",
        r"\bhtn\b": "hypertension",
        r"\bdm\b": "diabetes mellitus",
        r"\bhld\b": "hyperlipidemia",
        r"\bcad\b": "coronary artery disease",
        r"\bt/a\b": "tonsillectomy and adenoidectomy",
        r"\bcva\b": "cerebrovascular accident",
        r"\bmvc\b": "motor vehicle collision",
        r"\bod\b": "overdose",
        r"\bsz\b": "seizure",
        r"\ba fib\b": "atrial fibrillation",
        r"\bpe\b": "pulmonary embolism",
        r"\bdvt\b": "deep vein thrombosis",
        r"\brh\b": "rheumatoid arthritis",
        r"\bgi\b": "gastrointestinal",
        r"\bruq\b": "right upper quadrant",
        r"\brlq\b": "right lower quadrant",
        r"\bluq\b": "left upper quadrant",
        r"\bllq\b": "left lower quadrant",
        r"\bsah\b": "subarachnoid hemorrhage",
        r"\bs\/p\b": "",
        r"\bh\/o\b": "",
        r"\bl\b": "",
        r"\br\b": "",
        r"\b-\b": "",
        r"\btransfer\b": "",
    }
    for abbr, full in abbreviations.items():
        text = re.sub(abbr, full, text)
    # alles Mehrfach-Leerzeichen zu einem
    text = re.sub(r"\s+", " ", text).strip()
    return text

def cluster_complaints(df, col_name, num_clusters=499, threshold=80):
    """
    Cluster complaints by normalizing them and grouping via fuzzy matching.
    
    Args:
        df (pd.DataFrame): Eingabe-DataFrame.
        col_name (str): Spaltenname mit den Original-Beschwerden.
        num_clusters (int): Anzahl Basis-Cluster (häufigste Phrasen).
        threshold (int): Fuzzy-Matching-Schwelle (0–100).
    Returns:
        pd.Series: Cluster-Labels für jede Zeile.
    """
    # 1) Normalisierung
    norms = df[col_name].astype(str).apply(preprocess_text)
    
    # 2) Häufigste Phrasen ermitteln
    freq = norms.value_counts()
    unique_phrases = freq.index.tolist()
    base_clusters = unique_phrases[:num_clusters]
    # print(f"Base clusters: {base_clusters}")

    # 3) Fuzzy-Cluster-Zuordnung
    clusters = {}
    assigned = set()
    for base in base_clusters:
        if base in assigned:
            continue
        clusters[base] = base
        assigned.add(base)
        matches = process.extract(
            base, unique_phrases,
            scorer=fuzz.token_set_ratio,
            score_cutoff=threshold
        )
        for match_phrase, score, _ in matches:
            if match_phrase not in assigned:
                clusters[match_phrase] = base
                assigned.add(match_phrase)
    
    # 4) Rest zu „Other“
    for phrase in unique_phrases:
        if phrase not in clusters:
            clusters[phrase] = "Other"
    
    # 5) Series zurückgeben
    return norms.map(clusters)



# Main Routine
## Load Data

In [9]:
df = pd.read_csv(file_path, low_memory=False) # Use low_memory=False to prevent dtype issues with large files
p = Path(file_path)

## Set correct dtypes

In [10]:
for col in ['resprate', 'o2sat', 'sbp', 'dbp', 'heartrate']:
    df[col] = pd.to_numeric(df[col], errors='coerce').astype(int)

for col in ['ed_intime', 'hadm_time']:
    df[col] = pd.to_datetime(df[col])  # falls Unix-Timestamp
    df[col] = pd.to_datetime(df[col], unit='s', origin='unix')
    
df['icu_within_24h'] = df['icu_within_24h'].map({'Yes': True, 'No': False})
df['acvpu'] = pd.Categorical(df['acvpu'], categories=['A', 'C', 'V', 'P', 'U'], ordered=True)

## Add Additional Features

In [11]:
df['arrival_hour'] = df['ed_intime'].dt.hour
df['arrival_dayofweek'] = df['ed_intime'].dt.dayofweek
df['night_arrival'] = (df['arrival_hour'] >= 22) | (df['arrival_hour'] <= 6).astype(bool)
df['weekend_arrival'] = df['arrival_dayofweek'].isin([5, 6]).astype(bool)
# df['time_to_admission_hours'] = round((df['hadm_time'] - df['ed_intime']).dt.total_seconds() / 3600, 1)

## Preprocess Vital Parameters

In [12]:
df['temperature'] = preprocess_temperature(df, 'temperature', f_threshold=50.0, c_min=30.0, c_max=45.0) # Thresh 50.0, c_min 30.0, c_max 45.0
df['heartrate'] = preprocess_vital_signs(df, 'heartrate', min_valid=20, max_valid=240, winsorize=False) # min_valid 20, max_valid 240
df['resprate'] = preprocess_vital_signs(df, 'resprate', min_valid=1, max_valid=70, winsorize=False) # min_valid 1, max_valid 70
df['o2sat'] = preprocess_vital_signs(df, 'o2sat', min_valid=20, max_valid=100, winsorize=False) # min_valid 20, max_valid 100
df['sbp'] = preprocess_vital_signs(df, 'sbp', min_valid=30, max_valid=300, winsorize=False) # min_valid 30, max_valid 300
df['dbp'] = preprocess_vital_signs(df, 'dbp', min_valid=25, max_valid=150, winsorize=False) # min_valid 25, max_valid 150


## NEWS Score

In [13]:
df['news_score'] = calculate_news_score(df)

## Ethinicity Mapping

In [14]:
print(f"Anzahl der unterschiedlichen Ethnicity-Kategorien in 'race' vor dem Mapping: {df['race'].nunique()}")
# 1) Extract the first occurrence of any of your keywords (case‑insensitive)
extracted = df['race'].str.extract(r'(?i)(asian|white|black|unknown)', expand=False)

# 2) Map back to the desired category names, fill all others as 'Other'
df['race_grouped'] = (
    extracted
      .str.lower()
      .map({
          'asian':   'Asian',
          'white':   'White',
          'black':   'Black',
          'unknown': 'Unknown',
      })
      .fillna('Other')
)
# print(df[['race', 'race_grouped']].head(50))
print(f"Anzahl der unterschiedlichen Ethnicity-Kategorien 'in race_grouped' nach dem Mapping: {df['race_grouped'].nunique()}")

Anzahl der unterschiedlichen Ethnicity-Kategorien in 'race' vor dem Mapping: 33
Anzahl der unterschiedlichen Ethnicity-Kategorien 'in race_grouped' nach dem Mapping: 5


## Diagnoses Preprocessing
### Map ICD-9 to ICD-10
Some codes will not be found, e.g. icd9_to_icd10(icd_code="70724", flag=None, show_flags=True) remains empty. This results from complete abscence of group 707.2 as it has no mapping correspondence in ICD-10 (https://www.icd10data.com/Convert/707.24)

In [15]:
df['icd10'] = convert_icd9_series_to_icd10(df, icd_colname='icd_code')

### Map ICD-10 Codes to Subgroups

In [16]:
df[["icd_block", "icd_desc"]] = df["icd10"].apply(
    lambda code: get_icd_info(code, blocks_csv_path=icd_path)
)

## Cluster Chief Complaints

In [17]:
print(f"Anzahl der unterschiedlichen Chief Complaint-Kategorien 'in chiefcomplaint' vor dem Clustern: {df['chiefcomplaint'].nunique()}")
df['chiefcomplaint_clustered'] = cluster_complaints(
    df,
    col_name='chiefcomplaint',
    num_clusters=499,
    threshold=80
)
print(f"Anzahl der unterschiedlichen Chief Complaint-Kategorien 'in chiefcomplaint_clustered' nach dem Clustern: {df['chiefcomplaint_clustered'].nunique()}")

Anzahl der unterschiedlichen Chief Complaint-Kategorien 'in chiefcomplaint' vor dem Clustern: 27687
Anzahl der unterschiedlichen Chief Complaint-Kategorien 'in chiefcomplaint_clustered' nach dem Clustern: 310


In [18]:
# Leere Werte in 'chiefcomplaint_clustered' durch pd.NA ersetzen
df.loc[df['chiefcomplaint_clustered'] == '', 'chiefcomplaint_clustered'] = pd.NA

# Final Dataframe Arrangement

In [19]:
# Drop unnecessary columns
df = df.drop(columns=["race", "chiefcomplaint", "icd_version", "diagnosis_text", "subject_id", "ed_stay_id", "ed_intime", "hadm_id", 
                      "hadm_time", "arrival_hour", "arrival_dayofweek", "icd_code", "icd_desc", "icd10"])

# Rename columns to match the desired output
df = df.rename(columns={"race_grouped": "ethnicity", "chiefcomplaint_clustered": "chief_complaint", "sbp": "systolic_bp", "dbp": "diastolic_bp",
                        "acvpu": "consciousness_level", "icu_within_24h": "icu_admission_24h", "o2sat": "oxygen_saturation", "resprate": "respiratory_rate",
                        "heartrate": "heart_rate"})

# Reorder columns to match the desired output
df = df[['icu_admission_24h', 'age', 'gender', 'ethnicity', 'consciousness_level', 'temperature', 'heart_rate', 'respiratory_rate',
         'oxygen_saturation', 'systolic_bp', 'diastolic_bp', 'news_score', 'night_arrival', 'weekend_arrival', 'chief_complaint',
         'icd_block']]

## Analyse der NA-Werte

In [20]:
df_missing = df[df.isna().any(axis=1)].reset_index(drop=True)
# df = df.dropna(how='any').reset_index(drop=True)

print(f"Anzahl der Zeilen mit mind. 1 NA-Wert in einer Spalte: {df_missing.shape[0]}")

# Anzahl fehlender Werte pro Zeile
na_counts_per_row = df_missing.isna().sum(axis=1)

# Zusammenfassung: Wie viele Zeilen haben genau 1, 2, 3, ... NAs?
na_summary_rows = na_counts_per_row.value_counts().sort_index()

print("Anzahl der Zeilen mit fehlenden Werten:")
for count, num_rows in na_summary_rows.items():
    print(f"{num_rows} Zeilen haben genau {count} fehlende Werte")

# Optional: nur Zeilen mit mindestens einem NA
print("\nZusammenfassung der NA-Werte pro Spalte:")
na_summary_cols = df_missing.isna().sum()
na_summary_cols = na_summary_cols[na_summary_cols > 0].sort_values(ascending=False)

for col, count in na_summary_cols.items():
    print(f"{count} NA-Werte in Spalte '{col}'")

Anzahl der Zeilen mit mind. 1 NA-Wert in einer Spalte: 1120
Anzahl der Zeilen mit fehlenden Werten:
923 Zeilen haben genau 1 fehlende Werte
158 Zeilen haben genau 2 fehlende Werte
38 Zeilen haben genau 3 fehlende Werte
1 Zeilen haben genau 4 fehlende Werte

Zusammenfassung der NA-Werte pro Spalte:
646 NA-Werte in Spalte 'chief_complaint'
315 NA-Werte in Spalte 'diastolic_bp'
196 NA-Werte in Spalte 'news_score'
63 NA-Werte in Spalte 'systolic_bp'
58 NA-Werte in Spalte 'temperature'
51 NA-Werte in Spalte 'oxygen_saturation'
15 NA-Werte in Spalte 'respiratory_rate'
12 NA-Werte in Spalte 'heart_rate'
1 NA-Werte in Spalte 'icd_block'


## Export of Final Dataset

In [21]:
# Get the current date in YYYYMMDD format
current_timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")

# Define custom text
custom_text = "_final"

file_name = f"{current_timestamp}{custom_text}.csv"
file_path = p.with_name(p.stem + "_" + file_name)

# Export DataFrame to CSV
df.to_csv(file_path, index=False)

print(f"CSV file has been saved at: {file_path}")
print(df.head())

CSV file has been saved at: data\20250301_data_20250510_122405_final.csv
   icu_admission_24h  age gender ethnicity consciousness_level  temperature  \
0              False   57      M     White                   A         37.4   
1              False   62      M     Asian                   A         36.6   
2              False   87      M     Other                   C         36.7   
3              False   69      M     White                   A         36.1   
4              False   69      M     White                   A         36.4   

   heart_rate  respiratory_rate  oxygen_saturation  systolic_bp  diastolic_bp  \
0          93                18                100          136            69   
1         109                15                100          127            75   
2          59                16                 96           95            56   
3          74                16                 97          131            77   
4          96                16                

In [22]:
df.describe(include='all')

Unnamed: 0,icu_admission_24h,age,gender,ethnicity,consciousness_level,temperature,heart_rate,respiratory_rate,oxygen_saturation,systolic_bp,diastolic_bp,news_score,night_arrival,weekend_arrival,chief_complaint,icd_block
count,141461,141461.0,141461,141461,141461,141403.0,141449.0,141446.0,141410.0,141398.0,141146.0,141265.0,141461,141461,140815,141460
unique,2,,2,5,5,,,,,,,14.0,2,2,309,225
top,False,,F,White,A,,,,,,,0.0,False,False,Other,I30-I5A
freq,123931,,72843,96364,130030,,,,,,,51109.0,114824,100957,38708,6756
mean,,59.627622,,,,36.778854,86.594186,17.97404,97.82881,134.311857,75.110857,,,,,
std,,18.381901,,,,0.614528,18.999235,2.800053,2.416766,24.449336,15.701946,,,,,
min,,18.0,,,,30.0,20.0,1.0,42.0,50.0,25.0,,,,,
25%,,47.0,,,,36.4,73.0,16.0,97.0,117.0,64.0,,,,,
50%,,61.0,,,,36.7,85.0,18.0,98.0,132.0,75.0,,,,,
75%,,74.0,,,,37.0,99.0,18.0,100.0,149.0,85.0,,,,,
