# Project: Diagnostic reasoning assistant

**Author:** Julia Parnis   
**Date:** February 14, 2026  

## Project Overview

**Goal:** Build an AI-powered diagnostic assistant that provides ranked differential diagnoses through iterative questioning, with enhanced detection of rare and overlooked conditions.

**Key Innovation:** Two-tier ML architecture combined with RAG (Retrieval-Augmented Generation) for transparent clinical reasoning and literature-backed rare disease identification.

**Dataset:** DDXPlus - 1.3M synthetic patient cases, 49 pathologies  
**Source:** [Hugging Face](https://huggingface.co/datasets/aai530-group6/ddxplus)
**Citation:** Fansi Tchango et al. (2022)

**Note:** This is a synthetic dataset (computer-generated from medical knowledge bases) designed for research and education. It provides a robust, privacy-compliant foundation for developing diagnostic AI systems.

# Notebook 2: Multi-Class Diagnosis Prediction - Baseline Models & Optimization 

**Objective:** Develop and evaluate machine learning models to predict patient pathology (49-class classification) using filtered evidence features, with emphasis on Top-k accuracy metrics and class imbalance handling.

**Success Criteria:**
- Establish strong internal baselines (frequency baseline → dummy → logistic regression)
- Beat simple baselines on Top‑k metrics (Top‑1/Top‑3/Top‑5)
- Demonstrate a leakage-safe preprocessing pipeline based on EDA findings
- Show meaningful improvement on macro metrics (macro F1 / per-class recall), especially for rare diseases
- Optional: compare to published benchmarks only when metrics + setup are directly comparable

## **Tasks**  

### **Phase 0: Setup and Reproducibility**
Project setup  
- Fix random seeds (NumPy / sklearn)
- Define a small configuration block (paths, flags: use_unique_cases, use_value_tokens, use_sample_weights)
- Create output folders (outputs/, models/, figures/)
- Log key dataset stats (rows, unique cases, split sizes) for traceability 

### **Phase 1: Foundation (Based on EDA Findings)**
Preprocessing pipeline implementation
- Parse list columns (EVIDENCES, DIFFERENTIAL_DIAGNOSIS) into Python lists
- Default-value filtering (correct rule):
    - Filter value-coded items only when value == default_value from the evidence mapping
    - Do not hard-code *_@_0 (defaults are evidence-specific and not always V_0) 
- Validation target:
    - mean effective base evidences ≈ 13.56 when weighted to match original distribution 
    - mean effective base evidences ≈ 13.65 on unique patterns (EDA focus) 
- Duplicate pattern handling with frequency preservation
    - Create case_hash (signature) using only model-available fields
    - Deduplicate to unique patterns but keep frequency as metadata (optional training weight)
- Leakage-safe train/val/test
    - Ensure no case_hash overlaps across splits (train∩val∩test must be empty)
    - Prefer “keep test fixed, remove overlaps from train/val” to preserve the official test set
- Validate preprocessing against EDA
    - counts: total unique cases ≈ 1,278,666 
    - effective evidence stats: min/max should stay within [1,36] 
    - multi-choice behavior: extra_multi_values mean ≈ 4.39

### **Phase 2: Feature Engineering**
**Feature engineering and ML-ready matrices**
- Baseline encoding (recommended first): base-level effective evidences
    - Multi-hot encode the effective base codes (223 features) 
- Demographics
    - Encode SEX (one-hot)
    - Use AGE numeric + optionally an AGE_GROUP categorical feature (binning is useful for interpretation)
- Simple derived numeric features
    - num_evidences_effective (base concepts)
    - num_items_effective (value items)
    - extra_multi_values
- Document feature meaning
    - Add one small table describing each feature group and why it exists.

**Optional (Phase 2b, after baseline works):**
- Add value-level tokens (e.g., E_54=V_161) to preserve signal from categorical/multi-choice items (still filtering defaults). This expands feature count but often improves performance.

Why: base-level features are strong, interpretable, and lightweight. Value tokens are a second step when you want extra signal.

## Phase 0: Setup and Reproducibility

**Purpose:** Establish reproducible environment and configuration for all experiments

In [1]:
# ----------------------------------------------------------------------------
# 0.1 Import Core Libraries 
# ----------------------------------------------------------------------------

import os
import random
import json
import pickle  # ok,  for sklearn models joblib is often better (optional)
from pathlib import Path
from datetime import datetime
import warnings

import numpy as np
import pandas as pd
from typing import Dict, List, Tuple, Optional
from datasets import load_dataset
import ast

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
from cycler import cycler

# Scikit-learn: preprocessing + modeling
from sklearn.preprocessing import LabelEncoder, StandardScaler, OneHotEncoder, MultiLabelBinarizer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Scikit-learn: model selection
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

# Scikit-learn: metrics
from sklearn.metrics import (
    accuracy_score,
    f1_score,
    classification_report,
    confusion_matrix,
    top_k_accuracy_score
)

# Models
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import BernoulliNB  # better for binary multi-hot features

# Silence noisy warnings (optional)
warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings("ignore", category=DeprecationWarning)

# Optional: Gradient Boosting libraries
try:
    import lightgbm as lgb
    LIGHTGBM_AVAILABLE = True
except ImportError:
    LIGHTGBM_AVAILABLE = False
    print("⚠️ LightGBM not installed. (Optional) Install with: pip install lightgbm")

try:
    import xgboost as xgb
    XGBOOST_AVAILABLE = True
except ImportError:
    XGBOOST_AVAILABLE = False
    print("⚠️ XGBoost not installed. (Optional) Install with: pip install xgboost")

print("✓ Libraries imported successfully")


  from .autonotebook import tqdm as notebook_tqdm


✓ Libraries imported successfully


In [2]:
"""
0.2: Setup - Seeds, Config, and Directories
"""

# ===== RANDOM SEEDS (for reproducibility) =====
random_seed = 42

os.environ["PYTHONHASHSEED"] = str(random_seed)  # Makes Python hashing deterministic
np.random.seed(random_seed)
random.seed(random_seed)

print(f"✓ Random seed set to: {random_seed}")


# ===== CONFIGURATION =====
class Config:
    """
    Central configuration for this experiment.
    Change settings here to control preprocessing and modeling.
    """

    # --- Data paths ---
    data_dir = Path("data")
    evidence_map_path = data_dir / "release_evidences.json"
    conditions_map_path = data_dir / "release_conditions.json"

    # --- Output folders ---
    output_root = Path("outputs")
    models_root = Path("models")
    figures_root = Path("figures")
    experiment_name = f"exp_{datetime.now().strftime('%Y%m%d_%H%M')}"

    # Will be set after folder creation:
    output_dir = None
    models_dir = None
    figures_dir = None

    # --- Preprocessing settings ---
    use_unique_cases = True       # deduplicate by case signature (unique patterns)
    use_sample_weights = True     # use frequency as training weight (NOT a feature)

    # IMPORTANT: default filtering rule (from EDA)
    # Do NOT hard-code '*_@_0'. Defaults are evidence-specific.
    # Correct rule: remove value-coded items when value == evidence_map[base]["default_value"].

    filter_defaults = True        # Remove default-valued evidences 
    feature_encoding = "base"     # # "base" (E_66) OR "token" (E_54=V_161) later

    # --- Modeling settings ---
    class_weight = "balanced"     # Handle class imbalance (254:1 ratio), works for LogisticRegression / RandomForest; NB ignores it
    top_k = [1, 3, 5]            # Evaluation metrics
    n_jobs = -1                   # Use all CPU cores
    random_seed = random_seed

    # --- Models to train ---
    models_to_train = [
        "dummy_most_frequent",    # Baseline: always predict most common class
        "dummy_stratified",       # Baseline: random predictions weighted by class frequency
        "logistic_regression",    # Simple, interpretable (good starting point)
        "bernoulli_nb",          # Fast, works well with binary sparse features
        "random_forest",         # Captures non-linear patterns
        "lightgbm",             # Expected best performance for tabular data
        # "xgboost",            # Uncomment to compare with LightGBM
    ]


config = Config()

# ===== CREATE EXPERIMENT DIRECTORIES =====
config.output_root.mkdir(exist_ok=True)
config.models_root.mkdir(exist_ok=True)
config.figures_root.mkdir(exist_ok=True)

config.output_dir = config.output_root / config.experiment_name
config.models_dir = config.models_root / config.experiment_name
config.figures_dir = config.figures_root / config.experiment_name

config.output_dir.mkdir(exist_ok=True)
config.models_dir.mkdir(exist_ok=True)
config.figures_dir.mkdir(exist_ok=True)

# ===== AUTO-REMOVE UNAVAILABLE LIBRARIES =====
if "lightgbm" in config.models_to_train and not LIGHTGBM_AVAILABLE:
    config.models_to_train.remove("lightgbm")
    print("⚠️  Removed 'lightgbm' from training (not installed)")

if "xgboost" in config.models_to_train and not XGBOOST_AVAILABLE:
    config.models_to_train.remove("xgboost")
    print("⚠️  Removed 'xgboost' from training (not installed)")

# ===== SUMMARY =====
print(f"✓ Experiment: {config.experiment_name}")
print(f"✓ Models to train ({len(config.models_to_train)}): {', '.join(config.models_to_train)}")
print(f"✓ Preprocessing: unique_cases={config.use_unique_cases}, encoding={config.feature_encoding}")
print(f"✓ Output directories created:")
print(f"    {config.output_dir}")
print(f"    {config.models_dir}")
print(f"    {config.figures_dir}")


✓ Random seed set to: 42
✓ Experiment: exp_20260214_2013
✓ Models to train (6): dummy_most_frequent, dummy_stratified, logistic_regression, bernoulli_nb, random_forest, lightgbm
✓ Preprocessing: unique_cases=True, encoding=base
✓ Output directories created:
    outputs\exp_20260214_2013
    models\exp_20260214_2013
    figures\exp_20260214_2013


In [3]:
"""
0.3 Display Settings
"""

# Pandas display - useful for inspecting dataframes
pd.set_option("display.max_columns", 20)
pd.set_option("display.max_rows", 60)
pd.set_option("display.max_colwidth", 120)
pd.set_option("display.precision", 4)

print("✓ Display settings configured")


✓ Display settings configured


In [4]:
# 0.4 - Visualization Settings

# Apply clean seaborn style
plt.style.use('seaborn-v0_8-white')

# IBM Design colorblind-safe palette
IBM_COLORS = {
    'blue': '#648FFF',
    'purple': '#785EF0', 
    'magenta': '#DC267F',
    'orange': '#FE6100',
    'yellow': '#FFB000',
    'teal': '#06A39B',
    'gray': '#5F6368'
}

# Figure defaults (presentation-optimized)
plt.rcParams.update({
    # Figure size and resolution
    "figure.figsize": (6, 4),          # Good for 2+ figs per slide
    "figure.dpi": 120,                 # Screen display
    "savefig.dpi": 300,                # High-quality save

    # Font sizes (readable on Zoom)
    "axes.titlesize": 16,
    "axes.labelsize": 14,
    "xtick.labelsize": 12,
    "ytick.labelsize": 12,
    "legend.fontsize": 12,

    # Appearance (subtle, professional)
    "axes.edgecolor": IBM_COLORS["gray"],
    "axes.linewidth": 1.2,
    "grid.color": "#D9D9D9",
    "grid.linewidth": 0.8,
    "grid.alpha": 0.6,
})

# Set IBM color cycle (for multi-line plots)
plt.rcParams["axes.prop_cycle"] = cycler(color=[
    IBM_COLORS["blue"],
    IBM_COLORS["teal"],
    IBM_COLORS["purple"],
    IBM_COLORS["magenta"],
    IBM_COLORS["orange"],
])

def save_fig(fig, filename: str):
    """Save a figure into this experiment's figures folder."""
    # Use experiment-specific folder created in config
    out_dir = getattr(config, "figures_dir", Path("figures"))
    out_dir.mkdir(parents=True, exist_ok=True)

    path = out_dir / filename
    fig.savefig(path, bbox_inches="tight", facecolor="white")
    print(f"✓ Saved: {path}")


## Phase 1: Foundation/ Preprocessing

### 1.1 Pre-processing helper functions

In [23]:
"""
===============================================================================
1.1a: Evidence Parsing & Default-Filtering Helpers
===============================================================================
DDXPlus encodes evidences in two main formats:

1) Binary evidence (no value):
   - Example: "E_66"  -> means the evidence is present (e.g., "shortness of breath")

2) Value-coded evidence (categorical / multi-choice):
   - Example: "E_54_@_V_161"  -> base code is "E_54", value code is "V_161"
   - These evidences have a "default_value" in the official mapping file that represents
     "not present / not applicable / negative". Those default-valued entries must be filtered
     out to avoid treating negatives as positives.

Important:
- Filtering defaults reduces counts somewhat.
- Separately, collapsing multi-choice values to unique base codes reduces counts further.
  (So: raw item count ≠ effective item count ≠ effective base-code count.)
===============================================================================
"""

def split_ev(item: str) -> Tuple[str, Optional[str]]:
    """
    Split an evidence string into (base_code, value_code_or_None).

    Examples:
        >>> split_ev("E_54_@_V_161")
        ("E_54", "V_161")
        >>> split_ev("E_66")
        ("E_66", None)
    """
    if "_@_" in item:
        base, value = item.split("_@_", 1)
        return base, value
    return item, None


def is_default_value(base: str, value: str, evidences_map: Dict) -> bool:
    """
    Return True if (base, value) equals the evidence's default value.

    Why we need this:
    - For value-coded evidences, the dataset may store the default value explicitly.
      Example: travel history might be stored as "E_204_@_V_0" for "no travel".
      If we keep that, it looks like *everyone* has travel history recorded,
      which is misleading for ML features.

    Robustness:
    - In some datasets/mappings, the default might be stored as "0" while the data stores "V_0",
      or vice versa. This function treats those as equivalent.

    Args:
        base: evidence base code (e.g., "E_204")
        value: value code extracted from the item (e.g., "V_0" or "0")
        evidences_map: mapping loaded from release_evidences.json

    Returns:
        True if the value is the default and should be filtered out.
    """
    default = evidences_map.get(base, {}).get("default_value", None)
    if default is None:
        return False

    default_str = str(default)
    value_str = str(value)

    # Consider equivalent representations: "0" <-> "V_0"
    candidates = {default_str}
    if default_str.startswith("V_") and default_str[2:].isdigit():
        candidates.add(default_str[2:])          # "V_0" -> "0"
    if (not default_str.startswith("V_")) and default_str.isdigit():
        candidates.add(f"V_{default_str}")       # "0" -> "V_0"

    return value_str in candidates


def is_effective_item(item: str, evidences_map: Dict) -> bool:
    """
    Decide whether an evidence item should be kept ("effective") for ML.

    Rules:
    - Binary evidence (no value): keep if present in the list
    - Value-coded evidence: keep only if value != default_value
    """
    base, value = split_ev(item)
    if value is None:
        return True
    return not is_default_value(base, value, evidences_map)


def to_item_token(item: str) -> str:
    """
    Convert evidence entry into a token suitable for ML feature encoding.

    - Binary:        "E_66"         -> "E_66"
    - Value-coded:   "E_54_@_V_161" -> "E_54=V_161"
    """
    base, value = split_ev(item)
    return base if value is None else f"{base}={value}"


print("✓ Evidence parsing helpers loaded")
print("  Functions: split_ev, is_effective_item, to_item_token")


✓ Evidence parsing helpers loaded
  Functions: split_ev, is_effective_item, to_item_token


In [24]:
"""
===============================================================================
1.1b: Deduplication & Leakage Detection Helpers
===============================================================================
Functions for case hashing, duplicate removal, and preventing data leakage.
Based on EDA findings: ~2% duplicates, ~1,965 train/test overlaps.
===============================================================================
"""


def add_case_hash(df: pd.DataFrame) -> pd.DataFrame:
    """
    Create stable fingerprint for each case (for deduplication).
    
    Hash computed from: AGE, SEX, INITIAL_EVIDENCE, EVIDENCES (sorted)
    Note: PATHOLOGY excluded - same symptoms can have different diagnoses.
    
    Returns:
        DataFrame with 'case_hash' column added
    """
    df = df.copy()
    
    # Sort evidences for order-independence
    df["EVIDENCES_sorted"] = df["EVIDENCES_list"].apply(lambda lst: tuple(sorted(lst)))
    
    # Hash based on input features only
    key_cols = ["AGE", "SEX", "INITIAL_EVIDENCE", "EVIDENCES_sorted"]
    df["case_hash"] = pd.util.hash_pandas_object(df[key_cols], index=False).astype("uint64")
    
    return df


def remove_leakage_keep_test_then_val(
    train_df: pd.DataFrame,
    val_df: pd.DataFrame,
    test_df: pd.DataFrame,
    key: str = "case_hash_inputs",
    verbose: bool = True
) -> Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]:
    """
    Remove cross-split duplicates (priority: test > val > train).
    
    Returns:
        Cleaned (train, val, test) with no overlapping cases
    """
    orig_train, orig_val = len(train_df), len(val_df)
    
    # Remove test cases from train and val
    test_keys = set(test_df[key].unique())
    train_df = train_df[~train_df[key].isin(test_keys)].copy()
    val_df = val_df[~val_df[key].isin(test_keys)].copy()
    
    # Remove val cases from train
    val_keys = set(val_df[key].unique())
    train_df = train_df[~train_df[key].isin(val_keys)].copy()
    
    if verbose:
        print(f"  Leakage removal:")
        print(f"    Train: {orig_train:,} → {len(train_df):,} (-{orig_train - len(train_df):,})")
        print(f"    Val:   {orig_val:,} → {len(val_df):,} (-{orig_val - len(val_df):,})")
        print(f"    Test:  {len(test_df):,} (unchanged)")
    
    return train_df, val_df, test_df


def dedup_with_frequency(df: pd.DataFrame, key: str = "case_hash") -> pd.DataFrame:
    """
    Remove within-split duplicates, preserve frequency for sample weighting.
    
    Example:
        Before: [Case A, Case A, Case B] → 3 rows
        After:  [Case A (freq=2), Case B (freq=1)] → 2 rows
    """
    df = df.copy()
    df["frequency"] = df.groupby(key)[key].transform("size")
    df = df.drop_duplicates(subset=[key]).reset_index(drop=True)
    return df


print("✓ Deduplication helpers loaded")
print("  Functions: add_case_hash, remove_leakage_keep_test_then_val, dedup_with_frequency")


✓ Deduplication helpers loaded
  Functions: add_case_hash, remove_leakage_keep_test_then_val, dedup_with_frequency


In [25]:
# This part creates an additional add_hash and deduplication helper functions
# The goal: Create separate hashes for leakage control/deduplication between splits and deduplication within each split
# Leakage control should not include target ('PATHOLOGY') column (uses the first def add_case_hash)
# Within the same split hash function includes an additional 'PATHOLOGY' column 

def add_hashes(df: pd.DataFrame) -> pd.DataFrame:
    """
    Add two hashes:
      - case_hash_inputs: used for leakage control (grouping by inputs only)
      - case_hash_record: used for exact deduplication + frequency (inputs + PATHOLOGY)
    """
    df = df.copy()

    # order-independent evidence representation
    df["EVIDENCES_sorted"] = df["EVIDENCES_list"].apply(lambda lst: tuple(sorted(lst)))

    input_cols = ["AGE", "SEX", "INITIAL_EVIDENCE", "EVIDENCES_sorted"]
    df["case_hash_inputs"] = pd.util.hash_pandas_object(df[input_cols], index=False).astype("uint64")

    record_cols = input_cols + ["PATHOLOGY"]
    df["case_hash_record"] = pd.util.hash_pandas_object(df[record_cols], index=False).astype("uint64")

    return df



def dedup_with_frequency_on_record(df: pd.DataFrame) -> pd.DataFrame:
    """
    Deduplicate exact duplicates (same inputs + same pathology),
    and keep frequency as a training weight.
    """
    df = df.copy()
    df["frequency"] = df.groupby("case_hash_record")["case_hash_record"].transform("size")
    df = df.drop_duplicates("case_hash_record", keep="first").reset_index(drop=True)
    return df


In [26]:
"""
===============================================================================
1.1c: Feature Engineering Helper
===============================================================================
Batch processing: filter default-valued items + create ML tokens.

Key outputs:
- EVIDENCES_effective_items: the filtered evidence entries (keeps value codes)
- EVIDENCES_effective_tokens:
    - "base"  -> unique base codes per case (e.g., ["E_66","E_204","E_54"])
    - "token" -> value-level tokens (e.g., ["E_66","E_204=V_3","E_54=V_161"])
- num_items_effective: number of effective ITEMS (keeps multi-choice values)
- num_bases_effective: number of effective BASE CODES (collapses multi-choice)
===============================================================================
"""

def build_effective_evidences(
    df: pd.DataFrame,
    evidences_map: Dict,
    encoding: str = "base"
) -> pd.DataFrame:
    """
    Filter defaults and create ML-ready tokens.

    Args:
        df: DataFrame with column 'EVIDENCES_list'
        evidences_map: mapping loaded from release_evidences.json
        encoding:
            - "base":  unique base codes per patient (recommended baseline)
            - "token": value-level tokens (richer features, more columns)

    Returns:
        df copy with new columns:
        - EVIDENCES_effective_items
        - EVIDENCES_effective_tokens
        - num_items_effective
        - num_bases_effective
    """
    df = df.copy()

    # 1) Filter default-valued evidences (keep only informative items)
    def filter_defaults(lst: List[str]) -> List[str]:
        return [x for x in lst if is_effective_item(x, evidences_map)]

    df["EVIDENCES_effective_items"] = df["EVIDENCES_list"].apply(filter_defaults)

    # 2) Build tokens for ML
    if encoding == "base":
        # IMPORTANT: deduplicate base codes per patient
        # (multi-choice can include the same base multiple times with different values)
        df["EVIDENCES_effective_tokens"] = df["EVIDENCES_effective_items"].apply(
            lambda lst: sorted({split_ev(x)[0] for x in lst})
        )
    elif encoding == "token":
        # Keep value-level detail (still filtered for defaults)
        # Deduplicate exact tokens just in case the raw list contains repeats
        df["EVIDENCES_effective_tokens"] = df["EVIDENCES_effective_items"].apply(
            lambda lst: list(dict.fromkeys(to_item_token(x) for x in lst))
        )
    else:
        raise ValueError(f"Unknown encoding: {encoding}. Use 'base' or 'token'.")

    # 3) Counts
    df["num_items_effective"] = df["EVIDENCES_effective_items"].apply(len)
    df["num_bases_effective"] = df["EVIDENCES_effective_items"].apply(
        lambda lst: len({split_ev(x)[0] for x in lst})
    )

    return df


print("✓ Feature engineering helper loaded")
print("  Function: build_effective_evidences")


✓ Feature engineering helper loaded
  Function: build_effective_evidences


In [27]:
"""
1.2: Load Data
Purpose: Load DDXPlus dataset splits and the official mapping files.
"""

print("Loading DDXPlus dataset from Hugging Face...")

# Load dataset (first run can take longer)
dataset = load_dataset("aai530-group6/ddxplus")

# Convert to pandas DataFrames
df_train = dataset["train"].to_pandas()
df_val   = dataset["validate"].to_pandas()
df_test  = dataset["test"].to_pandas()

print(f"✓ Train: {len(df_train):,} rows")
print(f"✓ Val:   {len(df_val):,} rows")
print(f"✓ Test:  {len(df_test):,} rows")

print("\nLoading mapping files...")

# Evidence mapping (symptoms/findings with default values)
with open(config.data_dir / "release_evidences.json", "r", encoding="utf-8") as f:
    evidences_map = json.load(f)

# Condition mapping (pathologies)
with open(config.data_dir / "release_conditions.json", "r", encoding="utf-8") as f:
    conditions_map = json.load(f)

print(f"✓ Evidence definitions:   {len(evidences_map)} (expected ~223)")
print(f"✓ Condition definitions:  {len(conditions_map)} (expected ~49)")

# Quick peek
print("\nDataset columns:")
print(df_train.columns.tolist())

print("\nSample row (selected columns):")
print(df_train.head(1)[["AGE", "SEX", "PATHOLOGY", "INITIAL_EVIDENCE", "EVIDENCES"]])


Loading DDXPlus dataset from Hugging Face...
✓ Train: 1,025,602 rows
✓ Val:   132,448 rows
✓ Test:  134,529 rows

Loading mapping files...
✓ Evidence definitions:   223 (expected ~223)
✓ Condition definitions:  49 (expected ~49)

Dataset columns:
['AGE', 'DIFFERENTIAL_DIAGNOSIS', 'SEX', 'PATHOLOGY', 'EVIDENCES', 'INITIAL_EVIDENCE']

Sample row (selected columns):
   AGE SEX PATHOLOGY INITIAL_EVIDENCE  \
0   18   M      URTI             E_91   

                                                                                                                 EVIDENCES  
0  ['E_48', 'E_50', 'E_53', 'E_54_@_V_161', 'E_54_@_V_183', 'E_55_@_V_89', 'E_55_@_V_108', 'E_55_@_V_167', 'E_56_@_4', ...  


In [28]:
"""
===============================================================================
Section 1.3: Parse String-Encoded Columns
===============================================================================
Convert EVIDENCES and DIFFERENTIAL_DIAGNOSIS from strings to Python lists.

From EDA (Notebook 01, Section 2.2): These columns are stored as strings.
Fix: Use ast.literal_eval() to convert to actual Python lists.
===============================================================================
"""
import ast

print("Parsing string-encoded columns...")

# Parse EVIDENCES
df_train['EVIDENCES_list'] = df_train['EVIDENCES'].apply(ast.literal_eval)
df_val['EVIDENCES_list'] = df_val['EVIDENCES'].apply(ast.literal_eval)
df_test['EVIDENCES_list'] = df_test['EVIDENCES'].apply(ast.literal_eval)

# Parse DIFFERENTIAL_DIAGNOSIS
df_train['DIFFERENTIAL_DIAGNOSIS_list'] = df_train['DIFFERENTIAL_DIAGNOSIS'].apply(ast.literal_eval)
df_val['DIFFERENTIAL_DIAGNOSIS_list'] = df_val['DIFFERENTIAL_DIAGNOSIS'].apply(ast.literal_eval)
df_test['DIFFERENTIAL_DIAGNOSIS_list'] = df_test['DIFFERENTIAL_DIAGNOSIS'].apply(ast.literal_eval)

print("✓ Parsed EVIDENCES → EVIDENCES_list")
print("✓ Parsed DIFFERENTIAL_DIAGNOSIS → DIFFERENTIAL_DIAGNOSIS_list")

# Quick verification
sample = df_train.iloc[0]
print(f"\nVerification:")
print(f"  EVIDENCES_list: {len(sample['EVIDENCES_list'])} items (type: {type(sample['EVIDENCES_list']).__name__})")
print(f"  Sample: {sample['EVIDENCES_list'][:3]}...")


Parsing string-encoded columns...
✓ Parsed EVIDENCES → EVIDENCES_list
✓ Parsed DIFFERENTIAL_DIAGNOSIS → DIFFERENTIAL_DIAGNOSIS_list

Verification:
  EVIDENCES_list: 19 items (type: list)
  Sample: ['E_48', 'E_50', 'E_53']...


### Section 1.4 — Add Case Hashes (Fingerprints for Deduplication & Leakage Control)

Two hashes are created per case to cleanly separate two distinct concerns:

| Hash | Columns | Purpose |
|------|---------|---------|
| `case_hash_inputs` | `AGE, SEX, INITIAL_EVIDENCE, EVIDENCES_sorted` | Leakage control — same symptom presentation must not cross splits |
| `case_hash_record` | `AGE, SEX, INITIAL_EVIDENCE, EVIDENCES_sorted, PATHOLOGY` | True duplicate removal — identical case AND identical diagnosis |

**Why two hashes?**  
Using inputs-only for deduplication would collapse cases where the same symptoms 
lead to different diagnoses — distorting class frequencies and introducing hidden 
label noise. Using the full record for leakage control would allow the same 
*presentation* to appear in both train and test under different labels.

> **Note on DDXPlus:** Same-input/different-label cases likely reflect synthetic 
> generation artifacts rather than true clinical ambiguity — but the two-hash 
> approach is the more rigorous and generalizable design.

From EDA: 2% duplicates within splits, 3,925 cross-split overlaps detected.


In [29]:
"""
===============================================================================
Section 1.4: Add Case Hashes (Fingerprints for Deduplication)
===============================================================================
Create stable fingerprints for each case to enable:
- Duplicate detection within splits
- Data leakage detection across splits

2 Hashes are computed for leakage control/deduplication between the splits and for deduplication within the splits:
- case_hash_inputs: AGE, SEX, INITIAL_EVIDENCE, EVIDENCES (sorted for order-independence)
- Note: PATHOLOGY excluded for leakage control - same symptoms can have different diagnoses
- case_hash_records:  AGE, SEX, INITIAL_EVIDENCE, EVIDENCES (sorted for order-independence), PATHOLOGY

From EDA: Found ~2% duplicates and ~3,925 cross-split overlaps.
===============================================================================
"""

print("Adding case hashes...")

# Add hash to each split
df_train = add_hashes(df_train)
df_val = add_hashes(df_val)
df_test = add_hashes(df_test)

print("✓ Case hashes added to all splits")

# Verification: Check for duplicates within each split: record level
print("\n" + "="*70)
print("DUPLICATE DETECTION (Within Splits)")
print("="*70)

for name, df in [('Train', df_train), ('Val', df_val), ('Test', df_test)]:
    total = len(df)
    unique = df['case_hash_record'].nunique()
    duplicates = total - unique
    dup_pct = (duplicates / total) * 100
    
    print(f"{name:6} - Total: {total:,}  |  Unique: {unique:,}  |  Duplicates: {duplicates:,} ({dup_pct:.1f}%)")

# Check for leakage across splits:input level
print("\n" + "="*70)
print("LEAKAGE DETECTION (Across Splits)")
print("="*70)

train_hash_inputs = set(df_train['case_hash_inputs'].unique())
val_hash_inputs = set(df_val['case_hash_inputs'].unique())
test_hash_inputs = set(df_test['case_hash_inputs'].unique())

train_val_overlap = len(train_hash_inputs & val_hash_inputs)
train_test_overlap = len(train_hash_inputs & test_hash_inputs)
val_test_overlap = len(val_hash_inputs & test_hash_inputs)

print(f"Train ↔ Val overlap:   {train_val_overlap:,} cases")
print(f"Train ↔ Test overlap:  {train_test_overlap:,} cases")
print(f"Val ↔ Test overlap:    {val_test_overlap:,} cases")
print(f"Total leakage:         {train_val_overlap + train_test_overlap + val_test_overlap:,} cases")

if train_test_overlap > 0:
    print("\n⚠️  Data leakage detected! Will be fixed in Section 1.5")
else:
    print("\n✓ No data leakage detected")

print("="*70)


Adding case hashes...
✓ Case hashes added to all splits

DUPLICATE DETECTION (Within Splits)
Train  - Total: 1,025,602  |  Unique: 1,012,347  |  Duplicates: 13,255 (1.3%)
Val    - Total: 132,448  |  Unique: 132,373  |  Duplicates: 75 (0.1%)
Test   - Total: 134,529  |  Unique: 134,428  |  Duplicates: 101 (0.1%)

LEAKAGE DETECTION (Across Splits)
Train ↔ Val overlap:   1,788 cases
Train ↔ Test overlap:  1,965 cases
Val ↔ Test overlap:    172 cases
Total leakage:         3,925 cases

⚠️  Data leakage detected! Will be fixed in Section 1.5


In [30]:
"""
===============================================================================
Section 1.5: Remove Data Leakage (Cross-Split Duplicates)
===============================================================================
Remove cases that appear in multiple splits to prevent data leakage.

Priority order (most critical first):
1. Keep TEST unchanged (most important for unbiased evaluation)
2. Remove TEST cases from TRAIN and VAL
3. Keep VAL (after removing test overlaps)
4. Remove VAL cases from TRAIN

From Section 1.4: Detected 3,925 leaking cases
- Train ↔ Test: 1,965 cases (CRITICAL - corrupts final evaluation)
- Train ↔ Val:  1,788 cases (biases validation)
- Val ↔ Test:   172 cases (minor but should fix)
===============================================================================
"""

print("Removing data leakage across splits...")
print(f"\nBefore leakage removal:")
print(f"  Train: {len(df_train):,} cases")
print(f"  Val:   {len(df_val):,} cases")
print(f"  Test:  {len(df_test):,} cases")

# Remove leakage (uses helper from Cell 1.1b)
df_train, df_val, df_test = remove_leakage_keep_test_then_val(
    df_train, 
    df_val, 
    df_test,
    verbose=True
)

print(f"\nAfter leakage removal:")
print(f"  Train: {len(df_train):,} cases")
print(f"  Val:   {len(df_val):,} cases")
print(f"  Test:  {len(df_test):,} cases")

# Verify: Check that leakage is gone
print("\n" + "="*70)
print("VERIFICATION: Checking for remaining leakage...")
print("="*70)

train_hashes = set(df_train['case_hash_inputs'].unique())
val_hashes = set(df_val['case_hash_inputs'].unique())
test_hashes = set(df_test['case_hash_inputs'].unique())

train_val_overlap = len(train_hashes & val_hashes)
train_test_overlap = len(train_hashes & test_hashes)
val_test_overlap = len(val_hashes & test_hashes)

print(f"Train ↔ Val overlap:   {train_val_overlap:,} cases")
print(f"Train ↔ Test overlap:  {train_test_overlap:,} cases")
print(f"Val ↔ Test overlap:    {val_test_overlap:,} cases")

if train_val_overlap == 0 and train_test_overlap == 0 and val_test_overlap == 0:
    print("\n✅ SUCCESS: All data leakage removed!")
    print("   Splits are now completely disjoint.")
else:
    print("\n⚠️  WARNING: Some leakage remains!")

print("="*70)


Removing data leakage across splits...

Before leakage removal:
  Train: 1,025,602 cases
  Val:   132,448 cases
  Test:  134,529 cases
  Leakage removal:
    Train: 1,025,602 → 1,020,977 (-4,625)
    Val:   132,448 → 132,274 (-174)
    Test:  134,529 (unchanged)

After leakage removal:
  Train: 1,020,977 cases
  Val:   132,274 cases
  Test:  134,529 cases

VERIFICATION: Checking for remaining leakage...
Train ↔ Val overlap:   0 cases
Train ↔ Test overlap:  0 cases
Val ↔ Test overlap:    0 cases

✅ SUCCESS: All data leakage removed!
   Splits are now completely disjoint.


In [31]:
"""
===============================================================================
Section 1.6: Deduplicate Within Splits (Preserve Frequency Information)
===============================================================================
Remove duplicate cases within each split while preserving occurrence counts.

From Section 1.4: Found duplicates within splits:
- Train: 13,255 duplicates (1.3%)
- Val:   75 duplicates (0.1%)
- Test:  101 duplicates (0.1%)

Strategy:
1. Count how many times each case pattern appears (frequency)
2. Keep only one copy of each pattern
3. Use frequency as sample_weight during training

This preserves information about pattern importance without actual duplicates.

Controlled by: config.use_unique_cases = {config.use_unique_cases}
===============================================================================
"""

print("Processing duplicates within splits...")

if config.use_unique_cases:
    print("\n✓ Config: use_unique_cases = True (removing duplicates)\n")
    
    # Store original counts
    orig_train, orig_val, orig_test = len(df_train), len(df_val), len(df_test)
    
    # Deduplicate each split (adds 'frequency' column)
    df_train = dedup_with_frequency_on_record(df_train)
    df_val = dedup_with_frequency_on_record(df_val)
    df_test = dedup_with_frequency_on_record(df_test)
    
    # Calculate removed counts
    removed_train = orig_train - len(df_train)
    removed_val = orig_val - len(df_val)
    removed_test = orig_test - len(df_test)
    
    print(f"Deduplication results:")
    print(f"  Train: {orig_train:,} → {len(df_train):,} (removed {removed_train:,}, {removed_train/orig_train*100:.2f}%)")
    print(f"  Val:   {orig_val:,} → {len(df_val):,} (removed {removed_val:,}, {removed_val/orig_val*100:.2f}%)")
    print(f"  Test:  {orig_test:,} → {len(df_test):,} (removed {removed_test:,}, {removed_test/orig_test*100:.2f}%)")
    
    # Show frequency distribution
    print("\n" + "="*70)
    print("FREQUENCY DISTRIBUTION")
    print("="*70)
    
    for name, df in [('Train', df_train), ('Val', df_val), ('Test', df_test)]:
        freq_dist = df['frequency'].value_counts().sort_index()
        unique_cases = (df['frequency'] == 1).sum()
        duplicate_cases = (df['frequency'] > 1).sum()
        
        print(f"\n{name}:")
        print(f"  Unique patterns (freq=1):     {unique_cases:,} ({unique_cases/len(df)*100:.1f}%)")
        print(f"  Duplicate patterns (freq>1):  {duplicate_cases:,} ({duplicate_cases/len(df)*100:.1f}%)")
        
        if duplicate_cases > 0:
            max_freq = df['frequency'].max()
            print(f"  Max frequency:                {max_freq} (same case appeared {max_freq}x)")
            
            # Show distribution
            if len(freq_dist) <= 5:
                print(f"  Frequency breakdown: {dict(freq_dist)}")
    
    print("\n" + "="*70)
    print("✓ Duplicates removed, frequency information preserved")
    print("  Use config.use_sample_weights=True to leverage frequency during training")
    
else:
    print("\n✗ Config: use_unique_cases = False (keeping all duplicates)\n")
    
    # Add frequency = 1 for all cases
    df_train['frequency'] = 1
    df_val['frequency'] = 1
    df_test['frequency'] = 1
    
    print(f"Keeping all cases:")
    print(f"  Train: {len(df_train):,} cases (no deduplication)")
    print(f"  Val:   {len(df_val):,} cases (no deduplication)")
    print(f"  Test:  {len(df_test):,} cases (no deduplication)")
    print(f"\nAll frequency values set to 1 (uniform weighting)")

print("="*70)


Processing duplicates within splits...

✓ Config: use_unique_cases = True (removing duplicates)

Deduplication results:
  Train: 1,020,977 → 1,008,626 (removed 12,351, 1.21%)
  Val:   132,274 → 132,201 (removed 73, 0.06%)
  Test:  134,529 → 134,428 (removed 101, 0.08%)

FREQUENCY DISTRIBUTION

Train:
  Unique patterns (freq=1):     998,350 (99.0%)
  Duplicate patterns (freq>1):  10,276 (1.0%)
  Max frequency:                6 (same case appeared 6x)

Val:
  Unique patterns (freq=1):     132,128 (99.9%)
  Duplicate patterns (freq>1):  73 (0.1%)
  Max frequency:                2 (same case appeared 2x)
  Frequency breakdown: {1: np.int64(132128), 2: np.int64(73)}

Test:
  Unique patterns (freq=1):     134,327 (99.9%)
  Duplicate patterns (freq>1):  101 (0.1%)
  Max frequency:                2 (same case appeared 2x)
  Frequency breakdown: {1: np.int64(134327), 2: np.int64(101)}

✓ Duplicates removed, frequency information preserved
  Use config.use_sample_weights=True to leverage frequen

In [32]:
total_length = len(df_train)+len(df_val)+len(df_test)
print("Split distribution after de-duplication: ")
print(f"\nTrain: {len(df_train)} rows; {((len(df_train)/total_length)*100):.2f}% of the dataset")
print(f"Validation: {len(df_val)} rows; {((len(df_val)/total_length)*100):.2f}% of the dataset")
print(f"Test: {len(df_test)} rows; {((len(df_test)/total_length)*100):.2f}% of the dataset")
print(f"\nTotal: {total_length} rows; 100% of the dataset")

Split distribution after de-duplication: 

Train: 1008626 rows; 79.09% of the dataset
Validation: 132201 rows; 10.37% of the dataset
Test: 134428 rows; 10.54% of the dataset

Total: 1275255 rows; 100% of the dataset


In [33]:
"""
===============================================================================
Section 1.7: Filter Default Values + Validate Evidence Counts
===============================================================================
Goal:
- Create ML-ready evidence features based on EDA findings.

IMPORTANT: There are TWO different "evidence count" concepts:

1) ITEM-level count (num_items_effective)
   - Counts every effective evidence entry kept after filtering defaults.
   - Multi-choice questions can contribute multiple items for the same base.

2) BASE-level count (num_bases_effective)
   - Counts unique evidence base codes after filtering defaults.
   - This is the number that matches the published "average number of evidences"
     (because it treats one question as one concept).

What to expect (from your EDA):
- Raw items mean ~19.77
- Effective BASE mean:
    - ~13.56 when weighted to match the original dataset distribution
    - ~13.65 on unique case patterns (deduplicated view)
- Effective ITEM mean ~17.95 (because multi-choice adds extra values)
===============================================================================
"""

print("Processing evidence lists...")

# Safety: if filter_defaults not defined in config
if not hasattr(config, "filter_defaults"):
    config.filter_defaults = True
    print("⚠️ config.filter_defaults not set, defaulting to True")

print(f"✓ Config: filter_defaults = {config.filter_defaults}")
print(f"✓ Config: feature_encoding = {config.feature_encoding}")

# -----------------------------
# BEFORE filtering (raw counts)
# -----------------------------
print("\n" + "="*70)
print("BEFORE FILTERING (Raw Evidence Item Counts)")
print("="*70)

for name, df in [("Train", df_train), ("Val", df_val), ("Test", df_test)]:
    mean_raw_items = df["EVIDENCES_list"].apply(len).mean()
    print(f"{name:6} - Mean raw evidence ITEMS: {mean_raw_items:.2f}")

# -----------------------------
# Apply filtering + tokenization
# -----------------------------
print("\n" + "="*70)
print("APPLYING DEFAULT FILTERING + TOKENIZATION")
print("="*70)

if config.filter_defaults:
    df_train = build_effective_evidences(df_train, evidences_map, config.feature_encoding)
    df_val   = build_effective_evidences(df_val, evidences_map, config.feature_encoding)
    df_test  = build_effective_evidences(df_test, evidences_map, config.feature_encoding)
    print("✓ Filtering applied to Train / Val / Test")
else:
    # Still create consistent columns even if you skip filtering
    for df in (df_train, df_val, df_test):
        df["EVIDENCES_effective_items"]  = df["EVIDENCES_list"]
        df["num_items_effective"] = df["EVIDENCES_list"].apply(len)
        df["num_bases_effective"] = df["EVIDENCES_list"].apply(lambda lst: len({split_ev(x)[0] for x in lst}))
        if config.feature_encoding == "base":
            df["EVIDENCES_effective_tokens"] = df["EVIDENCES_list"].apply(lambda lst: sorted({split_ev(x)[0] for x in lst}))
        else:
            df["EVIDENCES_effective_tokens"] = df["EVIDENCES_list"].apply(lambda lst: list(dict.fromkeys(to_item_token(x) for x in lst)))
    print("⚠️ Filtering skipped (config.filter_defaults = False)")

# -----------------------------
# AFTER filtering (report both)
# -----------------------------
print("\n" + "="*70)
print("AFTER FILTERING (Effective Evidence Counts)")
print("="*70)

for name, df in [("Train", df_train), ("Val", df_val), ("Test", df_test)]:
    mean_raw_items = df["EVIDENCES_list"].apply(len).mean()
    mean_eff_items = df["num_items_effective"].mean()
    mean_eff_bases = df["num_bases_effective"].mean()

    red_items = ((mean_raw_items - mean_eff_items) / mean_raw_items) * 100

    print(f"\n{name}:")
    print(f"  Raw mean (items):          {mean_raw_items:.2f}")
    print(f"  Effective mean (items):    {mean_eff_items:.2f}   (after filtering defaults)")
    print(f"  Effective mean (base):     {mean_eff_bases:.2f}   (unique base codes, after filtering)")
    print(f"  Defaults filtered (items): {red_items:.1f}%")

# -----------------------------
# Weighted vs unweighted means
# -----------------------------
print("\n" + "="*70)
print("VALIDATION: Weighted vs Unweighted (Unique-pattern vs Original-distribution view)")
print("="*70)

def weighted_mean(values: pd.Series, weights: pd.Series) -> float:
    return float(np.average(values.astype(float), weights=weights.astype(float)))

# Combine splits (useful for global comparison)
df_all = pd.concat([df_train, df_val, df_test], ignore_index=True)

if "frequency" in df_all.columns:
    wm_base  = weighted_mean(df_all["num_bases_effective"], df_all["frequency"])
    wm_items = weighted_mean(df_all["num_items_effective"], df_all["frequency"])
else:
    wm_base  = df_all["num_bases_effective"].mean()
    wm_items = df_all["num_items_effective"].mean()

um_base  = df_all["num_bases_effective"].mean()
um_items = df_all["num_items_effective"].mean()

print(f"Unweighted mean BASE (unique patterns view): {um_base:.2f}")
print(f"Weighted mean BASE   (original distribution): {wm_base:.2f}")
print(f"Unweighted mean ITEMS: {um_items:.2f}")
print(f"Weighted mean ITEMS:   {wm_items:.2f}")

print("\nReference (from the EDA):")
print("  - BASE weighted mean ~13.56 (paper-aligned)")
print("  - BASE unweighted mean ~13.65 (unique patterns)")
print("  - ITEMS mean ~17.95")

# Quick sanity check ranges (not strict)
if 13.3 <= wm_base <= 13.9:
    print("\n✅ BASE weighted mean is in a reasonable range.")
else:
    print("\n⚠️ BASE weighted mean is outside expected range — check default filtering logic.")

# -----------------------------
# Token sanity check (fast)
# -----------------------------
print("\n" + "="*70)
print("TOKEN SANITY CHECK")
print("="*70)

sample_tokens = df_train.iloc[0]["EVIDENCES_effective_tokens"]
print(f"Encoding type: {config.feature_encoding}")
print(f"Example tokens (first 8): {sample_tokens[:8]}")

# Check that token vocab roughly matches expectations without scanning all rows
sample_vocab = set()
for tokens in df_train["EVIDENCES_effective_tokens"].head(10000):
    sample_vocab.update(tokens)

print(f"Unique tokens observed in first 10k train rows: {len(sample_vocab)}")

if config.feature_encoding == "base":
    print(f"Evidence base codes in mapping file: {len(evidences_map)}")
    if len(sample_vocab) <= len(evidences_map):
        print("✓ Looks consistent for base encoding.")
    else:
        print("⚠️ More codes than mapping suggests — check parsing.")
else:
    print("✓ Token encoding usually has more features than base encoding (value-level detail).")

print("="*70)
print("\n✅ Section 1.7 Complete")
print("Columns available for ML:")
print("  - EVIDENCES_effective_items")
print("  - EVIDENCES_effective_tokens")
print("  - num_items_effective")
print("  - num_bases_effective")
print("  - frequency (if deduplicated)")
print("="*70)


Processing evidence lists...
✓ Config: filter_defaults = True
✓ Config: feature_encoding = base

BEFORE FILTERING (Raw Evidence Item Counts)
Train  - Mean raw evidence ITEMS: 19.90
Val    - Mean raw evidence ITEMS: 20.16
Test   - Mean raw evidence ITEMS: 20.06

APPLYING DEFAULT FILTERING + TOKENIZATION
✓ Filtering applied to Train / Val / Test

AFTER FILTERING (Effective Evidence Counts)

Train:
  Raw mean (items):          19.90
  Effective mean (items):    17.74   (after filtering defaults)
  Effective mean (base):     13.31   (unique base codes, after filtering)
  Defaults filtered (items): 10.8%

Val:
  Raw mean (items):          20.16
  Effective mean (items):    17.97   (after filtering defaults)
  Effective mean (base):     13.44   (unique base codes, after filtering)
  Defaults filtered (items): 10.8%

Test:
  Raw mean (items):          20.06
  Effective mean (items):    17.89   (after filtering defaults)
  Effective mean (base):     13.40   (unique base codes, after filtering)

**Observation:** there is a large difference between EDA numbers for effective evidences (raw, items and bases). This requires investigation to see whether out filtering is too agressive, and we are filtering incorrectly or something has happenned with the original file.

In [34]:
# Quick diagnostics on effective filtering impact
for split_name, df in [("train", df_train), ("validate", df_val), ("test", df_test)]:
    if "n_evidence_items_raw" in df.columns:
        raw_items = df["n_evidence_items_raw"]
    else:
        raw_items = df["EVIDENCES_list"].apply(len)

    eff_items = df["num_items_effective"]
    eff_bases = df["num_bases_effective"]

    dropped = (raw_items - eff_items)

    print(f"\n[{split_name}]")
    print("  mean raw items:   ", raw_items.mean().round(2))
    print("  mean eff items:   ", eff_items.mean().round(2))
    print("  mean eff bases:   ", eff_bases.mean().round(2))
    print("  mean dropped defaults (items):", dropped.mean().round(2))
    print("  drop rate:", (dropped.mean() / raw_items.mean()).round(3))



[train]
  mean raw items:    19.9
  mean eff items:    17.74
  mean eff bases:    13.31
  mean dropped defaults (items): 2.16
  drop rate: 0.108

[validate]
  mean raw items:    20.16
  mean eff items:    17.97
  mean eff bases:    13.44
  mean dropped defaults (items): 2.19
  drop rate: 0.108

[test]
  mean raw items:    20.06
  mean eff items:    17.89
  mean eff bases:    13.4
  mean dropped defaults (items): 2.17
  drop rate: 0.108


In [19]:
from collections import defaultdict, Counter
import random

def canonical(v: str) -> str:
    """Make 'V_0' and '0' comparable."""
    s = str(v)
    return s[2:] if s.startswith("V_") else s

def check_default_filtering(df, evidences_map, n_rows=20000, seed=42):
    rng = random.Random(seed)
    idx = rng.sample(range(len(df)), k=min(n_rows, len(df)))
    df_s = df.iloc[idx]

    dropped_values = defaultdict(Counter)

    for lst in df_s["EVIDENCES_list"]:
        for item in lst:
            base, value = split_ev(item)
            if value is None:
                continue

            default = evidences_map.get(base, {}).get("default_value", None)
            if default is None:
                continue

            # if your code drops it, record which value got dropped
            if not is_effective_item(item, evidences_map):
                dropped_values[base][value] += 1

    # Find bases where you dropped values OTHER than the default (after canonical normalization)
    suspicious = []
    for base, counter in dropped_values.items():
        default = evidences_map.get(base, {}).get("default_value", None)
        if default is None:
            continue
        allowed = {canonical(default)}
        observed = {canonical(v) for v in counter.keys()}
        if not observed.issubset(allowed):
            suspicious.append((base, default, list(counter.items())[:5]))

    print("Suspicious bases where dropped values != default_value:", len(suspicious))
    if suspicious:
        print("Example suspicious entries (first 10):")
        for s in suspicious[:10]:
            print(s)

check_default_filtering(df_train, evidences_map)


Suspicious bases where dropped values != default_value: 0


In [20]:
print("Evidence map size:", len(evidences_map))
defaults = [v.get("default_value", None) for v in evidences_map.values()]
print("Num evidences with default_value:", sum(d is not None for d in defaults))

# Quick spot-check a few known codes (edit codes you know from your analysis)
for code in ["E_204", "E_58", "E_210"]:
    if code in evidences_map:
        print(code, "default_value =", evidences_map[code].get("default_value"))
    else:
        print(code, "not found in mapping!")


Evidence map size: 223
Num evidences with default_value: 223
E_204 default_value = V_10
E_58 default_value = 0
E_210 default_value = 0


In [35]:
# How many different pathologies share the same input signature?
n_labels_per_input = df_all.groupby("case_hash_inputs")["PATHOLOGY"].nunique()

print(n_labels_per_input.value_counts().head(10))
print("Fraction of input signatures with >1 pathology:",
      (n_labels_per_input > 1).mean().round(4))


PATHOLOGY
1    1275255
Name: count, dtype: int64
Fraction of input signatures with >1 pathology: 0.0


### Preprocessing Summary (Notebook 02)

In this section, we prepared the DDXPlus dataset for machine learning:

- Loaded train/validation/test splits and official mapping files (evidences + conditions).
- Parsed string-encoded columns into Python lists:
  - `EVIDENCES_list`
  - `DIFFERENTIAL_DIAGNOSIS_list`
- Created an input-based `case_hash` to detect duplicates and prevent cross-split leakage.
- Removed leakage across splits (priority: keep test unchanged).
- Deduplicated within each split while preserving `frequency` for optional sample weighting.
- Filtered default-valued evidence items using the official `default_value` for each evidence.
- Created ML-ready evidence tokens:
  - Base encoding (`E_XX`) or token encoding (`E_XX=V_YY`)
- Added evidence complexity features:
  - `num_items_effective` (effective evidence *items*, keeps multi-choice values)
  - `num_bases_effective` (effective evidence *base codes*, collapses multi-choice)
- Validated preprocessing by comparing weighted vs unweighted evidence statistics.

Result: the dataset is leakage-safe, deduplicated (optional), and has evidence features ready for vectorization and baseline model training.


## Phase 2: Feature Engineering