# Data Preprocessing: From Raw Clinical Data to TwinWeaver Format

This tutorial demonstrates how to transform raw clinical data into the standardized TwinWeaver format required for training digital twin models.

**Key Principles:**
1. **Include as much data as possible** - We aim to capture all available clinical information first, then trim down if needed during data generation.
2. **Prefer events over constants** - Longitudinal data (events) provides richer temporal context than static data (constants). Put as much as possible into the events dataframe.

We will cover:
1. Creating synthetic raw clinical data (simulating real-world EHR exports)
2. Transforming raw data into the three required TwinWeaver dataframes:
   - `df_events`: Longitudinal patient events in long format
   - `df_constant`: Static patient demographics
   - `df_constant_description`: Metadata describing constant columns
3. Using preprocessing helper functions for data aggregation and column classification
4. Converting the processed data into instruction-tuning format

In [None]:
import pandas as pd

from twinweaver import (
    DataManager,
    Config,
    DataSplitterForecasting,
    DataSplitterEvents,
    ConverterInstruction,
    DataSplitter,
    identify_constant_and_changing_columns,
    aggregate_events_to_weeks,
)

## 1. Create Synthetic Raw Clinical Data

In real-world scenarios, you would receive data exports from electronic health records (EHR), clinical trial databases, or other clinical data sources. These typically come as wide-format tables with mixed static and longitudinal information.

We'll create two raw dataframes simulating a typical oncology dataset:
- **Raw Patient Demographics**: Contains static information like birth year, gender, and diagnosis details
- **Raw Clinical Observations**: Contains longitudinal data like lab results, treatments, and clinical assessments

In [None]:
# Raw Patient Demographics DataFrame
# This simulates a typical patient registry export with static information
raw_demographics = pd.DataFrame(
    {
        "patient_id": ["PT001", "PT002", "PT003", "PT004", "PT005"],
        "birth_year": [1958, 1965, 1972, 1949, 1961],
        "sex": ["Male", "Female", "Male", "Female", "Male"],
        "cancer_type": ["NSCLC", "NSCLC", "NSCLC", "NSCLC", "NSCLC"],
        "histology": [
            "Adenocarcinoma",
            "Squamous Cell Carcinoma",
            "Adenocarcinoma",
            "Adenocarcinoma",
            "Squamous Cell Carcinoma",
        ],
        "smoking_status": ["Former", "Never", "Current", "Former", "Current"],
        "diagnosis_date": ["2020-03-15", "2020-06-22", "2021-01-10", "2019-11-05", "2020-09-18"],
        "stage_at_diagnosis": ["IIIB", "IV", "IIIA", "IV", "IIIB"],
        "egfr_status": ["Wild Type", "Wild Type", "L858R Mutation", "Wild Type", "Wild Type"],
        "alk_status": ["Wild Type", "Wild Type", "Wild Type", "Rearrangement", "Wild Type"],
        "pdl1_expression": ["50-100%", "1-49%", "<1%", "1-49%", "50-100%"],
        # Death information: some patients died, others are censored (alive at last follow-up)
        "death_status": ["Deceased", "Alive", "Alive", "Deceased", "Alive"],
        "death_date": ["2021-02-10", None, None, "2020-08-15", None],  # None for alive patients
    }
)

print("Raw Demographics DataFrame:")
raw_demographics

In [None]:
# Raw Clinical Observations DataFrame
# This simulates longitudinal clinical data with labs, vitals, treatments, and outcomes
raw_observations = pd.DataFrame(
    {
        "patient_id": [
            # Patient PT001 - multiple visits
            "PT001",
            "PT001",
            "PT001",
            "PT001",
            "PT001",
            "PT001",
            "PT001",
            "PT001",
            "PT001",
            "PT001",
            "PT001",
            "PT001",
            "PT001",
            "PT001",
            "PT001",
            "PT001",
            # Patient PT002 - multiple visits
            "PT002",
            "PT002",
            "PT002",
            "PT002",
            "PT002",
            "PT002",
            "PT002",
            "PT002",
            "PT002",
            "PT002",
            "PT002",
            "PT002",
            "PT002",
            "PT002",
            # Patient PT003 - multiple visits
            "PT003",
            "PT003",
            "PT003",
            "PT003",
            "PT003",
            "PT003",
            "PT003",
            "PT003",
            "PT003",
            "PT003",
            "PT003",
            "PT003",
            # Patient PT004 - multiple visits
            "PT004",
            "PT004",
            "PT004",
            "PT004",
            "PT004",
            "PT004",
            "PT004",
            "PT004",
            "PT004",
            "PT004",
            "PT004",
            "PT004",
            "PT004",
            "PT004",
            # Patient PT005 - multiple visits
            "PT005",
            "PT005",
            "PT005",
            "PT005",
            "PT005",
            "PT005",
            "PT005",
            "PT005",
            "PT005",
            "PT005",
        ],
        "visit_date": [
            # PT001 visits
            "2020-03-20",
            "2020-03-20",
            "2020-03-20",
            "2020-03-20",  # Baseline
            "2020-04-17",
            "2020-04-17",
            "2020-04-17",
            "2020-04-17",  # Cycle 1
            "2020-05-15",
            "2020-05-15",
            "2020-05-15",
            "2020-05-15",  # Cycle 2
            "2020-06-12",
            "2020-06-12",
            "2020-06-12",
            "2020-06-12",  # Cycle 3
            # PT002 visits
            "2020-06-25",
            "2020-06-25",
            "2020-06-25",  # Baseline
            "2020-07-23",
            "2020-07-23",
            "2020-07-23",
            "2020-07-23",  # Cycle 1
            "2020-08-20",
            "2020-08-20",
            "2020-08-20",  # Cycle 2
            "2020-09-17",
            "2020-09-17",
            "2020-09-17",
            "2020-09-17",  # Cycle 3
            # PT003 visits
            "2021-01-15",
            "2021-01-15",
            "2021-01-15",  # Baseline
            "2021-02-12",
            "2021-02-12",
            "2021-02-12",  # Cycle 1
            "2021-03-12",
            "2021-03-12",
            "2021-03-12",  # Cycle 2
            "2021-04-09",
            "2021-04-09",
            "2021-04-09",  # Cycle 3
            # PT004 visits
            "2019-11-10",
            "2019-11-10",
            "2019-11-10",
            "2019-11-10",  # Baseline
            "2019-12-08",
            "2019-12-08",
            "2019-12-08",  # Cycle 1
            "2020-01-05",
            "2020-01-05",
            "2020-01-05",  # Cycle 2
            "2020-02-02",
            "2020-02-02",
            "2020-02-02",
            "2020-02-02",  # Cycle 3
            # PT005 visits
            "2020-09-22",
            "2020-09-22",
            "2020-09-22",  # Baseline
            "2020-10-20",
            "2020-10-20",
            "2020-10-20",  # Cycle 1
            "2020-11-17",
            "2020-11-17",
            "2020-11-17",
            "2020-11-17",  # Cycle 2
        ],
        "observation_type": [
            # PT001
            "hemoglobin",
            "platelets",
            "ecog",
            "treatment_start",
            "hemoglobin",
            "platelets",
            "ecog",
            "drug_admin",
            "hemoglobin",
            "platelets",
            "ecog",
            "drug_admin",
            "hemoglobin",
            "platelets",
            "ecog",
            "response_assessment",
            # PT002
            "hemoglobin",
            "platelets",
            "treatment_start",
            "hemoglobin",
            "platelets",
            "ecog",
            "drug_admin",
            "hemoglobin",
            "platelets",
            "drug_admin",
            "hemoglobin",
            "platelets",
            "ecog",
            "response_assessment",
            # PT003
            "hemoglobin",
            "platelets",
            "treatment_start",
            "hemoglobin",
            "platelets",
            "drug_admin",
            "hemoglobin",
            "platelets",
            "drug_admin",
            "hemoglobin",
            "platelets",
            "response_assessment",
            # PT004
            "hemoglobin",
            "platelets",
            "ecog",
            "treatment_start",
            "hemoglobin",
            "platelets",
            "drug_admin",
            "hemoglobin",
            "platelets",
            "drug_admin",
            "hemoglobin",
            "platelets",
            "ecog",
            "response_assessment",
            # PT005
            "hemoglobin",
            "platelets",
            "treatment_start",
            "hemoglobin",
            "platelets",
            "drug_admin",
            "hemoglobin",
            "platelets",
            "ecog",
            "response_assessment",
        ],
        "observation_value": [
            # PT001 - stable patient
            "13.5",
            "285",
            "1",
            "Carboplatin/Pemetrexed/Pembrolizumab",
            "13.2",
            "278",
            "1",
            "Carboplatin/Pemetrexed/Pembrolizumab",
            "12.8",
            "265",
            "1",
            "Carboplatin/Pemetrexed/Pembrolizumab",
            "12.9",
            "270",
            "0",
            "Partial Response",
            # PT002 - declining hemoglobin
            "14.1",
            "310",
            "Carboplatin/Paclitaxel/Pembrolizumab",
            "13.5",
            "295",
            "1",
            "Carboplatin/Paclitaxel/Pembrolizumab",
            "12.8",
            "280",
            "Carboplatin/Paclitaxel/Pembrolizumab",
            "12.2",
            "268",
            "1",
            "Stable Disease",
            # PT003 - EGFR+ patient on targeted therapy
            "14.8",
            "245",
            "Osimertinib",
            "14.5",
            "250",
            "Osimertinib",
            "14.3",
            "248",
            "Osimertinib",
            "14.6",
            "252",
            "Partial Response",
            # PT004 - ALK+ patient
            "11.8",
            "198",
            "2",
            "Alectinib",
            "12.1",
            "210",
            "Alectinib",
            "12.5",
            "225",
            "Alectinib",
            "12.8",
            "235",
            "1",
            "Partial Response",
            # PT005 - IO monotherapy
            "15.2",
            "320",
            "Pembrolizumab",
            "14.9",
            "315",
            "Pembrolizumab",
            "14.6",
            "308",
            "0",
            "Complete Response",
        ],
        "observation_unit": [
            # PT001
            "g/dL",
            "10^9/L",
            "",
            "",
            "g/dL",
            "10^9/L",
            "",
            "",
            "g/dL",
            "10^9/L",
            "",
            "",
            "g/dL",
            "10^9/L",
            "",
            "",
            # PT002
            "g/dL",
            "10^9/L",
            "",
            "g/dL",
            "10^9/L",
            "",
            "",
            "g/dL",
            "10^9/L",
            "",
            "g/dL",
            "10^9/L",
            "",
            "",
            # PT003
            "g/dL",
            "10^9/L",
            "",
            "g/dL",
            "10^9/L",
            "",
            "g/dL",
            "10^9/L",
            "",
            "g/dL",
            "10^9/L",
            "",
            # PT004
            "g/dL",
            "10^9/L",
            "",
            "",
            "g/dL",
            "10^9/L",
            "",
            "g/dL",
            "10^9/L",
            "",
            "g/dL",
            "10^9/L",
            "",
            "",
            # PT005
            "g/dL",
            "10^9/L",
            "",
            "g/dL",
            "10^9/L",
            "",
            "g/dL",
            "10^9/L",
            "",
            "",
        ],
    }
)

print("Raw Clinical Observations DataFrame:")
raw_observations.head(20)

## 2. Use Preprocessing Helpers to Understand Your Data

Before transforming the data, let's use the preprocessing helper functions to:
1. **Identify constant vs. changing columns** - This helps decide what goes into `df_constant` vs `df_events`
2. **Aggregate events to weeks** - This reduces noise from multiple observations on nearby days

In [None]:
# First, let's check which columns in our demographics data are truly constant
# We'll merge demographics with a simplified observations view to check

# Create a merged view for analysis
merged_for_analysis = raw_observations.merge(
    raw_demographics[["patient_id", "birth_year", "sex", "histology", "smoking_status"]], on="patient_id", how="left"
)

# Identify constant vs changing columns
constant_cols, changing_cols = identify_constant_and_changing_columns(
    merged_for_analysis, date_column="visit_date", patientid_column="patient_id"
)

print("Constant columns (same value across all visits for each patient):")
print(constant_cols)
print("\nChanging columns (values vary over time):")
print(changing_cols)

### Why Put Most Data into Events?

**Key Insight**: Even data that appears "constant" (like biomarker status) is often better represented as events because:
1. It has a specific date when it was measured
2. It could potentially change over time (e.g., acquired resistance mutations)
3. The temporal context of when information was known is clinically relevant

**Rule of thumb**: Only truly immutable patient characteristics (birth year, biological sex) should go in `df_constant`. Everything else should be an event!

## 3. Transform Raw Data into TwinWeaver Format

Now we'll convert our raw data into the three required TwinWeaver dataframes.

### 3.1 Create df_events (Longitudinal Events)

In [None]:
def transform_to_events(raw_obs: pd.DataFrame, raw_demo: pd.DataFrame) -> pd.DataFrame:
    """
    Transform raw clinical data into TwinWeaver events format.

    The events dataframe has these required columns:
    - patientid: Unique patient identifier
    - date: Date of the event
    - event_category: High-level grouping (e.g., 'lab', 'drug', 'lot', 'death')
    - event_name: Specific variable name
    - event_value: The result/value
    - event_descriptive_name: Natural language description for prompts
    - meta_data: Additional metadata (optional)
    - source: Data source identifier (optional)
    """
    events_list = []

    # --- Process clinical observations ---
    for _, row in raw_obs.iterrows():
        patient_id = row["patient_id"]
        visit_date = row["visit_date"]
        obs_type = row["observation_type"]
        obs_value = row["observation_value"]
        obs_unit = row["observation_unit"]

        # Map observation types to TwinWeaver categories
        if obs_type == "hemoglobin":
            events_list.append(
                {
                    "patientid": patient_id,
                    "date": visit_date,
                    "event_category": "lab",
                    "event_name": "hemoglobin_-_718-7",
                    "event_value": obs_value,
                    "event_descriptive_name": "hemoglobin - 718-7",
                    "meta_data": f"Test: hemoglobin, Cleaned lab units: {obs_unit}",
                    "source": "clinical_observations",
                }
            )
        elif obs_type == "platelets":
            events_list.append(
                {
                    "patientid": patient_id,
                    "date": visit_date,
                    "event_category": "lab",
                    "event_name": "platelets_-_26515-7",
                    "event_value": obs_value,
                    "event_descriptive_name": "platelets - 26515-7",
                    "meta_data": f"Test: platelets, Cleaned lab units: {obs_unit}",
                    "source": "clinical_observations",
                }
            )
        elif obs_type == "ecog":
            events_list.append(
                {
                    "patientid": patient_id,
                    "date": visit_date,
                    "event_category": "ecog",
                    "event_name": "ecog",
                    "event_value": obs_value,
                    "event_descriptive_name": "ECOG Performance Status",
                    "meta_data": None,
                    "source": "clinical_observations",
                }
            )
        elif obs_type == "treatment_start":
            # Treatment start creates a Line of Therapy (LoT) event
            events_list.append(
                {
                    "patientid": patient_id,
                    "date": visit_date,
                    "event_category": "lot",
                    "event_name": "line_number",
                    "event_value": "1",
                    "event_descriptive_name": "line number",
                    "meta_data": None,
                    "source": "clinical_observations",
                }
            )
            events_list.append(
                {
                    "patientid": patient_id,
                    "date": visit_date,
                    "event_category": "lot",
                    "event_name": "line_name",
                    "event_value": obs_value,
                    "event_descriptive_name": "line of therapy",
                    "meta_data": None,
                    "source": "clinical_observations",
                }
            )
            # Also add individual drug LoT start events
            for drug in obs_value.split("/"):
                events_list.append(
                    {
                        "patientid": patient_id,
                        "date": visit_date,
                        "event_category": "lot",
                        "event_name": drug.lower(),
                        "event_value": "LoT Start",
                        "event_descriptive_name": "LoT",
                        "meta_data": None,
                        "source": "clinical_observations",
                    }
                )
        elif obs_type == "drug_admin":
            # Drug administration events
            for drug in obs_value.split("/"):
                events_list.append(
                    {
                        "patientid": patient_id,
                        "date": visit_date,
                        "event_category": "drug",
                        "event_name": drug.lower(),
                        "event_value": "administered",
                        "event_descriptive_name": drug.lower(),
                        "meta_data": obs_value,
                        "source": "clinical_observations",
                    }
                )
        elif obs_type == "response_assessment":
            events_list.append(
                {
                    "patientid": patient_id,
                    "date": visit_date,
                    "event_category": "response",
                    "event_name": "recist_response",
                    "event_value": obs_value,
                    "event_descriptive_name": "RECIST Response",
                    "meta_data": None,
                    "source": "clinical_observations",
                }
            )

    # --- Process diagnosis and biomarker data from demographics ---
    # These are events because they have a specific date and could change over time
    for _, row in raw_demo.iterrows():
        patient_id = row["patient_id"]
        diagnosis_date = row["diagnosis_date"]

        # Initial diagnosis event
        events_list.append(
            {
                "patientid": patient_id,
                "date": diagnosis_date,
                "event_category": "main_diagnosis",
                "event_name": "initial_diagnosis",
                "event_value": row["cancer_type"],
                "event_descriptive_name": "initial cancer diagnosis",
                "meta_data": row["cancer_type"],
                "source": "demographics",
            }
        )

        # Stage at diagnosis
        events_list.append(
            {
                "patientid": patient_id,
                "date": diagnosis_date,
                "event_category": "staging",
                "event_name": "stage",
                "event_value": row["stage_at_diagnosis"],
                "event_descriptive_name": "Cancer Stage",
                "meta_data": None,
                "source": "demographics",
            }
        )

        # Biomarker results (these go into events, not constants!)
        events_list.append(
            {
                "patientid": patient_id,
                "date": diagnosis_date,
                "event_category": "basic_biomarker",
                "event_name": "EGFR",
                "event_value": row["egfr_status"],
                "event_descriptive_name": "EGFR",
                "meta_data": "NGS",
                "source": "demographics",
            }
        )
        events_list.append(
            {
                "patientid": patient_id,
                "date": diagnosis_date,
                "event_category": "basic_biomarker",
                "event_name": "ALK",
                "event_value": row["alk_status"],
                "event_descriptive_name": "ALK",
                "meta_data": "NGS",
                "source": "demographics",
            }
        )
        events_list.append(
            {
                "patientid": patient_id,
                "date": diagnosis_date,
                "event_category": "biomarker_ihc",
                "event_name": "PD-L1",
                "event_value": row["pdl1_expression"],
                "event_descriptive_name": "PD-L1 Expression (TPS)",
                "meta_data": "IHC 22C3",
                "source": "demographics",
            }
        )

        # --- Process death events ---
        # Death is a time-to-event outcome that occurs at a specific date
        if row["death_status"] == "Deceased" and pd.notna(row["death_date"]):
            events_list.append(
                {
                    "patientid": patient_id,
                    "date": row["death_date"],
                    "event_category": "death",
                    "event_name": "death",
                    "event_value": "Yes",
                    "event_descriptive_name": "Death",
                    "meta_data": None,
                    "source": "demographics",
                }
            )

    # Create DataFrame and sort by patient and date
    df_events = pd.DataFrame(events_list)
    df_events["date"] = pd.to_datetime(df_events["date"])
    df_events = df_events.sort_values(["patientid", "date"]).reset_index(drop=True)

    return df_events


# Transform the data
df_events = transform_to_events(raw_observations, raw_demographics)

print(f"Created events DataFrame with {len(df_events)} events")
print(f"Unique patients: {df_events['patientid'].nunique()}")
print(f"\nEvent categories: {df_events['event_category'].unique().tolist()}")
df_events.head(15)

### 3.2 Create df_constant (Static Patient Information)

Only truly immutable characteristics should go here. We keep this minimal!

In [None]:
def transform_to_constant(raw_demo: pd.DataFrame) -> pd.DataFrame:
    """
    Extract truly constant patient information.

    Only include immutable characteristics that:
    1. Never change over time
    2. Don't have a meaningful "measurement date"
    """
    df_constant = raw_demo[["patient_id", "birth_year", "sex", "histology", "smoking_status"]].copy()

    # Rename columns to match TwinWeaver format
    df_constant = df_constant.rename(
        columns={
            "patient_id": "patientid",
            "birth_year": "birthyear",
            "sex": "gender",
        }
    )

    return df_constant


df_constant = transform_to_constant(raw_demographics)

print("Constant DataFrame (static patient information):")
df_constant

### 3.3 Create df_constant_description (Metadata for Constants)

This provides human-readable descriptions for each column in `df_constant`.

In [None]:
def create_constant_description(df_constant: pd.DataFrame) -> pd.DataFrame:
    """
    Create descriptions for each constant column.
    These descriptions are used in prompt generation.
    """
    descriptions = {
        "patientid": "Unique patient identifier",
        "birthyear": "Year of birth of the patient",
        "gender": "Gender of the patient",
        "histology": "Histological subtype of NSCLC",
        "smoking_status": "Smoking status at diagnosis",
    }

    # Create description for each column that exists
    rows = []
    for col in df_constant.columns:
        rows.append({"variable": col, "comment": descriptions.get(col, f"Description for {col}")})

    return pd.DataFrame(rows)


df_constant_description = create_constant_description(df_constant)

print("Constant Description DataFrame:")
df_constant_description

## 4. Apply Weekly Aggregation (Optional Preprocessing)

If your data has multiple observations on nearby days (e.g., labs taken daily), you may want to aggregate them to reduce noise. The `aggregate_events_to_weeks` function handles this automatically.

In [None]:
# Demonstrate weekly aggregation on lab values
df_labs_only = df_events[df_events["event_category"] == "lab"].copy()

print(f"Before aggregation: {len(df_labs_only)} lab events")

# Aggregate to weekly values
df_labs_aggregated = aggregate_events_to_weeks(
    df_labs_only,
    patientid_column="patientid",
    date_column="date",
    event_name_column="event_name",
    event_value_column="event_value",
    random_state=42,  # For reproducibility
)

print(f"After aggregation: {len(df_labs_aggregated)} lab events")
print("\nAggregated lab events (first 10):")
df_labs_aggregated.sort_values(by=["patientid", "date"]).head(10)

## 5. Validate the TwinWeaver Format

Let's verify our data is in the correct format before proceeding.

In [None]:
def validate_twinweaver_format(df_events, df_constant, df_constant_description):
    """Validate that dataframes conform to TwinWeaver requirements."""
    issues = []

    # Check df_events required columns
    events_required = ["patientid", "date", "event_category", "event_name", "event_value", "event_descriptive_name"]
    for col in events_required:
        if col not in df_events.columns:
            issues.append(f"df_events missing required column: {col}")

    # Check df_constant has patientid
    if "patientid" not in df_constant.columns:
        issues.append("df_constant missing required column: patientid")

    # Check df_constant_description structure
    if "variable" not in df_constant_description.columns:
        issues.append("df_constant_description missing required column: variable")
    if "comment" not in df_constant_description.columns:
        issues.append("df_constant_description missing required column: comment")

    # Check patient ID consistency
    events_patients = set(df_events["patientid"].unique())
    constant_patients = set(df_constant["patientid"].unique())
    if events_patients != constant_patients:
        missing_in_events = constant_patients - events_patients
        missing_in_constant = events_patients - constant_patients
        if missing_in_events:
            issues.append(f"Patients in constant but not in events: {missing_in_events}")
        if missing_in_constant:
            issues.append(f"Patients in events but not in constant: {missing_in_constant}")

    if issues:
        print("❌ Validation issues found:")
        for issue in issues:
            print(f"  - {issue}")
    else:
        print("✅ All dataframes are in valid TwinWeaver format!")
        print(f"   - Events: {len(df_events)} rows, {df_events['patientid'].nunique()} patients")
        print(f"   - Constants: {len(df_constant)} rows, {len(df_constant.columns)} columns")
        print(f"   - Descriptions: {len(df_constant_description)} variable descriptions")

    return len(issues) == 0


validate_twinweaver_format(df_events, df_constant, df_constant_description)

## 6. Convert to Instruction-Tuning Format

Now we can use our processed data with the TwinWeaver pipeline to generate instruction-tuning examples, just like in the `01_data_preparation_for_training` tutorial.

In [None]:
# Configure TwinWeaver
config = Config()

# Set the event category used for data splitting (split around Lines of Therapy)
config.split_event_category = "lot"

# Define which event categories to forecast
config.event_category_forecast = ["lab"]

# 3. Mapping of specific time to events to predict (e.g., we want to predict 'death' and 'progression')
# Only needs to be set if you want to do time to event prediction
config.data_splitter_events_variables_category_mapping = {
    "death": "death",
    "progression": "next progression",  # Custom name in prompt: "next progression" instead of "progression"
}

# Define which static columns to include in prompts
config.constant_columns_to_use = [
    "birthyear",
    "gender",
    "histology",
    "smoking_status",
]

# Specify the birth year column for age calculation
config.constant_birthdate_column = "birthyear"

In [None]:
# Initialize the DataManager and load our processed data
dm = DataManager(config=config)
dm.load_indication_data(df_events=df_events, df_constant=df_constant, df_constant_description=df_constant_description)
dm.process_indication_data()
dm.setup_unique_mapping_of_events()
dm.setup_dataset_splits()
dm.infer_var_types()

print(f"Loaded {len(dm.all_patientids)} patients into DataManager")

In [None]:
# Initialize data splitters and converter
data_splitter_events = DataSplitterEvents(dm, config=config)
data_splitter_events.setup_variables()

data_splitter_forecasting = DataSplitterForecasting(
    data_manager=dm,
    config=config,
)
data_splitter_forecasting.setup_statistics()

data_splitter = DataSplitter(data_splitter_events, data_splitter_forecasting)

converter = ConverterInstruction(
    nr_tokens_budget_total=8192,
    config=config,
    dm=dm,
    variable_stats=data_splitter_forecasting.variable_stats,
)

In [None]:
# Generate instruction-tuning examples for a patient
patientid = dm.all_patientids[0]
patient_data = dm.get_patient_data(patientid)

print(f"Patient: {patientid}")
print(f"Number of events: {len(patient_data['events'])}")
print("\nPatient events:")
patient_data["events"]

In [None]:
# Generate training splits
forecasting_splits, events_splits, reference_dates = data_splitter.get_splits_from_patient_with_target(
    patient_data,
)

print(f"Generated {len(forecasting_splits)} training splits for patient {patientid}")

In [None]:
# Convert first split to instruction format
if len(forecasting_splits) > 0:
    split_idx = 0
    p_converted = converter.forward_conversion(
        forecasting_splits=forecasting_splits[split_idx],
        event_splits=events_splits[split_idx],
        override_mode_to_select_forecasting="both",
    )

    print("=" * 80)
    print("INSTRUCTION (Model Input):")
    print("=" * 80)
    print(p_converted["instruction"])
else:
    print("No training splits generated for this patient.")

In [None]:
if len(forecasting_splits) > 0:
    print("=" * 80)
    print("ANSWER (Target Output):")
    print("=" * 80)
    print(p_converted["answer"])

## Summary: Key Takeaways

### Data Format Requirements

TwinWeaver requires three dataframes:

1. **`df_events`** (Longitudinal data in long format)
   - Required columns: `patientid`, `date`, `event_category`, `event_name`, `event_value`, `event_descriptive_name`
   - Optional columns: `meta_data`, `source`

2. **`df_constant`** (Static patient information)
   - Required column: `patientid`
   - Additional columns for immutable characteristics (birthyear, gender, etc.)

3. **`df_constant_description`** (Metadata for constants)
   - Required columns: `variable`, `comment`

### Best Practices

1. **Put as much as possible into events** - Even data that seems "constant" often has temporal context:
   - Biomarker results → events (they have a test date)
   - Staging information → events (stage at diagnosis date)
   - Demographics like birth year, biological sex → constants (truly immutable)

2. **Include all available data first** - Start with everything, then trim during data generation if needed:
   - Use the token budget in `ConverterInstruction` to control output length
   - The framework automatically prioritizes recent and relevant events

3. **Use preprocessing helpers wisely**:
   - `identify_constant_and_changing_columns()` - Helps decide what goes where
   - `aggregate_events_to_weeks()` - Reduces noise from frequent measurements

4. **Validate your data** before training to catch format issues early.