# üèÜ Challenge 1: Data Preparation for Training

**Difficulty**: ‚≠ê‚≠ê (Intermediate) | **Time**: 45-60 minutes

---

## üéØ Learning Objectives

By completing this challenge, you will:
1. Understand the data format required by TwinWeaver
2. Configure the pipeline for different prediction tasks
3. Generate training splits from patient timelines
4. Convert structured data to instruction-tuning format

## üìã Rules
- Complete all `# TODO:` sections
- Answer quiz questions before proceeding
- Run checkpoint cells to validate your solutions
- **No peeking at the original tutorial!**

---
## Part 1: Understanding the Data

Before we start coding, let's understand what data we're working with.

In [None]:
import pandas as pd

from twinweaver import (
    DataManager,
    Config,
)

In [None]:
# Load the example data
df_events = pd.read_csv("../example_data/events.csv")
df_constant = pd.read_csv("../example_data/constant.csv")
df_constant_description = pd.read_csv("../example_data/constant_description.csv")

### üîç Exercise 1.1: Explore the Data

Before configuring the pipeline, you need to understand your data. Explore the three dataframes to answer the quiz questions below.

In [None]:
# TODO: Explore df_events - what columns does it have? What are the unique event categories?
# Write your exploration code here


In [None]:
# TODO: Explore df_constant - what patient-level information is available?


In [None]:
# TODO: Explore df_constant_description - how does this map to df_constant?


### ‚ùì Quiz 1: Data Understanding

Answer these questions based on your exploration:

**Q1.1**: What column in `df_events` contains the type of medical event (lab, drug, condition, etc.)?

**Q1.2**: List all unique event categories in the dataset:

**Q1.3**: How many unique patients are in the dataset?

**Q1.4**: What column in `df_constant` could be used to calculate a patient's age?

*Write your answers in the cell below:*

**Your Answers:**

Q1.1: 

Q1.2: 

Q1.3: 

Q1.4: 

---
## Part 2: Configuration Challenge

Now you need to configure the TwinWeaver pipeline. This is where understanding your data pays off!

### üéØ Your Task

Configure the pipeline to:
1. Split patient histories around **Lines of Therapy** (treatment changes)
2. Forecast **lab values** into the future
3. Predict time-to-event for **death** and **progression**

In [None]:
config = Config()

# TODO: Set the event category used for splitting patient timelines
# HINT: Look at your answer to Q1.2 - which category represents treatment lines?
config.split_event_category = None  # Replace None with the correct value

# TODO: Set which event categories should be forecasted as time-series
# HINT: We want to predict future lab values
config.event_category_forecast = None  # Replace None with a list

# TODO: Configure time-to-event prediction targets
# HINT: This should be a dictionary mapping event names to display names
# Example: {"original_name": "display name in prompt"}
config.data_splitter_events_variables_category_mapping = None  # Replace with dict

### üèÅ Checkpoint 2.1: Validate Configuration

In [None]:
# Run this cell to check your configuration
def validate_config_part1(config):
    errors = []

    if config.split_event_category is None:
        errors.append("‚ùå split_event_category is not set")
    elif config.split_event_category not in df_events["event_category"].unique():
        errors.append(f"‚ùå split_event_category '{config.split_event_category}' not found in data")
    else:
        print(f"‚úÖ split_event_category: '{config.split_event_category}'")

    if config.event_category_forecast is None:
        errors.append("‚ùå event_category_forecast is not set")
    elif not isinstance(config.event_category_forecast, list):
        errors.append("‚ùå event_category_forecast should be a list")
    elif any([cat not in df_events["event_category"].unique() for cat in config.event_category_forecast]):
        errors.append("‚ùå At least one of the event_category_forecast values not found in data")
    else:
        print(f"‚úÖ event_category_forecast: {config.event_category_forecast}")

    if config.data_splitter_events_variables_category_mapping is None:
        errors.append("‚ùå data_splitter_events_variables_category_mapping is not set")
    elif not isinstance(config.data_splitter_events_variables_category_mapping, dict):
        errors.append("‚ùå data_splitter_events_variables_category_mapping should be a dict")
    elif any(
        [
            cat not in df_events["event_category"].unique()
            for cat in config.data_splitter_events_variables_category_mapping.keys()
        ]
    ):
        errors.append("‚ùå At least one key in data_splitter_events_variables_category_mapping not found in data")
    else:
        print(f"‚úÖ Event mapping: {config.data_splitter_events_variables_category_mapping}")

    if errors:
        print("\n" + "\n".join(errors))
        print("\nüí° Hint: Review Part 1 exploration to find the correct values")
    else:
        print("\nüéâ Part 2.1 Complete! Configuration looks good.")

    return len(errors) == 0


validate_config_part1(config)

### üîß Exercise 2.2: Configure Static Variables

Now configure which patient demographics to include in the prompts.

In [None]:
# TODO: Look at df_constant columns and decide which ones to include
# Consider: Which variables are clinically relevant for predictions?

# First, explore what's available
print("Available columns in df_constant:")
print(df_constant.columns.tolist())

In [None]:
# TODO: Select which constant columns to use (list of column names)
config.constant_columns_to_use = []  # Fill in the list

# TODO: Specify which column contains birth year/date for age calculation
config.constant_birthdate_column = None  # Set the column name

---
## Part 3: Initialize the Pipeline

With configuration complete, let's initialize the data processing components.

In [None]:
# TODO: Initialize DataManager and load data
# The DataManager needs to:
# 1. Be created with your config
# 2. Load the indication data (events, constant, constant_description)
# 3. Process the indication data
# 4. Setup unique mapping of events
# 5. Setup dataset splits
# 6. Infer variable types

dm = DataManager(config=config)

# TODO: Call the required methods in the correct order
# dm.????
# dm.????
# dm.????
# dm.????
# dm.????

### üèÅ Checkpoint 3.1: Validate DataManager

In [None]:
# Run this to verify DataManager is set up correctly
try:
    n_patients = len(dm.all_patientids)
    print(f"‚úÖ DataManager initialized with {n_patients} patients")

    # Check if we can get patient data
    test_patient = dm.all_patientids[0]
    patient_data = dm.get_patient_data(test_patient)
    print(f"‚úÖ Successfully retrieved data for patient {test_patient}")
    print(f"   - Events: {len(patient_data['events'])} rows")
    print(f"   - Constant: {len(patient_data['constant'])} rows")
    print("\nüéâ Part 3 Complete!")
except Exception as e:
    print(f"‚ùå Error: {e}")
    print("\nüí° Hint: Make sure you called all DataManager methods in the correct order")

---
## Part 4: Create Splitters and Converter

### ‚ùì Quiz 2: Understanding Splitters

Before creating the splitters, answer these conceptual questions:

**Q2.1**: What is the purpose of splitting a patient's timeline? Why not use the entire history?

**Q2.2**: What's the difference between `DataSplitterEvents` and `DataSplitterForecasting`?

**Q2.3**: Why do we need a token budget for the converter?

**Your Answers:**

Q2.1: 

Q2.2: 

Q2.3: 

In [None]:
# TODO: Initialize DataSplitterEvents
# This handles event prediction tasks (death, progression)
data_splitter_events = None  # Create the splitter

# TODO: Don't forget to call setup_variables() on it!

In [None]:
# TODO: Initialize DataSplitterForecasting
# This handles continuous variable forecasting (lab values)
data_splitter_forecasting = None  # Create the splitter

# TODO: Call setup_statistics() for forecasting QA and filtering

In [None]:
# TODO: Combine both splitters using DataSplitter wrapper
data_splitter = None  # Create the combined splitter

In [None]:
# TODO: Initialize ConverterInstruction
# Parameters needed:
# - nr_tokens_budget_total: How many tokens can the prompt be? (try 8192)
# - config: Your configuration object
# - dm: Your DataManager
# - variable_stats: Statistics from forecasting splitter (optional but recommended)

converter = None  # Create the converter

### üèÅ Checkpoint 4.1: Validate Pipeline Components

In [None]:
# Validate all components are created
components_valid = True

if data_splitter_events is None:
    print("‚ùå data_splitter_events is not initialized")
    components_valid = False
else:
    print("‚úÖ data_splitter_events initialized")

if data_splitter_forecasting is None:
    print("‚ùå data_splitter_forecasting is not initialized")
    components_valid = False
else:
    print("‚úÖ data_splitter_forecasting initialized")

if data_splitter is None:
    print("‚ùå data_splitter is not initialized")
    components_valid = False
else:
    print("‚úÖ data_splitter initialized")

if converter is None:
    print("‚ùå converter is not initialized")
    components_valid = False
else:
    print("‚úÖ converter initialized")

if components_valid:
    print("\nüéâ Part 4 Complete! All components ready.")

---
## Part 5: Generate Training Examples

Now let's generate actual training examples!

In [None]:
# Select a patient to work with
patientid = dm.all_patientids[4]
print(f"Working with patient: {patientid}")

# Get patient data
patient_data = dm.get_patient_data(patientid)

In [None]:
# TODO: Generate splits from this patient's data
# Use data_splitter.get_splits_from_patient_with_target()
# This returns: forecasting_splits, events_splits, reference_dates

forecasting_splits, events_splits, reference_dates = None, None, None  # Replace with actual call

### üîç Exercise 5.1: Analyze the Splits

Before converting, understand what the splitter produced.

In [None]:
# TODO: Answer these questions by exploring the splits:
# 1. How many splits were generated for this patient?
# 2. What dates are the reference points (split dates)?
# 3. What does each split contain?

print("Number of splits: ???")  # Fill in
print("Reference dates: ???")  # Fill in

In [None]:
# TODO: Convert the first split to instruction format
# Use converter.forward_conversion()
# Parameters:
# - forecasting_splits: the forecasting split for one time point
# - event_splits: the event split for one time point
# - override_mode_to_select_forecasting: set to "both"

split_idx = 0
p_converted = None  # Replace with actual conversion call

### üîç Exercise 5.2: Examine the Output

In [None]:
# TODO: Print and examine the instruction (input prompt)
# What information is included? What's the structure?


In [None]:
# TODO: Print and examine the answer (target output)
# What format is the answer in? What predictions are being made?


### ‚ùì Quiz 3: Output Analysis

**Q3.1**: What sections can you identify in the instruction prompt?

**Q3.2**: How are the forecasting predictions formatted in the answer?

**Q3.3**: How are the time-to-event predictions formatted?

**Your Answers:**

Q3.1: 

Q3.2: 

Q3.3: 

---
## Part 6: Reverse Conversion

An important capability is converting model outputs back to structured data.

In [None]:
# TODO: Use reverse_conversion to parse the answer back to structured data
# You'll need:
# - The answer string from p_converted
# - The data manager (dm)
# - The reference date for this split

date = reference_dates["date"][split_idx]
return_list = None  # Call converter.reverse_conversion()

In [None]:
# TODO: Examine what the reverse conversion produced
# What structure does return_list have? What's in each element?


---
## üåü Bonus Challenge 1: Custom Configuration

**+15 points**

Modify the configuration to predict **only drug-related events** instead of death and progression. Generate a new training example and compare the output.

In [None]:
# BONUS: Implement your custom configuration here


---
## üåü Bonus Challenge 2: Multi-Patient Dataset

**+25 points**

Write a function that generates training examples for ALL patients in the dataset and returns a pandas DataFrame with columns: `patientid`, `split_idx`, `instruction`, `answer`.

In [None]:
# BONUS: Implement the multi-patient dataset generator


def generate_training_dataset(dm, data_splitter, converter):
    """
    Generate training examples for all patients.

    Returns:
        pd.DataFrame with columns: patientid, split_idx, instruction, answer
    """
    # TODO: Implement this function
    pass


# Test your function
# df_training = generate_training_dataset(dm, data_splitter, converter)
# print(f"Generated {len(df_training)} training examples")

---
## üèÜ Challenge Complete!

Congratulations on completing Challenge 1! You've learned how to:

- ‚úÖ Explore and understand clinical data formats
- ‚úÖ Configure the TwinWeaver pipeline for different tasks
- ‚úÖ Generate training splits from patient timelines
- ‚úÖ Convert data to instruction-tuning format
- ‚úÖ Reverse convert predictions back to structured data

Ready for the next challenge? Move on to **Challenge 2: End-to-End LLM Fine-tuning!**