# 🔬 Featurization Pipeline Tutorial
Using ClearML and Raven to Dynamically Launch Per-Series Jobs
🕒 *Generated on 2025-07-17*

This notebook walks through configuring, launching, and monitoring a modular featurization pipeline using ClearML, Raven, and an internal CLI tool.

---

### What You'll Learn:
- How to define a multi-step pipeline using a YAML config
- Where to store configs in S3 for remote triggering
- How to run the pipeline using the `featurize` CLI
- How each step is tracked in ClearML and how logs are preserved

### Requirements:
- Your AWS credentials configured locally (or just access to the S3 interface)
- ClearML configured locally
- Installed CLI: 
    - clone https://github.com/Picture-Health/raven-features
    - run `pip install .`

## 1 – Define Your Pipeline in YAML

This YAML config defines everything the pipeline needs to run:
- `project_parameters`: general identifiers and owner info
- `autoscaler_parameters`: tells the launcher which queue to use
- `raven_query_parameters`: defines what to fetch from Raven
- `pipeline_steps`: ordered steps, each running in its own repo and queue

Each step includes:
- `repo`, `branch`, and `commit` to specify the code to run
- `script` to execute
- `execution_queue` for ClearML
- `parent_steps` to enforce dependency order
- `output_path` (optional) to control folder structure

## 2 – Upload Your Config to S3

The pipeline launcher pulls config files from S3. This cell uploads your YAML to a known location. Note that it is not _required_ that you keep the config in `s3://px-app-bucket/config`, but we advise it for organization's sake.

🔁 You can re-run this cell anytime you make edits to `my_pipeline.yaml`.

In [None]:
import boto3

bucket = "px-app-bucket"
key = "config/my-pipeline.yaml"

s3 = boto3.client("s3")
s3.upload_file("my_pipeline.yaml", bucket, key)

print(f"✅ Uploaded to s3://{bucket}/{key}")


## 3 – Trigger the Pipeline from CLI

Once the config is uploaded, launch your pipeline with the CLI:

```bash
featurize --config-uri s3://px-app-bucket/config/my_pipeline.yaml
```

This will:
- Query Raven for matching studies
- Launch a ClearML pipeline per series
- Dynamically execute each pipeline step in order, per ClearML queues

## 4 – Navigating Results in ClearML

Once launched, your pipeline will leave two primary trails in ClearML:

---
### 1. Launcher Task (YAML Tracker)

- **Location**:  
  `ClearML UI → Projects → RAVEN-FEATURES → Featurization - {config_name} @ Month DD, YYYY at HH:MM [AM|PM] EDT`

- **Purpose**:  
  This task logs the config used for pipeline execution.

- **Artifacts**:
  - The full `my_pipeline.yaml` used to trigger the run
  - A list of all series selected by the Raven query
  - Status logs for each submitted pipeline run

<img src="images/pipeline_launcher_yaml_artifact.png" alt="YAML Artifact" width="700"/>
<img src="images/pipeline_launcher_logs.png" alt="Launcher Logs" width="700"/>


### 2. Pipeline Runs (Actual Executions)

- **Location**:  
  `ClearML UI → Pipelines → RAVEN-FEATURES → {project_id} → {config_name}-YYYY-MM-DD__HH-MM-SS/`

- **Contents**:  
  Each directory represents a full run of your per-series pipeline:
  - Steps are visualized as a **ClearML DAG**
  - Logs and artifacts are stored per step
  - Failed steps are clearly marked for debugging

- **Tip**: You can click into each step to inspect:
  - Input parameters
  - Downloadable artifacts (e.g., features, masks, logs)
  - Worker queue used
  - Execution time and resource usage


<img src="images/pipeline.png" alt="Pipeline Execution" width="700"/>


Use these views to:
- Track your config lineage
- Audit all runs from a given config
- Debug or rerun failed steps individually