# ACIC 2022 – Standard Workflow (CIDL)

This notebook demonstrates a reproducible end-to-end workflow for working with **ACIC 2022** data using the CIDL module. It walks through:

1. **Setting environment variables** for S3 access (Windows + macOS/Linux)
2. **Loading simulation datasets** from storage backend. This Notebook walks you through two different ways of accessing the data
   - 10 random datasets, or  
   - specific indices (e.g. `204`, `500`, `3001`)
3. **Loading matching ground-truth files** based on the simulation indices
4. **Running a model/estimator** on each dataset
5. **Producing a standardized results table** for comparison and reporting



---
---

#### **Step 1** — Set up S3 environment variables

CIDL reads S3 credentials from these environment variables:

- UHH_S3_ACCESS
- UHH_S3_SECRET


1. Check whether the environment variables are set:

In [None]:
import os

print("UHH_S3_ACCESS already set? ", bool(os.getenv("UHH_S3_ACCESS")))
print("UHH_S3_SECRET already set? ", bool(os.getenv("UHH_S3_SECRET")))

If both are **True**, continue with **Step 2 - Loading simulation datasets**.

If at least one of them is **False**, set the variables in your terminal. 

2. Set environment variable(s) on your windows machine. Open PowerShell and run the following lines:

        setx UHH_S3_ACCESS "..."
        setx UHH_S3_SECRET "..."

3. Set environment variable(s) on your macOS / Linux machine. Open Terminal and run the following lines:

        export UHH_S3_ACCESS="..."
        export UHH_S3_SECRET="..."

4. Restart VS Code / your IDE (and the notebook kernel), then run the check from 1 again.


---

#### **Step 2** — Loading simulation datasets

CIDL loads simulations into a **dictionary** that maps each dataset index to its DataFrame:

* `{simulation_index: pandas.DataFrame}`

Keeping the data in a dict is helpful because the **simulation index stays attached** to each dataset as a stable identifier (useful for truth matching, logging, debugging, and standardized result tables).

There are **3400** simulation datasets available, indexed from **1 to 3400**. You can load them in different ways:

* Load **one** dataset: `load_simulation(index=204)`
* Load **multiple** datasets: `load_simulations(indices=[204, 500, 3004])`
* Load **n random** datasets (reproducible via seed): `load_random_simulations(n=10, seed=123)`

Optional: You can restrict random sampling (or load full subsets) via the built-in difficulty tiers: `"very_easy"`, `"easy"`, `"medium"`, `"hard"`, `"very_hard"`. To learn more about how datasets are rated and how this relates to the underlying data-generating processes, see the metadata [documentation on GitHub](https://github.com/JDenzel-UHH/CIDL/tree/main/src/cidl/resources/metadata).


* Random sample from one tier: `load_random_simulations(n=10, seed=123, difficulty="easy")`
* Load **all** datasets of a tier: `load_by_difficulty(difficulty="easy")`

The example below shows (A) loading 5 random datasets without restricting the difficulty level, and (B) loading three specific datasets by index. You can change the numbers freely — just keep indices within **1–3400**.


In [None]:
# Option A: Load 5 randomly selected simulations

import cidl.loaders as loaders

sims = loaders.load_random_simulations(n=5, seed=123)

print(f"Loaded {len(sims)} simulations.")
print("Indices:", sorted(sims.keys()))
print(type(sims), "keys:", list(sims))

Your output should look like this: 

Connected to bucket 'cidl-test' via https://s3-uhh.lzs.uni-hamburg.de:443 [READ-ONLY mode]
Loaded 5 simulations.
Indices: [53, 183, 2015, 2318, 3091]
<class 'dict'> keys: [53, 183, 2015, 2318, 3091]

In [None]:
# Option B: Load specific simulation indices
sims_spec = loaders.load_simulations(indices = [204, 500, 3004])

print(f"Loaded {len(sims_spec)} simulations.")
print("Indices:", sorted(sims_spec.keys()))
print(type(sims_spec), "keys:", list(sims_spec))

Your output should look like this: 

Loaded 3 simulations.
Indices: [204, 500, 3004]
<class 'dict'> keys: [204, 500, 3004]

Take a look at the dataset structure:

In [6]:
sim_204 = sims_spec[204]
sim_204.head()

Unnamed: 0,id.patient,id.practice,V1,V2,V3,V4,V5,year,Y,X1,X2,X3,X4,X5,X6,X7,X8,X9,Z,post,n.patients,V1_avg,V2_avg,V3_avg,V4_avg,V5_A_avg,V5_B_avg,V5_C_avg
0,1,1,8.353,6,1,0.297,C,1,194.950494,0,C,1,A,1,22.356,9.205,0.088,75.442,1,0,346,11.304,3.012,0.61,0.267,0.587,0.197,0.217
1,1,1,8.353,6,1,0.297,C,2,187.553758,0,C,1,A,1,22.356,9.205,0.088,75.442,1,0,372,11.263,2.965,0.594,0.217,0.575,0.21,0.215
2,1,1,8.353,6,1,0.297,C,3,123.03548,0,C,1,A,1,22.356,9.205,0.088,75.442,1,1,379,11.356,2.955,0.578,0.135,0.58,0.211,0.208
3,1,1,8.353,6,1,0.297,C,4,174.036918,0,C,1,A,1,22.356,9.205,0.088,75.442,1,1,449,11.212,2.964,0.568,0.083,0.588,0.2,0.212
4,2,1,8.741,3,1,1.37,A,1,806.07912,0,C,1,A,1,22.356,9.205,0.088,75.442,1,0,346,11.304,3.012,0.61,0.267,0.587,0.197,0.217


For example, simulation 204 should look like this:

<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }

    .dataframe thead th {
        text-align: right;
    }
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>id.patient</th>
      <th>id.practice</th>
      <th>V1</th>
      <th>V2</th>
      <th>V3</th>
      <th>V4</th>
      <th>V5</th>
      <th>year</th>
      <th>Y</th>
      <th>X1</th>
      <th>X2</th>
      <th>X3</th>
      <th>X4</th>
      <th>X5</th>
      <th>X6</th>
      <th>X7</th>
      <th>X8</th>
      <th>X9</th>
      <th>Z</th>
      <th>post</th>
      <th>n.patients</th>
      <th>V1_avg</th>
      <th>V2_avg</th>
      <th>V3_avg</th>
      <th>V4_avg</th>
      <th>V5_A_avg</th>
      <th>V5_B_avg</th>
      <th>V5_C_avg</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>1</td>
      <td>1</td>
      <td>8.353</td>
      <td>6</td>
      <td>1</td>
      <td>0.297</td>
      <td>C</td>
      <td>1</td>
      <td>194.950494</td>
      <td>0</td>
      <td>C</td>
      <td>1</td>
      <td>A</td>
      <td>1</td>
      <td>22.356</td>
      <td>9.205</td>
      <td>0.088</td>
      <td>75.442</td>
      <td>1</td>
      <td>0</td>
      <td>346</td>
      <td>11.304</td>
      <td>3.012</td>
      <td>0.610</td>
      <td>0.267</td>
      <td>0.587</td>
      <td>0.197</td>
      <td>0.217</td>
    </tr>
    <tr>
      <th>1</th>
      <td>1</td>
      <td>1</td>
      <td>8.353</td>
      <td>6</td>
      <td>1</td>
      <td>0.297</td>
      <td>C</td>
      <td>2</td>
      <td>187.553758</td>
      <td>0</td>
      <td>C</td>
      <td>1</td>
      <td>A</td>
      <td>1</td>
      <td>22.356</td>
      <td>9.205</td>
      <td>0.088</td>
      <td>75.442</td>
      <td>1</td>
      <td>0</td>
      <td>372</td>
      <td>11.263</td>
      <td>2.965</td>
      <td>0.594</td>
      <td>0.217</td>
      <td>0.575</td>
      <td>0.210</td>
      <td>0.215</td>
    </tr>
    <tr>
      <th>2</th>
      <td>1</td>
      <td>1</td>
      <td>8.353</td>
      <td>6</td>
      <td>1</td>
      <td>0.297</td>
      <td>C</td>
      <td>3</td>
      <td>123.035480</td>
      <td>0</td>
      <td>C</td>
      <td>1</td>
      <td>A</td>
      <td>1</td>
      <td>22.356</td>
      <td>9.205</td>
      <td>0.088</td>
      <td>75.442</td>
      <td>1</td>
      <td>1</td>
      <td>379</td>
      <td>11.356</td>
      <td>2.955</td>
      <td>0.578</td>
      <td>0.135</td>
      <td>0.580</td>
      <td>0.211</td>
      <td>0.208</td>
    </tr>
    <tr>
      <th>3</th>
      <td>1</td>
      <td>1</td>
      <td>8.353</td>
      <td>6</td>
      <td>1</td>
      <td>0.297</td>
      <td>C</td>
      <td>4</td>
      <td>174.036918</td>
      <td>0</td>
      <td>C</td>
      <td>1</td>
      <td>A</td>
      <td>1</td>
      <td>22.356</td>
      <td>9.205</td>
      <td>0.088</td>
      <td>75.442</td>
      <td>1</td>
      <td>1</td>
      <td>449</td>
      <td>11.212</td>
      <td>2.964</td>
      <td>0.568</td>
      <td>0.083</td>
      <td>0.588</td>
      <td>0.200</td>
      <td>0.212</td>
    </tr>
    <tr>
      <th>4</th>
      <td>2</td>
      <td>1</td>
      <td>8.741</td>
      <td>3</td>
      <td>1</td>
      <td>1.370</td>
      <td>A</td>
      <td>1</td>
      <td>806.079120</td>
      <td>0</td>
      <td>C</td>
      <td>1</td>
      <td>A</td>
      <td>1</td>
      <td>22.356</td>
      <td>9.205</td>
      <td>0.088</td>
      <td>75.442</td>
      <td>1</td>
      <td>0</td>
      <td>346</td>
      <td>11.304</td>
      <td>3.012</td>
      <td>0.610</td>
      <td>0.267</td>
      <td>0.587</td>
      <td>0.197</td>
      <td>0.217</td>
    </tr>
  </tbody>
</table>
</div>



To display a preview of all loaded datasets together with their corresponding key (simulation index), use:

In [None]:
import pandas as pd

with pd.option_context("display.max_columns", None):
    for idx, df in sims_spec.items():
        print(idx)
        display(df.head())

---

#### **Step 3** — Loading matching ground-truth files

Each simulation dataset has a matching ground-truth file containing information on the Data Generating Process and the true effects for evaluation. In this step, we load the ground-truth objects for the simulation indices selected in Step 2, so that estimated effects can be compared against the known truth in a standardized way. We focus on Option B to demonstrate two different ways of accessing the ground-truth data.

In [None]:
import cidl.truth_matcher as truth

sims_spec_truth = truth.truth_for_simulations(simulations = sims_spec)

It is strongly recommended to use the simulations dictionary (sims_spec) to load the corresponding ground-truth data. However, you can also load ground truth by providing explicit indices:

In [None]:
sims_spec_truth_alternative = truth.load_truths(indices = [204, 500, 3004])

Take a look at tructure of truth datasets:

In [None]:
truth_204 = sims_spec_truth.truth[204]
truth_204.head()


<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }

    .dataframe thead th {
        text-align: right;
    }
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>dataset.num</th>
      <th>Confounding Strength</th>
      <th>Confounding Source</th>
      <th>Impact Heterogeneity</th>
      <th>Idiosyncrasy of Impacts</th>
      <th>variable</th>
      <th>level</th>
      <th>year</th>
      <th>id.practice</th>
      <th>SATT</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>204</td>
      <td>Strong</td>
      <td>Scenario B</td>
      <td>Large</td>
      <td>Small</td>
      <td>Overall</td>
      <td>&lt;NA&gt;</td>
      <td>&lt;NA&gt;</td>
      <td>&lt;NA&gt;</td>
      <td>6.409693</td>
    </tr>
    <tr>
      <th>1</th>
      <td>204</td>
      <td>Strong</td>
      <td>Scenario B</td>
      <td>Large</td>
      <td>Small</td>
      <td>X1</td>
      <td>0</td>
      <td>&lt;NA&gt;</td>
      <td>&lt;NA&gt;</td>
      <td>6.112640</td>
    </tr>
    <tr>
      <th>2</th>
      <td>204</td>
      <td>Strong</td>
      <td>Scenario B</td>
      <td>Large</td>
      <td>Small</td>
      <td>X1</td>
      <td>1</td>
      <td>&lt;NA&gt;</td>
      <td>&lt;NA&gt;</td>
      <td>6.680797</td>
    </tr>
    <tr>
      <th>3</th>
      <td>204</td>
      <td>Strong</td>
      <td>Scenario B</td>
      <td>Large</td>
      <td>Small</td>
      <td>X2</td>
      <td>A</td>
      <td>&lt;NA&gt;</td>
      <td>&lt;NA&gt;</td>
      <td>6.768705</td>
    </tr>
    <tr>
      <th>4</th>
      <td>204</td>
      <td>Strong</td>
      <td>Scenario B</td>
      <td>Large</td>
      <td>Small</td>
      <td>X2</td>
      <td>B</td>
      <td>&lt;NA&gt;</td>
      <td>&lt;NA&gt;</td>
      <td>-25.009869</td>
    </tr>
  </tbody>
</table>
</div>


---

#### **Step 4** —  Running a model/estimator

This part of the workflow is entirely up to you and is not part of the standardized pipeline. Once you have loaded the data as shown above, how you fit and run your model is your own choice.

---

#### **Step 5** — Producing a standardized results table

In order to be able to use automated result evaluation (CI coverage and RMSE). Make sure your results align with the following structure: 

**Columns**

| column        | type        | description                                                  |
|---------------|-------------|--------------------------------------------------------------|
| `dataset.num` | int         | Simulation index (e.g., `0001` … `3400`)                     |
| `variable`    | string      | `Overall` or one of `X1`, `X2`, `X3`, `X4`, `X5`             |
| `level`       | string      | Subgroup level (`0/1` or `A/B/C`), or `NA` if not applicable |
| `year`        | int         | `3`, `4`, or `NA` if not applicable                          |
| `id.practice` | int         | `1..500`, or `NA` if not applicable                          |
| `satt`        | float       | Point estimate                                               |
| `lower90`     | float       | Lower bound of 90% interval                                  |
| `upper90`     | float       | Upper bound of 90% interval                                  |


**Example**

| dataset.num | variable | level | year | id.practice | satt | lower90 | upper90 |
|-------------|----------|-------|------|-------------|------|--------|--------|
| 1 | Overall | NA | NA | NA | `result` | `result` | `result` |
| 1 | Overall | NA | 3  | NA | `result` | `result` | `result` |
| 1 | Overall | NA | 4  | NA | `result` | `result` | `result` |
| 1 | X1 | 0 | NA | NA | `result` | `result` | `result` |
| 1 | X1 | 1 | NA | NA | `result` | `result` | `result` |
| 1 | X2 | A | NA | NA | `result` | `result` | `result` |
| 1 | X2 | B | NA | NA | `result` | `result` | `result` |
| 1 | X2 | C | NA | NA | `result` | `result` | `result` |
| 1 | X3 | 0 | NA | NA | `result` | `result` | `result` |
| 1 | X3 | 1 | NA | NA | `result` | `result` | `result` |
| 1 | X4 | A | NA | NA | `result` | `result` | `result` |
| 1 | X4 | B | NA | NA | `result` | `result` | `result` |
| 1 | X4 | C | NA | NA | `result` | `result` | `result` |
| 1 | X5 | 0 | NA | NA | `result` | `result` | `result` |
| 1 | X5 | 1 | NA | NA | `result` | `result` | `result` |
| 1 | NA | NA | NA | 1   | `result` | `result` | `result` |
| 1 | NA | NA | NA | ... | `result` | `result` | `result` |
| 1 | NA | NA | NA | 500 | `result` | `result` | `result` |
| ... | ... | ... | ... | ... | `result` | `result` | `result` |
| 3400 | NA | NA | NA | 500 | `result` | `result` | `result` |

TBD