# Introduction

This notebook is designed to guide you through the key functionalities of the library, providing hands-on examples and explanations.

First, we need to import necessary packages

In [1]:
from tum.dag import *
from tum.config import PROJECT_DIR

EXAMPLE_DIR = PROJECT_DIR / "docs/examples"

Choose the example (folder name) you want to run

In [2]:
cwd = EXAMPLE_DIR / "copper"

### Defining the DAG for Data Extraction

Our data extraction workflow is structured as a *Directed Acyclic Graph* (DAG). Each node in the DAG, referred to as an *actor*, represents a specific step in the process, while the edges define the data flow between these steps. The workflow is composed of four primary steps:

1. **(id=table)**: Read and normalize the table.
2. **(id=sem_label)**: Predict semantic types.
3. **(id=sem_desc)**: Create a semantic description.
4. **(id=export)**: Export the data.

#### Controlling Workflow Execution

To control which steps to execute, set the `output` argument to the desired step's ID. Only the specified step and its ancestor steps will be executed. For example:
- To test the table normalization step, set the output to `table`.
- Once the table normalization step is verified, set the output to `sem_model` or `export` to run the complete workflow.

#### Semantic Model Integration

In the `sem_model` step:
- The `sem_label` step is used to predict semantic descriptions.
- The example is uploaded to SAND for semantic description curation.

After curating the semantic description in SAND:
- Run the DAG again with the `export` step to export the data.
- The `export` step uses the curated semantic description to generate the final output that can be imported to MinMod.

In [3]:
dag = get_dag(
    cwd,
    # pipeline for reading a table, normalizing it, and writing it to a file
    # this is a required step
    table=[
        PartialFn(read_table_from_file),
        PartialFn(select_table, idx=0),
        PartialFn(table_range_select, start_row=3, end_row=2306, end_col="BO"),
        NormTableActor(NormTableArgs()),
        PartialFn(
            matrix_to_relational_table,
            drop_cols=list(range(27, 33)) + list(range(55, 60)),
            horizontal_props=[
                {"row": 0, "col": (6, 27)},
                {"row": 1, "col": (6, 27)},
                {"row": 0, "col": (34, 55)},
                {"row": 1, "col": (34, 55)},
            ],
        ),
        PartialFn(to_column_based_table, num_header_rows=2),
        PartialFn(write_table_to_file, outdir=cwd / "output", format="csv"),
    ],
    # predict semantic types for each column
    sem_label=Flow(
        source="table",
        target=GppSemLabelActor(
            GppSemLabelArgs(
                # you can try different models here
                # DSL
                model="tum.sm.dsl.dsl_sem_label.DSLSemLabelModel",
                model_args={
                    "model": "logistic-regression",
                    "ontology_factory": "tum.dag.get_ontology",
                    "data_dir": PROJECT_DIR / "data/minmod/mos-v3",
                },
                data="tum.sm.dsl.dsl_sem_label.get_dataset",
                data_args={},
                # OpenAI
                # model="tum.sm.llm.openai_sem_label.OpenAISemLabel",
                # model_args={
                #     "model": "gpt-4.1",
                #     "api_key": "YOUR_OPENAI_KEY",
                #     "max_sampled_rows": 20,
                # },
                # data="tum.sm.llm.openai_sem_label.get_dataset",
                # data_args={},
            )
        ),
    ),
    # if provided, sand_endpoint will be used to upload table and its predicted semantic
    # description to SAND for manual curation
    sand_endpoint="http://localhost:5524",
)

[32m2025-05-01 12:14:48.753[0m | [1mINFO    [0m | [36mlibactor.storage._global_storage[0m:[36minit[0m:[36m41[0m - [1mGlobalStorage: /Volumes/research/workspace/projects/darpa-criticalmaas/ta2-table-understanding/docs/examples/copper/storage[0m


create dag: 0.016 seconds


#### Running the DAG

To execute the DAG, use the `dag.process` function. The `input` argument specifies parameters for any actor in the DAG. It is a mapping from actor IDs to their respective parameters. Since an actor can accept multiple parameters, the values in the mapping should be tuples.

The `context` argument allows you to define global parameters that can be assigned to any actors by name.

To capture the output of the DAG, use the `output` argument, which is a set of actor IDs. The return value of `dag.process` is a dictionary where the keys are the actor IDs specified in the `output` argument, and the values are lists of items. Each item in the list corresponds to a result from an invocation of the actor. The values are always lists because actors can be invoked multiple times.

In [4]:
output = dag.process(
    input={"table": tuple(cwd.glob("*.xlsx"))},
    output=set(["table", "sem_model"]),
    context=get_context(cwd),
)

[32m2025-05-01 12:14:48.799[0m | [34m[1mDEBUG   [0m | [36mtimer[0m:[36mwatch_and_report[0m:[36m74[0m - [34m[1mdeserialize: 0.000 seconds[0m


get context: 0.016 seconds


[32m2025-05-01 12:14:54.400[0m | [34m[1mDEBUG   [0m | [36mtimer[0m:[36mwatch_and_report[0m:[36m74[0m - [34m[1mGppSemLabel.predict deserialize: 0.000 seconds[0m


Edit the semantic model in SAND at this URL: http://localhost:5524/tables/2


In [5]:
output["sem_model"][0].value.print(env="notebook")

HTML(value='<pre>\n00.\t<span style="background: #b7eb8f; color: black; padding: 2px; border-radius: 3px;">[0]…