# Introduction

This notebook is designed to guide you through the key functionalities of the library, providing hands-on examples and explanations.

First, we need to import necessary packages

In [9]:
from tum.dag import *
from tum.config import PROJECT_DIR

EXAMPLE_DIR = PROJECT_DIR / "docs/examples"

Choose the example (folder name) you want to run

In [None]:
cwd = EXAMPLE_DIR / "copper"

### Defining the DAG for Data Extraction

The workflow for extracting data from the table consists of the following four steps:

1. **(id=table)**: Read and normalize the table.
2. **(id=sem_label)**: Predict semantic types.
3. **(id=sem_desc)**: Create a semantic description.
4. **(id=export)**: Export the data.

#### Controlling Workflow Execution

To control which steps to execute, set the `output` argument to the desired step's ID. Only the specified step and its ancestor steps will be executed. For example:
- To test the table normalization step, set the output to `table`.
- Once the table normalization step is verified, set the output to `sem_model` or `export` to run the complete workflow.

#### Semantic Model Integration

In the `sem_model` step:
- The `sem_label` step is used to predict semantic descriptions.
- The example is uploaded to SAND for semantic description curation.

After curating the semantic description in SAND:
- Run the DAG again with the `export` step to export the data.
- The `export` step uses the curated semantic description to generate the final output that can be imported to MinMod.

In [None]:
dag = get_dag(
    cwd,
    # pipeline for reading a table, normalizing it, and writing it to a file
    # this is a required step
    table=[
        PartialFn(read_table_from_file),
        PartialFn(select_table, idx=0),
        PartialFn(table_range_select, start_row=3, end_row=2306, end_col="BO"),
        NormTableActor(NormTableArgs()),
        PartialFn(
            matrix_to_relational_table,
            drop_cols=list(range(27, 33)) + list(range(55, 60)),
            horizontal_props=[
                {"row": 0, "col": (6, 27)},
                {"row": 1, "col": (6, 27)},
                {"row": 0, "col": (34, 55)},
                {"row": 1, "col": (34, 55)},
            ],
        ),
        PartialFn(to_column_based_table, num_header_rows=2),
        PartialFn(write_table_to_file, outdir=cwd / "output", format="csv"),
    ],
    # predict semantic types for each column
    sem_label=Flow(
        source="table",
        target=GppSemLabelActor(
            GppSemLabelArgs(
                model="tum.sm.dsl.dsl_sem_label.DSLSemLabelModel",
                model_args={
                    "model": "logistic-regression",
                    "ontology_factory": "tum.dag.get_ontology",
                    "data_dir": PROJECT_DIR / "data/minmod/mos-v3",
                },
                data="tum.sm.dsl.dsl_sem_label.get_dataset",
                data_args={},
            )
        ),
    ),
    # if provided, sand_endpoint will be used to upload table and its predicted semantic
    # description to SAND for manual curation
    sand_endpoint="http://localhost:5524",
)

[32m2025-04-30 22:15:25.359[0m | [1mINFO    [0m | [36mlibactor.storage._global_storage[0m:[36minit[0m:[36m41[0m - [1mGlobalStorage: /Volumes/research/workspace/projects/darpa-criticalmaas/ta2-table-understanding/docs/examples/copper/storage[0m


create dag: 0.005 seconds


#### Running DAG



In [None]:
output = dag.process(
    input={"table": tuple(cwd.glob("*.xlsx"))},
    output=set(["table", "sem_model", "export"]),
    context=get_context(cwd),
)

[32m2025-04-30 22:17:08.155[0m | [34m[1mDEBUG   [0m | [36mtimer[0m:[36mwatch_and_report[0m:[36m74[0m - [34m[1mdeserialize: 0.000 seconds[0m


get context: 0.040 seconds


[32m2025-04-30 22:17:13.910[0m | [34m[1mDEBUG   [0m | [36mtimer[0m:[36mwatch_and_report[0m:[36m74[0m - [34m[1mGppSemLabel.predict deserialize: 0.000 seconds[0m


Edit the semantic model in SAND at this URL: http://localhost:5524/table/3


Failed to convert Literal lexical form to value. Datatype=http://www.w3.org/2001/XMLSchema#float, Converter=<class 'float'>
Traceback (most recent call last):
  File "/Volumes/research/workspace/projects/darpa-criticalmaas/ta2-table-understanding/.venv/lib/python3.12/site-packages/rdflib/term.py", line 2163, in _castLexicalToPython
    return conv_func(lexical)  # type: ignore[arg-type]
           ^^^^^^^^^^^^^^^^^^
ValueError: could not convert string to float: 'some'
Failed to convert Literal lexical form to value. Datatype=http://www.w3.org/2001/XMLSchema#float, Converter=<class 'float'>
Traceback (most recent call last):
  File "/Volumes/research/workspace/projects/darpa-criticalmaas/ta2-table-understanding/.venv/lib/python3.12/site-packages/rdflib/term.py", line 2163, in _castLexicalToPython
    return conv_func(lexical)  # type: ignore[arg-type]
           ^^^^^^^^^^^^^^^^^^
ValueError: could not convert string to float: 'some'
Failed to convert Literal lexical form to value. Dat

  0%|          | 0/8342 [00:00<?, ?it/s]