# Step 1: Manually externalize bulk data using a BulkDataProcessor

In this first step, we simply externalize manually using a BulkDataProcessor.
This is exactly what happens inside a TableWriter by default: bulk-data paths
are identified and sibling paths with the "_binary_property_url" suffix are
added, with contents of the form `<base>/<key><chunk-num>.raw:<start-offset>-<length>`

The original key is set to None once the data has been written to disk.

Example:

```
input_row = {
    "vertices_2d": np.array([0, 0, 1, 1, 2, 2, 3, 3]) # 4 2d points
}

BulkDataProcessor(bulk_data_url="C:/Temp").process_row(input_row)

{
    "vertices_2d": None,
    "vertices_2d_binary_property_url": "C:/temp/chunk-0.raw:0-32"
}
```

> Note:
> This will only work in practice for columns with schemas deriving from Geometry2DSchema or Geometry3DSchema (so the example above is for reference only)

In [None]:
import tlc
from pathlib import Path
from data_sources import random_array_generator
from tlc.core.helpers.bulk_data_helper import BulkDataRowProcessor
import numpy as np

Setup a schema for a column containing 2d-points and intensity per point

In [None]:
schema = tlc.Geometry2DSchema(
    include_2d_vertices=True,
    per_vertex_schemas={
        "intensity": tlc.Float32ListSchema()
    },
    is_bulk_data=True,  # This is what sets up the "sibling" paths with the "_binary_property_url" suffix
)

In [None]:
bulk_data_paths = tlc.SchemaHelper.get_bulk_data_values(schema)
bulk_data_paths

Define a BulkDataRowProcessor configured for writing bulk data to a local path

In [None]:
local_bulk_data_folder = Path("bulk_data/1").absolute()
local_bulk_data_folder.mkdir(parents=True, exist_ok=True)

bulk_data_processor = BulkDataRowProcessor(
    table_url=None, paths=bulk_data_paths, bulk_data_url=tlc.Url(local_bulk_data_folder.as_posix())
)

Helpers for generating random arrays of given shapes

In [None]:
points_2d_generator = random_array_generator((4, 2))  # generates 4 2d points at a time
intensity_generator = random_array_generator((4,), dtype=np.float32)

In [None]:
next(points_2d_generator)

In [None]:
next(intensity_generator)

Create a single row value using the Geometry2DInstances helper-dataclass

In [None]:
geo = tlc.Geometry2DInstances.create_empty(0, 0, 1, 1, per_vertex_extras_keys=["intensity"])
geo.add_instance(vertices=next(points_2d_generator), per_vertex_extras={"intensity": next(intensity_generator)})

In [None]:
geo.to_row()

This is where the magic happens:

In [None]:
processed_row = bulk_data_processor.process_row(geo.to_row())
processed_row


In [None]:
bulk_data_processor.close_all()  # Ensure files are closed

The input row has been recursively visited, any bulk data paths
("instances.vertices_2d", "instances.vertices_2d_additional_data.intensity")
have been written to disk and nulled, sibling binary property urls pointing to
written files/offsets have been added to the row.

# Now write the pre-externalized data to a Table

In [None]:
table_writer = tlc.TableWriter(
    table_name="pre-externalized-table",
    dataset_name="pre-externalized-dataset",
    project_name="pre-externalized-project",
    description="Pre-externalized table",
    column_schemas={"vertices": schema},  # We use the same schema as before
    if_exists="rename",
)

In [None]:
# When we add_row with a row that has already been processed, nothing will happen (BulkDataRowProcessor.process_row is idempotent)
table_writer.add_row({"vertices": processed_row})
table = table_writer.finalize()

In [None]:
table

access the table

In [None]:
table[0]["vertices"]  # Table data just contains None / empties

Use a BulkDataAccessor if access to the underlying arrays is required in Python

In [None]:
from tlc.core.helpers.bulk_data_helper import BulkDataAccessor
accessor = BulkDataAccessor(table)
row = accessor[0]
row
