Hamilton's driver allows for ad-hoc materialization. This enables you to take a DAG you already have, and save your data to a set of custom locations/url.
Note that these materializers are isomorphic in nature to the @save_to decorator. Materializers inject the additional node at runtime, modifying the DAG to include a data saver node, and returning the metadata around materialization.
This framework is meant to be highly pluggable. While the set of available data savers is currently limited, we expect folks to build their own materializers (and, hopefully, contribute them back to the community!).
In this example we take the scikit-learn iris_loader pipeline, and materialize outputs to specific locations through a driver call. We demonstrate:
- Saving model parameters to a json file (using the default json materializer)
- Writing a custom data adapters for:
- Pickling a model to an object file
- Saving confusion matrices to a csv file
See run.py for the full example.
In this example we only pass literal values to the materializers. That said, you can use both source
(to specify the source from an upstream node),
and value
(which is the default) to specify literals.
This will be a high-level overview. For more details, see documentation.
driver.materialize()
does the following:
- Processes a list of materializers to create a new DAG
- Alters the output to include the materializer nodes
- Processes a list of "additional variables" (for debugging) to return intermediary data
- Executes the DAG, including the materializers
- Returns a tuple of (
materialization metadata
,additional variables
)
Materializers each consume:
- A
dependencies
list to materialize - A (optional)
combine
parameter to combine the outputs of the dependencies (this is required if there are multiple dependencies). This is a ResultMixin object - an
id
parameter to identify the materializer, which serves as the nde name in the DAG
Materializers are referenced by the to
object in hamilton.io.materialization
, which utilizes
dynamic dispatch to create the appropriate materializer.
These refer to a DataSaver
, which are keyed by a string (E.G csv
).
Multiple data adapters can share the same key, each of which applies to a specific type
(E.G. pandas dataframe, numpy matrix, polars dataframe). New
data adapters are registered by calling hamilton.registry.register_adapter
To define a custom materializer, all you have to do is implement the DataSaver
class
(which will allow use in save_to
as well.) This is demonstrated in custom_materializers.py.
driver.materialize
is an ad-hoc form of save_to
. You want to use this when you're developing, and
want to do ad-hoc materialization. When you have a production ETL, you can choose between save_to
and materialize
.
If the save location/structure is unlikely to change, then you might consider using save_to
. Otherwise, materialize
is an idiomatic way of conducting the maerialization operations that cleanly separates side-effects from transformations.