A small, modular data catalog.
pip install smallcatLocal catalogs can be kept in YAML files.
entries:
foo:
file_format: csv
connection:
conn_type: fs
extra:
base_path: /tmp/smallcat-example/
location: foo.csv
load_options:
header: true
bar:
file_format: parquet
connection:
conn_type: google_cloud_platform
extra:
bucket: my-bucket
location: bar.csv
save_options:
partition_by:
- year
- monthfrom smallcat import Catalog
catalog = Catalog.from_path("catalog.yaml")
catalog.save_pandas("foo", df)
df2 = catalog.load_pandas("foo")load_pandas (and the lower-level Arrow loaders) accept an optional where
SQL predicate to push filters down to DuckDB/Arrow when reading:
df = catalog.load_pandas("bar", where="event_date >= '2024-01-01'")from smallcat import Catalog
catalog = Catalog.from_airflow_variable("example_catalog")
df = catalog.load_pandas("bar")Read more at the official docs.
