Skip to content

DeepKernelLabs/smallcat

Repository files navigation

smallcat logo

smallcat

A small, modular data catalog.

PyPI Python versions CI coverage license downloads docs

Install

pip install smallcat

Quickstart

Create Catalog

Local catalogs can be kept in YAML files.

entries:
    foo:
        file_format: csv
        connection:
            conn_type: fs
            extra:
                base_path: /tmp/smallcat-example/
        location: foo.csv
        load_options:
            header: true
    bar:
        file_format: parquet
        connection:
            conn_type: google_cloud_platform
            extra:
                bucket: my-bucket
        location: bar.csv
        save_options:
            partition_by:
                - year
                - month

Standalone

from smallcat import Catalog

catalog = Catalog.from_path("catalog.yaml")
catalog.save_pandas("foo", df)
df2 = catalog.load_pandas("foo")

Filter on load

load_pandas (and the lower-level Arrow loaders) accept an optional where SQL predicate to push filters down to DuckDB/Arrow when reading:

df = catalog.load_pandas("bar", where="event_date >= '2024-01-01'")

With Airflow

from smallcat import Catalog

catalog = Catalog.from_airflow_variable("example_catalog")
df = catalog.load_pandas("bar")

Docs

Read more at the official docs.

About

Simple modular all-in-one catalog

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •  

Languages