Skip to content

Codenym/ezduckdb

Repository files navigation

ezduckdb

Tools to make working with duckdb easier for codenym.

Not intended for general use, but feel free to steal code or ideas.

If you'd be super excited about this being made for general use, reach out.

Acknowledgements

There's a ton code and ideas in here from a dagster blog post

Installation

pip install ezduckdb

Usage

There are 3 classes in this library:

  • S3AwarePath: pathlib.Path + s3 paths
  • SQL: Work with sql files programatically via templating.
  • DuckDB: Connection and Query manager

S3AwarePath

S3AwarePath adds functionality to the pathlib.Path class.

  • is_s3: Is path an s3 path (ie s3://....)
  • get_s3_bucket and get_s3_prefix: Break path for use with boto3
  • Retain s3:// when cast to string (ie in f strings)
  • get_table_name: Get db table name from file name based on codenym convention
    • <schema>_<table>.<extension>
from ezduckdb import S3AwarePath

s3_path = S3AwarePath("s3://bucket/curated/s1chema_table1.csv")
assert inp.get_s3_bucket() == "bucket"
assert inp.get_s3_prefix() == "curated/s1chema_table1.csv"
assert str(inp) == "s3://bucket/curated/s1chema_table1.csv"
assert inp.is_s3()
assert inp.get_table_name() == ("s1chema", "table1")

SQL

SQL enable type based templating for programatical sql query generation for duckdb.

Non-exhaustive list of replacements:

  • pd.DataFrame is converted to df_<id> in the query to enable pandas querying
  • Str are replaced with the string value enclosed in single quotes
  • Int are replaced with the value without quotes
  • SQL replaces recusively for nested querying

Basic

from ezduckdb import SQL

example = SQL("SELECT * FROM $table WHERE id = $id", table="foo", id=1)
assert inp.to_string() == "SELECT * FROM 'foo' WHERE id = 1"

Pandas

from ezduckdb import SQL
import pandas as pd

df = pd.DataFrame({"id": [1, 2, 3]})
inp = SQL("SELECT * FROM $table", table=df)
assert inp.to_string() == "SELECT * FROM df_" + str(id(df))

Nested

from ezduckdb import SQL

example = SQL("SELECT * FROM $table", table=SQL("SELECT * FROM $table", table="foo"))
assert inp.to_string() == "SELECT * FROM (SELECT * FROM 'foo')"

DuckDB

DuckDB is a connection manager for duckdb that has some convenience methods for querying.

  • If s3_storage_used=True then query method will:
    • Load httpfs and aws duckdb extensions
    • call load_aws_credentials passing the aws_profile.
  • query method will:
    • Do all sql templating for SQL object.
    • Return a pd.DataFrame of the results if applicable
  • Provide a context manager for pure sql querying with strings

Templated Querying (Querying with SQL objects)

Basic Querying
from ezduckdb import DuckDB
import pandas as pd

db = DuckDB(s3_storage_used=False)

assert db.query(SQL("select 1")).values == pd.DataFrame([(1,)]).values
Pandas Querying
from ezduckdb import DuckDB
import pandas as pd

db = DuckDB(s3_storage_used=False)
df = pd.DataFrame({"id": [1, 2, 3]})

actual = db.query(SQL("SELECT * FROM $table", table=df))
expected = pd.DataFrame([(1,), (2,), (3,)])
assert (actual.values == expected.values).all()
S3 querying
from ezduckdb import DuckDB
import pandas as pd

db = DuckDB(s3_storage_used=True)
s3_path = "s3://codenym-automated-testing/ezduckdb/parquet/schema1_table1.parquet"

actual = db.query(SQL("SELECT * FROM read_parquet($s3_path)", s3_path=s3_path))
expected = pd.DataFrame([[1, 4], [2, 5], [3, 6]])
assert (actual.values == expected.values).all()

Context Manager (Querying with Strings)

from ezduckdb import DuckDB
import pandas as pd

with DuckDB(s3_storage_used=False) as conn:
    assert conn.query("select 1").df().values == pd.DataFrame([(1,)]).values

About

Tools to make working with duckdb easier in python

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages