Tools to make working with duckdb easier for codenym.
Not intended for general use, but feel free to steal code or ideas.
If you'd be super excited about this being made for general use, reach out.
There's a ton code and ideas in here from a dagster blog post
pip install ezduckdb
There are 3 classes in this library:
S3AwarePath
:pathlib.Path
+ s3 pathsSQL
: Work with sql files programatically via templating.DuckDB
: Connection and Query manager
S3AwarePath
adds functionality to the pathlib.Path
class.
is_s3
: Is path an s3 path (ies3://....
)get_s3_bucket
andget_s3_prefix
: Break path for use with boto3- Retain
s3://
when cast to string (ie in f strings) get_table_name
: Get db table name from file name based on codenym convention<schema>_<table>.<extension>
from ezduckdb import S3AwarePath
s3_path = S3AwarePath("s3://bucket/curated/s1chema_table1.csv")
assert inp.get_s3_bucket() == "bucket"
assert inp.get_s3_prefix() == "curated/s1chema_table1.csv"
assert str(inp) == "s3://bucket/curated/s1chema_table1.csv"
assert inp.is_s3()
assert inp.get_table_name() == ("s1chema", "table1")
SQL
enable type based templating for programatical sql query generation for duckdb.
Non-exhaustive list of replacements:
pd.DataFrame
is converted todf_<id>
in the query to enable pandas queryingStr
are replaced with the string value enclosed in single quotesInt
are replaced with the value without quotesSQL
replaces recusively for nested querying
from ezduckdb import SQL
example = SQL("SELECT * FROM $table WHERE id = $id", table="foo", id=1)
assert inp.to_string() == "SELECT * FROM 'foo' WHERE id = 1"
from ezduckdb import SQL
import pandas as pd
df = pd.DataFrame({"id": [1, 2, 3]})
inp = SQL("SELECT * FROM $table", table=df)
assert inp.to_string() == "SELECT * FROM df_" + str(id(df))
from ezduckdb import SQL
example = SQL("SELECT * FROM $table", table=SQL("SELECT * FROM $table", table="foo"))
assert inp.to_string() == "SELECT * FROM (SELECT * FROM 'foo')"
DuckDB
is a connection manager for duckdb that has some convenience methods for querying.
- If
s3_storage_used=True
thenquery
method will:- Load
httpfs
andaws
duckdb extensions - call
load_aws_credentials
passing theaws_profile
.
- Load
query
method will:- Do all sql templating for
SQL
object. - Return a
pd.DataFrame
of the results if applicable
- Do all sql templating for
- Provide a context manager for pure sql querying with strings
from ezduckdb import DuckDB
import pandas as pd
db = DuckDB(s3_storage_used=False)
assert db.query(SQL("select 1")).values == pd.DataFrame([(1,)]).values
from ezduckdb import DuckDB
import pandas as pd
db = DuckDB(s3_storage_used=False)
df = pd.DataFrame({"id": [1, 2, 3]})
actual = db.query(SQL("SELECT * FROM $table", table=df))
expected = pd.DataFrame([(1,), (2,), (3,)])
assert (actual.values == expected.values).all()
from ezduckdb import DuckDB
import pandas as pd
db = DuckDB(s3_storage_used=True)
s3_path = "s3://codenym-automated-testing/ezduckdb/parquet/schema1_table1.parquet"
actual = db.query(SQL("SELECT * FROM read_parquet($s3_path)", s3_path=s3_path))
expected = pd.DataFrame([[1, 4], [2, 5], [3, 6]])
assert (actual.values == expected.values).all()
from ezduckdb import DuckDB
import pandas as pd
with DuckDB(s3_storage_used=False) as conn:
assert conn.query("select 1").df().values == pd.DataFrame([(1,)]).values