-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DuckDB MIMIC-III concepts implementation #1529
base: main
Are you sure you want to change the base?
Conversation
…)...committing before gutting dead code.
…es without a large amount of RAM, and this way it can be skipped with `--skip-indexes`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some initial thoughts from a scan of the code.
I also think there are a lot of if/else statements littered throughout, but haven't thought about it enough to simplify that yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was hoping to avoid having a separate create table statement for duckdb - for MIMIC-III it's not a big deal (the tables have been static for ages), so this can be left as is.
For MIMIC-IV, we are continuing to update it, so this results in some inconsistencies (which either need to be checked with a gh action, which is hard, or avoided by having one source of SQL DDL for creating tables).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can try putting this through sqlglot--I think I made that SQL script before I started using that. At present, there is a bug (referenced in the comment in that file) that causes creation of the table to fail with the primary key in place (index creation causes OOM)... so until that's fixed the table creation will need at least that tweak.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to have a json used to generate the DDL for postgres/bigquery/mysql/duckdb - but it also seems like a separate PR to duckdb concepts
I suppose the index OOM bug is a duckdb bug?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose the index OOM bug is a duckdb bug?
Yep, it's resolved in their repo but not released yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for consistency this file would be better placed in mimic-iii/concepts_duckdb/icustay_hours.sql - this seems doable with the python script
sql = fp.read() | ||
result = connection.execute(sql).fetchall() | ||
for row in result: | ||
print(f"{row[0]}: {row[2]} records ({row[1]} expected) - {row[3]}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will print all fails if we load the demo. Might be nice to have a --demo
flag to alternate to a demo validation check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about --skip-checks
... should I just make --run-checks
(off by default)? Don't love the idea of another set of static values...though I suppose they wouldn't change much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ended up with a validate.sql
and a validate_demo.sql
with MIMIC-IV fwiw
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though again, the row counts are something that screams "this should be a json somewhere else in the repo", rather than hard-coded into all these destined to rot SQL files
|
||
print("Creating tables...") | ||
|
||
# ccs_dx is an outlier...this is adapted from the BigQuery version... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is it an outlier?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, comment should probably be more specific. For the record, 1. because it's the only non-query-based table (that gets loaded with data) in the concepts build process, 2. the Postgres DB puts it in the public schema instead of mimiciii
... which I am guessing is because it's also applicable to the MIMIC-IV concepts?
#cProfile.run('...') | ||
#print(f"Making view {key}...") | ||
db = concept_name_map[key].get("db", "bigquery") | ||
if db == "duckdb": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is unnecessarily abstract. You might as well just hard-code the make_duckdb_query function to do something special for icustay_hours. Or, alternatively, have the subfunction contain a hard-coded list of queries which it parses as duckdb, that way you can expand it to more queries in the future if you plan to move more over
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.... earlier I was pulling some queries from the BigQuery version and others from the Postgres version, and Postgres was actually the default...later I got them all to work translated from BigQuery except this one...which now looks odd.
sql = _duckdb_rewrite_schema(sql, schema) | ||
|
||
try: | ||
conn.execute(f"CREATE VIEW {qname} AS " + sql) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing a drop view if exists -> will be fixed by harmonizing duckdb/bigquery subfunctions
with open(qfile, "r") as fp: | ||
sql = fp.read() | ||
sql = re.sub(_too_many_backslashes_re, '\\$1', sql) | ||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i would rewrite this merging the bigquery/duckdb subfunctions into one, and having an if statement which calls the sql modification if needed, and then continuing on as usual. this would avoid mild divergences in behaviour observed e.g. forgetting to drop the view for duckdb but doing it for bigquery
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably valid.... these two were originally far more divergent but are indeed quite similar now.
|
||
conn.execute(f"DROP VIEW IF EXISTS {qname}") | ||
try: | ||
conn.execute(f"CREATE VIEW {qname} AS " + sql) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about creating either materialized views or simply tables? A view would rerun the code over and over (and be slow). I think most users will be happy with trading off the disk space for faster queries, so I'd say the default should be to create tables and if a --view
argument is passed it creates views instead (or two arguments, --make-concepts
and --make-concepts-as-views
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably a good idea. Looks like DuckDB doesn't have materialized views (yet), but for this use case tables are perfectly valid.
Seems like there are a few valid changes, feel free to update in your own time and just ping me with a re-review when you think it's ready |
I have a very interesting problem. I noticed that the DuckDB version here is producing integer results for Then, since I am actually converting the queries from the BigQuery versions, I consulted the BigQuery documentation for It looks like I can do either in the DuckDB version by changing how the SQL is translated... I may even put in an option to do either... but which one should it be, if there is a "should"? If nothing else I wanted to bring this up if it's an issue to have the discrepancy between the two current versions. |
Definitely an inconsistency, good spot. I can confirm BigQuery is giving integers after calculating a floor. Immediately I think the easiest fix is to add a For duckdb I think return whole integers is fine. |
These changes expand on #1489 / #1239 by adding a Python version that ingests the data and creates the concepts views in the DuckDB database. The python version requires only the duckdb python package and not the command-line executable. Also requires sqlglot.
Run
python import-duckdb.py --help
for usage, it's pretty simple.Tested on the MIMIC-III demo and the full version (1.4)... it seems to work, though I'm not sure if there's a better way to test (i.e. compare actual values... I have the Postgres version running in a container if so).
Resolves #1495