-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Test that the MIMIC-IV ETL runs on the MIMIC-IV demo dataset #4
Test that the MIMIC-IV ETL runs on the MIMIC-IV demo dataset #4
Conversation
Sorry for the churn! The test now runs:
|
|
||
Tests requiring data will be skipped unless the `tests/data/` folder is populated first. | ||
|
||
To download the testing data, run the following command/s from project root: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably want to make a script for this, as we will want to also include synthetic OMOP data, but this is good for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, this could be improved. I wasn't entirely sure of the best way of implementing tests on local machines. I don't really like the idea of local tests requiring access to external resources, but this is an option (e.g. when you run the test, you pull down data from HuggingFace, PhysioNet, etc).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could package the mimic demo data in this repository (maybe just 1 patient to save space)? Or generate synthetic MIMIC format data in the same way that I have synthetic OMOP data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either of these would be good with me. II was cautious about adding too much volume to the repo, but just picking a few patients makes sense. I love the idea of a synthetic data generator.
shutil.rmtree(cls.destination_path) | ||
|
||
# Run the ETL | ||
subprocess.run(['meds_etl_mimic', cls.source_path, cls.destination_path, "--num_shards", "10"], check=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably want to test that the files actually load with HuggingFace datasets.
dataset = datasets.Dataset.from_parquet(cls.destination_path + '/data/*')
example_patient = dataset[0]
assert 'patient_id' in example_patient
assert 'events' in example_patient
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, I was really just going for the bare minimum in this pull request (i.e. do the build run and did it generate files) but I'll add your load step now. One minute...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks really good! My main request is that we want the test to at minimum try to load the resulting files with huggingface Datasets.
dataset = huggingface.Datasets.from_parquet(blah + '/data/*')
Thanks Ethan. I added a couple of simple tests as suggested. |
Hmm, looks like the build was broken by a recent update in Medical-Event-Data-Standard/meds@e93f63a So successful test? |
Good catch and sorry about that! I pushed an update to the meds package which should fix this. https://pypi.org/project/meds/0.1.1/ Now the pyproject.toml file needs to be updated to require 0.1.1 instead of 0.1 Also, we shouldn't use subprocess here since it results in terrible error messages. We should just import main from the corresponding module. |
…e destination folder. The test will only run if the MIMIC demo data is found in tests/data/
The behaviour of importlib.resources changed between Python 3.9 and 3.10. In 3.9, running a script leaves spec.origin set to None, causing the tests to fail. There are several possiblefixes, but this is the easiest one.
f6275dd
to
c9a3f5e
Compare
Thanks, pull request updated and tests are passing again.
I think this change belongs in a new pull request, but if you let me know how you'd like to implement it then I can add it here. The ETL needs refactoring beforehand really. |
Yep, let's just merge this as is. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Feel free to merge when you are ready.
Thanks for reviewing! |
As discussed in #1, we would like to add some tests to confirm that the MIMIC ETL runs as expected.
This pull request:
tests/data/mimic-iv-demo
tests/data/mimic-iv-demo
locallymeds_etl_mimic
script successfully outputs files to a temporary destination folder attests/data/mimic-iv-demo/build/