Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test that the MIMIC-IV ETL runs on the MIMIC-IV demo dataset #4

Merged

Conversation

tompollard
Copy link
Contributor

As discussed in #1, we would like to add some tests to confirm that the MIMIC ETL runs as expected.

This pull request:

  • Adds a step to the Github workflow that wgets the MIMIC-IV demo dataset from PhysioNet and puts it at: tests/data/mimic-iv-demo
  • Adds instructions for downloading the MIMIC-IV demo dataset to tests/data/mimic-iv-demo locally
  • If the dataset exists when pytest is run, checks that the meds_etl_mimic script successfully outputs files to a temporary destination folder at tests/data/mimic-iv-demo/build/

@tompollard
Copy link
Contributor Author

Sorry for the churn! The test now runs:

  • I added a few missing dependencies to pyproject.toml: jsonschema, meds, typing_extensions
  • The behaviour of importlib.resources changed between Python 3.9 and 3.10. There are several possible fixes, but the easiest one is just to drop testing on < Python 3.10.


Tests requiring data will be skipped unless the `tests/data/` folder is populated first.

To download the testing data, run the following command/s from project root:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably want to make a script for this, as we will want to also include synthetic OMOP data, but this is good for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, this could be improved. I wasn't entirely sure of the best way of implementing tests on local machines. I don't really like the idea of local tests requiring access to external resources, but this is an option (e.g. when you run the test, you pull down data from HuggingFace, PhysioNet, etc).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could package the mimic demo data in this repository (maybe just 1 patient to save space)? Or generate synthetic MIMIC format data in the same way that I have synthetic OMOP data?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either of these would be good with me. II was cautious about adding too much volume to the repo, but just picking a few patients makes sense. I love the idea of a synthetic data generator.

shutil.rmtree(cls.destination_path)

# Run the ETL
subprocess.run(['meds_etl_mimic', cls.source_path, cls.destination_path, "--num_shards", "10"], check=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably want to test that the files actually load with HuggingFace datasets.

dataset = datasets.Dataset.from_parquet(cls.destination_path + '/data/*')

example_patient = dataset[0]

assert 'patient_id' in example_patient
assert 'events' in example_patient

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, I was really just going for the bare minimum in this pull request (i.e. do the build run and did it generate files) but I'll add your load step now. One minute...

Copy link
Collaborator

@EthanSteinberg EthanSteinberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really good! My main request is that we want the test to at minimum try to load the resulting files with huggingface Datasets.

dataset = huggingface.Datasets.from_parquet(blah + '/data/*')

@tompollard
Copy link
Contributor Author

My main request is that we want the test to at minimum try to load the resulting files with huggingface Datasets.

Thanks Ethan. I added a couple of simple tests as suggested.

@tompollard
Copy link
Contributor Author

Hmm, looks like the build was broken by a recent update in meds:

Medical-Event-Data-Standard/meds@e93f63a

So successful test?

@EthanSteinberg
Copy link
Collaborator

Good catch and sorry about that! I pushed an update to the meds package which should fix this.

https://pypi.org/project/meds/0.1.1/

Now the pyproject.toml file needs to be updated to require 0.1.1 instead of 0.1

Also, we shouldn't use subprocess here since it results in terrible error messages.

We should just import main from the corresponding module.

@tompollard
Copy link
Contributor Author

Now the pyproject.toml file needs to be updated to require 0.1.1 instead of 0.1

Thanks, pull request updated and tests are passing again.

Also, we shouldn't use subprocess here since it results in terrible error messages. We should just import main from the corresponding module.

I think this change belongs in a new pull request, but if you let me know how you'd like to implement it then I can add it here. The ETL needs refactoring beforehand really.

@EthanSteinberg
Copy link
Collaborator

I think this change belongs in a new pull request, but if you let me know how you'd like to implement it then I can add it here. The ETL needs refactoring beforehand really.

Yep, let's just merge this as is.

Copy link
Collaborator

@EthanSteinberg EthanSteinberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Feel free to merge when you are ready.

@tompollard
Copy link
Contributor Author

Thanks for reviewing!

@tompollard tompollard merged commit fd46ddf into Medical-Event-Data-Standard:main Dec 19, 2023
1 check passed
@tompollard tompollard deleted the tp/mimic-iv-demo branch December 19, 2023 17:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants