Test that the MIMIC-IV ETL runs on the MIMIC-IV demo dataset #4

tompollard · 2023-12-15T21:10:41Z

As discussed in #1, we would like to add some tests to confirm that the MIMIC ETL runs as expected.

This pull request:

Adds a step to the Github workflow that wgets the MIMIC-IV demo dataset from PhysioNet and puts it at: tests/data/mimic-iv-demo
Adds instructions for downloading the MIMIC-IV demo dataset to tests/data/mimic-iv-demo locally
If the dataset exists when pytest is run, checks that the meds_etl_mimic script successfully outputs files to a temporary destination folder at tests/data/mimic-iv-demo/build/

tompollard · 2023-12-15T21:47:40Z

Sorry for the churn! The test now runs:

I added a few missing dependencies to pyproject.toml: jsonschema, meds, typing_extensions
The behaviour of importlib.resources changed between Python 3.9 and 3.10. There are several possible fixes, but the easiest one is just to drop testing on < Python 3.10.

EthanSteinberg · 2023-12-18T20:14:44Z

README.md

+
+Tests requiring data will be skipped unless the `tests/data/` folder is populated first. 
+
+To download the testing data, run the following command/s from project root:


We probably want to make a script for this, as we will want to also include synthetic OMOP data, but this is good for now.

I agree, this could be improved. I wasn't entirely sure of the best way of implementing tests on local machines. I don't really like the idea of local tests requiring access to external resources, but this is an option (e.g. when you run the test, you pull down data from HuggingFace, PhysioNet, etc).

I think we could package the mimic demo data in this repository (maybe just 1 patient to save space)? Or generate synthetic MIMIC format data in the same way that I have synthetic OMOP data?

Either of these would be good with me. II was cautious about adding too much volume to the repo, but just picking a few patients makes sense. I love the idea of a synthetic data generator.

EthanSteinberg · 2023-12-18T20:16:13Z

tests/test_etl.py

+            shutil.rmtree(cls.destination_path)
+
+        # Run the ETL
+        subprocess.run(['meds_etl_mimic', cls.source_path, cls.destination_path, "--num_shards", "10"], check=True)


We probably want to test that the files actually load with HuggingFace datasets.

dataset = datasets.Dataset.from_parquet(cls.destination_path + '/data/*')

example_patient = dataset[0]

assert 'patient_id' in example_patient
assert 'events' in example_patient

Good idea, I was really just going for the bare minimum in this pull request (i.e. do the build run and did it generate files) but I'll add your load step now. One minute...

EthanSteinberg

This looks really good! My main request is that we want the test to at minimum try to load the resulting files with huggingface Datasets.

dataset = huggingface.Datasets.from_parquet(blah + '/data/*')

tompollard · 2023-12-18T22:18:50Z

My main request is that we want the test to at minimum try to load the resulting files with huggingface Datasets.

Thanks Ethan. I added a couple of simple tests as suggested.

tompollard · 2023-12-18T22:25:04Z

Hmm, looks like the build was broken by a recent update in meds:

Medical-Event-Data-Standard/meds@e93f63a

So successful test?

EthanSteinberg · 2023-12-18T23:53:47Z

Good catch and sorry about that! I pushed an update to the meds package which should fix this.

https://pypi.org/project/meds/0.1.1/

Now the pyproject.toml file needs to be updated to require 0.1.1 instead of 0.1

Also, we shouldn't use subprocess here since it results in terrible error messages.

We should just import main from the corresponding module.

…e destination folder. The test will only run if the MIMIC demo data is found in tests/data/

The behaviour of importlib.resources changed between Python 3.9 and 3.10. In 3.9, running a script leaves spec.origin set to None, causing the tests to fail. There are several possiblefixes, but this is the easiest one.

tompollard · 2023-12-19T15:35:22Z

Now the pyproject.toml file needs to be updated to require 0.1.1 instead of 0.1

Thanks, pull request updated and tests are passing again.

Also, we shouldn't use subprocess here since it results in terrible error messages. We should just import main from the corresponding module.

I think this change belongs in a new pull request, but if you let me know how you'd like to implement it then I can add it here. The ETL needs refactoring beforehand really.

EthanSteinberg · 2023-12-19T15:57:09Z

I think this change belongs in a new pull request, but if you let me know how you'd like to implement it then I can add it here. The ETL needs refactoring beforehand really.

Yep, let's just merge this as is.

EthanSteinberg

Looks good! Feel free to merge when you are ready.

tompollard · 2023-12-19T17:19:06Z

Thanks for reviewing!

tompollard requested a review from EthanSteinberg December 18, 2023 20:09

EthanSteinberg reviewed Dec 18, 2023

View reviewed changes

EthanSteinberg requested changes Dec 18, 2023

View reviewed changes

tompollard mentioned this pull request Dec 18, 2023

Run MEDS ETLs as part of a testing framework Medical-Event-Data-Standard/meds#1

Closed

tompollard added 12 commits December 19, 2023 09:59

add folder to hold data for the testing framework.

2579ccb

download MIMIC-IV demo for tests.

9de07b4

Don't track test data.

68aca36

Add simple test that runs the ETL and checks that it adds files to th…

cc0809f

…e destination folder. The test will only run if the MIMIC demo data is found in tests/data/

install the meds etl package.

b6897c6

importlib.resources module is not available in Python <3.9

12467f5

add missing jsonschema dependency.

2bed691

add missing meds dependency.

a0de752

add missing typing_extensions dependency.

9404b87

Remove Python 3.9 test

7da7d8a

The behaviour of importlib.resources changed between Python 3.9 and 3.10. In 3.9, running a script leaves spec.origin set to None, causing the tests to fail. There are several possiblefixes, but this is the easiest one.

add 🤗 datasets dependency.

1cb9c0f

add some simple tests to check the MIMIC-IV conversion.

c9a3f5e

tompollard force-pushed the tp/mimic-iv-demo branch from f6275dd to c9a3f5e Compare December 19, 2023 14:59

tompollard mentioned this pull request Dec 19, 2023

meds not defined or included in dependencies #5

Closed

Requires meds >= 0.1.1

b7a1deb

EthanSteinberg self-requested a review December 19, 2023 15:57

EthanSteinberg approved these changes Dec 19, 2023

View reviewed changes

tompollard merged commit fd46ddf into Medical-Event-Data-Standard:main Dec 19, 2023
1 check passed

tompollard deleted the tp/mimic-iv-demo branch December 19, 2023 17:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test that the MIMIC-IV ETL runs on the MIMIC-IV demo dataset #4

Test that the MIMIC-IV ETL runs on the MIMIC-IV demo dataset #4

tompollard commented Dec 15, 2023

tompollard commented Dec 15, 2023

EthanSteinberg Dec 18, 2023

tompollard Dec 18, 2023

EthanSteinberg Dec 18, 2023

tompollard Dec 18, 2023

EthanSteinberg Dec 18, 2023

tompollard Dec 18, 2023

EthanSteinberg left a comment

tompollard commented Dec 18, 2023

tompollard commented Dec 18, 2023

EthanSteinberg commented Dec 18, 2023

tompollard commented Dec 19, 2023

EthanSteinberg commented Dec 19, 2023

EthanSteinberg left a comment

tompollard commented Dec 19, 2023


		Tests requiring data will be skipped unless the `tests/data/` folder is populated first.

		To download the testing data, run the following command/s from project root:

Test that the MIMIC-IV ETL runs on the MIMIC-IV demo dataset #4

Test that the MIMIC-IV ETL runs on the MIMIC-IV demo dataset #4

Conversation

tompollard commented Dec 15, 2023

tompollard commented Dec 15, 2023

EthanSteinberg Dec 18, 2023

Choose a reason for hiding this comment

tompollard Dec 18, 2023

Choose a reason for hiding this comment

EthanSteinberg Dec 18, 2023

Choose a reason for hiding this comment

tompollard Dec 18, 2023

Choose a reason for hiding this comment

EthanSteinberg Dec 18, 2023

Choose a reason for hiding this comment

tompollard Dec 18, 2023

Choose a reason for hiding this comment

EthanSteinberg left a comment

Choose a reason for hiding this comment

tompollard commented Dec 18, 2023

tompollard commented Dec 18, 2023

EthanSteinberg commented Dec 18, 2023

tompollard commented Dec 19, 2023

EthanSteinberg commented Dec 19, 2023

EthanSteinberg left a comment

Choose a reason for hiding this comment

tompollard commented Dec 19, 2023