Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use this ETL as a way to provide MIMIC in OMOP directly on the Physionet website #52

Open
vojtechhuser opened this issue Jul 25, 2018 · 34 comments

Comments

@vojtechhuser
Copy link
Contributor

This ETL allows local user to download and convert-at-many sites

How about convert-once and allow sites to download the converted dataset.

This would save MIMIC users some effort and make MIMIC more used. (and published about; getting credit).

@dsontag
Copy link

dsontag commented Jul 25, 2018

Great idea!

@chandryou
Copy link

chandryou commented Jul 26, 2018

Thank you for this great work!

@tompollard
Copy link
Member

Thanks for the suggestion @vojtechhuser. The mapping needs some work, but sharing the transformed dataset is something that we'd like to do once we're happy with it. We haven't been able to give this project the time it needs just because of competing priorities (research tasks, rebuilding PhysioNet, preparing the next release of MIMIC, etc), but it's on our to-do list.

@vojtechhuser
Copy link
Contributor Author

any updates on this? We would like to use mimic3 in a Data Quality totorial at OHDSI symposium and desperately need someone who ran the code from this repo and can collaborate with us.

@alistairewj
Copy link
Member

As soon as we publish something on PhysioNet we have to be able to support it and the ETL isn't ready. We are currently building ETLs for other ICU datasets so that our model doesn't overfit to MIMIC.

If by data quality you mean running Achilles, then I have done that, but the results aren't that useful on MIMIC because of the unique data structure and deidentification approach (e.g. deidentified ages ~ 300).

@AEW0330
Copy link

AEW0330 commented Sep 7, 2018

@alistairewj the use @vojtechhuser is referring to is for a tutorial on how to use Achilles and two other data quality tool sets designed for use with OMOP data sources. The version of MIMIC we use doesn't need to be free of defects. It just needs to be usable - i.e. it won't break the tools because there are empty or missing tables or missing required variables. To the extent that it will resemble a real world data set with typical data quality issues that the tools can identify, it will meet our needs. Before I spend the effort to get this to run, can you give your sense of how likely it is to meet those needs?

@alistairewj
Copy link
Member

MIMIC is a real world dataset, from a real hospital, but I don't know if I can fully answer your question without knowing the ins and outs of the tools you'll use. The ETL is incomplete; there are still a lot of unmapped concepts. I ran Achilles a few months ago and the output is hopefully informative for you (see below). You'll notice that there are a lot of reported "errors" around times/dates due to our deidentification approach (we randomly shift patient data into the future, therefore doing any analysis which aggregates distinct patients over time is flawed).

Type Message
ERROR 3-Number of persons by year of birth; should not have year of birth in the future, (n=44,374)
ERROR 101-Number of persons by age, with age at first observation period; should not have age > 150, (n=1,991)
ERROR 400-Number of persons with at least one condition occurrence, by condition_concept_id; 2 concepts in data are not in vocabulary
ERROR 400-Number of persons with at least one condition occurrence, by condition_concept_id; 228 concepts in data are not in correct vocabulary
ERROR Death event outside observation period, 510-Number of death records outside valid observation period; count (n=8,980) should not be > 0
ERROR 600-Number of persons with at least one procedure occurrence, by procedure_concept_id; 39 concepts in data are not in correct vocabulary
ERROR 610-Number of procedure occurrence records outside valid observation period; count (n=883) should not be > 0
ERROR 700-Number of persons with at least one drug exposure, by drug_concept_id; 4 concepts in data are not in correct vocabulary
ERROR 706 - Distribution of age by drug_concept_id (count = 1); min value should not be negative
ERROR 710-Number of drug exposure records outside valid observation period; count (n=12,437,292) should not be > 0
ERROR 711-Number of drug exposure records with end date < start date; count (n=15,922) should not be > 0
ERROR 717 - Distribution of quantity by drug_concept_id (count = 7); min value should not be negative
ERROR 806 - Distribution of age by observation_concept_id (count = 2); min value should not be negative
ERROR 810-Number of observation records outside valid observation period; count (n=85,787) should not be > 0
ERROR 814-Number of observation records with no value (numeric, string, or concept); count (n=99,839) should not be > 0
NOTIFICATION Unmapped data over percentage threshold in:Measurement
NOTIFICATION Count of unmapped source values exceeds threshold in: drug_exposure
NOTIFICATION [GeneralPopulationOnly] Count of distinct specialties of providers in the PROVIDER table is below threshold
NOTIFICATION No body weight data in MEASUREMENT table (under concept_id 3,025,315 (LOINC code 29,463-7))
NOTIFICATION Unmapped data over percentage threshold in:Condition
NOTIFICATION Unmapped data over percentage threshold in:Procedure
NOTIFICATION Unmapped data over percentage threshold in:DrugExposure
NOTIFICATION Unmapped data over percentage threshold in:Observation
WARNING 5-Number of persons by ethnicity; data with unmapped concepts
WARNING 101-Number of persons by age, with age at first observation period; should not have age > 125, (n=1,991)
WARNING 400-Number of persons with at least one condition occurrence, by condition_concept_id; data with unmapped concepts
WARNING 402-Number of persons by condition occurrence start month, by condition_concept_id; 2 concepts have a 100% change in monthly count of events
WARNING 420-Number of condition occurrence records by condition occurrence start month; theres a 100% change in monthly count of events
WARNING 512-Distribution of time from death to last drug (count = 1); max value should not be positive, otherwise its a zombie with data >1mo after death
WARNING 514-Distribution of time from death to last procedure (count = 1); max value should not be positive, otherwise its a zombie with data >1mo after death
WARNING 515-Distribution of time from death to last observation (count = 1); max value should not be positive, otherwise its a zombie with data >1mo after death
WARNING 600-Number of persons with at least one procedure occurrence, by procedure_concept_id; data with unmapped concepts
WARNING 602-Number of persons by procedure occurrence start month, by procedure_concept_id; 6 concepts have a 100% change in monthly count of events
WARNING 620-Number of procedure occurrence records by procedure occurrence start month; theres a 100% change in monthly count of events
WARNING 700-Number of persons with at least one drug exposure, by drug_concept_id; data with unmapped concepts
WARNING 702-Number of persons by drug exposure start month, by drug_concept_id; 22 concepts have a 100% change in monthly count of events
WARNING 717-Distribution of quantity by drug_concept_id (count = 83); max value should not be > 600
WARNING 720-Number of drug exposure records by drug exposure start month; theres a 100% change in monthly count of events
WARNING 800-Number of persons with at least one observation occurrence, by observation_concept_id; data with unmapped concepts
WARNING 802-Number of persons by observation occurrence start month, by observation_concept_id; 7 concepts have a 100% change in monthly count of events
WARNING 820-Number of observation records by observation start month; theres a 100% change in monthly count of events

@AEW0330
Copy link

AEW0330 commented Sep 8, 2018

@alistairewj this is helpful. Thanks.

@tomseinen
Copy link

Any updates on sharing a complete version of mimic in omop on physionet?

Especially now in Covid19 times, I would very much like to work with a proper cdm at home, as I can't access my organisation's cdm.
Alternatives databases, like Synpuf, are too limited for the analyses I want to test.

Thank you, Tom

@tompollard
Copy link
Member

tompollard commented Apr 17, 2020

We would be happy to share an OMOP version of MIMIC-III on PhysioNet. See also MIT-LCP/mimic-code#725.

I suggest that someone from the OMOP community takes responsibility for putting together a submission to PhysioNet. The person should:

  • make efforts to describe the dataset clearly.
  • include a snapshot of the code used to generate the dataset.
  • ensure that people who have been involved in the work are included as contributors.

Once we receive a well described version of the dataset, we can move forward with publication. For instructions on submitting the project, see: https://physionet.org/about/publish/#sharing

@vojtechhuser
Copy link
Contributor Author

That is great. I will work on a revised proposal that I am happy to revise multiple times until I hit all your requirements to the satisfaction of the PhysioNet reviewing team. (tagging @parisni )

@parisni
Copy link
Contributor

parisni commented May 1, 2020

Hi all. Good news. I would be pleased to give some help to make this possible.

@vojtechhuser
Copy link
Contributor Author

Today - I started a draft.

I will add @parisni and other important people.

image

@vojtechhuser
Copy link
Contributor Author

I plan to use (let me know if that is wrong)
image

@tompollard
Copy link
Member

@vojtechhuser those access settings are correct. Not sure about "OMOP shaped data" as the title of the dataset, but presumably this is a placeholder!

@vojtechhuser
Copy link
Contributor Author

The title is changed now.
Please let me know who else want to be invited (or not want to be). So far, I have

image

@vojtechhuser
Copy link
Contributor Author

What people thing about number of projects. One project will be for full data. Should we create another project that converts Demo data? (I am happy to do what MIT tells me).

image

@jmbanda
Copy link

jmbanda commented May 4, 2020

I would like an invite! I would love to be able to skip ETLing the data and getting it in the OMOP format from source.

@alistairewj
Copy link
Member

If published as a credentialed project then it would be accessible to MIMIC users. The invite mentioned is for the authors of the project, i.e. those who helped create the ETL.

@tompollard
Copy link
Member

One project will be for full data. Should we create another project that converts Demo data?

Yes, I think separate projects for each dataset is best. One of the benefits is that the MIMIC demo is open access (https://physionet.org/content/mimiciii-demo/1.4/), so the same permissions could be applied to the OMOP version.

@AEW0330
Copy link

AEW0330 commented May 4, 2020

Excellent point Tom.

@vojtechhuser
Copy link
Contributor Author

vojtechhuser commented May 4, 2020

based on guidance - I have now created a sister "demo" project and invited folks there too.

image

@AEW0330
Copy link

AEW0330 commented May 4, 2020

I'm seeing whether the N3C project can support some of this work - pay for some of people's time and get more hand on deck. Who has a guess at the amount of work involved?

@AEW0330
Copy link

AEW0330 commented May 4, 2020

Folks leading that seem to have some leeway with unspecified cash allocations to fund it - it being the National Covid Cohort Collaborative (N3C) - and indicate potential interest in supporting this. So I'm eager to respond to their question about the amount of work. I'd take a guess myself but I'm the least fit amongst this group to do so.

@tompollard
Copy link
Member

Interesting, thanks Andrew. @parisni @alistairewj @aparrot89 any thoughts on whether we should be putting in additional work to improve the mapping before the dataset is shared?

@SSMK-wq
Copy link

SSMK-wq commented May 5, 2020

Hi, I am interested to be part of this project and am already a registered user of Physionet.

@vojtechhuser
Copy link
Contributor Author

Formal funding would be great.

See notes in this shared folder: https://drive.google.com/open?id=1j-x-rwuYJr2nIs5zxCW6ST_Q-vPc1tfN

For folks willing to help, please put your name next to a table that you volunteer to tackle (port to GBQ or improve)

image

@vojtechhuser
Copy link
Contributor Author

I propose a plan were multiple versions are released. We need initial versions to make people aware of it. E.g., v0.1 with some tables. After that - some version (e.g., v1.0 can be using existing mapping) and v2.0 can be with improved mapping. Perfect should not be the enemy of the good enough.

@alistairewj
Copy link
Member

I can't say I agree with releasing an incomplete dataset on PhysioNet and justifying the lack of comprehension with a "v0.1" tag.

@vojtechhuser
Copy link
Contributor Author

google link permission was fixed. You can sign up for individual tables again here:
https://drive.google.com/open?id=1j-x-rwuYJr2nIs5zxCW6ST_Q-vPc1tfN (file central notes)

@epiben
Copy link

epiben commented May 15, 2020

@vojtechhuser, I'd be happy to help join this effort! I put myself on the measurement table.

@vojtechhuser
Copy link
Contributor Author

The project description is now also in Central Notes. At this link (pick file central notes)
https://drive.google.com/open?id=1j-x-rwuYJr2nIs5zxCW6ST_Q-vPc1tfN @AEW0330

@vojtechhuser
Copy link
Contributor Author

The project is from now on called Argos

This OHDSI forum thread is used for major updates.

https://forums.ohdsi.org/t/argos-project-2020-omoped-mimic-project/10926

technical items will still be posted here.

@tompollard
Copy link
Member

What is the need for the codename? MIMIC-OMOP seems clearer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests