Join GitHub today
GitHub is home to over 36 million developers working together to host and review code, manage projects, and build software together.Sign up
Discussion: forum post
Table CDM_SOURCE provides metadata.
The proposal is adding a single table to the CDM specs. In phase 1, we are trying to provide a mechanism for sites to capture metadata. The concept level standardization is planned in phase 2.
new METADATA table
This table is relying on concept_ids that exist for CDM tables. In Atlas, search for those using advanced search and selecting Metadata.
The proposal encourages all CDM adopters to fully populate and utilize the existing CDM_SOURCE table.
END OF PROPOSAL
Text below only reflects some historical notes related to the proposal above.
Proposing person: Patrick Ryan, Martijn Schuemie, Ajit Londhe, & Erica Voss
(may need to be updated)
Additionally we would like the CDM_SOURCE table to store metadata about each of the domains. Our idea is to implement it by adding an additional column for each domain in the CDM to the CDM_SOURCE table (i.e. CDM_SOURCE.VISIT_OCCURRENCE, CDM_SOURCE.PERSON, etc). The value this brings is this will allow us to display information about a specific domain on an ACHILLES report. For example, VISIT_OCCURRENCE logic in PREMIER is fairly complex and displaying a description of that logic at the point where someone is reviewing the data in ACHILLES would be beneficial.
Here is an example of some text for JMDC:
Database as a whole
(already has a column) JMDC database consists of data from 60 Society-Managed Health Insurances covering workers aged 18 to 65 and their dependents (children younger than 18 years old and elderly people older than 65 years old). The old people (particularly those aged 66 or older) are less representative as compared with whole population in the nation. When estimated among the people who are younger than 66 years old, the proportion of children younger than 18 years old in JMDC is approximately the same as the proportion in the whole nation. JMDC data includes data on membership status of the insured people and claims data provided by insurers under contract. Claims data are derived from monthly claims issued by clinics, hospitals and community pharmacies.
JMDC covers workers aged 18 to 65 and their dependents (children younger than 18 years old and elderly people older than 65 years old). The old people (particularly those aged 66 or older) are less representative as compared with whole population in the nation. When estimated among the people who are younger than 66 years old, the proportion of children younger than 18 years old in JMDC is approximately the same as the proportion in the whole nation.
The observation period is defined as the time of enrollment in the health insurance. If the member is a dependent, the enrollment depends on the enrollment of the main beneficiary.
Care sites in JMDC are institutions where care is provided, typically a department in a hospital.
debate about CDM_SOURCE table
improve the guidance for this table
(superceded by inclusion of the below information in the METADATA table)
Advanced Data Quality checks (inside Achilles Heel) would take advantage of this information in this new column.
Predominantly means if at least 51% of significant records comes from a given source.
Proposing person: Ajit Londhe, & Erica Voss
We would like to propose the following table to hold metadata:
forum thread where metadata was discussed.
Summary of CDM WG discussion from Aug 1st:
referenced this issue
Aug 1, 2017
The modified proposal adds the date and datetime fields, and is the one under consideration. While it's certainly likely metadata records would be captured at ETL time, further observations of the data set could happen during the lifecycle of that particular CDM, as it becomes more utilized through various studies by multiple researchers.
I feel it would be useful to track the date of the metadata record to help with traceability of the site's evolving understanding of the data set's nuances. I'm envisioning in AchillesWeb or eventually Atlas a set of metadata reports that would be able to expose this new information to all users of the CDM. Temporal attributes help in consuming this information more easily.
I feel that this approach is simple enough to allow us to get started with Metadata. There are more complex ways to think about Metadata but from what I've seen they are more complicated than it is worth. Here at Janssen we have been using a similar table for awhile now and it has been helpful for tracking high level information about CDMs (like in the examples above).
With source data rife with temporal trend shifts, contextual nuance, data collection idiosyncrasies, and vocabulary changes, not to mention the ETL design choices needed to wrangle the data, it can be easy to employ poor study design that overlooks these issues, because the knowledge about them is not stored in a central repository, but usually sitting in a few key researchers' brains. This is troublesome in a study utilizing one data set, but considerably more problematic when conducting a network study in which many data sets are leveraged, but none of the known nuances within each data set are acknowledged or adjusted for in some way.
A few case studies we discussed include (1) the loss of Social Security Death Master File access in a claims data set, which resulted in a significant drop-off in death records per patient in November 2011; and (2) the change in prevalence of the condition "malaise and fatigue" in that same claims data set due to the migration from ICD9CM to ICD10CM. Both are captured in the proposed metadata table below:
The ability to have a central place to store human-authored observations like these can benefit researchers utilizing this data set, particularly if they are exposed via our main web applications:
Additionally, the user can be warned of potential "dangers" in attempting to utilize these problematic concepts in their cohort design:
Another application enhancement made possible is the ability for data custodians to catalog design choices in building the CDM version of the data set, such as in Achilles Heel:
We cannot begin to implement these standard practices of data set annotation without a metadata repository. The one being proposed here, while not as robust as something like ISO-11179, does satisfy the core requirements of a metadata repository as specified by the National Information Standards Organization (NISO):
I feel we should adopt this proposal, begin using it, and then share our experiences with its usage through a working group.
We have two OMOPs (one inpatient and one outpatient) that we want to merge. This might be outside of this proposal, but I wonder if it would be worth adding an optional column later named row_id that will have the row level metadata. The main issue is that the metadata table will be very large. I know this is out of scope for this proposal, but I wanted to see if others have an option. Thanks.
Oh. You want a shadow record for every other record to put some metadata on it? Really? To show provenance? Put it into the *_type_concept_id. Why do you care anyway? You have VISIT to declare whether it was outpatient or inpatient. What's the use case?
Right. We are already using the type_concept_id. There are data points, such as labs, that don't have a visit, but are in both instances. This issue of provenance is also needed for the All of Us project where we are merging essentially 20 OMOPs into one. One option is to append the source information into the source value field, but it gets pretty messy and when you want to query for that information.
Looking at this more carefully, it seems like my suggestion might now fully work b/c I was thinking the metadata_concept_id was the domain_id.
We did tease this up a bit in the forums, but it didn't seem like many folks were storing multiple CDMs in one database. I think this makes it clear that we should allow for delineation of metadata records by data source. Perhaps CDM_SOURCE should be the place to identify multiple sources within a CDM, and then METADATA would need a cdm_source_id foreign key field.
Will this table have a unique identifier field (i.e.