Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fact_relationship changes #230

Open
gowthamrao opened this issue Oct 30, 2018 · 21 comments
Open

Fact_relationship changes #230

gowthamrao opened this issue Oct 30, 2018 · 21 comments
Labels

Comments

@gowthamrao
Copy link
Member

We decided to introduce concept_id's for each field in OMOP cdm. e.g. we will be assigning a concept_id for person.person_id, visit_occurrence.visit_occurrence_id etc. To be released in future version of vocabulary. e.g. cost.cost_event_field_concept_id needs concept_id for each identify of the field.

Proposal 1: This proposal is to change FACT_RELATIONSHIP table's domain_concept_id to field_concept_id, where field_concept_id is the identity of the field in omop cdm.

Field Required Type Description
domain_concept_id_1 Yes integer The concept representing the domain of fact one, from which the corresponding table can be inferred.
field_concept_id_1 Yes integer A foreign key identifier to a concept in the concept table representing the identity of the field of fact one.
fact_id_1 Yes integer The unique identifier in the table corresponding to the domain of fact one.
domain_concept_id_2 Yes integer The concept representing the domain of fact two, from which the corresponding table can be inferred.
field_concept_id_2 Yes integer A foreign key identifier to a concept in the concept table representing the identity of the field of fact two.
fact_id_2 Yes integer The unique identifier in the table corresponding to the domain of fact two.
relationship_concept_id Yes integer A foreign key to a Standard Concept ID of relationship in the Standardized Vocabularies.

Proposal 1: will help overcome domain ambiguity such as does the domain_id = 'Visit' represent visit_occurrence or visit_detail.

@gowthamrao
Copy link
Member Author

Proposal 2: relationship are generally valid for a certain interval of time, e.g. start_date and end_date. We propose adding two optional fields relationship_start_date_time and relationship_end_date_time to fact_relationship table.

Examples of relationships:

  • John (person_id fact_id_1) had Dr. Smith (provider_id fact_id_2) as PCP (relationship_concept_id) starting Jan 1st 2017 (start_datetime) to Dec 31st 2017 (end_datetime)
  • Michael (person_id fact_id_1) was residing in his home (relationship_concept_id) located at location_id (fact_id_2) starting Jan 1st 2017 (start_datetime) to Dec 31st 2017 (end_datetime)
  • Dr. Smith (provider_id fact_id_1) was credentialed as oncologist (relationship_concept_id) at care_site_id (fact_id_2) starting Jan 1st 2017 (start_datetime) to Dec 31st 2017 (end_datetime)
Field Required Type Description
field_concept_id_1 Yes integer A foreign key identifier to a concept in the concept table representing the identity of the field of fact one.
fact_id_1 Yes integer The unique identifier in the table corresponding to the domain of fact one.
field_concept_id_2 Yes integer A foreign key identifier to a concept in the concept table representing the identity of the field of fact two.
fact_id_2 Yes integer The unique identifier in the table corresponding to the domain of fact two.
relationship_concept_id Yes integer A foreign key to a Standard Concept ID of relationship in the Standardized Vocabularies.
relationship_start_datetime No datetime Datetime of start of the relationship between fact one and fact two
relationship_end_datetime No datetime Datetime of end of the relationship between fact one and fact two

@gowthamrao
Copy link
Member Author

There are some fact_relationship like tables in OMOP CDM, that may be subsumed by this proposal. e.g. Location_history. Location_history may be replaced by this new fact_relationship, because location_history is similar to fact_relationship table. See below mapping between location_history to fact_relationship.

location_history (current) fact_relationship (new) fact_relationship
location_id fact_id_2 fact_id_2
relationship_type_concept_id relationship_concept_id relationship_concept_id
domain_id domain_concept_id_2 field_concept_id_2
entity_id fact_id_1 fact_id_1
start_date   relationship_start_datetime
end_date   relationship_end_datetime
  domain_concept_id_1 field_concept_id_1

@cgreich
Copy link
Contributor

cgreich commented Oct 30, 2018

Like it. If I knew how I'd put this thumb up emoticon in. But we need a default if the relationship is eternal. Something like in the vocabs: 1-Jan-1970 to 31-Dec-2099

@gowthamrao
Copy link
Member Author

gowthamrao commented Oct 30, 2018

But we need a default if the relationship is eternal. Something like in the vocabs: 1-Jan-1970 to 31-Dec-2099

should datetime's be optional or required? if required, i agree with default.

Alternatively, we can keep them optional and assume during analytic time that if the datetimes are null then the value for the interval is from earliest observation_period_start_date to latest observation_period_end_date (computed during analysis time - similar to how we build cohorts in Atlas, where observation_period_end_date is the default cohort_end_date).

@gowthamrao
Copy link
Member Author

gowthamrao commented Oct 30, 2018

Proposal 3: when we are making changes, why not add _type_concept_id

Field Required Type Description
field_concept_id_1 Yes integer A foreign key identifier to a concept in the concept table representing the identity of the field of fact one.
fact_id_1 Yes integer The unique identifier in the table corresponding to the domain of fact one.
field_concept_id_2 Yes integer A foreign key identifier to a concept in the concept table representing the identity of the field of fact two.
fact_id_2 Yes integer The unique identifier in the table corresponding to the domain of fact two.
relationship_concept_id Yes integer A foreign key to a Standard Concept ID of relationship in the Standardized Vocabularies.
relationship_type_concept_id Yes integer A foreign key identifier to a concept in the CONCEPT table for the provenance or the source of the relationship. E.g. derived during ETL, vs. mapped from source
relationship_start_datetime No datetime Datetime of start of the relationship between fact one and fact two
relationship_end_datetime No datetime Datetime of end of the relationship between fact one and fact two

Example:

  • relationships between a person_id and provider_id may come from source data e.g. reported by person in personal health assessment, reported by providers EHR. relationship between a person_id and provider_id may be derived during ETL using an algorithm e.g. attribution algorithms
  • oncology workgroup has proposed episode table here . One requirement is to related higher level episodes to lower level events. These relationship may be from source data (EHR may capture cancer treatment cycles as episodes of care), or may be derived during ETL thru an algorithm.

@pavgra
Copy link

pavgra commented Oct 30, 2018

Although the proposal 2 itself sounds good and allows to remove location_history table, I would be very careful with the general idea of untypified relations. Having more flexibility, you lose both logical separation of concerns between tables and physical data integrity checks provided by RDBMS (e.g. you know that in location_history you can put only relations of location and 3 domains; while with fact_relationship you can do anything and it is not clear whether it is correct or not). So moving in that direction too fast can easily leave you with RDF and pure ontologies.

@rtmill
Copy link

rtmill commented Oct 30, 2018

It makes sense and its a clean design. To me its a trade off between neat and compact vs. efficient and segregated. I think there's potential for this table becoming massive but it seems like a universal means to get at these one-to-many issues that need to be addressed (e.g. multiple care sites for a provider, multiple providers for a patient, etc).

@pavgra
Copy link

pavgra commented Oct 30, 2018

I would propose to treat fact_relationship table as a POC table. So that when you have a new use-case which requires some new relations between tables (but those relations don't exist yet), you could use fact_relationship to model the relations w/o a need to modify CDM schema. But the approach has some drawbacks:

  • The table is not self-descriptive: when you see a FK clearly referring to a certain field in a certain table, it's much more clear than having a generic table w/o any constraints
  • All data from different domains go into the single table - it takes more time to query the table

So, once a POC has been validated and common use-case was figured out, a person can propose to create permanent tables so that we:

  • don't face performance issues with storing everything in a single table
  • have RDBMS data integrity checks (FKs)

So, following the idea, location_history looks like such typical use-case and is a valid standalone entity, while there are some other cases where fact_relationship should be enough for now.

@cgreich
Copy link
Contributor

cgreich commented Oct 30, 2018

@pavgra: Point taken. The FACT_RELATIONSHIP_TABLE should really be the last resort and contain the relationships for which creating a correctly modeled concrete RDBMS solution with FKs is just not worth it, because we will need it like 21 times for the 1.9B patients. All others should be modeled properly.

@gowthamrao
Copy link
Member Author

@cgreich @pavgra like everyone else, what i like about the fact_relationship table is that it is flexible. what i dont like is that it is flexible. Flexibility allows us to use one structure for an infinite number of use cases, but these uses are hard to solve in a fact_relationship table (see points that @pavgra made). Also, I like fact_realtionship table because ETL can be done in a generic way - and we dont have to revise our ETL every time for POC'ing a new use cases.

I like @pavgra comment of using fact_relationship table as the source for a POC table. Right now, like in location_history, we recommend new tables in CDM WG based on expert opinion - and not a POC. If we take a POC approach, then one requirement before approving a new table (especially that related two facts together) is a POC.

using the generic fact_relationship table, we could POC three new tables that follow proper RDBMS referential integrity - all derived from fact_relationship table. i.e. instead of location_history (that has location_id with referential integrity, but entity_id has no referential integrity), we could create person_location_history, provider_location_history, care_site_location_history each with RDBMS referential integrity - test it out in POC, build software in dev brach, validate it: and then ask the community to ratify validated tables person_location_history, provider_location_history etc. Plus these tables will be smaller in size and not go to billions of rows (i.e. performance)

In short - i like @pavgra comment of using fact_relationship as the POC.

@pavgra
Copy link

pavgra commented Oct 30, 2018

Right now, like in location_history, we recommend new tables in CDM WG based on expert opinion - and not a POC. If we take a POC approach, then one requirement before approving a new table (especially that related two facts together) is a POC.

Gowtham, that would be so great!

@rtmill
Copy link

rtmill commented Oct 30, 2018

@gowthamrao @pavgra
I get that its preferable to maintain referential integrity but at what cost? A query to check the validity of these relationships would be simple enough.

Point being, if we keep going down this path we're going to end up with an excessive number of tables.
e.g.
person_location_history
provider_location_history
care_site_location_history
person_care_site_history
person_provider_history
provider_care_site_history

And if we want to be consistent then we need to remove similar approaches in other tables. OBSERVATION_EVENT_ID and OBS_EVENT_FIELD_ID for example, would then require additional tables:
observation_procedure_history
observation_measurement_history
observation_diagnosis_history
etc

Is it worth it?

@pavgra
Copy link

pavgra commented Oct 30, 2018

@rtmill , for me it sounds that you go to extremum, why don't you want to end up then with a single table with three field - "subject", "relationship", "object" (which is RDF)? It would cover the whole CDM.

Although I'd like to have a relations table per each domain pair, I have to admit that it will expand CDM too much and too fast for regular business users. So, I just stated that the hybrid approach, where subject is typified (e.g. location_history.location_id is a concrete FK), relation is described via the table itself (location_history - relation of some domain with location) and object (person / care_site / ...) is dynamic, is the least evil between two extremums (until there is a chance to have normalized relational CDM and DW CDM).

@AEW0330
Copy link

AEW0330 commented Oct 31, 2018

I think this is such an important structural issue that it deserves a wider community conversation. I like the balance @pavgra proposes and agree we don't want fact_relationship to ingest and annotate the rest of the CDM. But I also don't know the answer to the question implicit in @rtmill 's comments: how much expansion can the CDM's primary relational structure support? And that seems to be an increasingly pressing issue.

@gowthamrao
Copy link
Member Author

@AEW0330

Welcome to OHDSI! Maybe you already have, if not, could you please introduce yourself here
http://forums.ohdsi.org/t/welcome-to-ohdsi-please-introduce-yourself/704

@AEW0330
Copy link

AEW0330 commented Oct 31, 2018

AEW0330 is my (Andrew Williams) github ID. You know me :)

@mgurley
Copy link

mgurley commented Oct 31, 2018

I like @pavgra description of FACT_RELATIONSHIP as a POC for adding foreign key relationships. In that vein, what is the recommend/conventional relationship to use in FACT_RELATIONSHIP to represent a 'has many/belongs to' relationship?

@cgreich
Copy link
Contributor

cgreich commented Nov 3, 2018

Friends:

Two things:

  1. Not sure about the POC. We cannot jerk everybody around all the time. The CDM has to move slowly. Unless by "POC" you mean something we never standardize or document in the main model. Which one?
  2. The timing for the FR: I first like the idea, then realized all the Facts already have a timing. Except PERSON, PROVIDER and CARE_SITE. Do you have these specificially in mind, @gowthamrao?

@dimshitc
Copy link
Collaborator

Where are with this?
FACT_RELATIONSHIP still has these ugly domain_concept_ids

@clairblacketer
Copy link
Contributor

Agree with @dimshitc - @gowthamrao, @cgreich is this proposal ready to discuss?

@cgreich
Copy link
Contributor

cgreich commented Apr 26, 2019

Yes, Ma'am.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

8 participants