Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bi-directional link between CARE_SITE and LOCATION_HISTORY #220

Open
pavgra opened this issue Oct 16, 2018 · 82 comments

Comments

Projects
None yet
7 participants
@pavgra
Copy link

commented Oct 16, 2018

Table care_site contains location_id field while rules for location_history also say that care_site can be a target domain:

The entities (and permissible domains) with related locations are: Persons (PERSON), Providers (PROVIDER), and Care Sites (CARE_SITE).

Therefore we get doubled link which leads to inconsistency: which location is true?

@pavgra

This comment has been minimized.

Copy link
Author

commented Oct 16, 2018

The same story is with person. But the provider table looks good. No location_id there but it can be referenced from location_history

@gowthamrao

This comment has been minimized.

Copy link
Contributor

commented Oct 17, 2018

@rtmill proposed the table with discussions here #181

I think the idea is that current location of person, provider and care_site are in location table. The historical location of person, provider and care_site are in location_history table.

i.e. if a person changes address: current address (on ETL) is in person table. The historical addresses are in location_history table with date spans (start_date and end_date).

@gklebanov

This comment has been minimized.

Copy link

commented Oct 17, 2018

I also think that the associations between person, care_site and provider is another time bound fact and these are changing throughout person's life. These should not be direct links between person and care_site and provider tables via FKs but rather through another fact table.

@gowthamrao

This comment has been minimized.

Copy link
Contributor

commented Oct 17, 2018

@gklebanov @pavgra

Would this make it less ambiguous?

Field Required Type Description
location_history_id Yes integer A unique identifier (primary key) for each location_history.
entity_id Yes integer The unique identifier for the entity. A foreign key that references either person_id, provider_id, or care_site_id, depending on location_.
location_entity_concept_id Yes varchar(50) A foreign key identifier to a concept in the CONCEPT table representing the identity of the field represented by LOCATION_ENTITY_ID. Either PERSON, PROVIDER, or CARE_SITE.
location_id Yes integer A foreign key to the location table.
relationship_type_concept_id Yes varchar(50) The type of relationship between location and entity.
start_date Yes date The date the relationship started.
end_date No date The date the relationship ended.

Changed

  • domain_id to location_entity_concept_id
  • added location_history_id as pk
  • changed some descriptions, to explicitly say entity_id is 'foreign key'
  • restructured the tables field positions

@clairblacketer @cgreich i think we should make these changes. These are simple changes and may be considered corrections

@pavgra

This comment has been minimized.

Copy link
Author

commented Oct 17, 2018

@gowthamrao , @gklebanov , location_entity_id and location_entity_concept_id should be replaced with separate many-to-many tables to simplify SQL creation and use native RDBMS mechanisms to enforce data integrity

@rtmill

This comment has been minimized.

Copy link

commented Oct 17, 2018

@pavgra @gklebanov @gowthamrao

Table care_site contains location_id field while rules for location_history also say that care_site can be a target domain:

The entities (and permissible domains) with related locations are: Persons (PERSON), Providers (PROVIDER), and Care Sites (CARE_SITE).

Therefore we get doubled link which leads to inconsistency: which location is true?

The location history table stores relationships to locations from person, care site and provider. The location_id field in each of those tables was left intentionally as a legacy structure, as it's common for implementations to use the field to represent current, or rather last known, location.

I would think the relationships of person to care site and person to provider can be inferred from visits. Are there specific use cases in mind? The only thing that occurs to me is changes to PCP over time.

@gklebanov

This comment has been minimized.

Copy link

commented Oct 17, 2018

Hi Robert,

The location_id field in each of those tables was left intentionally as a legacy structure, as it's common for implementations to use the field to represent current, or rather last known, location.

Since these are facts, why would this be any different from other facts such as drug exposure, conditions, observations etc..? It would be the same as saying let's store "last/current drug taken by a person" or "last/current observed condition". These facts can be obtained from fact tables, including the last known provider or care_site.

Don't get me wrong, I understand why people are trying to do that and that these are less volatile facts but these are facts nonetheless and I would prefer consistency vs. potential data inconsistency issues.

@gowthamrao

This comment has been minimized.

Copy link
Contributor

commented Oct 17, 2018

@gklebanov are you advocating for removing the location_id from person, care_site, provider tables?

@pavgra could you please illustrate this point in a diagram. We are not educated in the theory of relational databases - so we will need some help understanding.

location_entity_id and location_entity_concept_id should be replaced with separate many-to-many tables to simplify SQL creation and use native RDBMS mechanisms to enforce data integrity

we get doubled link which leads to inconsistency:

@rtmill

This comment has been minimized.

Copy link

commented Oct 17, 2018

@gklebanov I share your concern and agree this is something that should be addressed. Specifically, the ohdsi tendency to force many-to-many relationships into one-to-one representations, which was touched on in (point 2) this post and was the motivation for proposing the location_history table.

Perhaps I'm too influenced by @cgreich but it doesn't seem reasonable to adapt the location_history structure to accommodate these relationships if there are no use cases for them, given the added complexity from an ETL perspective and that the content can typically be inferred from visits.

I would lean towards either further specifying these fields to logically force one-to-one (e.g. Person.primary_provider_id, Provider.primary_care_site_id) or in some cases remove them altogether (what is Person.care_site_id used for?).

@pavgra

This comment has been minimized.

Copy link
Author

commented Oct 17, 2018

@cgreich

This comment has been minimized.

Copy link
Contributor

commented Oct 17, 2018

@pavgra and @gklebanov:

Nominally you are right. There is no good reason to model the same thing twice, once as the "current" in the PERSON table, and another one through linking to the LOCATION_HISTORY table. While we are not doing that with the other event tables, there, everything is in the equivalent of the History table.

Except:

  1. There really is a one-to-many relationship to Condition, Drug etc. And I mean MANY MANY. Most data have only one location, making it a de-facto one-to-one. Many have none.
  2. Location is much less fluid than the clinical events. These change every day or even within a day. Locations change in years.
  3. The consequences of information duplication is little. If somebody wants the geographic distribution today they will use PERSON, if they want to study the past or dynamics of moving around they will use the History table and union it to the location in PERSON. Works.

Bottom line: You are right, but making your change to get it right is much more costly than keeping it dirty, and no use cases suffer, which is our ultimate criterion. Cleanliness is not.

@pavgra

This comment has been minimized.

Copy link
Author

commented Oct 17, 2018

@cgreich , I clearly get why proposed many-to-many links cost more than give, but what costs so much with dropping two columns from person and care_site? (in the way, which @gowthamrao described)

@cgreich

This comment has been minimized.

Copy link
Contributor

commented Oct 17, 2018

@pavgra: Any tool/method/cohort definition would have to change.

@pavgra

This comment has been minimized.

Copy link
Author

commented Oct 17, 2018

@cgreich , That's why it is called a major release. To bring in breaking changes. But the changes are clearly reasonable and valuable.

@cgreich

This comment has been minimized.

Copy link
Contributor

commented Oct 17, 2018

Know what? Bring it on. Put out a CDM proposal, and we let the crowd decide. (Except the folks who come to that WG are ideologs, and might outweigh the common sense. I need a "directed democracy" there. :) )

@cgreich

This comment has been minimized.

Copy link
Contributor

commented Oct 17, 2018

Or is this the proposal already?

@pavgra

This comment has been minimized.

Copy link
Author

commented Oct 17, 2018

@cgreich, I believe, @gowthamrao 's post + explicit statement that location_id columns in person and care_site should be dropped = the proposal

@cgreich

This comment has been minimized.

Copy link
Contributor

commented Oct 17, 2018

Right, but that's not debatable and votable at the WG. You need to make a proposal, explain it, show the use case or reason and pros/cons. Can't throw it over the fence, particularly if it is not straightforward.

@pavgra

This comment has been minimized.

Copy link
Author

commented Oct 17, 2018

Removal of location_id column from person and care_site

Relevant tables:

  • person
  • care_site
  • location
  • location_history

Issue

There are bi-directional associations between person / care_site and location tables:

  • person.location_idlocationlocation_history.location_id & location_history.entity_idperson
  • care_site.location_idlocationlocation_history.location_id & location_history.entity_identity_id

Therefore there are two ways to get current location of person / care site: either get by location_id field or calculate via location_history.entity_id, and these two ways can give different results. Since location is a time bound fact and these are changing throughout person's life (also can change for care_site), there can be multiple locations per person / care_site. Therefore we have 1-to-many relationship, which should be implemented by foreign key in location_history table (it was already done via entity_id) and the second way of linking location to person / care_site should be removed to exclude issues with data inconsistency and have only a single source of truth.

Proposed solution

Drop fields:

  • person.location_id
  • care_site.location_id

Pros

  • Data consistency

Cons

  • Additional join to query current location

@cgreich , is this sufficient or anything else is required?

@cgreich

This comment has been minimized.

Copy link
Contributor

commented Oct 17, 2018

No pros and cons, no use cases, but fine. Will bring it on.

@gklebanov

This comment has been minimized.

Copy link

commented Oct 18, 2018

Friends,

I would propose the following changes:

  • remove care_site_id from person
  • remove provider_id from person
  • remove location_id from person
  • merge location and location history into one table and rename it to be location
  • have a one-to-one...many relationship between person, care_site, provider and location(history)

Simple and practical. It would simplify things but yes - I know - many might initially struggle with this. And yes - it is a change.

Person, Care Site or Provider NEVER exist without an address and the opposite is definitely true. In modeling, it is called the "composition". Unless you are a real estate builder- the address would never exist by itself. Moreover, who cares if you have multiple address entries in this table (get rid of Rule 1) - it is just the address stamp, think White Book.

So, to determine the LATEST address, provider or care_site - someone would just literally get the last record. Not index required, just get all and sort by date and use the latest one

here is the pathetic PowerPoint based model;)
image

@gowthamrao

This comment has been minimized.

Copy link
Contributor

commented Oct 18, 2018

What is the PK for the location_history table you are proposing? Where are the lat/long/street_name_1/city/county stored?

@pavgra

This comment has been minimized.

Copy link
Author

commented Oct 18, 2018

Additionally to @gklebanov 's proposal I'd like to add that getting current location can be as simple as WHERE end_date = NULL. And as location & location_history are merged, there is even no need in additional join

@gklebanov

This comment has been minimized.

Copy link

commented Oct 18, 2018

@gowthamrao - here is my thinking:

In my view, the PK should be a system generated ID (location_id) and entity_id would be the "FK" to person, provider or care_site

The "location" - or, simply the physical address - is what is called a composite part of the person (or care_site, provider). It is not the first grade object in OMOP CDM model. What is important is that we track the addresses where those entities reside or resided.

We are not the US Census Bureau and thus do not need to maintain a perfectly clean address book, thus this rule should not even be there

"Each address or Location is unique and is present only once in the table."

Otherwise it will lead to a complex ETL process that has to cleanse every location, do fuzzy matching etc.. just to match a physical address to a person. Never mind, I doubt that this information is clean in raw or even always present. Then, that would also imply that we need to do the Many-Many relationships - just too complex and I do not think is needed for OMOP CDM use cases.

@rtmill

This comment has been minimized.

Copy link

commented Oct 18, 2018

The convention that you have issues with "Each address or Location is unique and present once only in the table" was meant as a conceptual convention for efficiency purposes and should not break anything if you ignore it.

  • remove care_site_id from person
  • remove provider_id from person
  • remove location_id from person

All for it. I could see an eventual need for a provider_history table to record changes to PCP over time, if that's a use case people have.

  • merge location and location history into one table and rename it to be location

Strongly against this.

Person, Care Site or Provider NEVER exist without an address and the opposite is definitely true. In modeling, it is called the "composition". Unless you are a real estate builder- the address would never exist by itself. Moreover, who cares if you have multiple address entries in this table (get rid of Rule 1) - it is just the address stamp, think White Book.

In the real world sure, but I've encountered countless sources of person, care site and provider without address data. The argument that a location does not exist without reference to one of these domains is the biggest difference in our approach.

Reasoning for leaving the two tables separate:

  • A location is not a relationship, it is a distinct entity that exists outside of how it is referenced or related to clinical data. In its essence, it is a coordinate pair that is a point on the globe.
  • Storing every relationship to a location as a location itself causes duplication and inefficiency
    • Establishing equivalence between two geocoded locations is not difficult
      • IF (a.lat == b.lat & a.lon == b.lon) THEN equivalent
    • Arguably the most common operation in this realm will be to translate between person and region. For bringing spatial data into the CDM, this has two main parts:
      • a) For a given region (polygon), find me all of the locations (points) within it
      • b) For this set of locations, find all the people that lived there at a given date
    • The spatial join (point-in-polygon), in part a, is likely to be the most expensive operation. Forcing that operation on every relationship rather than the subset of unique locations is inefficient. Furthermore, part b is simple when indexes on the location_id within the location_history table are created
  • Regions and locations have attributes, independent of the CDM
    • Region examples : poverty rate of county, air quality of tract, provider shortage area, etc
    • Location examples: paper mill exists at [x,y], arsinic measurement for well water at [x,y] (this was the original use case for the GIS WG)
    • Point being, while this isn't needed for your use case, the ability to independently reference locations, outside of the context of clinical data and how they relate to it, is neccessary. This is even more relevant with the functionality being developed by the metadata and annotations work group. @alondhe Correct me if I'm wrong but I believe we'll have the ability to annotate and create metadata for locations, which would need to be replicated for every relationship in this proposed model
@cgreich

This comment has been minimized.

Copy link
Contributor

commented Oct 30, 2018

Friends:

I want to reopen the debate. Because it is so much fun (!) and because I changed my mind after some hardcore hairwashing from Greg. The problem was that we were constantly talking two simultaneous issues, and didn't really listen to each other. At least I might not have. They were:

  1. Duplication of information
  2. Denormalization or normalization of the locations

The advantage of @gklebanov and @pavgra's proposal is simplicity. We'd have one location table, and could phase out the location_id in the PERSON table over time. The disadvantage is what @rtmill points out. Essentially:

  • Assigning locations to polygons or region is compute-expensive, and you don't want to do that for the same location over and over again
  • Comparing locations for identity or calculating distances will be expensive for the same reason.

Here is a compromise: We take the denormalized approach @gklebanov sketched out, but for the compute-intensive stuff we can have an additional normalized reference table. The good thing about that is that all non geo-specific use cases it will be fast. And the little bit of space for repeating the same 3-letter zip - I don't care. Storage is cheap.

Thoughts?

@rtmill

This comment has been minimized.

Copy link

commented Oct 30, 2018

@cgreich I'm not sure I understand.

Assuming you're talking about adopting @gowthamrao 's new proposal, which I think makes sense, what else would change? Removal of location_id from person? If you're proposing representing locations as relations rather than distinct entities please elaborate.

@cgreich

This comment has been minimized.

Copy link
Contributor

commented Oct 30, 2018

@rtmill:

No, no. @gowthamrao talks about the FACT_RELATIONSHIP table. He claims that the LOCATION_HISTORY table is kind of the same thing and we don't need it twice. But I am not talking about that. I am talking about the proposal @gklebanov threw in (above with the diagram of Person, Care Site, Provider and Location). It would be a one to one relationship of Locations to those three domains, with a person_id, care_site_id and provider_id in the LOCATION table. You had issues with that, because it would denormalize that table. The same 3-letter zip would be there a gazillion times, one for each patient, hospital and provider located in that 3-letter zip. So, I understand why you hate it, but I am proposing to solve your problem with an additional normalized LOCATION_REFERENCE or so table which would shrink the repetition down to what really is in the data.

We are really pressing the balloon: If we are trying to be efficient in one place we increase the problem in another, it seems. I am just trying to come out with the optimum.

@rtmill

This comment has been minimized.

Copy link

commented Oct 30, 2018

@cgreich

I changed my mind after some hardcore hairwashing from Greg

We are really pressing the balloon

I can't tell if you're making these up as you go or if you have an endless supply of obscure metaphors 😄

It would be a one to one relationship of Locations to those three domains, with a person_id, care_site_id and provider_id in the LOCATION table
...an additional normalized LOCATION_REFERENCE or so table which would shrink the repetition down to what really is in the data.

It's not obvious to me what advantages introducing one-to-one relationships and a LOCATION_REFERENCE table would have over LOCATION_HISTORY (and assumption of unique locations in LOCATION). You would still be replicating the static data of a location, where it exists on the globe.

Six of one, half dozen of the other. I'd be happy as a clam if you can shine some light on this 😄

@pavgra

This comment has been minimized.

Copy link
Author

commented Oct 30, 2018

Let me also palp the balloon a bit 😄

While I've been hashing the geo polygon topic with @cgreich, following came to my mind:

  • A need for a hierarchy of areas (locations) is sounded both here and there. What is more, such hierarchy fits OMOP vocabularies really well. You have concepts (name of area and its type, valid start / end dates), concept types (city / county / country / etc), you have relations (city is in a country, etc), you have ancestry. So why to invent new tables while this can be easily put into existing polished vocabs mechanism? In this way there is no need for additional LOCATION_REFERENCE table, you just organize e.g. OSM data as an OMOP vocabulary and add admin_area_concept_id field to the location table. And the admin_area_concept_id could be filled e.g. during ETL process by calculating instercestion of the lat-lon from the location table and areas' polygons (e.g. taken from the same OSM). @cgreich , your thoughts?
  • We drop "location_id" field from "person" and "care_site" to make things non-ambiguous and apply @gklebanov 's proposal to reduce amount of tables (although it lacks normalization, it should be compensated by the geo vocabs and the linkage described above)
@AEW0330

This comment has been minimized.

Copy link

commented Nov 6, 2018

We could define a separate source daimon for Spatial DB: if CDM DB supports spatial functions - we specify its connection details, otherwise we bundle PostGIS, configure external connection and specify the PostGIS connection details as the daimon. <

I find this idea appealing though I have no idea what bundling really involves here. Apart from that, I am unclear on a couple of points about the implementation. Do we know what spatial functions should be classified as requirements and used as the criteria for this bundle/no-bundle switch? We might need to enumerate those functions that get used in this build and add others we think will be needed to support common GIS use cases in the future. If we define spatial functions much beyond "can do optimized spatial or spatio-temporal joins", we might not expect there to be many CDM DBs that would get the no-bundle option.

@cgreich

This comment has been minimized.

Copy link
Contributor

commented Nov 6, 2018

Friends:

Oh boy. This conversation is spinning.

  1. Why do we need a LOCATION_HISTORY table? Didn't we say long ago we will have a not-normalized LOCATION table, which will take care of the history? And in it, we'd have a person_id (if it is a person we are a locating) or a care_site_id (if that) and a provider_id.
  2. The latter you want to abolish. We can talk about it.
  3. I still don't understand why we need the vocabulary_id. The concept_id has a unique vocabulary_id. Having it here will create the contradictions @pavgra usually gets upset about.
  4. I thought we wanted to test the assumption that each location is linked to one low-level region, and the rest of the regions have a hierarchical relationship to that one in the vocabulary system.
  5. We really need to pre-calculate distances? An analytic can't take like 10 minutes?

Look: I think all we need is the LOCATION tables which includes the history like @gklebanov proposed. And a hierarchical system of regions in the vocabulary like @pavgra thinks we can build. Let's test how that works. In parallal, @rtmill et al. are going to create a complex representation and test that. And then we compare.

Deal?

@AEW0330

This comment has been minimized.

Copy link

commented Nov 6, 2018

point to polygon relations are many-to-many (not many to one) At least, due to possiblity to have areas represented by different coding systems and containing the same point <

Only if polygons of different types live in the same table. My sense from the Tuft GIS folks is that's not the typical way GIS services are supported if they accommodate multiple sources with different polygons. It might be good to make sure that is the way this will and should go. Their initial take was that we should have separate tables for each class of polygons with associated precalculated variables. I think that is to ensure timely visualization not analysis @cgreich - 10 min wait to add a visualization layer is probably not tenable. I assume that (separate tables per polygon class) is only needed when polygons don't role up. There are plenty of health relevant regional attributes that don't come in administrative or census units. We're still in the process of confirming whether that's their recommendation given the GIS WG use cases. I still don't have a clear sense of how those use cases differ from the AEGIS to ATLAS use cases to feel confident about their relevance here. But, as I wrote previously, I think this project might benefit from a clear sense of how meatadata are used to drive the references and querying from disparate GIS data sources. If that metadata driven strategy is "industry standard", it is bound to effect our thinking what data gets standardized and where it lives and what GIS functions are essential to support natively.

@pavgra

This comment has been minimized.

Copy link
Author

commented Nov 6, 2018

@cgreich ,

  1. The reasons were described above:
  • performance (of both geo relations pre-calc and clustering) with deduped locations would be much better
  • normalization
  1. As soon as you have 2 or more geo vocabs, a single FK would stop working - you'll have 2 low-level areas for one location, just because of 2 vocabs. For the same reason, the location_area.vocabulary_id field was proposed, to differentiate between lower-level area concepts for a single location (although you could derive vocab from concept, the field should enforce a person to keep in mind that only one area concept per location per vocab should be presented in the table).
  2. Analytic cannot use geospatial functions. They are not supported by PDW, Redshift, Impala. It's not just a matter of time, it's a matter of being capable to build cohorts with geo criteria.
@cgreich

This comment has been minimized.

Copy link
Contributor

commented Nov 6, 2018

I assume that (separate tables per polygon class) is only needed when polygons don't role up. There are plenty of health relevant regional attributes that don't come in administrative or census units.

@AEW0330. Look. Why don't we do this: Create a use case. And then we talk. We can always throw additional one-to-man linkers into the mix, but right now I am working off of a hierarchical region system where things role up. As the detault. Because today we do not have the ability to say "Stratify the cohort by US state".

performance (of both geo relations pre-calc and clustering) with deduped locations would be much better

I don't care about performance. We are not building an application. We are modeling the space. If you need to do some calculation, and you need to normalize the data to save time - go for it. We don't need that in the OMOP CDM. In the OMOP CDM we need the important relevant data as simple and compact as possible without the risk of creating contradictions.

As soon as you have 2 or more geo vocabs, a single FK would stop working

I know. So far I have heard the use case "hospital area". Fine. That's not patient-level data we need to harmonize. That is external data.

Analytic cannot use geospatial functions.

That is a good point. But it is not limited to geospatial. The pregnancy model also cannot be run in RDBS, it is an R script today. Here, we do what we can do in a RDBS. That's the point of this.

So, let me repeat. Let's do:

  1. The minimal model we (all of you, really) proposed, plus testing it out. We can use any data, even if it has only 3-letter zip. Use case: stratify by state.
  2. A propler geospatial model the way @rtmill and @AEW0330 are thinking of, plus testing it out. Do we have an OMOP CDM with potential geo information anywhere to test it? Use case: stratify by hospital region or by distance to care site.

Yes? :)

@gowthamrao

This comment has been minimized.

Copy link
Contributor

commented Nov 6, 2018

@cgreich

Is there suggestion to create a new schema (separate from CDM, results, vocabulary schemas) called DERIVED schema - designed for precalculated/precomputed/application specific tables, that are derived from the omop CDM? Advantage being, these are managed by the application developers - not the CDM wg?

@cgreich

This comment has been minimized.

Copy link
Contributor

commented Nov 7, 2018

Friends:

This is a war of attrition, it seems. :) Had another discussion with @pavgra and @gowthamrao, and I think we now are ready for a compromise. Here it is:

  1. The collapsing of LOCATION and LOCATION_HISTORY and the drop of location_id from PERSON and CARE_SITE will happen, but not now. Not urgent. We can postpone.
  2. We add region_concept_id to LOCATION and introduce the geo vocabs. This is for standard regions.
  3. We add the CUSTOM_AREA for @gowthamrao's dynamic areas.
  4. We add the LOCATION_DISTANCE table for precalculated distances. It will contain only Person-Care Site pairs that exist in the data, which will make the size palatable.
@pavgra

This comment has been minimized.

Copy link
Author

commented Nov 7, 2018

Two edits:

  • person.location_id and care_site.location_id are marked as deprecated
  • I would still propose to use LOCATION_AREA name instead of CUSTOM_AREA to have consistent naming

@cgreich , please confirm

@gklebanov

This comment has been minimized.

Copy link

commented Dec 5, 2018

@cgreich - how do we get this proposal out on a table and have it ratified?

@pavgra

This comment has been minimized.

Copy link
Author

commented Mar 6, 2019

@cgreich , could you tell what's the ETA for:

  1. We add region_concept_id to LOCATION
  2. We add the LOCATION_DISTANCE table for precalculated distances. It will contain only Person-Care Site pairs that exist in the data, which will make the size palatable.

?

@clairblacketer

This comment has been minimized.

Copy link
Contributor

commented Mar 6, 2019

@gklebanov @pavgra we can put this on the agenda for the April meeting

@cgreich

This comment has been minimized.

Copy link
Contributor

commented Mar 6, 2019

@pavgra:

  1. As @clairblacketer said. On the agenda for next month. However, if you want to risk it go ahead and implement it already. I don't see any issues.
  2. That will induce significant discussion. Even though you don't do the entire Cartesian product, there are still numerous Care Sites for each patient, multiplying the size of the PERSON table. And then we have the Location History. Is that in there as well? I know you want to precalculate this so it needn't be noodled at analysis time, but again, folks will have opinions.
@pavgra

This comment has been minimized.

Copy link
Author

commented Mar 6, 2019

@cgreich :

there are still numerous Care Sites for each patient, multiplying the size of the PERSON table

I doubt that each patient presented in a database will have visits to all care sites

And then we have the Location History. Is that in there as well?

It is related only for the process of getting relations between a person and care sites

@cgreich

This comment has been minimized.

Copy link
Contributor

commented Mar 7, 2019

Well, one way to help the discussion is to count the number of Care Sites per patient in a database or so, so we know what the damage would be.

It is related only for the process of getting relations between a person and care sites

You said that. And the person has a Location History. So, if your use case is to use the distance from the Person's home to the Care Site for some analytic, it would be a different place before and after the Person moved. So the total size is # Person * average # Care Sites * average # Locations per person.

What is your use case anyway?

@pavgra

This comment has been minimized.

Copy link
Author

commented Mar 7, 2019

From the OHDSI/WebAPI#649 (comment) :

Example in Atlas/circe-be: that should be supported at both cohort entry events (qualified events) and inclusion criteria (inclusion events) are:
where the care_site.location_id was within x units of distance from person.location_id (eucledian distance between two points defined by lat/long)

I am not aware of other proposed solutions to solve the problem in such way that would work for all OHDSI supported DBs. If you have such, would be nice to hear

what the damage would be

Even though the numbers are going to be quite big, search be distance intervals is quite efficient both in relational DBs (covered by B+ tree, close to logarithmic complexity) and column oriented ones (due to compression). So this should not differ so much from domain tables lookup by some concept id

@rtmill

This comment has been minimized.

Copy link

commented Mar 7, 2019

@gowthamrao @pavgra @cgreich @AEW0330

I am very sorry for dropping the ball on this conversation. No idea how I missed the comments back in November.

want be sure that @rtmill is happy.

Finally somebody with their priorities straight! 😄

Guys, this is a complicated topic. We are trying to integrate a whole new realm of content, both in terms data (non person-centric) and functionality (requirements not universally supported by all dbms flavors). These conversations are incredibly productive and I hope they continue, and I apologize for the opacity and snail's pace of the GIS WG (we're getting there), but forcing the above implementation into the CDM, in my opinion, would be a step backwards.

I understand there are deadlines to consider here but this seems to be shortsighted fix for a small handful of use cases. These sprints are invaluable for figuring out the details of this project, but do we need to rip apart the CDM to do them? If you agree that the above proposal is not a long term extensible strategy, then we are adding to the CDM now just to tear it down later.

Food for thought: What if we consider it to be a GIS 'extension', at least while we work on development. The CDM doesn't get modified and we segregate all GIS tables to a different schema or database. We could then add the connection information and type to source and source_daimon (or a similar approach), and build the functionality into ATLAS just the same. When we need specific calculations, at least those that I see in your use cases, we can always leverage R (distance 1, distance 2, location to region).

Sincerely yours,
Stick-in-the-mud

@pavgra

This comment has been minimized.

Copy link
Author

commented Mar 7, 2019

@rtmill , a couple of questions:

  • what is your proposal / what is ETA for getting the proposal?
  • how you are going to support all DB flavors? (your suggestion for using R won't work since there is a very reasonable need to use location criteria during cohort definition - see the AEGIS issue in WebAPI)

I believe, these questions were raised a while ago and there were no answers from you. So either we can do and evaluate what is simple and works or we can wait till no one knows when for a theoretically ideal solution.

@rtmill

This comment has been minimized.

Copy link

commented Mar 7, 2019

@pavgra We have been focused on other portions of the project (schema and staging data specifically) but I can take a dive into alternate approaches, if that's what you're asking for.

I'll need to do my homework and get more familiar with circe and atlas. Could you provide any more information about why R isn't an option beyond:

Limitations of geo criteria executed in R
If we imagine execution of Geo criteria in R, the following issues come out:
Geo criteria can be placed only into "Inclusion criteria" section since it should be processed separately (by R).
Geo criteria cannot be nested or cannot nest other criteria since it should be processed separately (by R).
Since geo criteria require processing by separate component, we cannot really embed them into circe expression - any change in circe expression (cohort JSON) would require reflection in Exported SQL.
These limitations are not acceptable.

Is it a matter of functionality or design principles? As in, if we brewed up a solution that did use R, would the atlas folks be amenable?

@cgreich

This comment has been minimized.

Copy link
Contributor

commented Mar 8, 2019

I need two proposals, one for each. The region_concept_id I got (even though flimsy), but we can bring it up and ratify at the next meeting, and I am sure it will go through.

The LOCATION_DISTANCE I still need. Don't get me wrong, I am not undermining it. All I am saying is there will be more discussion.

@pavgra

This comment has been minimized.

Copy link
Author

commented Mar 8, 2019

@cgreich , what are the questions which still remain? I thought I've explained why there shouldn't be performance concerns and why it doesn't differ from other domain tables. In other words, what's required?

@cgreich

This comment has been minimized.

Copy link
Contributor

commented Mar 8, 2019

Nothing. I am not the one to argue with. We will bring it to the CDM WG, and that's where you have to defend it. But you need the proposals to @clairblacketer as an Issue to the Common Data Model.

@cgreich

This comment has been minimized.

Copy link
Contributor

commented Mar 10, 2019

Also: We need a instruction sheet with the proposal to explain how that information is used, and how an ETLer determines for each LOCATION record the correct region_concept_id based on a calculation which polygon a point (address, zip code, 3-digit zip) is located in. Makes sense?

@pavgra

This comment has been minimized.

Copy link
Author

commented Mar 13, 2019

The first part - #252

@pavgra

This comment has been minimized.

Copy link
Author

commented Mar 13, 2019

The second part - #253

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.