-
Notifications
You must be signed in to change notification settings - Fork 156
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implements distribution of a field values in CC #1449
Conversation
Greetings Devs, Current Behavior Issues@anthonysena and I constructed an example of defining a Prevalence and Distribution feature in Atlas under the current (pre-PR) form. Things to note about the above: In both cases, I'm defining 2 criteria features: one for DMARDs one for A single ingredient. If you take a look at the output: You'll notice that the prevalence features each appear with the given name, but for the distribution, it's not clear what is happening to get the single row of output. I haven't checked the implementation, but I'm confused as to why the distribtion editor allows you to define 2 separate criteria and name them, but nothing resembling it is displayed in the report. In addition, I'm not clear where the result statistics are comming from: the prevalence statistics shows 3.9k for methotrexate, so that could be where the 3.9k is comming under the 'Count' column of the distribution table, but how does that result in an average of 0.04? If the 3.9k prevalence says there's 3.9k people with the exposure, wouldn't the 3.9k shown in the distribution table mean that everyone has at least 1 record? New Distribution DesignerItem 1: multiple criteria being defined in a distribution featureIf it is the case that all the criteria get grouped together into 1 set of data (and the names of those individual criteria are never used), I suggest that we limit the criteria defined in the distribution feature to only 1 criteria. Item 2: Defining conflicting aggregation types vs. the type of criteria featureUsing the proposed UI, you can define a feature using one criteria type but apply an invalid aggregation operation on it. Example: define a criteria for drug exposure, and specify an 'average measurement result' as the aggregation. Related to Item 1: if we restrict the distribution feature to be single-criteria based, you can display the appropriate aggregation option based on the domain type of the criteria. Here's the current UI: Item 3: Allowing multiple criteria in a single distribution analysisIf the decision is that we should allow multiple distribution features in a single feature design (ie: allow the user to add multiple criteria, with individual names, and have those statistics appear on the UI) then we should move the 'aggregation type' selector to be embedded with each criteria that is added to the feature (ie: not at the top-level being applied to all criteria). The reason for this is that you may define a set of distribution features where you select a drug exposure for 'distinct concepts per person', a visit criteria for 'length of stay' and others. So, the aggregation operation should really be based on the domain criteria type, and not applied at the overall level. In either case of Item 2 or Item 3, I propose that the UI should be altered in the following way: Note: the 'distribtion chooser' will be filtered to only show options that will work for the selected criteria type: ie: Any + {domain specific operations}. |
From review with @wivern and @olga-ganina:
|
@anthonysena @chrisknoll I've made changes regarding your suggestions and review. Would you like to review this again? I'd like to get some feedback from your side. Honestly speaking some details are still not completely clear to me. |
I'll work with @anthonysena and let you know. If you could give us a few examples of things that are not clear to you, we can try to address those too. |
@chrisknoll Thank you for reviewing and commenting this.
|
Here's a list: Any
Note: the reason to perform max/min/avg on top of the 'raw' duration value is that some people may have many more records than others, so you may just want to have a way to get a single recors per person for the distribution (ie: only use the max value, or only use the min value, or take all their records and create an average value). Condition Era
Condition OccurrenceI don't think there's anything in this Domain that we wouldn't get from the DeathI don't think it's worth giving the option for Death...'distribtion of death for a person' doesn't make sense to me. Drug Era
Drug Exposure
Measurement
Note on Measurement distributions: people should understand that they should specify the unit of the measurement record so that we don't mix, for example, 'weight in Kg vs. weight in Lbs'. This is on the user, and we don't need to code some sort of 'intelligence' into the tool to prevent that error. Observation
Procedure
Visit Occurrence
Should Missing mean Zero?In some of our tools, we have the idea that 'missing means zero'. This option is used when if you don't find a record for a person, you should use zero as the value for the person's distribution So, the simple approach is to only include people in the distribution who have at least one countable record. In the example of 'distinct dates', the distribution will show a min value of at least 1 because we're only going to include people that had at least 1 distinct date. The more complex approach is to provide an option for 'missing means zero', and if they use that, we should For this PR, i'd go with the simple approach, and get feedback from the community about handling this use case, and if we want to handle it, we just add a new field to the 'criteria dist feature ExamplesTo Illustrate the difference in results of the raw value distribution vs. the min/max/avg options: If you have a distribution of Visit Occurrence length of stay, and your cohort has 3 people in it:
Using the raw value of 'length of stay', you would produce a distribution from the values (the below would just be 1 list, but I'm grouping them by person for illustration): min length of stay would use: I think this approach makes sense, but I'd like to hear from @pbr6cornell if this approach makes sense. Demographic Criteria
I don't think we'd construct a distribution criteria from 'demographic criteria' because demographic criteria just lets you put criteria around the cohort start_date,. In other words, 'demographic criteria' does not yield any records by itself. In fact, it just yields the cohort record if the person' for this cohort record matches the demographic criteria. You'd just get a 'single record' per person if the person matches the demographic criteria that the cohort_start_date. Update: the following is not correct, correction is belowTo do what you described (E.g. count an average inpatient visit duration for people with age between 30 and 50) you would use a
That would find visits only for people who are between 30 and 50 at cohort_start.
CorrectionIn the above example, I said that using the
To do what you want (only consider people who are age 30-50 in the cohort) we would use the 'stratify criteria' to sub-set the population into the specific group of interest. When creating these distribution criteria, it applies to everyone in the target population equally, and I think this is the correct behavior: we wouldn't want some criteria to only consider some subset of people and other criteria to apply to a subset of others. What's nice about the strata criteria is that it creates those sub-groups, and then all features and distributions are calculated for all people in each of those sub-groups. End of CorrectionHope that all makes sense. |
# Conflicts: # src/main/java/org/ohdsi/webapi/feanalysis/FeAnalysisServiceImpl.java
# Conflicts: # src/main/java/org/ohdsi/webapi/feanalysis/FeAnalysisService.java
@@ -22,5 +22,7 @@ CREATE TABLE @results_schema.cc_results | |||
p75_value DOUBLE PRECISION, | |||
p90_value DOUBLE PRECISION, | |||
max_value DOUBLE PRECISION, | |||
cohort_definition_id BIGINT | |||
cohort_definition_id BIGINT, | |||
aggregate_id INTEGER, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noting that these changes will require an update to the results schema when merged into master.
# Conflicts: # src/main/java/org/ohdsi/webapi/feanalysis/FeAnalysisController.java # src/main/java/org/ohdsi/webapi/feanalysis/FeAnalysisService.java
…cc-values Conflicts: src/main/java/org/ohdsi/webapi/feanalysis/FeAnalysisController.java src/main/java/org/ohdsi/webapi/feanalysis/FeAnalysisService.java src/main/java/org/ohdsi/webapi/feanalysis/FeAnalysisServiceImpl.java
I've just pushed updates to resolve conflicts. I'd appreciate if someone else could review this to ensure I did not miss anything. Thank you. |
When testing this, i attempted to ensure that I was using the correct version of SkeletonCohortCharacterization, but there are conflicts when attempting to merge master into the issue-1220-cc-values branch: https://github.com/OHDSI/SkeletonCohortCharacterization/tree/issue-1220-cc-values. However, ran into issues where I could not determine what the correct way to resolve the conflicts. Could someone with knowledge of these changes please resolve conflicts and push udpates to the issue-1220-cc-values branch in SkeletonCohortCharacterization? |
CREATE SEQUENCE ${ohdsiSchema}.fe_aggregate_sequence START WITH 1; | ||
|
||
CREATE TABLE ${ohdsiSchema}.fe_analysis_aggregate( | ||
id INTEGER NOT NULL, | ||
name VARCHAR(255) NOT NULL, | ||
domain VARCHAR(64), | ||
agg_function VARCHAR(64), | ||
expression CLOB, | ||
agg_query CLOB, | ||
is_default NUMBER(1) DEFAULT 0, | ||
CONSTRAINT pk_fe_aggregate PRIMARY KEY(id) | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have to admit this table confused me a little: when getting the code updated, I realized i need to have StandardAnalysisAPI, SkeletonCohortChraacterization and Atlas and WebAPI all pionting to the same branches...expecting SkeletonCohortCharacterization to contain the bulk of the logic for performing the characterization, and everything else just supporting or dependign on it.
But, looking at these statements, it looks like the implementation of how to perform the cohort characterization (at least from the distribution perspective) is implemented within WebAPI. Any reason why this wasn't enclosed in SkeletonCohortCharacterization (like FeatureAnalysis does)?
I can understand a link between some analysis / aggregate ID specified in SkelCohortChar that lets you present choices in the UI for options available in SkelCohortChar, but I didn't think the actual aggregateion statements and join conditions would be applied here.
If someone wants to do characterization in R, do they have to import somethign from webAPI or is this information duplicated?
Let me know if I'm missing something.
Found an issue with how 'distinct start dates' was being handled. Here is the distribution design: Focusing on the first one: getting the counts of distinct start dates of diagnoses of Diabetes. I checked the data, and these events do exist, however the query that's generated does not select the right records (or the right dats) as I'll describe: This is the part of the query for this distribution feature:
The qualfied_events CTE is correct: You want to assign an event_id to each record, partitioned by the subject_id (via ROW_NUMBER()) , and then attach the observation period start/end dates to the event. Note: the 'event' is simply the 'cohort episode for the person...the qualfied_events cte should be the same no matter what distribution value you're getting. The first problem, tho is in the SELECT immediately following:
This is supposed to be counting distinct condition_start_dates, but it's counting observation_period_start_dates. Next, it appears that the query is attempting to perform a 'group criteria' subquery by doing this:
The SELECT 0 part is used to do operations like:
The query (if it was doing a group criteria) would have something like this at the end:
But, we don't want to do any sort of group logic in the query...so this part isn't correct....and before i propose what I think the correct form should be, let me point out the last problem: The last part of the query has the problem where it's attempting to do something with an observation period, but it's incorrect/not applicable:
The JOIN condition for A (where it looks at A.START_DATE >- DATEADD(day,-365,P.START_DATE) is correct: this is saying 'the condition start date (A) has to be later than 365 days before the cohort start date (P.start_date) and the condition start date (A) has to be before the cohort_start_date (P.start_date + 0 days). That is the condition occurence records we want: count the distinct dates of condition occurences of X that occur between 365 days before and 0 days before the cohort start date. The records from A is what we should use for counting. But, the query is taking records from V as the records that we want to use to count our distribtion value...additionally, there is a join to the observation period where the obseravtion-period_start date is > E.start Date (Remember: E = cohort events) and observation_period_end_date <= E.end date. This means we're trying to find an observation period that fits inside a cohort event..which is impossible, observation periods are the longest spans of time that can appear. I think tis join to obsrvation_period and the sub_query V can be changed.. How the query function is that the Condition criteria is built from a 'WindowCriteria' such that given a set of cohort events (qualified_events), find the condtion_occurrence records relative to the cohort event's start/end date. The WindowCriteria sql builder should handle all this logic..it just assumes you have a 'qualified_events' table (the CTE) to join to. The correct form of the above query should look something like this:
Note, the sub-query 'A" is the part that fetches the domain-specific records, and it re-aliases the condition_start_date to 'start_date" and figures out what the condition_end_date should be (by using either an end date or start_date + 1d). It is the records from A that you want to use for the distribution counts..that's why we do a count DISTINCT(a.start_date) at the start. I'm fairly certain that there is a call that will covert the ConditionOccurrece Criteria object into the query that is between the Let me know if anything above is not clear, I understand it's a bit to unpack. |
Thanks for the hard work on this @wivern . At this point, the changes look good from a 'window criteria' perspective, such that when i was looking for quantity values or distinct start date values, it was building the queries properly (as far as I could tell) that would yield the individual records. However, a concern arises with building the distribution results. This is the query:
Looking at the structure of this code, I'm guessing that you based this on distribtion code that you found in Feature Extraction, such as here: https://github.com/OHDSI/FeatureExtraction/blob/master/inst/sql/sql_server/ConceptCounts.sql. @schuemie : we'd like your input on the following: First issue is the column: Then, in another column we have: So, these two items have me a little confused, and I'm not sure what the correct answer is. I just wanted to be clear about what was being calculated in these statistics and also, if possible, see how they'd be used in practice. For your consideration, this is a query that replicates the distribution query without people_count or count_no_value, it simply calculates the distribution quartiles based on the input set of values:
Again, I'd like to hear from @schuemie about how feature extraction calculates these distribution values and what his understanding is of these -Chris |
Sorry, don't have a lot of time, but trying to shine some light on my code: For example, take the Charlson comorbidity index. People that have none of the components that make up this score don't have a score entry. This should be interpreted as 'Charlson comorbidity index. = 0'. So when computing for example the median, I need to make sure I account for all those zeroes that are implied by the people not having a record. |
Thanks @schuemie . @wivern : I don't suggest we follow this, since for our purposes, we're calculating a distribution based on records selected from a domain, and not something calculated like a score where a valid result is 0. In cases where we want a distribution of quantity, or refills, or durations, we wouldn't want to yield a 0 value as a Thinking on this further tho, even if we wanted to count '0' values, we could just yield those records (via a left join + I think I need second opinion on how this should function. @pbr6cornell , you're up! Could you give us your perspective? |
I'd also like to call out something in the code that I presented vs. the code from Feature extraction related to the
|
I agree with @chrisknoll that sometimes we want to capture 'missing = 0' and include those 0 values in distributions calculations (ex: Charlson, # of visits, etc.), and sometimes we want to exclude persons with missing values (ex. length of exposure among persons taking a drug, we don't care about those without the drug). I also have used the trick @schuemie employs (counting the number of persons with value and persons without value), and that can be effective and efficient for specific statistics with 'missing = 0', but there can be other approaches (in particular when the statistic is a value, not a person count, then the 0s needn't be added to the statistic but the denominator needs to be the total population, not just those with a non-zero value). |
Thanks @pbr6cornell for meeting with me and clarifying the above details. @wivern : Patrick and I came to the conclusion that we want to balance the complexity of the functionality and making the tool easy to use and understand. So, instead of adding new functions to allow people to choose 'missing means zeros', I think it is easier if we just said that 'Event Counts will include zeros for missing people, everything else does not'. 'Everything else' means distributions for duration, quantity, days supply, distinct start dates, etc...To clarify: a zero value is a valid thing to include in a distribution (like a quantity of 0 or a duration could be zero if start_date = end date). But, in both of those cases, there was a record that yielded the value. Event counts, however, looks for records for all people, and even if a record doesn't exist, we want to indicate that with a zero for that calculation. Only 'event counts' should be treated this way. Can you think of any problems with implementing it this way? Second item is the output table: It's not clear what 'count' column means. It could mean distinct people, it could mean records....so we need to name this better. But, it should reflect what is actually in the value and I'm not clear on what it is. Can you tell me what the 'count' column represents in this above table, and then we should rename it. If it is distinct people that contributed a value to the distribution, then we should name it 'persons' if it is the total number of records that are involved in the distribution, then let's label it 'records'. Finally, I will need to closely review what is happening in the temp tables Let me know if you have any questions. |
# Conflicts: # src/main/java/org/ohdsi/webapi/feanalysis/FeAnalysisController.java
@chrisknoll Regarding Count column. It's value calculated in the following way:
Therefore, it's a people count and column title could be changed to the 'People count' to be more clear. |
I think 'person count' is too long, but 'Persons' would be fine. Also,we need to be very sure about where we're using the distinct person count (count_value) in later calculations. Specifically where we are calculating averages, stddevs, etc. |
Btw: pulling latest branch here gives me a compiler error now:
Looking at the code, I do see online 25 and 27 to imports for Transactional:
In addition, there's a reference to ExportUtil that's no longer in the concept set package. I can push up a commit to correct these errors. |
Ok, I went through the query and I think there are 2 problems: The stddev calculation: Then it calculates the stddev as the follows: (i'm told this is the simple calculation of std. dev).
The trick here is properly applying count_no_value to pad zeroes into the calculation...refer to the ConceptCounts.sql file above to find that. The second problem is that the mode that the distribution logic that's used in SkeletonCC is applying the I hate changing up the design at this point, but I'm beginning to think it would just be simpler and easier if we fetched the distribution values and put that into Alternatively, we leave it alone but we need a way to specify that the 'count_no_value' should be 0 (and not total_cohort_count.cnt - count(*) ) if missing_means_zero is false. Which means, we still need a 'missing_means_zero' flag somewhere in the distribution statistic definition...but this would be useful on the UI to show which distribution values are including 0's. We'd still need to fix the calculation for stddev, but this second option leaves most of the code you have in tact, you just need to figure out how to deal with the case where you don't include zeros. I understand that you are off on projects and there may be other people taking over for you, but if it seems like you guys are bogged down and don't have the cycles, just say the word and I'll take this on, I just don't want to interfere with any ongoing work if you guys are on top of this. |
@chrisknoll Guys really confused with this task and I'm afraid they could spend too much time to investigate a problem. |
Ok, thanks. |
Modified migrations scripts: remove length of era (use duration instead), and specify missing_means_zero for distinct_days and event count.
I've made updates to the StandardAnalysisAPI, SkeletonCC, and WebAPI branches, please pull those and review. Thank you. |
Ok, next part I'm investigating is the distribution values that comes from the prespec analyses, specifically 'VisitCountByConcept' which should give a distribution report item for each visit concept ID that is in the data (ie: Inpatient, Outpatient, ER, etc). When I ran this (I don't have the official results in front of me) only one row was generated from that analysis. I need to look into the intermediate query and understand how it's copied over to |
Resolves #1220