Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implements distribution of a field values in CC #1449

Merged
merged 26 commits into from
Nov 23, 2020
Merged

Conversation

wivern
Copy link
Contributor

@wivern wivern commented Feb 4, 2020

Resolves #1220

@anthonysena anthonysena self-assigned this Feb 12, 2020
@chrisknoll
Copy link
Collaborator

chrisknoll commented Feb 19, 2020

Greetings Devs,
Thanks very much for putting this together. I'd like to take this opportunity to raise some questions/issues about the current functionality of distributions in Cohort Characterization before going into the details of this PR.

Current Behavior Issues

@anthonysena and I constructed an example of defining a Prevalence and Distribution feature in Atlas under the current (pre-PR) form.

Prevalence:
current_prevalenceDesign

Distribution:
current_distDesign

Things to note about the above: In both cases, I'm defining 2 criteria features: one for DMARDs one for A single ingredient. If you take a look at the output:
current_ccResult

You'll notice that the prevalence features each appear with the given name, but for the distribution, it's not clear what is happening to get the single row of output. I haven't checked the implementation, but I'm confused as to why the distribtion editor allows you to define 2 separate criteria and name them, but nothing resembling it is displayed in the report.

In addition, I'm not clear where the result statistics are comming from: the prevalence statistics shows 3.9k for methotrexate, so that could be where the 3.9k is comming under the 'Count' column of the distribution table, but how does that result in an average of 0.04? If the 3.9k prevalence says there's 3.9k people with the exposure, wouldn't the 3.9k shown in the distribution table mean that everyone has at least 1 record?

New Distribution Designer

Item 1: multiple criteria being defined in a distribution feature

If it is the case that all the criteria get grouped together into 1 set of data (and the names of those individual criteria are never used), I suggest that we limit the criteria defined in the distribution feature to only 1 criteria.

Item 2: Defining conflicting aggregation types vs. the type of criteria feature

Using the proposed UI, you can define a feature using one criteria type but apply an invalid aggregation operation on it. Example: define a criteria for drug exposure, and specify an 'average measurement result' as the aggregation. Related to Item 1: if we restrict the distribution feature to be single-criteria based, you can display the appropriate aggregation option based on the domain type of the criteria. Here's the current UI:
new_distDesign

Item 3: Allowing multiple criteria in a single distribution analysis

If the decision is that we should allow multiple distribution features in a single feature design (ie: allow the user to add multiple criteria, with individual names, and have those statistics appear on the UI) then we should move the 'aggregation type' selector to be embedded with each criteria that is added to the feature (ie: not at the top-level being applied to all criteria). The reason for this is that you may define a set of distribution features where you select a drug exposure for 'distinct concepts per person', a visit criteria for 'length of stay' and others. So, the aggregation operation should really be based on the domain criteria type, and not applied at the overall level.

In either case of Item 2 or Item 3, I propose that the UI should be altered in the following way:
proposed_distDesign

Note: the 'distribtion chooser' will be filtered to only show options that will work for the selected criteria type: ie: Any + {domain specific operations}.

@anthonysena
Copy link
Collaborator

From review with @wivern and @olga-ganina:

  • Distribution features should work in the same way as prevalence features: you can have > 1 criteria as part of the definition and the results should produce a distribution row per criteria.
  • The "Event Count" distribution will be nested insides of the criteria as shown above.
  • We need to also make sure we're clear on how distributions are displayed in the results. This needs to consider the strata (subgroups) that are part of the cohort characterization definition.

@wivern
Copy link
Contributor Author

wivern commented Feb 26, 2020

@anthonysena @chrisknoll I've made changes regarding your suggestions and review. Would you like to review this again? I'd like to get some feedback from your side. Honestly speaking some details are still not completely clear to me.

@chrisknoll
Copy link
Collaborator

I'll work with @anthonysena and let you know. If you could give us a few examples of things that are not clear to you, we can try to address those too.

@wivern
Copy link
Contributor Author

wivern commented Feb 28, 2020

@chrisknoll Thank you for reviewing and commenting this.
I have some doubts regarding:

  1. I collected some ideads from discussion which values could be used as a count value, but there were a very limited number of domains covered. Is this ok and we'll add more when needed or more cases could be implemented?
  2. During the feature implementing I apologized Demographic criteria allows count all available values. E.g. count an average inpatient visit duration for people with age between 30 and 50, and so on. Am I reight? Or may be Demographic criteria might have some restrictions like other criteria?

@chrisknoll
Copy link
Collaborator

chrisknoll commented Feb 28, 2020

I collected some ideads from discussion which values could be used as a count value, but there were a very limited number of domains covered. Is this ok and we'll add more when needed or more cases could be implemented?

Here's a list:

Any

  • Event count (counts all records)
  • Distinct Start Dates (counts only unique start days) ie: count(distinct start_date)
  • duration (ie: datediff(d,start_date, end_date))
  • max duration (ie max(dateDiff,d,start_date, end_date))
  • min duration (ie min(dateDiff,d,start_date, end_date))
  • avg duration (ie avg(dateDiff,d,start_date, end_date)) we should think about rounding since we store the values as int, or we can think about changing the result schema to store numeric, and format the number on the front end to be int or decimal based on if we see any decimal values in the result.

Note: the reason to perform max/min/avg on top of the 'raw' duration value is that some people may have many more records than others, so you may just want to have a way to get a single recors per person for the distribution (ie: only use the max value, or only use the min value, or take all their records and create an average value).

Condition Era

  • length of era (ie: datediff(d, start_date,end_date))
  • max/min/avg length of era
  • condition_occurrence_count (ie: condition_occurrence_count)
  • max/min/avg condition_occurrence

Condition Occurrence

I don't think there's anything in this Domain that we wouldn't get from the Any options above.

Death

I don't think it's worth giving the option for Death...'distribtion of death for a person' doesn't make sense to me.

Drug Era

  • length of era (ie: datediff(d, start_date,end_date))
  • max/min/avg length of era
  • drug_exposure_count (ie: drug_exposure_count )
  • max/min/avg drug_exposure_count
  • gap_days
  • max/min/avg gap_days

Drug Exposure

  • refills
  • max/min/avg refills
  • quantity
  • max/min/avg quantity
  • days_supply
  • max/min/avg days_supply

Measurement

  • value_as_number
  • max/min/avg value_as_number
  • range_high
  • max/min/avg range_high
  • range_low
  • max/min/avg range_low

Note on Measurement distributions: people should understand that they should specify the unit of the measurement record so that we don't mix, for example, 'weight in Kg vs. weight in Lbs'. This is on the user, and we don't need to code some sort of 'intelligence' into the tool to prevent that error.

Observation

  • value_as_number
  • max/min/avg value_as_number

Procedure

  • quantity
  • max/min/avg quantity

Visit Occurrence

  • length of stay (ie: datediff(d, start_date, end_date))
  • max/min/avg length of stay

Should Missing mean Zero?

In some of our tools, we have the idea that 'missing means zero'. This option is used when if you don't find a record for a person, you should use zero as the value for the person's distribution count_value. For example, if you're counting distinct dates, a person that has no dates you may want to consider that if a person has no dates, then the count of dates is 0. However, what if you were looking for measurement value of 'blood pressure'. Not everyone has a record of this, and you definitely don't want to assign a blood pressure of 0 to people who do not have a record for it!

So, the simple approach is to only include people in the distribution who have at least one countable record. In the example of 'distinct dates', the distribution will show a min value of at least 1 because we're only going to include people that had at least 1 distinct date.

The more complex approach is to provide an option for 'missing means zero', and if they use that, we should LEFT JOIN the records for the person in the cohort, and NULL gets assigned a 0 count value. if 'missing means zero' = false, then we'd filter out the records where count_value not null

For this PR, i'd go with the simple approach, and get feedback from the community about handling this use case, and if we want to handle it, we just add a new field to the 'criteria dist featurecalledmissingMeansZeroand default it tofalse` for backwards comparability.

Examples

To Illustrate the difference in results of the raw value distribution vs. the min/max/avg options:

If you have a distribution of Visit Occurrence length of stay, and your cohort has 3 people in it:

  • Person 1 has 5 visits, lengths: [1,2,3,4,5]
  • Person 2 has 1 visit, lengths [5]
  • Person 3 has 2 visits, lengths [0,12]

Using the raw value of 'length of stay', you would produce a distribution from the values (the below would just be 1 list, but I'm grouping them by person for illustration):
[1,2,3,4,5] + [5] +[0,12]

min length of stay would use: [1,5,0]
Max length of stay would use [5,5,12]
avg length of stay would use [3,5,6]

I think this approach makes sense, but I'd like to hear from @pbr6cornell if this approach makes sense.

Demographic Criteria

During the feature implementing I apologized Demographic criteria allows count all available values. E.g. count an average inpatient visit duration for people with age between 30 and 50, and so on. Am I right? Or may be Demographic criteria might have some restrictions like other criteria?

I don't think we'd construct a distribution criteria from 'demographic criteria' because demographic criteria just lets you put criteria around the cohort start_date,. In other words, 'demographic criteria' does not yield any records by itself. In fact, it just yields the cohort record if the person' for this cohort record matches the demographic criteria. You'd just get a 'single record' per person if the person matches the demographic criteria that the cohort_start_date.

Update: the following is not correct, correction is below

To do what you described (E.g. count an average inpatient visit duration for people with age between 30 and 50) you would use a VISIT_OCCURRENCE criteria, and in that criteria you specify the age between 30 and 50. This would be the 'age at the visit', and not 'age of the cohort_start_date'. If you wanted to also say that the person needs to be aged 30-50 at the time of cohort_start, then you use nested criteria of the visit to add the Demographic rule to the visit:

Visit Occurrence of <Any visit>
and having all of the following:
  Demographic criteria: age between 30 and 50

That would find visits only for people who are between 30 and 50 at cohort_start.
If you wanted to require that at the time of the visit they were still between 30 and 50:

Visit Occurrence of <Any visit>
with age between 30 and 50
and having all of the following:
  Demographic criteria: age between 30 and 50

Correction

In the above example, I said that using the nested criteria for demographic criteria would use the cohort_start date. This is INCORRECT because the nested criteria is indexed off the containing criteria (in this case the Visit Occurrence). Therefore, these are identical:

Visit Occurrence of <Any visit>
having all of the following:
  Demographic criteria: age between 30 and 50
Visit Occurrence of <Any visit>
with age between 30 and 50

To do what you want (only consider people who are age 30-50 in the cohort) we would use the 'stratify criteria' to sub-set the population into the specific group of interest. When creating these distribution criteria, it applies to everyone in the target population equally, and I think this is the correct behavior: we wouldn't want some criteria to only consider some subset of people and other criteria to apply to a subset of others. What's nice about the strata criteria is that it creates those sub-groups, and then all features and distributions are calculated for all people in each of those sub-groups.

End of Correction

Hope that all makes sense.

# Conflicts:
#	src/main/java/org/ohdsi/webapi/feanalysis/FeAnalysisServiceImpl.java
# Conflicts:
#	src/main/java/org/ohdsi/webapi/feanalysis/FeAnalysisService.java
@@ -22,5 +22,7 @@ CREATE TABLE @results_schema.cc_results
p75_value DOUBLE PRECISION,
p90_value DOUBLE PRECISION,
max_value DOUBLE PRECISION,
cohort_definition_id BIGINT
cohort_definition_id BIGINT,
aggregate_id INTEGER,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noting that these changes will require an update to the results schema when merged into master.

ssuvorov-fls and others added 6 commits August 25, 2020 11:01
# Conflicts:
#	src/main/java/org/ohdsi/webapi/feanalysis/FeAnalysisController.java
#	src/main/java/org/ohdsi/webapi/feanalysis/FeAnalysisService.java
…cc-values

 Conflicts:
	src/main/java/org/ohdsi/webapi/feanalysis/FeAnalysisController.java
	src/main/java/org/ohdsi/webapi/feanalysis/FeAnalysisService.java
	src/main/java/org/ohdsi/webapi/feanalysis/FeAnalysisServiceImpl.java
@chrisknoll
Copy link
Collaborator

I've just pushed updates to resolve conflicts. I'd appreciate if someone else could review this to ensure I did not miss anything.

Thank you.

@chrisknoll
Copy link
Collaborator

When testing this, i attempted to ensure that I was using the correct version of SkeletonCohortCharacterization, but there are conflicts when attempting to merge master into the issue-1220-cc-values branch: https://github.com/OHDSI/SkeletonCohortCharacterization/tree/issue-1220-cc-values.

However, ran into issues where I could not determine what the correct way to resolve the conflicts. Could someone with knowledge of these changes please resolve conflicts and push udpates to the issue-1220-cc-values branch in SkeletonCohortCharacterization?

Comment on lines 1 to 12
CREATE SEQUENCE ${ohdsiSchema}.fe_aggregate_sequence START WITH 1;

CREATE TABLE ${ohdsiSchema}.fe_analysis_aggregate(
id INTEGER NOT NULL,
name VARCHAR(255) NOT NULL,
domain VARCHAR(64),
agg_function VARCHAR(64),
expression CLOB,
agg_query CLOB,
is_default NUMBER(1) DEFAULT 0,
CONSTRAINT pk_fe_aggregate PRIMARY KEY(id)
);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to admit this table confused me a little: when getting the code updated, I realized i need to have StandardAnalysisAPI, SkeletonCohortChraacterization and Atlas and WebAPI all pionting to the same branches...expecting SkeletonCohortCharacterization to contain the bulk of the logic for performing the characterization, and everything else just supporting or dependign on it.

But, looking at these statements, it looks like the implementation of how to perform the cohort characterization (at least from the distribution perspective) is implemented within WebAPI. Any reason why this wasn't enclosed in SkeletonCohortCharacterization (like FeatureAnalysis does)?

I can understand a link between some analysis / aggregate ID specified in SkelCohortChar that lets you present choices in the UI for options available in SkelCohortChar, but I didn't think the actual aggregateion statements and join conditions would be applied here.

If someone wants to do characterization in R, do they have to import somethign from webAPI or is this information duplicated?

Let me know if I'm missing something.

@chrisknoll
Copy link
Collaborator

chrisknoll commented Oct 8, 2020

Found an issue with how 'distinct start dates' was being handled. Here is the distribution design:

image

Focusing on the first one: getting the counts of distinct start dates of diagnoses of Diabetes. I checked the data, and these events do exist, however the query that's generated does not select the right records (or the right dats) as I'll describe:

This is the part of the query for this distribution feature:

WITH qualified_events  AS (
  SELECT ROW_NUMBER() OVER (partition by E.subject_id order by E.cohort_start_date) AS event_id, E.subject_id AS person_id, E.cohort_start_date AS start_date, E.cohort_end_date AS end_date, OP.observation_period_start_date AS op_start_date, OP.observation_period_end_date AS op_end_date
  FROM [CDM_DB].cknoll1_temp.temp_cohort_xq8n9di9 E
    JOIN [CDM_DB].dbo.observation_period OP ON E.subject_id = OP.person_id AND E.cohort_start_date >= OP.observation_period_start_date AND E.cohort_start_date <= OP.observation_period_end_date
  WHERE cohort_definition_id = 4
)
 SELECT
v.person_id as person_id,
  count(DISTINCT op.observation_period_start_date) as value_as_number

FROM
( SELECT 0 as index_id, p.person_id, p.event_id
FROM qualified_events P
INNER JOIN
(
  -- Begin Condition Occurrence Criteria
SELECT C.person_id, C.condition_occurrence_id as event_id, C.condition_start_date as start_date, COALESCE(C.condition_end_date, DATEADD(day,1,C.condition_start_date)) as end_date,
       C.CONDITION_CONCEPT_ID as TARGET_CONCEPT_ID, C.visit_occurrence_id,
       C.condition_start_date as sort_date
FROM 
(
  SELECT co.* 
  FROM [CDM_DB].dbo.CONDITION_OCCURRENCE co
  JOIN [CDM_DB].cknoll1_temp.xq8n9di9_Codesets codesets on ((co.condition_concept_id = codesets.concept_id and codesets.codeset_id = 0))
) C


-- End Condition Occurrence Criteria

) A on A.person_id = P.person_id  AND A.START_DATE >= DATEADD(day,-365,P.START_DATE) AND A.START_DATE <= DATEADD(day,0,P.START_DATE) ) v
join qualified_events E on E.person_id = v.person_id join observation_period op on op.person_id = v.person_id and op.observation_period_start_date >= E.start_date and op.observation_period_end_date <= E.end_date
group by v.person_id
;

The qualfied_events CTE is correct: You want to assign an event_id to each record, partitioned by the subject_id (via ROW_NUMBER()) , and then attach the observation period start/end dates to the event. Note: the 'event' is simply the 'cohort episode for the person...the qualfied_events cte should be the same no matter what distribution value you're getting.

The first problem, tho is in the SELECT immediately following:

SELECT v.person_id as person_id, count(DISTINCT op.observation_period_start_date) as value_as_number

This is supposed to be counting distinct condition_start_dates, but it's counting observation_period_start_dates.

Next, it appears that the query is attempting to perform a 'group criteria' subquery by doing this:

( SELECT 0 as index_id, p.person_id, p.event_id
FROM qualified_events P
INNER JOIN
(
  -- Begin Condition Occurrence Criteria

The SELECT 0 part is used to do operations like:

having  all/any of the following criteria:  a condition of X (index_id 0) and a drug exposure of Y (index_id 1)

The query (if it was doing a group criteria) would have something like this at the end:

having count(index_id) = 2  / having count(index_id) > 0   -- all / any

But, we don't want to do any sort of group logic in the query...so this part isn't correct....and before i propose what I think the correct form should be, let me point out the last problem:

The last part of the query has the problem where it's attempting to do something with an observation period, but it's incorrect/not applicable:

) A on A.person_id = P.person_id  AND A.START_DATE >= DATEADD(day,-365,P.START_DATE) AND A.START_DATE <= DATEADD(day,0,P.START_DATE) ) v
join qualified_events E on E.person_id = v.person_id join observation_period op on op.person_id = v.person_id and op.observation_period_start_date >= E.start_date and op.observation_period_end_date <= E.end_date
group by v.person_id

The JOIN condition for A (where it looks at A.START_DATE >- DATEADD(day,-365,P.START_DATE) is correct: this is saying 'the condition start date (A) has to be later than 365 days before the cohort start date (P.start_date) and the condition start date (A) has to be before the cohort_start_date (P.start_date + 0 days). That is the condition occurence records we want: count the distinct dates of condition occurences of X that occur between 365 days before and 0 days before the cohort start date. The records from A is what we should use for counting.

But, the query is taking records from V as the records that we want to use to count our distribtion value...additionally, there is a join to the observation period where the obseravtion-period_start date is > E.start Date (Remember: E = cohort events) and observation_period_end_date <= E.end date. This means we're trying to find an observation period that fits inside a cohort event..which is impossible, observation periods are the longest spans of time that can appear. I think tis join to obsrvation_period and the sub_query V can be changed..

How the query function is that the Condition criteria is built from a 'WindowCriteria' such that given a set of cohort events (qualified_events), find the condtion_occurrence records relative to the cohort event's start/end date. The WindowCriteria sql builder should handle all this logic..it just assumes you have a 'qualified_events' table (the CTE) to join to.

The correct form of the above query should look something like this:

WITH qualified_events  AS (
  SELECT ROW_NUMBER() OVER (partition by E.subject_id order by E.cohort_start_date) AS event_id, E.subject_id AS person_id, E.cohort_start_date AS start_date, E.cohort_end_date AS end_date, OP.observation_period_start_date AS op_start_date, OP.observation_period_end_date AS op_end_date
  FROM [CDM_DB].cknoll1_temp.temp_cohort_xq8n9di9 E
    JOIN [CDM_DB].dbo.observation_period OP ON E.subject_id = OP.person_id AND E.cohort_start_date >= OP.observation_period_start_date AND E.cohort_start_date <= OP.observation_period_end_date
  WHERE cohort_definition_id = 4
)
SELECT  P.person_id as person_id, count(DISTINCT A.start_date) as value_as_number
FROM qualified_events P
INNER JOIN
(
-- Begin Condition Occurrence Criteria
	SELECT C.person_id, C.condition_occurrence_id as event_id, C.condition_start_date as start_date, COALESCE(C.condition_end_date, DATEADD(day,1,C.condition_start_date)) as end_date,
				 C.CONDITION_CONCEPT_ID as TARGET_CONCEPT_ID, C.visit_occurrence_id,
				 C.condition_start_date as sort_date
	FROM 
	(
		SELECT co.* 
		FROM [CDM_DB].dbo.CONDITION_OCCURRENCE co
		JOIN [CDM_DB].cknoll1_temp.xq8n9di9_Codesets codesets on ((co.condition_concept_id = codesets.concept_id and codesets.codeset_id = 0))
	) C
-- End Condition Occurrence Criteria
) A on A.person_id = P.person_id  AND A.START_DATE >= DATEADD(day,-365,P.START_DATE) AND A.START_DATE <= DATEADD(day,0,P.START_DATE)
group by P.person_id

Note, the sub-query 'A" is the part that fetches the domain-specific records, and it re-aliases the condition_start_date to 'start_date" and figures out what the condition_end_date should be (by using either an end date or start_date + 1d). It is the records from A that you want to use for the distribution counts..that's why we do a count DISTINCT(a.start_date) at the start.

I'm fairly certain that there is a call that will covert the ConditionOccurrece Criteria object into the query that is between the -- Begin Condition .... -- End Condition Occurrence. and I think the WindowCriteria would build the part that joins the qualified_events P to the Sub-Query A (the domain query). I want to say that this should exist, but it's possible that it doesn't, I can dig into this to see if it is avaialble or if it is something that needs to be constructed...

Let me know if anything above is not clear, I understand it's a bit to unpack.

@chrisknoll
Copy link
Collaborator

chrisknoll commented Nov 3, 2020

Thanks for the hard work on this @wivern . At this point, the changes look good from a 'window criteria' perspective, such that when i was looking for quantity values or distinct start date values, it was building the queries properly (as far as I could tell) that would yield the individual records. However, a concern arises with building the distribution results. This is the query:

CREATE TABLE  results_schema.ilf1ier8_events_dist
WITH (DISTRIBUTION = HASH(analysis_id))
AS
WITH total_cohort_count  AS (
    SELECT COUNT(*) cnt FROM results_schema.temp_cohort_ilf1ier8 where cohort_definition_id = 4
  ),
  events_max_value as (
    select max(value_as_number) as max_value from results_schema.ilf1ier8_events_count
  ),
  event_stat_values as (
    select
      count(distinct person_id) as count_value,
      min(value_as_number) as min_value,
      max(value_as_number) as max_value,
      sum(value_as_number) as sum_value,
      stdev(value_as_number) as stdev_value,
      total_cohort_count.cnt - count(*) as count_no_value,
      total_cohort_count.cnt as population_size
    from results_schema.ilf1ier8_events_count, total_cohort_count
    group by total_cohort_count.cnt
  ),
  event_prep as (select row_number() over (order by value_as_number) as rn, value_as_number, count(*) as people_count from results_schema.ilf1ier8_events_count group by value_as_number),
  events_dist as (
    select s.value_as_number, sum(p.people_count) as people_count
    from event_prep s join event_prep p on p.rn <= s.rn group by s.value_as_number
  ),
  events_p10_value as (
    select min(value_as_number) as p10 from events_dist, event_stat_values where (people_count + count_no_value) >= 0.1 * population_size
  ),
  events_p25_value as (
    select min(value_as_number) as p25 from events_dist, event_stat_values where (people_count + count_no_value) >= 0.25 * population_size
  ),
  events_median_value as (
      select min(value_as_number) as median_value from events_dist, event_stat_values where (people_count + count_no_value) >= 0.5 * population_size
  ),
  events_p75_value as (
    select min(value_as_number) as p75 from events_dist, event_stat_values where people_count + count_no_value >= 0.75 * population_size
  ),
  events_p90_value as (
    select min(value_as_number) as p90 from events_dist, event_stat_values where people_count + count_no_value >= 0.9 * population_size
  )
 SELECT
CAST('DISTRIBUTION' AS VARCHAR(255)) as type,
  CAST('CRITERIA_SET' AS VARCHAR(255)) as fa_type,
  CAST(1088 AS BIGINT) as covariate_id,
  CAST('Warfarin' AS VARCHAR(1000)) as covariate_name,
  CAST(115 AS INT) as  analysis_id, CAST('Distribution Tests' AS VARCHAR(1000)) as analysis_name,
  CAST(0 AS INT) as concept_id,
  CAST(4 AS BIGINT) as cohort_definition_id,
  CAST(132639 AS BIGINT) as cc_generation_id,
  CAST(0 AS BIGINT) as strata_id,
  CAST('' AS VARCHAR(255)) as strata_name,
  CAST(31 AS INTEGER) as aggregate_id,
  CAST('Quantity' AS VARCHAR(1000)) as aggregate_name,
  event_stat_values.count_value,
  CAST(case when count_no_value = 0 then event_stat_values.min_value else 0 end AS float) as min_value,
  event_stat_values.max_value,
  cast(event_stat_values.sum_value / (1.0 * population_size) as float) as avg_value,
  event_stat_values.stdev_value,
  case when population_size * .10 < count_no_value then 0 else events_p10_value.p10 end as p10_value,
  case when population_size * .25 < count_no_value then 0 else events_p25_value.p25 end as p25_value,
  case when population_size * .50 < count_no_value then 0 else events_median_value.median_value end as median_value,
  case when population_size * .75 < count_no_value then 0 else events_p75_value.p75 end as p75_value,
  case when population_size * .90 < count_no_value then 0 else events_p90_value.p90 end as p90_value

FROM
events_max_value, event_stat_values, events_p10_value, events_p25_value, events_median_value, events_p75_value, events_p90_value;

Looking at the structure of this code, I'm guessing that you based this on distribtion code that you found in Feature Extraction, such as here: https://github.com/OHDSI/FeatureExtraction/blob/master/inst/sql/sql_server/ConceptCounts.sql.

@schuemie : we'd like your input on the following:

First issue is the column: count(distinct person_id) as count_value,. I'm not sure what count_value is supposed to represent here, but when I produce my own distribution summary statistics, count_value is just the value that is going into the set of distribution records. So a count_value might be 5,10,20,30 for ages, or refills or whatever. The count value is the thing that you're making your distribution for Additionally, I would store the total number of records that were used in the distribution because when you calculate things like 'average' you want to know how many elements contributed to the average. But storing the 'disticnt person_id' from the distibution as the count_value, I'm not sure what value that has other than 'that's the number of unique people that contributed one or more values to the distribution'. Maybe it's a useful value (@schuemie please give us your thoughts) but I think more important is knowing how many values went into the distribution, and I'm not sure where we can capture that value in the result table.

Then, in another column we have: total_cohort_count.cnt - count(*) as count_no_value, That's taking the difference between the number of people in the cohort and the number of records in your distribution. Is that a valid calculation?

So, these two items have me a little confused, and I'm not sure what the correct answer is. I just wanted to be clear about what was being calculated in these statistics and also, if possible, see how they'd be used in practice.

For your consideration, this is a query that replicates the distribution query without people_count or count_no_value, it simply calculates the distribution quartiles based on the input set of values:

select
	CAST(4 AS BIGINT) as cohort_definition_id,
	CAST(115 AS INT) as  analysis_id, CAST('Distribution Tests' AS VARCHAR(1000)) as analysis_name,
	--- other meta columns here
	o.total,
	o.avg_value,
	coalesce(o.stdev_value, 0) as stdev_value,
	o.min_value,
	MIN(case when s.accumulated >= .10 * o.total then count_value else o.max_value end) as p10_value,
	MIN(case when s.accumulated >= .25 * o.total then count_value else o.max_value end) as p25_value,
	MIN(case when s.accumulated >= .50 * o.total then count_value else o.max_value end) as median_value,
	MIN(case when s.accumulated >= .75 * o.total then count_value else o.max_value end) as p75_value,
	MIN(case when s.accumulated >= .90 * o.total then count_value else o.max_value end) as p90_value,
	o.max_value
FROM (
		avg(1.0 * count_value) as avg_value,
		stdev(count_value) as stdev_value,
		min(count_value) as min_value,
		max(count_value) as max_value,
		count_big(*) as total
	from results_schema.ilf1ier8_events_count
	group by cohort_definition_id
) o
cross join (
	select count_value, count_big(*) as total,
		sum(count_big(*)) over (order by count_value) as accumulated
	FROM esults_schema.ilf1ier8_events_count
	group count_value
) s
group by o.total, o.min_value, o.max_value, o.avg_value, o.stdev_value

Again, I'd like to hear from @schuemie about how feature extraction calculates these distribution values and what his understanding is of these people_count, count_no_value and total_chort_count.cnt - count(*)

-Chris

@schuemie
Copy link
Member

schuemie commented Nov 3, 2020

Sorry, don't have a lot of time, but trying to shine some light on my code:
count_value is the count of persons that have a non-zero value. count_no_value is the number of people that have no value, which depending on the covariate you're looking at should be interpreted as 0 or missing.

For example, take the Charlson comorbidity index. People that have none of the components that make up this score don't have a score entry. This should be interpreted as 'Charlson comorbidity index. = 0'. So when computing for example the median, I need to make sure I account for all those zeroes that are implied by the people not having a record.

@chrisknoll
Copy link
Collaborator

chrisknoll commented Nov 3, 2020

Thanks @schuemie .

@wivern : I don't suggest we follow this, since for our purposes, we're calculating a distribution based on records selected from a domain, and not something calculated like a score where a valid result is 0.

In cases where we want a distribution of quantity, or refills, or durations, we wouldn't want to yield a 0 value as a count_value for the person. In that case, we'd just want to understand the distribution of actual quantities, or actual durations of things. In other cases, maybe we want the distribution of events for those that have at least 1 event. ...

Thinking on this further tho, even if we wanted to count '0' values, we could just yield those records (via a left join + coalesce(stat_value, 0)) and let the distribution query work naturally. For example, we're looking for a count of records....in this case, it make make sense to include 'people who had 0 records' as part of the distribution...because those people having 0 records should contribute their '0' to the overall distribution. This is where the missing means zero comment I made previously comes into play: when defining the distribution, the user has the opportunity to determine if 'missing should be interpreted as zero'. So, if we're looking for duration of visits in the past 365 days, maybe in some cases you want the distribution of the actual visits...in other cases, maybe you want people with no visits to contribute a '0' to the distribution.

I think I need second opinion on how this should function. @pbr6cornell , you're up! Could you give us your perspective?

@chrisknoll
Copy link
Collaborator

I'd also like to call out something in the code that I presented vs. the code from Feature extraction related to the count_value field. It is an amazing consequence that Martijn and I picked out a variable name that represents totally different things....my 'count_value' actually is a horrible name but it represented the statistical value that was being 'counted for distribution'...again horrible name. Therefore I present the updated query with more reasonable variable names (count_value -> stat_value):

select
	CAST(4 AS BIGINT) as cohort_definition_id,
	CAST(115 AS INT) as  analysis_id, CAST('Distribution Tests' AS VARCHAR(1000)) as analysis_name,
	--- other meta columns here
	o.total,
	o.avg_value,
	coalesce(o.stdev_value, 0) as stdev_value,
	o.min_value,
	MIN(case when s.accumulated >= .10 * o.total then stat_value else o.max_value end) as p10_value,
	MIN(case when s.accumulated >= .25 * o.total then stat_value else o.max_value end) as p25_value,
	MIN(case when s.accumulated >= .50 * o.total then stat_value else o.max_value end) as median_value,
	MIN(case when s.accumulated >= .75 * o.total then stat_value else o.max_value end) as p75_value,
	MIN(case when s.accumulated >= .90 * o.total then stat_value else o.max_value end) as p90_value,
	o.max_value
FROM (
		avg(1.0 * stat_value ) as avg_value,
		stdev(stat_value ) as stdev_value,
		min(stat_value ) as min_value,
		max(stat_value ) as max_value,
		count_big(*) as total
	from results_schema.ilf1ier8_events_count
	group by cohort_definition_id
) o
cross join (
	select stat_value, count_big(*) as total,
		sum(count_big(*)) over (order by stat_value) as accumulated
	FROM esults_schema.ilf1ier8_events_count
	group stat_value
) s
group by o.total, o.min_value, o.max_value, o.avg_value, o.stdev_value

@pbr6cornell
Copy link
Contributor

I agree with @chrisknoll that sometimes we want to capture 'missing = 0' and include those 0 values in distributions calculations (ex: Charlson, # of visits, etc.), and sometimes we want to exclude persons with missing values (ex. length of exposure among persons taking a drug, we don't care about those without the drug). I also have used the trick @schuemie employs (counting the number of persons with value and persons without value), and that can be effective and efficient for specific statistics with 'missing = 0', but there can be other approaches (in particular when the statistic is a value, not a person count, then the 0s needn't be added to the statistic but the denominator needs to be the total population, not just those with a non-zero value).

@chrisknoll
Copy link
Collaborator

Thanks @pbr6cornell for meeting with me and clarifying the above details.

@wivern : Patrick and I came to the conclusion that we want to balance the complexity of the functionality and making the tool easy to use and understand. So, instead of adding new functions to allow people to choose 'missing means zeros', I think it is easier if we just said that 'Event Counts will include zeros for missing people, everything else does not'. 'Everything else' means distributions for duration, quantity, days supply, distinct start dates, etc...To clarify: a zero value is a valid thing to include in a distribution (like a quantity of 0 or a duration could be zero if start_date = end date). But, in both of those cases, there was a record that yielded the value. Event counts, however, looks for records for all people, and even if a record doesn't exist, we want to indicate that with a zero for that calculation. Only 'event counts' should be treated this way.

Can you think of any problems with implementing it this way?

Second item is the output table:
image

It's not clear what 'count' column means. It could mean distinct people, it could mean records....so we need to name this better. But, it should reflect what is actually in the value and I'm not clear on what it is. Can you tell me what the 'count' column represents in this above table, and then we should rename it. If it is distinct people that contributed a value to the distribution, then we should name it 'persons' if it is the total number of records that are involved in the distribution, then let's label it 'records'.

Finally, I will need to closely review what is happening in the temp tables event_counts and the subsequent distribution queries because I'm a little concerned that what was done for Feature Extraction isn't exactly what we've done for criteria-based-distribution so I need to understand how 'count_no_value' and 'count_value' and how the avg is being calculated. What I mean is: FeatureExtraction is doing a trick where it determines a number of 'zero records' that we can infer from the distinct people in cohort - distinct people from distribtuion records. The problem is that you can't simply do AVG(stat_column) with that, instead you have to manually calculate an avg as sum(stat_value) / (count_no_value + count_value) because you need to include the zero-people in the denominator of the avg calculation. Again, this trick makes the other calcualtions very tricky so I'd seriously recommend using my simplified form for calculating a distribution, but include the 'zero-records' in the set of values to calculate the distribtuion, and then your avg, stddev, etc all will work naturally.

Let me know if you have any questions.

# Conflicts:
#	src/main/java/org/ohdsi/webapi/feanalysis/FeAnalysisController.java
@wivern
Copy link
Contributor Author

wivern commented Nov 12, 2020

@chrisknoll Regarding Count column. It's value calculated in the following way:

event_stat_values as (
    select
      count(distinct person_id) as count_value,

Therefore, it's a people count and column title could be changed to the 'People count' to be more clear.
Do you agree?

@chrisknoll
Copy link
Collaborator

chrisknoll commented Nov 12, 2020

I think 'person count' is too long, but 'Persons' would be fine.

Also,we need to be very sure about where we're using the distinct person count (count_value) in later calculations. Specifically where we are calculating averages, stddevs, etc.

@chrisknoll
Copy link
Collaborator

Btw: pulling latest branch here gives me a compiler error now:

org/ohdsi/webapi/feanalysis/FeAnalysisController.java:[27,1] a type with the same simple name is already defined by the single-type-import of org.springframework.transaction.annotation.Transactional

Looking at the code, I do see online 25 and 27 to imports for Transactional:

import org.springframework.transaction.annotation.Transactional;

import javax.transaction.Transactional;

In addition, there's a reference to ExportUtil that's no longer in the concept set package. I can push up a commit to correct these errors.

@chrisknoll
Copy link
Collaborator

chrisknoll commented Nov 13, 2020

Ok, I went through the query and I think there are 2 problems:

The stddev calculation: stdev(value_as_number) as stdev_value, needs to account for the 'count_no_value' part of the logic. FeatureExtraction shows where this is done: first it get the square of the sum:
https://github.com/OHDSI/FeatureExtraction/blob/master/inst/sql/sql_server/ConceptCounts.sql#L109

Then it calculates the stddev as the follows: (i'm told this is the simple calculation of std. dev).

	CAST(CASE
		WHEN t2.cnt = 1 THEN 0 
		ELSE SQRT((1.0 * t2.cnt*t2.squared_concept_count - 1.0 * t2.sum_concept_count*t2.sum_concept_count) / (1.0 * t2.cnt*(1.0 * t2.cnt - 1))) 
	END AS FLOAT) AS standard_deviation,

The trick here is properly applying count_no_value to pad zeroes into the calculation...refer to the ConceptCounts.sql file above to find that.

The second problem is that the mode that the distribution logic that's used in SkeletonCC is applying the missing means zero logic in all cases. I the case of the 'quantity' option, any one who doesn't have a quantity value will get added as a zero value since you are calculating count_no_value and using it in the distribution calculation.

I hate changing up the design at this point, but I'm beginning to think it would just be simpler and easier if we fetched the distribution values and put that into _events_count table, and then optionally (based on a missing_means_zero flag) add 0's for each person's cohort episode (remember, a person can have more than 1) if the person_id does not appear in the set of person's for the events_count. This would be 3 main changes: a) the distribution query can be simpler (you don't have to do any special work for avg/stddev, nor figure out how to pad; b) optionally you would need to insert into _events_count select subject_id as person_id, 0 as value_as_number from temp_cohort where subject_id not in (select person_id from _events_count). This is optional: only perform that additional insert if the distribution statistic says 'missing means zero'; and c) we need to add a flag for 'missing means zero' to the different statistic types. (count events = true, refills and quantity = false).

Alternatively, we leave it alone but we need a way to specify that the 'count_no_value' should be 0 (and not total_cohort_count.cnt - count(*) ) if missing_means_zero is false. Which means, we still need a 'missing_means_zero' flag somewhere in the distribution statistic definition...but this would be useful on the UI to show which distribution values are including 0's. We'd still need to fix the calculation for stddev, but this second option leaves most of the code you have in tact, you just need to figure out how to deal with the case where you don't include zeros.

I understand that you are off on projects and there may be other people taking over for you, but if it seems like you guys are bogged down and don't have the cycles, just say the word and I'll take this on, I just don't want to interfere with any ongoing work if you guys are on top of this.

@wivern
Copy link
Contributor Author

wivern commented Nov 13, 2020

@chrisknoll Guys really confused with this task and I'm afraid they could spend too much time to investigate a problem.
That would be great if you can take this on. Thank you for a great assistance and nice explanations that I believe extremely helpful for our team.

@chrisknoll
Copy link
Collaborator

Ok, thanks.

Modified migrations scripts: remove length of era (use duration instead), and specify missing_means_zero for distinct_days and event count.
@chrisknoll
Copy link
Collaborator

I've made updates to the StandardAnalysisAPI, SkeletonCC, and WebAPI branches, please pull those and review. Thank you.

@chrisknoll
Copy link
Collaborator

Ok, next part I'm investigating is the distribution values that comes from the prespec analyses, specifically 'VisitCountByConcept' which should give a distribution report item for each visit concept ID that is in the data (ie: Inpatient, Outpatient, ER, etc). When I ran this (I don't have the official results in front of me) only one row was generated from that analysis. I need to look into the intermediate query and understand how it's copied over to cc_results.

@wivern wivern merged commit 8626859 into master Nov 23, 2020
@delete-merged-branch delete-merged-branch bot deleted the issue-1220-cc-values branch November 23, 2020 15:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Distribution of a field values in Cohort Characterization
8 participants