Heracles reports do not insert final results when using Spark #2112

TomWhite-MedStar · 2022-10-05T21:15:10Z

Expected behavior

After defining and generating a cohort, it should be possible to use the Reporting tab to generate Quick or Full Analyses.

Actual behavior

This works fine on SQL Server, but not on Spark (e.g. DataBricks).

Steps to reproduce behavior

Need an instance of OHDSI that has an OMOP data source connected via SPARK

Select any cohort
Run Generate on the cohort
Run "Quick Analysis" under the Reporting tab
After several minutes, the Job will show a Failed status.

Root Cause

After all of the initial processing, the final query looks like this:

insert into result_schema.heracles_results (cohort_definition_id, analysis_id, stratum_1, stratum_2, stratum_3, stratum_4, count_value) select cohort_definition_id, analysis_id, cast(stratum_1 as STRING), cast(stratum_2 as STRING), cast(stratum_3 as STRING), cast(stratum_4 as STRING), count_value from tmp.xrfc9243results_1 UNION ALL ...

However, SPARK SQL does not support INSERT INTO with a list of fields that does not exactly match the full set of columns for the final table.

Modifying the query does work. Either this:

insert into omop_160101_to_220731_v813_results.heracles_results (cohort_definition_id, analysis_id, stratum_1, stratum_2, stratum_3, stratum_4, stratum_5, count_value, last_update_time) select cohort_definition_id, analysis_id, cast(stratum_1 as STRING), cast(stratum_2 as STRING), cast(stratum_3 as STRING), cast(stratum_4 as STRING), '' as stratum_5, count_value, now() from tmp.xrfc9243results_1

or this:

insert into omop_160101_to_220731_v813_results.heracles_results select cohort_definition_id, analysis_id, cast(stratum_1 as STRING), cast(stratum_2 as STRING), cast(stratum_3 as STRING), cast(stratum_4 as STRING), '' as stratum_5, count_value, now() from tmp.xrfc9243results_1

Can the relevant SQL fragments be modified so that this works for Spark?

One of the relevant files appears to be:
https://github.com/OHDSI/WebAPI/blob/master/src/main/resources/resources/cohortanalysis/sql/selectHeraclesResults.sql

But it looks as though a Java file is used to orchestrate the sub-queries:
https://github.com/OHDSI/WebAPI/blob/fd108571086fceea8513389fab426a6ea8101888/src/main/java/org/ohdsi/webapi/cohortanalysis/HeraclesQueryBuilder.java

The text was updated successfully, but these errors were encountered:

chrisknoll · 2022-10-06T14:41:59Z

Thanks for the suggestion. we're just completing a 2.12.0 release, but this is something we can push into 2.12.1 after if we don't have time to squeeze this in.

TomWhite-MedStar · 2022-10-07T18:04:24Z

Thanks. There is a some discussion on Forums too - https://forums.ohdsi.org/t/databricks-spark-coming-to-ohdsi-stack/14545/13

TomWhite-MedStar · 2022-10-13T11:57:18Z

Per @alondhe:
My guess is that the tasklet that executes the heracles SQL commands (https://github.com/OHDSI/WebAPI/blob/master/src/main/java/org/ohdsi/webapi/cohortanalysis/CohortAnalysisTasklet.java#L90) needs to be wrapped in SqlRender’s sparkHandleInsert function to ensure the insert command is reconstructed to have all table fields present.

chrisknoll · 2022-10-13T17:40:10Z

Shouldn't this be handled by the core SqlRender function? I do not feel supportive of putting dialect-specific rules in WebAPI when dialect specific rules should be taken into account with these external libaries (so that you get the handling without having to make the special case every place you use it). Is this a simple mater of sql rendering or is this something eternal to sql rendering and is something that has to be done directly on the connection?

alondhe · 2022-10-15T20:02:41Z

SqlRender sparkHandleInsert() can be used, but it has to be invoked during any SQL command fired off by WebAPI. This is because we need the active connection in order to then run a "describe table" to get the full column list and re-write the insert. So unfortunately, dialect-specific rules are needed in WebAPI -- and we have done this already.

Refactoring all inserts to use all columns explicitly is the cleanest approach, but also a heavy lift. But if we did that, we can just skip any Spark specific patterns in WebAPI SQL executions.

chrisknoll · 2022-10-15T20:32:20Z

Ok, it's messy but I think we have to adhere to dealing with writing cross-platform SQL (with respect to SqlRender). There's talk about creating a OHDSI-SQL specification and part of that would include naming your columns, and even matching your INSERTS to SELECTS.

I'll take this one on over time if necessary.

TomWhite-MedStar · 2022-10-16T20:21:54Z

@chrisknoll , sounds like you are suggesting refactoring the relevant SQL. If so, I'm willing to help if I can get some guidance.

I've confirmed that all of the queries prior to the insert statement run correctly. So, specifically for this Heracles function, it appears that only four things need to change:

(1) Change contents of https://github.com/OHDSI/WebAPI/blob/master/src/main/resources/resources/cohortanalysis/sql/selectHeraclesResults.sql to
select cohort_definition_id, analysis_id, cast(stratum_1 as STRING), cast(stratum_2 as STRING), cast(stratum_3 as STRING), cast(stratum_4 as STRING), '' as stratum_5, count_value, GETDATE() from #results_@analysisId
(2) Change line 41 in https://github.com/OHDSI/WebAPI/blob/master/src/main/java/org/ohdsi/webapi/cohortanalysis/HeraclesQueryBuilder.java to
private final static String INSERT_RESULT_STATEMENT = "insert into @results_schema.heracles_results (cohort_definition_id, analysis_id, stratum_1, stratum_2, stratum_3, stratum_4, stratum_5, count_value, last_update_time)\n";
(3) Change contents of https://github.com/OHDSI/WebAPI/blob/master/src/main/resources/resources/cohortanalysis/sql/selectHeraclesDistResults.sql to
select cohort_definition_id, analysis_id, cast(stratum_1 as varchar(255)), cast(stratum_2 as varchar(255)), cast(stratum_3 as varchar(255)), cast(stratum_4 as varchar(255)), cast(stratum_5 as varchar(255)), cast(count_value as bigint), cast(min_value as float), cast(max_value as float), cast(avg_value as float), cast(stdev_value as float), cast(median_value as float), cast(p10_value as float), cast(p25_value as float), cast(p75_value as float), cast(p90_value as float), GETDATE() from #results_dist_@analysisId
(4) Change line 42 in https://github.com/OHDSI/WebAPI/blob/master/src/main/java/org/ohdsi/webapi/cohortanalysis/HeraclesQueryBuilder.java to
private final static String INSERT_DIST_RESULT_STATEMENT = "insert into @results_schema.heracles_results_dist (cohort_definition_id, analysis_id, stratum_1, stratum_2, stratum_3, stratum_4, stratum_5, count_value, min_value, max_value, avg_value, stdev_value, median_value, p10_value, p25_value, p75_value, p90_value, last_update_time)\n";

Any chance that can get incorporated into the 2.12.1 release?

chrisknoll · 2022-10-17T01:47:09Z

yes, absolutely, if we can get a PR put together with changes, then we can incorporate that into a hotfix. If you can make the above changes, all the better.

…hen using Spark

TomWhite-MedStar · 2022-10-17T02:42:31Z

@chrisknoll , thanks. I added a pull request #2138 . I'm not able to compile and test the WebAPI myself, but was able to test the SQL statements this should generate.

chrisknoll · 2022-10-20T17:24:34Z

@TomWhite-MedStar : Hi, I need a few days to recover from all the activity, but I'm not ignoring you. Let me get back to you with final thoughts and we can push this forward then.

TomWhite-MedStar · 2022-11-30T19:20:30Z

@chrisknoll , any chance this can get into 2.12.1 release?

chrisknoll · 2022-11-30T19:30:49Z

I haven't had a chance to go into this, and I won't have time any time soon, so, I left a comment on the PR that I can approve if the PR works for spark and other dialects. Once merged to master, odyessus can pull together a hotfix with this commit.

…sing Spark (#2138)

alex-odysseus · 2023-01-24T18:59:50Z

@ssuvorov-fls please cherry pick to 'master-2.12'

…sing Spark (#2138) (cherry picked from commit b7ae10b)

TomWhite-MedStar added a commit to TomWhite-MedStar/WebAPI that referenced this issue Oct 17, 2022

Fix issue OHDSI#2112 - Heracles reports do not insert final results w…

f86808d

…hen using Spark

chrisknoll mentioned this issue Oct 24, 2022

Fix issue #2112 - Heracles reports do not insert final results when using Spark #2138

Merged

anthonysena added this to the v2.12.1 milestone Jan 24, 2023

anthonysena closed this as completed in #2138 Jan 24, 2023

anthonysena pushed a commit that referenced this issue Jan 24, 2023

Fix issue #2112 - Heracles reports do not insert final results when u…

b7ae10b

…sing Spark (#2138)

ssuvorov-fls pushed a commit that referenced this issue Jan 25, 2023

Fix issue #2112 - Heracles reports do not insert final results when u…

fdfdd95

…sing Spark (#2138) (cherry picked from commit b7ae10b)

alex-odysseus added the bug label Feb 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Heracles reports do not insert final results when using Spark #2112

Heracles reports do not insert final results when using Spark #2112

TomWhite-MedStar commented Oct 5, 2022 •

edited

Loading

chrisknoll commented Oct 6, 2022

TomWhite-MedStar commented Oct 7, 2022

TomWhite-MedStar commented Oct 13, 2022

chrisknoll commented Oct 13, 2022

alondhe commented Oct 15, 2022

chrisknoll commented Oct 15, 2022

TomWhite-MedStar commented Oct 16, 2022 •

edited

Loading

chrisknoll commented Oct 17, 2022

TomWhite-MedStar commented Oct 17, 2022

chrisknoll commented Oct 20, 2022

TomWhite-MedStar commented Nov 30, 2022

chrisknoll commented Nov 30, 2022

alex-odysseus commented Jan 24, 2023

Heracles reports do not insert final results when using Spark #2112

Heracles reports do not insert final results when using Spark #2112

Comments

TomWhite-MedStar commented Oct 5, 2022 • edited Loading

Expected behavior

Actual behavior

Steps to reproduce behavior

Root Cause

chrisknoll commented Oct 6, 2022

TomWhite-MedStar commented Oct 7, 2022

TomWhite-MedStar commented Oct 13, 2022

chrisknoll commented Oct 13, 2022

alondhe commented Oct 15, 2022

chrisknoll commented Oct 15, 2022

TomWhite-MedStar commented Oct 16, 2022 • edited Loading

chrisknoll commented Oct 17, 2022

TomWhite-MedStar commented Oct 17, 2022

chrisknoll commented Oct 20, 2022

TomWhite-MedStar commented Nov 30, 2022

chrisknoll commented Nov 30, 2022

alex-odysseus commented Jan 24, 2023

TomWhite-MedStar commented Oct 5, 2022 •

edited

Loading

TomWhite-MedStar commented Oct 16, 2022 •

edited

Loading