Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Heracles reports do not insert final results when using Spark #2112

Closed
TomWhite-MedStar opened this issue Oct 5, 2022 · 13 comments · Fixed by #2138
Closed

Heracles reports do not insert final results when using Spark #2112

TomWhite-MedStar opened this issue Oct 5, 2022 · 13 comments · Fixed by #2138
Labels
Milestone

Comments

@TomWhite-MedStar
Copy link
Contributor

TomWhite-MedStar commented Oct 5, 2022

Expected behavior

After defining and generating a cohort, it should be possible to use the Reporting tab to generate Quick or Full Analyses.

Actual behavior

This works fine on SQL Server, but not on Spark (e.g. DataBricks).

Steps to reproduce behavior

Need an instance of OHDSI that has an OMOP data source connected via SPARK

  1. Select any cohort
  2. Run Generate on the cohort
  3. Run "Quick Analysis" under the Reporting tab
    After several minutes, the Job will show a Failed status.

Root Cause

After all of the initial processing, the final query looks like this:

insert into result_schema.heracles_results (cohort_definition_id, analysis_id, stratum_1, stratum_2, stratum_3, stratum_4, count_value) select cohort_definition_id, analysis_id, cast(stratum_1 as STRING), cast(stratum_2 as STRING), cast(stratum_3 as STRING), cast(stratum_4 as STRING), count_value from tmp.xrfc9243results_1 UNION ALL ...

However, SPARK SQL does not support INSERT INTO with a list of fields that does not exactly match the full set of columns for the final table.

Modifying the query does work. Either this:

insert into omop_160101_to_220731_v813_results.heracles_results (cohort_definition_id, analysis_id, stratum_1, stratum_2, stratum_3, stratum_4, stratum_5, count_value, last_update_time) select cohort_definition_id, analysis_id, cast(stratum_1 as STRING), cast(stratum_2 as STRING), cast(stratum_3 as STRING), cast(stratum_4 as STRING), '' as stratum_5, count_value, now() from tmp.xrfc9243results_1

or this:

insert into omop_160101_to_220731_v813_results.heracles_results select cohort_definition_id, analysis_id, cast(stratum_1 as STRING), cast(stratum_2 as STRING), cast(stratum_3 as STRING), cast(stratum_4 as STRING), '' as stratum_5, count_value, now() from tmp.xrfc9243results_1

Can the relevant SQL fragments be modified so that this works for Spark?

One of the relevant files appears to be:
https://github.com/OHDSI/WebAPI/blob/master/src/main/resources/resources/cohortanalysis/sql/selectHeraclesResults.sql

But it looks as though a Java file is used to orchestrate the sub-queries:
https://github.com/OHDSI/WebAPI/blob/fd108571086fceea8513389fab426a6ea8101888/src/main/java/org/ohdsi/webapi/cohortanalysis/HeraclesQueryBuilder.java

@chrisknoll
Copy link
Collaborator

Thanks for the suggestion. we're just completing a 2.12.0 release, but this is something we can push into 2.12.1 after if we don't have time to squeeze this in.

@TomWhite-MedStar
Copy link
Contributor Author

Thanks. There is a some discussion on Forums too - https://forums.ohdsi.org/t/databricks-spark-coming-to-ohdsi-stack/14545/13

@TomWhite-MedStar
Copy link
Contributor Author

Per @alondhe:
My guess is that the tasklet that executes the heracles SQL commands (https://github.com/OHDSI/WebAPI/blob/master/src/main/java/org/ohdsi/webapi/cohortanalysis/CohortAnalysisTasklet.java#L90) needs to be wrapped in SqlRender’s sparkHandleInsert function to ensure the insert command is reconstructed to have all table fields present.

@chrisknoll
Copy link
Collaborator

Shouldn't this be handled by the core SqlRender function? I do not feel supportive of putting dialect-specific rules in WebAPI when dialect specific rules should be taken into account with these external libaries (so that you get the handling without having to make the special case every place you use it). Is this a simple mater of sql rendering or is this something eternal to sql rendering and is something that has to be done directly on the connection?

@alondhe
Copy link
Contributor

alondhe commented Oct 15, 2022

SqlRender sparkHandleInsert() can be used, but it has to be invoked during any SQL command fired off by WebAPI. This is because we need the active connection in order to then run a "describe table" to get the full column list and re-write the insert. So unfortunately, dialect-specific rules are needed in WebAPI -- and we have done this already.

Refactoring all inserts to use all columns explicitly is the cleanest approach, but also a heavy lift. But if we did that, we can just skip any Spark specific patterns in WebAPI SQL executions.

@chrisknoll
Copy link
Collaborator

Ok, it's messy but I think we have to adhere to dealing with writing cross-platform SQL (with respect to SqlRender). There's talk about creating a OHDSI-SQL specification and part of that would include naming your columns, and even matching your INSERTS to SELECTS.

I'll take this one on over time if necessary.

@TomWhite-MedStar
Copy link
Contributor Author

TomWhite-MedStar commented Oct 16, 2022

@chrisknoll , sounds like you are suggesting refactoring the relevant SQL. If so, I'm willing to help if I can get some guidance.

I've confirmed that all of the queries prior to the insert statement run correctly. So, specifically for this Heracles function, it appears that only four things need to change:

(1) Change contents of https://github.com/OHDSI/WebAPI/blob/master/src/main/resources/resources/cohortanalysis/sql/selectHeraclesResults.sql to
select cohort_definition_id, analysis_id, cast(stratum_1 as STRING), cast(stratum_2 as STRING), cast(stratum_3 as STRING), cast(stratum_4 as STRING), '' as stratum_5, count_value, GETDATE() from #results_@analysisId
(2) Change line 41 in https://github.com/OHDSI/WebAPI/blob/master/src/main/java/org/ohdsi/webapi/cohortanalysis/HeraclesQueryBuilder.java to
private final static String INSERT_RESULT_STATEMENT = "insert into @results_schema.heracles_results (cohort_definition_id, analysis_id, stratum_1, stratum_2, stratum_3, stratum_4, stratum_5, count_value, last_update_time)\n";
(3) Change contents of https://github.com/OHDSI/WebAPI/blob/master/src/main/resources/resources/cohortanalysis/sql/selectHeraclesDistResults.sql to
select cohort_definition_id, analysis_id, cast(stratum_1 as varchar(255)), cast(stratum_2 as varchar(255)), cast(stratum_3 as varchar(255)), cast(stratum_4 as varchar(255)), cast(stratum_5 as varchar(255)), cast(count_value as bigint), cast(min_value as float), cast(max_value as float), cast(avg_value as float), cast(stdev_value as float), cast(median_value as float), cast(p10_value as float), cast(p25_value as float), cast(p75_value as float), cast(p90_value as float), GETDATE() from #results_dist_@analysisId
(4) Change line 42 in https://github.com/OHDSI/WebAPI/blob/master/src/main/java/org/ohdsi/webapi/cohortanalysis/HeraclesQueryBuilder.java to
private final static String INSERT_DIST_RESULT_STATEMENT = "insert into @results_schema.heracles_results_dist (cohort_definition_id, analysis_id, stratum_1, stratum_2, stratum_3, stratum_4, stratum_5, count_value, min_value, max_value, avg_value, stdev_value, median_value, p10_value, p25_value, p75_value, p90_value, last_update_time)\n";

Any chance that can get incorporated into the 2.12.1 release?

@chrisknoll
Copy link
Collaborator

yes, absolutely, if we can get a PR put together with changes, then we can incorporate that into a hotfix. If you can make the above changes, all the better.

TomWhite-MedStar added a commit to TomWhite-MedStar/WebAPI that referenced this issue Oct 17, 2022
@TomWhite-MedStar
Copy link
Contributor Author

@chrisknoll , thanks. I added a pull request #2138 . I'm not able to compile and test the WebAPI myself, but was able to test the SQL statements this should generate.

@chrisknoll
Copy link
Collaborator

@TomWhite-MedStar : Hi, I need a few days to recover from all the activity, but I'm not ignoring you. Let me get back to you with final thoughts and we can push this forward then.

@TomWhite-MedStar
Copy link
Contributor Author

@chrisknoll , any chance this can get into 2.12.1 release?

@chrisknoll
Copy link
Collaborator

I haven't had a chance to go into this, and I won't have time any time soon, so, I left a comment on the PR that I can approve if the PR works for spark and other dialects. Once merged to master, odyessus can pull together a hotfix with this commit.

@anthonysena anthonysena added this to the v2.12.1 milestone Jan 24, 2023
@alex-odysseus
Copy link
Contributor

@ssuvorov-fls please cherry pick to 'master-2.12'

ssuvorov-fls pushed a commit that referenced this issue Jan 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
5 participants