Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial bigquery support. #210

Merged
merged 4 commits into from
Sep 12, 2017
Merged

Initial bigquery support. #210

merged 4 commits into from
Sep 12, 2017

Conversation

myounglai
Copy link
Contributor

No description provided.

Copy link
Contributor

@t-abdul-basser t-abdul-basser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@myl-google Looks good to me.

@t-abdul-basser
Copy link
Contributor

@fdefalco @chrisknoll I thought that this might be of interest to you also. As you may remember from his excellent forum posts and commits re: Achilles on BigQuery, @myl-google has been adding BigQuery support to the OHDSI components for a few weeks. This is his first PR in WebAPI ( as opposed to Achiless) as he and I together to get Atlas/WebAPI on Big Query working.

Copy link
Collaborator

@anthonysena anthonysena left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@myl-google can you remove the webapi-bigquery profile from the pom.xml? We cannot host the WebAPI tables on BigQuery and that is what that configuration is attempting to do.

@t-abdul-basser
Copy link
Contributor

@anthonysena We are targeting initial support for BigQuery with this PR. Others are expected to follow. See Issue #231.

With flyaway support disabled, we are currently able to start WebAPI, show Atlas Data Sources, etc against BigQuery data set. Additional testing is being done but additon of bigquery profile appears warranted.

Flyway support will being addressed separately.

@chrisknoll
Copy link
Collaborator

@t-abdul-basser , I don't think there's a flyway support issue here: bigquery is for storing the CDM data, while a postgres/sqlserver/oracle instance hosts the webapi tables. The Webapi tables needs things like sequencers, foreign key support, etc, which isn't supported by bigquery. So, the change requested by @anthonysena is to remove the changes in this PR that involve introducing a new POM profile for bigquery.

@t-abdul-basser
Copy link
Contributor

We can host the WebAPI table in BigQuery. We are currently dealing with several use cases in AoU/PMI in which this is (currently) required. Missing features have been or will be addressed. For now, while other changes (embedded) may obviate this requirement, I suggest accepting new profile. I can remove later if required.

@anthonysena
Copy link
Collaborator

@t-abdul-basser - I think I'm a bit lost so apologies if I've missed something obvious. In reviewing this with @chrisknoll we presumed that the intent of including BigQuery was solely focused on hosting the CDM data. It sounds like you would like to expand that to also allow it to host the "ohdsi" schema that powers the WebAPI?

Part of my confusion stems from the ideas you presented in #217 where you also mentioned H2 for hosting the WebAPI "ohdsi" schema which sounded like it would potentially replace the use of other RDBMS in this role? I'm going to tag @fdefalco as this feels like it is also an architecture consideration.

Happy to discuss this if easier and then post the results of that discussion here for those that are interested.

@chrisknoll
Copy link
Collaborator

Hi, I'm also must apologize because I also share the same perspective as @anthonysena.

The main thrust of my comment in doing this review was this line of the POM:

 <flyway.locations>classpath:db/migration/postgres</flyway.locations>

This looks like it's saying that the big query profile uses the postgres migration scripts. Unless bigquery is some sort of fork of postgresql, I don't see how this could work considering our experience with defining migration scripts for the other platforms (every platform seems to have their own flavor of doing minor things). So, could you shed some light just on the POM configuration to use postgresql migration scripts?

-Chris

@myounglai
Copy link
Contributor Author

Sorry for the delay on responding to these comments. I've removed the webapi-bigquery profile from the pom.xml.

Up until this feedback, however, I really was thinking of this as a step towards getting bigquery working for hosting both the cdm and ohdsi schemas. Now I'm wondering if that's really worthwhile given the number of db features that the web-api depends on that are absent from bigquery.

Is running with a different DBMS backends for the ohdsi and cdm data a common, well-supported configuration in which all functionality is expected to work? If so, then I'll probably change focus to getting bigquery working well for the cdm schema in conjunction with a postgres database as the ohdsi schema.

@chrisknoll
Copy link
Collaborator

chrisknoll commented Sep 7, 2017

Is running with a different DBMS backends for the ohdsi and cdm data a common, well-supported configuration in which all functionality is expected to work? If so, then I'll probably change focus to getting bigquery working well for the cdm schema in conjunction with a postgres database as the ohdsi schema.

Yes, that's the normal configuration. We have an internal network of 10+ CDM instances all managed by a single WebAPI instance allowing us to generate cohorts and run analysis across all of them. The WebAPI db (that POM.xml references) is MSSQL (but postgressql, oracle, are supported) while our CDMs are hosted on SQL Server, PDW, Redshift. The single-cdm-webapi configuration is really only for demonstration purposes. Our ohdsi.org instance doesn't even do the single-cdm configuration anymore.

Edit: I should also point out that the connection configuration for the external CDM databases is set up in the WebAPI's database's source and source_daimon table. Even in a single-cdm configuration.

-Chris

@t-abdul-basser
Copy link
Contributor

@myl-google My apologies for not clarifying this issue earlier with @chrisknoll and @anthonysena earlier.

@chrisknoll @anthonysena Sorry not taking you up on your offer to discuss this further in another forum. I will try to clear up some the confusion that that I may have introduced with earlier comments: This PR (and the overall thrust of current BigQuery-related development) does aim to support the use case in which BigQuery contains the WebAPI tables (OHDSI schema) as well as the CDM schema, as per the original PR. In that way, we would not only be treating BigQuery like the other storage systems (MSSQL, postgres, Oracle, etc) that WebAPI supports, but also meeting **a specific requirement of the PMI/AoU project out of which that this work comes.

@anthonysena While #217/#232 would potentially alleviate the need for BigQuery (or any other supported DBMS for that matter) to host the WebAPI tables, it is not the intention of this PR be dependent on that proposal, i.e. *it is not the intention of this PR that there must be an embedded DB that contains WebAPI tables and that therefore BigQuery will not host the WebAPI tables

@chrisknoll We aware that there are some feature of WebAPI that require features uses that are missing or lacking in BigQuery in comparison to the currently supported DBMS'. @myl-google has identified several and I have independently identified these and a few others. There are workarounds for several of these. However, since 1) our immediate goal is to support initialization and CDM profiling reports (Atlas Data Source) and 2) this initial PR actually also supports basic vocab searches, concept set creation and cohort definition creation, we are proposing that this more than adequate as an initial PR.

So I move that the original PR be accepted. My apologies again @chrisknoll @anthonysena if I am still not being clear. Please let me know if there are additional questions.

@myounglai
Copy link
Contributor Author

@t-abdul-basser I've removed the change to pom.xml as @anthonysena requested. Could we merge without that part for now to unblock other changes and discuss separately adding it back as a followup PR?

@t-abdul-basser
Copy link
Contributor

t-abdul-basser commented Sep 8, 2017

@myl-google That is fine.

@anthonysena Is that acceptable?

Copy link
Collaborator

@chrisknoll chrisknoll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes approved.

pom.xml Outdated
<flyway.datasource.url>${datasource.url}</flyway.datasource.url>
<flyway.datasource.username>FLYWAY_USER</flyway.datasource.username>
<flyway.datasource.password>FLYWAY_PASS</flyway.datasource.password>
<flyway.locations>classpath:db/migration/postgres</flyway.locations>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assumes that the bigquery dialect is 100% compatable with the postgres migration scripts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, no guarantees regarding compatibility with migration script made at all :)

@myounglai
Copy link
Contributor Author

@anthonysena can you please approve to unblock the merge of this change?

@t-abdul-basser
Copy link
Contributor

Thanks @anthonysena and @chrisknoll!

@myounglai
Copy link
Contributor Author

@t-abdul-basser
Why did you close this? It doesn't seem to have been merged yet.

@t-abdul-basser
Copy link
Contributor

@myl-google Oops. Re-opened.

@chrisknoll chrisknoll merged commit 91df6c1 into OHDSI:master Sep 12, 2017
@chrisknoll
Copy link
Collaborator

chrisknoll commented Sep 12, 2017

Ok, I merged this. In the future, for big query, I suggest this:
I'd fork webapi and start a new feature branch for a complete big query implementation, Once it's fully functional, then we take it into WebAPI as a complete feature. What I would like to avoid is seeing fragments of big query support appearing across multiple releases of WebAPI. As it stands right now, the only thign this PR does is attempt to load the JDBC driver so that if you have a CDM source set up, it will be able to query against it. Is that enough? Do we need some changes to SQL Render? Can we now officially say that WebAPI can execute analysis queries against CDMs hosted in BigQuery? If the answer is no, let's keep a fork'ed feature branch open until we can say we can do this. Then once it's all ready, we can bring it in as part of a single release, and people will know exactly which release WebAPI officially supported big query.

@myounglai
Copy link
Contributor Author

@chrisknoll I've made a bunch of sql render changes already and things do appear to work with just this change. Is there some sort of overall acceptance test I should be updating before we can say that the WebAPI officially supports big query?

@chrisknoll
Copy link
Collaborator

No we don't have an integration test suite that could call out to specific analysis calls. That would be a good community contribution. For now, our internal validation involves executing some pre-defined cohorts and analysis queries (achilles is a good one) that can verify the db support. Since not everyone has the different db flavors (we use PDW and MSSQL) we have to trust the community when they introduce a new platform that the functionality works (Impala is an example). So I do see commits over in OHDSI/SqlRender land that deal with bigquery, so if you guys are saying that achilles runs and you can get cohort generations to execute (and re-execute, the results are refreshed between runs so deleting is something that should be supported) then we should be able to say that BigQuery is a supported platform to host CDM data for WebAPI analysis queries. Note, this doesn't include hosting webapi META tables (MS pdw and Redshift does not support this either) so it's fine to keep BigQuery in the same category as the other MPP platforms.

-Chris

@myounglai
Copy link
Contributor Author

I've verified that achilles runs including viewing the results in Atlas. Where can I find the pre-defined cohorts so that I can also verify those?

@chrisknoll
Copy link
Collaborator

we use some internal ones, but you can also pull one off the ohdsi website and import them into your local atlas installation, for example:
http://www.ohdsi.org/web/atlas/#/cohortdefinition/98094

That one is an implementation of PHEKB T2DM cohort identification. It's pretty beefy and uses quite a bit of logic. Also, it appears to have been generated on the ohdsi website, so I believe it's valid.

Othweriwse, you just create your own internally that does a few complex operations (multiple primary events, qualifying crteria, exit strategies) and if it works there, you're probably in good shape.

-Chris

@myounglai
Copy link
Contributor Author

Thanks! I'll see if I can get those working.

@myounglai
Copy link
Contributor Author

It took one more SqlRender change OHDSI/SqlRender#83 but http://www.ohdsi.org/web/atlas/#/cohortdefinition/98094 runs now.

A couple of followup questions:

@t-abdul-basser
Copy link
Contributor

t-abdul-basser commented Sep 19, 2017

That is excellent @myl-google! We will try to replicate.

  • Adding generation tests there would be a great contribution.
  • @chrisknoll @anthonysena, who coordinate our releases, and @leeevans, who maintains our OHDSI code repo., can better answer that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants