-
Notifications
You must be signed in to change notification settings - Fork 158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial bigquery support. #210
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@myl-google Looks good to me.
@fdefalco @chrisknoll I thought that this might be of interest to you also. As you may remember from his excellent forum posts and commits re: Achilles on BigQuery, @myl-google has been adding BigQuery support to the OHDSI components for a few weeks. This is his first PR in WebAPI ( as opposed to Achiless) as he and I together to get Atlas/WebAPI on Big Query working. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@myl-google can you remove the webapi-bigquery profile from the pom.xml? We cannot host the WebAPI tables on BigQuery and that is what that configuration is attempting to do.
@anthonysena We are targeting initial support for BigQuery with this PR. Others are expected to follow. See Issue #231. With flyaway support disabled, we are currently able to start WebAPI, show Atlas Data Sources, etc against BigQuery data set. Additional testing is being done but additon of bigquery profile appears warranted. Flyway support will being addressed separately. |
@t-abdul-basser , I don't think there's a flyway support issue here: bigquery is for storing the CDM data, while a postgres/sqlserver/oracle instance hosts the webapi tables. The Webapi tables needs things like sequencers, foreign key support, etc, which isn't supported by bigquery. So, the change requested by @anthonysena is to remove the changes in this PR that involve introducing a new POM profile for bigquery. |
We can host the WebAPI table in BigQuery. We are currently dealing with several use cases in AoU/PMI in which this is (currently) required. Missing features have been or will be addressed. For now, while other changes (embedded) may obviate this requirement, I suggest accepting new profile. I can remove later if required. |
@t-abdul-basser - I think I'm a bit lost so apologies if I've missed something obvious. In reviewing this with @chrisknoll we presumed that the intent of including BigQuery was solely focused on hosting the CDM data. It sounds like you would like to expand that to also allow it to host the "ohdsi" schema that powers the WebAPI? Part of my confusion stems from the ideas you presented in #217 where you also mentioned H2 for hosting the WebAPI "ohdsi" schema which sounded like it would potentially replace the use of other RDBMS in this role? I'm going to tag @fdefalco as this feels like it is also an architecture consideration. Happy to discuss this if easier and then post the results of that discussion here for those that are interested. |
Hi, I'm also must apologize because I also share the same perspective as @anthonysena. The main thrust of my comment in doing this review was this line of the POM:
This looks like it's saying that the big query profile uses the postgres migration scripts. Unless bigquery is some sort of fork of postgresql, I don't see how this could work considering our experience with defining migration scripts for the other platforms (every platform seems to have their own flavor of doing minor things). So, could you shed some light just on the POM configuration to use postgresql migration scripts? -Chris |
Sorry for the delay on responding to these comments. I've removed the webapi-bigquery profile from the pom.xml. Up until this feedback, however, I really was thinking of this as a step towards getting bigquery working for hosting both the cdm and ohdsi schemas. Now I'm wondering if that's really worthwhile given the number of db features that the web-api depends on that are absent from bigquery. Is running with a different DBMS backends for the ohdsi and cdm data a common, well-supported configuration in which all functionality is expected to work? If so, then I'll probably change focus to getting bigquery working well for the cdm schema in conjunction with a postgres database as the ohdsi schema. |
Yes, that's the normal configuration. We have an internal network of 10+ CDM instances all managed by a single WebAPI instance allowing us to generate cohorts and run analysis across all of them. The WebAPI db (that POM.xml references) is MSSQL (but postgressql, oracle, are supported) while our CDMs are hosted on SQL Server, PDW, Redshift. The single-cdm-webapi configuration is really only for demonstration purposes. Our ohdsi.org instance doesn't even do the single-cdm configuration anymore. Edit: I should also point out that the connection configuration for the external CDM databases is set up in the WebAPI's database's source and source_daimon table. Even in a single-cdm configuration. -Chris |
@myl-google My apologies for not clarifying this issue earlier with @chrisknoll and @anthonysena earlier. @chrisknoll @anthonysena Sorry not taking you up on your offer to discuss this further in another forum. I will try to clear up some the confusion that that I may have introduced with earlier comments: This PR (and the overall thrust of current BigQuery-related development) does aim to support the use case in which BigQuery contains the WebAPI tables (OHDSI schema) as well as the CDM schema, as per the original PR. In that way, we would not only be treating BigQuery like the other storage systems (MSSQL, postgres, Oracle, etc) that WebAPI supports, but also meeting **a specific requirement of the PMI/AoU project out of which that this work comes. @anthonysena While #217/#232 would potentially alleviate the need for BigQuery (or any other supported DBMS for that matter) to host the WebAPI tables, it is not the intention of this PR be dependent on that proposal, i.e. *it is not the intention of this PR that there must be an embedded DB that contains WebAPI tables and that therefore BigQuery will not host the WebAPI tables @chrisknoll We aware that there are some feature of WebAPI that require features uses that are missing or lacking in BigQuery in comparison to the currently supported DBMS'. @myl-google has identified several and I have independently identified these and a few others. There are workarounds for several of these. However, since 1) our immediate goal is to support initialization and CDM profiling reports (Atlas Data Source) and 2) this initial PR actually also supports basic vocab searches, concept set creation and cohort definition creation, we are proposing that this more than adequate as an initial PR. So I move that the original PR be accepted. My apologies again @chrisknoll @anthonysena if I am still not being clear. Please let me know if there are additional questions. |
@t-abdul-basser I've removed the change to pom.xml as @anthonysena requested. Could we merge without that part for now to unblock other changes and discuss separately adding it back as a followup PR? |
@myl-google That is fine. @anthonysena Is that acceptable? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes approved.
pom.xml
Outdated
<flyway.datasource.url>${datasource.url}</flyway.datasource.url> | ||
<flyway.datasource.username>FLYWAY_USER</flyway.datasource.username> | ||
<flyway.datasource.password>FLYWAY_PASS</flyway.datasource.password> | ||
<flyway.locations>classpath:db/migration/postgres</flyway.locations> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assumes that the bigquery dialect is 100% compatable with the postgres migration scripts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, no guarantees regarding compatibility with migration script made at all :)
@anthonysena can you please approve to unblock the merge of this change? |
Thanks @anthonysena and @chrisknoll! |
@t-abdul-basser |
@myl-google Oops. Re-opened. |
Ok, I merged this. In the future, for big query, I suggest this: |
@chrisknoll I've made a bunch of sql render changes already and things do appear to work with just this change. Is there some sort of overall acceptance test I should be updating before we can say that the WebAPI officially supports big query? |
No we don't have an integration test suite that could call out to specific analysis calls. That would be a good community contribution. For now, our internal validation involves executing some pre-defined cohorts and analysis queries (achilles is a good one) that can verify the db support. Since not everyone has the different db flavors (we use PDW and MSSQL) we have to trust the community when they introduce a new platform that the functionality works (Impala is an example). So I do see commits over in OHDSI/SqlRender land that deal with bigquery, so if you guys are saying that achilles runs and you can get cohort generations to execute (and re-execute, the results are refreshed between runs so deleting is something that should be supported) then we should be able to say that BigQuery is a supported platform to host CDM data for WebAPI analysis queries. Note, this doesn't include hosting webapi META tables (MS pdw and Redshift does not support this either) so it's fine to keep BigQuery in the same category as the other MPP platforms. -Chris |
I've verified that achilles runs including viewing the results in Atlas. Where can I find the pre-defined cohorts so that I can also verify those? |
we use some internal ones, but you can also pull one off the ohdsi website and import them into your local atlas installation, for example: That one is an implementation of PHEKB T2DM cohort identification. It's pretty beefy and uses quite a bit of logic. Also, it appears to have been generated on the ohdsi website, so I believe it's valid. Othweriwse, you just create your own internally that does a few complex operations (multiple primary events, qualifying crteria, exit strategies) and if it works there, you're probably in good shape. -Chris |
Thanks! I'll see if I can get those working. |
It took one more SqlRender change OHDSI/SqlRender#83 but http://www.ohdsi.org/web/atlas/#/cohortdefinition/98094 runs now. A couple of followup questions:
|
That is excellent @myl-google! We will try to replicate.
|
No description provided.