Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added data_access module for managed access data. Fixes #1535 #1537

Merged
merged 20 commits into from
Mar 4, 2024

Conversation

ESapenaVentura
Copy link
Collaborator

@ESapenaVentura ESapenaVentura commented Nov 8, 2023

Fixes #1535

related to: ebi-ait/dcp-ingest-central#967

Release notes

For type/project/project schema:

  • Added new required field data_access

For module/ontology/data_access_ontology schema:

  • Created module

Copy link

@ncalvanese1 ncalvanese1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@hannes-ucsc
Copy link
Contributor

hannes-ucsc commented Nov 8, 2023

Can you describe what use cases this change is intended to facilitate? I assumed we were planning to use DUOS and this change doesn't seem to tie into that.

@amnonkhen
Copy link
Collaborator

Can you describe what use cases this change is intended to facilitate? I assumed we were planning to use DUOS and this change doesn't seem to tie into that.

Good point @hannes-ucsc
We can have another field in addition to type and notes to hold the DUO code. It can be an enum or an ontology term of children of data use permission
or data use modifier
. Currently, the implementation on the ingest side takes into account the type property when authorising an API call. I don't object to use the DUO code for that, but I would like to finish system tests on the change as it is, so I would like to get this schema change live in the dev environment, for which I will need to merge it to the dev branch.

@amnonkhen amnonkhen changed the base branch from staging to develop November 9, 2023 23:00
@hannes-ucsc
Copy link
Contributor

When I said DUOS, I meant https://duos.broadinstitute.org/

@ncalvanese1 would the addition of a field with the DUO ontology as described by @amnonkhen work for you when it comes to setting up snapshot permissions?

@amnonkhen
Copy link
Collaborator

amnonkhen commented Nov 9, 2023

When I said DUOS, I meant https://duos.broadinstitute.org/

Thanks for the clarification.
I need a project to have an indication whether it is open access or managed access, because I use it do allow or prevent access to metadata in ingest via its REST API.
@hannes-ucsc Are you objecting to the to type and notes attributes of the dataAccess module, or are you suggesting to add something in addition to them?

@hannes-ucsc
Copy link
Contributor

I'm not objecting to anything. I am asking questions.

@gabsie
Copy link
Contributor

gabsie commented Nov 10, 2023

Hi, @hannes-ucsc @amnonkhen @ESapenaVentura @ncalvanese1

Basically the purpose of storing with us and passing this information (managed access or open access) to the downstream components will allow us to differentiate how to handle those datasets.

As Amnon has said, this way we will make sure these projects are treated with the required security with us in ingest and down the line with you. So for now with the pilot we are starting, we can mark that project as 'managed access'.

There is no problem to add an additional field/note around DUO codes, but I don't know if at this stage we will always know these from contributors or can assign them correctly. I think we have to continue our current conversations about managed access implementations to decide on the DUO codes.

Hope this is okay and a good first step.

@amnonkhen
Copy link
Collaborator

Thanks @gabsie for the clarifications and @hannes-ucsc for the questions. I did not say you were objecting, I was trying to understand why you were asking the questions you were asking, in order to figure out how to explain my points better.

@hannes-ucsc
Copy link
Contributor

Simply marking a project as containing managed-access data is not sufficient information for determining and enforcing who should have access to that data. Assuming that in order to answer that question, we ultimately want to integrate with DUOS (not DUO) and SAM, the changes proposed here are not taking us in that direction. They will likely need to be backed out and replaced with something else.

For the pilot, why can't we just communicate the set of MA-projects to @ncalvanese1 when they are ready to be imported, instead of burdening the schema with a temporary solution? The set of people who should have access will also have to be communicated to @ncalvanese1 so giving him a few project UUIDs in addition to that set of user identities seems no big deal.

Because I can't attend the biweekly Tuesday meetings, I am likely out of the loop with respect to the pilot. If someone here is the mastermind for the pilot effort, or knows who that person is, please let me know.

@amnonkhen
Copy link
Collaborator

@hannes-ucsc each component in the DCP will figure out the access rights of its own users. ingest will get information from the DAC in either push or pull mode about users who need access to the project, and protect the api calls accordingly. The same would apply for other components. The access list can change after the project has been exported by ingest, so it is not ingest's responsibility to communicate the access rights down stream. The only thing ingest would communicate is whether the project is a managed access project. Downstream components would query the DAC (or consult a local copy) for managed access projects.

In addition, we need the change in the pilot because ingest uses the schema when it exports files. This is a small change, which we can refine at a later stage.

@hannes-ucsc
Copy link
Contributor

each component in the DCP will figure out the access rights of its own users

That's a bit simplistic, I'm afraid, and I must point out, your opinion, not some already accepted decision of the DCP/2.

In my opinion, each component shouldn't have to "figure out access rights", it should enforce access, and do so consistently with other components across the platform, and we've already adopted the specification how this would work in TDR and downstream. You approved the PR that added that part of the specification. Each component can only enforce access if it has the information to do so. To that extent, Azul and Data Browser implement the specification, and tie into TDR for determining who has access. What's missing is the mechanism by which TDR determines access and I thought consensus was that we would use DUOS. We need to figure out how this would work, ideally with an addition to the DCP/2 specification.

This is a small change, which we can refine at a later stage.

If you need a way to temporarily mark a project as managed-access for the pilot, you could just adopt a naming convention for the project description or its short name. That way we wouldn't need to burden the schema with a change that we already know is insufficient for a complete implementation of managed-access for HCA.

@gabsie
Copy link
Contributor

gabsie commented Nov 21, 2023

Hey both!
@hannes-ucsc - shall we try and organise a short meeting to decide on these? Maybe involving Nate, us, and you?
Tell us some convenient slots for you - I think this will help us move on. Managed access is not easy to coordinate so far. Hannes, there's also a meeting which happens bi-weekly, which we feel you should be invited to.. Organised by John Randell at the HCA exec office. Next one is Dec 7th, 4pm UK time.

@hannes-ucsc
Copy link
Contributor

That works for me!

@gabsie
Copy link
Contributor

gabsie commented Nov 22, 2023

I will tell them to add you to that meeting. Meanwhile, just us need to catch up, so I suggest next week Wednesday, 29 Dec, 4 or 5pm UK time? Let me know if this works, otherwise please suggest another time.

@amnonkhen
Copy link
Collaborator

@hannes-ucsc I see 2 scenarios:

  1. it is ingest's responsibility to update DUOS with the managed access information when projects are registered in ingest as "managed access"?

  2. managed access registration is recorded on DUOS (not sure yet by whom). HCA components, query DUOS given a request to access a project resource, so that they can enforce access to that resource

I have been under the impression that scenario 2 is the effective scenario. Do you see it differently?

Can you please clarify what you mean?

@hannes-ucsc
Copy link
Contributor

Can you please clarify what you mean?

Specifically, which of my statements is unclear to you?

@gabsie
Copy link
Contributor

gabsie commented Nov 27, 2023

Hi @amnonkhen and @hannes-ucsc, but also @ncalvanese1 - let's discuss this in a call. I have proposed this Wednesday, 29 Dec, 4 pm UK time. Does that work for you?

@hannes-ucsc
Copy link
Contributor

I have proposed this Wednesday, 29 Dec, 4 pm UK time. Does that work for you?

Assuming you mean Nov 29, I'm hesitant to have a meeting without wranglers or leadership present. We need to determine who the authority is on deciding the conditions by which specific users are granted access to managed-access projects in HCA. I don't think this is something the audience of the meeting you proposed would be able to determine. That's why I'd rather wait for the big picture to be formed during the meeting on 12/7/2023. Then we can discuss the implementation details. OTOH, it would be helpful to hear from @ncalvanese1. If he thinks a separate meeting would be useful, I'd be happy to join.

@gabsie
Copy link
Contributor

gabsie commented Dec 5, 2023

Hey Hannes, no need for now for the separate meeting.
Let's as a start get you to join the next HCA Managed access call, that will potentially clarify some aspects of specifically the role of the DACO and how they handle registrations and requests.
If we still need another meeting after, we can schedule that.

Gabby

@idazucchi
Copy link
Collaborator

With the last few commits I addressed two comments which are not visible now:

  1. I’ve changed naming from data use to data restriction for all fields/descriptions
  2. Labels: I’ve made the ontology_label into an enum and added a set of rules to enforce the correct pairs of labels and ids. I think this is important because otherwise we can get cases where the ID indicates that the dataset is managed access while the label indicates open access and we don’t know which one to trust

In addition to this I’ve made the ontology id required so we can never push a project that lacks the access restrictions. I’ve also made the ontology label required so we have a field that’s human readable, which makes it easier to check that we selected the right DUO code for the project

Copy link
Contributor

@hannes-ucsc hannes-ucsc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of questions:

@@ -0,0 +1,94 @@
{
"$schema": "http://json-schema.org/draft-07/schema#",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the # at the end intended?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit confusing but yes, for draft 7 the meta-schema $id contains a hash at the end https://json-schema.org/draft-07/schema

Following drafts do not continue with this convention, not entirely sure why

"type": "string",
"enum": [
"no restriction",
"non-commercial use only",
Copy link
Contributor

@hannes-ucsc hannes-ucsc Feb 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, there was some talk during our meeting that NCU was to be used as a modifier in combination with GRU, at least in the form to be filled out by contributors. This PR models NCU as a stand-alone enum item. Do we need to change the form or the schema or are you OK with that inconsistency?

@NoopDog NoopDog self-assigned this Feb 27, 2024
@idazucchi
Copy link
Collaborator

I restructured the enum field to combine the general research use code with the non-commercial use only modifier. Considering all the rules we’ve added this module won’t behave like other ontology modules I’ve moved to the project modules instead.

"enum": [
"DUO:0000004",
"DUO:0000042",
"DUO:0000042;DUO:0000046"
Copy link
Contributor

@hannes-ucsc hannes-ucsc Feb 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use of a custom separator strikes me as hacky since it requires custom parsing on the consumer end. Multiplicity in JSON is natively handled as—I hesitate to bring this up again—arrays. You could still restrict the valid term combinations with allOf and require DUO:0000004.

Alternatively, we could ditch the combination from the contributor form. I personally find it confusing and it makes for a simpler "radio button" form UI instead of the more complicated "only some checkbox combinations are valid" approach.

@NoopDog
Copy link
Contributor

NoopDog commented Feb 28, 2024

Hi folks,

I wanted to take a second to point out some prior art.

It looks like Terra and the DUOS API encode the meaning of the consents rather than consistently using the ontology terms and ontology term IDs.

DUOS API

Looking at the output of the DUOS API, you can see modeling like:

"dataUse": {
    "generalUse": true,
    "hmbResearch": false,
    "diseaseRestrictions": [],
    "populationOriginsAncestry": false,
    "commercialUse": true,
    "ethicsApprovalRequired": false,
    "collaboratorRequired": false,
    "geographicalRestrictions": "",
    "geneticStudiesOnly": false,
    "publicationResults": false
},

Here is an example where ontology term IDs are used (for disease):

"dataUse": {
    "generalUse": false,
    "hmbResearch": false,
    "diseaseRestrictions": [
        "http://purl.obolibrary.org/obo/DOID_1287"
    ],
    "populationOriginsAncestry": false,
    "commercialUse": true,
    "ethicsApprovalRequired": false,
    "collaboratorRequired": false,
    "geographicalRestrictions": "",
    "geneticStudiesOnly": false,
    "publicationResults": false
},

The DUOS API also gives a text summary for the restrictions like:

"translatedDataUse": "Samples are restricted for use under the following conditions:\nData use is limited for studying: brain cancer [DS]\nCommercial use is not prohibited.\nData use for methods development research irrespective of the specified data use limitations is not prohibited.\nRestrictions for use as a control set for diseases other than those defined were not specified."

Terra

In Terra, you can see what looks to be a UI embodiment of this concept in the workspace dataset attributes like:

image

AnVIL Explorer

In the AnVIL Explorer and Dataset Catalog, the consents are represented in a single string like:

image

AnVIL Dataset Catalog

Here is an example of a consent code with explanatory text from the AnVIL Dataset Catalog:

image

Options

Following the examples above, some options for us are:

  1. Use a single string to represent the consent e.g., "NRES," "GRU," or "GRU-NPU". This is simple, flexible, and future-proofs us to represent any consent in the future. We would validate that the consent string is in the allowed set.

  2. Use the DUOS/Broad-type approach to semantically represent the specific constraints we care about with a structure like:

NRES

{
      "noRestrictions": true,
      "generalResearchUse": false,
      "nonCommercialUseOnly": false
}

GRU

{
      "noRestrictions": false,
      "generalResearchUse": true,
      "nonCommercialUseOnly": false
}

GRU-NPU

{
      "noRestrictions": false,
      "generalResearchUse": true,
      "nonCommercialUseOnly": true
}

Translated Data Use
Optionally for both approaches above, we could add a "translatedDataUse" field, with values calculated from the input of:

NRES - No restrictions
GRU - General research use
GRU-NPU - General research use by not-for-profit organizations only.

The text would need to be validated, of course.

Given that we have basic requirements for representing consents, it seems like option 1 above (e.g., GRU-NPU) might be easiest all around.

I hope this helps. I am curious what other folks think.

Cheers,
D

@NoopDog
Copy link
Contributor

NoopDog commented Feb 29, 2024

After our discussion, I wanted to propose that we use an enum with the following allowed values:

  • NRES (DUO_0000004)
  • GRU (DUO_0000042)
  • GRU-NCU (DUO_0000042 - DUO_0000046)

We could also use:

  • NRES
  • GRU
  • GRU-NCU

And tie the codes to the ontology term IDs in the description section of the enum definition.

Cheers,
D

@NoopDog
Copy link
Contributor

NoopDog commented Feb 29, 2024

If we want to model our requirements more explicitly, it could be done with two fields:

  • dataUsePermission - required, with allowed values of DUO_0000004 or DUO_0000042
  • dataUseModifier - optional, with allowed values of DUO_0000046 but only allowed if dataUsePermission is DUO_0000042

So we would have:

NRES

dataUseRestriction: {
   dataUsePermission: "DUO_0000004",
   dataUseModifier: null
}

GRU

dataUseRestriction: {
   dataUsePermission: "DUO_0000042",
   dataUseModifier: null
}

GRU-NPU

dataUseRestriction: {
   dataUsePermission: "DUO_0000042",
   dataUseModifier:  "DUO_0000046"
}

@idazucchi, @amnonkhen, would ingest be able to validate that dataUseModifier: "DUO_0000046" is only used in the context of dataUsePermission: "DUO_0000042"? For example, can we detect the following is invalid and prevent this from being entered?

NRES-NPU (Invalid)

dataUseRestriction: {
   dataUsePermission: "DUO_0000004",
   dataUseModifier:  "DUO_0000046"
}

@NoopDog
Copy link
Contributor

NoopDog commented Feb 29, 2024

For reference, a link to the Data Use Ontology is here https://www.ebi.ac.uk/ols4/ontologies/duo

image

Copy link
Contributor

@hannes-ucsc hannes-ucsc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

src/schema_linter.py Outdated Show resolved Hide resolved
idazucchi and others added 2 commits March 1, 2024 13:56
Co-authored-by: ESapenaVentura <38617863+ESapenaVentura@users.noreply.github.com>
Co-authored-by: ESapenaVentura <38617863+ESapenaVentura@users.noreply.github.com>
Copy link
Contributor

@NoopDog NoopDog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! 🚀 Thanks for working through this!

Copy link
Collaborator

@amnonkhen amnonkhen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

KISS! Love it!

@idazucchi idazucchi merged commit 91adce8 into staging Mar 4, 2024
3 of 5 checks passed
@idazucchi idazucchi deleted the esv-managedAccess-Issue1535 branch March 4, 2024 14:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants