Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keep track of platforms and genome updates within Gemma #378

Closed
10 tasks done
arteymix opened this issue Jun 29, 2022 · 7 comments
Closed
10 tasks done

Keep track of platforms and genome updates within Gemma #378

arteymix opened this issue Jun 29, 2022 · 7 comments
Assignees
Labels
enhancement Enhance the code or user experience high priority Issues that require immediate attention
Milestone

Comments

@arteymix
Copy link
Member

arteymix commented Jun 29, 2022

These are currently stored in a spreadsheet.

  • investigate what genome release we use for blasting probes
  • expose that information through the RESTful API on a per-platform basis (the platform should know what its probes are based upon for RNA-Seq and what genome releases it targets)
  • use hard-coded values in the API for now
  • list all possible cases in Gemma where version tracking could be used to guide the design of the feature
  • expose an overview of versioned external databases in the RESTful API
  • update last updated in TableMaintenanceUtil when the gene2cs is updated
  • display a brief overview of databases freshess in the frontpage
  • include RNA-Seq external databases under their respective genome databases

Things to keep track of

  • genome releases
  • taxa (through genome releases)
  • genes from Ensembl (ID mappings are obtained from NCBI)
  • genes from NCBI
  • gene products
  • platforms
  • probes in platforms which an be remapped to different gene products, but largely remain unchanged

Tracking individual probes, genes and gene products is intractable. I need to see how they might relate to the same ExternalDatabase. Having all the genes relating to the same database would allow us to keep track of the update in a single location.

There might be redundant ExternalDatabase which should be merged so that we can reasonably update them.

Transitory solution

Before we decide on a way to store this metadata, we can already design the outside view of it. Genomes and gene annotations are slow-moving things in Gemma, so we can take our time to think this out.

  • add release, releaseUrl, lastUpdated attributes to the ExternalDatabase VOs
  • ensure that these fields appear within the DatabaseAccession VOs

Solution 1

Add columns to EXTERNAL_DATABASE to record the platform release being used or specific genome version. We're interested in:

If we take this approach, we have to associate ARRAY_DESIGN with EXTERNAL_DATABASE. This would be used for the RNA-Seq platforms to keep track of the current release being used. However, since the platforms are not versioned (i.e. there is one Generic_human_ncbiIds platform), it will not be helpful for the EE.

Solution 2

Use the audit trail to record platform and gene updates. This allows us to know who did the update and when, but it's not ideal for storing a release number, for example. The downside is that we'll likely have to implement the Auditable interface.

The ArrayDesign already implements Auditable. We just need to create a new event type to indicate when a platform update was performed. It also resolves the issue

Other considerations:

This relates to #20 in a way because we are trying to keep track of the above information at the EE-level.

Relevant spreadsheet that currently keeps track of these manually: https://docs.google.com/spreadsheets/d/1MIi_r9U6ufiROdwRFi5fESeHbF35UHs1mnzjNySmJFg/edit#gid=0

@arteymix
Copy link
Member Author

Both solution can also be combined to keep track of updates of our ExternalDatabase entities. We'll need custom audit event types for various update operations performed through the CLI.

@arteymix arteymix added the high priority Issues that require immediate attention label Oct 14, 2022
@arteymix
Copy link
Member Author

The most urgent step is to prepare the data model and expose that information in the RESTful API, audit events and integration in the CLI can wait.

@arteymix
Copy link
Member Author

There are 44 external databases defined in the table, out of which only the genome ones are worth keeping track of.

For probes, the platforms are represented with a database entry, which we will want to keep track of at this level. I'm introducing a new interface called Versioned (this is a preliminary naming) to have an idea of the look & feel. It will be deployed on the development server.

@arteymix
Copy link
Member Author

We need to expose ArrayDesign.externalReferences in ArrayDesignValueObject. Fortunately, it's already exposed in ArrayDesignValueObjectExt, so I'll simply move it up.

@arteymix
Copy link
Member Author

To expose RNA-Seq annotations, we could add additional external references to the corresponding generic platform. There would be one ref for the gene source (i.e. Ensembl) and another for the version the pipeline is using.

@arteymix
Copy link
Member Author

Feedback from @ppavlidis:

Clarify what the RNA-Seq annotations are.

Reword the line that introduces EDs:

Gemma’s expression platform and gene annotations are powered by:

@arteymix
Copy link
Member Author

Fixed in 9630181. The only remaining thing is to close #486 once the 1.29 is in production.

@arteymix arteymix mentioned this issue Dec 3, 2022
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhance the code or user experience high priority Issues that require immediate attention
Projects
None yet
Development

No branches or pull requests

2 participants