Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding a 'database' for assembly accessions to map saved seqCol objects #13

Closed
waterflow80 opened this issue Jul 20, 2023 · 5 comments · Fixed by #83 · May be fixed by #78, #81 or #82
Closed

Adding a 'database' for assembly accessions to map saved seqCol objects #13

waterflow80 opened this issue Jul 20, 2023 · 5 comments · Fixed by #83 · May be fixed by #78, #81 or #82
Labels
enhancement New feature or request

Comments

@waterflow80
Copy link
Collaborator

When trying to fetch and insert seqCol objects, we test whether the seqCol's digest is saved in the database or not, if so, we don't proceed with the saving. But at that point we've already downloaded both the assembly report and the sequences FASTA file and processed them to create the seqCol object, which might be a huge work for the server, especially when the sequences are too large.

So if we can have a database (or a file) where we can save the assembly accessions that maps to the saved seqCol objects. Like this, we'll be saving a huge amount of time, because we'll check the existing of seqCol objects that corresponds to that accession b4 downloading and processing anything.

Note (Technical detail): we should make sure to check that we have seqcol objects saved in the db that corresponds to all of the naming conventions that exist in the assembly report, in order to abort the fetch.

@waterflow80 waterflow80 added the enhancement New feature or request label Jul 20, 2023
@waterflow80
Copy link
Collaborator Author

However this is useful only when creating new seqCol objects and populating our database.

@tcezard
Copy link
Member

tcezard commented Sep 1, 2023

We should be using the Assembly accession to check that an assembly has been ingested already.

@waterflow80
Copy link
Collaborator Author

@tcezard
I think this topic is more related to the spec itself, I'm not sure if we should keep this issue on the eva-seqcol repo.

@tcezard
Copy link
Member

tcezard commented Mar 13, 2024

I've said above that we should be storing the Assembly accession but we should also try to think about other potential source of sequence collection and how they would impact our design.
The main goals of this issue are:

  • storing an identifier of the ingested sequence collection so we do not need to retrieve the data again if it has been done before
  • store enough metadata so we can provide the origin of the sequence collection

At the moment the main identifier is the assembly accession since it is what we use in the ingestion parameter so it makes sense to store this. However it won't be enough to know exactly where the sequence collection comes from. For this I think storing the source URL would be more accurate.

I also think that in the future we will want to be able to store sequence collection that are not linked to INSDC accession. These most likely will have URL and we will also need to find some form of identifier associated. To enable this we should chose generic column names such as datasource identifier rather than insdc accession

All these points makes me think we are building a set of metadata associated with the sequence collection which should be stored in a separate table than the one we already use for storing the digests and JSON objects.

There is the question of the same metadata being used for multiple collections because they have different naming convention and the possibility that the same sequence collection could exist in different source. For this, I'm thinking we might need a many to many relationship between a sequence collection metadata set and a sequence collection.
We could also chose to extract the naming convention to the metadata table which could potentially change how the relationship between the two table is defined.

@tcezard
Copy link
Member

tcezard commented Mar 13, 2024

Concretely there are 3 pieces of metadata that I think are very relevant for a sequence collection:

  • Source identifier
  • Source URL
  • Naming convention

There could also be some other information we might want to store like ingestion time if we ever want to list all sequence collections that have been ingested before/after specific time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment