Adding a 'database' for assembly accessions to map saved seqCol objects #13

waterflow80 · 2023-07-20T16:41:19Z

When trying to fetch and insert seqCol objects, we test whether the seqCol's digest is saved in the database or not, if so, we don't proceed with the saving. But at that point we've already downloaded both the assembly report and the sequences FASTA file and processed them to create the seqCol object, which might be a huge work for the server, especially when the sequences are too large.

So if we can have a database (or a file) where we can save the assembly accessions that maps to the saved seqCol objects. Like this, we'll be saving a huge amount of time, because we'll check the existing of seqCol objects that corresponds to that accession b4 downloading and processing anything.

Note (Technical detail): we should make sure to check that we have seqcol objects saved in the db that corresponds to all of the naming conventions that exist in the assembly report, in order to abort the fetch.

waterflow80 · 2023-07-20T16:49:49Z

However this is useful only when creating new seqCol objects and populating our database.

tcezard · 2023-09-01T15:43:39Z

We should be using the Assembly accession to check that an assembly has been ingested already.

waterflow80 · 2023-12-01T13:30:20Z

@tcezard
I think this topic is more related to the spec itself, I'm not sure if we should keep this issue on the eva-seqcol repo.

tcezard · 2024-03-13T15:57:10Z

I've said above that we should be storing the Assembly accession but we should also try to think about other potential source of sequence collection and how they would impact our design.
The main goals of this issue are:

storing an identifier of the ingested sequence collection so we do not need to retrieve the data again if it has been done before
store enough metadata so we can provide the origin of the sequence collection

At the moment the main identifier is the assembly accession since it is what we use in the ingestion parameter so it makes sense to store this. However it won't be enough to know exactly where the sequence collection comes from. For this I think storing the source URL would be more accurate.

I also think that in the future we will want to be able to store sequence collection that are not linked to INSDC accession. These most likely will have URL and we will also need to find some form of identifier associated. To enable this we should chose generic column names such as datasource identifier rather than insdc accession

All these points makes me think we are building a set of metadata associated with the sequence collection which should be stored in a separate table than the one we already use for storing the digests and JSON objects.

There is the question of the same metadata being used for multiple collections because they have different naming convention and the possibility that the same sequence collection could exist in different source. For this, I'm thinking we might need a many to many relationship between a sequence collection metadata set and a sequence collection.
We could also chose to extract the naming convention to the metadata table which could potentially change how the relationship between the two table is defined.

tcezard · 2024-03-13T17:08:06Z

Concretely there are 3 pieces of metadata that I think are very relevant for a sequence collection:

Source identifier
Source URL
Naming convention

There could also be some other information we might want to store like ingestion time if we ever want to list all sequence collections that have been ingested before/after specific time.

waterflow80 added the enhancement New feature or request label Jul 20, 2023

waterflow80 mentioned this issue Feb 5, 2024

added a column 'asm_accession' to the sequence_collection_l1 table #78

Open

waterflow80 mentioned this issue Mar 22, 2024

Added a meta data table - one-to-one relationship approach #81

Open

4 tasks

waterflow80 mentioned this issue Apr 12, 2024

Added a metadata table - one-to-many relationship apporach #82

Open

8 tasks

waterflow80 mentioned this issue Apr 20, 2024

Added a metadata table - one-to-many relationship enhanced #83

Merged

waterflow80 closed this as completed in #83 May 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding a 'database' for assembly accessions to map saved seqCol objects #13

Adding a 'database' for assembly accessions to map saved seqCol objects #13

waterflow80 commented Jul 20, 2023

waterflow80 commented Jul 20, 2023

tcezard commented Sep 1, 2023

waterflow80 commented Dec 1, 2023

tcezard commented Mar 13, 2024

tcezard commented Mar 13, 2024

Adding a 'database' for assembly accessions to map saved seqCol objects #13

Adding a 'database' for assembly accessions to map saved seqCol objects #13

Comments

waterflow80 commented Jul 20, 2023

waterflow80 commented Jul 20, 2023

tcezard commented Sep 1, 2023

waterflow80 commented Dec 1, 2023

tcezard commented Mar 13, 2024

tcezard commented Mar 13, 2024