You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The rnc_taxonomy table is based off the NCBI taxonomy and is actually kept up-to-date and accurate. The columns in rnc_taxonomy are not. We should move to using that for everything. The webfront end uses it, but the pipeline still does work to parse taxonomy information. This is uneeded and should be removed.
Validate that all taxids in xref are present in rnc_taxonomy
Create any needed fake entries
Add fk constraint from xref.taxid to rnc_taxonomy.id
Modify pipeline to write empty strings for rnc_accessions.{species,common_name,lineage}
Update export steps to ingore the species, common_name, lineage columns
Update the rnc_accession update script to reflect the missing species, common_name, lineage columns
Validate that ENA can still be parsed and imported
Remove species, common_name, lineage properties from Entry object
Remove all attempts to set the taxonomy information in Entries
Validate the pipeline can still run
Remove the rnc_accessions.{species,common_name,lineage} columns
The text was updated successfully, but these errors were encountered:
To start looking at this:
(I'm testing everything out in the dev database, once I have the steps I'll apply it to pro)
An anti-join between xref and rnc_taxonomy:
select count(distinct taxid) from xref
left join rnc_taxonomy
on xref.taxid = rnc_taxonomy.id
where
TRUE
and rnc_taxonomy.id is null
shows 2968 entries in xref that do not have a taxid in rnc_taxonomy
Extracted the relevant information from rnc_accessions and put it into rnc_taxonomy:
insert into rnc_taxonomy (
id,
name,
lineage,
common_name,
is_deleted
) (
select
taxid,
rac.species,
rac.classification,
rac.species,
FALSE
from xref
join rnc_accessions rac
on xref.ac = rac.accession
left join rnc_taxonomy
on xref.taxid = rnc_taxonomy.id
where
TRUE
and rnc_taxonomy.id is null
) on conflict(id) do nothing;
(not sure why the on conflict is needed - it shouldn't be by definition)
That query inserts the expected number of entries into rnc_taxonomy, and re-running the anti-join gives me 0 entries now, so I think everything needed is in rnc_taxonomy.
After this I was also able to add the foreign key constraint with no errors
In Entry there is a method that writes out the data for accession.csv. In there we should write None (or an empty string) for those fields. We can't just not write information as the SQL in the database assumes it will be there. But we can have no data in there. Once the sql is updated to not require those fields we can delete them.
The rnc_taxonomy table is based off the NCBI taxonomy and is actually kept up-to-date and accurate. The columns in rnc_taxonomy are not. We should move to using that for everything. The webfront end uses it, but the pipeline still does work to parse taxonomy information. This is uneeded and should be removed.
The text was updated successfully, but these errors were encountered: