Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate to using rnc_taxonomy instead of rnc_accession columns #161

Open
6 of 11 tasks
blakesweeney opened this issue Oct 28, 2022 · 3 comments
Open
6 of 11 tasks

Migrate to using rnc_taxonomy instead of rnc_accession columns #161

blakesweeney opened this issue Oct 28, 2022 · 3 comments
Assignees

Comments

@blakesweeney
Copy link
Member

blakesweeney commented Oct 28, 2022

The rnc_taxonomy table is based off the NCBI taxonomy and is actually kept up-to-date and accurate. The columns in rnc_taxonomy are not. We should move to using that for everything. The webfront end uses it, but the pipeline still does work to parse taxonomy information. This is uneeded and should be removed.

  • Validate that all taxids in xref are present in rnc_taxonomy
  • Create any needed fake entries
  • Add fk constraint from xref.taxid to rnc_taxonomy.id
  • Modify pipeline to write empty strings for rnc_accessions.{species,common_name,lineage}
  • Update export steps to ingore the species, common_name, lineage columns
  • Update the rnc_accession update script to reflect the missing species, common_name, lineage columns
  • Validate that ENA can still be parsed and imported
  • Remove species, common_name, lineage properties from Entry object
  • Remove all attempts to set the taxonomy information in Entries
  • Validate the pipeline can still run
  • Remove the rnc_accessions.{species,common_name,lineage} columns
@afg1
Copy link
Contributor

afg1 commented Nov 11, 2022

To start looking at this:
(I'm testing everything out in the dev database, once I have the steps I'll apply it to pro)

An anti-join between xref and rnc_taxonomy:

select count(distinct taxid) from xref 

left join rnc_taxonomy 
	on xref.taxid = rnc_taxonomy.id
where 
	TRUE
	and rnc_taxonomy.id is null

shows 2968 entries in xref that do not have a taxid in rnc_taxonomy

Extracted the relevant information from rnc_accessions and put it into rnc_taxonomy:

insert into rnc_taxonomy (
  id,
  name,
  lineage,
  common_name,
  is_deleted
) (
  select 
  taxid,
  rac.species,
  rac.classification,
  rac.species,
  FALSE
  
  from xref 
	join rnc_accessions rac
		on xref.ac = rac.accession
	left join rnc_taxonomy 
		on xref.taxid = rnc_taxonomy.id
  where 
    TRUE
    and rnc_taxonomy.id is null
) on conflict(id) do nothing;

(not sure why the on conflict is needed - it shouldn't be by definition)

That query inserts the expected number of entries into rnc_taxonomy, and re-running the anti-join gives me 0 entries now, so I think everything needed is in rnc_taxonomy.

After this I was also able to add the foreign key constraint with no errors

@afg1
Copy link
Contributor

afg1 commented Nov 11, 2022

For
Update export steps to ingore the species, common_name, lineage columns

Does this mean removing them from the export, or getting the data from the rnc_taxonomy table instead? For now, I've gone with the latter suggestion

Tracking this issue in branch issue-161

@blakesweeney
Copy link
Member Author

In Entry there is a method that writes out the data for accession.csv. In there we should write None (or an empty string) for those fields. We can't just not write information as the SQL in the database assumes it will be there. But we can have no data in there. Once the sql is updated to not require those fields we can delete them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants