Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Script the installation (and updates) of NCBI taxonomy data #14

Open
jimallman opened this issue Aug 5, 2014 · 18 comments
Open

Script the installation (and updates) of NCBI taxonomy data #14

jimallman opened this issue Aug 5, 2014 · 18 comments

Comments

@jimallman
Copy link
Collaborator

This is currently only documented in the script db/database-migration-002.sql (link).

Ultimately, it should be available from the site's Admin Dashboard page as an easily repeated task. For now, we should offer a script to download and install the latest NCBI taxonomy, then update the db's internal timestamp for this admin task.

@jimallman
Copy link
Collaborator Author

From @jparham email on Dec 9, 2014, at 12:57 AM:

Dan brought up a very good and important point:

"I have one other question - I just noticed the last NCBI update is Feb 2013 while staring at the stats. Have there been previous NCBI updates? I assume so since the site existed prior to 2013. I was wondering how automated that process was or if we should even think about it. I assume it does not do anything to our existing calibrations because the NCBI taxon numbers stay the same regardless of how the table gets shuffled, but was curious as to whether we could test that now or should just leave it be. The current NCBI should be fine to get us to launch but I thought it better to ask ahead of time how we might deal with an update or wether we should avoid that for fear of breaking anything.”

The process of importing newer NCBI data is not automatic, but it’s pretty straightforward for someone with moderate “sysadmin” skills.

The NCBI “dump" files are large, so the process takes a little time to complete. It would also be prudent to backup the FCDB data beforehand as a precaution, at least until this has become routine.

NCBI updates could be done as part of annual maintenance, or we can try to script it to be entirely hands-off. (There's already a placeholder for this in the FCDB’s admin dashboard.) I thought that this data was slow to change, but @hlapp explained that NCBI updates are actually quite frequent (sometimes daily). To me, this means the frequency of updates to FCDB is more of an editorial decision.

I can not intuit what would happens to our pinned calibrations when the NCBI taxonomy changes. And then I wonder that if there is a problem, what could we even do about it at this late stage.

This is of course the other interesting question: Does an NCBI update create more work, or screw up the data in FCDB? This should not be an issue. By design, our calibrated nodes “float” along with changes to NCBI, so updating the NCBI taxonomy then running these tasks in the admin dashboard should bring everything nicely up to date:

  • Update searchable multitree
  • Update calibrations-by-clade table
  • Update auto-complete lists

This assumes that NCBI identifiers are never discarded, which I understand to be the case. After the updates above, the system will show subtle differences:

  • Calibrated nodes in the affected areas will appear differently in the Browse Calibrations UI
  • Some search tools (filter by tip taxa and filter by clade) will reflect these changes.

@pdpolly
Copy link
Collaborator

pdpolly commented Dec 9, 2014

Interesting questions, and important ones too. Most NCBI taxonomy updates will be irrelevant to FCD because most reflect relatively low level relationships, groups without good fossil records, minor changes that affect only one or two taxa, or simple changes in rank. It is only major changes that will really need to be updated in FCD, ones where relationships of groups with a good fossil record are radically reorganized because of improvements in phylogenetic understanding (whales moving into Artiodactyla, for example). It is probably also these big changes that would most likely break FCD scripts.

Would it be worth experimenting with robustness of the code? We could manually alter an NCBI data file and import to test version of FCD.

@jimallman
Copy link
Collaborator Author

Would it be worth experimenting with robustness of the code? We could manually alter an NCBI data file and import to test version of FCD.

Yes, the old dev site at http://fossils.ibang.com/ has older calibrations and test data, but we can use it to test the process and resulting changes.

@Ksepka
Copy link
Collaborator

Ksepka commented Dec 9, 2014

I am all for a test run with the old dev site. Shall we proceed? It would be good to see how everything reacts before we come to the point of needing to do it for real.

@jparham
Copy link

jparham commented Dec 9, 2014

Of course, this won't necessarily help us understand how future changes to the underlying hierarchy may affect node pinning. I.e., we could get a false "ok" signal.

@pdpolly
Copy link
Collaborator

pdpolly commented Dec 9, 2014

could get a false ok, of course, but it's about the only way to test it that i know of.

On 9 Dec 2014, at 4:36 PM, jparham notifications@github.com wrote:

Of course, this won't necessarily help us understand how future changes to the underlying hierarchy may affect node pinning. I.e., we could get a false "ok" signal.


Reply to this email directly or view it on GitHub #14 (comment).

@jparham
Copy link

jparham commented Dec 9, 2014

I think we should do the test. But before we do the test, we should see if any changed parts of the NCBI for the comparison/test involved calibrations that we have in the database.

I guess one other option, which I mentioned before is freezing it.

But a third option would be to just optimistically go ahead and let NCBI hierarchy change and then if there is a problem roll it back. I kind of favor this option, if rolling it back would not be too difficult.

@jimallman
Copy link
Collaborator Author

I guess one other option, which I mentioned before is freezing it.

Sorry, I didn't mean to gloss over this. It's certainly an option, and guarantees a minimum of surprises. Naturally, some searches may suffer if someone is expecting the site to have the latest NCBI taxonomy.

But a third option would be to just optimistically go ahead and let NCBI hierarchy change and then if there is a problem roll it back. I kind of favor this option, if rolling it back would not be too difficult.

This is fairly easy to do, provided

  • you always back up the FCDB database just prior to applying an NCBI update;
  • you're careful to suspend any changes or additions to other data in the meantime; and
  • someone can revert the database to this backup in the event that there's a problem.

In this case, I'd recommend you keep a sysadmin in the loop, possibly treating this as a planned, annual maintenance operation as suggested in #55. In principle, it could be scripted from the admin dashboard, but in practice I wouldn't be comfortable with an easy, "full auto" version of this.

@pdpolly
Copy link
Collaborator

pdpolly commented Dec 10, 2014

What I was thinking for the test is to purposefully change parts of the taxonomy that are relevant to calibrations that have been entered.

That said, I don't mind Jim's option of simply freezing it. It will still meet most FCD purposes even if the agreement isn't perfect.

On 9 Dec 2014, at 6:20 PM, James Parham notifications@github.com wrote:

I think we should do the test. But before we do the test, we should see if any changed parts of the NCBI for the comparison/test involved calibrations that we have in the database.

I guess one other option, which I mentioned before is freezing it.

But a third option would be to just optimistically go ahead and let NCBI hierarchy change and then if there is a problem roll it back. I kind of favor this option, if rolling it back would not be too difficult.


Reply to this email directly or view it on GitHub #14 (comment).

@jparham
Copy link

jparham commented Dec 10, 2014

David, my apologies, I didn't realize that you meant to purposefully change parts of the taxonomy- that makes sense. Please see what JimA says above, about backing it up. Is this something that is reasonable moving forward? If so then we can accept the updates and just revert if there is an issue. Would be better than a freeze. But if it is not easy to revert then a freeze would be best.

@jimallman
Copy link
Collaborator Author

What I was thinking for the test is to purposefully change parts of the taxonomy that are relevant to calibrations that have been entered.

Yes, based on the SQL script I used to set up this table, it looks like we can simply modify the parenttaxonid column for selected nodes to manipulate the NCBI taxonomy for a suitable test. (For that matter, we can test such changes by making edits to the current database, without introducing new NCBI updates.)

@jimallman
Copy link
Collaborator Author

FYI, I'm working on a test today. This will include "before" and "after" scans of all calibrated nodes, tracing the lineage of each pinned node. This should generate a watch list of calibrations that need review in the new taxonomy.

@jimallman
Copy link
Collaborator Author

UPDATE: I've been testing an NCBI update on the old dev site (fossils.ibang.com) with mixed results. As described above, I can generate a report of all calibrations that need review, but there are others that need immediate repairs before the Browse feature can be made to work with the new taxonomy.

This assumes that NCBI identifiers are never discarded, which I understand to be the case.

I was mistaken. While these identifiers are not re-used, it's actually very common for their nodes to be removed from the database. From the original NCBI Taxonomy database paper:

Taxids are stable and persistent—they may be deleted (when taxa are 
removed from the database) and they may be merged (when taxa are 
synonymized), but they will never be reused to identify a different taxon.

And sure enough, in my test update we have a few calibrations that were pinned to taxids that have since been deleted. I'm working now on a report that will flag these calibrations are they're discovered. The fix is actually straightforward -- just edit each offending calibration and refresh its entries in section 4. Locate this calibration within the NCBI tree. Re-entering the same taxon names (or sometimes picking a new substitute) sets things right.

More notes to come as I learn more, but suffice to say that updating the NCBI taxonomy will not be a fully automatic process, and will almost always require some curation time.

@jimallman
Copy link
Collaborator Author

I've updated the three calibrations (215, 218, 220) on fossils.ibang.com that needed revisions to the calibrated node location, and now it seems all's well. This required a bit of jumping around (rebuilding the different tables in the Admin Dashboard, sometimes more than once, then re-testing in the MySQL interactive console).

The final report flags almost all the calibrations in the system as needing review. In each case, the "pinned" nodes that tie each node to the NCBI taxonomy have seen changes in their NCBI lineage, which means they could in principle show up in a new place in the Browse view, or in different clades in Search. This was an unusually long lag between NCBI taxonomy versions, but it suggests once again that there's a lot to review when we update the taxonomy.

@pdpolly
Copy link
Collaborator

pdpolly commented Dec 12, 2014

Thanks, Jim (A). This may mean that we want to follow Jim P's suggestion to simply freeze the taxonomy until there are major upgrades to the system.

On 12 Dec 2014, at 3:36 AM, Jim Allman notifications@github.com wrote:

I've updated the three calibrations (215, 218, 220) on fossils.ibang.com that needed revisions to the calibrated node location, and now it seems all's well. This required a bit of jumping around (rebuilding the different tables in the Admin Dashboard, sometimes more than once, then re-testing in the MySQL interactive console).

The final report flags almost all the calibrations in the system as needing review. In each case, the "pinned" nodes that tie each node to the NCBI taxonomy have seen changes in their NCBI lineage, which means they could in principle show up in a new place in the Browse view, or in different clades in Search. This was an unusually long lag between NCBI taxonomy versions, but it suggests once again that there's a lot to review when we update the taxonomy.


Reply to this email directly or view it on GitHub #14 (comment).

@jparham
Copy link

jparham commented Dec 12, 2014

Agreed, I see no other way.

@jimallman
Copy link
Collaborator Author

freeze the taxonomy until there are major upgrades to the system

By "upgrades to the system", do you mean changes in the NCBI taxonomy, or new features in FCDB? Because there's not much more we can do technically to overcome the need for review.

OK, there's one possible improvement: NCBI provides information about merged nodes as well as deleted nodes; in the case of a merge, we could probably trace these and re-pin to the new (merged) node. But that was not an issue in this NCBI update.

Also, keep in mind that most (or even all) of the calibrations needing review are probably just fine in the new taxonomy. It's more of a sanity check. You might even decide to go ahead with NCBI updates, re-pin the nodes whose NCBI targets were deleted, and postpone further work pending user complaints.

I suppose the bottom line is, Does all this work result in a noticeably better site? The quickest way to judge this is to compare the Browse results on fossils.ibang.com (latest NCBI) versus those on fossilcalibrations.org (early-2013 NCBI).

jimallman added a commit that referenced this issue Dec 12, 2014
Addresses #14, with a manual solution for now.
@jimallman
Copy link
Collaborator Author

While it's all fresh in my mind, I've gone ahead and documented the tools and methods used in this test. This process requires a moderately skilled sysadmin and at least one subject-matter expert, and should be accompanied by a window of planned downtime ⌚ and a fresh cup of coffee. ☕

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants