Create a csv dump of IDs #10

Open
dimus opened this Issue Oct 7, 2015 · 15 comments

Comments

Projects
None yet
3 participants
@dimus
Member

dimus commented Oct 7, 2015

CSV dump can be used by others to experiment with distributed approach to matching IDs

@hyanwong

This comment has been minimized.

Show comment
Hide comment
@hyanwong

hyanwong Dec 21, 2015

Any ETA on this? I'm particularly interested in EoL page ID mapping to e.g. NCBI ids.

Any ETA on this? I'm particularly interested in EoL page ID mapping to e.g. NCBI ids.

@deepreef

This comment has been minimized.

Show comment
Hide comment
@deepreef

deepreef Dec 21, 2015

Contributor

I've been busy these past few months, but starting later this month/early next month, I plan to implement a number of improvements to BioGUID, including the CSV dump. Do you have identifier links already? Or would you like me to prioritize that particular set (EOL<->NCBI)?

Contributor

deepreef commented Dec 21, 2015

I've been busy these past few months, but starting later this month/early next month, I plan to implement a number of improvements to BioGUID, including the CSV dump. Do you have identifier links already? Or would you like me to prioritize that particular set (EOL<->NCBI)?

@hyanwong

This comment has been minimized.

Show comment
Hide comment
@hyanwong

hyanwong Dec 21, 2015

Well, I have a specific aim to get a map of OpenTree of Life OTT IDs to EoL page IDs. But I can already get a mapping of OpenTree -> NBCI and OpenTree -> WoRMS and OpenTree -> IRMNG and OpenTree -> GBIF and OpenTree -> Index_fungorum IDs (from http://files.opentreeoflife.org/ott/), so it is the link from these sources (NCBI, IF, GBIF, etc) to EoL pages that I am missing. But that may be too specific a request to be useful to other users.

Well, I have a specific aim to get a map of OpenTree of Life OTT IDs to EoL page IDs. But I can already get a mapping of OpenTree -> NBCI and OpenTree -> WoRMS and OpenTree -> IRMNG and OpenTree -> GBIF and OpenTree -> Index_fungorum IDs (from http://files.opentreeoflife.org/ott/), so it is the link from these sources (NCBI, IF, GBIF, etc) to EoL pages that I am missing. But that may be too specific a request to be useful to other users.

@deepreef

This comment has been minimized.

Show comment
Hide comment
@deepreef

deepreef Dec 21, 2015

Contributor

Ah! OK. So, I guess I would suggest that I incorporate all the OpenTree -> NBCI and OpenTree -> WoRMS and OpenTree -> IRMNG and OpenTree -> GBIF and OpenTree -> Index_fungorum IDs to BioGUID first; then if any of those other ID's are already linked to EoL, then they will all likewise be linked to OpenTree. Moreover, going forward, linking EOL ID's to ANY of the other ones that OpenTree is already linekd to will automatically ensure that the OpenTree is linked to EOL as well. This is exactly the sort of Use Case that BioGUID was intended to support. In fact, I think I will make this my first priority task (i.e., harvesting the OpenTree IDs to all the ones already linked), then follow that up with an effort to link EOL to any of the others. So.. where will I find the OpenTree->XXX links within http://files.opentreeoflife.org/ott/?

Contributor

deepreef commented Dec 21, 2015

Ah! OK. So, I guess I would suggest that I incorporate all the OpenTree -> NBCI and OpenTree -> WoRMS and OpenTree -> IRMNG and OpenTree -> GBIF and OpenTree -> Index_fungorum IDs to BioGUID first; then if any of those other ID's are already linked to EoL, then they will all likewise be linked to OpenTree. Moreover, going forward, linking EOL ID's to ANY of the other ones that OpenTree is already linekd to will automatically ensure that the OpenTree is linked to EOL as well. This is exactly the sort of Use Case that BioGUID was intended to support. In fact, I think I will make this my first priority task (i.e., harvesting the OpenTree IDs to all the ones already linked), then follow that up with an effort to link EOL to any of the others. So.. where will I find the OpenTree->XXX links within http://files.opentreeoflife.org/ott/?

@deepreef

This comment has been minimized.

Show comment
Hide comment
@deepreef

deepreef Dec 21, 2015

Contributor

OK, I just downloaded/imported the OTT dataset (v2.9), and I see now where the identifiers are stored. I'll parse these out and incorporate them into BioGUID within the next couple of weeks, then look into ways of getting EoL cross-links incorporated as well.

Contributor

deepreef commented Dec 21, 2015

OK, I just downloaded/imported the OTT dataset (v2.9), and I see now where the identifiers are stored. I'll parse these out and incorporate them into BioGUID within the next couple of weeks, then look into ways of getting EoL cross-links incorporated as well.

@hyanwong

This comment has been minimized.

Show comment
Hide comment
@hyanwong

hyanwong Dec 21, 2015

Great! In the taxonomy.tsv file in the download link I sent, there is a column named 'sourceinfo' with entries like:

ncbi:10239,gbif:8,irmng:19

The first column, labelled uid, is the OTT (Open Tree Taxonomy) ID.

Great! In the taxonomy.tsv file in the download link I sent, there is a column named 'sourceinfo' with entries like:

ncbi:10239,gbif:8,irmng:19

The first column, labelled uid, is the OTT (Open Tree Taxonomy) ID.

@hyanwong

This comment has been minimized.

Show comment
Hide comment
@hyanwong

hyanwong Dec 22, 2015

Are you also aware of the wikidata Q ids, e.g. https://www.wikidata.org/wiki/Q2267046. If you can map those somehow too, then they immediately provide links to wikipedia pages for taxa in different languages, categories of media files for taxa, etc.

Are you also aware of the wikidata Q ids, e.g. https://www.wikidata.org/wiki/Q2267046. If you can map those somehow too, then they immediately provide links to wikipedia pages for taxa in different languages, categories of media files for taxa, etc.

@deepreef

This comment has been minimized.

Show comment
Hide comment
@deepreef

deepreef Dec 22, 2015

Contributor

I don't have links to Wikidata yet; do you know if they have a way to bulk download their identifiers mapped to genus/species/etc. and their own identifier cross-links (ITIS, EoL, Worms, GBIF, Dynataxa)?

Contributor

deepreef commented Dec 22, 2015

I don't have links to Wikidata yet; do you know if they have a way to bulk download their identifiers mapped to genus/species/etc. and their own identifier cross-links (ITIS, EoL, Worms, GBIF, Dynataxa)?

@hyanwong

This comment has been minimized.

Show comment
Hide comment
@hyanwong

hyanwong Dec 22, 2015

Yes: you download their JSON dump and parse it. I'll post some code to do this.

Yes: you download their JSON dump and parse it. I'll post some code to do this.

@deepreef

This comment has been minimized.

Show comment
Hide comment
@deepreef

deepreef Dec 22, 2015

Contributor

One name at a time or in bulk?

Contributor

deepreef commented Dec 22, 2015

One name at a time or in bulk?

@hyanwong

This comment has been minimized.

Show comment
Hide comment
@hyanwong

hyanwong Dec 22, 2015

It runs on the bulk 8gb JSON download file, which has 1 record per line.

It runs on the bulk 8gb JSON download file, which has 1 record per line.

@deepreef

This comment has been minimized.

Show comment
Hide comment
@deepreef

deepreef Dec 22, 2015

Contributor

Excellent! Link?

Contributor

deepreef commented Dec 22, 2015

Excellent! Link?

@hyanwong

This comment has been minimized.

Show comment
Hide comment
@hyanwong

hyanwong Dec 22, 2015

Just modifying my script for you.

By the way, note that there are a few entries in the OTT taxonomy.tsv file which have 2 (different) ncbi IDs. These are (always?) ones that have an NCBI id immediately followed by a silva ID, and then later in the list, another ncbi ID. The open tree people have pointed out that these are cases where the first NCBI id has been derived indirectly, via SILVA (https://groups.google.com/d/msg/opentreeoflife/L2x3Ond16c4/CVp6msiiCgAJ), and in these cases, I have noticed that the first ncbi ID is often wrong: I reckon that first NCBI number can probably be ignored if there is another alternative.

Just modifying my script for you.

By the way, note that there are a few entries in the OTT taxonomy.tsv file which have 2 (different) ncbi IDs. These are (always?) ones that have an NCBI id immediately followed by a silva ID, and then later in the list, another ncbi ID. The open tree people have pointed out that these are cases where the first NCBI id has been derived indirectly, via SILVA (https://groups.google.com/d/msg/opentreeoflife/L2x3Ond16c4/CVp6msiiCgAJ), and in these cases, I have noticed that the first ncbi ID is often wrong: I reckon that first NCBI number can probably be ignored if there is another alternative.

@hyanwong

This comment has been minimized.

Show comment
Hide comment
@hyanwong

hyanwong Dec 23, 2015

Attached a rough python script. I haven't tested it yet, and it will probably need debugging, but it should give you the general idea.

Yan

get_wikidata_taxonQid.py.zip

Attached a rough python script. I haven't tested it yet, and it will probably need debugging, but it should give you the general idea.

Yan

get_wikidata_taxonQid.py.zip

@hyanwong

This comment has been minimized.

Show comment
Hide comment
@hyanwong

hyanwong Dec 28, 2015

Hope the script is helpful. I just realised that I didn't get it to actually print out the Wikidata QID for each line (doh), but I guess you can fix that easily. Is there any ETA for a csv matching file? Especially one with Encyclopedia of Life <=> OpenTree IDs? I don't know if @dimus has provided an EoL ID dump to BioGUID yet? So don't know how possible this is?

Hope the script is helpful. I just realised that I didn't get it to actually print out the Wikidata QID for each line (doh), but I guess you can fix that easily. Is there any ETA for a csv matching file? Especially one with Encyclopedia of Life <=> OpenTree IDs? I don't know if @dimus has provided an EoL ID dump to BioGUID yet? So don't know how possible this is?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment