Create a file that maps all hierarchy entries to EOL taxon concept (page) ids #162

KatjaSchulz · 2016-02-10T19:08:50Z

This was requested by @dimus, @jhpoelen, @hyanwong. Ideally, this would be a Darwin Core Archive, but anything that gets the data out would do for now. @JRice suggested we could start with a straight dump of certain columns from the hierarchies table, i.e., identifier used by provider, hierarchy_entry_id, hierarchy_id, taxon_concept_id, what else?

Interested parties, please elaborate if necessary.

JRice · 2016-02-10T19:20:55Z

It's pretty big, even gzipped, at 120 MB or so.

jhammock · 2016-02-10T19:25:02Z

When we're sure we have all the requirements, we should schedule updates and post this somewhere :)

JRice · 2016-02-10T19:26:10Z

I would use a query like this:

select id,identifier,hierarchy_id,taxon_concept_id from hierarchy_entries where published = 1 and visibility_id = 1 into outfile 'identifiers.csv' FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' LINES TERMINATED BY '\n';

It took about 25 seconds to run.

...The problem is, of course, that it will end up on the server where you run the command... for me, it was in /eol/data/db_slave3/var/lib/mysql/eol_production/identifiers.csv. From there, it will need to be moved somewhere useful. ...opendata.eol.org would be nice, but seems to be down more often than up, so we'll have to figure out where to put it. :(

KatjaSchulz · 2016-02-10T19:40:48Z

@JRice Just has a peek at the file. It looks like a lot of nodes don't have an identifier, i.e., column 2 is empty, e.g., for hierarchy 114. How can that be? Isn't the taxon identifier a required field?

JRice · 2016-02-10T19:47:30Z

An excellent question! Off the top of my head, I don't know. 114 is "EOL Group on Flickr", so perhaps there have been some exceptions to that rule for really weird content partners...

KatjaSchulz · 2016-02-10T20:05:03Z

Ah, that makes sense. However, it looks like there are a lot of hierarchies that have all or some identifiers missing: 114,115,118,119,120,113,121,124,125,130,132,133,138,140,141,143,144,393,394,396,397,398,127,400,401,402,403,405,410,430,431,433,436,439,440,444,448,449,453,428,454,155,459,463,464,465,467,476,507,508,518,522,537,542,548,553,556,558,563,562,564,565,566,567,570,571,582,587,588,549,589,590,595,601,617,622,625,628,635,636,639,643,648,653,661,690,695,696,697,707,711,716,717,720,721,726,727,731,734,735,714,740,753,777,778,787,798,799,804,808,814,821,826,833,837,838,840,841,842,843,844,852,854,857,861,864,868,869,875,879,880,883,890,893,894,896,901,902,912,913,925,929,930,931,932,933,934,935,936,937,938,939,940,941,942,943,944,945,946,131,956,957,962,970,974,977,975,982,984,986,987,988,994,998,1002,1003,1004,1005,1006,1007,1008,1009,1010,1011,1012,1017,1018,1019,1021,1022,1023,1024,1025,1026,1027,1028,1029,1030,1031,1032,1033,1034,1035,1066,1077,1076,884,1082,1089,1094,1099,1093,1100,1112,1115,604,1145,1154,1157,1158,1161,1164,1165,1166,1167,1208,1255,1256,1090,1261,1264,1269,1270,1272,1273,1277,1279,1282,1168,1285,1288,1289,1291,1169,1296,1297,1298,1301,1302,1303,1304,1305,1308,1309,1310,1311,1312,1313,1314,1315,1316,1317,1320,1322,1323,1325,1328,1332,1335,1344,1346,1354,1362,1367,1371,1377,1378,1381,1383,1388,1393,1394,1395,1400,1401,967,1402,1404,1406,1321,1412,1414,1423,1424,1425,1426,1427,1428,1431,1432,1433,1413,1460,1461,1462,1463,1466,1469,1472,1473,1474,1484,1485,1486,1487,1491,1429,1498,860,1499,1501

Unfortunately, I cannot look up the providers of these hierachies, because the Hierarchies API seems to be defunct.

JRice · 2016-02-10T20:10:13Z

That is a lot. :
Most of them are non-browsable (maybe we should change the query to only include browsable hierarchies?) But some were:

AntWeb (Ant Species) (121)
FishBase (Fish Species) (143)
IUCN Red List (Species Assessed for Global Conservation) (144)
The Reptile Database (787)
Avibase - IOC World Bird Names (2011) (860)
Paleobiology Database (967)

...Does that shed any light?

KatjaSchulz · 2016-02-10T20:39:50Z

Hm, looks like a bunch of connectors, except Antweb, which is an old-style xml file. Antweb does have taxon identifiers, e.g.:

<taxon> <dc:identifier>acanthognathus brevicornis</dc:identifier> <dc:source>https://www.antweb.org/description.do?genus=acanthognathus&name=brevicornis&rank=species</dc:source> <dwc:Phylum>Arthropoda</dwc:Phylum> <dwc:Class>Insecta</dwc:Class> <dwc:Order>Hymenoptera</dwc:Order> <dwc:Family>Formicidae</dwc:Family> <dwc:ScientificName>Acanthognathus brevicornis</dwc:ScientificName>

Also, paleodb provides identifiers (taxon_no) in the api we use to create the EOL import file, and I would be surprised if Eli didn't use them, but it may be worth checking with him. Example api response:
"taxon_no","orig_no","record_type","associated_records","rank","taxon_name","common_name","status","parent_no","senior_no","reference_no","is_extant"
"38613","38613","taxon","","genus","Tyrannosaurus","tyrant lizard","belongs to","38606","","9259","0"

JRice · 2016-02-10T20:41:21Z

Yup, this is worth turning into a bug—these might have all been "recent" harvests (since April), and there might be a bug. I'll do that -> #164 (we should reserve this ticket for talking about the file itself—what goes in it, where to put it, etc).

JRice · 2016-02-10T20:48:11Z

Hmmn... I could have sworn I'd put this in an earlier comment, but I was thinking that, since my quick experiment with JOINing a query to get the top images here didn't work, it might be worth a seprate output file for those... or two, actually. One would have to have "page_id" and "data_object_id" in it, the other would have to have "data_object_id" and a URL (or several, if it's convenient to include thumbnails) to the image.

Not much more work, but a bit more confusing, to put a "view_order" in the second file and have multiple data_object_ids per page, so you could get a few top images. ...Not sure if that helps.

JRice · 2016-02-10T20:49:07Z

Other information we could include in this file are rank ids and name ids... though you would need separate files (one each) to figure out what those meant. :S

jhpoelen · 2016-02-10T20:54:12Z

@KatjaSchulz thanks for opening this. In my mind, this is a duplicate, or at least related to, #10 . This issue also contains some concrete examples of an approach that is currently operational.

@JRice I am not too concerned about file sizes. I've stored archives on s3.amazon.com to get the archives web-accessible to anyone as part of an automated process. I'd be happy to show you if you are interested.

hyanwong · 2016-02-10T22:12:22Z

@JRice "if it was that easy, I wonder why we didn't get this a few years ago..." But seriously, thanks. This was also requested by @iimog I think.

Also moving my gitter comment here: "A DB dump with just hierarchy ids is v. useful. "Best image" ids are not so so useful, since 'best image' changes regularly, and can probably be better queried only as needed. Also that bit (getting the best image from an EoL page ID) can be done for multiple images in a single API call now, thanks to Pages API batch mode (#40)'

hyanwong · 2016-02-10T22:25:07Z

P.s. @deepreef is also interested in this.

hyanwong · 2016-02-11T09:19:53Z

@JRice I think, although I may be wrong, that EoL used to have gbif identifiers in here, and lots of them, under provider_id=800. I can't find anything like that in the file

grep ',800,' identifiers.csv

iimog · 2016-02-11T09:39:09Z

@JRice Thanks for the file. I downloaded it and for my use case it is perfect.
Thanks @hyanwong for drawing my attention to this issue.

dimus · 2016-02-11T18:42:23Z

Some relevant information from JR from a few months ago:

What I want to do first -- to gather all the data directly from database

Okay. I'll go through what is where, then.

scientific names with corresponding taxon-concept IDs (1 name per 1 ID)

You probably want to look at hierarchy entries, which have a name_id right on them for their preferred scientific name.

names which are considered synonyms for that EOL ID

hierarchy_entries have synonyms:

SELECT synonyms.* FROM synonyms WHERE synonyms.hierarchy_entry_id = ?

vernacular names with language

common names are more complicated; you have to go through the taxon_concept. Something like

SELECT * FROM taxon_concept_names INNER JOIN names ON names.id = taxon_concept_names.name_id INNER JOIN languages ON languages.id = taxon_concept_names.language_id WHERE taxon_concept_names.taxon_concept_id = 340732 AND (vern = 1 AND languages.iso_639_1 IS NOT NULL AND languages.iso_639_1 != '')

crossmapping between taxon concept IDs from EOL and IDs of 5 data partners:
NCBI, IRMNG, Silva (actually not sure it is in EOL), WoRMS, CoL

Here's where you can select which hierarchy entries to grab. you'll need to know the resource_id of each of those (easiest to get from the website, as an admin—search from the CP index page, then find the CP go to the resource pages to get the id... you'll see here that it's not always clear, as NCBI has two resources... I don't know which you want, or if you want both... it's a mystery.)

Each of those resources then has a hierarchy_id, so you'll need that. Once you have those, you can select the hierarchy_entries with those hierarchy_ids (you should also only select published HEs!), and on each of those is an "identifier" (that's the name of the field) which is the ID used on the resource's website (if applicable). If you want the full URL, you will have to build it off of the "outlink_uri" field on the hierarchy.

JRice · 2016-02-14T23:52:10Z

I'm going to try Joins again in the hopes of identifying the bad indexes... here's a sample, simple query based on the above:

select id,identifier,hierarchy_id,taxon_concept_id from hierarchy_entries where published = 1 and visibility_id = 1 limit 10 offset 130000;

JRice · 2016-02-15T02:23:18Z

Adding a JOIN immediately leaps query times up to 6-plus seconds for a result set of 10. :( Doing a tiny bit of digging into the joins themselves hasn't helped, either—I tried switching to a subselect to avoid the "weak" filters of published and visible, but that only made matters worse...

# HORRIBLE PERFORMANCE (I didn't even let it finish):
SELECT * FROM (
  SELECT he.id, he.identifier, he.hierarchy_id, he.taxon_concept_id,
    n.string
  FROM hierarchy_entries he
  LEFT JOIN names n ON (he.name_id = n.id) ) q
WHERE q.published = 1 AND q.visibility_id = 1
LIMIT 10 OFFSET 130000;

I did get a much better (usable, but not stellar: 1.45 sec with limit 10 offset 130000) result when I remove the LEFT part of the join (and go with straight joins). ...that might be adequate. (Aside: the code is FULL of left joins. ALL OVER. This might be a big factor in the overall speed of the site... though removing them would of course potentially result in missing data on the site, soooo... would need thinking.)

SELECT he.id, he.identifier, he.hierarchy_id, he.taxon_concept_id, n.string
FROM hierarchy_entries he
JOIN names n ON (he.name_id = n.id)
WHERE he.published = 1 AND he.visibility_id = 1
LIMIT 10 OFFSET 130000;

I'll try making another join and see if it gets much worse...

SELECT he.id, he.identifier, he.hierarchy_id, he.taxon_concept_id,
  n.string, d.object_cache_url, r.label
FROM hierarchy_entries he
JOIN names n ON (he.name_id = n.id)
JOIN top_images i ON (he.id = i.hierarchy_entry_id AND view_order = 1)
JOIN data_objects d ON (i.data_object_id = d.id)
JOIN translated_ranks r ON (he.rank_id = r.rank_id AND r.language_id = 152)
WHERE he.published = 1 AND he.visibility_id = 1
LIMIT 10 OFFSET 130000;

This one is horrible. ...It wasn't so bad when it was top_images only (3.97 sec), but adding dato truly killed it (2 min 8.42 sec), and throwing in the rank didn't hurt much more (it was probably cached, so: 1 min 12.19 sec). Adding into outfile 'identifiers.csv' FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' LINES TERMINATED BY '\n'; created a file that took 6 min 20.73 sec, which is usable.

Update: second run (months later) took 3 min 21.32 sec. The file ends up in /eol/data/db_slave2/var/lib/mysql/eol_production/identifiers.csv.

JRice · 2016-02-15T02:39:28Z

For those of you who would like to see a file with a little more information in it, you can download the new version of the file here. It's a smaller file because there are fewer rows (if there's data missing, the row is missing—this is problematic when it comes to taxa with no image), so you might want both files, if you're looking for completeness... but this file has more information: the scientific name, the rank, and the "object cache URL", from which you can build a link to any (valid) size image, e.g.:

201201171519404 -> http://media.eol.org/content/2012/01/17/15/19404_260_190.jpg

(I'll let you figure out the pattern; it's pretty basic. Hint: the last two numbers are (max) width and height.)

jhpoelen · 2016-03-21T15:17:50Z

@JRice was trying to access the identifiers archive at http://beta.eol.org/uploads/data_search_files/big_identifiers.csv.gz and got a 404. Is there a different location I should use?

KatjaSchulz · 2016-07-11T20:26:28Z

We just had another request from @diatomsRcool for a file with all EOL taxon concept ids and associated taxon names. Is this file still around somewhere or can it be recovered?

hyanwong · 2016-07-11T20:29:26Z

I have a copy. Shall I send it to @diatomsRcool or is a new one soon in the pipeline?

KatjaSchulz · 2016-07-11T20:30:41Z

That would be great thanks!

diatomsRcool · 2016-07-12T15:18:56Z

What are the column headers here? Looks like 4 columns...

JRice · 2016-08-24T22:59:43Z

What I did on the last run:

mysql_2

# Note this took 17 min 22.41 sec last time.
SELECT he.id, he.identifier, he.hierarchy_id, he.taxon_concept_id,
  n.string, d.object_cache_url, r.label
FROM hierarchy_entries he
LEFT JOIN names n ON (he.name_id = n.id)
LEFT JOIN top_images i ON (he.id = i.hierarchy_entry_id AND view_order = 1)
LEFT JOIN data_objects d ON (i.data_object_id = d.id)
LEFT JOIN translated_ranks r ON (he.rank_id = r.rank_id AND r.language_id = 152)
WHERE he.published = 1 AND he.visibility_id = 1
INTO OUTFILE 'identifiers_with_images.csv'
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' LINES TERMINATED BY '\n';
SELECT he.id, he.identifier, he.hierarchy_id, he.taxon_concept_id,
  n.string
FROM hierarchy_entries he
LEFT JOIN names n ON (he.name_id = n.id)
WHERE he.published = 1 AND he.visibility_id = 1
INTO OUTFILE 'identifiers.csv'
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' LINES TERMINATED BY '\n';

quit

sudo su -

mv /eol/data/db_slave2/var/lib/mysql/eol_production/identifiers.csv ~jrice/.
chown jrice:jrice !$
mv /eol/data/db_slave2/var/lib/mysql/eol_production/identifiers_with_images.csv ~jrice/.
chown jrice:jrice !$
exit

scp identifiers*.csv eol-app1:~/.
app1

sudo mv ~jrice/identifiers*.csv /eol/data/app/public/.
sudo cp /eol/data/app/public/identifiers*.csv /eol/data/app/log/beta/.

beta

mv log/identifiers*.csv public/.
chmod 744 public/identifiers*.csv
exit

exit

rm identifiers*.csv

jhammock · 2016-08-25T17:49:50Z

Most recent version: http://beta.eol.org/identifiers.csv

iimog · 2016-11-15T09:36:55Z

Hey, I use the identifiers.csv file a lot to cross-map identifiers. However my version of the file is a little outdated. The link of the most recent version by @jhammock returns a 404 now. Is it possible to get the latest version of the file somewhere?

hyanwong · 2016-11-15T15:52:27Z

@jhammock mentioned that there was some new stuff in the pipeline, so it might just be on its way to a new home.

jhammock · 2016-11-15T15:57:18Z

Sorry, not sure what happened to that file. The replacement files are still in prep.

iimog · 2016-11-16T05:39:53Z

All right. Thank you. I'll be patient.

eliagbayani · 2017-02-22T20:31:26Z

Hi @hyanwong, @iimog ,
Please check here latest identifiers.csv.
identifiers.csv in opendata.eol.org

If you have an old copy of this file. Can you please tell me if this latest has more or less records.
That is, once you've downloaded this latest.
Thanks.

hyanwong · 2017-02-22T21:02:18Z

I can't seem to get that file using wget or cURL, annoyingly. Is there something weird about the URL?

jhpoelen · 2017-02-22T21:39:18Z

@hyanwong this is consistent with globalbioticinteractions/globalbioticinteractions#274 . I suspect that the portal is designed for humans to do "click to download" rather than having machines do curl/wget .

jhammock · 2017-02-22T22:56:31Z

Hmm... I think we were expecting machine readable services also from CKAN. Eli, is there a solution in our CKAN arsenal?

hyanwong · 2017-02-22T22:57:09Z

@jhpoelen I guess you didn't figure out a way to get either wget or cURL to do it, via some fancy flag setting?

hyanwong · 2017-02-22T23:49:14Z

Looks like slightly more records (lines)

yans-air:EOL yan$ wc identifiers.csv
 12564087 45169123 854534282 identifiers.csv
yans-air:EOL yan$ wc identifiers_prev.csv 
 12513903 45117019 851901860 identifiers_prev.csv

jhpoelen · 2017-02-22T23:56:04Z

@hyanwong I was actually able to download the archive using curl by copy-pasting the request made by the opendata site to start the download (see globalbioticinteractions/globalbioticinteractions#274 (comment)). However, it seemed that this download (and the manual download) time-ed out after 10 mins, leaving a partial archive. So, with my crappy consumer grader internet connection, I was unable to access the archive. I had to use an external server to download the file, then transfer the file to my development system from that server. Hope this helps!

eliagbayani · 2017-02-23T12:13:29Z

Thanks @hyanwong and @jhpoelen for your inputs. Very helpful.

Hi @hyanwong , slightly more records is about right. Since we didn't have a real harvest for sometime now.

Regarding problem on serving files for download, I will consult our sysadmin to improve service for clients with lower bandwidth. One non-technical sol'n is to divide big files into smaller chunks for easier download. Where either wget, curl or http should work.

I just increased Apache Timeout but that didn't seem to work either. I keep on getting a partial 100-140 MB file size downloaded.

I'm able to wget smaller files e.g. reptiles

hyanwong · 2017-02-23T12:24:43Z

Is it a bandwidth problem? I can download fine with a normal browser, just not with wget / curl?

iimog · 2017-02-24T09:40:20Z

I have the same problem as @hyanwong click download works (takes ~90s from my university), both curl and wget return an Error 503 Service Unavailable. I don't think it is a timeout problem.
But in general: nice to have this file in the opendata.eol.org portal.

eliagbayani · 2017-02-24T11:08:47Z

Thanks for the input @iimog.

jhammock added the next iteration if possible label Feb 10, 2016

JRice mentioned this issue Feb 10, 2016

Many resources are missing "identifier" values and/or have empty names #164

Open

jhammock added bug and removed bug labels Feb 10, 2016

JRice assigned JRice and unassigned JRice Feb 16, 2016

jhammock added the data services label Mar 24, 2016

AmrMMorad added the question label May 3, 2016

hyanwong mentioned this issue May 16, 2016

[1] Common names mapping file #297

Open

jhammock modified the milestones: 2016.06.14, 2016.05.31 Jun 14, 2016

jhammock mentioned this issue Jul 6, 2016

[5] Map directories on the physical drive to a directory on the Docker container #281

Open

jhpoelen mentioned this issue Jul 7, 2016

[3] Better data structure returned by Pages API batch mode: hash not array #232

Open

jhammock added priority and removed question labels Jul 12, 2016

jhammock modified the milestone: 2016.06.14 Jul 12, 2016

Create a file that maps all hierarchy entries to EOL taxon concept (page) ids #162

Create a file that maps all hierarchy entries to EOL taxon concept (page) ids #162

Comments

KatjaSchulz commented Feb 10, 2016

JRice commented Feb 10, 2016

jhammock commented Feb 10, 2016

JRice commented Feb 10, 2016

KatjaSchulz commented Feb 10, 2016

JRice commented Feb 10, 2016

KatjaSchulz commented Feb 10, 2016

JRice commented Feb 10, 2016

KatjaSchulz commented Feb 10, 2016

JRice commented Feb 10, 2016

JRice commented Feb 10, 2016

JRice commented Feb 10, 2016

jhpoelen commented Feb 10, 2016

hyanwong commented Feb 10, 2016

hyanwong commented Feb 10, 2016

hyanwong commented Feb 11, 2016

iimog commented Feb 11, 2016

dimus commented Feb 11, 2016

JRice commented Feb 14, 2016

JRice commented Feb 15, 2016 • edited Loading

JRice commented Feb 15, 2016

jhpoelen commented Mar 21, 2016

KatjaSchulz commented Jul 11, 2016

hyanwong commented Jul 11, 2016

KatjaSchulz commented Jul 11, 2016

diatomsRcool commented Jul 12, 2016

JRice commented Aug 24, 2016

jhammock commented Aug 25, 2016

iimog commented Nov 15, 2016

hyanwong commented Nov 15, 2016

jhammock commented Nov 15, 2016

iimog commented Nov 16, 2016

eliagbayani commented Feb 22, 2017 • edited Loading

hyanwong commented Feb 22, 2017

jhpoelen commented Feb 22, 2017

jhammock commented Feb 22, 2017

hyanwong commented Feb 22, 2017

hyanwong commented Feb 22, 2017 • edited Loading

jhpoelen commented Feb 22, 2017 • edited Loading

eliagbayani commented Feb 23, 2017 • edited Loading

hyanwong commented Feb 23, 2017

iimog commented Feb 24, 2017

eliagbayani commented Feb 24, 2017

JRice commented Feb 15, 2016 •

edited

Loading

eliagbayani commented Feb 22, 2017 •

edited

Loading

hyanwong commented Feb 22, 2017 •

edited

Loading

jhpoelen commented Feb 22, 2017 •

edited

Loading

eliagbayani commented Feb 23, 2017 •

edited

Loading