Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a file that maps all hierarchy entries to EOL taxon concept (page) ids #162

Open
KatjaSchulz opened this issue Feb 10, 2016 · 49 comments

Comments

@KatjaSchulz
Copy link

This was requested by @dimus, @jhpoelen, @hyanwong. Ideally, this would be a Darwin Core Archive, but anything that gets the data out would do for now. @JRice suggested we could start with a straight dump of certain columns from the hierarchies table, i.e., identifier used by provider, hierarchy_entry_id, hierarchy_id, taxon_concept_id, what else?

Interested parties, please elaborate if necessary.

@JRice
Copy link
Member

JRice commented Feb 10, 2016

It's pretty big, even gzipped, at 120 MB or so.

@jhammock
Copy link
Collaborator

When we're sure we have all the requirements, we should schedule updates and post this somewhere :)

@JRice
Copy link
Member

JRice commented Feb 10, 2016

I would use a query like this:

select id,identifier,hierarchy_id,taxon_concept_id from hierarchy_entries where published = 1 and visibility_id = 1 into outfile 'identifiers.csv' FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' LINES TERMINATED BY '\n';

It took about 25 seconds to run.

...The problem is, of course, that it will end up on the server where you run the command... for me, it was in /eol/data/db_slave3/var/lib/mysql/eol_production/identifiers.csv. From there, it will need to be moved somewhere useful. ...opendata.eol.org would be nice, but seems to be down more often than up, so we'll have to figure out where to put it. :(

@KatjaSchulz
Copy link
Author

@JRice Just has a peek at the file. It looks like a lot of nodes don't have an identifier, i.e., column 2 is empty, e.g., for hierarchy 114. How can that be? Isn't the taxon identifier a required field?

@JRice
Copy link
Member

JRice commented Feb 10, 2016

An excellent question! Off the top of my head, I don't know. 114 is "EOL Group on Flickr", so perhaps there have been some exceptions to that rule for really weird content partners...

@KatjaSchulz
Copy link
Author

Ah, that makes sense. However, it looks like there are a lot of hierarchies that have all or some identifiers missing: 114,115,118,119,120,113,121,124,125,130,132,133,138,140,141,143,144,393,394,396,397,398,127,400,401,402,403,405,410,430,431,433,436,439,440,444,448,449,453,428,454,155,459,463,464,465,467,476,507,508,518,522,537,542,548,553,556,558,563,562,564,565,566,567,570,571,582,587,588,549,589,590,595,601,617,622,625,628,635,636,639,643,648,653,661,690,695,696,697,707,711,716,717,720,721,726,727,731,734,735,714,740,753,777,778,787,798,799,804,808,814,821,826,833,837,838,840,841,842,843,844,852,854,857,861,864,868,869,875,879,880,883,890,893,894,896,901,902,912,913,925,929,930,931,932,933,934,935,936,937,938,939,940,941,942,943,944,945,946,131,956,957,962,970,974,977,975,982,984,986,987,988,994,998,1002,1003,1004,1005,1006,1007,1008,1009,1010,1011,1012,1017,1018,1019,1021,1022,1023,1024,1025,1026,1027,1028,1029,1030,1031,1032,1033,1034,1035,1066,1077,1076,884,1082,1089,1094,1099,1093,1100,1112,1115,604,1145,1154,1157,1158,1161,1164,1165,1166,1167,1208,1255,1256,1090,1261,1264,1269,1270,1272,1273,1277,1279,1282,1168,1285,1288,1289,1291,1169,1296,1297,1298,1301,1302,1303,1304,1305,1308,1309,1310,1311,1312,1313,1314,1315,1316,1317,1320,1322,1323,1325,1328,1332,1335,1344,1346,1354,1362,1367,1371,1377,1378,1381,1383,1388,1393,1394,1395,1400,1401,967,1402,1404,1406,1321,1412,1414,1423,1424,1425,1426,1427,1428,1431,1432,1433,1413,1460,1461,1462,1463,1466,1469,1472,1473,1474,1484,1485,1486,1487,1491,1429,1498,860,1499,1501

Unfortunately, I cannot look up the providers of these hierachies, because the Hierarchies API seems to be defunct.

@JRice
Copy link
Member

JRice commented Feb 10, 2016

That is a lot. :
Most of them are non-browsable (maybe we should change the query to only include browsable hierarchies?) But some were:

  • AntWeb (Ant Species) (121)
  • FishBase (Fish Species) (143)
  • IUCN Red List (Species Assessed for Global Conservation) (144)
  • The Reptile Database (787)
  • Avibase - IOC World Bird Names (2011) (860)
  • Paleobiology Database (967)

...Does that shed any light?

@KatjaSchulz
Copy link
Author

Hm, looks like a bunch of connectors, except Antweb, which is an old-style xml file. Antweb does have taxon identifiers, e.g.:

<taxon> <dc:identifier>acanthognathus brevicornis</dc:identifier> <dc:source>https://www.antweb.org/description.do?genus=acanthognathus&amp;name=brevicornis&amp;rank=species</dc:source> <dwc:Phylum>Arthropoda</dwc:Phylum> <dwc:Class>Insecta</dwc:Class> <dwc:Order>Hymenoptera</dwc:Order> <dwc:Family>Formicidae</dwc:Family> <dwc:ScientificName>Acanthognathus brevicornis</dwc:ScientificName>

Also, paleodb provides identifiers (taxon_no) in the api we use to create the EOL import file, and I would be surprised if Eli didn't use them, but it may be worth checking with him. Example api response:
"taxon_no","orig_no","record_type","associated_records","rank","taxon_name","common_name","status","parent_no","senior_no","reference_no","is_extant"
"38613","38613","taxon","","genus","Tyrannosaurus","tyrant lizard","belongs to","38606","","9259","0"

@JRice
Copy link
Member

JRice commented Feb 10, 2016

Yup, this is worth turning into a bug—these might have all been "recent" harvests (since April), and there might be a bug. I'll do that -> #164 (we should reserve this ticket for talking about the file itself—what goes in it, where to put it, etc).

@JRice
Copy link
Member

JRice commented Feb 10, 2016

Hmmn... I could have sworn I'd put this in an earlier comment, but I was thinking that, since my quick experiment with JOINing a query to get the top images here didn't work, it might be worth a seprate output file for those... or two, actually. One would have to have "page_id" and "data_object_id" in it, the other would have to have "data_object_id" and a URL (or several, if it's convenient to include thumbnails) to the image.

Not much more work, but a bit more confusing, to put a "view_order" in the second file and have multiple data_object_ids per page, so you could get a few top images. ...Not sure if that helps.

@JRice
Copy link
Member

JRice commented Feb 10, 2016

Other information we could include in this file are rank ids and name ids... though you would need separate files (one each) to figure out what those meant. :S

@jhpoelen
Copy link

@KatjaSchulz thanks for opening this. In my mind, this is a duplicate, or at least related to, #10 . This issue also contains some concrete examples of an approach that is currently operational.

@JRice I am not too concerned about file sizes. I've stored archives on s3.amazon.com to get the archives web-accessible to anyone as part of an automated process. I'd be happy to show you if you are interested.

@jhammock jhammock added bug and removed bug labels Feb 10, 2016
@hyanwong
Copy link

@JRice "if it was that easy, I wonder why we didn't get this a few years ago..." But seriously, thanks. This was also requested by @iimog I think.

Also moving my gitter comment here: "A DB dump with just hierarchy ids is v. useful. "Best image" ids are not so so useful, since 'best image' changes regularly, and can probably be better queried only as needed. Also that bit (getting the best image from an EoL page ID) can be done for multiple images in a single API call now, thanks to Pages API batch mode (#40)'

@hyanwong
Copy link

P.s. @deepreef is also interested in this.

@hyanwong
Copy link

@JRice I think, although I may be wrong, that EoL used to have gbif identifiers in here, and lots of them, under provider_id=800. I can't find anything like that in the file

grep ',800,' identifiers.csv

@iimog
Copy link

iimog commented Feb 11, 2016

@JRice Thanks for the file. I downloaded it and for my use case it is perfect.
Thanks @hyanwong for drawing my attention to this issue.

@dimus
Copy link
Collaborator

dimus commented Feb 11, 2016

Some relevant information from JR from a few months ago:

What I want to do first -- to gather all the data directly from database

Okay. I'll go through what is where, then.

  • scientific names with corresponding taxon-concept IDs (1 name per 1 ID)

You probably want to look at hierarchy entries, which have a name_id right on them for their preferred scientific name.

  • names which are considered synonyms for that EOL ID

hierarchy_entries have synonyms:

SELECT synonyms.* FROM synonyms WHERE synonyms.hierarchy_entry_id = ?

  • vernacular names with language

common names are more complicated; you have to go through the taxon_concept. Something like

SELECT * FROM taxon_concept_names INNER JOIN names ON names.id = taxon_concept_names.name_id INNER JOIN languages ON languages.id = taxon_concept_names.language_id WHERE taxon_concept_names.taxon_concept_id = 340732 AND (vern = 1 AND languages.iso_639_1 IS NOT NULL AND languages.iso_639_1 != '')

  • crossmapping between taxon concept IDs from EOL and IDs of 5 data partners:
    NCBI, IRMNG, Silva (actually not sure it is in EOL), WoRMS, CoL

Here's where you can select which hierarchy entries to grab. you'll need to know the resource_id of each of those (easiest to get from the website, as an admin—search from the CP index page, then find the CP go to the resource pages to get the id... you'll see here that it's not always clear, as NCBI has two resources... I don't know which you want, or if you want both... it's a mystery.)

Each of those resources then has a hierarchy_id, so you'll need that. Once you have those, you can select the hierarchy_entries with those hierarchy_ids (you should also only select published HEs!), and on each of those is an "identifier" (that's the name of the field) which is the ID used on the resource's website (if applicable). If you want the full URL, you will have to build it off of the "outlink_uri" field on the hierarchy.

@JRice
Copy link
Member

JRice commented Feb 14, 2016

I'm going to try Joins again in the hopes of identifying the bad indexes... here's a sample, simple query based on the above:

select id,identifier,hierarchy_id,taxon_concept_id from hierarchy_entries where published = 1 and visibility_id = 1 limit 10 offset 130000;

@JRice
Copy link
Member

JRice commented Feb 15, 2016

Adding a JOIN immediately leaps query times up to 6-plus seconds for a result set of 10. :( Doing a tiny bit of digging into the joins themselves hasn't helped, either—I tried switching to a subselect to avoid the "weak" filters of published and visible, but that only made matters worse...

# HORRIBLE PERFORMANCE (I didn't even let it finish):
SELECT * FROM (
  SELECT he.id, he.identifier, he.hierarchy_id, he.taxon_concept_id,
    n.string
  FROM hierarchy_entries he
  LEFT JOIN names n ON (he.name_id = n.id) ) q
WHERE q.published = 1 AND q.visibility_id = 1
LIMIT 10 OFFSET 130000;

I did get a much better (usable, but not stellar: 1.45 sec with limit 10 offset 130000) result when I remove the LEFT part of the join (and go with straight joins). ...that might be adequate. (Aside: the code is FULL of left joins. ALL OVER. This might be a big factor in the overall speed of the site... though removing them would of course potentially result in missing data on the site, soooo... would need thinking.)

SELECT he.id, he.identifier, he.hierarchy_id, he.taxon_concept_id, n.string
FROM hierarchy_entries he
JOIN names n ON (he.name_id = n.id)
WHERE he.published = 1 AND he.visibility_id = 1
LIMIT 10 OFFSET 130000;

I'll try making another join and see if it gets much worse...

SELECT he.id, he.identifier, he.hierarchy_id, he.taxon_concept_id,
  n.string, d.object_cache_url, r.label
FROM hierarchy_entries he
JOIN names n ON (he.name_id = n.id)
JOIN top_images i ON (he.id = i.hierarchy_entry_id AND view_order = 1)
JOIN data_objects d ON (i.data_object_id = d.id)
JOIN translated_ranks r ON (he.rank_id = r.rank_id AND r.language_id = 152)
WHERE he.published = 1 AND he.visibility_id = 1
LIMIT 10 OFFSET 130000;

This one is horrible. ...It wasn't so bad when it was top_images only (3.97 sec), but adding dato truly killed it (2 min 8.42 sec), and throwing in the rank didn't hurt much more (it was probably cached, so: 1 min 12.19 sec). Adding into outfile 'identifiers.csv' FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' LINES TERMINATED BY '\n'; created a file that took 6 min 20.73 sec, which is usable.

Update: second run (months later) took 3 min 21.32 sec. The file ends up in /eol/data/db_slave2/var/lib/mysql/eol_production/identifiers.csv.

@JRice
Copy link
Member

JRice commented Feb 15, 2016

For those of you who would like to see a file with a little more information in it, you can download the new version of the file here. It's a smaller file because there are fewer rows (if there's data missing, the row is missing—this is problematic when it comes to taxa with no image), so you might want both files, if you're looking for completeness... but this file has more information: the scientific name, the rank, and the "object cache URL", from which you can build a link to any (valid) size image, e.g.:

201201171519404 -> http://media.eol.org/content/2012/01/17/15/19404_260_190.jpg

(I'll let you figure out the pattern; it's pretty basic. Hint: the last two numbers are (max) width and height.)

@JRice JRice assigned JRice and unassigned JRice Feb 16, 2016
@jhpoelen
Copy link

@JRice was trying to access the identifiers archive at http://beta.eol.org/uploads/data_search_files/big_identifiers.csv.gz and got a 404. Is there a different location I should use?

@KatjaSchulz
Copy link
Author

We just had another request from @diatomsRcool for a file with all EOL taxon concept ids and associated taxon names. Is this file still around somewhere or can it be recovered?

@hyanwong
Copy link

I have a copy. Shall I send it to @diatomsRcool or is a new one soon in the pipeline?

@KatjaSchulz
Copy link
Author

That would be great thanks!

@diatomsRcool
Copy link

What are the column headers here? Looks like 4 columns...

@jhammock jhammock modified the milestone: 2016.06.14 Jul 12, 2016
@JRice
Copy link
Member

JRice commented Aug 24, 2016

What I did on the last run:

mysql_2

# Note this took 17 min 22.41 sec last time.
SELECT he.id, he.identifier, he.hierarchy_id, he.taxon_concept_id,
  n.string, d.object_cache_url, r.label
FROM hierarchy_entries he
LEFT JOIN names n ON (he.name_id = n.id)
LEFT JOIN top_images i ON (he.id = i.hierarchy_entry_id AND view_order = 1)
LEFT JOIN data_objects d ON (i.data_object_id = d.id)
LEFT JOIN translated_ranks r ON (he.rank_id = r.rank_id AND r.language_id = 152)
WHERE he.published = 1 AND he.visibility_id = 1
INTO OUTFILE 'identifiers_with_images.csv'
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' LINES TERMINATED BY '\n';
SELECT he.id, he.identifier, he.hierarchy_id, he.taxon_concept_id,
  n.string
FROM hierarchy_entries he
LEFT JOIN names n ON (he.name_id = n.id)
WHERE he.published = 1 AND he.visibility_id = 1
INTO OUTFILE 'identifiers.csv'
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' LINES TERMINATED BY '\n';

quit

sudo su -

mv /eol/data/db_slave2/var/lib/mysql/eol_production/identifiers.csv ~jrice/.
chown jrice:jrice !$
mv /eol/data/db_slave2/var/lib/mysql/eol_production/identifiers_with_images.csv ~jrice/.
chown jrice:jrice !$
exit

scp identifiers*.csv eol-app1:~/.
app1

sudo mv ~jrice/identifiers*.csv /eol/data/app/public/.
sudo cp /eol/data/app/public/identifiers*.csv /eol/data/app/log/beta/.

beta

mv log/identifiers*.csv public/.
chmod 744 public/identifiers*.csv
exit

exit

rm identifiers*.csv

@jhammock
Copy link
Collaborator

Most recent version: http://beta.eol.org/identifiers.csv

@iimog
Copy link

iimog commented Nov 15, 2016

Hey, I use the identifiers.csv file a lot to cross-map identifiers. However my version of the file is a little outdated. The link of the most recent version by @jhammock returns a 404 now. Is it possible to get the latest version of the file somewhere?

@hyanwong
Copy link

@jhammock mentioned that there was some new stuff in the pipeline, so it might just be on its way to a new home.

@jhammock
Copy link
Collaborator

Sorry, not sure what happened to that file. The replacement files are still in prep.

@iimog
Copy link

iimog commented Nov 16, 2016

All right. Thank you. I'll be patient.

@eliagbayani
Copy link

eliagbayani commented Feb 22, 2017

Hi @hyanwong, @iimog ,
Please check here latest identifiers.csv.
identifiers.csv in opendata.eol.org

If you have an old copy of this file. Can you please tell me if this latest has more or less records.
That is, once you've downloaded this latest.
Thanks.

@hyanwong
Copy link

I can't seem to get that file using wget or cURL, annoyingly. Is there something weird about the URL?

@jhpoelen
Copy link

@hyanwong this is consistent with globalbioticinteractions/globalbioticinteractions#274 . I suspect that the portal is designed for humans to do "click to download" rather than having machines do curl/wget .

@jhammock
Copy link
Collaborator

Hmm... I think we were expecting machine readable services also from CKAN. Eli, is there a solution in our CKAN arsenal?

@hyanwong
Copy link

@jhpoelen I guess you didn't figure out a way to get either wget or cURL to do it, via some fancy flag setting?

@hyanwong
Copy link

hyanwong commented Feb 22, 2017

Looks like slightly more records (lines)

yans-air:EOL yan$ wc identifiers.csv
 12564087 45169123 854534282 identifiers.csv
yans-air:EOL yan$ wc identifiers_prev.csv 
 12513903 45117019 851901860 identifiers_prev.csv

@jhpoelen
Copy link

jhpoelen commented Feb 22, 2017

@hyanwong I was actually able to download the archive using curl by copy-pasting the request made by the opendata site to start the download (see globalbioticinteractions/globalbioticinteractions#274 (comment)). However, it seemed that this download (and the manual download) time-ed out after 10 mins, leaving a partial archive. So, with my crappy consumer grader internet connection, I was unable to access the archive. I had to use an external server to download the file, then transfer the file to my development system from that server. Hope this helps!

@eliagbayani
Copy link

eliagbayani commented Feb 23, 2017

Thanks @hyanwong and @jhpoelen for your inputs. Very helpful.

Hi @hyanwong , slightly more records is about right. Since we didn't have a real harvest for sometime now.

Regarding problem on serving files for download, I will consult our sysadmin to improve service for clients with lower bandwidth. One non-technical sol'n is to divide big files into smaller chunks for easier download. Where either wget, curl or http should work.

I just increased Apache Timeout but that didn't seem to work either. I keep on getting a partial 100-140 MB file size downloaded.

I'm able to wget smaller files e.g. reptiles

@hyanwong
Copy link

Is it a bandwidth problem? I can download fine with a normal browser, just not with wget / curl?

@iimog
Copy link

iimog commented Feb 24, 2017

I have the same problem as @hyanwong click download works (takes ~90s from my university), both curl and wget return an Error 503 Service Unavailable. I don't think it is a timeout problem.
But in general: nice to have this file in the opendata.eol.org portal.

@eliagbayani
Copy link

Thanks for the input @iimog.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants