-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create a file that maps all hierarchy entries to EOL taxon concept (page) ids #162
Comments
It's pretty big, even gzipped, at 120 MB or so. |
When we're sure we have all the requirements, we should schedule updates and post this somewhere :) |
I would use a query like this:
It took about 25 seconds to run. ...The problem is, of course, that it will end up on the server where you run the command... for me, it was in |
@JRice Just has a peek at the file. It looks like a lot of nodes don't have an identifier, i.e., column 2 is empty, e.g., for hierarchy 114. How can that be? Isn't the taxon identifier a required field? |
An excellent question! Off the top of my head, I don't know. 114 is "EOL Group on Flickr", so perhaps there have been some exceptions to that rule for really weird content partners... |
Ah, that makes sense. However, it looks like there are a lot of hierarchies that have all or some identifiers missing: 114,115,118,119,120,113,121,124,125,130,132,133,138,140,141,143,144,393,394,396,397,398,127,400,401,402,403,405,410,430,431,433,436,439,440,444,448,449,453,428,454,155,459,463,464,465,467,476,507,508,518,522,537,542,548,553,556,558,563,562,564,565,566,567,570,571,582,587,588,549,589,590,595,601,617,622,625,628,635,636,639,643,648,653,661,690,695,696,697,707,711,716,717,720,721,726,727,731,734,735,714,740,753,777,778,787,798,799,804,808,814,821,826,833,837,838,840,841,842,843,844,852,854,857,861,864,868,869,875,879,880,883,890,893,894,896,901,902,912,913,925,929,930,931,932,933,934,935,936,937,938,939,940,941,942,943,944,945,946,131,956,957,962,970,974,977,975,982,984,986,987,988,994,998,1002,1003,1004,1005,1006,1007,1008,1009,1010,1011,1012,1017,1018,1019,1021,1022,1023,1024,1025,1026,1027,1028,1029,1030,1031,1032,1033,1034,1035,1066,1077,1076,884,1082,1089,1094,1099,1093,1100,1112,1115,604,1145,1154,1157,1158,1161,1164,1165,1166,1167,1208,1255,1256,1090,1261,1264,1269,1270,1272,1273,1277,1279,1282,1168,1285,1288,1289,1291,1169,1296,1297,1298,1301,1302,1303,1304,1305,1308,1309,1310,1311,1312,1313,1314,1315,1316,1317,1320,1322,1323,1325,1328,1332,1335,1344,1346,1354,1362,1367,1371,1377,1378,1381,1383,1388,1393,1394,1395,1400,1401,967,1402,1404,1406,1321,1412,1414,1423,1424,1425,1426,1427,1428,1431,1432,1433,1413,1460,1461,1462,1463,1466,1469,1472,1473,1474,1484,1485,1486,1487,1491,1429,1498,860,1499,1501 Unfortunately, I cannot look up the providers of these hierachies, because the Hierarchies API seems to be defunct. |
That is a lot. :
...Does that shed any light? |
Hm, looks like a bunch of connectors, except Antweb, which is an old-style xml file. Antweb does have taxon identifiers, e.g.:
Also, paleodb provides identifiers (taxon_no) in the api we use to create the EOL import file, and I would be surprised if Eli didn't use them, but it may be worth checking with him. Example api response: |
Yup, this is worth turning into a bug—these might have all been "recent" harvests (since April), and there might be a bug. I'll do that -> #164 (we should reserve this ticket for talking about the file itself—what goes in it, where to put it, etc). |
Hmmn... I could have sworn I'd put this in an earlier comment, but I was thinking that, since my quick experiment with JOINing a query to get the top images here didn't work, it might be worth a seprate output file for those... or two, actually. One would have to have "page_id" and "data_object_id" in it, the other would have to have "data_object_id" and a URL (or several, if it's convenient to include thumbnails) to the image. Not much more work, but a bit more confusing, to put a "view_order" in the second file and have multiple data_object_ids per page, so you could get a few top images. ...Not sure if that helps. |
Other information we could include in this file are rank ids and name ids... though you would need separate files (one each) to figure out what those meant. :S |
@KatjaSchulz thanks for opening this. In my mind, this is a duplicate, or at least related to, #10 . This issue also contains some concrete examples of an approach that is currently operational. @JRice I am not too concerned about file sizes. I've stored archives on s3.amazon.com to get the archives web-accessible to anyone as part of an automated process. I'd be happy to show you if you are interested. |
@JRice "if it was that easy, I wonder why we didn't get this a few years ago..." But seriously, thanks. This was also requested by @iimog I think. Also moving my gitter comment here: "A DB dump with just hierarchy ids is v. useful. "Best image" ids are not so so useful, since 'best image' changes regularly, and can probably be better queried only as needed. Also that bit (getting the best image from an EoL page ID) can be done for multiple images in a single API call now, thanks to Pages API batch mode (#40)' |
P.s. @deepreef is also interested in this. |
@JRice I think, although I may be wrong, that EoL used to have gbif identifiers in here, and lots of them, under provider_id=800. I can't find anything like that in the file
|
Some relevant information from JR from a few months ago: What I want to do first -- to gather all the data directly from database Okay. I'll go through what is where, then.
You probably want to look at hierarchy entries, which have a name_id right on them for their preferred scientific name.
hierarchy_entries have synonyms: SELECT
common names are more complicated; you have to go through the taxon_concept. Something like SELECT * FROM
Here's where you can select which hierarchy entries to grab. you'll need to know the resource_id of each of those (easiest to get from the website, as an admin—search from the CP index page, then find the CP go to the resource pages to get the id... you'll see here that it's not always clear, as NCBI has two resources... I don't know which you want, or if you want both... it's a mystery.) Each of those resources then has a hierarchy_id, so you'll need that. Once you have those, you can select the hierarchy_entries with those hierarchy_ids (you should also only select published HEs!), and on each of those is an "identifier" (that's the name of the field) which is the ID used on the resource's website (if applicable). If you want the full URL, you will have to build it off of the "outlink_uri" field on the hierarchy. |
I'm going to try Joins again in the hopes of identifying the bad indexes... here's a sample, simple query based on the above:
|
Adding a JOIN immediately leaps query times up to 6-plus seconds for a result set of 10. :( Doing a tiny bit of digging into the joins themselves hasn't helped, either—I tried switching to a subselect to avoid the "weak" filters of published and visible, but that only made matters worse...
I did get a much better (usable, but not stellar: 1.45 sec with limit 10 offset 130000) result when I remove the LEFT part of the join (and go with straight joins). ...that might be adequate. (Aside: the code is FULL of left joins. ALL OVER. This might be a big factor in the overall speed of the site... though removing them would of course potentially result in missing data on the site, soooo... would need thinking.)
I'll try making another join and see if it gets much worse...
This one is horrible. ...It wasn't so bad when it was top_images only (3.97 sec), but adding dato truly killed it (2 min 8.42 sec), and throwing in the rank didn't hurt much more (it was probably cached, so: 1 min 12.19 sec). Adding Update: second run (months later) took 3 min 21.32 sec. The file ends up in |
For those of you who would like to see a file with a little more information in it, you can download the new version of the file here. It's a smaller file because there are fewer rows (if there's data missing, the row is missing—this is problematic when it comes to taxa with no image), so you might want both files, if you're looking for completeness... but this file has more information: the scientific name, the rank, and the "object cache URL", from which you can build a link to any (valid) size image, e.g.: 201201171519404 -> http://media.eol.org/content/2012/01/17/15/19404_260_190.jpg (I'll let you figure out the pattern; it's pretty basic. Hint: the last two numbers are (max) width and height.) |
@JRice was trying to access the identifiers archive at http://beta.eol.org/uploads/data_search_files/big_identifiers.csv.gz and got a 404. Is there a different location I should use? |
We just had another request from @diatomsRcool for a file with all EOL taxon concept ids and associated taxon names. Is this file still around somewhere or can it be recovered? |
I have a copy. Shall I send it to @diatomsRcool or is a new one soon in the pipeline? |
That would be great thanks! |
What are the column headers here? Looks like 4 columns... |
What I did on the last run: mysql_2
# Note this took 17 min 22.41 sec last time.
SELECT he.id, he.identifier, he.hierarchy_id, he.taxon_concept_id,
n.string, d.object_cache_url, r.label
FROM hierarchy_entries he
LEFT JOIN names n ON (he.name_id = n.id)
LEFT JOIN top_images i ON (he.id = i.hierarchy_entry_id AND view_order = 1)
LEFT JOIN data_objects d ON (i.data_object_id = d.id)
LEFT JOIN translated_ranks r ON (he.rank_id = r.rank_id AND r.language_id = 152)
WHERE he.published = 1 AND he.visibility_id = 1
INTO OUTFILE 'identifiers_with_images.csv'
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' LINES TERMINATED BY '\n';
SELECT he.id, he.identifier, he.hierarchy_id, he.taxon_concept_id,
n.string
FROM hierarchy_entries he
LEFT JOIN names n ON (he.name_id = n.id)
WHERE he.published = 1 AND he.visibility_id = 1
INTO OUTFILE 'identifiers.csv'
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' LINES TERMINATED BY '\n';
quit
sudo su -
mv /eol/data/db_slave2/var/lib/mysql/eol_production/identifiers.csv ~jrice/.
chown jrice:jrice !$
mv /eol/data/db_slave2/var/lib/mysql/eol_production/identifiers_with_images.csv ~jrice/.
chown jrice:jrice !$
exit
scp identifiers*.csv eol-app1:~/.
app1
sudo mv ~jrice/identifiers*.csv /eol/data/app/public/.
sudo cp /eol/data/app/public/identifiers*.csv /eol/data/app/log/beta/.
beta
mv log/identifiers*.csv public/.
chmod 744 public/identifiers*.csv
exit
exit
rm identifiers*.csv |
Most recent version: http://beta.eol.org/identifiers.csv |
Hey, I use the identifiers.csv file a lot to cross-map identifiers. However my version of the file is a little outdated. The link of the most recent version by @jhammock returns a 404 now. Is it possible to get the latest version of the file somewhere? |
@jhammock mentioned that there was some new stuff in the pipeline, so it might just be on its way to a new home. |
Sorry, not sure what happened to that file. The replacement files are still in prep. |
All right. Thank you. I'll be patient. |
Hi @hyanwong, @iimog , If you have an old copy of this file. Can you please tell me if this latest has more or less records. |
I can't seem to get that file using wget or cURL, annoyingly. Is there something weird about the URL? |
@hyanwong this is consistent with globalbioticinteractions/globalbioticinteractions#274 . I suspect that the portal is designed for humans to do "click to download" rather than having machines do curl/wget . |
Hmm... I think we were expecting machine readable services also from CKAN. Eli, is there a solution in our CKAN arsenal? |
@jhpoelen I guess you didn't figure out a way to get either wget or cURL to do it, via some fancy flag setting? |
Looks like slightly more records (lines)
|
@hyanwong I was actually able to download the archive using curl by copy-pasting the request made by the opendata site to start the download (see globalbioticinteractions/globalbioticinteractions#274 (comment)). However, it seemed that this download (and the manual download) time-ed out after 10 mins, leaving a partial archive. So, with my crappy consumer grader internet connection, I was unable to access the archive. I had to use an external server to download the file, then transfer the file to my development system from that server. Hope this helps! |
Thanks @hyanwong and @jhpoelen for your inputs. Very helpful. Hi @hyanwong , slightly more records is about right. Since we didn't have a real harvest for sometime now. Regarding problem on serving files for download, I will consult our sysadmin to improve service for clients with lower bandwidth. One non-technical sol'n is to divide big files into smaller chunks for easier download. Where either wget, curl or http should work. I just increased Apache Timeout but that didn't seem to work either. I keep on getting a partial 100-140 MB file size downloaded. I'm able to wget smaller files e.g. reptiles |
Is it a bandwidth problem? I can download fine with a normal browser, just not with wget / curl? |
I have the same problem as @hyanwong click download works (takes ~90s from my university), both curl and wget return an |
Thanks for the input @iimog. |
This was requested by @dimus, @jhpoelen, @hyanwong. Ideally, this would be a Darwin Core Archive, but anything that gets the data out would do for now. @JRice suggested we could start with a straight dump of certain columns from the hierarchies table, i.e., identifier used by provider, hierarchy_entry_id, hierarchy_id, taxon_concept_id, what else?
Interested parties, please elaborate if necessary.
The text was updated successfully, but these errors were encountered: