Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[5] Allow batch mode in the Pages API #40

Closed
hyanwong opened this issue Nov 4, 2015 · 19 comments
Closed

[5] Allow batch mode in the Pages API #40

hyanwong opened this issue Nov 4, 2015 · 19 comments
Assignees
Milestone

Comments

@hyanwong
Copy link

hyanwong commented Nov 4, 2015

For querying a large number of pages or data objects, it is slow and costly to make multiple API calls. It would be good if the pages and data_objects APIs could take an array of page IDs or data_object IDs and return a list of results.

To help with this, it would be sensible to be able to minimise the amount of data returned from a pages or data_object API query. At the moment, both types of query always return an (often large) "taxonConcepts" array. It would be sensible to have a parameter to set to true/false which can be used to turn this off, to save bandwidth / EoL effort.

@iimog
Copy link

iimog commented Nov 10, 2015

This feature would help me a lot as I need a mapping of a huge list of NCBI taxids to EOL page ids.
Using the API with one ID at a time is not feasible for over 700,000 IDs.

@hyanwong
Copy link
Author

Yes, sorry, I forget to add that the API for 'Search By Provider' should be batch-ified too, if possible. That would help both my use case and that of https://github.com/iimog

@JRice
Copy link
Member

JRice commented Nov 10, 2015

Let's make separate tickets for each, so we can manage the tasks. I'm going to modify this one to be the pages API only, as an initial test case.

@JRice JRice changed the title Allow batch mode in the API Allow batch mode in the Pages API Nov 10, 2015
@JRice JRice changed the title Allow batch mode in the Pages API [5] Allow batch mode in the Pages API Nov 10, 2015
@hyanwong
Copy link
Author

Thanks. Easiest batch-ification might be for 'Search by provider', as it only returns a very simple integer value (well, actually 2 values, but one is redundant, and can probably be scrapped)

@AmrMMorad AmrMMorad self-assigned this Dec 1, 2015
@AmrMMorad
Copy link

For pages API, I think we need to discuss 2 points:
1- If one of the pages' ids is wrong, Do you want an error (ie error message and no response is returned) or return the correct ones and "ignore" the wrong one?
2- What are the values that can be omitted from the returned values (ie less important ones) to reduce the overhead?
Thanks!

@hyanwong
Copy link
Author

hyanwong commented Dec 2, 2015

For point 1) I imagine we would return a JSON array, each element of which corresponds to the IDs passed in, in which case an 'incorrect' ID would need to return a blank/null value in the appropriate slot in the array. Alternatively we could return an associative array with {ID1=>{data1}, ID2=>{data2}}, in which case we could simply ignore any wrong IDs.

@hyanwong
Copy link
Author

hyanwong commented Dec 2, 2015

For point 2), that's what I meant in the original opening post when I said that it should be possible to turn off the "taxonConcepts" part of the result when calling the pages or data objects API. In general, I think we would want to use exactly the same format as the normal (non-batch) API, which saves having to create extra documentation, etc. So I think the first step is to add some extra params to the normal API which allow the user to slim down the returned request. It's not too bad at the moment, since we can set "details: False", "iucn: False", etc. But the following changes to the normal API would help:

  1. taxonomy: true,false (default true) - removes taxonConcepts from the returned result (NB: the default on batch calls could be false)
  2. subjects: all,... should also add the possibility of 'none', which perhaps should be the default for batch. (sorry, my mistake, see [5] Implement true/false 'taxonomy' API parameter to reduce size of returned data in the API #60)

While we are at it, could we also add an option to the vettedStatus parameter to return only unreviewed content (perhaps value = 3)? This is useful for checking content to review (also, come to think of it, something to return untrusted content might be useful for checking on EoL accuracy). Should I open new git issues for each of these 3 proposed changes to the normal API, or can they all be followed here?

For the searchbyprovider API it is much simpler, since only a single value needs to be returned.

@AmrMMorad
Copy link

Thanks for your reply.
I think we need to create issues for these changes as, you know, these changes will be prerequisites for enabling batch mode in pages API. After we finish these, we could continue working on that.
Thanks!

@hyanwong
Copy link
Author

hyanwong commented Dec 2, 2015

@JRice OK to open 2 more issues to improve the pages & data_objects API, as a precursor to batch mode?

Edit: now done: #60 and #61. Also see #62

@hyanwong
Copy link
Author

hyanwong commented Dec 3, 2015

Also we need a way to pass a number of IDs into the API. At the moment, the ID is hard coded into the url, as in

http://eol.org/api/pages/1.0/1045608.json?images=2

I guess we don't want to encode this in the URL name. E.g.separating with a vertical bar (pipe character, %7C) in the file name looks bad to me:

http://eol.org/api/pages/1.0/1045608%7C328023%7C591753.json?images=2

perhaps we need something like

http://eol.org/api/pages/1.0/batch.json?images=2&pageIDs=1045608%7C328023%7C591753

I use the pipe character (%7C) to separate numbers, as that is what is done in the licenses field. But some other separator could be used, or we could even repeat the parameter:

http://eol.org/api/pages/1.0/batch.json?images=2&ID=1045608&ID=328023&ID=591753

@JRice JRice modified the milestone: 2015.12.08 Dec 8, 2015
@JRice JRice closed this as completed Dec 22, 2015
@hyanwong
Copy link
Author

What's the status of this now? Has it been coded up, but not gone live, for instance? I can't see where the new code/documentation might be, and I think I'm not quite understanding the process of how issues move through the EoL machine.

@AmrMMorad
Copy link

Actually, it is waiting for the next deploy. It is already committed to the master code branch.
Thanks

@hyanwong
Copy link
Author

Cool, thanks. What calling format did you chose eventually? Oh, and when is the next deploy likely to be? Oh, and finally (sorry) does this also apply to the Search_by_provider API?

@AmrMMorad
Copy link

For format, we have now 2 extra flags: one for batch mode and one for taxonomy included in the result or not. When you choose the batch mode, you can enter multiple pages' ids separated by ",".
In fact, this is only applied for pages API.

@hyanwong
Copy link
Author

Thanks. Perfect. Should I open another issue for getting an identical thing coded for the data_objects and search_by_provider APIs?

@AmrMMorad
Copy link

Yes, please
Thank you

@hyanwong
Copy link
Author

Just done so (see above). What is the ETA for the next deploy, by the way?

@AmrMMorad
Copy link

Thanks.
Sorry I don't know the ETA for it. I think @JRice can answer this..

@JRice
Copy link
Member

JRice commented Jan 27, 2016

The newest code will be released tomorrow (2016-01-28) after the downtime
(09:00 ET). So, if it was on the master branch as of 12:00 ET today, it
will be included in the deploy.

On Wed, Jan 27, 2016 at 9:01 AM, AmrMMorad notifications@github.com wrote:

Thanks.
Sorry I don't know the ETA for it. I think @JRice
https://github.com/JRice can answer this..


Reply to this email directly or view it on GitHub
#40 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants