-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
/classes/collections/<lidvid>/members (and deprecated equivalent) hangs #231
Comments
This may be related to the latest/all issue - need to retest once provenance script is in place. |
@jimmie @alexdunnjpl I think this may be the same issue that @nutjob4life identified here but let me know if that is not the case and we can create a new ticket. thanks to @nutjob4life for the thorough detailing of the issue he is seeing |
Yeah, great summary @nutjob4life - thank you. Confirming that after re-indexing and provenance...er...ing, this issue (i.e. the hanging) remains. Cracking open the CloudWatch logs, I'm seeing what looks to be an infinite loop, will need to examine further but it does seem to be a different issue than the issue Sean identified. |
@jimmie okay, filing my issue separately from this one, thanks! |
Good news: The request eventually completes Regarding the last point, I was able to get results in an acceptable (IMHO) amount of time for this collection: |
@alexdunnjpl @jimmie could you provide an estimate on this ticket ? Thanks |
First steps: point local API at prod OS instance (@jimmie to provide config/creds shortly) and attempt to replicate and enumerate the set of requests associated with the call to find a pattern. |
Crude log filter yields ~230 outgoing requests with {"bool":{"must":[{"term":{"collection_lidvid":{"value":"urn:nasa:pds:orex.ovirs:data_calibrated::10.0","boost":1.0}}}],"adjust_pure_negative":true,"boost":1.0}}
Issue appears to be that queries to OS for products with membership in the given collection are paged at a rate of 10/page, resulting in ~230 requests and a runtime of ~244sec. ~It's currently unclear what there are Increasing OS page size to 10000 results in runtime reduction to ~13sec and a single OS query request (down from 230). No data was or is returned, which seems incorrect - shouldn't it return the first 100 results given no pagination queryparams from the user? This data is present in productLidvids but doesn't seem to be written into the response further up the call stack. |
Status: continuing to dig a bit deeper on the performance issues in order to dig deeper into the bug |
Key outstanding questions:
|
This request directly on the opensearch node responds in acceptable time: But the one going through CCS, takes longer and reach the time out limit: |
Next steps:
|
Turns out the 2290 entities are actually pages of lidvids with size 500. 2290*500=1145000, corresponding with the 1144754 products having membership in the collection. The page size 500 is not defined anywhere in the API codebase and does not vary with @al-niessner I know we rejected "is each hit a list of concatenated lidvids that gets expanded" but that appears to be exactly what's happening:
{"product_lidvid" : ["urn:nasa:pds:orex.ovirs:data_calibrated:20190523t225311s377_ovr_scil2_calv2.fits::1.0" ...]} I'm guessing that 500 is some OpenSearch-internal paging configuration value? @tloubrieu-jpl If my understanding of all this is correct, I think we're good to
|
The 500 is something we pick. Now when it comes to pagination, the page is based on the product_lidvids not the hit iterator returned. The registry-api should select its page out of 1.1 million not the 2290. Does that make sense? Unfortunately, you still have to load all 2290 because we have to return the total number possible (2290*500) - some of those are not a full 500 and you do not know which. I agree that setting it to 5000 seems just fine or the users page when doing the second look up. Does it still timeout or return nothing? |
@al-niessner yeah that all scans, thanks! No timeout (nor previously - just took ages when Jimmie was testing). It returns nothing when no query params are provides (e.g. /classes/collections/urn:nasa:pds:orex.ovirs:data_calibrated/members) but I assume will perform as expected when begin/limit query params are provided (that's next on my list to confirm for safety's sake). The outstanding question is whether a sample first page of results should be returned in the no-args case - in the meeting yesterday @tloubrieu-jpl said yes, but I'm wondering if it's intentional (because 100% of non-dev use will need to provide those params to do anything meaningful). |
Defaults for pagination are always start=0 limit=100 (or something like that) so it should be returning the first 100 lidvids if you did not override limit. If you provide limit=0 or summary-only then it should not return any results but limit=0 is not or should not be the default. |
A while back, September when I was not here, @jimmie had to reinstate summary-only (I have strong dislike of it) because it appeared limit=0 was not working. I think I left it in place after that but may have butchered it up in the process. Ideally, it would be nice to kill summary-only and use limit=0 instead. |
Gotcha, thanks Al. @tloubrieu-jpl I'll open a PR for this issue and open a new ticket for the no-data thing. |
@tloubrieu-jpl How can I prove this is fixed? Is the same data on a test server that I can provide the query doesn't time out? |
@gxtchen I believe this ticket has been fixed by an update in the OpenSearch configuration in production, so you should be able to validate that this ticket has been fixed on the production server. If not on gamma where I will deploy the latest version of the registry-api today. @alexdunnjpl can you confirm ? |
@tloubrieu-jpl @gxtchen from memory and a quick skim, testing will require performing the query against this specific collection, with an updated build of registry-api containing #246 Probably the simplest way, given that I don't believe this bundle/collection is loaded in test, is to deploy on gamma and temporarily change the registry-api configuration (application.properties) to target the prod cluster and allow testing against the prod OS db (read-only). |
Thanks @alexdunnjpl , @gxtchen the latest version of the API is deployed on gamma and uses the Opensearch production server as a backend, so you can use that for tests. For example from URL: |
🐛 Describe the bug
The endpoints /classes/collections//members and /collections//products hang.
📜 To Reproduce
Steps to reproduce the behavior:
🕵️ Expected behavior
Expect a response containing the products contained by the collection to be returned, in a timely manner.
📚 Version of Software Used
registry-api 1.1.12
The text was updated successfully, but these errors were encountered: