Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API falsely reports 10000 hits for hits>10000 #343

Closed
jordanpadams opened this issue Jun 9, 2023 · 17 comments · Fixed by #367
Closed

API falsely reports 10000 hits for hits>10000 #343

jordanpadams opened this issue Jun 9, 2023 · 17 comments · Fixed by #367
Assignees
Labels
B14.0 bug Something isn't working s.medium Medium level severity

Comments

@jordanpadams
Copy link
Member

jordanpadams commented Jun 9, 2023

Checked for duplicates

Yes

🐛 Describe the bug

When I did https://pds.nasa.gov/api/search/1/products?limit=100&q=product_class%20eq%20%22Product_Observational%22&start=0 I noticed hits = 10000

🕵️ Expected behavior

I expected 1000000000 per https://github.com/NASA-PDS/registry-mgr/blob/main/src/main/resources/elastic/registry.json#LL5C36-L5C46

📜 To Reproduce

curl -X 'GET' \
  'https://pds.nasa.gov/api/search/1/products?limit=100&q=product_class%20eq%20%22Product_Observational%22&start=0' \ 
   json_pp | grep 'hits'

🖥 Environment Info

API v1.2

📚 Version of Software Used

No response

🩺 Test Data / Additional context

No response

🦄 Related requirements

No response

⚙️ Engineering Details

No response

@jordanpadams
Copy link
Member Author

@tloubrieu-jpl @jimmie @sjoshi-jpl @alexdunnjpl I will setup a working meeting to debug this. this bug has reoccurred a few times now so I want to make sure we are covering all the bases.

@jordanpadams jordanpadams changed the title Registry API still returning hit count = 10000 Reoccurrence of Registry API hit count = 10000 bug Jun 9, 2023
@alexdunnjpl
Copy link
Contributor

alexdunnjpl commented Jun 9, 2023

@jordanpadams N.B. max_result_window refers to the maximal page size for a single request, not the max quantity of results returned by a query.

I have a meeting now, but a cursory search shows that it may just be the hits count (not the actual results set) which is limited, in which case it may just be necessary to add track_total_hits to the responses sent out by the API. Relevant discussion

This is supported by testing with start=10000, which returns a full page of results (which it shouldn't, if the query is limited to 10000)

@alexdunnjpl alexdunnjpl changed the title Reoccurrence of Registry API hit count = 10000 bug API falsely reports 10000 hits for hits>1000 Jun 9, 2023
@alexdunnjpl
Copy link
Contributor

Confirmed that appending track_total_hits=true to OpenSearch query results in correct count. registry-api is dropping the relation field (which is "gte" in this case) returned from OpenSearch.

Quick fix is to append this. If performance becomes an issue (it shouldn't), we can address that later by returning the relation object rather than the fully-resolved count.

@jordanpadams could you please re-triage?

@jordanpadams jordanpadams changed the title API falsely reports 10000 hits for hits>1000 API falsely reports 10000 hits for hits>10000 Jun 9, 2023
@jordanpadams jordanpadams added s.medium Medium level severity sprint-backlog and removed needs:triage labels Jun 9, 2023
@jordanpadams
Copy link
Member Author

@alexdunnjpl triage complete.

@jordanpadams
Copy link
Member Author

@alexdunnjpl is there a code update associated with this ticket coming?

@alexdunnjpl
Copy link
Contributor

@jordanpadams yes, eventually (or sooner, if you want to bump the priority back up)

@jordanpadams
Copy link
Member Author

copy thanks @alexdunnjpl we will keep it in the backlog until you complete your current task and we can re-evaluate

@al-niessner
Copy link
Contributor

@alexdunnjpl @jordanpadams

Nothing to be done here unless we are going to be happy with 10+ instead of 11 for total hits. The relation 'gte' is nifty if we could have state as we could dump the current results and move to a scroll window which we would be using already and avoid the problem outright. The Java object for total hits is Integer so 10+ really is not going to work there either. I guess, just nothing to be done really.

@jordanpadams
Copy link
Member Author

@al-niessner per @alexdunnjpl, can we update the API to us track_total_hits=true. Also, why does https://pds.nasa.gov/api/search/1/products/urn:nasa:pds:orex.ovirs:data_calibrated::11.0/members?start=0&limit=500 return the correct value for hits but the query above does not?

@al-niessner
Copy link
Contributor

@jordanpadams @alexdunnjpl

Sure you can add another field in the summary to indicate that there are more hits that you can never fetch. The reason you get the 10000 is that it is some max_something_or_other opensearch server configuration value. Some of our PDS units are set to 2**32-1 or some other ridiculously large value. Others are the default 10000. Should be able to search the old tickets in this repo to find the exact parameter. Something about max and window. Anyway, it is all a server configuration problem.

@jordanpadams
Copy link
Member Author

@al-niessner that configuration was changed, this track_total_hits=true thing is different per some tests that @alexdunnjpl performed.

@alexdunnjpl
Copy link
Contributor

alexdunnjpl commented Jul 31, 2023

@al-niessner here's the qparam in question

https://opensearch.org/docs/1.2/opensearch/rest-api/search/#url-parameters (see: track_total_hits)

By default, opensearch will not return an exact value for hits, presumably as this takes some nontrivial amount of extra time than not returning the exact value, when paging hits.

More relevant discussion/context from earlier in this ticket's chatter

@al-niessner
Copy link
Contributor

@alexdunnjpl @jordanpadams

We need to discuss this one because my experience with opensearch is exactly the opposite. For instance, if I do this search https://pds.nasa.gov/api/search/1/products/urn:nasa:pds:orex.ovirs:data_calibrated::11.0/members?limit=2 it gives me a total hits of 1169346. It seems that it is accurately counting them out despite not specifying the URI parameter - unless you have reason to believe that there are more of them. I mean if you are doing efficiency game, why not give up on 2 and save yourself the other 1169344 hits?

Also, the 10000 is just too coincidental to being the default window size for me to not to be Missourian - show me state. Not disparaging you, but reverting back to my Hume readings that eyewitness testimony is the weakest form of evidence. Apparently Hume was from Missouri long before its founding.

Now, maybe when a cluster or something has a node go missing and opensearch says I found what I found but there could be more on the missing node(s) then that would be nice. However, is that not a an error rather than obscure flag int eh summaary?

Lastly, would it not be simpler to just the URI parameter to true always rather than pass something back to the user? The parameter you named is an input to opensearch and not an output.

Yeah, breakout for this ticket.

@jordanpadams
Copy link
Member Author

@al-niessner I believe @alexdunnjpl was thinking we just include track_total_hits in our queries to OpenSearch under the hood, by default, all the time. The users have no idea we are using that flag, it is just something we know we need to make these hits accurate.

@al-niessner
Copy link
Contributor

Double checking the URL in the first comment. Changed start to 10001 and left limit alone:

curl 'https://pds.nasa.gov/api/search/1/products?limit=1000&q=product_class%20eq%20%22Product_Observational%22&start=10001'

It returned a non-error which indicates that the window is big validating @alexdunnjpl statements:

{
  "summary": {
    "q": "product_class eq \"Product_Observational\"",
    "hits": 10000,
    "took": 21511,
    "start": 10001,
    "limit": 1000,
    "sort": [],

@jordanpadams
Sorry, I read adding it to search as meaning adding it to the summary we return to the user. Should be simple to add to all queries.

@alexdunnjpl
Copy link
Contributor

If further validation is required, just take the API out of the loop and curl the OpenSearch directly, with and without the track_total_hits flag - that's the final-confirmation basis for my claim.

@jordanpadams
Copy link
Member Author

@al-niessner

Sorry, I read adding it to search as meaning adding it to the summary we return to the user. Should be simple to add to all queries.

🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
B14.0 bug Something isn't working s.medium Medium level severity
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants