Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API crashes with JVM memory error on data sets with very large labels (>1MB) #296

Closed
jordanpadams opened this issue Mar 23, 2023 · 2 comments · Fixed by #303
Closed

API crashes with JVM memory error on data sets with very large labels (>1MB) #296

jordanpadams opened this issue Mar 23, 2023 · 2 comments · Fixed by #303
Assignees
Labels
B14.0 bug Something isn't working i&t.done s.medium Medium level severity sprint-backlog

Comments

@jordanpadams
Copy link
Member

Checked for duplicates

Yes - I've already checked

🐛 Describe the bug

When I did a query for products of the SHERLOC bundle, I noticed having performing the query a few times, the API stopped working. After further investigation, it appears the API crashed, with initial thoughts being due to a JVM memory overflow error

🕵️ Expected behavior

I expected the query would return as expected.

📜 To Reproduce

  1. performed the following query a few times:
curl --GET "https://pds.nasa.gov/api/search/1/products?q=lidvid%20like%20%22urn:nasa:pds:mars2020_sherloc*%22"
  1. eventually started seeing 500 errors from all endpoints.
  2. from the server logs, the following error was noted:
2023-03-22 19:01:32.751 DEBUG 1 --- [-nio-80-exec-10] org.opensearch.client.RestClient         : request [POST [https://search-en-prod-di7dor7quy7qwv3husi2wt5tde.us-west-2.es.amazonaws.com:443/registry,psa-prod:registry,naif-prod-ccs:registry,rms-prod:registry,sbnumd-prod-ccs:registry,geo-prod-ccs:registry,atm-prod-ccs:registry,sbnpsi-prod-ccs:registry,ppi-prod-ccs:registry,img-prod-ccs:registry/_search?typed_keys=true&max_concurrent_shard_requests=5&ignore_unavailable=false&expand_wildcards=open&allow_no_indices=true&ignore_throttled=true&search_type=query_then_fetch&batched_reduce_size=512&ccs_minimize_roundtrips=true](https://search-en-prod-di7dor7quy7qwv3husi2wt5tde.us-west-2.es.amazonaws.com/registry,psa-prod:registry,naif-prod-ccs:registry,rms-prod:registry,sbnumd-prod-ccs:registry,geo-prod-ccs:registry,atm-prod-ccs:registry,sbnpsi-prod-ccs:registry,ppi-prod-ccs:registry,img-prod-ccs:registry/_search?typed_keys=true&max_concurrent_shard_requests=5&ignore_unavailable=false&expand_wildcards=open&allow_no_indices=true&ignore_throttled=true&search_type=query_then_fetch&batched_reduce_size=512&ccs_minimize_roundtrips=true)] failed

🖥 Environment Info

No response

📚 Version of Software Used

No response

🩺 Test Data / Additional context

🦄 Related requirements

No response

⚙️ Engineering Details

No response

@jimmie
Copy link
Member

jimmie commented Mar 23, 2023

In ecs.tf terraform script, bumped vCPU to 1024 (1 full vCPU in AWS terms) and memory to 8096 (8GB). Applied to EN only.

A concern is that while the service was returning 500's in response to API requests, the ECS health check continued to succeed (which only verifies a 200 redirect from a request for the swagger docs). We should consider a more meaningful request - I would think /classes would be a good one.

@jimmie
Copy link
Member

jimmie commented Mar 23, 2023

I would suggest we deploy the updated health check at the same time we deploy the updated docker image w/ explicit JVM memory controls from #300

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
B14.0 bug Something isn't working i&t.done s.medium Medium level severity sprint-backlog
Projects
No open projects
Status: 🏁 Done
Development

Successfully merging a pull request may close this issue.

5 participants