Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BigQueryTemplate.writeJsonStream API throwing java.lang.OutOfMemoryError: unable to create new native thread #1599

Closed
pradeepnr opened this issue Feb 20, 2023 · 8 comments · Fixed by #1855 or #1972
Assignees
Labels
bigquery priority: p3 type: question Further information is requested

Comments

@pradeepnr
Copy link

pradeepnr commented Feb 20, 2023

Describe the bug

Problem : We are processing huge number of requests in the range of 100s of millions. Then pushing the records to Bigquery using the BigQueryTempate::writeJsonStream API. After running for more than 2 hours and calling the API 1000 times, it is throwing the below exception

java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:719)
at com.google.cloud.spring.bigquery.core.BigQueryTemplate.writeJsonStream(BigQueryTemplate.java:310)
...
...

Library
spring-cloud-gcp-starter-bigquery
Bom

<dependency>
    <groupId>com.google.cloud</groupId>
    <artifactId>spring-cloud-gcp-dependencies</artifactId>
    <version>3.4.0</version>
    <type>pom</type>
    <scope>import</scope>
</dependency>

Details about the run
We are receiving the messages from another system and processing it and writing it to BigQuery in batch of 5000.
After 700-800 API calls we are receiving the out of memory error.

@mpeddada1
Copy link
Contributor

Hi @pradeepnr Thank you filing this issue! A few questions to help us understand this better:

  1. Does this issue also occur with the latest versions of spring-cloud-gcp? Including 3.4.4 and 4.1.0.
  2. Is this a recent occurrence? More specifically, is there a version of spring-cloud-gcp where the OutOfMemoryError doesn't occur?
  3. Would it be possible to provide a reproducer that can help us investigate this further?

@pradeepnr
Copy link
Author

Hi @mpeddada1,
Please find my response below

  1. I tired with only 3.4.0 version.
  2. The issue was found during development stage.
  3. I don't think, I will be able to do that. You can try running a for loop for 1000 times and calling the API to insert a dummy record to BQ.
  4. For time being I resorted to use Bigquery::insertAll(). Also, I found the perform of this API to be better than writeJsonStream API.

@mpeddada1
Copy link
Contributor

Thank you for the background!

  1. Could you try this out with the latest version of the spring cloud gcp? It will be helpful to see if there is a change in behavior.
  2. Additionally, if you are unable to provide a reproducer atm, we have a Bigquery sample. Are you able to modify it to demonstrate the breakage?

@meltsufin
Copy link
Member

cc: @prash-mi

@prash-mi
Copy link
Contributor

prash-mi commented Mar 1, 2023

@pradeepnr writeJsonStream(tableName, jsonInputStream) takes a stream of record and uses a background thread to write the records to storage. Calling this API concurrently for say a 100 times will mean that we are trying to write a 100 concurrent streams, which implies a 100 concurrent threads and it could land us in an error like what you have mentioned (java.lang.OutOfMemoryError: unable to create new native thread).

What I would recommend is:

  1. Try to process much bigger batches, by saying a batch I mean a stream of record passed as a InputStream. The API has been designed for that
  2. Throttle the middle layer such that we do not have too many instances of writeJsonStream(tableName, jsonInputStream) running concurrently (as each call comes with it's own memory footprint, which might cause such issue)

This API may not be the best choice for executing too many smaller, concurrent batches.
Please let us know if the above solution works, thanks.

@meltsufin
Copy link
Member

@prash-mi Have you considered the option of fixing the library such that it uses a thread pool? It's a best practice that would prevent this kind of issue and we do it everywhere else in the library.

@pradeepnr
Copy link
Author

@prash-mi, Thanks for the update.
I'm using Bigquery::insertAll() API and it seems to be working fine without any issues.

@prash-mi
Copy link
Contributor

Right @meltsufin , we actually had plans to implement thread-pool with this API but couldn't yet priortize it. Let me update this thread back when that implemented.

@pradeepnr In the meantime you may want to use insertAll or the recommendations I mentioned in the previous comment as a workaround. I will keep this thread posted on the implementation of the ThreadPool which will avoid such issues.

@prash-mi prash-mi self-assigned this May 9, 2023
meltsufin pushed a commit that referenced this issue Jun 20, 2023
…#1855)

User can use bigQueryThreadPoolTaskScheduler in order to avoid java.lang.OutOfMemoryError which was arising due to too many concurrent write threads working.

Fixes: #1599
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bigquery priority: p3 type: question Further information is requested
Projects
None yet
4 participants