New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Importing lot of versions of a schema from KafkaSQL causes crash loop #2627
Comments
Sorry for the delay on this. We'll need to reproduce this and see what we can do about a fix. I don't have any theories about why this might happen - I'll need to look at the code. Based on your description this is clearly a bug. It really shouldn't be a problem having 10k or more versions of a single schema. It's obviously rarely what you WANT to do, but it shouldn't crash the server. I'm marking this as a high priority bug. |
Hi @petolexa , thanks a lot for rising this issue, I cannot yet reproduce this specific issue, but, running Registry as a Java process on the machine shows that, at the end of importing 10000+ Kafka messages there is a consistent heavy "spike" in the memory usage: I will investigate deeper in the next days, those are the current takeaways:
|
Hi @andreaTP,
One of the things to try to fix this was, that I raised a memory limit to 4GiB for each pod replica. |
Also, what is noticeable:
|
Hi @petolexa , thanks for the additional information! Thar said with reasonable Memory numbers the behavior seems to be pretty stable and consistent; for reference, I'm successfully testing using a command like: docker run -p 8080:8080 -e JAVA_OPTIONS="-XX:MaxRAMPercentage=80 -Dregistry.kafka.common.bootstrap.servers=host.docker.internal:9092 -Dquarkus.http.host=0.0.0.0 -Djava.util.logging.manager=org.jboss.logmanager.LogManager" --memory="300m" --memory-swap="300m" --rm docker.io/apicurio/apicurio-registry-kafkasql:latest The tweaks in my command are:
I recommend you to add the last Java Option (or something similar like an appropriate At this point, I'm convinced that this issue is the effect of excessive memory consumption and the "DB session closing" are only an effect of it. If you are still able to reproduce the issue you can confirm (or deny 🙂 ) my theory by collecting the Please, try those suggestions and let me know how it goes! |
Hi @andreaTP, Test preparation took a while as I had to create thousands of versions :) As my environment is on Kubernetes cluster, I am not able to use I was able to set
This is how the start of the container log looks like:
What only changed with the After the initial crash, the pods continued crashing in loop with the errors described at the beginning of this issue - updating artifact with new version loaded from Kafka topic until it started disconnecting the sessions:
what lead to this:
finishing with this:
With that said, the initial crash is hard to reach in real life, if your customers behave :) But I see the afterward crash loops being more dangerous as the application will never reach readiness this way. |
Please find the testing schema attached if you needed it. I just had to add a .txt postfix as I cannot upload .json files. |
Hi @petolexa , thanks for taking the time to test out the suggestions and doing this detailed report! I'm noticing something strange here:
You can see that you still have This is the script I was using to reproduce the issue: And seems like I'm using a slightly different pattern/endpoint to load the instance, next week I will try to setup a reproducible environment on minikube so that we can bisect the differences, for reference here you can find a few notes on how I was trying to reproduce the issue. |
Nice, thank you for the links and for pointing out the other parameters. From the JVM parameters, I only add the ones that are in environment variables in JAVA_OPTIONS above in the example. So these are somewhat default maybe? I'll try to override the |
Hi @petolexa , as promised I have replicated the environment on Unfortunately, with this setup, I'm unable to replicate the issue with +10000 versions of the sample artifact you provided. Thanks in advance! |
Hi @andreaTP, thank you for the detail and another test. 1. for the initial crash
The main difference, in my opinion, is in the amount of messages:
2. for the followup loop crash |
Not sure, but, according to your tests, this should not be a major difference.
This is not accurate, unfortunately,
I will try another run increasing the number of messages and let you know the results.
Yup, I kill the pod and restart it to see what happens and it always comes back nicely. |
@petolexa I have tested with 20000+ versions (which means > messages in Kafka), and Registry is behaving fairly well. In order:
All in all, I do not see under any circumstances the mentioned database disconnection and, without a reproducer that's impossible to debug. I encourage you to set memory |
@petolexa at this point I'm going to close this issue as "Can't reproduce", but feel free to re-open this or another one if you manage to get a reproducer together! |
Thank you @andreaTP for your effort and for the tests. |
Hi @petolexa !
I think this analysis is not correct, from the events you shared looks pretty clear that the pod is getting killed because of probe failures (not the other way around). I encourage you to tweak the probes to have a much more relaxed frequency and timeouts as a first step. |
Hi @andreaTP , you were right! I just understood where you are pointing from your last comment. We've had default liveness and readiness probes set up:
and the container just had no enough time to process all messages (with x versions of on artifact) after the reboot. I added a
and the container was able to run properly after 5-6 failures of a startup probe. what means 55-65 seconds. Thank you for pointing me to the right direction :) |
Thanks a lot for getting back @petolexa ! Appreciated! And happy that we solved the mistery 🙂 |
Hello,
we've run into an issue that is partially caused by user's mistake, but migh affect someone else, so I would like to describe it and ask for your help/advice.
We use Apicurio v2.2.4 in docker image on k8s cluster, with KafkaSQL storage underneath.
The cause
One of our Apicurio instance users uses schema-registry the way, that they send a 'PUT' request with every request on their schema with the same content:
PUT /api/artifacts/com.example.MySchema1
We are working on the improval of their process, this is not the standard usecase of course. But what it caused so far is, that we have 10000+ versions of this schema.
Also, it means, that every version is a Kafka message to be processed.
The issue
When our pods with Apicurio are restarted, Apicurio loads and processes messages from Kafka topic. When it gets to the messages with (many) new versions of the problematic schema, these messages are processed the way, that it causes all database sessions to disconect before finishing and Apicurio pod crashes.
This is the trace from h2:mem database processing one of the versions:
This is how it disconnects during processing:
And this is how the container crashes:
What I've noticed is, that it only crashes, when processing kafka messages with newly added versions. There is no problem with processing big amount of versions when I import schemas to a clean topic via export/import apis. Imported versions are processed slightly differently and it causes no issues with processing:
My questions
Would it be, please, possible to fix the processing of new versions? It is unusual usecase, but apparently possible. I think it is slightly similar to crash loops caused by globalIds thay you fixed a year ago in #1500 .
Is there a way to prevent such behavior of users in Apicurio setup?
Is it, in case of such issues, possible to keep session connected longer?
It starts here:
ends 54 seconds later in my case:
and none of the setups for H2 DB parameters or quarkus.datasource parameters helped me to affect the lifetime of the session.
The text was updated successfully, but these errors were encountered: