Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Importing lot of versions of a schema from KafkaSQL causes crash loop #2627

Closed
petolexa opened this issue Jul 12, 2022 · 20 comments
Closed

Importing lot of versions of a schema from KafkaSQL causes crash loop #2627

petolexa opened this issue Jul 12, 2022 · 20 comments
Assignees
Labels

Comments

@petolexa
Copy link

Hello,

we've run into an issue that is partially caused by user's mistake, but migh affect someone else, so I would like to describe it and ask for your help/advice.
We use Apicurio v2.2.4 in docker image on k8s cluster, with KafkaSQL storage underneath.

The cause

One of our Apicurio instance users uses schema-registry the way, that they send a 'PUT' request with every request on their schema with the same content:
PUT /api/artifacts/com.example.MySchema1
We are working on the improval of their process, this is not the standard usecase of course. But what it caused so far is, that we have 10000+ versions of this schema.

Also, it means, that every version is a Kafka message to be processed.

The issue

When our pods with Apicurio are restarted, Apicurio loads and processes messages from Kafka topic. When it gets to the messages with (many) new versions of the problematic schema, these messages are processed the way, that it causes all database sessions to disconect before finishing and Apicurio pod crashes.

This is the trace from h2:mem database processing one of the versions:

2022-07-11 15:01:12 DEBUG <> [io.apicurio.registry.storage.impl.kafkasql.sql.KafkaSqlSink] (KSQL Kafka Consumer Thread) Processing Kafka message with UUID: e0fb998a-e294-4cc2-9997-0f4e73150296
2022-07-11 15:01:12 jdbc[3]: 
/*SQL l:59 #:1*/SELECT value FROM sequences WHERE name = ? AND tenantId = ? {1: 'globalId', 2: '_'};
2022-07-11 15:01:12 jdbc[3]: 
/*SQL l:81 #:1*/MERGE INTO sequences (tenantId, name, value) KEY (tenantId, name) VALUES(?, ?, ?) {1: '_', 2: 'globalId', 3: 52609075};
2022-07-11 15:01:12 jdbc[3]: 
/*SQL */COMMIT;
2022-07-11 15:01:12 jdbc[3]: 
/*SQL */COMMIT;
2022-07-11 15:01:12 DEBUG <_> [io.apicurio.registry.storage.impl.kafkasql.sql.KafkaSqlSink] (KSQL Kafka Consumer Thread) Clearing tenant id after message processed
2022-07-11 15:01:12 DEBUG <> [io.apicurio.registry.storage.impl.kafkasql.sql.KafkaSqlSink] (KSQL Kafka Consumer Thread) Kafka message successfully processed. Notifying listeners of response.
2022-07-11 15:01:12 DEBUG <> [io.apicurio.registry.storage.impl.kafkasql.sql.KafkaSqlSink] (KSQL Kafka Consumer Thread) Processing Kafka message with UUID: 36f5c6bf-7df1-4e28-a1df-46cf36d89aef
2022-07-11 15:01:12 jdbc[3]: 
/*SQL l:76 #:1*/SELECT c.contentId FROM content c WHERE c.contentHash = ? AND c.tenantId = ? {1: 'fa5d4b21ae39e9e978a7f1d06b7ed4d6dfdc1840c2e27eefd33181579ee8260f', 2: '_'};
2022-07-11 15:01:12 DEBUG <_> [io.apicurio.registry.storage.impl.sql.AbstractSqlRegistryStorage] (KSQL Kafka Consumer Thread) Updating artifact null com.example.MySchema1 with a new version (content).
2022-07-11 15:01:12 jdbc[3]: 
/*SQL l:315 #:1*/SELECT a.*, v.contentId, v.globalId, v.version, v.versionId, v.state, v.name, v.description, v.labels, v.properties, v.createdBy AS modifiedBy, v.createdOn AS modifiedOn FROM artifacts a JOIN versions v ON a.tenantId = v.tenantId AND a.latest = v.globalId WHERE a.tenantId = ? AND a.groupId = ? AND a.artifactId = ? {1: '_', 2: '__$GROUPID$__', 3: 'com.example.MySchema1'};
2022-07-11 15:01:12 jdbc[3]: 
/*SQL l:305 #:1 t:13*/INSERT INTO versions (globalId, tenantId, groupId, artifactId, version, versionId, state, name, description, createdBy, createdOn, labels, properties, contentId) VALUES (?, ?, ?, ?, ?, (SELECT MAX(versionId) + 1 FROM versions WHERE tenantId = ? AND groupId = ? AND artifactId = ?), ?, ?, ?, ?, ?, ?, ?, ?) {1: 52609075, 2: '_', 3: '__$GROUPID$__', 4: 'com.example.MySchema1', 5: NULL, 6: '_', 7: '__$GROUPID$__', 8: 'com.example.MySchema1', 9: 'ENABLED', 10: 'MySchema1', 11: NULL, 12: '', 13: TIMESTAMP '2022-07-11 11:50:25.212', 14: NULL, 15: NULL, 16: 605};
2022-07-11 15:01:12 jdbc[3]: 
/*SQL l:134 #:1*/UPDATE versions SET version = (SELECT versionId FROM versions WHERE tenantId = ? AND globalId = ?) WHERE tenantId = ? AND globalId = ? {1: '_', 2: 52609075, 3: '_', 4: 52609075};
2022-07-11 15:01:12 jdbc[3]: 
/*SQL l:85 #:1*/UPDATE artifacts SET latest = ? WHERE tenantId = ? AND groupId = ? AND artifactId = ? {1: 52609075, 2: '_', 3: '__$GROUPID$__', 4: 'com.example.MySchema1'};
2022-07-11 15:01:12 jdbc[3]: 
/*SQL l:176 #:1*/SELECT v.*, a.type FROM versions v JOIN artifacts a ON v.tenantId = a.tenantId AND v.groupId = a.groupId AND v.artifactId = a.artifactId WHERE v.tenantId = ? AND v.globalId = ? {1: '_', 2: 52609075};
2022-07-11 15:01:12 jdbc[3]: 
/*SQL */COMMIT;
2022-07-11 15:01:12 jdbc[3]: 
/*SQL */COMMIT;
2022-07-11 15:01:12 DEBUG <_> [io.apicurio.registry.storage.impl.kafkasql.sql.KafkaSqlSink] (KSQL Kafka Consumer Thread) Clearing tenant id after message processed
2022-07-11 15:01:12 DEBUG <> [io.apicurio.registry.storage.impl.kafkasql.sql.KafkaSqlSink] (KSQL Kafka Consumer Thread) Kafka message successfully processed. Notifying listeners of response.

This is how it disconnects during processing:

2022-07-11 15:01:14 DEBUG <> [io.apicurio.registry.storage.impl.kafkasql.sql.KafkaSqlSink] (KSQL Kafka Consumer Thread) Processing Kafka message with UUID: fd634a51-ba81-4064-81a6-3ffe3772eb43
2022-07-11 15:01:14 database: disconnecting session #3
2022-07-11 15:01:14 database: disconnected session #3
2022-07-11 15:01:14 database: disconnecting session #4
2022-07-11 15:01:14 database: disconnected session #4
2022-07-11 15:01:14 database: disconnecting session #5
2022-07-11 15:01:14 database: disconnected session #5
2022-07-11 15:01:14 database: disconnecting session #6
2022-07-11 15:01:14 database: disconnected session #6
2022-07-11 15:01:14 database: disconnecting session #7
2022-07-11 15:01:14 database: disconnected session #7
2022-07-11 15:01:14 database: disconnecting session #8
2022-07-11 15:01:14 database: disconnected session #8
2022-07-11 15:01:14 database: disconnecting session #9
2022-07-11 15:01:14 database: disconnected session #9
2022-07-11 15:01:14 database: disconnecting session #10
2022-07-11 15:01:14 database: disconnected session #10
2022-07-11 15:01:14 database: disconnecting session #11
2022-07-11 15:01:14 INFO <_> [io.apicurio.registry.storage.impl.sql.AbstractSqlRegistryStorage] (KSQL Kafka Consumer Thread) SqlRegistryStorage constructed successfully.  JDBC URL: jdbc:h2:mem:registry_db;DB_CLOSE_ON_EXIT=FALSE;TRACE_LEVEL_SYSTEM_OUT=2
2022-07-11 15:01:14 database: disconnected session #11
2022-07-11 15:01:14 database: disconnecting session #12
2022-07-11 15:01:14 database: disconnected session #12
2022-07-11 15:01:14 database: disconnecting session #13
2022-07-11 15:01:14 DEBUG <_> [io.apicurio.registry.storage.impl.kafkasql.sql.KafkaSqlSink] (KSQL Kafka Consumer Thread) Clearing tenant id after message processed
2022-07-11 15:01:14 database: disconnected session #13
2022-07-11 15:01:14 DEBUG <> [io.apicurio.registry.storage.impl.kafkasql.sql.KafkaSqlSink] (KSQL Kafka Consumer Thread) Registry exception detected: java.lang.RuntimeException: java.sql.SQLException: This pool is closed and does not handle any more connections!
2022-07-11 15:01:14 DEBUG <> [io.apicurio.registry.storage.impl.kafkasql.sql.KafkaSqlSink] (KSQL Kafka Consumer Thread) Processing Kafka message with UUID: 5b551d28-dc44-4f44-84b0-b88f37c98840
2022-07-11 15:01:14 database: disconnecting session #14
2022-07-11 15:01:14 database: disconnected session #14
2022-07-11 15:01:14 database: disconnecting session #15
2022-07-11 15:01:14 database: disconnected session #15
2022-07-11 15:01:14 database: disconnecting session #16
2022-07-11 15:01:14 database: disconnected session #16
2022-07-11 15:01:14 INFO <_> [io.apicurio.registry.storage.impl.sql.AbstractSqlRegistryStorage] (KSQL Kafka Consumer Thread) SqlRegistryStorage constructed successfully.  JDBC URL: jdbc:h2:mem:registry_db;DB_CLOSE_ON_EXIT=FALSE;TRACE_LEVEL_SYSTEM_OUT=2
2022-07-11 15:01:14 DEBUG <_> [io.apicurio.registry.storage.impl.kafkasql.sql.KafkaSqlSink] (KSQL Kafka Consumer Thread) Clearing tenant id after message processed
2022-07-11 15:01:14 database: disconnecting session #17
2022-07-11 15:01:14 DEBUG <> [io.apicurio.registry.storage.impl.kafkasql.sql.KafkaSqlSink] (KSQL Kafka Consumer Thread) Registry exception detected: java.lang.RuntimeException: java.sql.SQLException: This pool is closed and does not handle any more connections!

And this is how the container crashes:

2022-07-11 15:01:14 DEBUG <> [io.apicurio.registry.storage.impl.kafkasql.sql.KafkaSqlSink] (KSQL Kafka Consumer Thread) Processing Kafka message with UUID: 477a0bde-97df-4b6f-acf5-c06287bd3944
2022-07-11 15:01:14 DEBUG <_> [io.apicurio.registry.storage.impl.kafkasql.sql.KafkaSqlSink] (KSQL Kafka Consumer Thread) Clearing tenant id after message processed
2022-07-11 15:01:14 DEBUG <> [io.apicurio.registry.storage.impl.kafkasql.sql.KafkaSqlSink] (KSQL Kafka Consumer Thread) Unexpected exception detected: Error injecting javax.transaction.TransactionManager io.quarkus.narayana.jta.runtime.interceptor.TransactionalInterceptorBase.transactionManager
2022-07-11 15:01:14 DEBUG <> [io.quarkus.runtime.ExecutorRecorder$2] (Shutdown thread) loop: 1, remaining: 60000000000, intervalRemaining: 5000000000, interruptRemaining: 10000000000
2022-07-11 15:01:14 DEBUG <> [io.apicurio.registry.storage.impl.kafkasql.sql.KafkaSqlSink] (KSQL Kafka Consumer Thread) Processing Kafka message with UUID: 349dd267-6841-4327-99f7-8189d3b33c2b
2022-07-11 15:01:14 DEBUG <_> [io.apicurio.registry.storage.impl.kafkasql.sql.KafkaSqlSink] (KSQL Kafka Consumer Thread) Clearing tenant id after message processed
2022-07-11 15:01:14 DEBUG <> [io.apicurio.registry.storage.impl.kafkasql.sql.KafkaSqlSink] (KSQL Kafka Consumer Thread) Unexpected exception detected: Error injecting javax.transaction.TransactionManager io.quarkus.narayana.jta.runtime.interceptor.TransactionalInterceptorBase.transactionManager
2022-07-11 15:01:14 DEBUG <> [io.apicurio.registry.storage.impl.kafkasql.sql.KafkaSqlSink] (KSQL Kafka Consumer Thread) Processing Kafka message with UUID: b56f0d56-c151-4db0-a396-4abae8222033
2022-07-11 15:01:14 DEBUG <_> [io.apicurio.registry.storage.impl.kafkasql.sql.KafkaSqlSink] (KSQL Kafka Consumer Thread) Clearing tenant id after message processed
2022-07-11 15:01:14 DEBUG <> [io.apicurio.registry.storage.impl.kafkasql.sql.KafkaSqlSink] (KSQL Kafka Consumer Thread) Unexpected exception detected: Error injecting javax.transaction.TransactionManager io.quarkus.narayana.jta.runtime.interceptor.TransactionalInterceptorBase.transactionManager
2022-07-11 15:01:14 DEBUG <> [io.apicurio.registry.storage.impl.kafkasql.sql.KafkaSqlSink] (KSQL Kafka Consumer Thread) Processing Kafka message with UUID: 5b24233f-8159-4218-a399-420ba19b26cd
2022-07-11 15:01:14 DEBUG <_> [io.apicurio.registry.storage.impl.kafkasql.sql.KafkaSqlSink] (KSQL Kafka Consumer Thread) Clearing tenant id after message processed
2022-07-11 15:01:14 DEBUG <> [io.apicurio.registry.storage.impl.kafkasql.sql.KafkaSqlSink] (KSQL Kafka Consumer Thread) Unexpected exception detected: Error injecting javax.transaction.TransactionManager io.quarkus.narayana.jta.runtime.interceptor.TransactionalInterceptorBase.transactionManager
2022-07-11 15:01:14 DEBUG <> [io.apicurio.registry.storage.impl.kafkasql.sql.KafkaSqlSink] (KSQL Kafka Consumer Thread) Processing Kafka message with UUID: 7fba2dbb-b75f-4aac-a69e-ef58956ba62e
2022-07-11 15:01:14 DEBUG <_> [io.apicurio.registry.storage.impl.kafkasql.sql.KafkaSqlSink] (KSQL Kafka Consumer Thread) Clearing tenant id after message processed
2022-07-11 15:01:14 DEBUG <> [io.apicurio.registry.storage.impl.kafkasql.sql.KafkaSqlSink] (KSQL Kafka Consumer Thread) Unexpected exception detected: Error injecting javax.transaction.TransactionManager io.quarkus.narayana.jta.runtime.interceptor.TransactionalInterceptorBase.transactionManager
2022-07-11 15:01:14 DEBUG <> [io.apicurio.registry.storage.impl.kafkasql.sql.KafkaSqlSink] (KSQL Kafka Consumer Thread) Processing Kafka message with UUID: cac5e98f-0a9c-4041-8bca-707b9aaa18f3
2022-07-11 15:01:14 DEBUG <_> [io.apicurio.registry.storage.impl.kafkasql.sql.KafkaSqlSink] (KSQL Kafka Consumer Thread) Clearing tenant id after message processed
2022-07-11 15:01:14 DEBUG <> [io.apicurio.registry.storage.impl.kafkasql.sql.KafkaSqlSink] (KSQL Kafka Consumer Thread) Unexpected exception detected: Error injecting javax.transaction.TransactionManager io.quarkus.narayana.jta.runtime.interceptor.TransactionalInterceptorBase.transactionManager
2022-07-11 15:01:14 DEBUG <> [io.apicurio.registry.storage.impl.kafkasql.sql.KafkaSqlSink] (KSQL Kafka Consumer Thread) Processing Kafka message with UUID: cfe0f411-1ebb-44de-b1cb-9a1316cb6ec7
2022-07-11 15:01:14 DEBUG <_> [io.apicurio.registry.storage.impl.kafkasql.sql.KafkaSqlSink] (KSQL Kafka Consumer Thread) Clearing tenant id after message processed
2022-07-11 15:01:14 DEBUG <> [io.apicurio.registry.storage.impl.kafkasql.sql.KafkaSqlSink] (KSQL Kafka Consumer Thread) Unexpected exception detected: Error injecting javax.transaction.TransactionManager io.quarkus.narayana.jta.runtime.interceptor.TransactionalInterceptorBase.transactionManager
2022-07-11 15:01:14 DEBUG <> [io.apicurio.registry.storage.impl.kafkasql.sql.KafkaSqlSink] (KSQL Kafka Consumer Thread) Processing Kafka message with UUID: 2f0d0e2e-374c-4727-9eee-87c4fe62a477
2022-07-11 15:01:14 DEBUG <_> [io.apicurio.registry.storage.impl.kafkasql.sql.KafkaSqlSink] (KSQL Kafka Consumer Thread) Clearing tenant id after message processed
2022-07-11 15:01:14 DEBUG <> [io.apicurio.registry.storage.impl.kafkasql.sql.KafkaSqlSink] (KSQL Kafka Consumer Thread) Unexpected exception detected: Error injecting javax.transaction.TransactionManager io.quarkus.narayana.jta.runtime.interceptor.TransactionalInterceptorBase.transactionManager
2022-07-11 15:01:14 DEBUG <> [io.apicurio.registry.storage.impl.kafkasql.sql.KafkaSqlSink] (KSQL Kafka Consumer Thread) Processing Kafka message with UUID: a403b190-68d7-48ef-a394-46a583fdb1f2
2022-07-11 15:01:14 DEBUG <_> [io.apicurio.registry.storage.impl.kafkasql.sql.KafkaSqlSink] (KSQL Kafka Consumer Thread) Clearing tenant id after message processed
2022-07-11 15:01:14 DEBUG <> [io.apicurio.registry.storage.impl.kafkasql.sql.KafkaSqlSink] (KSQL Kafka Consumer Thread) Unexpected exception detected: Error injecting javax.transaction.TransactionManager io.quarkus.narayana.jta.runtime.interceptor.TransactionalInterceptorBase.transactionManager
2022-07-11 15:01:14 DEBUG <> [io.apicurio.registry.storage.impl.kafkasql.sql.KafkaSqlSink] (KSQL Kafka Consumer Thread) Processing Kafka message with UUID: 5bb6bb57-01c3-4dfd-b4ce-2b0005470bdb
2022-07-11 15:01:14 INFO <> [io.quarkus.bootstrap.runner.Timing] (Shutdown thread) apicurio-registry-storage-kafkasql stopped in 0.115s
2022-07-11 15:01:14 DEBUG <_> [io.apicurio.registry.storage.impl.kafkasql.sql.KafkaSqlSink] (KSQL Kafka Consumer Thread) Clearing tenant id after message processed
2022-07-11 15:01:14 DEBUG <> [io.apicurio.registry.storage.impl.kafkasql.sql.KafkaSqlSink] (KSQL Kafka Consumer Thread) Unexpected exception detected: Error injecting javax.transaction.TransactionManager io.quarkus.narayana.jta.runtime.interceptor.TransactionalInterceptorBase.transactionManager

What I've noticed is, that it only crashes, when processing kafka messages with newly added versions. There is no problem with processing big amount of versions when I import schemas to a clean topic via export/import apis. Imported versions are processed slightly differently and it causes no issues with processing:

2022-07-11 15:00:29 DEBUG <> [io.apicurio.registry.storage.impl.kafkasql.sql.KafkaSqlSink] (KSQL Kafka Consumer Thread) Processing Kafka message with UUID: 6d71e375-24cd-4a2f-844b-d97d520f9b48
2022-07-11 15:00:29 jdbc[3]: 
/*SQL l:103 #:1*/SELECT COUNT(a.artifactId) FROM artifacts a WHERE a.tenantId = ? AND a.groupId = ? AND a.artifactId = ? {1: '_', 2: '__$GROUPID$__', 3: 'com.example.MySchema1'};
2022-07-11 15:00:29 jdbc[3]: 
/*SQL l:72 #:1*/SELECT COUNT(globalId) FROM versions WHERE globalId = ? AND tenantId = ? {1: 52604860, 2: '_'};
2022-07-11 15:00:29 jdbc[3]: 
/*SQL l:211 #:1*/INSERT INTO versions (globalId, tenantId, groupId, artifactId, version, versionId, state, name, description, createdBy, createdOn, labels, properties, contentId) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?) {1: 52604860, 2: '_', 3: '__$GROUPID$__', 4: 'com.example.MySchema1', 5: '6585', 6: 6585, 7: 'ENABLED', 8: 'MySchema1', 9: NULL, 10: '', 11: TIMESTAMP '2022-06-28 12:11:50.741', 12: NULL, 13: NULL, 14: 605};
2022-07-11 15:00:29 INFO <_> [io.apicurio.registry.storage.impl.sql.AbstractSqlRegistryStorage] (KSQL Kafka Consumer Thread) Artifact version entity imported successfully.
2022-07-11 15:00:29 jdbc[3]: 
/*SQL */COMMIT;
2022-07-11 15:00:29 jdbc[3]: 
/*SQL */COMMIT;
2022-07-11 15:00:29 DEBUG <_> [io.apicurio.registry.storage.impl.kafkasql.sql.KafkaSqlSink] (KSQL Kafka Consumer Thread) Clearing tenant id after message processed
2022-07-11 15:00:29 DEBUG <> [io.apicurio.registry.storage.impl.kafkasql.sql.KafkaSqlSink] (KSQL Kafka Consumer Thread) Kafka message successfully processed. Notifying listeners of response.

My questions

Would it be, please, possible to fix the processing of new versions? It is unusual usecase, but apparently possible. I think it is slightly similar to crash loops caused by globalIds thay you fixed a year ago in #1500 .
Is there a way to prevent such behavior of users in Apicurio setup?

Is it, in case of such issues, possible to keep session connected longer?
It starts here:

2022-07-11 15:00:20 database: connecting session #4 to mem:registry_db

ends 54 seconds later in my case:

2022-07-11 15:01:14 database: disconnecting session #4
2022-07-11 15:01:14 database: disconnected session #4

and none of the setups for H2 DB parameters or quarkus.datasource parameters helped me to affect the lifetime of the session.

@EricWittmann
Copy link
Member

Sorry for the delay on this. We'll need to reproduce this and see what we can do about a fix. I don't have any theories about why this might happen - I'll need to look at the code. Based on your description this is clearly a bug. It really shouldn't be a problem having 10k or more versions of a single schema. It's obviously rarely what you WANT to do, but it shouldn't crash the server.

I'm marking this as a high priority bug.

@andreaTP or @jsenko any volunteers to look into this? :)

@andreaTP
Copy link
Member

andreaTP commented Sep 6, 2022

Hi @petolexa , thanks a lot for rising this issue, I cannot yet reproduce this specific issue, but, running Registry as a Java process on the machine shows that, at the end of importing 10000+ Kafka messages there is a consistent heavy "spike" in the memory usage:
Screenshot 2022-09-06 at 19 00 50

I will investigate deeper in the next days, those are the current takeaways:

  • check that, when running in Docker, the JVM is respecting the container memory limit (which might be the cause of the container crash)
  • check H2 indexing/caching/optimizations
  • verify if we can benefit from tuning H2

@petolexa
Copy link
Author

petolexa commented Sep 7, 2022

Hi @andreaTP,
thank you for your answer. You are right, memory limit was partially a problem. In a timeline, this happened:

  1. We had 3 replicas of a pod with Apicurio running (1GiB of memory for each).
  2. One of the customers started with producing thousands of versions (in few days).
  3. 1st container crashed on reaching the memory limit and started to restart in loop. After the initial restart, other restarts were caused by the H2 mem DB connection sessions dropping before all the messages (mostly versions of that one artifact) were processed:
    2022-07-11 15:01:14 database: disconnecting session #4
  4. In a cascade, 2nd and then 3rd replica of the pod reached memory limit and started endless loop with disconnecting DB session.

One of the things to try to fix this was, that I raised a memory limit to 4GiB for each pod replica.
This would maybe solve the initial crash reason, but not the follow-up crashes, as I did not find a working way to keep sessions opened longer.

@petolexa
Copy link
Author

petolexa commented Sep 7, 2022

Also, what is noticeable:

  • there is no such issue when I import 10000+ versions from the backup file via /admin/export and /admin/import API
  • even if these 10000+ versions that were previously imported by /admin/import API are freshly loaded from the Kafka topic, there is no issue with processing
  • the issue only happens when the messages are loaded from the topic as the customer directly created new 10000+ versions via groups/{groupId}/artifacts/{artifactId}/versions API

@andreaTP
Copy link
Member

andreaTP commented Sep 7, 2022

Hi @petolexa , thanks for the additional information!
I tested a bit further, and I can reproduce a container crash for small enough RAM values (e.g. going below 200Mb).

Thar said with reasonable Memory numbers the behavior seems to be pretty stable and consistent; for reference, I'm successfully testing using a command like:

docker run -p 8080:8080 -e JAVA_OPTIONS="-XX:MaxRAMPercentage=80 -Dregistry.kafka.common.bootstrap.servers=host.docker.internal:9092 -Dquarkus.http.host=0.0.0.0 -Djava.util.logging.manager=org.jboss.logmanager.LogManager" --memory="300m" --memory-swap="300m" --rm docker.io/apicurio/apicurio-registry-kafkasql:latest

The tweaks in my command are:

  • Memory 300Mb
  • Swap Memory 300Mb
  • -XX:MaxRAMPercentage=80

I recommend you to add the last Java Option (or something similar like an appropriate -Xmx) in your environment and to set a .resource.request in your pods > 300Mb (and limits accordingly).

At this point, I'm convinced that this issue is the effect of excessive memory consumption and the "DB session closing" are only an effect of it. If you are still able to reproduce the issue you can confirm (or deny 🙂 ) my theory by collecting the events on the relevant pods in your installation.

Please, try those suggestions and let me know how it goes!

@petolexa
Copy link
Author

petolexa commented Sep 9, 2022

Hi @andreaTP,
thank you for your tests and suggestions.

Test preparation took a while as I had to create thousands of versions :)
I basically did this in 3 Linux sessions for few hours to reach the crash:
watch -n 1 'curl -k -X POST https://dev3.example.com/apis/registry/v2/groups/loop/artifacts/loop-test/versions --data-binary @loop-test.json' with testing schema.
to create enough versions for one schema and to simulate what our customer/colleague did initially creating this issue.

As my environment is on Kubernetes cluster, I am not able to use --memory-swap. It is Docker-only parameter and if I understand the documentation right, in Kubernetes it is in alfa and only for nodes, not for applications deployments.

I was able to set -XX:MaxRAMPercentage=80 though. My related part of setup looks like this in values.yaml for the k8s deployment:

#pod resources limits
resources:
  limits:
    memory: "1000Mi"

...

#environment variables
env:
- name: JAVA_OPTIONS
  value: >-
   -XX:MaxRAMPercentage=80
   -D%prod.quarkus.http.ssl-port=8443 
   -D%prod.quarkus.http.ssl.certificate.key-store-file=/home/security/keystore.jks
   -D%prod.quarkus.http.ssl.certificate.key-store-password=${SSL_KEY_PASSWORD}
   -Djavax.net.ssl.trustStore=/home/security/truststore.jks
   -Djavax.net.ssl.trustStorePassword=password
   -D%prod.registry.streams.topology.num.standby.replicas=0
   -Dio.netty.tryReflectionSetAccessible=true
   -D%prod.registry.kafkasql.topic=dev3_kafkasql_journal_topic
   -D%prod.registry.kafka.common.ssl.truststore.location=/home/security/truststore.jks
   -D%prod.registry.kafka.common.ssl.truststore.password=${TRUST_PASSWORD}
   -D%prod.registry.kafka.common.ssl.truststore.type=JKS
   -D%prod.registry.kafka.common.ssl.endpoint.identification.algorithm=
   -D%prod.registry.kafka.common.ssl.keystore.location=/home/security/kafka.jks
   -D%prod.registry.kafka.common.ssl.keystore.password=${KAFKA_PASSWORD}
   -D%prod.registry.kafka.common.ssl.keystore.type=JKS
   -D%prod.registry.kafka.common.security.protocol=SSL
   -D%prod.registry.rules.global.compatibility=FULL
   -D%prod.registry.rules.global.validity=FULL
   -Dquarkus.http.access-log.exclude-pattern=/health/.*
   -D%prod.quarkus.datasource.jdbc.url=jdbc:h2:mem:registry_db;DB_CLOSE_ON_EXIT=FALSE;TRACE_LEVEL_SYSTEM_OUT=2

This is how the start of the container log looks like:

INFO exec  java -XX:MaxRAMPercentage=80 -D%prod.quarkus.http.ssl-port=8443 -D%prod.quarkus.http.ssl.certificate.key-store-file=/home/security/keystore.jks -D%prod.quarkus.http.ssl.certificate.key-store-password=${SSL_KEY_PASSWORD} -Djavax.net.ssl.trustStore=/home/security/truststore.jks -Djavax.net.ssl.trustStorePassword=password -D%prod.registry.streams.topology.num.standby.replicas=0 -Dio.netty.tryReflectionSetAccessible=true -D%prod.registry.kafkasql.topic=dev3_kafkasql_journal_topic -D%prod.registry.kafka.common.ssl.truststore.location=/home/security/truststore.jks -D%prod.registry.kafka.common.ssl.truststore.password=${TRUST_PASSWORD} -D%prod.registry.kafka.common.ssl.truststore.type=JKS -D%prod.registry.kafka.common.ssl.endpoint.identification.algorithm= -D%prod.registry.kafka.common.ssl.keystore.location=/home/security/kafka.jks -D%prod.registry.kafka.common.ssl.keystore.password=${KAFKA_PASSWORD} -D%prod.registry.kafka.common.ssl.keystore.type=JKS -D%prod.registry.kafka.common.security.protocol=SSL -D%prod.registry.rules.global.compatibility=FULL -D%prod.registry.rules.global.validity=FULL -Dquarkus.http.access-log.exclude-pattern=/health/.* -D%prod.quarkus.datasource.jdbc.url=jdbc:h2:mem:registry_db;DB_CLOSE_ON_EXIT=FALSE;TRACE_LEVEL_SYSTEM_OUT=2 -javaagent:/usr/share/java/jolokia-jvm-agent/jolokia-jvm.jar=config=/opt/jboss/container/jolokia/etc/jolokia.properties -Xms125m -Xmx500m -XX:+UseParallelGC -XX:MinHeapFreeRatio=10 -XX:MaxHeapFreeRatio=20 -XX:GCTimeRatio=4 -XX:AdaptiveSizePolicyWeight=90 -XX:+ExitOnOutOfMemoryError -cp "." -jar /deployments/apicurio-registry-storage-kafkasql-2.2.5.Final-runner.jar

What only changed with the -XX:MaxRAMPercentage=80 is, that pods started crashing even sooner - at around 800MB. Which in my case matches 80% of the 1000MB RAM limit for every pod. I ran on 2 pod instances for the test and you can nicely see how the ram usage continually increases until max ram percentage was reached, 1st pod instance crashed (14:06) and then the 2nd continuing with crash loops of both (14:30):
image

After the initial crash, the pods continued crashing in loop with the errors described at the beginning of this issue - updating artifact with new version loaded from Kafka topic until it started disconnecting the sessions:

/*SQL */COMMIT;
2022-09-09 12:51:18 DEBUG <_> [io.apicurio.registry.storage.impl.kafkasql.sql.KafkaSqlSink] (KSQL Kafka Consumer Thread) Clearing tenant id after message processed
2022-09-09 12:51:18 DEBUG <> [io.apicurio.registry.storage.impl.kafkasql.sql.KafkaSqlSink] (KSQL Kafka Consumer Thread) Kafka message successfully processed. Notifying listeners of response.
2022-09-09 12:51:18 DEBUG <> [io.apicurio.registry.storage.impl.kafkasql.sql.KafkaSqlSink] (KSQL Kafka Consumer Thread) Processing Kafka message with UUID: b51783f0-4b59-4ad9-a94f-8b3319b8397b
2022-09-09 12:51:18 jdbc[3]: 
/*SQL l:76 #:1*/SELECT c.contentId FROM content c WHERE c.contentHash = ? AND c.tenantId = ? {1: '809d42dc9a75f7ed8d99d776b1aa09f5c6dc35b24d0e8a586c6cd73da3747795', 2: '_'};
2022-09-09 12:51:18 DEBUG <_> [io.apicurio.registry.storage.impl.sql.AbstractSqlRegistryStorage] (KSQL Kafka Consumer Thread) Updating artifact loop loop-test with a new version (content).
2022-09-09 12:51:18 jdbc[3]: 
/*SQL l:315 #:1*/SELECT a.*, v.contentId, v.globalId, v.version, v.versionId, v.state, v.name, v.description, v.labels, v.properties, v.createdBy AS modifiedBy, v.createdOn AS modifiedOn FROM artifacts a JOIN versions v ON a.tenantId = v.tenantId AND a.latest = v.globalId WHERE a.tenantId = ? AND a.groupId = ? AND a.artifactId = ? {1: '_', 2: 'loop', 3: 'loop-test'};
2022-09-09 12:51:18 DEBUG <_> [io.apicurio.registry.storage.impl.sql.SqlStatementsProducer] (KSQL Kafka Consumer Thread) Creating an instance of ISqlStatements for DB: h2
2022-09-09 12:51:18 jdbc[3]: 
/*SQL l:305 #:1 t:6*/INSERT INTO versions (globalId, tenantId, groupId, artifactId, version, versionId, state, name, description, createdBy, createdOn, labels, properties, contentId) VALUES (?, ?, ?, ?, ?, (SELECT MAX(versionId) + 1 FROM versions WHERE tenantId = ? AND groupId = ? AND artifactId = ?), ?, ?, ?, ?, ?, ?, ?, ?) {1: 3201, 2: '_', 3: 'loop', 4: 'loop-test', 5: NULL, 6: '_', 7: 'loop', 8: 'loop-test', 9: 'ENABLED', 10: 'TestV1', 11: NULL, 12: '', 13: TIMESTAMP '2022-09-09 11:03:55.514', 14: NULL, 15: NULL, 16: 1};
2022-09-09 12:51:18 jdbc[3]: 
/*SQL l:134 #:1*/UPDATE versions SET version = (SELECT versionId FROM versions WHERE tenantId = ? AND globalId = ?) WHERE tenantId = ? AND globalId = ? {1: '_', 2: 3201, 3: '_', 4: 3201};
2022-09-09 12:51:18 jdbc[3]: 
/*SQL l:85 #:1*/UPDATE artifacts SET latest = ? WHERE tenantId = ? AND groupId = ? AND artifactId = ? {1: 3201, 2: '_', 3: 'loop', 4: 'loop-test'};
2022-09-09 12:51:18 jdbc[3]: 
/*SQL l:176 #:1*/SELECT v.*, a.type FROM versions v JOIN artifacts a ON v.tenantId = a.tenantId AND v.groupId = a.groupId AND v.artifactId = a.artifactId WHERE v.tenantId = ? AND v.globalId = ? {1: '_', 2: 3201};
2022-09-09 12:51:18 jdbc[3]: 
/*SQL */COMMIT;
2022-09-09 12:51:18 jdbc[3]: 
/*SQL */COMMIT;
2022-09-09 12:51:18 DEBUG <_> [io.apicurio.registry.storage.impl.kafkasql.sql.KafkaSqlSink] (KSQL Kafka Consumer Thread) Clearing tenant id after message processed
2022-09-09 12:51:18 DEBUG <> [io.apicurio.registry.storage.impl.kafkasql.sql.KafkaSqlSink] (KSQL Kafka Consumer Thread) Kafka message successfully processed. Notifying listeners of response.
2022-09-09 12:51:18 DEBUG <> [io.apicurio.registry.storage.impl.kafkasql.sql.KafkaSqlSink] (KSQL Kafka Consumer Thread) Processing Kafka message with UUID: fff0a130-0d27-41bd-92df-15386adc6322
2022-09-09 12:51:18 database: disconnecting session #3
2022-09-09 12:51:18 database: disconnected session #3
2022-09-09 12:51:18 database: disconnecting session #4
2022-09-09 12:51:18 database: disconnected session #4
2022-09-09 12:51:18 database: disconnecting session #5

what lead to this:

2022-09-09 12:51:18 database: disconnected session #20
2022-09-09 12:51:18 database: disconnecting session #21
2022-09-09 12:51:18 database: disconnected session #21
2022-09-09 12:51:18 database: disconnecting session #22
2022-09-09 12:51:18 database: closing mem:registry_db
2022-09-09 12:51:18 database: closed
2022-09-09 12:51:18 database: disconnected session #22
2022-09-09 12:51:18 DEBUG <> [io.quarkus.arc.impl.ArcContainerImpl] (Shutdown thread) ArC DI container shut down
2022-09-09 12:51:18 INFO <_> [io.apicurio.registry.storage.impl.sql.AbstractSqlRegistryStorage] (KSQL Kafka Consumer Thread) SqlRegistryStorage constructed successfully.  JDBC URL: jdbc:h2:mem:registry_db;DB_CLOSE_ON_EXIT=FALSE;TRACE_LEVEL_SYSTEM_OUT=2
2022-09-09 12:51:18 DEBUG <_> [io.apicurio.registry.storage.impl.kafkasql.sql.KafkaSqlSink] (KSQL Kafka Consumer Thread) Clearing tenant id after message processed
2022-09-09 12:51:18 DEBUG <> [io.apicurio.registry.storage.impl.kafkasql.sql.KafkaSqlSink] (KSQL Kafka Consumer Thread) Registry exception detected: java.lang.RuntimeException: java.sql.SQLException: This pool is closed and does not handle any more connections!
2022-09-09 12:51:18 DEBUG <> [io.apicurio.registry.storage.impl.kafkasql.sql.KafkaSqlSink] (KSQL Kafka Consumer Thread) Processing Kafka message with UUID: abc6bcd0-9099-4f04-ae05-ffa6c2ed7764
2022-09-09 12:51:18 DEBUG <_> [io.apicurio.registry.storage.impl.kafkasql.sql.KafkaSqlSink] (KSQL Kafka Consumer Thread) Clearing tenant id after message processed
2022-09-09 12:51:18 DEBUG <> [io.apicurio.registry.storage.impl.kafkasql.sql.KafkaSqlSink] (KSQL Kafka Consumer Thread) Unexpected exception detected: Error injecting javax.transaction.TransactionManager io.quarkus.narayana.jta.runtime.interceptor.TransactionalInterceptorBase.transactionManager

finishing with this:

2022-09-09 12:51:18 DEBUG <> [io.apicurio.registry.storage.impl.kafkasql.sql.KafkaSqlSink] (KSQL Kafka Consumer Thread) Processing Kafka message with UUID: e5f3d5ba-608e-4771-bc1f-1210cf45788c
2022-09-09 12:51:18 DEBUG <_> [io.apicurio.registry.storage.impl.kafkasql.sql.KafkaSqlSink] (KSQL Kafka Consumer Thread) Clearing tenant id after message processed
2022-09-09 12:51:18 DEBUG <> [io.apicurio.registry.storage.impl.kafkasql.sql.KafkaSqlSink] (KSQL Kafka Consumer Thread) Unexpected exception detected: Error injecting javax.transaction.TransactionManager io.quarkus.narayana.jta.runtime.interceptor.TransactionalInterceptorBase.transactionManager
2022-09-09 12:51:18 DEBUG <> [io.netty.buffer.PoolThreadCache] (vert.x-eventloop-thread-0) Freed 4 thread-local buffer(s) from thread: vert.x-eventloop-thread-0
2022-09-09 12:51:18 DEBUG <> [io.netty.buffer.PoolThreadCache] (vert.x-eventloop-thread-1) Freed 6 thread-local buffer(s) from thread: vert.x-eventloop-thread-1
2022-09-09 12:51:18 DEBUG <> [io.quarkus.runtime.ExecutorRecorder$2] (Shutdown thread) loop: 1, remaining: 60000000000, intervalRemaining: 5000000000, interruptRemaining: 10000000000
2022-09-09 12:51:18 INFO <> [io.quarkus.bootstrap.runner.Timing] (Shutdown thread) apicurio-registry-storage-kafkasql stopped in 0.161s

With that said, the initial crash is hard to reach in real life, if your customers behave :) But I see the afterward crash loops being more dangerous as the application will never reach readiness this way.

@petolexa
Copy link
Author

petolexa commented Sep 9, 2022

Please find the testing schema attached if you needed it. I just had to add a .txt postfix as I cannot upload .json files.

loop-test.json.txt

@andreaTP
Copy link
Member

andreaTP commented Sep 9, 2022

Hi @petolexa , thanks for taking the time to test out the suggestions and doing this detailed report!

I'm noticing something strange here:

INFO exec  java -XX:MaxRAMPercentage=80 -D%prod.quarkus.http.ssl-port=8443 -D%prod.quarkus.http.ssl.certificate.key-store-file=/home/security/keystore.jks -D%prod.quarkus.http.ssl.certificate.key-store-password=${SSL_KEY_PASSWORD} -Djavax.net.ssl.trustStore=/home/security/truststore.jks -Djavax.net.ssl.trustStorePassword=password -D%prod.registry.streams.topology.num.standby.replicas=0 -Dio.netty.tryReflectionSetAccessible=true -D%prod.registry.kafkasql.topic=dev3_kafkasql_journal_topic -D%prod.registry.kafka.common.ssl.truststore.location=/home/security/truststore.jks -D%prod.registry.kafka.common.ssl.truststore.password=${TRUST_PASSWORD} -D%prod.registry.kafka.common.ssl.truststore.type=JKS -D%prod.registry.kafka.common.ssl.endpoint.identification.algorithm= -D%prod.registry.kafka.common.ssl.keystore.location=/home/security/kafka.jks -D%prod.registry.kafka.common.ssl.keystore.password=${KAFKA_PASSWORD} -D%prod.registry.kafka.common.ssl.keystore.type=JKS -D%prod.registry.kafka.common.security.protocol=SSL -D%prod.registry.rules.global.compatibility=FULL -D%prod.registry.rules.global.validity=FULL -Dquarkus.http.access-log.exclude-pattern=/health/.* -D%prod.quarkus.datasource.jdbc.url=jdbc:h2:mem:registry_db;DB_CLOSE_ON_EXIT=FALSE;TRACE_LEVEL_SYSTEM_OUT=2 -javaagent:/usr/share/java/jolokia-jvm-agent/jolokia-jvm.jar=config=/opt/jboss/container/jolokia/etc/jolokia.properties -Xms125m -Xmx500m -XX:+UseParallelGC -XX:MinHeapFreeRatio=10 -XX:MaxHeapFreeRatio=20 -XX:GCTimeRatio=4 -XX:AdaptiveSizePolicyWeight=90 -XX:+ExitOnOutOfMemoryError -cp "." -jar /deployments/apicurio-registry-storage-kafkasql-2.2.5.Final-runner.jar

You can see that you still have -Xms125m -Xmx500m which is likely going to interfere with the JVM parameter I suggested.
Also, I see that you are adding a bunch of extra JVM flags that might influence the behavior as well, I encourage you to review all of them starting from an empty configuration and adding them back one by one.

This is the script I was using to reproduce the issue:
https://github.com/andreaTP/apicurio-registry/blob/debug-10000-kafkasql/load.sh

And seems like I'm using a slightly different pattern/endpoint to load the instance, next week I will try to setup a reproducible environment on minikube so that we can bisect the differences, for reference here you can find a few notes on how I was trying to reproduce the issue.

@petolexa
Copy link
Author

petolexa commented Sep 9, 2022

Nice, thank you for the links and for pointing out the other parameters.

From the JVM parameters, I only add the ones that are in environment variables in JAVA_OPTIONS above in the example. So these are somewhat default maybe?
-javaagent:/usr/share/java/jolokia-jvm-agent/jolokia-jvm.jar=config=/opt/jboss/container/jolokia/etc/jolokia.properties -Xms125m -Xmx500m -XX:+UseParallelGC -XX:MinHeapFreeRatio=10 -XX:MaxHeapFreeRatio=20 -XX:GCTimeRatio=4 -XX:AdaptiveSizePolicyWeight=90 -XX:+ExitOnOutOfMemoryError -cp "." -jar /deployments/apicurio-registry-storage-kafkasql-2.2.5.Final-runner.jar

I'll try to override the -Xmx to 800m to match the -XX:MaxRAMPercentage=80. Either I'll keep -Xms the same or I can change it if you suggest better value. And I will re-test next week.

@andreaTP andreaTP self-assigned this Sep 13, 2022
@andreaTP
Copy link
Member

Hi @petolexa , as promised I have replicated the environment on minikube to try to reproduce the issue.
Here you can find the instructions on how to replicate locally my setup:
https://github.com/andreaTP/apicurio-registry/blob/kafkasql-load-test/load-tests/notes.md
and here you can find all the relevant files:
https://github.com/andreaTP/apicurio-registry/tree/kafkasql-load-test/load-tests

Unfortunately, with this setup, I'm unable to replicate the issue with +10000 versions of the sample artifact you provided.
Can you please check my configuration and see if you can replicate the issue, or notice any difference from your environment?

Thanks in advance!

@petolexa
Copy link
Author

Hi @andreaTP, thank you for the detail and another test.

1. for the initial crash
I only see few minor differences between mine and your setup, considering minikube should react the same way as k8s cluster:

  • I didn't use JAVA_MAX_MEM_RATIO, but I guess it is the same as -XX:MaxRAMPercentage, am I right?
  • I did not specify memory request, only memory limit. But the result should be the same I guees.

The main difference, in my opinion, is in the amount of messages:

  • the issue happens after 10000 versions. In my test, the loop is infinite until the pod crashes. After the crash, the Kafka topic shows 12145 records for my last test which I'll describe later.
  • I ran my (more primitive) bash command with watch on three different terminals at the same time, to simulate more requests/consumers and to speed up the crash. But this is unnecessary I guess, If you let your command run long enough.

2. for the followup loop crash
After you did your test, did you try to run again/restart the deployment with Apicurio on the same topic, where these 10000 versions are stored? Didn't it crash for you?
These crashes are the ones that are caused by the DB connection timeout. It happens to me at importing version nr. 4814 (of 12145) at last test.

@petolexa
Copy link
Author

In my last test I tried to rewrite the default -Xmx to a higher number.
I've added -Xmx800m to my JAVA_OPTIONS:

#environment variables
env:
- name: JAVA_OPTIONS
  value: >-
   -Xmx800m
   -XX:MaxRAMPercentage=80
   -D%prod.quarkus.http.ssl-port=8443

and the parameter was rewritten successfully:

INFO exec  java -Xmx800m -XX:MaxRAMPercentage=80 -D%prod.quarkus.http.ssl-port=8443 -D%prod.quarkus.http.ssl.certificate.key-store-file=/home/security/keystore.jks -D%prod.quarkus.http.ssl.certificate.key-store-password=${SSL_KEY_PASSWORD} -Djavax.net.ssl.trustStore=/home/security/truststore.jks -Djavax.net.ssl.trustStorePassword=password -D%prod.registry.streams.topology.num.standby.replicas=0 -Dio.netty.tryReflectionSetAccessible=true -D%prod.registry.kafkasql.topic=dev3_kafkasql_journal_topic -D%prod.registry.kafka.common.ssl.truststore.location=/home/security/truststore.jks -D%prod.registry.kafka.common.ssl.truststore.password=${TRUST_PASSWORD} -D%prod.registry.kafka.common.ssl.truststore.type=JKS -D%prod.registry.kafka.common.ssl.endpoint.identification.algorithm= -D%prod.registry.kafka.common.ssl.keystore.location=/home/security/kafka.jks -D%prod.registry.kafka.common.ssl.keystore.password=${KAFKA_PASSWORD} -D%prod.registry.kafka.common.ssl.keystore.type=JKS -D%prod.registry.kafka.common.security.protocol=SSL -D%prod.registry.rules.global.compatibility=FULL -D%prod.registry.rules.global.validity=FULL -Dquarkus.http.access-log.exclude-pattern=/health/.* -D%prod.quarkus.datasource.jdbc.url=jdbc:h2:mem:registry_db;DB_CLOSE_ON_EXIT=FALSE;TRACE_LEVEL_SYSTEM_OUT=2 -javaagent:/usr/share/java/jolokia-jvm-agent/jolokia-jvm.jar=config=/opt/jboss/container/jolokia/etc/jolokia.properties -Xms125m -XX:+UseParallelGC -XX:MinHeapFreeRatio=10 -XX:MaxHeapFreeRatio=20 -XX:GCTimeRatio=4 -XX:AdaptiveSizePolicyWeight=90 -XX:+ExitOnOutOfMemoryError -cp "." -jar /deployments/apicurio-registry-storage-kafkasql-2.2.5.Final-runner.jar

Unfortunatelly the result is similar to my previous tests. With:

  • empty topic
  • three terminals running watch -n 1 'curl -k -X POST https://dev3.example.com/apis/registry/v2/groups/loop/artifacts/loop-test/versions --data-binary @loop-test.json'
  • 1000Mi ram per container

The first container crashed in less than 2 hours, followed by the second one:
image

There are 12145 messages in Kafka topic, that was empty before, so I assume that 12145 versions was registered before the crash.

Regarding the afterward crash loop, Apicurio imports successfully 4813 messages with versions and crashes on message 4184 with DB sessions being disconnected:

/*SQL l:305 #:1 t:5*/INSERT INTO versions (globalId, tenantId, groupId, artifactId, version, versionId, state, name, description, createdBy, createdOn, labels, properties, contentId) VALUES (?, ?, ?, ?, ?, (SELECT MAX(versionId) + 1 FROM versions WHERE tenantId = ? AND groupId = ? AND artifactId = ?), ?, ?, ?, ?, ?, ?, ?, ?) {1: 4813, 2: '_', 3: 'loop', 4: 'loop-test', 5: NULL, 6: '_', 7: 'loop', 8: 'loop-test', 9: 'ENABLED', 10: 'TestV1', 11: NULL, 12: '', 13: TIMESTAMP '2022-09-14 20:34:35.866', 14: NULL, 15: NULL, 16: 1};2022-09-15T14:39:56.481001009+02:00 
2022-09-15 12:39:56 jdbc[3]: 
/*SQL l:134 #:1*/UPDATE versions SET version = (SELECT versionId FROM versions WHERE tenantId = ? AND globalId = ?) WHERE tenantId = ? AND globalId = ? {1: '_', 2: 4813, 3: '_', 4: 4813};2022-09-15T14:39:56.481146127+02:00 
2022-09-15 12:39:56 jdbc[3]: 
/*SQL l:85 #:1*/UPDATE artifacts SET latest = ? WHERE tenantId = ? AND groupId = ? AND artifactId = ? {1: 4813, 2: '_', 3: 'loop', 4: 'loop-test'};2022-09-15T14:39:56.481197933+02:00 
2022-09-15 12:39:56 jdbc[3]: 
/*SQL l:176 #:1*/SELECT v.*, a.type FROM versions v JOIN artifacts a ON v.tenantId = a.tenantId AND v.groupId = a.groupId AND v.artifactId = a.artifactId WHERE v.tenantId = ? AND v.globalId = ? {1: '_', 2: 4813};2022-09-15T14:39:56.481242912+02:00 
2022-09-15 12:39:56 jdbc[3]: 
/*SQL */COMMIT;2022-09-15T14:39:56.481396945+02:00 
2022-09-15 12:39:56 jdbc[3]: 
/*SQL */COMMIT;2022-09-15T14:39:56.481449487+02:00 
2022-09-15 12:39:56 DEBUG <_> [io.apicurio.registry.storage.impl.kafkasql.sql.KafkaSqlSink] (KSQL Kafka Consumer Thread) Clearing tenant id after message processed
2022-09-15 12:39:56 DEBUG <> [io.apicurio.registry.storage.impl.kafkasql.sql.KafkaSqlSink] (KSQL Kafka Consumer Thread) Kafka message successfully processed. Notifying listeners of response.
2022-09-15 12:39:56 DEBUG <> [io.apicurio.registry.storage.impl.kafkasql.sql.KafkaSqlSink] (KSQL Kafka Consumer Thread) Processing Kafka message with UUID: a21278a0-37a6-4e4e-a7c6-d37c107a88b4
2022-09-15 12:39:56 jdbc[3]: 
/*SQL l:59 #:1*/SELECT value FROM sequences WHERE name = ? AND tenantId = ? {1: 'globalId', 2: '_'};2022-09-15T14:39:56.481917475+02:00 
2022-09-15 12:39:56 jdbc[3]: 
/*SQL l:81 #:1*/MERGE INTO sequences (tenantId, name, value) KEY (tenantId, name) VALUES(?, ?, ?) {1: '_', 2: 'globalId', 3: 4814};2022-09-15T14:39:56.481980150+02:00 
2022-09-15 12:39:56 jdbc[3]: 
/*SQL */COMMIT;2022-09-15T14:39:56.482036138+02:00 
2022-09-15 12:39:56 jdbc[3]: 
/*SQL */COMMIT;2022-09-15T14:39:56.482071276+02:00 
2022-09-15 12:39:56 DEBUG <_> [io.apicurio.registry.storage.impl.kafkasql.sql.KafkaSqlSink] (KSQL Kafka Consumer Thread) Clearing tenant id after message processed
2022-09-15 12:39:56 DEBUG <> [io.apicurio.registry.storage.impl.kafkasql.sql.KafkaSqlSink] (KSQL Kafka Consumer Thread) Kafka message successfully processed. Notifying listeners of response.
2022-09-15 12:39:56 DEBUG <> [io.apicurio.registry.storage.impl.kafkasql.sql.KafkaSqlSink] (KSQL Kafka Consumer Thread) Processing Kafka message with UUID: 98c640f4-42bb-4934-a2a2-36d4491af11c
2022-09-15 12:39:56 jdbc[3]: 
/*SQL l:76 #:1*/SELECT c.contentId FROM content c WHERE c.contentHash = ? AND c.tenantId = ? {1: '809d42dc9a75f7ed8d99d776b1aa09f5c6dc35b24d0e8a586c6cd73da3747795', 2: '_'};2022-09-15T14:39:56.482294448+02:00 
2022-09-15 12:39:56 DEBUG <_> [io.apicurio.registry.storage.impl.sql.AbstractSqlRegistryStorage] (KSQL Kafka Consumer Thread) Updating artifact loop loop-test with a new version (content).
2022-09-15 12:39:56 jdbc[3]: 
/*SQL l:315 #:1*/SELECT a.*, v.contentId, v.globalId, v.version, v.versionId, v.state, v.name, v.description, v.labels, v.properties, v.createdBy AS modifiedBy, v.createdOn AS modifiedOn FROM artifacts a JOIN versions v ON a.tenantId = v.tenantId AND a.latest = v.globalId WHERE a.tenantId = ? AND a.groupId = ? AND a.artifactId = ? {1: '_', 2: 'loop', 3: 'loop-test'};2022-09-15T14:39:56.482444514+02:00 
2022-09-15 12:39:56 jdbc[3]: 
/*SQL l:305 #:1 t:7*/INSERT INTO versions (globalId, tenantId, groupId, artifactId, version, versionId, state, name, description, createdBy, createdOn, labels, properties, contentId) VALUES (?, ?, ?, ?, ?, (SELECT MAX(versionId) + 1 FROM versions WHERE tenantId = ? AND groupId = ? AND artifactId = ?), ?, ?, ?, ?, ?, ?, ?, ?) {1: 4814, 2: '_', 3: 'loop', 4: 'loop-test', 5: NULL, 6: '_', 7: 'loop', 8: 'loop-test', 9: 'ENABLED', 10: 'TestV1', 11: NULL, 12: '', 13: TIMESTAMP '2022-09-14 20:34:38.635', 14: NULL, 15: NULL, 16: 1};2022-09-15T14:39:56.489767315+02:00 
2022-09-15 12:39:56 DEBUG <_> [io.apicurio.registry.storage.impl.sql.SqlStatementsProducer] (KSQL Kafka Consumer Thread) Creating an instance of ISqlStatements for DB: h2
2022-09-15 12:39:56 database: disconnecting session #32022-09-15T14:39:56.494199689+02:00 
2022-09-15 12:39:56 database: disconnected session #3
2022-09-15 12:39:56 database: disconnecting session #4
2022-09-15 12:39:56 database: disconnected session #4
2022-09-15 12:39:56 database: disconnecting session #5
2022-09-15 12:39:56 database: disconnected session #5
2022-09-15 12:39:56 database: disconnecting session #6
2022-09-15 12:39:56 database: disconnected session #6
2022-09-15 12:39:56 database: disconnecting session #7
2022-09-15 12:39:56 database: disconnected session #7
2022-09-15 12:39:56 database: disconnecting session #8
2022-09-15 12:39:56 database: disconnected session #8
2022-09-15 12:39:56 database: disconnecting session #9
2022-09-15 12:39:56 database: disconnected session #9
2022-09-15 12:39:56 database: disconnecting session #10
2022-09-15 12:39:56 database: disconnected session #10
2022-09-15 12:39:56 database: disconnecting session #11
2022-09-15 12:39:56 database: disconnected session #11
2022-09-15 12:39:56 database: disconnecting session #12
2022-09-15 12:39:56 database: disconnected session #12
2022-09-15 12:39:56 database: disconnecting session #13
2022-09-15 12:39:56 database: disconnected session #13
2022-09-15 12:39:56 database: disconnecting session #14
2022-09-15 12:39:56 database: disconnected session #14
2022-09-15 12:39:56 database: disconnecting session #15
2022-09-15 12:39:56 database: disconnected session #15
2022-09-15 12:39:56 database: disconnecting session #16
2022-09-15 12:39:56 database: disconnected session #16
2022-09-15 12:39:56 database: disconnecting session #17
2022-09-15 12:39:56 database: disconnected session #17
2022-09-15 12:39:56 database: disconnecting session #18
2022-09-15 12:39:56 database: disconnected session #18
2022-09-15 12:39:56 database: disconnecting session #19
2022-09-15 12:39:56 database: disconnected session #19
2022-09-15 12:39:56 database: disconnecting session #20
2022-09-15 12:39:56 database: disconnected session #20
2022-09-15 12:39:56 database: disconnecting session #21
2022-09-15 12:39:56 database: disconnected session #21
2022-09-15 12:39:56 database: disconnecting session #22
2022-09-15 12:39:56 database: closing mem:registry_db
2022-09-15 12:39:56 jdbc[3]: exception2022-09-15T14:39:56.501918164+02:00 
2022-09-15 12:39:56 database: closed
2022-09-15 12:39:56 database: disconnected session #22
2022-09-15 12:39:56 jdbc[3]: exception2022-09-15T14:39:56.503058684+02:00 
2022-09-15 12:39:56 DEBUG <> [io.quarkus.arc.impl.ArcContainerImpl] (Shutdown thread) ArC DI container shut down
2022-09-15 12:39:56 WARN <_> [com.arjuna.ats.internal.jta.resources.arjunacore.XAResourceRecord] (KSQL Kafka Consumer Thread) ARJUNA016045: attempted rollback of < formatId=131077, gtrid_length=35, bqual_length=36, tx_uid=0:ffffc0a84e92:9db3:63231d04:bc16, node_name=quarkus, branch_uid=0:ffffc0a84e92:9db3:63231d04:bc18, subordinatenodename=null, eis_name=0 > (io.agroal.narayana.LocalXAResource@7847fb17) failed with exception code XAException.XAER_RMERR: javax.transaction.xa.XAException: Error trying to transactionRollback local transaction: The object is already closed [90007-197]
	at io.agroal.narayana.LocalXAResource.xaException(LocalXAResource.java:140)
	at io.agroal.narayana.LocalXAResource.xaException(LocalXAResource.java:134)

@andreaTP
Copy link
Member

I didn't use JAVA_MAX_MEM_RATIO, but I guess it is the same as -XX:MaxRAMPercentage, am I right?

Not sure, but, according to your tests, this should not be a major difference.

did not specify memory request, only memory limit. But the result should be the same I guess.

This is not accurate, unfortunately, request sets the minimum available resources on the Node to be reserved for the Pod, depending on the environment (e.g. with/without autoscaler, "size" of the machines for the nodes, neighbors etc. etc.) this might affect the result.

The main difference, in my opinion, is in the amount of messages:

I will try another run increasing the number of messages and let you know the results.

  1. for the followup loop crash

Yup, I kill the pod and restart it to see what happens and it always comes back nicely.

@andreaTP
Copy link
Member

@petolexa I have tested with 20000+ versions (which means > messages in Kafka), and Registry is behaving fairly well.

In order:

  • the pod starts and re-starts correctly
  • there are some transient(?) issues in getting the Global ID, but it seems to slowly recover

All in all, I do not see under any circumstances the mentioned database disconnection and, without a reproducer that's impossible to debug.

I encourage you to set memory requests accordingly, but, after that, you would need to bisect your setup until the error is reproducible.
Can you also try to collect the k8s events at the moment the Pod crashes? I suspect that it is being killed externally for some reason.

@andreaTP
Copy link
Member

@petolexa at this point I'm going to close this issue as "Can't reproduce", but feel free to re-open this or another one if you manage to get a reproducer together!

@andreaTP andreaTP closed this as not planned Won't fix, can't repro, duplicate, stale Sep 15, 2022
@andreaTP andreaTP added Wontfix This will not be worked on priority/low and removed priority/normal Bug Something isn't working labels Sep 15, 2022
@petolexa
Copy link
Author

Thank you @andreaTP for your effort and for the tests.
I agree that in this circumstances it makes no sense to leave the issue open.
My test with requests set up and with only 1 replica (I am curious if it could make a difference) is running, I will post the results here for the record.

@petolexa
Copy link
Author

Results with the same resources and one replica as you had:

#pod replicas
replicas: 2
#pod resources limits
resources:
  requests:
    memory: "500Mi"
    cpu: "250m"
  limits:
    memory: "1000Mi"
    cpu: "500m"

were very similar, the container crashed the same way:
image

K8s events only show repeated liveness and healthiness issue as the result of the crash (sorry for ugly anonymization):
image

I tested also with new 2.3.0 version just to give it a shot, but with no higher success.

Thank you for your time spent on the tests. If I find some way how to fix it, or make it globally reproducible, I'll refresh it.

@andreaTP
Copy link
Member

Hi @petolexa !
Thanks for sharing those results!

only show repeated liveness and healthiness issue as the result of the crash

I think this analysis is not correct, from the events you shared looks pretty clear that the pod is getting killed because of probe failures (not the other way around).
This also explains the "disconnections" you are observing, now I understand that they are caused by the Pod receiving a SIG-TERM.

I encourage you to tweak the probes to have a much more relaxed frequency and timeouts as a first step.

@petolexa
Copy link
Author

Hi @andreaTP , you were right! I just understood where you are pointing from your last comment.

We've had default liveness and readiness probes set up:

          livenessProbe:
            failureThreshold: 3
            httpGet:
              path: /health/live
              port: 8080
              scheme: HTTP
            initialDelaySeconds: 5
            periodSeconds: 10
            successThreshold: 1
            timeoutSeconds: 5
          readinessProbe:
            failureThreshold: 3
            httpGet:
              path: /health/ready
              port: 8080
              scheme: HTTP
            initialDelaySeconds: 5
            periodSeconds: 10
            successThreshold: 1
            timeoutSeconds: 5

and the container just had no enough time to process all messages (with x versions of on artifact) after the reboot.

I added a startupProbe that was crated to handle exactly such cases:

          startupProbe:
            failureThreshold: 30
            httpGet:
              path: /health/live
              port: 8080
              scheme: HTTP
            initialDelaySeconds: 5
            periodSeconds: 10
            successThreshold: 1
            timeoutSeconds: 5

and the container was able to run properly after 5-6 failures of a startup probe. what means 55-65 seconds.

Thank you for pointing me to the right direction :)

@andreaTP
Copy link
Member

Thanks a lot for getting back @petolexa ! Appreciated! And happy that we solved the mistery 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants