Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concurrent system tests trigger intermittent deadlock in Derby #2837

Closed
punktilious opened this issue Oct 6, 2021 · 2 comments
Closed

Concurrent system tests trigger intermittent deadlock in Derby #2837

punktilious opened this issue Oct 6, 2021 · 2 comments
Assignees
Labels
bug Something isn't working P2 Priority 2 - Should Have

Comments

@punktilious
Copy link
Collaborator

punktilious commented Oct 6, 2021

Describe the bug

The system tests occasionally fail due to locking in Derby:

[10/5/21, 22:15:54:977 UTC] 0000002e com.ibm.fhir.persistence.jdbc.dao.impl.ResourceReferenceDAO E SELECT code_system_name, code_system_id FROM code_systems WHERE code_system_name IN (?,?,?,?,?)^M
java.sql.SQLTransactionRollbackException: A lock could not be obtained due to a deadlock, cycle of locks and waiters is:
Lock : ROW, CODE_SYSTEMS, (1,30)
Waiting XID : {28669, S} , APP, SELECT code_system_name, code_system_id FROM code_systems WHERE code_system_name IN (?,?,?,?,?)
Granted XID : {28664, X}
Lock : ROW, CODE_SYSTEMS, (1,25)
Waiting XID : {28664, S} , APP, SELECT code_system_name, code_system_id FROM code_systems WHERE code_system_name IN (?,?,?,?,?,?,?,?)
Granted XID : {28669, X}
. The selected victim is XID : 28669.^M

Environment
4.10.0

To Reproduce
Run the server integration test suite using Derby multiple times until the failure occurs.

Expected behavior
No deadlock.

Additional context
The deadlock happens when two threads are attempting to populate the common values table - in this case CODE_SYSTEMS. This typically happens early on when the table is empty, or a batch of data being ingested in parallel contains new values which are shared among resources.

The conflict occurs when the transaction data is being flushed to the database:

        at com.ibm.fhir.persistence.jdbc.dao.impl.ResourceReferenceDAO.upsertCodeSystems(ResourceReferenceDAO.java:388)^M
        at com.ibm.fhir.persistence.jdbc.dao.impl.ResourceReferenceDAO.persist(ResourceReferenceDAO.java:825)^M
        at com.ibm.fhir.persistence.jdbc.impl.FHIRPersistenceJDBCImpl.persistResourceTokenValueRecords(FHIRPersistenceJDBCImpl.java:2801)^M
        at com.ibm.fhir.persistence.jdbc.impl.ParameterTransactionDataImpl.persist(ParameterTransactionDataImpl.java:66)^M
        at com.ibm.fhir.persistence.jdbc.dao.impl.TransactionDataImpl.lambda$persist$0(TransactionDataImpl.java:46)^M
        at java.base/java.util.HashMap$Values.forEach(HashMap.java:976)^M
        at com.ibm.fhir.persistence.jdbc.dao.impl.TransactionDataImpl.persist(TransactionDataImpl.java:46)^M
        at com.ibm.fhir.persistence.jdbc.impl.CacheTransactionSync.beforeCompletion(CacheTransactionSync.java:57)^

To avoid a deadlock, we should endeavour to always request/obtain locks in the same order. This means:

  1. Sort the list of values before we insert.
  2. Sort the list of values in the above select statement so that the IN list is collated in the same way as the values sorted in item 1.

This will ensure that inserts don't conflict with other inserts, and the read (S) locks will be requested in the same order as the write (X) locks are requested and granted.

In ResourceReferenceDAO#persist, we need to sort systemMisses before the upsert:

        // Grab the ids for all the code-systems, and upsert any misses
        List<ResourceTokenValueRec> systemMisses = new ArrayList<>();
        cache.resolveCodeSystems(records, systemMisses);
        cache.resolveCodeSystems(tagRecs, systemMisses);
        cache.resolveCodeSystems(securityRecs, systemMisses);
        upsertCodeSystems(systemMisses);

Do the same for the other collections before their upserts.

The sort needs to happen lower in the code, within:

    public void upsertCodeSystems(List<ResourceTokenValueRec> systems) {

This is because the list of system names is converted to a Set for uniqueness. The number of objects in this set will generally be relatively small (<<100). Either we insert the values into a SortedSet, or convert the Set to a List and sort that separately. There's probably not a lot of difference in performance when dealing with small object counts.

@punktilious punktilious added the bug Something isn't working label Oct 6, 2021
@lmsurpre lmsurpre added the P2 Priority 2 - Should Have label Oct 11, 2021
@lmsurpre lmsurpre added this to the Sprint 2021-14 milestone Oct 11, 2021
@punktilious
Copy link
Collaborator Author

Due to enhancements in the FHIR search implementation, new parameter names are created at runtime which can trigger more deadlocks with Derby. This has been addressed in #2857 by preloading the parameter_names table with all the parameter names discovered after a successful run-through of the system tests.

@lmsurpre
Copy link
Member

The deadlock issue seems to be addressed (per recent CI results).
Unfortunately, there is still an intermittent failure in the CI from actions running with derby...specifically contention on the sequence allocation during ingestion of many resources.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P2 Priority 2 - Should Have
Projects
None yet
Development

No branches or pull requests

2 participants