Concurrent system tests trigger intermittent deadlock in Derby #2837

punktilious · 2021-10-06T08:29:00Z

Describe the bug

The system tests occasionally fail due to locking in Derby:

[10/5/21, 22:15:54:977 UTC] 0000002e com.ibm.fhir.persistence.jdbc.dao.impl.ResourceReferenceDAO E SELECT code_system_name, code_system_id FROM code_systems WHERE code_system_name IN (?,?,?,?,?)^M
java.sql.SQLTransactionRollbackException: A lock could not be obtained due to a deadlock, cycle of locks and waiters is:
Lock : ROW, CODE_SYSTEMS, (1,30)
Waiting XID : {28669, S} , APP, SELECT code_system_name, code_system_id FROM code_systems WHERE code_system_name IN (?,?,?,?,?)
Granted XID : {28664, X}
Lock : ROW, CODE_SYSTEMS, (1,25)
Waiting XID : {28664, S} , APP, SELECT code_system_name, code_system_id FROM code_systems WHERE code_system_name IN (?,?,?,?,?,?,?,?)
Granted XID : {28669, X}
. The selected victim is XID : 28669.^M

Environment
4.10.0

To Reproduce
Run the server integration test suite using Derby multiple times until the failure occurs.

Expected behavior
No deadlock.

Additional context
The deadlock happens when two threads are attempting to populate the common values table - in this case CODE_SYSTEMS. This typically happens early on when the table is empty, or a batch of data being ingested in parallel contains new values which are shared among resources.

The conflict occurs when the transaction data is being flushed to the database:

        at com.ibm.fhir.persistence.jdbc.dao.impl.ResourceReferenceDAO.upsertCodeSystems(ResourceReferenceDAO.java:388)^M
        at com.ibm.fhir.persistence.jdbc.dao.impl.ResourceReferenceDAO.persist(ResourceReferenceDAO.java:825)^M
        at com.ibm.fhir.persistence.jdbc.impl.FHIRPersistenceJDBCImpl.persistResourceTokenValueRecords(FHIRPersistenceJDBCImpl.java:2801)^M
        at com.ibm.fhir.persistence.jdbc.impl.ParameterTransactionDataImpl.persist(ParameterTransactionDataImpl.java:66)^M
        at com.ibm.fhir.persistence.jdbc.dao.impl.TransactionDataImpl.lambda$persist$0(TransactionDataImpl.java:46)^M
        at java.base/java.util.HashMap$Values.forEach(HashMap.java:976)^M
        at com.ibm.fhir.persistence.jdbc.dao.impl.TransactionDataImpl.persist(TransactionDataImpl.java:46)^M
        at com.ibm.fhir.persistence.jdbc.impl.CacheTransactionSync.beforeCompletion(CacheTransactionSync.java:57)^

To avoid a deadlock, we should endeavour to always request/obtain locks in the same order. This means:

Sort the list of values before we insert.
Sort the list of values in the above select statement so that the IN list is collated in the same way as the values sorted in item 1.

This will ensure that inserts don't conflict with other inserts, and the read (S) locks will be requested in the same order as the write (X) locks are requested and granted.

In ResourceReferenceDAO#persist, we need to sort systemMisses before the upsert:

        // Grab the ids for all the code-systems, and upsert any misses
        List<ResourceTokenValueRec> systemMisses = new ArrayList<>();
        cache.resolveCodeSystems(records, systemMisses);
        cache.resolveCodeSystems(tagRecs, systemMisses);
        cache.resolveCodeSystems(securityRecs, systemMisses);
        upsertCodeSystems(systemMisses);

Do the same for the other collections before their upserts.

The sort needs to happen lower in the code, within:

    public void upsertCodeSystems(List<ResourceTokenValueRec> systems) {

This is because the list of system names is converted to a Set for uniqueness. The number of objects in this set will generally be relatively small (<<100). Either we insert the values into a SortedSet, or convert the Set to a List and sort that separately. There's probably not a lot of difference in performance when dealing with small object counts.

The text was updated successfully, but these errors were encountered:

punktilious · 2021-10-15T11:46:05Z

Due to enhancements in the FHIR search implementation, new parameter names are created at runtime which can trigger more deadlocks with Derby. This has been addressed in #2857 by preloading the parameter_names table with all the parameter names discovered after a successful run-through of the system tests.

lmsurpre · 2021-10-25T14:57:34Z

The deadlock issue seems to be addressed (per recent CI results).
Unfortunately, there is still an intermittent failure in the CI from actions running with derby...specifically contention on the sequence allocation during ingestion of many resources.

punktilious added the bug Something isn't working label Oct 6, 2021

lmsurpre assigned punktilious Oct 11, 2021

lmsurpre added the P2 Priority 2 - Should Have label Oct 11, 2021

lmsurpre added this to the Sprint 2021-14 milestone Oct 11, 2021

punktilious mentioned this issue Oct 15, 2021

Deadlock creating code_systems in Derby #1417

Closed

lmsurpre closed this as completed Oct 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Concurrent system tests trigger intermittent deadlock in Derby #2837

Concurrent system tests trigger intermittent deadlock in Derby #2837

punktilious commented Oct 6, 2021 •

edited

Loading

punktilious commented Oct 15, 2021

lmsurpre commented Oct 25, 2021

Concurrent system tests trigger intermittent deadlock in Derby #2837

Concurrent system tests trigger intermittent deadlock in Derby #2837

Comments

punktilious commented Oct 6, 2021 • edited Loading

punktilious commented Oct 15, 2021

lmsurpre commented Oct 25, 2021

punktilious commented Oct 6, 2021 •

edited

Loading