Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock during concurrent updates on Derby #2420

Closed
lmsurpre opened this issue May 24, 2021 · 1 comment
Closed

Deadlock during concurrent updates on Derby #2420

lmsurpre opened this issue May 24, 2021 · 1 comment
Assignees
Labels
bug Something isn't working configuration persistence

Comments

@lmsurpre
Copy link
Member

The check to catch the version conflict was a relatively simple fix. Cleaned things up a bit because for this implementation there's no need to support an asserted version id (which might not be the latest version id - useful for logical replication). However the fix did expose a deadlock scenario for Derby caused by the default isolation level being TRANSACTION_REPEATABLE_READ. This resulted in shared locks lingering on rows which were selected when reading the current version of the resource earlier in the transaction. In Derby, these locks are not compatible with the exclusive lock requested by the UPDATE xx_logical_resources statement. The resulting deadlock graph looks something like this:

A lock could not be obtained due to a deadlock, cycle of locks and waiters is:
Lock : ROW, LOGICAL_RESOURCES, (115,11)
  Waiting XID : {74421, U} , APP, SELECT logical_resource_id FROM logical_resources WHERE resource_type_id = ? AND logical_id = ? FOR UPDATE WITH RS
  Granted XID : {74409, U}
Lock : ROW, PATIENT_LOGICAL_RESOURCES, (8,29)
  Waiting XID : {74409, X} , APP, UPDATE Patient_logical_resources SET current_resource_id = ?, is_deleted = ?, last_updated = ?, version_id = ? WHERE logical_resource_id = ?
  Granted XID : {74401, S} , {74416, S} , {74409, S} , {74421, S}
. The selected victim is XID : 74421.

The solution is to set the isolation level for Derby datasources in openliberty:

     <dataSource id="fhirDatasourceBootstrapDefaultDefault" jndiName="jdbc/fhir_default_default" type="javax.sql.XADataSource" statementCacheSize="200" syncQueryTimeoutWithTransactionTimeout="true" isolationLevel="TRANSACTION_READ_COMMITTED">
        <jdbcDriver javax.sql.XADataSource="org.apache.derby.jdbc.EmbeddedXADataSource" libraryRef="fhirSharedLib"/>
            <properties.derby.embedded createDatabase="create" databaseName="derby/fhirDB"/>
        <connectionManager maxPoolSize="50" minPoolSize="10"/>
    </dataSource>

As part of the effort to uncover the deadlock, the stored procedure implementations were simplified by removing an update on xx_logical_resources when the resource was new. The new solution allocates the resource_id early and uses it to set the xx_logical_resources.current_resource_id column before the actual resource row is created. This should be particularly helpful for PostgreSQL, because it avoids creating a tombstone which then needs to be vacuumed.

Originally posted by @punktilious in #2405 (comment)

@prb112 prb112 added persistence bug Something isn't working labels May 24, 2021
lmsurpre added a commit that referenced this issue May 24, 2021
Previously this was flagged as a db2-only test, which helped contribute
to us missing this issue.

Additionally, this test was written to highlight a potential issue but
didn't yet take into account our solution which is to return a 409 for
one of the updates (in the case of a conflict).

The updated test will retry on a tight loop, ensuring that we can
continue to make progress in the face of contentious PUT requests to the
same resource.

Signed-off-by: Lee Surprenant <lmsurpre@us.ibm.com>
lmsurpre added a commit that referenced this issue May 24, 2021
Previously this was flagged as a db2-only test, which helped contribute
to us missing this issue.

Additionally, this test was written to highlight a potential issue but
didn't yet take into account our solution which is to return a 409 for
one of the updates (in the case of a conflict).

The updated test will retry on a tight loop, ensuring that we can
continue to make progress in the face of contentious PUT requests to the
same resource.

Signed-off-by: Lee Surprenant <lmsurpre@us.ibm.com>
lmsurpre added a commit that referenced this issue May 24, 2021
Previously this was flagged as a db2-only test, which helped contribute
to us missing this issue.

Additionally, this test was written to highlight a potential issue but
didn't yet take into account our solution which is to return a 409 for
one of the updates (in the case of a conflict).

The updated test will retry on a tight loop, ensuring that we can
continue to make progress in the face of contentious PUT requests to the
same resource.

Signed-off-by: Lee Surprenant <lmsurpre@us.ibm.com>
lmsurpre added a commit that referenced this issue May 24, 2021
Signed-off-by: Lee Surprenant <lmsurpre@us.ibm.com>
lmsurpre added a commit that referenced this issue May 24, 2021
Signed-off-by: Lee Surprenant <lmsurpre@us.ibm.com>
@lmsurpre
Copy link
Member Author

lmsurpre commented Jun 2, 2021

I updated our existing ConcurrentUpdateTest to retry in the case of a 409 Conflict response.
Additionally I removed the flags that caused us to skip this test in our CI and so now we should be running this against all db types on each PR.

Specifically, the test forks 12 threads (easy to change) and asks each one to update the same resource in parallel. Because each thread will retry (in a tight loop) upon failure, this shows that one thread should always win (i.e. it continues to make progress despite the contention), that we avoid a deadlock in such cases, and that no updates are lost.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working configuration persistence
Projects
None yet
Development

No branches or pull requests

3 participants