Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel PATCH calls result in optimistic locking exception #107

Open
fgrtl opened this issue Apr 4, 2024 · 13 comments
Open

Parallel PATCH calls result in optimistic locking exception #107

fgrtl opened this issue Apr 4, 2024 · 13 comments

Comments

@fgrtl
Copy link

fgrtl commented Apr 4, 2024

Hey, as I have encountered the same problem as mentioned in #90, my company has acquired the SCIM for Keycloak enterprise edition to see if we could fix this issue with the use of KC24.

Therefore, I created a test setup with the kc-24-2.0.0-enterprise plugin installed on KC24 and a small Java application that mocks Entra.

The Java application tries to parallel create multiple users and add them to the same group.

The sequence looks like this:

  1. Query for the User based on the display name
  2. Retrieve the User if it was found / Create the user if no results were returned
  3. Query for the Group (always the same) based on the display name
  4. Retrieve the Group
  5. PATCH the Group and add the User

The result is the following:
All threads die due to an optimistic locking exception in Keycloak except a single one.
The weird thing is that the thread which gets the 500 internal server error still patched the user into the group in the same request.

Internal server error:

2024-04-04 10:25:27,775 WARN [io.agroal.pool] (executor-thread-22) Datasource '': JDBC resources leaked: 2 ResultSet(s) and 2 Statement(s)
2024-04-04 10:25:27,776 ERROR [de.captaingoldfish.scim.sdk.keycloak.scim.AbstractEndpoint] (executor-thread-22) Failed to commit or rollback the transaction: org.hibernate.StaleObjectStateException: Row was updated or deleted by another transaction (or unsaved-value mapping was incorrect) : [org.keycloak.models.jpa.entities.GroupAttributeEntity#1a3a3e46-69a2-4b3a-b01a-a906484c8f0a]
2024-04-04 10:25:27,776 ERROR [de.captaingoldfish.scim.sdk.keycloak.scim.AbstractEndpoint] (executor-thread-22) org.hibernate.StaleObjectStateException: Row was updated or deleted by another transaction (or unsaved-value mapping was incorrect) : [org.keycloak.models.jpa.entities.GroupAttributeEntity#1a3a3e46-69a2-4b3a-b01a-a906484c8f0a]: org.keycloak.models.ModelException: org.hibernate.StaleObjectStateException: Row was updated or deleted by another transaction (or unsaved-value mapping was incorrect) : [org.keycloak.models.jpa.entities.GroupAttributeEntity#1a3a3e46-69a2-4b3a-b01a-a906484c8f0a]
at org.keycloak.connections.jpa.PersistenceExceptionConverter.convert(PersistenceExceptionConverter.java:99)
at org.keycloak.connections.jpa.JpaExceptionConverter.convert(JpaExceptionConverter.java:31)
at org.keycloak.transaction.JtaTransactionWrapper.lambda$handleException$0(JtaTransactionWrapper.java:65)
at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:197)
at java.base/java.util.HashMap$ValueSpliterator.forEachRemaining(HashMap.java:1779)
at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509)
at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499)
at java.base/java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150)
at java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173)
at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at java.base/java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:596)
at org.keycloak.transaction.JtaTransactionWrapper.handleException(JtaTransactionWrapper.java:67)
at org.keycloak.transaction.JtaTransactionWrapper.commit(JtaTransactionWrapper.java:92)
at org.keycloak.services.DefaultKeycloakTransactionManager.commit(DefaultKeycloakTransactionManager.java:136)
at de.captaingoldfish.scim.sdk.keycloak.scim.AbstractEndpoint.commitOrRollbackTransaction(AbstractEndpoint.java:159)
at de.captaingoldfish.scim.sdk.keycloak.scim.AbstractEndpoint.lambda$doAfterExecution$6(AbstractEndpoint.java:132)
at de.captaingoldfish.scim.sdk.server.endpoints.ResourceEndpoint.handleRequest(ResourceEndpoint.java:201)
at de.captaingoldfish.scim.sdk.keycloak.scim.AbstractEndpoint.lambda$handleScimRequest$2(AbstractEndpoint.java:75)
at de.captaingoldfish.scim.sdk.keycloak.scim.RetryStrategy.doWithRetries(RetryStrategy.java:71)
at de.captaingoldfish.scim.sdk.keycloak.scim.AbstractEndpoint.handleScimRequest(AbstractEndpoint.java:72)
at de.captaingoldfish.scim.sdk.keycloak.scim.ScimResourceServerEndpoint.handleScimRequest(ScimResourceServerEndpoint.java:124)
at de.captaingoldfish.scim.sdk.keycloak.scim.ScimResourceServerEndpoint.handleScimPatchRequest(ScimResourceServerEndpoint.java:89)
at de.captaingoldfish.scim.sdk.keycloak.scim.ScimResourceServerEndpoint$quarkusrestinvoker$handleScimPatchRequest_e58d4f8a5c56e81acada2ec0ef3ac16143e93153.invoke(Unknown Source)
at org.jboss.resteasy.reactive.server.handlers.InvocationHandler.handle(InvocationHandler.java:29)
at io.quarkus.resteasy.reactive.server.runtime.QuarkusResteasyReactiveRequestContext.invokeHandler(QuarkusResteasyReactiveRequestContext.java:141)
at org.jboss.resteasy.reactive.common.core.AbstractResteasyReactiveContext.run(AbstractResteasyReactiveContext.java:147)
at io.quarkus.vertx.core.runtime.VertxCoreRecorder$14.runWith(VertxCoreRecorder.java:582)
at org.jboss.threads.EnhancedQueueExecutor$Task.run(EnhancedQueueExecutor.java:2513)
at org.jboss.threads.EnhancedQueueExecutor$ThreadBody.run(EnhancedQueueExecutor.java:1538)
at org.jboss.threads.DelegatingRunnable.run(DelegatingRunnable.java:29)
at org.jboss.threads.ThreadLocalResettingRunnable.run(ThreadLocalResettingRunnable.java:29)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: org.hibernate.StaleObjectStateException: Row was updated or deleted by another transaction (or unsaved-value mapping was incorrect) : [org.keycloak.models.jpa.entities.GroupAttributeEntity#1a3a3e46-69a2-4b3a-b01a-a906484c8f0a]
at org.hibernate.engine.jdbc.mutation.internal.ModelMutationHelper.identifiedResultsCheck(ModelMutationHelper.java:75)
at org.hibernate.persister.entity.mutation.AbstractDeleteCoordinator.lambda$doStaticDelete$2(AbstractDeleteCoordinator.java:283)
at org.hibernate.engine.jdbc.mutation.internal.ModelMutationHelper.checkResults(ModelMutationHelper.java:50)
at org.hibernate.engine.jdbc.mutation.internal.AbstractMutationExecutor.performNonBatchedMutation(AbstractMutationExecutor.java:115)
at org.hibernate.engine.jdbc.mutation.internal.MutationExecutorSingleNonBatched.performNonBatchedOperations(MutationExecutorSingleNonBatched.java:40)
at org.hibernate.engine.jdbc.mutation.internal.AbstractMutationExecutor.execute(AbstractMutationExecutor.java:52)
at org.hibernate.persister.entity.mutation.AbstractDeleteCoordinator.doStaticDelete(AbstractDeleteCoordinator.java:279)
at org.hibernate.persister.entity.mutation.AbstractDeleteCoordinator.coordinateDelete(AbstractDeleteCoordinator.java:87)
at org.hibernate.persister.entity.AbstractEntityPersister.delete(AbstractEntityPersister.java:2979)
at org.hibernate.action.internal.EntityDeleteAction.execute(EntityDeleteAction.java:131)
at org.hibernate.engine.spi.ActionQueue.executeActions(ActionQueue.java:632)
at org.hibernate.engine.spi.ActionQueue.executeActions(ActionQueue.java:499)
at org.hibernate.event.internal.AbstractFlushingEventListener.performExecutions(AbstractFlushingEventListener.java:363)
at org.hibernate.event.internal.DefaultFlushEventListener.onFlush(DefaultFlushEventListener.java:41)
at org.hibernate.event.service.internal.EventListenerGroupImpl.fireEventOnEachListener(EventListenerGroupImpl.java:127)
at org.hibernate.internal.SessionImpl.doFlush(SessionImpl.java:1403)
at org.hibernate.internal.SessionImpl.managedFlush(SessionImpl.java:484)
at org.hibernate.internal.SessionImpl.flushBeforeTransactionCompletion(SessionImpl.java:2319)
at org.hibernate.internal.SessionImpl.beforeTransactionCompletion(SessionImpl.java:1976)
at org.hibernate.engine.jdbc.internal.JdbcCoordinatorImpl.beforeTransactionCompletion(JdbcCoordinatorImpl.java:439)
at org.hibernate.resource.transaction.backend.jta.internal.JtaTransactionCoordinatorImpl.beforeCompletion(JtaTransactionCoordinatorImpl.java:336)
at org.hibernate.resource.transaction.backend.jta.internal.synchronization.SynchronizationCallbackCoordinatorNonTrackingImpl.beforeCompletion(SynchronizationCallbackCoordinatorNonTrackingImpl.java:47)
at org.hibernate.resource.transaction.backend.jta.internal.synchronization.RegisteredSynchronization.beforeCompletion(RegisteredSynchronization.java:37)
at com.arjuna.ats.internal.jta.resources.arjunacore.SynchronizationImple.beforeCompletion(SynchronizationImple.java:52)
at com.arjuna.ats.arjuna.coordinator.TwoPhaseCoordinator.beforeCompletion(TwoPhaseCoordinator.java:351)
at com.arjuna.ats.arjuna.coordinator.TwoPhaseCoordinator.end(TwoPhaseCoordinator.java:69)
at com.arjuna.ats.arjuna.AtomicAction.commit(AtomicAction.java:138)
at com.arjuna.ats.internal.jta.transaction.arjunacore.TransactionImple.commitAndDisassociate(TransactionImple.java:1271)
at com.arjuna.ats.internal.jta.transaction.arjunacore.BaseTransaction.commit(BaseTransaction.java:104)
at io.quarkus.narayana.jta.runtime.NotifyingTransactionManager.commit(NotifyingTransactionManager.java:70)
at org.keycloak.transaction.JtaTransactionWrapper.commit(JtaTransactionWrapper.java:90)
... 20 more

2024-04-04 10:25:27,880 ERROR [de.captaingoldfish.scim.sdk.common.response.ErrorResponse] (executor-thread-22) Cannot access delegate without a transaction: de.captaingoldfish.scim.sdk.common.exceptions.InternalServerException: Cannot access delegate without a transaction
at de.captaingoldfish.scim.sdk.server.endpoints.ResourceEndpointHandler.patchResource(ResourceEndpointHandler.java:1070)
at de.captaingoldfish.scim.sdk.server.endpoints.ResourceEndpoint.resolveRequest(ResourceEndpoint.java:290)
at de.captaingoldfish.scim.sdk.server.endpoints.ResourceEndpoint.handleRequest(ResourceEndpoint.java:189)
at de.captaingoldfish.scim.sdk.keycloak.scim.AbstractEndpoint.lambda$handleScimRequest$1(AbstractEndpoint.java:95)
at org.keycloak.models.utils.KeycloakModelUtils.lambda$runJobInTransaction$1(KeycloakModelUtils.java:257)
at org.keycloak.models.utils.KeycloakModelUtils.runJobInTransactionWithResult(KeycloakModelUtils.java:379)
at org.keycloak.models.utils.KeycloakModelUtils.runJobInTransaction(KeycloakModelUtils.java:256)
at de.captaingoldfish.scim.sdk.keycloak.scim.AbstractEndpoint.lambda$handleScimRequest$2(AbstractEndpoint.java:86)
at de.captaingoldfish.scim.sdk.keycloak.scim.RetryStrategy.doWithRetries(RetryStrategy.java:71)
at de.captaingoldfish.scim.sdk.keycloak.scim.AbstractEndpoint.handleScimRequest(AbstractEndpoint.java:72)
at de.captaingoldfish.scim.sdk.keycloak.scim.ScimResourceServerEndpoint.handleScimRequest(ScimResourceServerEndpoint.java:124)
at de.captaingoldfish.scim.sdk.keycloak.scim.ScimResourceServerEndpoint.handleScimPatchRequest(ScimResourceServerEndpoint.java:89)
at de.captaingoldfish.scim.sdk.keycloak.scim.ScimResourceServerEndpoint$quarkusrestinvoker$handleScimPatchRequest_e58d4f8a5c56e81acada2ec0ef3ac16143e93153.invoke(Unknown Source)
at org.jboss.resteasy.reactive.server.handlers.InvocationHandler.handle(InvocationHandler.java:29)
at io.quarkus.resteasy.reactive.server.runtime.QuarkusResteasyReactiveRequestContext.invokeHandler(QuarkusResteasyReactiveRequestContext.java:141)
at org.jboss.resteasy.reactive.common.core.AbstractResteasyReactiveContext.run(AbstractResteasyReactiveContext.java:147)
at io.quarkus.vertx.core.runtime.VertxCoreRecorder$14.runWith(VertxCoreRecorder.java:582)
at org.jboss.threads.EnhancedQueueExecutor$Task.run(EnhancedQueueExecutor.java:2513)
at org.jboss.threads.EnhancedQueueExecutor$ThreadBody.run(EnhancedQueueExecutor.java:1538)
at org.jboss.threads.DelegatingRunnable.run(DelegatingRunnable.java:29)
at org.jboss.threads.ThreadLocalResettingRunnable.run(ThreadLocalResettingRunnable.java:29)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: java.lang.IllegalStateException: Cannot access delegate without a transaction
at org.keycloak.models.cache.infinispan.RealmCacheSession.getGroupDelegate(RealmCacheSession.java:176)
at org.keycloak.models.cache.infinispan.RealmCacheSession.getGroupById(RealmCacheSession.java:907)
at org.keycloak.models.cache.infinispan.RealmAdapter.getGroupById(RealmAdapter.java:1450)
at de.captaingoldfish.scim.sdk.keycloak.scim.groups.GroupHandler.getResource(GroupHandler.java:216)
at de.captaingoldfish.scim.sdk.keycloak.scim.groups.GroupHandler.getResource(GroupHandler.java:61)
at de.captaingoldfish.scim.sdk.keycloak.scim.patch.AbstractPatchOperationHandler.getUpdatedResource(AbstractPatchOperationHandler.java:414)
at de.captaingoldfish.scim.sdk.server.patch.PatchRequestHandler.getUpdatedResource(PatchRequestHandler.java:197)
at de.captaingoldfish.scim.sdk.server.endpoints.ResourceEndpointHandler.lambda$patchResource$24(ResourceEndpointHandler.java:1032)
at de.captaingoldfish.scim.sdk.server.interceptor.NoopInterceptor.doAround(NoopInterceptor.java:21)
at de.captaingoldfish.scim.sdk.server.endpoints.ResourceEndpointHandler.patchResource(ResourceEndpointHandler.java:1011)
... 22 more

I have also testet this with KC18 and got the exact same result.

Should the fix already be in the kc-24-2.0.0-enterprise plugin?

Thanks in advance!

@Captain-P-Goldfish
Copy link
Owner

the reason that the result can still be found in the database is a built in retry-strategy.
If either an OptimisticLockException or StaleObjectStateException is thrown during processing the RetryStrategy will open a new transaction and try to repeat the operation from start. This is done up to 3 times. If an error occurred in the previous calls it is still logged.
So in your case the second try of the operation will probably have succeeded.

Does this answer the problem?

@fgrtl
Copy link
Author

fgrtl commented Apr 5, 2024

Thanks for the answer but how is entra handling this with a 500 response? And how does this effect the entra state (watermark)?

@Captain-P-Goldfish
Copy link
Owner

Hmm... I am not able to reproduce this behaviour with a 500 response.
I have some integrationtests that provoke this behaviour. A concurrent modification of the same resource in 100 requests and with an assertion that each request must return response-code 200. In cases in which I get a 500 response beause the retry-counter has reached its limit and the operation still fails, the operation is not saved in my case...

Could you give me some more details or maybe provide your mock-implementation? This might help to pinpoint your problem.

@fgrtl
Copy link
Author

fgrtl commented Apr 8, 2024

Sure! I have also tested it with KC18 and received a 500, but as you predicted, the user is not added to the group in KC18.

Code:
public void runKeycloakScimTest() {
    log.info("Run Keycloak-Entra SCIM Test");

    int threadAmount = 10;
    int userAmount = 1000;
    int usersPerThread = userAmount / threadAmount;

    Group group = new Group();
    group.setDisplayName("Test Group");
    try (ExecutorService executorService = Executors.newFixedThreadPool(threadAmount)) {
        for (int i = 0; i < threadAmount; i++) {
            executorService.execute(() -> {
                for (int j = 0; j < usersPerThread; j++) {
                    generateUsersAndAddToGroup(usersPerThread, Thread.currentThread()
                                                                     .getId(), group);
                }
            });
        }
    }
}

private void generateUsersAndAddToGroup(int amount, long threadId, Group group) {
    for (int i = 0; i < amount; i++) {
        User user = generateUser(UUID.randomUUID()
                                     .toString()
                                     .substring(0, 4) + "-" + threadId);
        user = createUser(user);
        createAndPatchGroup(group, user);
    }
}

private User createUser(User user) {
    log.info("Query Users for: {}", user.getDisplayName()
                                        .get());
    Optional<User> optScimObjectNode = get(String.format("?filter=userName+eq+%%22%s%%22", user.getUserName()
                                                                                               .get()), User.class);
    if (getTotalResultsFromGetCall(optScimObjectNode.get()) == 0) {
        user = post(user, User.class).get();
        log.info("Created User: {}", user.getDisplayName()
                                         .get());
    } else {
        // !PATCH not included if not matching!
        String userId = getIdFromGetCall(optScimObjectNode.get());
        user = get(String.format("/%s", userId), User.class).get();
        log.info("Retrieved User: {}", user.getDisplayName()
                                           .get());
    }
    return user;
}

private void createAndPatchGroup(Group group, User user) {
    log.info("Query Groups for: {}", group.getDisplayName()
                                          .get());
    Optional<Group> optScimObjectNode = get(
            String.format("?excludedAttributes=members&filter=displayName+eq+%%22%s%%22", group.getDisplayName()
                                                                                               .get()),
            Group.class);
    if (getTotalResultsFromGetCall(optScimObjectNode.get()) == 0) {
        group = post(group, Group.class).get();
        log.info("Created Group: {}", group.getDisplayName()
                                           .get());
    } else {
        // !PATCH not included if group name not matching!
        String groupId = getIdFromGetCall(optScimObjectNode.get());
        group = get(String.format("/%s", groupId), Group.class).get();
        log.info("Retrieved Group: {}", group.getDisplayName()
                                             .get());
    }
    addUserToGroup(group, user);
}

private void addUserToGroup(Group group, User user) {
    log.info("Trying to add User {} to Group {}", user.getDisplayName()
                                                      .get(), group.getDisplayName()
                                                                   .get());
    PatchRequestOperation patchRequestOperation = new PatchRequestOperation("members", PatchOp.ADD, null,
                                                                            createUserToGroupArrayNode(user));
    PatchOpRequest patchOpRequest = new PatchOpRequest(Collections.singletonList(patchRequestOperation));
    try {
        patch(group.getId()
                   .get(), patchOpRequest, Group.class);
    } catch (HttpServerErrorException e) {
        if (e.getStatusCode().is5xxServerError()) {
            log.info("Received 500 for User {}, continuing", user.getDisplayName().get());
        } else {
            throw e;
        }
    }
    log.info("Added User {} to Group {}", user.getDisplayName()
                                              .get(), group.getDisplayName()
                                                           .get());
}

@Captain-P-Goldfish
Copy link
Owner

I checked the first comment again and I am a little bit confused. How are you testing with kc-18? The plugin from https://scim-for-keycloak.de does not have support for Keycloak version before 20.

The kc-18 plugin is still based on the OpenSource version from github that you can download here and this version does not have handling for such errors.
So, I am not quite sure where this is heading now. From my point of view the error is handled to a certain degree in the scim-for-keycloak plugin at https://scim-for-keycloak.de. But the error cannot be handled absolutely reliably because I have limited the numbers of retries. It is just less error-prone because it retries it. The ideal way is telling the client not to send a lot of update-requests on a single resource. SCIM is defining the Bulk-Endpoint and allows several operations on patch to handle this.

But how does the kc-24.2.0.0 version into this? Do you have the exactly same problem with this version as with the OpenSource version? I cannot reproduce this with version kc-24-2.0.0

@fgrtl
Copy link
Author

fgrtl commented Apr 9, 2024

Sorry for the confusion. I have two separate KC instances running for testing, KC18 with the open source plugin and KC24 with the kc-24-2.0.0-enterprise plugin. I get the same 500 responses from both KC instances when testing the code above, but in KC18, the user is not added to the group. In KC24, on the other hand, the user is added to the group, but it still results in a 500 for me which might confuse leading IDPs.

The ideal way is telling the client not to send a lot of update-requests on a single resource. SCIM is defining the Bulk-Endpoint and allows several operations on patch to handle this.

I totally agree, but unfortunately, we might not be able to pressurize Microsoft into working with the SCIM standard.

@Captain-P-Goldfish
Copy link
Owner

I see.
Thanks for the clarification. The enterprise version should indeed not return a 500 status code. I will look into this!

@Captain-P-Goldfish
Copy link
Owner

I extended my integration-test but I am not able to reproduce it. This is my code:

(This test has 1000 users created in the database that are all getting added to a single group in parallel patch-requests) version kc-23-2.0.0

dynamicNodes.add(DynamicTest.dynamicTest("Provoke OptimisticLockException on Groups-endpoint", () -> {
      try
      {
        new ForkJoinPool(20).submit(() -> {
          IntStream.range(0, users.size()).parallel().mapToObj(index -> {

            final int userBulkId = 2 + index;
            final User randomUser = users.get(String.valueOf(userBulkId));
            Assertions.assertNotNull(randomUser, String.valueOf(userBulkId));


            final String patchRequest;
            patchRequest = """
               {
                "schemas": [
                  "urn:ietf:params:scim:api:messages:2.0:PatchOp"
                ],
                "Operations": [
                  {
                    "op": "add",
                    "path": "members",
                    "value": {
                      "value": "%s",
                      "type": "%s"
                    }
                  }
                ]
              }
               """.formatted(randomUser.getId().get(), ResourceTypeNames.USER);
            groupMembers.add(Member.builder()
                                   .value(randomUser.getId().get())
                                   .display(randomUser.getUserName().get())
                                   .type(ResourceTypeNames.USER)
                                   .build());
            return Triple.of(index, patchRequest, randomUser.getId().get());
          }).forEach(triple -> {
            final String patchRequest = triple.getMiddle();
            final String groupId = groups.get(String.valueOf(groupBulkId)).getId().get();

            ServerResponse<Group> response = scimRequestBuilder.patch(Group.class, EndpointPaths.GROUPS, groupId)
                                                               .setPatchResource(patchRequest)
                                                               .sendRequest();
            if (HttpStatus.NO_CONTENT == response.getHttpStatus() || HttpStatus.OK == response.getHttpStatus())
            {
              indexes.add(triple.getLeft());
            }
            else
            {
              ServerResponse<Group> groupResponse = scimRequestBuilder.get(Group.class, EndpointPaths.GROUPS, groupId)
                                                                      .sendRequest();
              Assertions.assertEquals(HttpStatus.OK, groupResponse.getHttpStatus());
              Group group = groupResponse.getResource();
              final String userMemberId = triple.getRight();
              boolean wasNotAdded = group.getMembers()
                                         .stream()
                                         .noneMatch(member -> member.getValue().get().equals(userMemberId));
              Assertions.assertTrue(wasNotAdded);
              membersNotAddedDueToError.add(userMemberId);
            }
          });
        }).get();
      }
      catch (Exception ex)
      {
        throw ex;
      }
    }));

    /*
     * this test will verify that the members that were added to the group in the previous were only added if no error was returned in the previous test
     */
    dynamicNodes.add(DynamicTest.dynamicTest("Verify members successfully added to group", () -> {
      {
        ServerResponse<ListResponse<User>> response = scimRequestBuilder.list(User.class, EndpointPaths.USERS)
                                                                        .filter("not(groups pr)")
                                                                        .get()
                                                                        .sendRequest();
        Assertions.assertEquals(HttpStatus.OK, response.getHttpStatus());
        log.warn("\n{}",
                 response.getResource()
                         .getListedResources()
                         .stream()
                         .map(user -> user.getUserName().get())
                         .collect(Collectors.joining("\n")));
      }
      {
        Collections.sort(indexes);
        ServerResponse<Group> response = scimRequestBuilder.get(Group.class,
                                                                EndpointPaths.GROUPS,
                                                                groups.get(String.valueOf(groupBulkId)).getId().get())
                                                           .sendRequest();
        Assertions.assertEquals(HttpStatus.OK, response.getHttpStatus());
        Group retrievedGroup = response.getResource();
        Assertions.assertEquals(users.size() - membersNotAddedDueToError.size(), retrievedGroup.getMembers().size());
      }
    }));

I think this code is a pretty good example and it provokes a lot of OptimisticLockExceptions in the process. Approximately 1% of the requests end up in status code 500 and the test does then verify that the member was not added to the group which is what we expect.
I am willing to fix the issue but I can't fix it if I cannot reproduce it...

@fgrtl
Copy link
Author

fgrtl commented Apr 22, 2024

Hey, I think I have found out the cause.
If E-Tag support is enabled in the SCIM settings, the behavior is exactly as you predicted.
If it is disabled, wrongfully 500 responses appear.

I created 1000 users parallel in KC24 and added them to the same group.

But I also checked the occurrences:

  • with E-Tag enabled:

I received 131–500 responses on the PATCH call without any of the users being added to the group.

  • without E-Tag enabled:

I received 177 - 500 responses to the PATCH call, with 169 users still added and 8 not added to the group.

@Captain-P-Goldfish
Copy link
Owner

My tests actually never had ETags activated to begin with but this would make the whole thing even more weird. ETags are calculated on the JSON response-body just after the response-body was generated. The ETag is a SHA-1 digest on the minified JSON-string. I have no idea how this should affect the whole thing.

But just to make sure, I tried both variations, with ETag and without and I get the same result in both cases... I still cannot reproduce this...

Which database are you using? Maybe we can find the problem there. I am testing with postgres

@fgrtl
Copy link
Author

fgrtl commented Apr 29, 2024

I totally get your concerns, and I am confused too.
This problem showed up in my Docker setup locally (using KC23 with an h2 DB) and in my KC24 test setup with a Postgres DB.
Could there be a problem with calculating the SHA-1 for the concurrent group resource when ETag is enabled?

@fgrtl
Copy link
Author

fgrtl commented Apr 30, 2024

Hey, I have some interesting news about the behavior of Entra for you.
We have collaborated with a customer to run a few tests for Entra Keycloak24 SCIM provisioning, including 5232 users and 8 groups (the number of replaced users and groups does not match these numbers because of old linked replaced ids).

We started two initial and one incremental cycle due to a misconfiguration of a custom attribute mapping.
No users or groups have been deleted in between any cycles.

In the first initial cycle, we tracked a few 500 responses when PATCHING group memberships.
The good news is that Entra retried those requests and received a 200 response.
This can be observed with Group: replacedGroupId11, first initial cycle: overall 993 calls, 35 x 500 responses, 958 x 200 responses, there should be 959 users in the group.
The missing PATCH call depends on Entra. We will look into why that particular call is missing (it is likely because they messed up the state).

In the second initial cycle, each PATCH call received a 200-word response. This is because no user or group has been deleted in Keycloak, but the state is still reset.
Entra does not check if a user is already assigned to a group and just makes a PATCH request.
On the other hand, before provisioning groups and users, they make a GET call and check if the object exists and already matches in the source and target systems. 

Here you can find our access_log: censored_access.log

@Captain-P-Goldfish
Copy link
Owner

Okay, I have a little bit of trouble following here, because I am not sure anymore if the error still seems to be on my side or if the error was found on Entra side:

  • I understand that from 1000 PATCH requests 35 did fail with a status code of 500. What I did not understand if the users were assigned despite the 500 status code to the group as reported before?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants