Skip to content
This repository has been archived by the owner on Dec 13, 2023. It is now read-only.

Data inconsistency caused by updateEventHandler in RedisEventHandlerDAO.java #3842

Closed
wy-hua opened this issue Nov 10, 2023 · 0 comments · Fixed by #3843
Closed

Data inconsistency caused by updateEventHandler in RedisEventHandlerDAO.java #3842

wy-hua opened this issue Nov 10, 2023 · 0 comments · Fixed by #3843
Labels
type: bug bugs/ bug fixes

Comments

@wy-hua
Copy link
Contributor

wy-hua commented Nov 10, 2023

Describe the bug

One of my critical business flow in Conductor <SQS_QUEUE_NAME> events --> event handlers --> workflow executions is in paralysis because of NullPointerException ocurring in the Conductor server. This kind of exception is thrown when handling an SQS message on the queue:<SQS_QUEUE_NAME> queue. After some invetigation and experiments, I believe this issue is related to the absence of an event handler in the Redis, even though its primary key appears in the index of event handlers by event. This data inconsistency leads to NullPointerException mentioned above.

Full logs in Conductor server can be found here

3442759355 [RxComputationScheduler-2] ERROR com.netflix.conductor.core.events.DefaultEventProcessor [] - Error handling message: 4f0491de-eb2a-43b0-b27b-ef1225acc5ba on queue:<SQS_QUEUE_NAME>
java.lang.NullPointerException: null
	at com.netflix.conductor.redis.dao.RedisEventHandlerDAO.getEventHandlersForEvent(RedisEventHandlerDAO.java:124) ~[conductor-redis-persistence-3.14.0.jar!/:?]
	at com.netflix.conductor.service.MetadataServiceImpl.getEventHandlersForEvent(MetadataServiceImpl.java:218) ~[conductor-core-3.14.0.jar!/:?]
	at com.netflix.conductor.service.MetadataServiceImpl$$FastClassBySpringCGLIB$$726d989d.invoke(<generated>) ~

RedisEventHandlerDAO.java:124

@Override
    public List<EventHandler> getEventHandlersForEvent(String event, boolean activeOnly) {
        String key = nsKey(EVENT_HANDLERS_BY_EVENT, event);
        Set<String> names = jedisProxy.smembers(key);
        List<EventHandler> handlers = new LinkedList<>();
        for (String name : names) {
            try {
                EventHandler eventHandler = getEventHandler(name);
                recordRedisDaoEventRequests("getEventHandler", event);
                // —————————Line 124————————
                if (eventHandler.getEvent().equals(event)
                // —————————Line 124————————
                        && (!activeOnly || eventHandler.isActive())) {
                    handlers.add(eventHandler);
                }
            } catch (NotFoundException nfe) {
                LOGGER.info("No matching event handler found for event: {}", event);
                throw nfe;
            }
        }
        return handlers;
    }

Details

  • Conductor version: 3.14.0 (This issue still exists in the latest version of main branch 8981878 because RedisEventHandlerDAO.java remains unchanged since 3.14.0)
  • Persistence implementation: Redis
  • Queue implementation: SQS
  • Lock: Redis
  • Workflow definition: Not specified
  • Task definition: Not specified
  • Event handler definition:
{
  "name": "xxx",
  "event": "sqs:<SQS_QUEUE_NAME>",
  "condition": "true",
  "actions": [
    {
      "action": "start_workflow",
      "start_workflow": {
        "name": "xxx",
        "version": 1
      },
      "expandInlineJSON": false
    }
  ],
  "active": true,
  "evaluatorType": "javascript"
}

To Reproduce
Steps to reproduce the behavior:

  1. Create an event handler listening on "sqs:<SQS_QUEUE_NAME>".
  2. Update this event handler to listen on "sqs:<ANOTHER_QUEUE_NAME>".
  3. Delete this event handler.
  4. Send a message to the <SQS_QUEUE_NAME> SQS queue.
  5. Conductor server throws a NullPointerException with logs as in code block above.

Code that leads to this bug

    @Override
    public void updateEventHandler(EventHandler eventHandler) {
        Preconditions.checkNotNull(eventHandler.getName(), "Missing Name");
        EventHandler existing = getEventHandler(eventHandler.getName());
        if (existing == null) {
            throw new NotFoundException(
                    "EventHandler with name %s not found!", eventHandler.getName());
        }
        // if  eventHandler would be updated with an event different from its existing settings
        // in this case, old index of event handlers by event should be delete
        // but it won't get deleted based on current logic
        index(eventHandler);
        jedisProxy.hset(nsKey(EVENT_HANDLERS), eventHandler.getName(), toJson(eventHandler));
        recordRedisDaoRequests("updateEventHandler");
    }

Expected behavior
Conductor should handle the SQS message without any NullPointerException and maintain consistency in the event handlers.

Screenshots

None

Additional context

  • The strongest evidence includes inspecting Redis data, revealing that the event handler <EVENT_HANDLER_NAME> is missing but its index still exists, causing the inconsistency.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
type: bug bugs/ bug fixes
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant