Dapr Workflows cannot be terminated if they are running lots of activities #7706

salaboy · 2024-04-25T07:59:28Z

In what area(s)?

/area runtime

/area operator

/area placement

/area docs

/area test-and-release

What version of Dapr?

1.13.2

1.0.x
edge: output of git describe --dirty

Expected Behavior

If a workflow is started, that creates tons of activities. It keeps running forever, but it cannot be terminated. Leaving the workflow in a forever Running state.

Actual Behavior

Workflows, no matter if they are running tons of activities, should be able to be terminated by calling the terminateWorkflow API.

One approach that can be implemented here, is to pause the workflow if it execute activities in a loop to avoid unwanted recursion.

Steps to Reproduce the Problem

Install Dapr in a Kubernetes cluster (I am using version 1.13.2 and helm charts)
I used Dapr shared but with Dapr Sidecar is the same:

helm install my-workflows-app oci://registry-1.docker.io/daprio/dapr-shared-chart --set shared.appId=my-workflow-app --set shared.daprd.image.tag=1.13.2 --set shared.strategy=deployment

kubectl port-forward svc/my-workflow-app-dapr 50001:50001

Clone https://github.com/salaboy/workflows-bugbash-java
Run tests using Maven -> Specifically this test: https://github.com/salaboy/workflows-bugbash-java/blob/main/src/test/java/com/example/demo/DemoApplicationTests.java#L71
Try to terminate the instance created by using the Terminate API -> https://github.com/salaboy/workflows-bugbash-java/blob/main/src/test/java/com/example/demo/DemoApplicationTests.java#L104
Check the status of the workflow after running terminate. It should show as Running

Release Note

RELEASE NOTE:

The text was updated successfully, but these errors were encountered:

famarting · 2024-04-25T08:52:28Z

if you look at the dapr sidecar logs you will see entries like

WARN[0513] Workflow actor '108adc75-08df-494b-99ec-65735f690802': execution timed-out and will be retried later: 'context deadline exceeded'  app_id=wfapp instance=MacBook-Pro-de-Fabian.local scope=dapr.wfengine.backend.actors type=log ver=1.13.2
WARN[0573] Workflow actor '108adc75-08df-494b-99ec-65735f690802': execution timed-out and will be retried later: 'context deadline exceeded'  app_id=wfapp instance=MacBook-Pro-de-Fabian.local scope=dapr.wfengine.backend.actors type=log ver=1.13.2
WARN[0613] Activity actor '108adc75-08df-494b-99ec-65735f690802::1::1': 'run-activity' is still running - will keep waiting until '2024-04-25 11:33:18.632479 +0200 CEST m=+3613.908154293'  app_id=wfapp instance=MacBook-Pro-de-Fabian.local scope=dapr.wfengine.backend.actors type=log ver=1.13.2

what is happening with this test it that it starts the worker that connects via grpc with the dapr sidecar, and then it schedules the workflow so it starts running, and as soon as the workflow starts running your test exits which also exits the worker and the grpc connection to the sidecar closes.

To my understanding, the workflow engine cannot move forward with the event log for this workflow, because it cannot send commands to the application. If it cannot move forward the event log it cannot process the workflow terminate command and the workflow gets stuck retrying any previous command (which in this case was an activity execution most likely)

I don't have sufficient knowledge on the workflow engine to propose a solution but to me it looks like there is a bit of a disconnect between the workflow actor and the engine, the actor tries to send work to the engine so it sends it to the app, but if the engine is not connected to the app nothing works. Maybe there should be some optimization or logic that breaks this kind of retry loop if a terminate command is detected. IDK if the client side MUST receive the terminate workflow command to safely terminate the workflow or if its safe to terminate the workflow from the backend POV if the connection to the application is absent.

olitomlinson · 2024-04-26T21:31:23Z

cc @cgillum

cgillum · 2024-04-26T23:32:50Z

Yes, I believe @famarting is correct here. If the worker has disconnected from the sidecar, then it will be unable to receive and process the terminate message, leaving the workflow stuck in the RUNNING state.

Termination works by sending a message to a workflow. When the workflow receives the terminate message, it transitions itself into a completed state with the TERMINATED runtime status. The terminate logic is not implemented at the sidecar/engine/actor layer. If you reconnect your worker app to the Dapr sidecar, then the terminate message should get handled and the workflow will terminate.

Maybe there should be some optimization or logic that breaks this kind of retry loop if a terminate command is detected. IDK if the client side MUST receive the terminate workflow command to safely terminate the workflow or if it's safe to terminate the workflow from the backend POV if the connection to the application is absent.

I think this can be considered as an optimization, but it would need to be implemented carefully to ensure that the workflow state is correctly updated in the same way as when a workflow transitions itself into a terminated state, and that the OTel spans are properly emitted.

salaboy added the kind/bug Something isn't working label Apr 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dapr Workflows cannot be terminated if they are running lots of activities #7706

Dapr Workflows cannot be terminated if they are running lots of activities #7706

salaboy commented Apr 25, 2024 •

edited

famarting commented Apr 25, 2024 •

edited

olitomlinson commented Apr 26, 2024

cgillum commented Apr 26, 2024

Dapr Workflows cannot be terminated if they are running lots of activities #7706

Dapr Workflows cannot be terminated if they are running lots of activities #7706

Comments

salaboy commented Apr 25, 2024 • edited

In what area(s)?

What version of Dapr?

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Release Note

famarting commented Apr 25, 2024 • edited

olitomlinson commented Apr 26, 2024

cgillum commented Apr 26, 2024

salaboy commented Apr 25, 2024 •

edited

famarting commented Apr 25, 2024 •

edited