Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dapr Workflows cannot be terminated if they are running lots of activities #7706

Open
salaboy opened this issue Apr 25, 2024 · 3 comments
Open
Labels
kind/bug Something isn't working

Comments

@salaboy
Copy link

salaboy commented Apr 25, 2024

In what area(s)?

/area runtime

/area operator

/area placement

/area docs

/area test-and-release

What version of Dapr?

1.13.2

1.0.x
edge: output of git describe --dirty

Expected Behavior

If a workflow is started, that creates tons of activities. It keeps running forever, but it cannot be terminated. Leaving the workflow in a forever Running state.

Actual Behavior

Workflows, no matter if they are running tons of activities, should be able to be terminated by calling the terminateWorkflow API.

One approach that can be implemented here, is to pause the workflow if it execute activities in a loop to avoid unwanted recursion.

Steps to Reproduce the Problem

  1. Install Dapr in a Kubernetes cluster (I am using version 1.13.2 and helm charts)
  2. I used Dapr shared but with Dapr Sidecar is the same:
helm install my-workflows-app oci://registry-1.docker.io/daprio/dapr-shared-chart --set shared.appId=my-workflow-app --set shared.daprd.image.tag=1.13.2 --set shared.strategy=deployment

kubectl port-forward svc/my-workflow-app-dapr 50001:50001
  1. Clone https://github.com/salaboy/workflows-bugbash-java
  2. Run tests using Maven -> Specifically this test: https://github.com/salaboy/workflows-bugbash-java/blob/main/src/test/java/com/example/demo/DemoApplicationTests.java#L71
  3. Try to terminate the instance created by using the Terminate API -> https://github.com/salaboy/workflows-bugbash-java/blob/main/src/test/java/com/example/demo/DemoApplicationTests.java#L104
  4. Check the status of the workflow after running terminate. It should show as Running

Release Note

RELEASE NOTE:

@salaboy salaboy added the kind/bug Something isn't working label Apr 25, 2024
@famarting
Copy link
Contributor

famarting commented Apr 25, 2024

if you look at the dapr sidecar logs you will see entries like

WARN[0513] Workflow actor '108adc75-08df-494b-99ec-65735f690802': execution timed-out and will be retried later: 'context deadline exceeded'  app_id=wfapp instance=MacBook-Pro-de-Fabian.local scope=dapr.wfengine.backend.actors type=log ver=1.13.2
WARN[0573] Workflow actor '108adc75-08df-494b-99ec-65735f690802': execution timed-out and will be retried later: 'context deadline exceeded'  app_id=wfapp instance=MacBook-Pro-de-Fabian.local scope=dapr.wfengine.backend.actors type=log ver=1.13.2
WARN[0613] Activity actor '108adc75-08df-494b-99ec-65735f690802::1::1': 'run-activity' is still running - will keep waiting until '2024-04-25 11:33:18.632479 +0200 CEST m=+3613.908154293'  app_id=wfapp instance=MacBook-Pro-de-Fabian.local scope=dapr.wfengine.backend.actors type=log ver=1.13.2

what is happening with this test it that it starts the worker that connects via grpc with the dapr sidecar, and then it schedules the workflow so it starts running, and as soon as the workflow starts running your test exits which also exits the worker and the grpc connection to the sidecar closes.

To my understanding, the workflow engine cannot move forward with the event log for this workflow, because it cannot send commands to the application. If it cannot move forward the event log it cannot process the workflow terminate command and the workflow gets stuck retrying any previous command (which in this case was an activity execution most likely)

I don't have sufficient knowledge on the workflow engine to propose a solution but to me it looks like there is a bit of a disconnect between the workflow actor and the engine, the actor tries to send work to the engine so it sends it to the app, but if the engine is not connected to the app nothing works. Maybe there should be some optimization or logic that breaks this kind of retry loop if a terminate command is detected. IDK if the client side MUST receive the terminate workflow command to safely terminate the workflow or if its safe to terminate the workflow from the backend POV if the connection to the application is absent.

@olitomlinson
Copy link

cc @cgillum

@cgillum
Copy link
Contributor

cgillum commented Apr 26, 2024

Yes, I believe @famarting is correct here. If the worker has disconnected from the sidecar, then it will be unable to receive and process the terminate message, leaving the workflow stuck in the RUNNING state.

Termination works by sending a message to a workflow. When the workflow receives the terminate message, it transitions itself into a completed state with the TERMINATED runtime status. The terminate logic is not implemented at the sidecar/engine/actor layer. If you reconnect your worker app to the Dapr sidecar, then the terminate message should get handled and the workflow will terminate.

Maybe there should be some optimization or logic that breaks this kind of retry loop if a terminate command is detected. IDK if the client side MUST receive the terminate workflow command to safely terminate the workflow or if it's safe to terminate the workflow from the backend POV if the connection to the application is absent.

I think this can be considered as an optimization, but it would need to be implemented carefully to ensure that the workflow state is correctly updated in the same way as when a workflow transitions itself into a terminated state, and that the OTel spans are properly emitted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants