Actor state manager does not invalidate tracker after `SetWithTTL` #406

shubham1172 · 2023-05-20T17:02:34Z

Describe the bug

In the actor state store, Get/Contains return invalid data after saving a state with SetWithTTL and waiting for TTL time.

t.GetStateManager().SetWithTTL introduced by #383 is able to save the actual data into the state store with the correct TTL.

Here is the screenshot from my Redis DB showing a TTL on the right side.

However, the internal state manager state is likely corrupted. On calling Contains or Get after the TTL has expired, it still returns the old key. This might be possibly due to the fact that in actor/state/state_manager.go, we update the stateChangeTracker when SetWithTTL is called, and in the Get or Contains method, we only check for Removed state, and never look at the TTL. This cache is not invalidated when the TTL expires.

To Reproduce

Here is the test plan

func (t *TestActor) TestTTL(ctx context.Context, stateKey string) error {
	fmt.Println("test ttl with key = ", stateKey)

	stateValue := "foobar"

	// 1. Set state with TTL 15 seconds
	if err := t.GetStateManager().SetWithTTL(ctx, stateKey, stateValue, time.Second*15); err != nil {
		fmt.Println("[ERROR] STEP 1: state manager set get with key " + stateKey + " and state value " + stateValue + "err = " + err.Error())
		return err
	}

	if err := t.GetStateManager().Save(ctx); err != nil {
		fmt.Println("[ERROR] STEP 1: state manager save err = " + err.Error())
		return err
	}

	// 2. Get state with key, it should exist and value should be stateValue
	if exist, err := t.GetStateManager().Contains(ctx, stateKey); err != nil {
		fmt.Println("[ERROR] STEP 2: state manager call contains with key " + stateKey + " err = " + err.Error())
		return err
	} else if !exist {
		fmt.Println("[ERROR] STEP 2: state manager call contains with key " + stateKey + " does not exist")
		return fmt.Errorf("state with key %s should exist", stateKey)
	} else {
		var stateData string
		if err := t.GetStateManager().Get(ctx, stateKey, &stateData); err != nil {
			fmt.Println("[ERROR] STEP 2: state manager call get with key " + stateKey + " err = " + err.Error())
			return err
		}
		if stateData != stateValue {
			fmt.Println("[ERROR] STEP 2: state manager call get with key " + stateKey + " value = " + stateData + " does not match expected value = " + stateValue)
			return fmt.Errorf("state with key %s should have value %s, but got %s", stateKey, stateValue, stateData)
		} else {
			fmt.Println("[INFO] STEP 2: state manager call get with key " + stateKey + " value = " + stateData + " matches expected value = " + stateValue)
		}
	}

	// 3. Wait for 20 seconds, state should be expired
	fmt.Println("[INFO] STEP 3: wait for 20 seconds")
	time.Sleep(time.Second * 20)

	// 4. Get state with key, it should not exist
	if exist, err := t.GetStateManager().Contains(ctx, stateKey); err != nil {
		fmt.Println("[ERROR] STEP 4: state manager call contains with key " + stateKey + " err = " + err.Error())
		return err
	} else if exist {
		fmt.Println("[ERROR] STEP 4: state manager call contains with key " + stateKey + " exist")
		return fmt.Errorf("state with key %s should not exist", stateKey)
	} else {
		fmt.Println("[INFO] STEP 4: state manager call contains with key " + stateKey + " does not exist")
	}

	return nil
}

Output:

== APP == test ttl with key =  testStateKey
== APP == [INFO] STEP 2: state manager call get with key testStateKey value = foobar matches expected value = foobar
== APP == [INFO] STEP 3: wait for 20 seconds
== APP == [ERROR] STEP 4: state manager call contains with key testStateKey exist

Expected behavior

Get/Contains should not return the data if its TTL has expired.

The text was updated successfully, but these errors were encountered:

JoshVanL · 2023-05-22T12:09:08Z

The problem here is that SDKs optimistically cache all actor state since an instance will assume an actor's identity for the lifetime of the instance or actor. With the introduction of actor state TTL, SDK caches do not currently reflect the real state as described by the DB (really this is Dapr runtime), i.e. if a key has passed its TTL and been deleted in the DB, it will not be deleted in the cache.

There are a few options we have to alleviate this problem in SDKs:

Do nothing: this means that regardless of the true state of the world, SDK caches will store expired keys indefinitely, with the intention that actor state TTL is a mechanism for ensuring DB state doesn't balloon, rather than another strategy for manipulating actor state at runtime. The problem with this is that a. a long running actor instance could itself balloon in memory, and b. potentially be confusing to developers as Dapr runtime and actor instances have different views of the world.
Don't cache any actor state, instead always talk to runtime. This will obviously introduce latency to actor state. Is this latency acceptable? Was it the case before that actors did not cache state, but the project found the latency to be too high?
Add a cache in-validator whereby the actor SDKs could estimate the expiration time, and get requests to a key past this time would invalidate the cache and call the dapr runtime. The DB is the source of truth of clock time for state TTL, and so the actor instance would technically be an approximation of the real TTL expiration time.
Add a background process to actor state SDKs which, much like dapr runtime itself for state stores which don't have native support for TTL, is responsible for periodically getting all keys currently stored in the cache and ensuring the state is still available and updating them accordingly.

I am probably in favor of 3 the most, though we might want to do 4 as well anyway to prevent actor SDKs themselves ballooning in memory over time as stale state is never cleaned up.

/cc @artursouza @yaron2

artursouza · 2023-05-22T13:37:27Z

This is a critical thing to be fixed. The SDK should implement option 3. In addition, for applications that cannot afford the inconsistency, it should allow (via config of the actor runtime in SDK) to not cache state anymore.

We also need to update documentation explaining the side effects of using TTL for actor state.

yaron2 · 2023-05-22T16:10:54Z

Option 3 is the best path forward. We shouldn't remove caching from the actor clients, it was put in there in the first place due to the low latency and stateful nature of actors (interacting with state in almost every call).

shubham1172 added the bug Something isn't working label May 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Actor state manager does not invalidate tracker after `SetWithTTL` #406

Actor state manager does not invalidate tracker after `SetWithTTL` #406

shubham1172 commented May 20, 2023

JoshVanL commented May 22, 2023

artursouza commented May 22, 2023

yaron2 commented May 22, 2023

Actor state manager does not invalidate tracker after SetWithTTL #406

Actor state manager does not invalidate tracker after SetWithTTL #406

Comments

shubham1172 commented May 20, 2023

JoshVanL commented May 22, 2023

artursouza commented May 22, 2023

yaron2 commented May 22, 2023

Actor state manager does not invalidate tracker after `SetWithTTL` #406

Actor state manager does not invalidate tracker after `SetWithTTL` #406