Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Flags aren't consistent when distinctIDs can change #9547

Closed
neilkakkar opened this issue Apr 27, 2022 · 10 comments
Closed

Feature Flags aren't consistent when distinctIDs can change #9547

neilkakkar opened this issue Apr 27, 2022 · 10 comments
Labels
bug Something isn't working right feature/experimentation Feature Tag: Experimentation feature/feature-flags Feature Tag: Feature flags

Comments

@neilkakkar
Copy link
Collaborator

Bug description

When someone logs out / switches to a new browser and then tries logging in, they get a new distinct ID, which can mean that sometimes the feature flags can flip, because feature flags are based on distinctID, not personID.

Then once they're logged in and move to their existing distinctID, until the /decide call returns, we're using the 'wrong' feature flag value. Further, if they keep using the $anon_distinct_id, our feature flag is permanently flipped, which is not great for experiments.

Additional context

  1. While there's not much we can do when a new unidentified user pops up, is there something we can do to minimise lag between logging in & new feature flags being loaded?
  2. What are problems / advantages with moving to personID as feature flag identifier, for cases when you'd want multiple distinctIDs belonging to the same person to go to the same bucket.

Thank you for your bug report – we love squashing them!

@neilkakkar neilkakkar added bug Something isn't working right feature/feature-flags Feature Tag: Feature flags feature/experimentation Feature Tag: Experimentation labels Apr 27, 2022
@pauldambra
Copy link
Member

pauldambra commented Apr 27, 2022

While there's not much we can do when a new unidentified user pops up, is there something we can do to minimise lag between logging in & new feature flags being loaded?

The server could write the feature flags as JSON into a script tag in the HTML of the page for the JS to load

Doh, which would only work for us 🤦

@AtifSS
Copy link

AtifSS commented Apr 29, 2022

I think right from when
a) a user opens a web browser
b) after login
c) to logout
d) closing and then reopening the browser
e) logging-in again.

during all these steps and further the feature flags with variants should not change their value. right?

@neilkakkar
Copy link
Collaborator Author

between (d) and (e), if you go incognito / clear caches / disabled cookies/cache , then there's no way to know if you are the same person who logged in earlier, so its very much possible for feature flags to flip.

@neilkakkar
Copy link
Collaborator Author

to minimise lag between logging in & new feature flags being loaded?

There's an existing issue addressing this problem: #7115

hey get a new distinct ID, which can mean that sometimes the feature flags can flip

Using personID as the hash identifier solves this problem

@macobo
Copy link
Contributor

macobo commented May 24, 2022

Note that feature flags can (and I have in the past) be used as authorization-type gates - can user do X.

This means that if you don't change the flag as user logs in you also create problems.

@neilkakkar
Copy link
Collaborator Author

neilkakkar commented Jun 6, 2022

I'm looking into fixing this issue this week. To have all context and make informed decisions, here's all the use cases we know of: ( @marcushyett-ph - could use some help filling out anything I've missed)

Feature Flags use cases

  1. As a rollout gate: I want to release a new feature, but only to a set % of people.
  2. As a precise rollout gate: I want to release a new feature, but only to a set % of people, where being precise matters (ex: running an experiment). We've chosen not to support this usecase (use multivariates here instead)
  3. As a rollout gate on person properties: I want to release a new feature, but only to a set % of people, with browser = Chrome
  4. As an authorization gate: I want to enable this feature for people with property X. No rollout nonsense.
  5. As a A/B/X Test: I want to split variants between different people.
  6. As a 'persistent' A/B/X Test: I want to split variants between different people, where the first variant they see (anonymous or not) remains throughout their journey.
  7. As a 'persistent' A/B/X Test with person properties: I want to split variants between different people with property X, where the first variant they see (anonymous or not) remains throughout their journey.
  8. As a 'persistent' A/B/X Test with person properties and a rollout to reduce scope of experiment: I want to split variants between different people with property X, where the first variant they see (anonymous or not) remains throughout their journey.

And in all of these cases, person & people can be replaced by group.


Problems with above use cases

  1. Wherever rollout %s or variants are involved, we have a nasty issue: distinctIDs belonging to the same person can get a different value, since the hash is based on the distinctID.
    Groups evade this problem by hashing on the group_key, instead of distinctID.
    We don't yet do the same with persons because this borks client side evaluation, and makes turbo-mode impossible.
    Also, note that this only occurs in a flow involving identify: If you're already identified / anonymous throughout, your distinctID remains the same. However, this is a common enough use-case that it's worth solving for.
    So, the open question here is which cases require consistency, and whether it makes sense to expose this as an option to users.
  2. The same problem arises in a different form where 'persistence' is required. Here, a variant is determined based on a anonymous distinct id, which flips when a new distinct ID is identified.

Before I jump into solutions, does this make sense? Am I missing any use cases? (cc: @macobo @mariusandra @Twixes @EDsCODE @rcmarron . No requirement to respond, just pinging to confirm I've covered all bases)

@neilkakkar
Copy link
Collaborator Author

neilkakkar commented Jun 6, 2022

Constraints to keep in mind:

  1. We want to solve this issue not just using our libraries, but for people directly accessing the APIs as well (since we'll probably never support all possible libraries)
  2. We don't want to stop / close the road to moarr client-side evaluation of FFs.
  3. Existing features should still work: If I come back to reduce rollout %s, things should still work as expected. Or, in an experiment case, if I switch rollout % for a variant to 100%, the old 'persistent' variant should change as well.
  4. ???

@neilkakkar
Copy link
Collaborator Author

neilkakkar commented Jun 7, 2022

I think there's 3 ways to about solving this, but one seems superior to all options.

Firstly, since consistency is not a requirement for all sorts of feature flags, and only plays a role when the flag has to go through an identify / login / distinct_id change event, we should make this optional, and default to false.

For the special case when consistency is switched on (we need a better name though), we need some extra work to make things work. All these options can't be evaluated completely on the client-side.

Not great option 1: Use person_id as the key for hashing

This seems to work, except for the fact that personIDs may merge when identify is called for the first time, which leads to deleting old person IDs, which can just as easily cause the same problems, although in a smaller number of cases.

Implementing this, however, is pretty straightforward.

Not great option 2: Use a feature flag value override

Whenever we have a distinctID which is assigned a variant / is considered part of a rollout for the first time, we set an override for the person to stick to this variant. When persons are merged, the override carries forward to all merged persons.

Thus, no matter the change in distinctID, we can be sure that the person always sees the same variant. The problem here, though, is how this handles adjustments to rollout %s. If, say, at the end of the experiment, I come and rollout the winning variant to 100%, it simply won't work because the persons are all overridden to this old value.

Option 3: Use a hash key override

Whenever we have a distinctID which is assigned a variant / is considered part of a rollout for the first time, we set an override for hash key to be used for calculating variants / rollout. When persons are merged, the override carries forward to all merged persons.

This ensures that future changes correctly change the behaviour for the feature flag, while also ensuring it remains consistent when distinct IDs change.


Open to hear about other options, but option 3 seems best to me so far. I considered going the opposite route, where the client keeps the anon distinctID as the 'main' one, but that has other problems, and doesn't fit in well with existing architecture, not worth reversing this at all.


More detailed spec of how option 3 will work:

  1. Client side has an anon distinct ID

  2. Client side triggers identify with a new distinct ID

  3. The API request to decide has a new field for anon_distinct_id, which shall be used for the feature flag overrides. This has set once semantics: it only works the first time & if no override already exists for the feature flag - personID combo.

    1. The update-and-get happens synchronously, so we get the right feature flag values as well as update the overrides.
    2. TBD: How do person merges on identify affect this?^
    3. We don't want to turn this into a separate API call, keeping in mind clients & async partitioning issues.
  4. This does mean more cruft on feature flag evaluation, and I'm thinking of ways to optimise this / achieve option 3 using a different route.

Effectively, option 3 is a more general form of option 1. Choosing between option 1 and 3 also depends on how the ingestion side works when identifying persons, which I'm looking into next. Maybe the cases where the person_id changes are actually much smaller (or that's my suspicion), which can help us get rid of the cruft.

@neilkakkar
Copy link
Collaborator Author

Option 1 was a no-go because of how person merges affect identify.

Consider this very standard case:

  1. We create a new feature flag with experience consistency enabled.
  2. neil@posthog.com is an existing user.
  3. An anonymous user, John Doe, comes to the website, does random stuff, gets their own person created, has feature flag values set based on their personID.
  4. John Doe now identifies as neil@posthog.com, which deletes John Doe's old person, and the new personID is neil's personID, which means the feature flag can flip!

In contrast to option 3, where the override is added at this point, ensuring that neil@posthog.com uses John Doe's distinctID. Option 3 could've overridden personIDs instead, but that's just needlessly complicating things, and keeps the way open for single-client client-side consistent feature flags, by using distinct IDs.

@neilkakkar
Copy link
Collaborator Author

This is now fixed and rolling out slowly to everyone

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working right feature/experimentation Feature Tag: Experimentation feature/feature-flags Feature Tag: Feature flags
Projects
None yet
Development

No branches or pull requests

4 participants