Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Resilient feature flags #74

Merged
merged 4 commits into from Jan 24, 2023
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
81 changes: 81 additions & 0 deletions requests-for-comments/resilient-feature-flags
@@ -0,0 +1,81 @@
# Request for comments: Resilient feature flags

## Problem statement
*Who are we building for, what are their needs (include example cases), why is this important?*

We've recently realised a few limitations with feature flags when PostHog goes down. This isn't great for permanent feature flags as it causes things to break
in unexpected ways & forces clients to think really hard about defaults & write code defensively.

Ideally, our client libraries and APIs should 'just work', even if there's no error handling on the user's side.

So, how do we make this happen?

## Context

The (simplified) way things currently work is:

Whenever a user shows up on a page, we send a request to `/decide` with their randomly generated anonymous ID to parse feature flags. Usually these people don't have
properties attached to them, so only flags that don't depend on user properties resolve to `true` (assuming they're within the rollout bound).

Things change when we call `identify()` on a user, as this updates their ID, and fires another request to `/decide` to get new flags based on not only current
user properties, but any past properties that were set for this user.

We need the database to:
(a) Validate project tokens and find relevant team
(b) Get feature flag definitions for the given team
(c) Evaluate above flags using the person's properties stored in the DB

Note that `decide` can run on any arbitrary anonymous ID.

Caching decide responses doesn't make too much sense, since it's called usually only once per distinct ID per session. Every anonymous user has a new ID, and except
for users coming back to the app again & again on the same day, we don't achieve resiliency as we can't cache for users never seen before, nor can we assume that properties
haven't changed; while introducing new problems like cache invalidation.

### Possible failure modes

1. The database is down/unreachable: `decide` evaluation fails and returns 500
2. The servers are down/unreachable: requests from client libraries time out / error out.

(2) seems hard to defend against without a distributed app distribution (and then a resolver to go to the 'correct' app), but (1) is possible.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is a distributed app distribution and what is a 'correct' app?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Talk about terrible wording 😅 . I basically mean multiple server deployments where one going down doesn't affect the other, like an edge server; and a load balancer than can appropriately link to healthy closest server.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t think we should build an SDK making an assumption on the underlying infrastructure.

To build resilient software, use defensive programming: always hope for the best (uptime 100%) but always prepare for the worst (everything is down). AKA: a multi-geo deployment might help you, but you should not rely on it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not assuming that at all, I'm just listing out what are ways things can go wrong here.

Even with defensive programming, the issue still remains that when servers go down, you stop getting flag information.

(a point below addresses this)


## Solution

Introducing, decide v3.

This is a good opportunity to fix other customer frustrations around this endpoint, and pave the way for breaking changes we've wanted to introduce for a while.

In short:

1. Decide v3 returns all flags, with a `true` or `false` based on whether they're enabled or not, vs. implicitly just not returning the false flags.
2. Decide v3 does a best effort calculation of flags when PostHog DBs are down, and tells the client that it faced issues computing all flags.
3. Decide v3 can return a JSON payload with the flag variant (sneaking this into the major version API update)

On the SDK side:

1. We stop obliterating all flags if decide returns a 500, and based on the parameter in (2), update only flags that were computed again, while keeping old values of other flags.
2. We add timeouts to flushes & shutdowns

To make this possible, we cache flag definitions (and not the decide responses) on our side, so these are available even if the DB is down. We need to do the
same for determining the team given a project API key.

It makes sense to include (3) with (1) for decide because the shape of the response is changing anyways, and it's much easier to handle both at the same time in all client libraries, than doing these one by one.

We don't have to build the rest of it out right now, but having this functionality here means when do build it, it will be extraordinarily faster to do.

## Is it worth doing this?

If PostHog never goes down, none of this gives us an advantage (except maybe speeding up `/decide` responses if we decide to put the cache in front).

neilkakkar marked this conversation as resolved.
Show resolved Hide resolved
However, for the times we do go down, doing this ensures that our customers don't have a terrible experience & don't have their apps break. It's a trust-building
activity that ensures customers stick around.

It's an extremely big churn risk. If PostHog going down makes their app go down, we become an extra dependency for them being up.

## Context
*What are our competitors doing, what are the technical constraints, what are customers asking for, what does the data tell us, are there external motivations (e.g. launch week, enterprise contract)?*

## Design
*What are the key user experience and technical design decisions / trade-offs?*

## Sprints
*How do we break this into discrete and valuable chunks for the folks shipping it? How do we ensure it's high quality and fast?*