Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Experimentation MVP Things to Do #7462

Closed
4 of 5 tasks
neilkakkar opened this issue Dec 1, 2021 · 38 comments
Closed
4 of 5 tasks

Experimentation MVP Things to Do #7462

neilkakkar opened this issue Dec 1, 2021 · 38 comments
Assignees
Labels
feature/experimentation Feature Tag: Experimentation

Comments

@neilkakkar
Copy link
Collaborator

neilkakkar commented Dec 1, 2021

Following up from #7418 , here's what we need to do to get the MVP out in these 2 weeks:

Main Tasks:

  1. Figure out & build the UI Creation Flow - @liyiy
    • @clarkus & @paolodamico : we could use some input here about the basic flow. (specifically on how we should structure this, disregard full fledged wireframes for now) There's 2.5 steps to creating an experiment: (1) Create an insight & (2) Choose a FF. (2.5): Name and description of experiment.
    • As you can tell, there's lots of ambiguity above^ that we need to clean up.
    • (We considered a flow where funnels have a "create experiment from insight" button, but @macobo mentioned how this increases complexity on the standard funnels view, and reduces discoverability quite a bit in simple mode)
    • (future context): There's a back & forth when choosing an insight & FF: They choose an insight, we tell them how long the experiment will run for. They tweak FF rollout % / definition, we tweak how long we expect experiment to run for. They tweak insight definition, we tweak how long to run the experiment for. What's a good way of doing this?
  2. Build Experiment Page with the Insight and experiment results (also see here)
  3. Create Plugin to add $active_feature_flags & $feature/{key} to every event (we only do it for web captures today) - @neilkakkar - deferred. Just focusing on posthog-js for now.
    • This implicitly allows breakdowns by Feature Flags on Funnels
  4. Given a funnel with FF breakdown, calculate experiment results
  5. Given a funnel and FF, calculate how long an experiment should run

New constraints:

  1. To give accurate results on expected running time, we need to answer the question: Given a FF, how many people in the past X days on this insight would've belonged to this FF? And use this to estimate. For now, restricting scope to basic FFs only (i.e. flag with the only property being the rollout % to all users).
@paolodamico
Copy link
Contributor

This is great @neilkakkar! Sounds like a plan. I'll work on some wireframes to discuss the overall flow.

@liyiy
Copy link
Contributor

liyiy commented Dec 1, 2021

Are we limiting the user to what type of insight they're allowed to create?

@neilkakkar
Copy link
Collaborator Author

Experimentation is only possible on Funnels, so ideally yes (your call depending on how hard is it to implement. Disabling the rest / removing them works)

@clarkus
Copy link
Contributor

clarkus commented Dec 1, 2021

Is there a list of known constraints somewhere?

  • There are 2.5 steps to creation right now
    • Create a funnel insight (has to be a funnel)
    • Choose a feature flag
    • Name things
    • Some back and forth to adjust how long-lived the experiment, etc.
  • Are there any other constraints to be aware of?

@liyiy
Copy link
Contributor

liyiy commented Dec 1, 2021

@clarkus Potentially things like participant selection? 🤔

@neilkakkar
Copy link
Collaborator Author

Hmm, the FF determines the participants, so this should be ok (thanks to the rollout %).

One more constraint: Each experiment can have only one insight, and only one FF set.

Don't think there's anything else

@paolodamico
Copy link
Contributor

Created a (very, very ugly) proposal for what the flow can look like (see below). Some notes based on the comments I see above:

  • I think a better experience is instead of creating a normal insight, is allowing the user to select the metric they care about (see wireframe proposal). In the end we want to make sure the user can track a single number as opposed to just seeing the same graph that they'd see in insights.
  • In terms of % rollout on FFs, I think we should skip this. For the MVP I would suggest just doing 50/50 binary rollout based on the selection criteria.

PH-experimentation-MVP.pdf

@clarkus
Copy link
Contributor

clarkus commented Dec 1, 2021

Here are the things I think we'd need to track / collect for creating an experiment. I have some open questions below on areas I see as need elaboration. I think the way we determine the target metric needs more discussion or at least consensus for MVP.

Experiment parameters

  • Name
  • Feature flag key - could be generated based on the name - the critical thing here is to communicate the key so it can be incorporated into code / features. There is also an automatic cleanup aspect to keys - do we need to ensure keys aren't reused in the future? Do they always have to be unique even after the experiment is concluded?
  • Description (optional)
  • Owner (not sure we need to provide this explicitly as it could be inferred by the creator of the experiment)
  • Target metric / funnel insight - needs elaboration and consensus
  • Time to live - how long running is the experiment - is this always in days? Are there upper or lower bounds?
  • Person / group targeting - can we infer this from the insight filters? Are we using rollout percentages here or do we need to be able to provide a target count with absolute units (for example 25 persons).
  • How does confidence level actually work? Is this derived based on percentage of audience and how long-running the experiment is? If a user is able to directly input this, how does it impact the performance of the experiment?
  • An experiment cannot be changed while it's active. Are there exceptions to this? What about situations where the experiment is malformed or has other detrimental results, can the user revert and stop? Do they need to edit and relaunch, or is it a scenario where they need to start fresh?
  • Does it make sense to colocate experiments and feature flags? They're very related in their function, but they are pretty separate items in terms of usage and user mental models. Just saying they could be a stand alone feature distinct from feature flags.

@weyert
Copy link
Contributor

weyert commented Dec 1, 2021

I hope you don't mind raising this issue I raised last year regarding time limited feature flags; I might be wrong. But it sounds like it might be related: #1903

@neilkakkar
Copy link
Collaborator Author

neilkakkar commented Dec 1, 2021

@paolodamico : How does experimentation over a single event work? In this case, we don't have success & failures for the A or B side, so how do we calculate the rest of the things?

imo we should restrict experimentation to funnels only. In that case, choosing the target metric = creating the funnel.

The single number would be the conversion rate. I don't get how it can be, say, number of pageviews (the followup calculations get borked in this case?)

@neilkakkar
Copy link
Collaborator Author

Person Selection is interesting! Sounds like instead of choosing an existing FF, its creating a new FF.

I was assuming that the FF would usually already exist, because there would be code changes based on this FF?

Imagine you want to test a blue vs green logo of PostHog.

I would make the change, release it only for PostHog users / myself, test it is ok, and then update some selection & rollout critera on the FF itself, and then move to the Experimentation part.

This is very hard to do if I have to create FF inside experiment. My selection criteria is borking my testing flexibility.

Thoughts? I feel your way is superior in terms of UX: I don't have to flip flop around to set things up. But hopefully(?) by the time you come to run an experiment, you have your FFs & code changes already finished, ready to go experiment.

@liyiy
Copy link
Contributor

liyiy commented Dec 1, 2021

I am kind of confused on how feature flags work here, are we duplicating them as an "experiment feature flag" and editing that duplicate when we change its rollout percentages, etc? Also I think the FF selection should be a dropdown of existing FFs instead of entering one manually

@mariusandra
Copy link
Collaborator

A reminder that we have a small data integrity issue with feature flags: posthog-js sends the first pageview before /decide returns a set of flags. To fix this, when we the event doesn't have the $active_feature_flags property, we backfill the flags on the server side. However there's an edge case when we send a cached set of feature flags on the client side.

In practice: the first pageview of a user returning after 3 days may contain a very different set of flags than all the following events in that session.

More context here: #6043 (comment)

@paolodamico
Copy link
Contributor

paolodamico commented Dec 2, 2021

How does experimentation over a single event work? In this case, we don't have success & failures for the A or B side, so how do we calculate the rest of the things?

I think it's a matter of comparing the baseline (control group) vs. experiment group. So for instance you target an increase in the number of Discoveries (and we could measure it over the time frame of the experiment). Or we could potentially target an increase in weekly active users (and then average weekly active users over the time frame). Though I could see the point of the MVP only targeting funnel conversion.

Person Selection is interesting! Sounds like instead of choosing an existing FF, its creating a new FF.

I think we could support "converting" an existing flag (which would basically mean reusing the key), but I think doing a person selection differently than what we do FFs makes more sense.

We can have a sync conversation tomorrow to iron out these details!

@neilkakkar
Copy link
Collaborator Author

Let's chat about this today!

What's worth ironing out here is that in this case, both control & experiment need another baseline with which they're being compared. (But, I also found out that there's an alternative possible here, by setting exposure - i.e. varying the time between how long A & B are active for. ).

Or, it can be something like: Baseline is number of pageviews over WAU, and then A/B test focuses on, say insights analyzed count. This is possible.

(And yes, would like to scope down MVP. Getting the calculations for either of these right can get hard).

@paolodamico
Copy link
Contributor

Very short summary from our sync meeting today (feel free to edit if I missed something):

  • Aligned on following the FF process outlined in the wireframes but we'll allow users to select an existing feature flag too (for the use case @neilkakkar described above). If they do this we'll "archive" the old flag and reuse the ID.
  • We'll explore using multivariate feature flags to assign users to either the experiment or the control group. This way we won't have to worry about mutable person properties over time or other filtering complexity.
  • @liyiy & @paolodamico will continue to figure out the general UX flow (particularly around how to create your target goal funnel while adjusting other FF params).
    • @clarkus to avoid overwhelming you we'll do ugly things for now, and we can talk about prioritizing solid designs next sprint.
  • We will use @neilkakkar's proposal on the Bayesian approach. Aside from all the great benefits that the original post outlines, it'll provide for a better UX, one less thing to worry about.

@paolodamico
Copy link
Contributor

paolodamico commented Dec 6, 2021

Alright based on our last conversation, here's a proposal (Figma link) for what the UI could look like to define the experiment parameters in the planning. Any feedback? @clarkus, @liyiy @neilkakkar Please think mostly big UX stuff, we can figure out all the details (particularly around UI) in the next sprint.

@macobo
Copy link
Contributor

macobo commented Dec 6, 2021

Minor fly-by comments

  • Experiments should have a name probably
  • No groups support?
  • How will integrating with feature flags work? Will we define one automatically based on a name
  • I can see having yet another different filters UX will cause issues - worth unifying with sessions at least?

@paolodamico
Copy link
Contributor

  • They do have a name (see breadcrumbs for instance, I have updated the mockups to reflect it in the title). Also FYI see the wireframes above, we have a step before where you define the name.
  • Groups support is not part of this MVP (but yes, we'll have support)
  • See wireframes above. Users will be able to create a new FF or convert an existing one (which basically means "archiving" the old one and reusing the key).
  • The filters UX follows what we have in the latest FF designs and insights, but I'll defer to @clarkus here.

@neilkakkar
Copy link
Collaborator Author

neilkakkar commented Dec 6, 2021

No groups support?

Not yet, but backend has implicit support, so once we're relatively confident things are going okay, easy to switch this on.

How will integrating with feature flags work? Will we define one automatically based on a name

If the FF exists already, we'll override update it to work with experimentation. If not, create a new one. What's new here is that all of these FFs will become multivariate FFs, to ensure we're accurately selecting people (each event will have control or test set), & not getting borked up by changing person properties over time, FF inaccuracies, etc.

@macobo
Copy link
Contributor

macobo commented Dec 6, 2021

See wireframes above. Users will be able to create a new FF or convert an existing one (which basically means "archiving" the old one and reusing the key).

This doesn't feel like it's actually a requirement/helps towards an MVP, but adds complexity we'll need to build and maintain.

Suggestion: Don't allow key collisions and always create a new flag to go with the experiment.

@neilkakkar
Copy link
Collaborator Author

I think most times the flow would be to create your own FF, test things work on a small set, and then create an experiment out of it. That's the flow we intend to support with reusing an existing FF, as it allows you to run an experiment without changes to existing code.

The complexity is limited to the experiment creation endpoint: to the rest of the world, the FFs look like what they're supposed to. -> Would you mind explaining more about what you're thinking about ref complexity we'll need to build and maintain.

@neilkakkar
Copy link
Collaborator Author

neilkakkar commented Dec 7, 2021

Some new technical constraints came up while implementing that I didn't think of earlier. This makes some things more annoying. cc: @liyiy @paolodamico

Fleshing this out with full context so people who aren't involved can contribute as well, if they want to (cc: @EDsCODE @macobo @hazzadous @mariusandra @marcushyett-ph )

The Problem

Experiments are very sensitive to measurement: You choose the wrong way of filtering people at the end of the experiment, and it borks all your results. This is a big no-no. Even in the MVP, we want to do precise measurements - we want to setup infra such that the numbers going into the calculation are accurate.

So, just having global filters on the breakdown funnel (which is used for measurement) doesn't work: person properties can change over time, and we might be incorrectly selecting the control & test group. (Also what Marius said is a problem)

The Solution

Set control and test variants explicitly on events. A.k.a use multivariate FFs to accurately select your control and test group. In this way, we count precisely those events which at the time were control & test variant. This get rids of both the above problems. (Users for whom FF haven't loaded wouldn't see the test/control variant and thus should be discarded from the result analysis)

The annoyance

Earlier, we decided to create this multivariate FF implicitly behind the scenes. This leads to complications: if you want the event coming in from posthog-js to have the variant set, your /decide endpoint needs to tell posthog-js it was a multivariate FF. Which means code using isFeatureEnabled() returns true for both control and variant (borked 100%).

This used to fit in well with the flow where:

  1. User creates simple FF to test out new change
  2. User creates experiment out of this FF
  3. Experiment can run with no code change

Another option here was to override /decide response and make it behave like a simple FF - but then we lose the information that there was the control variant set on the event, so no bueno.

Solving the annoyance

Now, the problem is: turning the simple FF to multivariate behind the scenes doesn't work. The user necessarily needs to do code changes to get things working. Once an experiment has started, this is hard to do, since we set the rollout to 50-50 to start collecting experiment data. And bugs here borks the experiment again.

Put more succinctly, we want to allow users to run an experiment without making any new code changes after they've tested things work (by, say, rolling out to only their team).


I think right now, the best way around this is to be more transparent: Tell users that running experiments means using multivariate FFs. Discard support for "simple" FFs. Support 2 variants by default: We change our UI flow a bit to allow selecting 2 variants, called control and test.

And, if we're going that far, we should change the flow as well:

  1. User comes to Experimentation.
  2. User uses our wizard to create a new Feature Flag (we don't allow selecting pre-existing ones). This is multivariate, has exactly 2 variants: control & test variants
  3. User selects persons, selects funnel (like right now)
  4. User saves experiment as draft. (This creates the FF on the backend) We tell them the details about the FF. And ask them to test everything works as they'd expect - i.e. the variants look good. Probably put in some sample code snippet here to make things easier. Actually, the exact code snippet generated using the values they've input earlier.
  5. User goes back to their code, tests things work with this FF (We tell them to use https://posthog.com/docs/user-guides/feature-flags#develop-locally to override)
  6. User launches Experiment. At this point, we confirm the FF filters & rollout %s are what we want them to be.

This is a bit more complicated in some ways than our existing flow - but our existing flow doesn't work, so I guess it's a non-comparison.

Keen to hear if there are easier ways to sort this? And does this UX make sense? Are you very opposed to this?

@posthog-contributions-bot
Copy link
Contributor

This issue has 2549 words at 23 comments. Issues this long are hard to read or contribute to, and tend to take very long to reach a conclusion. Instead, why not:

  1. Write some code and submit a pull request! Code wins arguments
  2. Have a sync meeting to reach a conclusion
  3. Create a Request for Comments and submit a PR with it to the meta repo or product internal repo

Is this issue intended to be sprawling? Consider adding label epic or sprint to indicate this.

@paolodamico
Copy link
Contributor

Thanks for sharing this @neilkakkar. My thoughts,

  1. In general, I think we're overestimating a bit the flow of FF release -> turns into experiment (from previous experience). Still worth solving for this case, but let's keep this in mind.
  2. A fix (maybe ugly) for the control vs. experiment tracking would be to send an additional FF which does contain this info (even though it's not actually used for functionality because the actual flag used to display new functionality is the overridden one that behaves as if it was a simple flag).
  3. I think your approach makes sense, particularly for the MVP.
  4. Post-MVP we can think how we can better support this use case of transitioning a FF to an experiment. In particular, I think once we introduce multivariate support for experiments too, it'll make the mental model more clear for the user.

Finally, do let me know if you need any more help with mockups and/or wireframes. We should be doing ugly things for now, but still mockups can help. @liyiy

@weyert
Copy link
Contributor

weyert commented Dec 7, 2021

Out of interest, may I ask what the use case is for wanting to convert (or use) an existing feature flag into an experiment?

@paolodamico
Copy link
Contributor

Absolutely @weyert! So the typical case is for instance you're launching your new landing page with an A/B test to make sure it converts better; but before launching the experiment you want to make sure the new landing page looks right in production, so maybe you want to release it to your internal team for testing purposes (as a FF). Then once, you've made sure the feature behind the FF works as expected, you convert it to an experiment and launch. Does that make sense?

@neilkakkar
Copy link
Collaborator Author

neilkakkar commented Dec 14, 2021

Given how it might be hard to gather feedback during this sprint, as people are away on holidays, changing things up a bit from the roadmap.

cc: @paolodamico @liyiy

The tactic I want to propose is doing things we're most confident this feature would need, and pushing the rest of the things out until we get some feedback.

So, things to do for this sprint (in order of priority):

Edit: Forgot about the cleanup on MVP

  1. Clean up MVP
    1. End dates for experiments
    2. Clean up summary page: Show all details, tell users about control and test variants, show code snippets for how to toggle feature flag, and how to test their code works.
    3. On results, users can't tell how long their experiment ran for / is running for. Need some date range here.
  2. Support trend metrics
    1. Frontend changes to allow selecting a kind of metric: figure out unobtrusive UX flow
    2. Backend: Do calculations based on the metric
  3. Support multivariates
    1. Figure out the UI flow here (& how to make things not confusing vs. the existing flow)
    2. Backend: Figure out how to do calculations for more than 1 variant
  4. Histogram of possible improvements: We've already answered the question of whether test is better than control. Now, answer the question of "how much better is test than control?" Figure out how best to represent this
    1. I think this histogram built using a bayesian approach is useful, no matter if we go the bayesian or frequentist route, so makes sense to have it.
    2. Edit: Prelim tests seem to suggest its intuitive just to me 🤷🏼‍♂️. So, removing histogram, appending "whatever representation makes sense".

All of these tie into the goal of the sprint: (1) Users can run more kinds of experiments. (2) Users get richer results

@marcushyett-ph
Copy link
Contributor

One piece of unsolicited feedback on the histogram: I feel this is going to be quite hard to interpret if you're not familiar with probability distributions. Is there something simpler we can do with to make the probability figure easy to understand?
e.g.
image

@neilkakkar
Copy link
Collaborator Author

Excellent piece of feedback, thanks! I'll keep this in mind, and think of alternative representations :)

@paolodamico
Copy link
Contributor

paolodamico commented Dec 14, 2021

This is great @neilkakkar! Can you update the roadmap PR to have that as our up-to-date source of truth? Re to Marcus's point, I would challenge that the histogram is something we should gather feedback for before building. Not convinced it will be useful for the majority of users.

Also let's chat with @clarkus about how we want to handle the UX for "how long will experiments run" (planned) vs actual running time.

@neilkakkar
Copy link
Collaborator Author

Will do!

Just to go a little deeper into what you're saying:

  1. Telling users how much better is variant than control is an important question to answer. (Yes? or do you disagree with this?)
  2. Assuming the above is true, what we're disagreeing about is how to best represent this information.

A few ideas from other platforms: (https://vwo.com/tools/ab-test-significance-calculator/)

Show the beta distribution of conversion rates:
image

Show the box plot:

image

.. or the histogram

.. or ????

@marcushyett-ph
Copy link
Contributor

These all look like good examples to use for customer feedback, +1 to @paolodamico's point - my hunch is that it'd be too complex but I could be totally wrong - so worth double checking

The simplest option I can think of is something like this where we use color / shade to reflect confidence and a "meter" to represent magnitude (excuse my terrible UX skills):

image

@clarkus
Copy link
Contributor

clarkus commented Dec 15, 2021

I have a consolidated experiment creation design at https://www.figma.com/file/gQBj9YnNgD8YW4nBwCVLZf?node-id=6012:30357#133670116. This reduces the flow into a single step for creation. I am moving on to summarizing an actively running experiment, a completed one, etc. Let me know if you have any ideas of how we might summarize progress. I am particularly thinking about the case where an experiment is running and a user needs to make the decision to let it run or end it earlier than planned.

@paolodamico
Copy link
Contributor

@neilkakkar aligned on the premise, let's just figure out the right UX here. My proposal would be to work with @clarkus on creating a few mockups for how we would display each of the 2-3 options and then show it to users.

@clarkus
Copy link
Contributor

clarkus commented Jan 4, 2022

I have posted updated screens for the experiment summary and its various states at https://www.figma.com/file/gQBj9YnNgD8YW4nBwCVLZf/PostHog-App?node-id=6240%3A34236. Please take a look and leave any feedback on anything that looks off target.

@paolodamico
Copy link
Contributor

Adding comments on Figma

@neilkakkar
Copy link
Collaborator Author

A broad observation: It's turning out, from user interviews, that "Telling users how much better is variant than control" is perhaps not that big a deal. They can figure out a rough good-enough estimate using just the conversion rates/ absolute values they can see.

We deprioritised histograms based on early feedback, and current feedback around "what do you think is missing from results?" doesn't seem to prompt the question "how much better is the variant?".

Will keep the idea in background, but not building anything out yet.


Since most things here have been achieved, closing this, will open a new issue for whatever new things come up / whatever else we decide to implement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature/experimentation Feature Tag: Experimentation
Projects
None yet
Development

No branches or pull requests

8 participants