Experimentation MVP Things to Do #7462

neilkakkar · 2021-12-01T11:45:10Z

Following up from #7418 , here's what we need to do to get the MVP out in these 2 weeks:

Main Tasks:

Figure out & build the UI Creation Flow - @liyiy
- @clarkus & @paolodamico : we could use some input here about the basic flow. (specifically on how we should structure this, disregard full fledged wireframes for now) There's 2.5 steps to creating an experiment: (1) Create an insight & (2) Choose a FF. (2.5): Name and description of experiment.
- As you can tell, there's lots of ambiguity above^ that we need to clean up.
- (We considered a flow where funnels have a "create experiment from insight" button, but @macobo mentioned how this increases complexity on the standard funnels view, and reduces discoverability quite a bit in simple mode)
- (future context): There's a back & forth when choosing an insight & FF: They choose an insight, we tell them how long the experiment will run for. They tweak FF rollout % / definition, we tweak how long we expect experiment to run for. They tweak insight definition, we tweak how long to run the experiment for. What's a good way of doing this?
Build Experiment Page with the Insight and experiment results (also see here)
Create Plugin to add $active_feature_flags & $feature/{key} to every event (we only do it for web captures today) - @neilkakkar - deferred. Just focusing on posthog-js for now.
- This implicitly allows breakdowns by Feature Flags on Funnels
Given a funnel with FF breakdown, calculate experiment results
Given a funnel and FF, calculate how long an experiment should run

New constraints:

To give accurate results on expected running time, we need to answer the question: Given a FF, how many people in the past X days on this insight would've belonged to this FF? And use this to estimate. For now, restricting scope to basic FFs only (i.e. flag with the only property being the rollout % to all users).

The text was updated successfully, but these errors were encountered:

paolodamico · 2021-12-01T13:04:05Z

This is great @neilkakkar! Sounds like a plan. I'll work on some wireframes to discuss the overall flow.

liyiy · 2021-12-01T14:57:34Z

Are we limiting the user to what type of insight they're allowed to create?

neilkakkar · 2021-12-01T15:23:18Z

Experimentation is only possible on Funnels, so ideally yes (your call depending on how hard is it to implement. Disabling the rest / removing them works)

clarkus · 2021-12-01T15:47:44Z

Is there a list of known constraints somewhere?

There are 2.5 steps to creation right now
- Create a funnel insight (has to be a funnel)
- Choose a feature flag
- Name things
- Some back and forth to adjust how long-lived the experiment, etc.
Are there any other constraints to be aware of?

liyiy · 2021-12-01T15:55:12Z

@clarkus Potentially things like participant selection? 🤔

neilkakkar · 2021-12-01T15:56:54Z

Hmm, the FF determines the participants, so this should be ok (thanks to the rollout %).

One more constraint: Each experiment can have only one insight, and only one FF set.

Don't think there's anything else

paolodamico · 2021-12-01T17:16:01Z

Created a (very, very ugly) proposal for what the flow can look like (see below). Some notes based on the comments I see above:

I think a better experience is instead of creating a normal insight, is allowing the user to select the metric they care about (see wireframe proposal). In the end we want to make sure the user can track a single number as opposed to just seeing the same graph that they'd see in insights.
In terms of % rollout on FFs, I think we should skip this. For the MVP I would suggest just doing 50/50 binary rollout based on the selection criteria.

PH-experimentation-MVP.pdf

clarkus · 2021-12-01T17:58:12Z

Here are the things I think we'd need to track / collect for creating an experiment. I have some open questions below on areas I see as need elaboration. I think the way we determine the target metric needs more discussion or at least consensus for MVP.

Experiment parameters

Name
Feature flag key - could be generated based on the name - the critical thing here is to communicate the key so it can be incorporated into code / features. There is also an automatic cleanup aspect to keys - do we need to ensure keys aren't reused in the future? Do they always have to be unique even after the experiment is concluded?
Description (optional)
Owner (not sure we need to provide this explicitly as it could be inferred by the creator of the experiment)
Target metric / funnel insight - needs elaboration and consensus
Time to live - how long running is the experiment - is this always in days? Are there upper or lower bounds?
Person / group targeting - can we infer this from the insight filters? Are we using rollout percentages here or do we need to be able to provide a target count with absolute units (for example 25 persons).
How does confidence level actually work? Is this derived based on percentage of audience and how long-running the experiment is? If a user is able to directly input this, how does it impact the performance of the experiment?
An experiment cannot be changed while it's active. Are there exceptions to this? What about situations where the experiment is malformed or has other detrimental results, can the user revert and stop? Do they need to edit and relaunch, or is it a scenario where they need to start fresh?
Does it make sense to colocate experiments and feature flags? They're very related in their function, but they are pretty separate items in terms of usage and user mental models. Just saying they could be a stand alone feature distinct from feature flags.

weyert · 2021-12-01T18:18:55Z

I hope you don't mind raising this issue I raised last year regarding time limited feature flags; I might be wrong. But it sounds like it might be related: #1903

neilkakkar · 2021-12-01T18:57:41Z

@paolodamico : How does experimentation over a single event work? In this case, we don't have success & failures for the A or B side, so how do we calculate the rest of the things?

imo we should restrict experimentation to funnels only. In that case, choosing the target metric = creating the funnel.

The single number would be the conversion rate. I don't get how it can be, say, number of pageviews (the followup calculations get borked in this case?)

neilkakkar · 2021-12-01T19:13:30Z

Person Selection is interesting! Sounds like instead of choosing an existing FF, its creating a new FF.

I was assuming that the FF would usually already exist, because there would be code changes based on this FF?

Imagine you want to test a blue vs green logo of PostHog.

I would make the change, release it only for PostHog users / myself, test it is ok, and then update some selection & rollout critera on the FF itself, and then move to the Experimentation part.

This is very hard to do if I have to create FF inside experiment. My selection criteria is borking my testing flexibility.

Thoughts? I feel your way is superior in terms of UX: I don't have to flip flop around to set things up. But hopefully(?) by the time you come to run an experiment, you have your FFs & code changes already finished, ready to go experiment.

liyiy · 2021-12-01T22:23:28Z

I am kind of confused on how feature flags work here, are we duplicating them as an "experiment feature flag" and editing that duplicate when we change its rollout percentages, etc? Also I think the FF selection should be a dropdown of existing FFs instead of entering one manually

mariusandra · 2021-12-01T23:07:17Z

A reminder that we have a small data integrity issue with feature flags: posthog-js sends the first pageview before /decide returns a set of flags. To fix this, when we the event doesn't have the $active_feature_flags property, we backfill the flags on the server side. However there's an edge case when we send a cached set of feature flags on the client side.

In practice: the first pageview of a user returning after 3 days may contain a very different set of flags than all the following events in that session.

More context here: #6043 (comment)

paolodamico · 2021-12-02T00:02:09Z

How does experimentation over a single event work? In this case, we don't have success & failures for the A or B side, so how do we calculate the rest of the things?

I think it's a matter of comparing the baseline (control group) vs. experiment group. So for instance you target an increase in the number of Discoveries (and we could measure it over the time frame of the experiment). Or we could potentially target an increase in weekly active users (and then average weekly active users over the time frame). Though I could see the point of the MVP only targeting funnel conversion.

Person Selection is interesting! Sounds like instead of choosing an existing FF, its creating a new FF.

I think we could support "converting" an existing flag (which would basically mean reusing the key), but I think doing a person selection differently than what we do FFs makes more sense.

We can have a sync conversation tomorrow to iron out these details!

neilkakkar · 2021-12-02T14:17:50Z

Let's chat about this today!

What's worth ironing out here is that in this case, both control & experiment need another baseline with which they're being compared. (But, I also found out that there's an alternative possible here, by setting exposure - i.e. varying the time between how long A & B are active for. ).

Or, it can be something like: Baseline is number of pageviews over WAU, and then A/B test focuses on, say insights analyzed count. This is possible.

(And yes, would like to scope down MVP. Getting the calculations for either of these right can get hard).

paolodamico · 2021-12-02T19:59:00Z

Very short summary from our sync meeting today (feel free to edit if I missed something):

Aligned on following the FF process outlined in the wireframes but we'll allow users to select an existing feature flag too (for the use case @neilkakkar described above). If they do this we'll "archive" the old flag and reuse the ID.
We'll explore using multivariate feature flags to assign users to either the experiment or the control group. This way we won't have to worry about mutable person properties over time or other filtering complexity.
@liyiy & @paolodamico will continue to figure out the general UX flow (particularly around how to create your target goal funnel while adjusting other FF params).
- @clarkus to avoid overwhelming you we'll do ugly things for now, and we can talk about prioritizing solid designs next sprint.
We will use @neilkakkar's proposal on the Bayesian approach. Aside from all the great benefits that the original post outlines, it'll provide for a better UX, one less thing to worry about.

paolodamico · 2021-12-06T15:56:29Z

Alright based on our last conversation, here's a proposal (Figma link) for what the UI could look like to define the experiment parameters in the planning. Any feedback? @clarkus, @liyiy @neilkakkar Please think mostly big UX stuff, we can figure out all the details (particularly around UI) in the next sprint.

macobo · 2021-12-06T16:46:12Z

Minor fly-by comments

Experiments should have a name probably
No groups support?
How will integrating with feature flags work? Will we define one automatically based on a name
I can see having yet another different filters UX will cause issues - worth unifying with sessions at least?

paolodamico · 2021-12-06T16:55:04Z

They do have a name (see breadcrumbs for instance, I have updated the mockups to reflect it in the title). Also FYI see the wireframes above, we have a step before where you define the name.
Groups support is not part of this MVP (but yes, we'll have support)
See wireframes above. Users will be able to create a new FF or convert an existing one (which basically means "archiving" the old one and reusing the key).
The filters UX follows what we have in the latest FF designs and insights, but I'll defer to @clarkus here.

neilkakkar · 2021-12-06T16:57:35Z

No groups support?

Not yet, but backend has implicit support, so once we're relatively confident things are going okay, easy to switch this on.

How will integrating with feature flags work? Will we define one automatically based on a name

If the FF exists already, we'll ~~override~~ update it to work with experimentation. If not, create a new one. What's new here is that all of these FFs will become multivariate FFs, to ensure we're accurately selecting people (each event will have control or test set), & not getting borked up by changing person properties over time, FF inaccuracies, etc.

macobo · 2021-12-06T17:08:01Z

See wireframes above. Users will be able to create a new FF or convert an existing one (which basically means "archiving" the old one and reusing the key).

This doesn't feel like it's actually a requirement/helps towards an MVP, but adds complexity we'll need to build and maintain.

Suggestion: Don't allow key collisions and always create a new flag to go with the experiment.

neilkakkar · 2021-12-06T17:15:59Z

I think most times the flow would be to create your own FF, test things work on a small set, and then create an experiment out of it. That's the flow we intend to support with reusing an existing FF, as it allows you to run an experiment without changes to existing code.

The complexity is limited to the experiment creation endpoint: to the rest of the world, the FFs look like what they're supposed to. -> Would you mind explaining more about what you're thinking about ref complexity we'll need to build and maintain.

neilkakkar · 2021-12-07T12:11:36Z

Some new technical constraints came up while implementing that I didn't think of earlier. This makes some things more annoying. cc: @liyiy @paolodamico

Fleshing this out with full context so people who aren't involved can contribute as well, if they want to (cc: @EDsCODE @macobo @hazzadous @mariusandra @marcushyett-ph )

The Problem

Experiments are very sensitive to measurement: You choose the wrong way of filtering people at the end of the experiment, and it borks all your results. This is a big no-no. Even in the MVP, we want to do precise measurements - we want to setup infra such that the numbers going into the calculation are accurate.

So, just having global filters on the breakdown funnel (which is used for measurement) doesn't work: person properties can change over time, and we might be incorrectly selecting the control & test group. (Also what Marius said is a problem)

The Solution

Set control and test variants explicitly on events. A.k.a use multivariate FFs to accurately select your control and test group. In this way, we count precisely those events which at the time were control & test variant. This get rids of both the above problems. (Users for whom FF haven't loaded wouldn't see the test/control variant and thus should be discarded from the result analysis)

The annoyance

Earlier, we decided to create this multivariate FF implicitly behind the scenes. This leads to complications: if you want the event coming in from posthog-js to have the variant set, your /decide endpoint needs to tell posthog-js it was a multivariate FF. Which means code using isFeatureEnabled() returns true for both control and variant (borked 100%).

This used to fit in well with the flow where:

User creates simple FF to test out new change
User creates experiment out of this FF
Experiment can run with no code change

Another option here was to override /decide response and make it behave like a simple FF - but then we lose the information that there was the control variant set on the event, so no bueno.

Solving the annoyance

Now, the problem is: turning the simple FF to multivariate behind the scenes doesn't work. The user necessarily needs to do code changes to get things working. Once an experiment has started, this is hard to do, since we set the rollout to 50-50 to start collecting experiment data. And bugs here borks the experiment again.

Put more succinctly, we want to allow users to run an experiment without making any new code changes after they've tested things work (by, say, rolling out to only their team).

I think right now, the best way around this is to be more transparent: Tell users that running experiments means using multivariate FFs. Discard support for "simple" FFs. Support 2 variants by default: We change our UI flow a bit to allow selecting 2 variants, called control and test.

And, if we're going that far, we should change the flow as well:

User comes to Experimentation.
User uses our wizard to create a new Feature Flag (we don't allow selecting pre-existing ones). This is multivariate, has exactly 2 variants: control & test variants
User selects persons, selects funnel (like right now)
User saves experiment as draft. (This creates the FF on the backend) We tell them the details about the FF. And ask them to test everything works as they'd expect - i.e. the variants look good. Probably put in some sample code snippet here to make things easier. Actually, the exact code snippet generated using the values they've input earlier.
User goes back to their code, tests things work with this FF (We tell them to use https://posthog.com/docs/user-guides/feature-flags#develop-locally to override)
User launches Experiment. At this point, we confirm the FF filters & rollout %s are what we want them to be.

This is a bit more complicated in some ways than our existing flow - but our existing flow doesn't work, so I guess it's a non-comparison.

Keen to hear if there are easier ways to sort this? And does this UX make sense? Are you very opposed to this?

posthog-contributions-bot · 2021-12-07T12:11:37Z

This issue has 2549 words at 23 comments. Issues this long are hard to read or contribute to, and tend to take very long to reach a conclusion. Instead, why not:

Write some code and submit a pull request! Code wins arguments
Have a sync meeting to reach a conclusion
Create a Request for Comments and submit a PR with it to the meta repo or product internal repo

Is this issue intended to be sprawling? Consider adding label epic or sprint to indicate this.

paolodamico · 2021-12-07T19:58:47Z

Thanks for sharing this @neilkakkar. My thoughts,

In general, I think we're overestimating a bit the flow of FF release -> turns into experiment (from previous experience). Still worth solving for this case, but let's keep this in mind.
A fix (maybe ugly) for the control vs. experiment tracking would be to send an additional FF which does contain this info (even though it's not actually used for functionality because the actual flag used to display new functionality is the overridden one that behaves as if it was a simple flag).
I think your approach makes sense, particularly for the MVP.
Post-MVP we can think how we can better support this use case of transitioning a FF to an experiment. In particular, I think once we introduce multivariate support for experiments too, it'll make the mental model more clear for the user.

Finally, do let me know if you need any more help with mockups and/or wireframes. We should be doing ugly things for now, but still mockups can help. @liyiy

weyert · 2021-12-07T21:46:14Z

Out of interest, may I ask what the use case is for wanting to convert (or use) an existing feature flag into an experiment?

paolodamico · 2021-12-07T22:16:53Z

Absolutely @weyert! So the typical case is for instance you're launching your new landing page with an A/B test to make sure it converts better; but before launching the experiment you want to make sure the new landing page looks right in production, so maybe you want to release it to your internal team for testing purposes (as a FF). Then once, you've made sure the feature behind the FF works as expected, you convert it to an experiment and launch. Does that make sense?

neilkakkar · 2021-12-14T14:52:54Z

Given how it might be hard to gather feedback during this sprint, as people are away on holidays, changing things up a bit from the roadmap.

cc: @paolodamico @liyiy

The tactic I want to propose is doing things we're most confident this feature would need, and pushing the rest of the things out until we get some feedback.

So, things to do for this sprint (in order of priority):

Edit: Forgot about the cleanup on MVP

Clean up MVP
1. End dates for experiments
2. Clean up summary page: Show all details, tell users about control and test variants, show code snippets for how to toggle feature flag, and how to test their code works.
3. On results, users can't tell how long their experiment ran for / is running for. Need some date range here.
Support trend metrics
1. Frontend changes to allow selecting a kind of metric: figure out unobtrusive UX flow
2. Backend: Do calculations based on the metric
Support multivariates
1. Figure out the UI flow here (& how to make things not confusing vs. the existing flow)
2. Backend: Figure out how to do calculations for more than 1 variant
~~Histogram of possible improvements~~: We've already answered the question of whether test is better than control. Now, answer the question of "how much better is test than control?" Figure out how best to represent this
1. I think this histogram built using a bayesian approach is useful, no matter if we go the bayesian or frequentist route, so makes sense to have it.
2. Edit: Prelim tests seem to suggest its intuitive just to me 🤷🏼‍♂️. So, removing histogram, appending "whatever representation makes sense".

All of these tie into the goal of the sprint: (1) Users can run more kinds of experiments. (2) Users get richer results

marcushyett-ph · 2021-12-14T15:02:31Z

One piece of unsolicited feedback on the histogram: I feel this is going to be quite hard to interpret if you're not familiar with probability distributions. Is there something simpler we can do with to make the probability figure easy to understand?
e.g.

neilkakkar · 2021-12-14T15:22:44Z

Excellent piece of feedback, thanks! I'll keep this in mind, and think of alternative representations :)

paolodamico · 2021-12-14T23:10:05Z

This is great @neilkakkar! Can you update the roadmap PR to have that as our up-to-date source of truth? Re to Marcus's point, I would challenge that the histogram is something we should gather feedback for before building. Not convinced it will be useful for the majority of users.

Also let's chat with @clarkus about how we want to handle the UX for "how long will experiments run" (planned) vs actual running time.

neilkakkar · 2021-12-15T10:44:44Z

Will do!

Just to go a little deeper into what you're saying:

Telling users how much better is variant than control is an important question to answer. (Yes? or do you disagree with this?)
Assuming the above is true, what we're disagreeing about is how to best represent this information.

A few ideas from other platforms: (https://vwo.com/tools/ab-test-significance-calculator/)

Show the beta distribution of conversion rates:

Show the box plot:

.. or the histogram

.. or ????

marcushyett-ph · 2021-12-15T11:10:10Z

These all look like good examples to use for customer feedback, +1 to @paolodamico's point - my hunch is that it'd be too complex but I could be totally wrong - so worth double checking

The simplest option I can think of is something like this where we use color / shade to reflect confidence and a "meter" to represent magnitude (excuse my terrible UX skills):

clarkus · 2021-12-15T22:08:40Z

I have a consolidated experiment creation design at https://www.figma.com/file/gQBj9YnNgD8YW4nBwCVLZf?node-id=6012:30357#133670116. This reduces the flow into a single step for creation. I am moving on to summarizing an actively running experiment, a completed one, etc. Let me know if you have any ideas of how we might summarize progress. I am particularly thinking about the case where an experiment is running and a user needs to make the decision to let it run or end it earlier than planned.

paolodamico · 2021-12-16T16:47:21Z

@neilkakkar aligned on the premise, let's just figure out the right UX here. My proposal would be to work with @clarkus on creating a few mockups for how we would display each of the 2-3 options and then show it to users.

clarkus · 2022-01-04T21:26:07Z

I have posted updated screens for the experiment summary and its various states at https://www.figma.com/file/gQBj9YnNgD8YW4nBwCVLZf/PostHog-App?node-id=6240%3A34236. Please take a look and leave any feedback on anything that looks off target.

paolodamico · 2022-01-05T13:40:42Z

Adding comments on Figma

neilkakkar · 2022-01-17T17:14:32Z

A broad observation: It's turning out, from user interviews, that "Telling users how much better is variant than control" is perhaps not that big a deal. They can figure out a rough good-enough estimate using just the conversion rates/ absolute values they can see.

We deprioritised histograms based on early feedback, and current feedback around "what do you think is missing from results?" doesn't seem to prompt the question "how much better is the variant?".

Will keep the idea in background, but not building anything out yet.

Since most things here have been achieved, closing this, will open a new issue for whatever new things come up / whatever else we decide to implement.

neilkakkar assigned neilkakkar and liyiy Dec 1, 2021

neilkakkar mentioned this issue Dec 2, 2021

Experimentation MVP scope - 1.31.0 2/2 [Core analytics] #7418

Closed

neilkakkar mentioned this issue Dec 7, 2021

Experimentation backend #7492

Merged

EDsCODE mentioned this issue Dec 10, 2021

Sprint 1.32.0 1/2 - Dec 13 to Jan 7 #7526

Closed

neilkakkar mentioned this issue Dec 17, 2021

Internal Dogfooding: Experimentation Feedback #7766

Closed

neilkakkar added the feature/experimentation Feature Tag: Experimentation label Dec 17, 2021

neilkakkar closed this as completed Jan 17, 2022

Experimentation MVP Things to Do #7462

Experimentation MVP Things to Do #7462

Comments

neilkakkar commented Dec 1, 2021 • edited Loading

paolodamico commented Dec 1, 2021

liyiy commented Dec 1, 2021

neilkakkar commented Dec 1, 2021

clarkus commented Dec 1, 2021

liyiy commented Dec 1, 2021

neilkakkar commented Dec 1, 2021

paolodamico commented Dec 1, 2021

clarkus commented Dec 1, 2021

weyert commented Dec 1, 2021

neilkakkar commented Dec 1, 2021 • edited Loading

neilkakkar commented Dec 1, 2021

liyiy commented Dec 1, 2021

mariusandra commented Dec 1, 2021

paolodamico commented Dec 2, 2021 • edited Loading

neilkakkar commented Dec 2, 2021

paolodamico commented Dec 2, 2021

paolodamico commented Dec 6, 2021 • edited Loading

macobo commented Dec 6, 2021

paolodamico commented Dec 6, 2021

neilkakkar commented Dec 6, 2021 • edited Loading

macobo commented Dec 6, 2021

neilkakkar commented Dec 6, 2021

neilkakkar commented Dec 7, 2021 • edited Loading

The Problem

The Solution

The annoyance

Solving the annoyance

posthog-contributions-bot bot commented Dec 7, 2021

paolodamico commented Dec 7, 2021

weyert commented Dec 7, 2021

paolodamico commented Dec 7, 2021

neilkakkar commented Dec 14, 2021 • edited Loading

marcushyett-ph commented Dec 14, 2021

neilkakkar commented Dec 14, 2021

paolodamico commented Dec 14, 2021 • edited Loading

neilkakkar commented Dec 15, 2021

marcushyett-ph commented Dec 15, 2021

clarkus commented Dec 15, 2021

paolodamico commented Dec 16, 2021

clarkus commented Jan 4, 2022

paolodamico commented Jan 5, 2022

neilkakkar commented Jan 17, 2022

neilkakkar commented Dec 1, 2021 •

edited

Loading

neilkakkar commented Dec 1, 2021 •

edited

Loading

paolodamico commented Dec 2, 2021 •

edited

Loading

paolodamico commented Dec 6, 2021 •

edited

Loading

neilkakkar commented Dec 6, 2021 •

edited

Loading

neilkakkar commented Dec 7, 2021 •

edited

Loading

neilkakkar commented Dec 14, 2021 •

edited

Loading

paolodamico commented Dec 14, 2021 •

edited

Loading