title: AB test - redesigning our TCs in the signup funnel.    
author: Fabio Schmidt-Fischbach   
date: 2020-04-01   
region: EU   
tags: acquire, signup, sign up, funnel, ab test, tc, fabio, paola   
summary: Here we present the results of the test for the design revamp of the T&Cs in the sign up flow. The conversion from create account to sign up complete rose by 1.05%. This corresponds to a 0.3 pp improvement of SUI to SU. The % with facebook consent went down 5pp (from 55 to 50%), and the % with moneybeam consent went down 8.4pp (from 59.7% to 51.3%).

In [None]:
import pandas as pd
import altair as alt
import seaborn as sns
import numpy as np

# New t&c test

Setup: The signup team revamped the design of the T&Cs in the sign up flow. 

Test: split total traffic 50:50 and compare which variant performs better. 

- Core KPI: % conversion from create-account screen to sign up complete. 
- Additional KPIs: % users who give FB audience consent and moneybeam consent

Results: 
- conversion from create account to sign up complete rose by 1.05%. This corresponds to a 0.3 pp improvement of SUI to SU. 
- % with facebook consent went down 5pp (from 55 to 50%).
- % with moneybeam consent went down 8.4pp (from 59.7% to 51.3%).  

Nerd-extra: We show results both with 

- frequentist (wait until end of experiment and look at results) 
- and bayesian testing (stop when there is enough evidence to call it). 

The same conclusion could have been reached with Bayesian testing with 60% of the sample. 

In [None]:
query = """

select 	
		csu.id as shadow_user_id, 
		se_property, 
		platform, 
        case when geo.country in ('FR','DE','ES','IT','AT') then country else 'RoE' end as country_group,
        case when visible_as_contact = 1 then 1 else 0 end as moneybeam, 
        case when facebook_audience_consent = 1 then 1 else 0 end as facebook, 
		case when csu.user_created is not null then 1 else 0 end as su
from ksp_event_crab as ec 
inner join ksp_event_userid as eui using (event_id) 
inner join cmd_shadow_user as csu on csu.id = eui.user_id 
left join cmd_user_preferences as cup on cup.user_created = csu.user_created 
inner join ksp_ip_geo geo on geo.event_id = ec.event_id 
where se_property in ('terms_v2', 'terms_v2_control') 
	and event_type = 630 
	and (JSON_EXTRACT_PATH_TEXT(ec.se_label, 'step') like 'create-account')
group by 1,2,3,4,5,6,7
	

"""

In [30]:
# clean up data
data = pd.read_csv("terms_v2.csv")

# drop users that saw both variants.

d = data.groupby("shadow_user_id")["se_property"].agg("nunique").reset_index()

# keep only problem user ids
d = d.loc[d["se_property"] > 1, "shadow_user_id"]

# now drop this list of user ids from data.

# print(data.shape)

data = data.loc[data["shadow_user_id"].isin(d) == False, :]

# print(data.shape)

### Sample size 

In [32]:
data = pd.read_csv("terms_v2.csv")

d = data.groupby("shadow_user_id")["se_property"].agg("nunique").reset_index()
d = d.loc[d["se_property"] > 1, "shadow_user_id"]
data = data.loc[data["shadow_user_id"].isin(d) == False, :]

data = data.groupby("se_property")["shadow_user_id"].agg("nunique").reset_index()

alt.Chart(data).mark_bar().encode(
    x=alt.X("se_property:N", axis=alt.Axis(title="Group")),
    y=alt.Y("shadow_user_id:Q", axis=alt.Axis(title="Number of users")),
).properties(width=400, height=400, title="Sample size")

## Core KPI: conversion rate 

In [87]:
data = pd.read_csv("terms_v2.csv")

d = data.groupby("shadow_user_id")["se_property"].agg("nunique").reset_index()
d = d.loc[d["se_property"] > 1, "shadow_user_id"]
data = data.loc[data["shadow_user_id"].isin(d) == False, :]

# go down to one row per user.
data = data.groupby(["shadow_user_id", "se_property"])["su"].agg("max").reset_index()

data = data.groupby("se_property")["su"].agg("mean").reset_index()

alt.Chart(data).mark_bar().encode(
    x=alt.X("se_property:N", axis=alt.Axis(title="Group")),
    y=alt.Y(
        "su:Q",
        axis=alt.Axis(title="% that complete sign up"),
        scale=alt.Scale(domain=(0.9, 0.98)),
    ),
).properties(width=400, height=400, title="% that complete sign up")

In [120]:
data = pd.read_csv("terms_v2.csv")

d = data.groupby("shadow_user_id")["se_property"].agg("nunique").reset_index()
d = d.loc[d["se_property"] > 1, "shadow_user_id"]
data = data.loc[data["shadow_user_id"].isin(d) == False, :]

# go down to one row per user.
data = data.groupby(["shadow_user_id", "se_property"])["su"].agg("max").reset_index()

data = data.groupby("se_property")["su"].agg("mean").reset_index()
data.head()

Unnamed: 0,se_property,su
0,terms_v2,0.966226
1,terms_v2_control,0.956207


In [65]:
from statsmodels.stats.proportion import proportions_ztest

data = pd.read_csv("terms_v2.csv")

d = data.groupby("shadow_user_id")["se_property"].agg("nunique").reset_index()
d = d.loc[d["se_property"] > 1, "shadow_user_id"]
data = data.loc[data["shadow_user_id"].isin(d) == False, :]

# go down to one row per user.
data = data.groupby(["shadow_user_id", "se_property"])["su"].agg("max").reset_index()

data = data.groupby("se_property")["su"].agg(["count", "sum"]).reset_index()

# run z test. (two sided)
stat, pval = proportions_ztest(data["sum"], data["count"])

print(
    "The z-score for this test is %s which corresponds to a p-value of %s"
    % (round(stat, 2), round(pval, 4))
)

if pval < 0.05:
    print("The difference is significant.")
else:
    print("The difference is not signficiant.")


# compute the effect size.
data["cr"] = data["sum"] / data["count"]

mu_v = data.loc[data["se_property"] == "terms_v2", "cr"][0]
mu_c = data.loc[data["se_property"] != "terms_v2", "cr"][1]
effect_size = 100 * ((mu_v - mu_c) / mu_c)

print("The effect size is %s Percent" % (round(effect_size, 2)))

The z-score for this test is 3.96 which corresponds to a p-value of 0.0001
The difference is significant.
The effect size is 1.05 Percent


In [72]:
data = pd.read_csv("terms_v2.csv")

d = data.groupby("shadow_user_id")["se_property"].agg("nunique").reset_index()
d = d.loc[d["se_property"] > 1, "shadow_user_id"]
data = data.loc[data["shadow_user_id"].isin(d) == False, :]

data = data.groupby(["se_property", "country_group"])["su"].agg("mean").reset_index()

alt.Chart(data).mark_bar().encode(
    x=alt.X("se_property:N", axis=alt.Axis(title="Group")),
    y=alt.Y(
        "su:Q",
        axis=alt.Axis(title="% that complete sign up"),
        scale=alt.Scale(domain=(0.9, 0.98)),
    ),
    column="country_group:N",
).properties(width=400, height=400, title="% that complete sign up")

The effect is visible across all markets.

## How did consent response change? 

In [81]:
data = pd.read_csv("terms_v2.csv")

d = data.groupby("shadow_user_id")["se_property"].agg("nunique").reset_index()
d = d.loc[d["se_property"] > 1, "shadow_user_id"]
data = data.loc[data["shadow_user_id"].isin(d) == False, :]

data = data.groupby("se_property")["facebook"].agg("mean").reset_index()

alt.Chart(data).mark_bar().encode(
    x=alt.X("se_property:N", axis=alt.Axis(title="Group")),
    y=alt.Y("facebook:Q", axis=alt.Axis(title="% that give FB consent", format="%")),
).properties(width=400, height=400, title="% that give FB consent")

In [78]:
from statsmodels.stats.proportion import proportions_ztest

data = pd.read_csv("terms_v2.csv")

d = data.groupby("shadow_user_id")["se_property"].agg("nunique").reset_index()
d = d.loc[d["se_property"] > 1, "shadow_user_id"]
data = data.loc[data["shadow_user_id"].isin(d) == False, :]

# go down to one row per user.
data = (
    data.groupby(["shadow_user_id", "se_property"])["facebook"].agg("max").reset_index()
)

data = data.groupby("se_property")["facebook"].agg(["count", "sum"]).reset_index()

# run z test. (two sided)
stat, pval = proportions_ztest(data["sum"], data["count"])

print(
    "The z-score for this test is %s which corresponds to a p-value of %s"
    % (round(stat, 2), round(pval, 4))
)

if pval < 0.05:
    print("The difference is significant.")
else:
    print("The difference is not signficiant.")


# compute the effect size.
data["cr"] = data["sum"] / data["count"]

mu_v = data.loc[data["se_property"] == "terms_v2", "cr"][0]
mu_c = data.loc[data["se_property"] != "terms_v2", "cr"][1]
effect_size = 100 * ((mu_v - mu_c) / mu_c)

print("The effect size is %s Percent" % (round(effect_size, 2)))

The z-score for this test is -8.63 which corresponds to a p-value of 0.0
The difference is significant.
The effect size is -10.23 Percent


In [82]:
data = pd.read_csv("terms_v2.csv")

d = data.groupby("shadow_user_id")["se_property"].agg("nunique").reset_index()
d = d.loc[d["se_property"] > 1, "shadow_user_id"]
data = data.loc[data["shadow_user_id"].isin(d) == False, :]

data = data.groupby("se_property")["moneybeam"].agg("mean").reset_index()

alt.Chart(data).mark_bar().encode(
    x=alt.X("se_property:N", axis=alt.Axis(title="Group")),
    y=alt.Y(
        "moneybeam:Q", axis=alt.Axis(title="% that give moneybeam consent", format="%")
    ),
).properties(width=400, height=400, title="% that give moneybeam consent")

In [121]:
data = pd.read_csv("terms_v2.csv")

d = data.groupby("shadow_user_id")["se_property"].agg("nunique").reset_index()
d = d.loc[d["se_property"] > 1, "shadow_user_id"]
data = data.loc[data["shadow_user_id"].isin(d) == False, :]

data = data.groupby("se_property")["moneybeam"].agg("mean").reset_index()

data.head()

Unnamed: 0,se_property,moneybeam
0,terms_v2,0.513017
1,terms_v2_control,0.597156


In [122]:
data = pd.read_csv("terms_v2.csv")

d = data.groupby("shadow_user_id")["se_property"].agg("nunique").reset_index()
d = d.loc[d["se_property"] > 1, "shadow_user_id"]
data = data.loc[data["shadow_user_id"].isin(d) == False, :]

data = (
    data.groupby(["se_property", "country_group"])["moneybeam"]
    .agg("mean")
    .reset_index()
)

alt.Chart(data).mark_bar().encode(
    x=alt.X("se_property:N", axis=alt.Axis(title="Group")),
    y=alt.Y(
        "moneybeam:Q", axis=alt.Axis(title="% that give moneybeam consent", format="%")
    ),
    column="country_group:N",
).properties(width=400, height=400, title="% that give moneybeam consent")

In [123]:
data = pd.read_csv("terms_v2.csv")

d = data.groupby("shadow_user_id")["se_property"].agg("nunique").reset_index()
d = d.loc[d["se_property"] > 1, "shadow_user_id"]
data = data.loc[data["shadow_user_id"].isin(d) == False, :]

data = (
    data.groupby(["se_property", "country_group"])["facebook"].agg("mean").reset_index()
)

alt.Chart(data).mark_bar().encode(
    x=alt.X("se_property:N", axis=alt.Axis(title="Group")),
    y=alt.Y(
        "facebook:Q", axis=alt.Axis(title="% that give facebook consent", format="%")
    ),
    column="country_group:N",
).properties(width=400, height=400, title="% that give facebook consent")

In [80]:
from statsmodels.stats.proportion import proportions_ztest

data = pd.read_csv("terms_v2.csv")

d = data.groupby("shadow_user_id")["se_property"].agg("nunique").reset_index()
d = d.loc[d["se_property"] > 1, "shadow_user_id"]
data = data.loc[data["shadow_user_id"].isin(d) == False, :]

# go down to one row per user.
data = (
    data.groupby(["shadow_user_id", "se_property"])["moneybeam"]
    .agg("max")
    .reset_index()
)

data = data.groupby("se_property")["moneybeam"].agg(["count", "sum"]).reset_index()

# run z test. (two sided)
stat, pval = proportions_ztest(data["sum"], data["count"])

print(
    "The z-score for this test is %s which corresponds to a p-value of %s"
    % (round(stat, 2), round(pval, 4))
)

if pval < 0.05:
    print("The difference is significant.")
else:
    print("The difference is not signficiant.")


# compute the effect size.
data["cr"] = data["sum"] / data["count"]

mu_v = data.loc[data["se_property"] == "terms_v2", "cr"][0]
mu_c = data.loc[data["se_property"] != "terms_v2", "cr"][1]
effect_size = 100 * ((mu_v - mu_c) / mu_c)

print("The effect size is %s Percent" % (round(effect_size, 2)))

The z-score for this test is -12.9 which corresponds to a p-value of 0.0
The difference is significant.
The effect size is -14.08 Percent


In [7]:
from statsmodels.stats.proportion import proportions_ztest

stat, pval = proportions_ztest([935, 1138], [26977, 25883])

print(stat, pval)


effect_size = 100 * ((1138 / 25883) - (935 / 26977)) / (935 / 26977)
effect_size

-5.5111828636525235 3.564301762191076e-08


26.85561373362361

In [None]:
Control
3.47
935
26, 977
Treatment
4.4
1, 138
25, 883

## What does this effect size mean for SUI to SU conversion? 

We can decompose the SUI to SU conversion into 

Prob(SUI to SU) = Prob(SUI to create account) x Prob(create account to SU) 

Using Prob(SUI to create account) = 0.3358 (last week), Prob(create account to SU | variant) = 0.966 and Prob(create account to SU | control ) = 0.956, we compute the overall CR under control and variant. 

Prob ( SUI to SU | variant ) = 0.324
Prob ( SUI to SU | control ) = 0.321

The overall effect is 0.3pp (1%). 

## Bayesian comparison

Traditional analysis relied on z-tests. This is not the only way to run AB tests. We are currently working on adopting Bayesian AB testing at N26. 

What are the upsides? 

- you stop the test if the results are conclusive: not when the pre-planned sample size has reached. 
- smaller sample sizes. 
- better interpretability. 

Among others Bayesian methods allow you to compute at any point in time two probabilities : Prob(variant is better) and Prob(control is better). 

We plot these two probabilities on the y-axis as the number of observations in our experiment increased (x-axis). 

You see beautifully that the Bayesian method becomes more and more sure over time that the variant is actually better. 

In [96]:
from scipy.stats import beta, binom
from scipy.special import betaln
import math

In [119]:
def v_wins_against_c(a_v, b_v, a_c, b_c, a_prior, b_prior):
    """Compute the probability that the variant wins against the control in the long run.
    We are adding one to all parameters following the definition of the parameters in Millers formula.
    """
    a_v = a_v + a_prior
    b_v = b_v + b_prior
    a_c = a_c + a_prior
    b_c = b_c + b_prior

    total = 0.0
    for i in range(0, a_v):
        total += np.exp(
            betaln(a_c + i, b_v + b_c)
            - math.log(b_v + i)
            - betaln(1 + i, b_v)
            - betaln(a_c, b_c)
        )
    return total


data = pd.read_csv("terms_v2.csv")

d = data.groupby("shadow_user_id")["se_property"].agg("nunique").reset_index()
d = d.loc[d["se_property"] > 1, "shadow_user_id"]
data = data.loc[data["shadow_user_id"].isin(d) == False, :]

# go down to one row per user.
data = data.groupby(["shadow_user_id", "se_property"])["su"].agg("max").reset_index()

# reshuffle data randomly
# data = data.sample(frac=1).reset_index(drop=True)

# let's imagine the data flows in chunks of 100 observations at a time.
data["running"] = data.groupby("se_property").cumcount() + 1

data["running"] = round(data["running"] / 100, 0)

# aggregate success and failures
data["failure"] = 0
data.loc[data["su"] == 0, "failure"] = 1

data = data.groupby(["se_property", "running"]).agg("sum").reset_index()

# now for each row in this dataset, take a look at the bayesian evaluation of the experiment.
prob_variant = []
prob_control = []
for chunk in data["running"].unique():
    variant_wins = sum(
        data.loc[(data["se_property"] == "terms_v2") & (data["running"] <= chunk), "su"]
    )
    control_wins = sum(
        data.loc[
            (data["se_property"] == "terms_v2_control") & (data["running"] <= chunk),
            "su",
        ]
    )

    variant_fail = sum(
        data.loc[
            (data["se_property"] == "terms_v2") & (data["running"] <= chunk), "failure"
        ]
    )
    control_fail = sum(
        data.loc[
            (data["se_property"] == "terms_v2_control") & (data["running"] <= chunk),
            "failure",
        ]
    )

    prob_variant.append(
        v_wins_against_c(variant_wins, variant_fail, control_wins, control_fail, 1, 1)
    )
    prob_control.append(
        v_wins_against_c(control_wins, control_fail, variant_wins, variant_fail, 1, 1)
    )

results = pd.DataFrame(data["running"].unique(), columns=["chunks"])

results["prob_variant"] = prob_variant
results["prob_control"] = prob_control

results = pd.melt(results, id_vars=["chunks"])
results["chunks"] = results["chunks"] * 100

alt.Chart(results).mark_line().encode(
    x=alt.X("chunks:Q", axis=alt.Axis(title="Sample size")),
    y=alt.Y("value:Q", axis=alt.Axis(title="Probability", format="%")),
    color="variable:N",
).properties(width=700, height=500, title="Real time testing with Bayesian statistics")

What does this tell us? 
- Bayesian statistics made the same decision as the frequentist.
- Bayesian statistics concluded this much quicker: after 5k observations we definitely knew what was going on and could have already stopped the test. The speed gain was 37.5%.
