title: AB Test: Remove internet speed test from flow   
author: Fabio Schmidt-Fischbach  
date: 2020-06-10  
region: EU  
summary: Before initiating the KYC call, we test the user's internet connection on a separate page. In the treatment we remove this step entirely. Main conversion rate of starting the KYC flow to initiating a KYC call increased from 93% to 94% (0.7pp to be precise) in the treatment. We also find a incremental lift of % SU to KYCc (7 day) of 0.4pp (e.g. parts of the effect size disappears during the KYC process). As the effect sizes are quite small (0.8% - 0.4pp) for the main metric % SU to KYCc the statistical test is underpowered. We hence cannot claim statistical significance. A Bayesian evaluation is also inconclusive suggesting that the probability of the variant being better than the control sits at 55%. To check robustness of the result we also check whether users in the treatment group require more or less KYC attempts now. We find no shift in distribution of # of attempts suggesting little impact on the cost efficiency of the KYC process.
tags: kyc, acquire, ab, test, internet connection

In [1]:
import pandas as pd
import os
import seaborn as sns
from statsmodels.stats.proportion import proportions_ztest
import altair as alt

### 1. Setup. 

- Target group: users that do KYC via video (IDNow) since 9th June 2020 and have the current app versions installed. 
- Feature change: Control - before initiating the KYC call, we test the user's internet connection on a separate page. In the treatment we remove this step entirely. 
- Rational: The test is not very useful and IDnow have their own functionality to test this. 
- Main KPIs: conversion to KYCi and KYCi to KYCc.


### 2. Summary 

Main conversion rate of starting the KYC flow to initiating a KYC call increased from 93% to 94% (0.7pp to be precise) in the treatment. 

We also find a incremental lift of % SU to KYCc (7 day) of 0.4pp (e.g. parts of the effect size disappears during the KYC process). 

As the effect sizes are quite small (0.8% - 0.4pp) for the main metric % SU to KYCc the statistical test is underpowered. We hence cannot claim statistical significance. A Bayesian evaluation is also inconclusive suggesting that the probability of the variant being better than the control sits at 55%.    

To check robustness of the result we also check whether users in the treatment group require more or less KYC attempts now. We find no shift in distribution of # of attempts suggesting little impact on the cost efficiency of the KYC process.

In [3]:
##### Query to pull the data from the dwh.
query = """

---get set of users that were part of the experiment for the first time. 
with users as ( 

select user_created, min(platform) as platform, min(os_major) as os_major    
from dbt.zrh_kyc_funnel 
where se_action like 'kyc.video%' and app_version like '%3.46%'
group by 1 

)

select  case when is_user_in_tg(user_id, 'idnow.show-connection-check', 50) = True then 'Control' else 'Treatment' end as treatment, 
		zu.user_id,
		zu.user_created::date as tc_date, 
		tnc_country_group,
		product_id, 
		kyc_first_initiated, 
		kyc_first_completed, 
		age_group,
		gender, 
		platform, 
		os_major 
from users inner join dbt.zrh_users AS zu using (user_created) 
group by 1,2,3,4,5,6,7,8,9,10,11
"""

### Sample sizes

We have roughly 11k users in each group.

You can check the validity of the sample selection criteria by using this metabase question. It shows that no user assigned to treatment visited the internet-connection field. https://metabase-product.tech26.de/question/2320?platform=Android&platform=iOS&app_version=n26-ios_3.46&app_version=n26-ios_3.46.1&app_version=n26-android_3.46.1&app_version=n26-android_3.46&start_date=2020-06-09


In [2]:
df = pd.read_csv("kyc_internet.csv")

df = df.groupby(["treatment"])["user_id"].agg("nunique").reset_index()

alt.Chart(df).mark_bar().encode(
    x=alt.X("treatment:N", axis=alt.Axis(title="Group")),
    y=alt.Y("user_id:Q", axis=alt.Axis(title="Number of users")),
).properties(width=500, height=500, title="Sample size")

### Step 4. Main KPI 

Our intervention starts at the beginning of the KYC flow - before the KYC process in cmd is officially initiated. 

The goal of this test is to increase the % of users that start the KYC flow and then also successfully manage to initiate a KYC process. 

This % of users that initiate a KYC process is shown below. 

In [129]:
from datetime import datetime

df = pd.read_csv("kyc_internet.csv")

# drop users that are not yet age 7.
df["age"] = (datetime.now() - pd.to_datetime(df["tc_date"])).dt.days
df = df.loc[df["age"] >= 7, :]

# make sure we are unique on the user level.
df = (
    df.groupby(["treatment", "user_id"])["kyc_first_initiated"].agg("min").reset_index()
)

# now, count how many users managed to initiate kyc.
df["kyci"] = 0
df.loc[df["kyc_first_initiated"].isna() == False, "kyci"] = 1

df = df.groupby(["treatment"])["kyci"].agg("mean").reset_index()

alt.Chart(df).mark_bar().encode(
    x=alt.X("treatment:N", axis=alt.Axis(title="Group")),
    y=alt.Y(
        "kyci:Q",
        axis=alt.Axis(title="% of users that initiate KYC", format="%"),
        scale=alt.Scale(domain=[0.90, 0.96]),
    ),
).properties(width=500, height=500, title="% of users that initiate KYC")

In [130]:
df.head()

Unnamed: 0,treatment,kyci
0,Control,0.936709
1,Treatment,0.941692


In [28]:
from statsmodels.stats.proportion import proportions_ztest
from datetime import datetime

df = pd.read_csv("kyc_internet.csv")

# drop users that are not yet age 7.
df["age"] = (datetime.now() - pd.to_datetime(df["tc_date"])).dt.days
df = df.loc[df["age"] >= 7, :]

# make sure we are unique on the user level.
df = (
    df.groupby(["treatment", "user_id"])["kyc_first_initiated"].agg("min").reset_index()
)

# now, count how many users managed to initiate kyc.
df["kyci"] = 0
df.loc[df["kyc_first_initiated"].isna() == False, "kyci"] = 1

data = df.groupby("treatment")["kyci"].agg(["count", "sum"]).reset_index()

# run z test. (two sided)
stat, pval = proportions_ztest(data["sum"], data["count"], alternative="smaller")

print(
    "The z-score for this test is %s which corresponds to a p-value of %s"
    % (round(stat, 2), round(pval, 4))
)

if pval < 0.05:
    print("The difference is significant.")
else:
    print("The difference is not signficiant.")

The z-score for this test is -1.5 which corresponds to a p-value of 0.0667
The difference is not signficiant.


The next question is whether the increased conversion to KYC initiation translates to higher KYC completion conversion. 

In [20]:
from datetime import datetime

df = pd.read_csv("kyc_internet.csv")

# drop users that are not yet age 7.
df["age"] = (datetime.now() - pd.to_datetime(df["tc_date"])).dt.days
df = df.loc[df["age"] >= 7, :]

# make sure we are unique on the user level.
df = (
    df.groupby(["treatment", "user_id"])["tc_date", "kyc_first_completed"]
    .agg("min")
    .reset_index()
)

# now, count how many users managed to finish KYC within 7 days.
df["kycc"] = 0
df.loc[
    (pd.to_datetime(df["kyc_first_completed"]) - pd.to_datetime(df["tc_date"])).dt.days
    <= 7,
    "kycc",
] = 1

df = df.groupby(["treatment"])["kycc"].agg("mean").reset_index()

alt.Chart(df).mark_bar().encode(
    x=alt.X("treatment:N", axis=alt.Axis(title="Group")),
    y=alt.Y(
        "kycc:Q",
        axis=alt.Axis(title="% of users that finish KYC", format="%"),
        scale=alt.Scale(domain=[0.45, 0.55]),
    ),
).properties(
    width=500,
    height=500,
    title="% of users that finish KYC within 7 days of sign up complete",
)

  # Remove the CWD from sys.path while we load stuff.


In [19]:
df.head()

Unnamed: 0,treatment,kycc
0,Control,0.518214
1,Treatment,0.522456


In [29]:
from statsmodels.stats.proportion import proportions_ztest
from datetime import datetime

df = pd.read_csv("kyc_internet.csv")

# drop users that are not yet age 7.
df["age"] = (datetime.now() - pd.to_datetime(df["tc_date"])).dt.days
df = df.loc[df["age"] >= 7, :]

# make sure we are unique on the user level.
df = (
    df.groupby(["treatment", "user_id"])["tc_date", "kyc_first_completed"]
    .agg("min")
    .reset_index()
)

# now, count how many users managed to initiate kyc.
df["kycc"] = 0
df.loc[
    (pd.to_datetime(df["kyc_first_completed"]) - pd.to_datetime(df["tc_date"])).dt.days
    <= 7,
    "kycc",
] = 1

data = df.groupby("treatment")["kycc"].agg(["count", "sum"]).reset_index()

# run z test. (two sided)
stat, pval = proportions_ztest(data["sum"], data["count"], alternative="smaller")

print(
    "The z-score for this test is %s which corresponds to a p-value of %s"
    % (round(stat, 2), round(pval, 4))
)

if pval < 0.05:
    print("The difference is significant.")
else:
    print("The difference is not signficiant.")

  # This is added back by InteractiveShellApp.init_path()


The z-score for this test is -0.61 which corresponds to a p-value of 0.2706
The difference is not signficiant.


### Do we gain efficiency e.g. do users complete KYC with fewer attempts? 

In [None]:
query = """

with users as ( 

select user_created, min(platform) as platform, min(os_major) as os_major    
from dbt.zrh_kyc_funnel 
where se_action like 'kyc.video%' and app_version like '%3.46%'
group by 1 

)

select  case when is_user_in_tg(user_id, 'idnow.show-connection-check', 50) = True then 'Control' else 'Treatment' end as treatment, 
		zu.user_id, 
		provider, 
		count(distinct cmd.id) as process,
		min(completed) as completed, 
		min(initiated) as initiated 
from users 
inner join dbt.zrh_users AS zu on zu.user_created = users.user_created 
left join cmd_kyc_process as cmd on cmd.user_created = users.user_created 
group by 1,2,3   


"""

In [140]:
df = pd.read_csv("kyc_attempts.csv")
# show distribution of kyc attempts in both groups.

df = df.groupby(["treatment", "user_id"])["process"].agg("sum").reset_index()
df = df.groupby(["treatment", "process"])["user_id"].agg("nunique").reset_index()

df["perc"] = df["user_id"] / df.groupby(["treatment"])["user_id"].transform("sum")
df["cum"] = df.groupby(["treatment"])["perc"].cumsum()

alt.Chart(df.loc[df["process"] < 10, :]).mark_line().encode(
    x=alt.X("process:N", axis=alt.Axis(title="Number of KYC processes")),
    y=alt.Y("cum:Q", axis=alt.Axis(title="Percentile")),
    color="treatment:N",
).properties(width=500, height=500, title="Number of KYC attempts")

In [141]:
df = pd.read_csv("kyc_attempts.csv")

df = df.groupby(["treatment", "user_id"])["process"].agg("sum").reset_index()
df = df.groupby(["treatment"])["process"].agg("mean").reset_index()

alt.Chart(df).mark_bar().encode(
    x=alt.X("treatment:N", axis=alt.Axis(title="Number of KYC processes")),
    y=alt.Y("process:Q", axis=alt.Axis(title="Avg. number of attempts")),
    color="treatment:N",
).properties(width=500, height=500, title="Number of KYC attempts")

In [142]:
df.head()

Unnamed: 0,treatment,process
0,Control,2.698979
1,Treatment,2.669585


### Robustness 

What is happening? Slice the experimental results by a couple of dimensions. 

In [35]:
from datetime import datetime

df = pd.read_csv("kyc_internet.csv")

# drop users that are not yet age 7.
df["age"] = (datetime.now() - pd.to_datetime(df["tc_date"])).dt.days
df = df.loc[df["age"] >= 7, :]

# make sure we are unique on the user level.
df = (
    df.groupby(["treatment", "user_id", "tnc_country_group"])[
        "tc_date", "kyc_first_completed"
    ]
    .agg("min")
    .reset_index()
)

# now, count how many users managed to finish KYC within 7 days.
df["kycc"] = 0
df.loc[
    (pd.to_datetime(df["kyc_first_completed"]) - pd.to_datetime(df["tc_date"])).dt.days
    <= 7,
    "kycc",
] = 1

df = df.groupby(["treatment", "tnc_country_group"])["kycc"].agg("mean").reset_index()

alt.Chart(df).mark_bar().encode(
    x=alt.X("treatment:N", axis=alt.Axis(title="Country")),
    y=alt.Y("kycc:Q", axis=alt.Axis(title="% of users that finish KYC", format="%")),
    column="tnc_country_group:N",
).properties(
    width=200,
    height=200,
    title="% of users that finish KYC within 7 days of sign up complete",
)

  # Remove the CWD from sys.path while we load stuff.


In [38]:
from datetime import datetime

df = pd.read_csv("kyc_internet.csv")

# drop users that are not yet age 7.
df["age"] = (datetime.now() - pd.to_datetime(df["tc_date"])).dt.days
df = df.loc[df["age"] >= 7, :]

# make sure we are unique on the user level.
df = (
    df.groupby(["treatment", "user_id", "platform"])["tc_date", "kyc_first_completed"]
    .agg("min")
    .reset_index()
)

# now, count how many users managed to finish KYC within 7 days.
df["kycc"] = 0
df.loc[
    (pd.to_datetime(df["kyc_first_completed"]) - pd.to_datetime(df["tc_date"])).dt.days
    <= 7,
    "kycc",
] = 1

df = df.groupby(["treatment", "platform"])["kycc"].agg("mean").reset_index()

alt.Chart(df).mark_bar().encode(
    x=alt.X("treatment:N", axis=alt.Axis(title="Platform")),
    y=alt.Y("kycc:Q", axis=alt.Axis(title="% of users that finish KYC", format="%")),
    column="platform:N",
).properties(
    width=400,
    height=400,
    title="% of users that finish KYC within 7 days of sign up complete",
)

  # Remove the CWD from sys.path while we load stuff.


In [42]:
from datetime import datetime

df = pd.read_csv("kyc_internet.csv")

# drop users that are not yet age 7.
df["age"] = (datetime.now() - pd.to_datetime(df["tc_date"])).dt.days
df = df.loc[df["age"] >= 7, :]

# make sure we are unique on the user level.
df = (
    df.groupby(["treatment", "user_id", "age_group"])["tc_date", "kyc_first_completed"]
    .agg("min")
    .reset_index()
)

# now, count how many users managed to finish KYC within 7 days.
df["kycc"] = 0
df.loc[
    (pd.to_datetime(df["kyc_first_completed"]) - pd.to_datetime(df["tc_date"])).dt.days
    <= 7,
    "kycc",
] = 1

df = df.groupby(["treatment", "age_group"])["kycc"].agg("mean").reset_index()

alt.Chart(df).mark_bar().encode(
    x=alt.X("treatment:N", axis=alt.Axis(title="Age group")),
    y=alt.Y("kycc:Q", axis=alt.Axis(title="% of users that finish KYC", format="%")),
    column="age_group:N",
).properties(
    width=200,
    height=200,
    title="% of users that finish KYC within 7 days of sign up complete",
)

  # Remove the CWD from sys.path while we load stuff.


In [41]:
from datetime import datetime

df = pd.read_csv("kyc_internet.csv")

# drop users that are not yet age 7.
df["age"] = (datetime.now() - pd.to_datetime(df["tc_date"])).dt.days
df = df.loc[df["age"] >= 7, :]

# make sure we are unique on the user level.
df = (
    df.groupby(["treatment", "user_id", "age_group"])["tc_date", "kyc_first_completed"]
    .agg("min")
    .reset_index()
)

# now, count how many users managed to finish KYC within 7 days.
df["kycc"] = 0
df.loc[
    (pd.to_datetime(df["kyc_first_completed"]) - pd.to_datetime(df["tc_date"])).dt.days
    <= 7,
    "kycc",
] = 1

df = df.groupby(["treatment", "age_group"])["kycc"].agg(["mean", "count"]).reset_index()
df.head(20)

  # Remove the CWD from sys.path while we load stuff.


Unnamed: 0,treatment,age_group,mean,count
0,Control,18-19,0.502985,670
1,Control,20-24,0.517309,2051
2,Control,25-29,0.520234,2224
3,Control,30-34,0.538595,1723
4,Control,35-39,0.525709,1128
5,Control,40-44,0.501892,793
6,Control,45-49,0.398981,589
7,Control,50-54,0.514354,418
8,Control,55-59,0.602362,254
9,Control,60-64,0.591549,142


#### Power: do we have power to detect an effect? 

We see a 0.4pp (0.8%) lift in SU to KYCc. With the current sample size, we are massively underpowered to detect an effect of that magnitude with frequentist methods. 

For the z-test to confidently detect an effect of smaller than 1%, we require more than 150k users in each variant. https://metabase-product.tech26.de/dashboard/32?baseline_conversion=0.50&minimum_detectable_effect=0.01 

We are far away from this. As an alternative decision rule, we can also leverage a parametric Bayesian decision rule.

def variant_wins_bayes(variant_success, variant_obs, control_success, control_obs, a_prior=1, b_prior=1, N=100000):
    """
    This function computes the probability that the variant is better than the control. It does so by sampling N times
    from the posteriors of the variant and control. The probability of variant winning is computed 
    by counting how often the variant are larger than the control draws and dividing the number of "variant wins" by N.
    
    Inputs: 
    - variant_success: number of success events in variant.
    - variant_obs : number of observations in variant. This needs to be at least as big as variant_success.
    - control_success: number of success events in control.
    - control_obs : number of observations in control. This needs to be at least as big as control_success.
    - a_prior: alpha prior 
    - b_prior: beta prior 
    - N: number of draws to make from the two posteriors to make comparison. 
    returns: float between 0 and 1. 
    """
    from scipy.stats import beta
    import random 
    
    np.random.seed(seed=1)
    
    control_a = control_success + a_prior
    variant_a = variant_success + a_prior
    control_b  = control_obs - control_success + b_prior 
    variant_b  = variant_obs - variant_success + b_prior
    
    # draw random numbers. 
    variant_draws = beta.rvs(a=variant_a, b=variant_b, size=N)
    control_draws = beta.rvs(a=control_a, b=control_b, size=N) 

    # check for how many draws the variant was better.
    variant_wins = sum(variant_draws > control_draws) 
    # the number of variant wins divided by the total number of draws is the end result.
    variant_prob = variant_wins/N
    
    return(variant_prob)


In [83]:
def variant_wins_bayes(
    variant_success,
    variant_obs,
    control_success,
    control_obs,
    a_prior=1,
    b_prior=1,
    N=100000,
):
    """
    This function computes the probability that the variant is better than the control. It does so by sampling N times
    from the posteriors of the variant and control. The probability of variant winning is computed
    by counting how often the variant are larger than the control draws and dividing the number of "variant wins" by N.

    Inputs:
    - variant_success: number of success events in variant.
    - variant_obs : number of observations in variant. This needs to be at least as big as variant_success.
    - control_success: number of success events in control.
    - control_obs : number of observations in control. This needs to be at least as big as control_success.
    - a_prior: alpha prior
    - b_prior: beta prior
    - N: number of draws to make from the two posteriors to make comparison.
    returns: float between 0 and 1.
    """
    from scipy.stats import beta
    import random
    import numpy as np

    np.random.seed(seed=1)

    control_a = control_success + a_prior
    variant_a = variant_success + a_prior
    control_b = control_obs - control_success + b_prior
    variant_b = variant_obs - variant_success + b_prior

    # draw random numbers.
    variant_draws = beta.rvs(a=variant_a, b=variant_b, size=N)
    control_draws = beta.rvs(a=control_a, b=control_b, size=N)

    # check for how many draws the variant was better.
    variant_wins = sum(variant_draws > control_draws)
    # the number of variant wins divided by the total number of draws is the end result.
    variant_prob = variant_wins / N

    return variant_prob

In [138]:
def variant_is_worse_bayes(
    variant_success,
    variant_obs,
    control_success,
    control_obs,
    a_prior=1,
    b_prior=1,
    N=100000,
):
    """
    This function computes the probability that the variant is better than the control. It does so by sampling N times
    from the posteriors of the variant and control. The probability of variant winning is computed
    by counting how often the variant are larger than the control draws and dividing the number of "variant wins" by N.

    Inputs:
    - variant_success: number of success events in variant.
    - variant_obs : number of observations in variant. This needs to be at least as big as variant_success.
    - control_success: number of success events in control.
    - control_obs : number of observations in control. This needs to be at least as big as control_success.
    - a_prior: alpha prior
    - b_prior: beta prior
    - N: number of draws to make from the two posteriors to make comparison.
    returns: float between 0 and 1.
    """
    from scipy.stats import beta
    import random
    import numpy as np

    np.random.seed(seed=1)

    control_a = control_success + a_prior
    variant_a = variant_success + a_prior
    control_b = control_obs - control_success + b_prior
    variant_b = variant_obs - variant_success + b_prior

    # draw random numbers.
    variant_draws = beta.rvs(a=variant_a, b=variant_b, size=N)
    control_draws = beta.rvs(a=control_a, b=control_b, size=N)

    # check for how many draws the variant was worse.
    variant_loss = sum(variant_draws < control_draws)
    # the number of variant wins divided by the total number of draws is the end result.
    variant_prob = variant_loss / N

    return variant_prob

In [137]:
from datetime import datetime

df = pd.read_csv("kyc_internet.csv")

# drop users that are not yet age 7.
df["age"] = (datetime.now() - pd.to_datetime(df["tc_date"])).dt.days
df = df.loc[df["age"] >= 7, :]
# drop users from before 9th of june.
df = df.loc[df["tc_date"] >= "2020-06-10", :]

# make sure we are unique on the user level.
df = (
    df.groupby(["treatment", "user_id"])["tc_date", "kyc_first_completed"]
    .agg("min")
    .reset_index()
)

# now, count how many users managed to finish KYC within 7 days.
df["kycc"] = 0
df.loc[
    (pd.to_datetime(df["kyc_first_completed"]) - pd.to_datetime(df["tc_date"])).dt.days
    <= 7,
    "kycc",
] = 1

# take cumulative sum.
df = df.groupby(["treatment", "tc_date"])["kycc"].agg(["count", "sum"]).reset_index()
df["sample"] = df.groupby("treatment")["count"].cumsum()
df["success"] = df.groupby("treatment")["sum"].cumsum()

success_chart = (
    alt.Chart(df)
    .mark_line()
    .encode(x=alt.X("tc_date"), y=alt.Y("success"), color="treatment")
    .properties(title="Number of conversions")
)

sample_chart = (
    alt.Chart(df)
    .mark_line()
    .encode(x=alt.X("tc_date"), y=alt.Y("sample:Q"), color="treatment")
    .properties(title="Number of participants")
)


# pivot table to wide
df = pd.pivot_table(
    df, index=["tc_date"], columns="treatment", values=["sample", "success"]
).reset_index()

df.columns = ["".join(col).strip() for col in df.columns.values]

df.head()

  if sys.path[0] == '':


Unnamed: 0,tc_date,sampleControl,sampleTreatment,successControl,successTreatment
0,2020-06-10,306,301,215,195
1,2020-06-11,595,597,379,375
2,2020-06-12,878,862,548,525
3,2020-06-13,1099,1084,689,659
4,2020-06-14,1414,1417,889,856


First, we bring the data into a "cumulative format". Bayesian testing does not require us to wait until a specific point in time to run a test but allows us to continuously monitor the results. 

We hence order the data points into the order of arrival and re-create the flows of conversions and participants as they happened during the experiment. 

In [122]:
success_chart | sample_chart

For every day (we could as well do batches of e.g. every thousandth participant), we can now evaluate our results and compute the probablity that the variant performs better than the control. 

A common decision rule is to stop the test and declare a winner once the probability of a variant winning crosses 95%.

Unfortunately, this does not happen in our case. 

In [128]:
prob_variant_wins = []
for i in range(df.shape[0]):
    prob_variant_wins.append(
        variant_wins_bayes(
            df["successTreatment"][i],
            df["sampleTreatment"][i],
            df["successControl"][i],
            df["sampleControl"][i],
            a_prior=10,
            b_prior=10,
            N=100000,
        )
    )

# assign to df
df["prob"] = prob_variant_wins

alt.Chart(df).mark_line().encode(
    x=alt.X("tc_date:N", axis=alt.Axis(title="TC date of user")),
    y=alt.Y(
        "prob:Q",
        axis=alt.Axis(title="Probability that treatment is better", format="%"),
    ),
).properties(
    title="Probability that the variant wins (Bayesian decision rule) using KYC completion as main metric",
    width=500,
    height=500,
)