title: AB Test: User-reengagement   
author: Fabio Schmidt-Fischbach  
date: 2020-06-23   
region: EU  
summary: The user re-engagement flow differs from the standard flow on the second step. Instead of asking on the personal information screen for the users name, birth date and mail, the re-engagement flow only asks for the mail on the second step. % SUI to KYCc (7 day) is 0.4pp lower in variant. % SUI to KYCc (21 day) is 0.4pp lower in variant. The treatment group performs worse but we fail to reject the null hypothesis that the two groups actually perform the same. Roughly 70% of shadow users in the treatment group give us marketing consent (this is not yet taking into account whether they confirm their mail), hence providing us with lots of scope to re-target users. The fact that 21-day and 7-day % SUI to KYCc conversions are the same suggests though that re-targetting is not yet effective enough to move the needle.
tags: kyc, acquire, ab test, email, marketing, crm 

In [2]:
import pandas as pd
import numpy as np
import altair as alt
import seaborn as sns
from datetime import datetime

In [26]:
Query = """


with opt_in as ( 
select domain_userid, case when event_type = 273983836 then 'Opt in' else 'Skipped' end as opt_in from ksp_web_crab where event_type in (-59294220, 273983836)
), kyc_process as ( 

select user_id	, count(distinct id) as kyc_count 
from km_processes 
group by 1 

)

select domain_userid, 
		shadow_user_id, 
		tbl.user_created, 
		dbt.zrh_users.user_id, 
		country, 
		case when country in ('DEU','AUT','FRA','ESP','ITA') then 'core' else 'other' end as country_group, 
		step, 
		tbl.created, 
		se_property, 
		attribution, 
		webview, 
		os_family, 
		dvce_type, 
		rank_column,
		kyc_first_initiated, 
		kyc_first_completed, 
		is_mau,
		opt_in,
		email_confirmation_completed,
		kyc_count,
		newsletter_opt_in, 
		email_bigint 
from dbt.stg_upper_funnel as tbl 
left join dbt.zrh_users using (user_created) 
left join opt_in using (domain_userid)
left join cmd_user_signup_status as cmd on cmd.user_created = tbl.user_created
left join kyc_process as kyc on kyc.user_id = dbt.zrh_users.user_id 
left join cmd_shadow_user as shadow on shadow.id = tbl.shadow_user_id 
where se_property in ('user_reengagement', 'user_reengagement_control')


"""

# User re-engagement v1. 


## Summary 

- % SUI to KYCc (7 day) is 0.4pp lower in variant. 
- % SUI to KYCc (21 day) is 0.4pp lower in variant. 

The treatment group performs worse but we fail to reject the null hypothesis that the two groups actually perform the same. 


- roughly 70% of shadow users in the treatment group give us marketing consent (this is not yet taking into account whether they confirm their mail), hence providing us with lots of scope to re-target users. 
- the fact that 21-day and 7-day % SUI to KYCc conversions are the same suggests though that re-targetting is not yet effective enough to move the needle. 


## What are we testing?  

##### How does the control experience look like? 

The following assumes that the reader is familiar with our standard flow. Not quite sure about it? Go click yourself through the signup form in staging: https://get.staging-n26.com/ 

##### How does re-engagement look like? 

The user re-engagement flow differs from the standard flow on the second step. Instead of asking on the personal information screen for the users name, birth date and mail, the re-engagement flow only asks for the mail on the second step. 

![](second_screen.png)

- If the user clicks on "continue without consent", they are re-directed to the variant's personal-information screen (without having to confirm the mail yet - in this case they'd confirm their mail at the end of the funnel). Of course, we continue to consider them users in the treatment group. 
- If the user clicks on "Accept and continue" they are re-directed to the email-confirmation screen. This screen used to be hosted at the end of the sign up funnel.

![](email_confirmation.png)

- The user is taken out of the flow and now needs to confirm their mail before they can resume the flow. 

![](actual_mail.png)

- Upon clicking "Confirm your email" the user is directed back to the signup flow and on the variant's "personal information screen". 

![](personal_information.png)

From now on the experience of the users are the same. 

## Sample size 

In [19]:
df = pd.read_csv("user_reengagement.csv")

# keep only first screen.
df = df.loc[df["step"] == "signup-start", :]

# go down to unique level for each event and keep only the first event for each user.
df["rn"] = (
    df.sort_values(["created"], ascending=[True]).groupby("rank_column").cumcount()
)
df = df.loc[df["rn"] == 0, :]

# drop samples before June 10th.
df["day"] = pd.to_datetime(df["created"]).dt.date
df = df.loc[df["day"] >= pd.to_datetime("2020-06-10"), :]

# drop users that were already customers.
df = df.loc[
    (pd.to_datetime(df["user_created"]) >= pd.to_datetime("2020-06-10"))
    | (df["user_created"].isna() == True),
    :,
]

# show sample size.
df = df.groupby(["se_property"])["rank_column"].agg("nunique").reset_index()


alt.Chart(df).mark_bar().encode(
    x=alt.X("se_property", axis=alt.Axis(title="Treatment group")),
    y=alt.Y("rank_column:Q", axis=alt.Axis(title="Sample size")),
    color="se_property:N",
).properties(width=400, height=400, title="Sample size")

We started the rollout at only 5% of traffic. Since 16th of June 2020 we have rolled it out to 60% of traffic. 

Note that the experiment started before June 10th, but we decide to drop the traffic prior that due to a bug that affected the variant. 

In [20]:
df = pd.read_csv("user_reengagement.csv")

# keep only first screen.
df = df.loc[df["step"] == "signup-start", :]

# go down to unique level for each event and keep only the first event for each user.
df["rn"] = (
    df.sort_values(["created"], ascending=[True]).groupby("rank_column").cumcount()
)
df = df.loc[df["rn"] == 0, :]

# drop samples before June 10th.
df["day"] = pd.to_datetime(df["created"]).dt.date
df = df.loc[df["day"] >= pd.to_datetime("2020-06-10"), :]
df = df.loc[df["day"] < pd.to_datetime(datetime.now()), :]

# drop users that were already customers.
df = df.loc[
    (pd.to_datetime(df["user_created"]) >= pd.to_datetime("2020-06-10"))
    | (df["user_created"].isna() == True),
    :,
]

# show sample size.
df = df.groupby(["se_property", "day"])["rank_column"].agg("nunique").reset_index()

# make it cumulative
df["cumulative_sample"] = df.groupby(["se_property"])["rank_column"].cumsum()
df["day"] = df["day"].astype(str)

alt.Chart(df).mark_line().encode(
    x=alt.X("day", axis=alt.Axis(title="Treatment group")),
    y=alt.Y("cumulative_sample:Q", axis=alt.Axis(title="Sample size")),
    color="se_property:N",
).properties(width=400, height=400, title="Sample size over time")

## Step 1. % sign up start that finish both signup and confirm their mail. 

In [21]:
df = pd.read_csv("user_reengagement.csv")

# keep only first screen.
df = df.loc[df["step"] == "signup-start", :]

# go down to unique level for each event and keep only the first event for each user.
df["rn"] = (
    df.sort_values(["created"], ascending=[True]).groupby("rank_column").cumcount()
)
df = df.loc[df["rn"] == 0, :]

# drop samples before June 10th.
df["day"] = pd.to_datetime(df["created"]).dt.date
df = df.loc[df["day"] >= pd.to_datetime("2020-06-10"), :]
df = df.loc[df["day"] < pd.to_datetime(datetime.now()), :]

# drop users that were already customers.
df = df.loc[
    (pd.to_datetime(df["user_created"]) >= pd.to_datetime("2020-06-10"))
    | (df["user_created"].isna() == True),
    :,
]

df["su_mail"] = 0
df.loc[
    (df["user_id"].isna() == False)
    & (df["email_confirmation_completed"].isna() == False),
    "su_mail",
] = 1

# check conversion.
df = df.groupby(["se_property", "rank_column"])["su_mail"].agg("max").reset_index()
df = df.groupby(["se_property"])["su_mail"].agg("mean").reset_index()

alt.Chart(df).mark_bar().encode(
    x=alt.X("se_property", axis=alt.Axis(title="Treatment group")),
    y=alt.Y(
        "su_mail:Q",
        axis=alt.Axis(title="% of SUI that finish signup+confirm mail", format="%"),
        scale=alt.Scale(domain=(0.30, 0.37)),
    ),
    color="se_property:N",
).properties(width=400, height=400, title="% of SUI that finish SU and confirm mail")

In [22]:
df.head()

Unnamed: 0,se_property,su_mail
0,user_reengagement,0.337134
1,user_reengagement_control,0.338299


In [23]:
from statsmodels.stats.proportion import proportions_ztest

df = pd.read_csv("user_reengagement.csv")

# keep only first screen.
df = df.loc[df["step"] == "signup-start", :]

# go down to unique level for each event and keep only the first event for each user.
df["rn"] = (
    df.sort_values(["created"], ascending=[True]).groupby("rank_column").cumcount()
)
df = df.loc[df["rn"] == 0, :]

# drop samples before June 10th.
df["day"] = pd.to_datetime(df["created"]).dt.date
df = df.loc[df["day"] >= pd.to_datetime("2020-06-10"), :]
df = df.loc[df["day"] < pd.to_datetime(datetime.now()), :]

# drop users that were already customers.
df = df.loc[
    (pd.to_datetime(df["user_created"]) >= pd.to_datetime("2020-06-10"))
    | (df["user_created"].isna() == True),
    :,
]


df["su_mail"] = 0
df.loc[
    (df["user_id"].isna() == False)
    & (df["email_confirmation_completed"].isna() == False),
    "su_mail",
] = 1

# check conversion.
df = df.groupby(["se_property", "rank_column"])["su_mail"].agg("max").reset_index()
data = df.groupby("se_property")["su_mail"].agg(["count", "sum"]).reset_index()

# run z test. (two sided)
stat, pval = proportions_ztest(data["sum"], data["count"])

print(
    "The z-score for this test is %s which corresponds to a p-value of %s"
    % (round(stat, 2), round(pval, 4))
)

if pval < 0.05:
    print("The difference is significant.")
else:
    print("The difference is not signficiant.")

  interactivity=interactivity, compiler=compiler, result=result)


The z-score for this test is -0.5 which corresponds to a p-value of 0.6202
The difference is not signficiant.


## 1.1. Robustness

% SUI to SU+mail confirmed over time. 

Both groups fluctuate quite a bit over time. We had two days in the middle of the sample where the treatment group actually performed better than the control. Of course, this could just be stochastic fluctuation.

In [35]:
df = pd.read_csv("user_reengagement.csv")

# keep only first screen.
df = df.loc[df["step"] == "signup-start", :]

# go down to unique level for each event and keep only the first event for each user.
df["rn"] = (
    df.sort_values(["created"], ascending=[True]).groupby("rank_column").cumcount()
)
df = df.loc[df["rn"] == 0, :]

# drop samples before June 10th.
df["day"] = pd.to_datetime(df["created"]).dt.date
df = df.loc[df["day"] >= pd.to_datetime("2020-06-10"), :]

m = max(pd.to_datetime(df["user_created"]))
df = df.loc[df["day"] < m, :]

# drop users that were already customers.
df = df.loc[
    (pd.to_datetime(df["user_created"]) >= pd.to_datetime("2020-06-10"))
    | (df["user_created"].isna() == True),
    :,
]


df["su_mail"] = 0
df.loc[
    (df["user_id"].isna() == False)
    & (df["email_confirmation_completed"].isna() == False),
    "su_mail",
] = 1

# check conversion.
df = (
    df.groupby(["se_property", "day", "rank_column"])["su_mail"]
    .agg("max")
    .reset_index()
)
df = df.groupby(["se_property", "day"])["su_mail"].agg("mean").reset_index()

df["day"] = df["day"].astype(str)

alt.Chart(df).mark_line().encode(
    x=alt.X("day:N", axis=alt.Axis(title="Day")),
    y=alt.Y(
        "su_mail:Q",
        axis=alt.Axis(title="% of SUI that finish signup+confirm mail", format="%"),
    ),
    color="se_property:N",
).properties(width=400, height=400, title="% of SUI that finish SU and confirm mail")

Now, let's look at different markets. 

In [12]:
df = pd.read_csv("user_reengagement.csv")

# keep only first screen.
df = df.loc[df["step"] == "signup-start", :]

# go down to unique level for each event and keep only the first event for each user.
df["rn"] = (
    df.sort_values(["created"], ascending=[True]).groupby("rank_column").cumcount()
)
df = df.loc[df["rn"] == 0, :]

# drop samples before June 10th.
df["day"] = pd.to_datetime(df["created"]).dt.date
df = df.loc[df["day"] >= pd.to_datetime("2020-06-10"), :]
df = df.loc[df["day"] < pd.to_datetime(datetime.now()), :]

# drop users that were already customers.
df = df.loc[
    (pd.to_datetime(df["user_created"]) >= pd.to_datetime("2020-06-10"))
    | (df["user_created"].isna() == True),
    :,
]

df["su_mail"] = 0
df.loc[
    (df["user_id"].isna() == False)
    & (df["email_confirmation_completed"].isna() == False),
    "su_mail",
] = 1

df.loc[df["country_group"] != "core", "country"] = "Other"

# check conversion.
df = (
    df.groupby(["se_property", "country", "rank_column"])["su_mail"]
    .agg("max")
    .reset_index()
)
df = df.groupby(["se_property", "country"])["su_mail"].agg("mean").reset_index()

alt.Chart(df).mark_bar().encode(
    x=alt.X("se_property:N", axis=alt.Axis(title="Group")),
    y=alt.Y(
        "su_mail:Q",
        axis=alt.Axis(title="% of SUI that finish signup+confirm mail", format="%"),
    ),
    column="country:N",
    color="se_property",
).properties(
    width=200, height=400, title="% of SUI that finish SU and confirm mail by market"
)

Next, let's look at the source of the traffic. Mobile vs desktop. App vs browser etc. 

In [8]:
df = pd.read_csv("user_reengagement.csv")

# go down to unique level for each event and keep only the first event for each user.
df["rn"] = (
    df.sort_values(["created"], ascending=[True]).groupby("rank_column").cumcount()
)
df = df.loc[df["rn"] == 0, :]

# drop samples before June 10th.
df["day"] = pd.to_datetime(df["created"]).dt.date
df = df.loc[df["day"] >= pd.to_datetime("2020-06-10"), :]
df = df.loc[df["day"] < pd.to_datetime(datetime.now()), :]

# drop users that were already customers.
df = df.loc[
    (pd.to_datetime(df["user_created"]) >= pd.to_datetime("2020-06-10"))
    | (df["user_created"].isna() == True),
    :,
]


# keep only first screen.
df = df.loc[df["step"] == "signup-start", :]

df["su_mail"] = 0
df.loc[
    (df["user_id"].isna() == False)
    & (df["email_confirmation_completed"].isna() == False),
    "su_mail",
] = 1

df.loc[df["country_group"] != "core", "country"] = "Other"

# check conversion.
df = (
    df.groupby(["se_property", "dvce_type", "rank_column"])["su_mail"]
    .agg("max")
    .reset_index()
)
df = df.groupby(["se_property", "dvce_type"])["su_mail"].agg("mean").reset_index()

alt.Chart(
    df.loc[df["dvce_type"].isin(["Computer", "Mobile", "Tablet"]), :]
).mark_bar().encode(
    x=alt.X("se_property:N", axis=alt.Axis(title="Group")),
    y=alt.Y(
        "su_mail:Q",
        axis=alt.Axis(title="% of SUI that finish signup+confirm mail", format="%"),
    ),
    column="dvce_type:N",
    color="se_property",
).properties(
    width=200, height=400, title="% of SUI that finish SU and confirm mail by device"
)

  interactivity=interactivity, compiler=compiler, result=result)


Re-engagement might take time to take effect. Let's consider % SUI to SU+mail within 7 days by effectively dropping all obs from the last week.


In [6]:
from datetime import datetime, timedelta

df = pd.read_csv("user_reengagement.csv")

# keep only first screen.
df = df.loc[df["step"] == "signup-start", :]

# go down to unique level for each event and keep only the first event for each user.
df["rn"] = (
    df.sort_values(["created"], ascending=[True]).groupby("rank_column").cumcount()
)
df = df.loc[df["rn"] == 0, :]

# drop samples before June 10th.
df["day"] = pd.to_datetime(df["created"]).dt.date
df = df.loc[df["day"] >= pd.to_datetime("2020-06-10"), :]
# drop customers that are too young
df = df.loc[pd.to_datetime(df["created"]) < datetime.today() - timedelta(days=7), :]

# drop users that were already customers.
df = df.loc[
    (pd.to_datetime(df["user_created"]) >= pd.to_datetime("2020-06-10"))
    | (df["user_created"].isna() == True),
    :,
]

# code up conversion within 7 days.
df["su_diff"] = (
    pd.to_datetime(df["user_created"]) - pd.to_datetime(df["created"])
).dt.days
df["mail_diff"] = (
    pd.to_datetime(df["email_confirmation_completed"]) - pd.to_datetime(df["created"])
).dt.days

df["su_mail"] = 0
df.loc[(df["su_diff"] <= 7) & (df["mail_diff"] <= 7), "su_mail"] = 1

# check conversion.
df = df.groupby(["se_property", "rank_column"])["su_mail"].agg("max").reset_index()
df = df.groupby(["se_property"])["su_mail"].agg("mean").reset_index()

alt.Chart(df).mark_bar().encode(
    x=alt.X("se_property", axis=alt.Axis(title="Treatment group")),
    y=alt.Y(
        "su_mail:Q",
        axis=alt.Axis(title="% of SUI that finish signup+confirm mail", format="%"),
        scale=alt.Scale(domain=(0.30, 0.35)),
    ),
    color="se_property:N",
).properties(
    width=400,
    height=400,
    title="% of SUI that finish SU and confirm mail within 7 days",
)

In [7]:
df.head()

Unnamed: 0,se_property,su_mail
0,user_reengagement,0.33318
1,user_reengagement_control,0.336239


And now the same but with a 21-day window...

In [24]:
from datetime import datetime, timedelta

df = pd.read_csv("user_reengagement.csv")

# keep only first screen.
df = df.loc[df["step"] == "signup-start", :]

# go down to unique level for each event and keep only the first event for each user.
df["rn"] = (
    df.sort_values(["created"], ascending=[True]).groupby("rank_column").cumcount()
)
df = df.loc[df["rn"] == 0, :]

# drop samples before June 10th.
df["day"] = pd.to_datetime(df["created"]).dt.date
df = df.loc[df["day"] >= pd.to_datetime("2020-06-10"), :]
# drop customers that are too young
df = df.loc[pd.to_datetime(df["created"]) < datetime.today() - timedelta(days=21), :]

# drop users that were already customers.
df = df.loc[
    (pd.to_datetime(df["user_created"]) >= pd.to_datetime("2020-06-10"))
    | (df["user_created"].isna() == True),
    :,
]

# code up conversion within 21 days.
df["su_diff"] = (
    pd.to_datetime(df["user_created"]) - pd.to_datetime(df["created"])
).dt.days
df["mail_diff"] = (
    pd.to_datetime(df["email_confirmation_completed"]) - pd.to_datetime(df["created"])
).dt.days

df["su_mail"] = 0
df.loc[(df["su_diff"] <= 21) & (df["mail_diff"] <= 21), "su_mail"] = 1

# check conversion.
df = df.groupby(["se_property", "rank_column"])["su_mail"].agg("max").reset_index()
df = df.groupby(["se_property"])["su_mail"].agg("mean").reset_index()

alt.Chart(df).mark_bar().encode(
    x=alt.X("se_property", axis=alt.Axis(title="Treatment group")),
    y=alt.Y(
        "su_mail:Q",
        axis=alt.Axis(title="% of SUI that finish signup+confirm mail", format="%"),
        scale=alt.Scale(domain=(0.30, 0.37)),
    ),
    color="se_property:N",
).properties(
    width=400,
    height=400,
    title="% of SUI that finish SU and confirm mail within 21 days",
)

In [25]:
df.head()

Unnamed: 0,se_property,su_mail
0,user_reengagement,0.342547
1,user_reengagement_control,0.347201


Similarly, show the % of successful signups by days between SUI and SU. There is not really a big difference. 

In [26]:
df = pd.read_csv("user_reengagement.csv")

# keep only first screen.
df = df.loc[df["step"] == "signup-start", :]

# go down to unique level for each event and keep only the first event for each user.
df["rn"] = (
    df.sort_values(["created"], ascending=[True]).groupby("rank_column").cumcount()
)
df = df.loc[df["rn"] == 0, :]

# drop samples before June 10th.
df["day"] = pd.to_datetime(df["created"]).dt.date
df = df.loc[df["day"] >= pd.to_datetime("2020-06-10"), :]
df = df.loc[df["day"] < pd.to_datetime(datetime.now()), :]

# drop users that were already customers.
df = df.loc[
    (pd.to_datetime(df["user_created"]) >= pd.to_datetime("2020-06-10"))
    | (df["user_created"].isna() == True),
    :,
]

# show datedifference between SUI and SU
df["diff"] = (
    pd.to_datetime(df["user_created"]).dt.date - pd.to_datetime(df["created"]).dt.date
).dt.days
df.loc[df["diff"] < 0, "diff"] = 0

df = df.groupby(["diff", "se_property"])["user_id"].agg("nunique").reset_index()

df["perc"] = (
    100 * df["user_id"] / df.groupby(["se_property"])["user_id"].transform("sum")
)

alt.Chart(df.loc[df["diff"] >= 0, :]).mark_bar().encode(
    x=alt.X("diff:N", axis=alt.Axis(title="Days since SUI to SU")),
    y=alt.Y("perc:Q", axis=alt.Axis(title="% of SUI that finish signup+confirm mail")),
    color="se_property",
    column="se_property",
).properties(
    width=400, height=400, title="% of SUI that finish SU by time between SUI and SU"
)

  interactivity=interactivity, compiler=compiler, result=result)


There is the hypothesis that re-engagement might push away "low intent" customers earlier on and that the difference will fade away over time e.g. as these users would have eventually dropped off anyway. 

Below you see the analysis using KYC completion as our success marker. 

In [27]:
from datetime import datetime, timedelta

df = pd.read_csv("user_reengagement.csv")

# keep only first screen.
df = df.loc[df["step"] == "signup-start", :]

# go down to unique level for each event and keep only the first event for each user.
df["rn"] = (
    df.sort_values(["created"], ascending=[True]).groupby("rank_column").cumcount()
)
df = df.loc[df["rn"] == 0, :]

# drop samples before June 10th.
df["day"] = pd.to_datetime(df["created"]).dt.date
df = df.loc[df["day"] >= pd.to_datetime("2020-06-10"), :]
# drop customers that are too young
df = df.loc[pd.to_datetime(df["created"]) < datetime.today() - timedelta(days=7), :]

# drop users that were already customers.
df = df.loc[
    (pd.to_datetime(df["user_created"]) >= pd.to_datetime("2020-06-10"))
    | (df["user_created"].isna() == True),
    :,
]


# code up conversion within 7 days.
df["kycc7day"] = 0
df.loc[
    (pd.to_datetime(df["kyc_first_completed"]) - pd.to_datetime(df["created"])).dt.days
    <= 7,
    "kycc7day",
] = 1

# check conversion.
df = df.groupby(["se_property", "rank_column"])["kycc7day"].agg("max").reset_index()
df = df.groupby(["se_property"])["kycc7day"].agg("mean").reset_index()

alt.Chart(df).mark_bar().encode(
    x=alt.X("se_property", axis=alt.Axis(title="Treatment group")),
    y=alt.Y(
        "kycc7day:Q",
        axis=alt.Axis(title="% of SUI that complete KYC within 7 days", format="%"),
    ),
    color="se_property:N",
).properties(width=400, height=400, title="% of SUI that complete KYC within 7 days")

In [80]:
df.head()

Unnamed: 0,se_property,kycc7day
0,user_reengagement,0.186356
1,user_reengagement_control,0.187275


In [28]:
from statsmodels.stats.proportion import proportions_ztest

df = pd.read_csv("user_reengagement.csv")

# keep only first screen.
df = df.loc[df["step"] == "signup-start", :]

# go down to unique level for each event and keep only the first event for each user.
df["rn"] = (
    df.sort_values(["created"], ascending=[True]).groupby("rank_column").cumcount()
)
df = df.loc[df["rn"] == 0, :]

# drop samples before June 10th.
df["day"] = pd.to_datetime(df["created"]).dt.date
df = df.loc[df["day"] >= pd.to_datetime("2020-06-10"), :]
# drop customers that are too young
df = df.loc[pd.to_datetime(df["created"]) < datetime.today() - timedelta(days=7), :]

# drop users that were already customers.
df = df.loc[
    (pd.to_datetime(df["user_created"]) >= pd.to_datetime("2020-06-10"))
    | (df["user_created"].isna() == True),
    :,
]

# code up conversion within 7 days.
df["kycc7day"] = 0
df.loc[
    (pd.to_datetime(df["kyc_first_completed"]) - pd.to_datetime(df["created"])).dt.days
    <= 7,
    "kycc7day",
] = 1

# check conversion.
df = df.groupby(["se_property", "rank_column"])["kycc7day"].agg("max").reset_index()
data = df.groupby("se_property")["kycc7day"].agg(["count", "sum"]).reset_index()

# run z test. (two sided)
stat, pval = proportions_ztest(data["sum"], data["count"])

print(
    "The z-score for this test is %s which corresponds to a p-value of %s"
    % (round(stat, 2), round(pval, 4))
)

if pval < 0.05:
    print("The difference is significant.")
else:
    print("The difference is not significant.")

The z-score for this test is -0.97 which corresponds to a p-value of 0.3316
The difference is not significant.


In [29]:
from datetime import datetime, timedelta

df = pd.read_csv("user_reengagement.csv")

# keep only first screen.
df = df.loc[df["step"] == "signup-start", :]

# go down to unique level for each event and keep only the first event for each user.
df["rn"] = (
    df.sort_values(["created"], ascending=[True]).groupby("rank_column").cumcount()
)
df = df.loc[df["rn"] == 0, :]

# drop samples before June 10th.
df["day"] = pd.to_datetime(df["created"]).dt.date
df = df.loc[df["day"] >= pd.to_datetime("2020-06-10"), :]
# drop customers that are too young
df = df.loc[pd.to_datetime(df["created"]) < datetime.today() - timedelta(days=21), :]

# drop users that were already customers.
df = df.loc[
    (pd.to_datetime(df["user_created"]) >= pd.to_datetime("2020-06-10"))
    | (df["user_created"].isna() == True),
    :,
]

# code up conversion within 21 days.
df["kycc21day"] = 0
df.loc[
    (pd.to_datetime(df["kyc_first_completed"]) - pd.to_datetime(df["created"])).dt.days
    <= 21,
    "kycc21day",
] = 1

# check conversion.
df = df.groupby(["se_property", "rank_column"])["kycc21day"].agg("max").reset_index()
df = df.groupby(["se_property"])["kycc21day"].agg("mean").reset_index()

alt.Chart(df).mark_bar().encode(
    x=alt.X("se_property", axis=alt.Axis(title="Treatment group")),
    y=alt.Y(
        "kycc21day:Q",
        axis=alt.Axis(title="% of SUI that complete KYC within 21 days", format="%"),
    ),
    color="se_property:N",
).properties(
    width=400, height=400, title="% of SUI that complete KYC within 21 days of SUI"
)

In [30]:
df.head()

Unnamed: 0,se_property,kycc21day
0,user_reengagement,0.189814
1,user_reengagement_control,0.194235


Similarly, look at the amount of times a user takes to pass KYC. We show the distribution of KYC attempts per SUI in both variants.    

Crucially, since a lot of user will never reach KYC there is a lot of mass at 0. 

In [83]:
from datetime import datetime, timedelta

df = pd.read_csv("user_reengagement.csv")

# keep only first screen.
df = df.loc[df["step"] == "signup-start", :]

# go down to unique level for each event and keep only the first event for each user.
df["rn"] = (
    df.sort_values(["created"], ascending=[True]).groupby("rank_column").cumcount()
)
df = df.loc[df["rn"] == 0, :]

# drop samples before June 10th.
df["day"] = pd.to_datetime(df["created"]).dt.date
df = df.loc[df["day"] >= pd.to_datetime("2020-06-10"), :]
# drop customers that are too young
df = df.loc[pd.to_datetime(df["created"]) < datetime.today() - timedelta(days=7), :]

# drop users that were already customers.
df = df.loc[
    (pd.to_datetime(df["user_created"]) >= pd.to_datetime("2020-06-10"))
    | (df["user_created"].isna() == True),
    :,
]

# keep only first screen.
df = df.loc[df["step"] == "signup-start", :]
df.loc[df["kyc_count"].isna() == True, "kyc_count"] = 0

# check conversion.
df = df.groupby(["se_property", "rank_column"])["kyc_count"].agg("max").reset_index()
df = (
    df.groupby(["se_property", "kyc_count"])["rank_column"].agg("nunique").reset_index()
)

df["perc"] = (
    100
    * df["rank_column"]
    / df.groupby(["se_property"])["rank_column"].transform("sum")
)
df["cum"] = df.groupby(["se_property"])["perc"].cumsum()


alt.Chart(df.loc[df["kyc_count"] < 10, :]).mark_line().encode(
    x=alt.X("kyc_count", axis=alt.Axis(title="Number of KYC attempts requiried")),
    y=alt.Y("cum:Q", axis=alt.Axis(title="Percentile")),
    color="se_property:N",
).properties(width=400, height=400, title="Number of KYC attempts required")

  interactivity=interactivity, compiler=compiler, result=result)


In [84]:
from datetime import datetime, timedelta

df = pd.read_csv("user_reengagement.csv")

# drop samples before June 10th.
df["day"] = pd.to_datetime(df["created"]).dt.date
df = df.loc[df["day"] >= pd.to_datetime("2020-06-10"), :]
# drop customers that are too young
df = df.loc[pd.to_datetime(df["created"]) < datetime.today() - timedelta(days=7), :]

# drop users that were already customers.
df = df.loc[
    (pd.to_datetime(df["user_created"]) >= pd.to_datetime("2020-06-10"))
    | (df["user_created"].isna() == True),
    :,
]

# keep only first screen.
df = df.loc[df["step"] == "signup-start", :]
df.loc[df["kyc_count"].isna() == True, "kyc_count"] = 0

# check conversion.
df = df.groupby(["se_property", "rank_column"])["kyc_count"].agg("max").reset_index()
df = df.groupby(["se_property"])["kyc_count"].agg("mean").reset_index()

alt.Chart(df).mark_bar().encode(
    x=alt.X("se_property", axis=alt.Axis(title="Treatment group")),
    y=alt.Y("kyc_count:Q", axis=alt.Axis(title="Avg. number of KYC attempts per SUI")),
    color="se_property:N",
).properties(width=400, height=400, title="Number of KYC attempts required")

In [85]:
df.head()

Unnamed: 0,se_property,kyc_count
0,user_reengagement,0.431792
1,user_reengagement_control,0.428946


## What % of users opt in for newsletter? 

This is not yet taking into account whether or not they confirm their mail (e.g. whether we can actually make use of their consent.

In [89]:
df = pd.read_csv("user_reengagement.csv")

# drop samples before June 10th.
df["day"] = pd.to_datetime(df["created"]).dt.date
df = df.loc[df["day"] >= pd.to_datetime("2020-06-10"), :]
df = df.loc[df["day"] < pd.to_datetime(datetime.now()), :]

# drop users that were already customers.
df = df.loc[
    (pd.to_datetime(df["user_created"]) >= pd.to_datetime("2020-06-10"))
    | (df["user_created"].isna() == True),
    :,
]

df.loc[df["newsletter_opt_in"].isna() == True, "newsletter_opt_in"] = "Missing"

df["rn"] = df.groupby(["shadow_user_id"])["created"].cumcount()
df = df.loc[df["rn"] == 0, :]

df = (
    df.groupby(["se_property", "newsletter_opt_in"])["shadow_user_id"]
    .agg("nunique")
    .reset_index()
)

df["perc"] = (
    100
    * df["shadow_user_id"]
    / df.groupby(["se_property"])["shadow_user_id"].transform("sum")
)

alt.Chart(df).mark_bar().encode(
    x=alt.X("newsletter_opt_in", axis=alt.Axis(title="Response")),
    y=alt.Y("perc:Q", axis=alt.Axis(title="% of SUI that we can re-target")),
    column="se_property:N",
).properties(width=400, height=400, title="% of shadow user ids with newsletter opt-in")

  interactivity=interactivity, compiler=compiler, result=result)


It looks similar if - instead of using the newsletter_opt_in field in cmd_shadow_user - leverage the snowplow event fired when the user consents in the treatment group on the email-address step. 

We don't have this response for all users because 
- not all users finish the email-address screen. 
- snowplow is not 100% accurate. 

In [19]:
df = pd.read_csv("user_reengagement.csv")

# drop samples before June 10th.
df["day"] = pd.to_datetime(df["created"]).dt.date
df = df.loc[df["day"] >= pd.to_datetime("2020-06-10"), :]
df = df.loc[df["day"] < pd.to_datetime(datetime.now()), :]

# drop users that were already customers.
df = df.loc[
    (pd.to_datetime(df["user_created"]) >= pd.to_datetime("2020-06-10"))
    | (df["user_created"].isna() == True),
    :,
]

df.loc[df["opt_in"].isna() == True, "opt_in"] = "Missing"

# keep only first screen.
df = df.loc[df["step"] == "email-address", :]

df = df.groupby(["se_property", "opt_in"])["domain_userid"].agg("nunique").reset_index()

df["perc"] = df["domain_userid"] / df.groupby(["se_property"])[
    "domain_userid"
].transform("sum")

alt.Chart(df.loc[df["se_property"] == "user_reengagement", :]).mark_bar().encode(
    x=alt.X("opt_in", axis=alt.Axis(title="Response")),
    y=alt.Y(
        "perc:Q",
        axis=alt.Axis(
            title="% of users that optin/skip opt in on email field", format="%"
        ),
    ),
).properties(
    width=400, height=400, title="% of users that optin/skip opt in on email field"
)

## Conversion by whether user consents on the first screen or not. 


In [20]:
df = pd.read_csv("user_reengagement.csv")

# drop samples before June 10th.
df["day"] = pd.to_datetime(df["created"]).dt.date
df = df.loc[df["day"] >= pd.to_datetime("2020-06-10"), :]
df = df.loc[df["day"] < pd.to_datetime(datetime.now()), :]

# drop users that were already customers.
df = df.loc[
    (pd.to_datetime(df["user_created"]) >= pd.to_datetime("2020-06-10"))
    | (df["user_created"].isna() == True),
    :,
]

df.loc[df["opt_in"].isna() == True, "opt_in"] = "Missing"

# keep only first screen.
df = df.loc[df["step"] == "signup-start", :]

df["su_mail"] = 0
df.loc[
    (df["user_id"].isna() == False)
    & (df["email_confirmation_completed"].isna() == False),
    "su_mail",
] = 1

df = (
    df.groupby(["se_property", "opt_in", "rank_column"])["su_mail"]
    .agg("max")
    .reset_index()
)
df = df.groupby(["se_property", "opt_in"])["su_mail"].agg("mean").reset_index()


alt.Chart(df.loc[df["se_property"] == "user_reengagement", :]).mark_bar().encode(
    x=alt.X("opt_in", axis=alt.Axis(title="Response")),
    y=alt.Y(
        "su_mail:Q",
        axis=alt.Axis(
            title="% of users that convert to SU+mail depending on opt in", format="%"
        ),
    ),
).properties(
    width=400,
    height=400,
    title="% of users that convert to SU+mail depending on opt in",
)