title: AB Test - provide support in the UX flow if user mistype their email domains.   
author: Fabio Schmidt-Fischbach  
date: 2020-06-23  
region: EU  
summary: We try to solve the problem that users mistype their email on the personal-information step in the signup funnel and then cannot confirm their mail after account creation. To do so, we check whether the mistyped mail resembles any popular email-domain and propose the customer to amend their possibly misspelled mail address. Crucially, not adopting the proposed mail address does not prevent the user from moving on with the flow. The treatment neither moved conversion nor CS contacts to a significant extent. From a data perspective, there is no clear recommendation in terms of which variant is better.
tags: su, acquire, ab, test, email, onboarding

In [2]:
import pandas as pd
import os
import seaborn as sns
from statsmodels.stats.proportion import proportions_ztest
import altair as alt
from datetime import datetime

In [77]:
Query = """

with cs_contacts as ( 

select user_id,
	   count(distinct case when cs_tag like '01%' then id end) as signup_cs, 
	   count(distinct case when cs_tag like '04%' then id end) as kyc_cs,
	   signup_cs + kyc_cs as total_cs
from dbt.sf_all_contacts 
group by 1 
)

select domain_userid, 
		shadow_user_id, 
		tbl.user_created, 
		user_id, 
		country, 
		case when country in ('DEU','AUT','FRA','ESP','ITA') then 'core' else 'other' end as country_group, 
		step, 
		tbl.created, 
		se_property, 
		attribution, 
		webview, 
		os_family, 
		dvce_type, 
		rank_column,
		kyc_first_initiated, 
		kyc_first_completed, 
		is_mau,
		email_confirmation_completed,
		signup_cs, 
		kyc_cs, 
		total_cs
from dbt.stg_upper_funnel_tbl as tbl 
left join dbt.zrh_users using (user_created) 
left join cmd_user_signup_status as cmd on cmd.user_created = tbl.user_created
left join cs_contacts using (user_id) 
where se_property in ('email_suggestions_control', 'email_suggestions')

"""

### Setup 

We try to solve the problem that users mistype their email on the personal-information step in the signup funnel and then cannot confirm their mail after account creation. 

To do so, we check whether the mistyped mail resembles any popular email-domain and propose the customer to amend their possibly misspelled mail address. 

Crucially, not adopting the proposed mail address does not prevent the user from moving on with the flow. 



![](email_feedback.png)

### Summary

The treatment neither moved conversion nor CS contacts to a significant extent. From a data perspective, there is no clear recommendation in terms of which variant is better.  

### Sample 

In [36]:
df = pd.read_csv("email_test.csv")


# get number of participants
df = df.loc[df["step"] == "personal-information", :]
df = df.loc[
    (pd.to_datetime(df["user_created"]) >= pd.to_datetime("2020-06-09"))
    | (df["user_created"].isna() == True),
    :,
]

# count unique users per variant.
df = df.groupby(["se_property"])["rank_column"].agg("nunique").reset_index()

alt.Chart(df).mark_bar().encode(
    x=alt.X("se_property:N", axis=alt.Axis(title="Group")),
    y=alt.Y("rank_column:Q", axis=alt.Axis(title="Number of users")),
    color="se_property",
).properties(title="Sample size", width=400, height=400).display(renderer="svg")

### Analysis 

The main KPI is % of personal information that complete SU and confirm their mail. 

The conversions are nearly identical. A statistical significance test confirms the visual first impression. The feature did not manage to improve performance. 

In [89]:
df = pd.read_csv("email_test.csv")

df = df.loc[df["step"] == "personal-information", :]
df = df.loc[
    (pd.to_datetime(df["user_created"]) >= pd.to_datetime("2020-06-09"))
    | (df["user_id"].isna() == True),
    :,
]

df["success"] = 0
df.loc[
    (df["user_id"].isna() == False)
    & (df["email_confirmation_completed"].isna() == False),
    "success",
] = 1

df = df.groupby(["se_property", "rank_column"])["success"].agg("max").reset_index()
df = df.groupby(["se_property"])["success"].agg("mean").reset_index()

alt.Chart(df).mark_bar().encode(
    x=alt.X("se_property:N", axis=alt.Axis(title="Group")),
    y=alt.Y(
        "success:Q",
        axis=alt.Axis(title="% personal information to SU+confirm mail", format="%"),
    ),
    color="se_property",
).properties(
    title="% personal information to complete SU and confirm mail",
    width=400,
    height=400,
).display(
    renderer="svg"
)

In [90]:
df.head()

Unnamed: 0,se_property,success
0,email_suggestions,0.317384
1,email_suggestions_control,0.316854


In [91]:
df = pd.read_csv("email_test.csv")

df = df.loc[df["step"] == "personal-information", :]
df = df.loc[
    (pd.to_datetime(df["user_created"]) >= pd.to_datetime("2020-06-09"))
    | (df["user_id"].isna() == True),
    :,
]

df["success"] = 0
df.loc[
    (df["user_id"].isna() == False)
    & (df["email_confirmation_completed"].isna() == False),
    "success",
] = 1

df = df.groupby(["se_property", "rank_column"])["success"].agg("max").reset_index()

df = df.groupby(["se_property"])["success"].agg(["count", "sum"]).reset_index()

# run z test. (two sided)
stat, pval = proportions_ztest(df["sum"], df["count"])

print(
    "The z-score for this test is %s which corresponds to a p-value of %s"
    % (round(stat, 2), round(pval, 4))
)

if pval < 0.05:
    print("The difference is significant.")
else:
    print("The difference is not signficiant.")

  interactivity=interactivity, compiler=compiler, result=result)


The z-score for this test is 0.2 which corresponds to a p-value of 0.8447
The difference is not signficiant.


Split up the funnel into two parts:

- Difference is more prominent from % SUI to SU 
- Almost no difference between % SU to email confirmed. 

--> We don't find an impact in either part. 

In [92]:
df = pd.read_csv("email_test.csv")

# get number of participants
df = df.loc[df["step"] == "personal-information", :]
df = df.loc[
    (pd.to_datetime(df["user_created"]) >= pd.to_datetime("2020-06-09"))
    | (df["user_id"].isna() == True),
    :,
]

df["success"] = 0
df.loc[(df["user_id"].isna() == False), "success"] = 1

df = df.groupby(["se_property", "rank_column"])["success"].agg("max").reset_index()
df = df.groupby(["se_property"])["success"].agg("mean").reset_index()

alt.Chart(df).mark_bar().encode(
    x=alt.X("se_property:N", axis=alt.Axis(title="Group")),
    y=alt.Y(
        "success:Q",
        axis=alt.Axis(title="% personal information to SU", format="%"),
        scale=alt.Scale(domain=[0.25, 0.35]),
    ),
    color="se_property",
).properties(title="% personal information to SU", width=400, height=400).display(
    renderer="svg"
)

In [93]:
df.head()

Unnamed: 0,se_property,success
0,email_suggestions,0.339873
1,email_suggestions_control,0.339681


In [94]:
df = pd.read_csv("email_test.csv")

# get number of participants
df = df.loc[df["user_id"].isna() == False, :]

df["success"] = 0
df.loc[(df["email_confirmation_completed"].isna() == False), "success"] = 1

df = df.groupby(["se_property", "user_id"])["success"].agg("max").reset_index()
df = df.groupby(["se_property"])["success"].agg("mean").reset_index()

alt.Chart(df).mark_bar().encode(
    x=alt.X("se_property:N", axis=alt.Axis(title="Group")),
    y=alt.Y("success:Q", axis=alt.Axis(title="% SU to confirm mail", format="%")),
    color="se_property",
).properties(title="% SU to mail confirmed", width=400, height=400).display(
    renderer="svg"
)

  interactivity=interactivity, compiler=compiler, result=result)


In [95]:
df.head()

Unnamed: 0,se_property,success
0,email_suggestions,0.943371
1,email_suggestions_control,0.942133


The step-by-step conversions are also very similar. 

In [96]:
df = pd.read_csv("email_test.csv")

df = df.groupby(["se_property", "step"])["rank_column"].agg("nunique").reset_index()

df = df.loc[
    df["step"].isin(
        [
            "signup-start",
            "personal-information",
            "phone-number",
            "address",
            "address-confirmation",
            "create-password",
            "create-account",
            "email-confirmation",
        ]
    ),
    :,
]

df["perc"] = (
    100
    * df["rank_column"]
    / df.groupby(["se_property"])["rank_column"].transform("max")
)

alt.Chart(df).mark_bar().encode(
    y=alt.Y("step:N", axis=alt.Axis(title="Step"), sort="-x"),
    x=alt.X("perc:Q", axis=alt.Axis(title="% of customers")),
    column="se_property",
)

  interactivity=interactivity, compiler=compiler, result=result)


In [85]:
df.head(100)

Unnamed: 0,se_property,step,rank_column,perc
2,email_suggestions,address,31392,36.932199
3,email_suggestions,address-confirmation,31450,37.000435
4,email_suggestions,create-account,26954,31.710961
5,email_suggestions,create-password,26555,31.241544
7,email_suggestions,email-confirmation,25787,30.338004
9,email_suggestions,personal-information,61566,72.43144
10,email_suggestions,phone-number,37471,44.084048
12,email_suggestions,signup-start,84999,100.0
20,email_suggestions_control,address,31225,36.961849
21,email_suggestions_control,address-confirmation,31260,37.003279


## CS contacts 

The funnel conversion was our primary metric of interest. The secondary metric of interest are CS contacts.   

We investigate both KYC and sign up related contacts (01 / 04) respectively. The goal is to decrease the % of users who need to reach out to CS.

We see slightly increased CS contacts in the variant (not significant). 

In [74]:
df = pd.read_csv("email_test.csv")

# go down to user level.
df = df.loc[df["user_id"].isna() == False, :]
df.loc[df["total_cs"].isna(), "total_cs"] = 0

df = df.groupby(["se_property", "user_id"])["total_cs"].agg("max").reset_index()
df = df.groupby(["se_property"])["total_cs"].agg("mean").reset_index()

alt.Chart(df).mark_bar().encode(
    x=alt.X("se_property:N", axis=alt.Axis(title="Group")),
    y=alt.Y("total_cs:Q", axis=alt.Axis(title="CS contacts per customer", format="%")),
    color="se_property",
).properties(
    title="CS contacts per customer (signup and cs)", width=400, height=400
).display(
    renderer="svg"
)

In [69]:
df.head()

Unnamed: 0,se_property,total_cs
0,email_suggestions,0.125483
1,email_suggestions_control,0.120628


In [97]:
df = pd.read_csv("email_test.csv")

# go down to user level.
df = df.loc[df["user_id"].isna() == False, :]
df.loc[df["total_cs"].isna(), "total_cs"] = 0

df = df.groupby(["se_property", "user_id"])["total_cs"].agg("max").reset_index()
df.loc[df["total_cs"] >= 1, "total_cs"] = 1

df = df.groupby(["se_property"])["total_cs"].agg(["count", "sum"]).reset_index()

# run z test. (two sided)
stat, pval = proportions_ztest(df["sum"], df["count"])

print(
    "The z-score for this test is %s which corresponds to a p-value of %s"
    % (round(stat, 2), round(pval, 4))
)

if pval < 0.05:
    print("The difference is significant.")
else:
    print("The difference is not signficiant.")

The z-score for this test is 0.67 which corresponds to a p-value of 0.5041
The difference is not signficiant.


In [98]:
df = pd.read_csv("email_test.csv")

# go down to user level.
df = df.loc[df["user_id"].isna() == False, :]

df.loc[df["signup_cs"].isna(), "signup_cs"] = 0

df = df.groupby(["se_property", "user_id"])["signup_cs"].agg("max").reset_index()
df = df.groupby(["se_property"])["signup_cs"].agg("mean").reset_index()

alt.Chart(df).mark_bar().encode(
    x=alt.X("se_property:N", axis=alt.Axis(title="Group")),
    y=alt.Y("signup_cs:Q", axis=alt.Axis(title="CS contacts per customer", format="%")),
    color="se_property",
).properties(
    title="CS contacts per customer (signup CS contacts)", width=400, height=400
).display(
    renderer="svg"
)

In [71]:
df.head()

Unnamed: 0,se_property,signup_cs
0,email_suggestions,0.030544
1,email_suggestions_control,0.028915


In [99]:
df = pd.read_csv("email_test.csv")

# go down to user level.
df = df.loc[df["user_id"].isna() == False, :]

df.loc[df["kyc_cs"].isna(), "kyc_cs"] = 0

df = df.groupby(["se_property", "user_id"])["kyc_cs"].agg("max").reset_index()
df = df.groupby(["se_property"])["kyc_cs"].agg("mean").reset_index()

alt.Chart(df).mark_bar().encode(
    x=alt.X("se_property:N", axis=alt.Axis(title="Group")),
    y=alt.Y("kyc_cs:Q", axis=alt.Axis(title="CS contacts per customer", format="%")),
    color="se_property",
).properties(
    title="CS contacts per customer (KYC CS contacts)", width=400, height=400
).display(
    renderer="svg"
)