title: Matching Studies - Apple pay incentives at onboarding

author: Brieuc Van Thienen

date: 2023-03-06

region: EU

tags: onboarding, digital card, apple pay transactions, user matching, incentives, retention, engagement, causal inference, net contribution, revenues

summary: N26 has contractual obligations towards Apple to incentivise its users to use the mobile wallet (Apple pay), with an incentive of of minimum 5 Euros. The Growth and Product teams decided to incentivise digital card users because, without the mobile wallet, such users can't spend at POS. We hence believe that tokenising the card and making a first Apple pay transaction could be meaningful actions during onboarding. This analysis looks at the difference in profitability attributable to Apple pay usage during the onboarding period. The study could not demonstrate any meaningful difference between the two groups. The overlap in propensity scores of users that did use / did not use Apple Pay is not optimal, even after having excluded outliers (e.g. users with less than 3 transactions in the first 35d, users with immensely high deposits, etc.). It was therefore suggested to the teams to stick to the amount of 5 Euros specified in the contracts, and not go above.

<br>
<br>
<br>
<br>
<br>
<br>

### Dataset

In [None]:
query = """
with users as (
    select
        u.user_created,
        u.user_id,
        u.kyc_first_completed::date,
        dateadd(day, 35, kyc_first_completed::date)::date as kycc_35d_date,
        u.tnc_country_group,
        u.age_group,
        u.gender,
        u.is_expat,
        p.product_id as product_start,
        p2.product_id as product_end,
        coalesce(cs.cs_contacts_35d, 0) as cs_contacts_35d,
        floor(balance_eur_max::float / 10) * 10 as balance_eur_max,
        n_act_total_35d,
        n_aa_35d,
        floor(amount_cents_aa_35d::float / 100 / 10) * 10 as eur_aa_35d,
        floor(amount_cents_ct_35d::float / 100 / 10) * 10 as eur_ct_35d,
        floor(amount_cents_act_total_35d::float / 10) * 10 as eur_act_total_35d,
        (n_card_apple_35d > 0) as used_apple
    from
        dbt.zrh_users u
    inner join
        dwh_earlycluster_labels_1month ec using (user_created)
    inner join
        (select user_created, max(balance_eur) as balance_eur_max from dbt.crm_segmentation_onboarding group by 1) ob using (user_created)
    inner join
        (select
            user_created,
            sum(amount_cents_ct) as amount_cents_ct_35d,
            sum(n_act_total) as n_act_total_35d,
            sum(n_aa) as n_aa_35d,
            sum(amount_cents_act_total) as amount_cents_act_total_35d, -- average
            sum(amount_cents_aa) as amount_cents_aa_35d,
            sum(n_card_apple) as n_card_apple_35d
        from
            dbt.zrh_users u
        inner join
            dbt.zrh_txn_day t using (user_created)
        where
            t.txn_date between u.kyc_first_completed::date and dateadd(day, 35, kyc_first_completed::date)
        group by 1
        ) act using (user_created)
        left join
        (select
            u.user_created,
            count(1) as cs_contacts_35d
        from
            dbt.stg_users u
        inner join
            dbt.sf_all_contacts c using (user_id)
        where
            1=1
            and kyc_first_completed is not null
            and initiated_date between kyc_first_completed and dateadd(day, 36, kyc_first_completed::date)
            and c_level_report is true
        group by 1) cs using (user_created)
    inner join
        (select distinct user_created from dbt.stg_logins where platform = 2) app using (user_created)
    inner join
        dbt.zrh_user_product p on u.user_created = p.user_created
            and p.enter_reason = 'SIGNUP'
            and p.product_id in ('STANDARD', 'BUSINESS_CARD')
    inner join
        dbt.zrh_user_product p2 on u.user_created = p2.user_created
            and dateadd(day, 35, kyc_first_completed::date) between p2.subscription_valid_from and p2.subscription_valid_until
            and p2.product_id = p.product_id
    left join -- all distinct users with a physical card
        (select distinct user_created from dbt.zrh_users u inner join dbt.zrh_cards c using (user_created) where order_date between u.kyc_first_completed::date and dateadd(day, 35, kyc_first_completed::date) and is_digital is false) pc on pc.user_created = u.user_created
    where
        pc.user_created is null
        and kyc_first_completed between dateadd(month, -12, current_date) and dateadd(month, -7, current_date)
        and kyc_first_completed < dateadd(day, 7, u.user_created)

), net_contribution as (

    select
        u.user_created,
        sum(case when product_group = 'Treasury' then value / 100 else 0 end) as rev_6m_treasury,
        sum(case when type = 'Revenue' then value / 100 else 0 end) as rev_6m,
        sum(coalesce(value,0) / 100) as nc1_6m
    from
        users u
    inner join
        dbt.ucm_pnl pnl on u.user_created = pnl.user_created
            and to_date(pnl.month, 'YYYY-MM') between date_trunc('month', kyc_first_completed::date) and date_trunc('month', dateadd(month, 5, kyc_first_completed)::date)
    inner join
        dbt.ucm_mapping m using (label)
    where
        type in ('Revenue', 'Direct', 'Variable')
        and product_group not in ('Banking') -- would by definition be outliers
    group by 1

)

select
    u.*,
    nc.nc1_6m,
    nc.rev_6m
from
    users u
inner join
    net_contribution nc using (user_created)
where
    n_act_total_35d > 0
order by 1
"""

In [None]:
# !pip3 install seaborn

In [None]:
# !pip3 install stats

In [None]:
from scipy.stats import zscore

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import stats

In [None]:
df = pd.read_csv("DASD2052_dataset.csv")

In [None]:
df.head()

In [None]:
df.describe()

In [None]:
df.isnull().sum()

In [None]:
df = df.fillna(0)

In [None]:
df.head()

In [None]:
# from scipy.stats import zscore

# removing outliers, especially driven by treasury
df = df.loc[df["balance_eur_max"] <= 10000, :]
df = df.loc[df["eur_act_total_35d"] <= 10000, :]
df = df.loc[(df["n_act_total_35d"] >= 3) & (df["n_act_total_35d"] <= 250), :]
# df = df[(np.abs(zscore(df['nc1_6m'])) < 3)]

In [None]:
sns.ecdfplot(data=df, x="n_act_total_35d", hue="used_apple")

In [None]:
sns.ecdfplot(data=df, x="nc1_6m", hue="used_apple")

In [None]:
df.describe()

<br>
<br>
<br>
<br>
<br>
<br>

### Naive comparison

In [None]:
df.groupby(["used_apple"])["nc1_6m"].describe()

In [None]:
# separate control and treatment for t-test
df_control = df.loc[~df["used_apple"]]
df_treatment = df.loc[df["used_apple"]]

In [None]:
from scipy.stats import ttest_ind

print(df_control.nc1_6m.mean(), df_treatment.nc1_6m.mean())

# compare samples
_, p = ttest_ind(df_control["nc1_6m"], df_treatment["nc1_6m"])
print(f"p={p:.3f}")

# interpret
alpha = 0.05  # significance level
if p > alpha:
    print(
        "same distributions / same group mean: fail to reject H0 - we do not have enough evidence to reject H0"
    )
else:
    print("different distributions / different group mean: reject H0")

<br>
<br>
<br>
<br>
<br>
<br>


### Model Preperation

---
<br>
<br>



In [None]:
user_features = ["user_created", "kyc_first_completed", "kycc_35d_date"]
categorical_features = [
    "product_start",
    # "product_end",
    "tnc_country_group",
    "is_expat",
    "age_group",
    "gender",
    "used_apple",
]

transaction_count_features = [
    "n_act_total_35d",
    "n_aa_35d",
    "cs_contacts_35d",
]

transaction_amount_features = [
    # 'eur_aa_35d',
    # 'eur_ct_35d',
    "balance_eur_max",
]

target_variable = ["nc1_6m"]

In [None]:
df = df[
    user_features
    + categorical_features
    + transaction_count_features
    + transaction_amount_features
    + target_variable
]

In [None]:
df = pd.get_dummies(df, columns=categorical_features, drop_first=True)

print(df.shape)

In [None]:
# onehotencoded_features
product_start_features = [f for f in df.columns if "product_start" in f]
# product_end_features = [f for f in df.columns if "product_end" in f]
country_features = [f for f in df.columns if "country" in f]
expat_features = [f for f in df.columns if "expat" in f]
age_features = [f for f in df.columns if "age" in f]
gender_features = [f for f in df.columns if "gender" in f]
used_apple_features = [f for f in df.columns if "used_apple" in f]

ohe_features = (
    product_start_features
    # + product_end_features
    + country_features
    + expat_features
    # + age_features
    # + gender_features
    # + names_features
    + used_apple_features
)

In [None]:
all_features_target = (
    ohe_features
    + transaction_count_features
    + transaction_amount_features
    + target_variable
)

In [None]:
# !pip3 install scikit-learn

In [None]:
# Preprocessing pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler, RobustScaler
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer

In [None]:
df = df[all_features_target].reset_index(drop=True)

In [None]:
df.head()

In [None]:
preprocessor = make_column_transformer(
    (StandardScaler(), transaction_count_features),
    (RobustScaler(), transaction_amount_features),
    remainder="passthrough",
)

In [None]:
df_fittransform = pd.DataFrame(
    preprocessor.fit_transform(
        df[transaction_count_features + transaction_amount_features + target_variable]
        # df[transaction_amount_features + target_variable]
    ),
    columns=df[
        transaction_count_features + transaction_amount_features + target_variable
    ].columns,
    # columns=df[transaction_amount_features + target_variable].columns,
)

In [None]:
df_fittransform.head()

In [None]:
df = df.loc[
    :,
    ~df.columns.isin(
        transaction_count_features + transaction_amount_features + target_variable
    ),
].join(df_fittransform)

In [None]:
df.head()

<br>
<br>
<br>
<br>
<br>
<br>

### Propensity score matching - matching users based on their likelihood to redeem the coupon

---

<br>
<br>

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
t = "used_apple_True"
y = "nc1_6m"
x = [f for f in df.columns if f not in t + y]

In [None]:
x

In [None]:
ps_scores_model = LogisticRegression(max_iter=200).fit(df[x], df["used_apple_True"])

In [None]:
ps_scores = df.assign(propensity_score=ps_scores_model.predict_proba(df[x])[:, 1])
df_final = ps_scores[["propensity_score", "used_apple_True", "nc1_6m"]]

In [None]:
df_final.head()

In [None]:
# Plotting the propensity score distribution to make sure there's big enough overlapped between the two groups
sns.histplot(
    data=df_final, x=df_final["propensity_score"], hue="used_apple_True", bins=10
)

<br>
<br>
<br>
<br>
<br>
<br>

### OLS estimation - Regressing NC1 with the propensity score and coupon boolean

---

<br>
<br>

In [None]:
# !pip3 install statsmodels

In [None]:
# ols estimation - Regressing NC1 with the propensity score and coupon boolean, focusing on the coefficient of redeemed_coupon_True
import statsmodels.formula.api as smf

In [None]:
smf.ols(
    "nc1_6m ~ used_apple_True + propensity_score", data=df_final
).fit().summary().tables[1]

<br>
<br>
<br>
<br>
<br>
<br>

### causal inference

---

<br>
<br>

In [None]:
# !pip3 install causalinference

In [None]:
# Using the Python package based on the propensity score method to directly get the ATE

from causalinference import CausalModel

cm = CausalModel(
    Y=df_final["nc1_6m"].values,
    D=df_final["used_apple_True"].values,
    X=df_final[["propensity_score"]].values,
)

cm.est_via_matching(matches=1, bias_adj=True)

print(cm.estimates)