title: Matching Studies - Cuenta nomina initiative December 2021

author: Brieuc Van Thienen

date: 2023-03-29

region: EU

tags: salary, user matching, cuenta nomina, incentives, retention, engagement, causal inference, net contribution, revenues

summary: In December 2021, customers were offered a 2 years of free Smart membership and 5 euros of cashback for 12 consecutive months, in exchange of topping up at least 500 Euros (in one or several transactions) each month. Roughly 2% of the targeted users redeemed the coupon and converted. For those users, greater deposits and potentially greater user engagement would have would have led to an increase in net contribution over that period. The goal of this analysis is to estimate the difference in net contribution in the subsequent 12 months that is attributable to the initiative, by matching the users that redeemed the coupon code with users that did not, and calculate a ROI for the initiative. The matching analysis notably controls for the overall transactional activity of users and the average daily amount held in deposits over the 30 days prior to the initiative. The two models that are used: linear regression and causal model. While both models show that the users that converted brought 15 to 20 Euros more in revenues, the impact on net contribution could not be demonstrated - due primarily to greater customer service costs.



<br>
<br>
<br>
<br>
<br>
<br>

### Dataset

In [3]:
query = """
    with user_activity as ( -- engagement state data as of the date of the campaign
        select
            ua.user_created,
            ua.activity_start,
            ua.activity_end,
            ua.names,
            row_number() over (partition by ua.user_created order by activity_start desc) = 1 as last_session,
            datediff('month', least('2021-12-14'::date, ua.activity_end::date), '2021-12-14'::date) as months_since_last_activity,
            -- frequency score calcs
            datediff('day', ua.activity_start::date, least('2021-12-14', ua.activity_end::date)) as days_in_session,
            sum(days_in_session) over (partition by ua.user_created order by ua.activity_start rows unbounded preceding) as days_in_session_cumulative,
            datediff('day', kyc_first_completed::date, least('2021-12-14', ua.activity_end::date)) as days_since_kyc,
            round(least(days_in_session_cumulative::float / nullif(days_since_kyc,0), 1), 1) as frequency_score,
            -- physical card
            no_physical_cards_flg
        from
            dbt.mktg_crm_lapses ua
        where
            activity_start < '2021-12-14'::date -- take all sessions until email date.

    ), average_30d_balance as (
        select
            user_created,
            avg(balance_eur) as balance_eur_30d_avg
        from
            (select user_created, date, sum(balance_eur) as balance_eur from dbt.mmb_daily_balance_aud where date between dateadd(day, -30, '2021-12-14'::date) and '2021-12-14'::date group by 1,2) b
        group by 1

    ), use_cases as (
        select
            user_created,
            coalesce(use_case_signup, use_case_journey, 'OTHER') as use_case
        from
            dev_dbt.mktg_crm_usecase_signup
        left join
            (select *, row_number() over (partition by user_created order by period_id desc) = 1 as last_row from dev_dbt.mktg_crm_usecase_journey where activity_start < '2021-12-14'::date) using (user_created)
        where
            last_row is true

    ), coupon_users as (
        select
            *
        from
            u_user_coupon uc
        inner join
            u_coupon c ON c.id = uc.coupon_id
        inner join
            u_campaign ca ON ca.id = c.campaign_id
        where
            1=1
            and ca.external = 0
            and ca.name = 'ES SAU - Cuenta Nomina'
            and ca.valid_from between '2021-08-01' and '2022-01-01'

    ), email_users as (
        select
            u.user_id,
            datediff('month', u.kyc_first_completed::date, '2021-12-14'::date) as months_since_kycc,
            date_trunc('month', u.kyc_first_completed)::date as kycc_month,
            use_case,
            ua.*,
            u.legal_entity,
            uc.user_created is not null as redeemed_coupon,
            balance_eur_30d_avg
        from
            dbt.zrh_users u
        inner join
            (select * from user_activity where last_session is true) ua using (user_created)
        inner join
            use_cases c using (user_created)
        inner join
            (select * from dbt.mktg_crm_emails where 1=1 and campaign_id = '[D;20211214][C;acc][SC;other][N;]') e using (user_id)
        inner join
            average_30d_balance b using (user_created)
        left join
            (select user_created from dwh_analysis_user_blacklist) bl using (user_created)
        left join
            coupon_users uc using (user_created)
        where
           1=1
           -- user data
           and u.country_tnc_legal in ('ESP')
           and u.legal_entity in ('ES','EU')
           -- blacklist
           and bl.user_created is null

    )

    select
        user_id,
        user_created,
        legal_entity,
        no_physical_cards_flg,
        use_case,
        u.names,
        months_since_kycc,
        months_since_last_activity,
        frequency_score,
        balance_eur_30d_avg,
        redeemed_coupon,
        pnl.*
    from
        email_users u
    inner join
        (select
            user_created,
            sum(case when type in ('Revenue') then value::float / 100 else 0 end) as rev_12m,
            sum(value::float / 100) as nc1_12m,
            sum(case when product_group = 'Payments' then value::float / 100 else 0 end) as nc1_12m_payments,
            sum(case when product_group = 'Treasury' then value::float / 100 else 0 end) as nc1_12m_treasury,
            sum(case when product_group = 'Customer Service' then value::float / 100 else 0 end) as nc1_12m_cs,
            sum(case when product_group = 'ATM' then value::float / 100 else 0 end) as nc1_12m_atm
        from
            dbt.ucm_pnl
        inner join
            dbt.ucm_mapping using (label)
        where
            type in ('Revenue', 'Direct', 'Variable')
            and to_date(month,'YYYY-MM') between '2021-12-01' and dateadd(month, 11, '2021-12-01'::date)
        group by 1
        ) pnl using (user_created)
    order by u.user_created

"""

In [None]:
# !pip3 install seaborn

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns

In [None]:
df = pd.read_csv("DASD1881_dataset.csv")

In [None]:
df.groupby(["redeemed_coupon"])["rev_12m"].describe()

In [None]:
df.groupby(["redeemed_coupon"])["nc1_12m"].describe()

In [None]:
df.groupby(["redeemed_coupon"])["nc1_12m_payments"].describe()

In [None]:
df.groupby(["redeemed_coupon"])["nc1_12m_treasury"].describe()

In [None]:
df.groupby(["redeemed_coupon"])["nc1_12m_cs"].describe()

In [None]:
df.groupby(["redeemed_coupon"])["nc1_12m_atm"].describe()

In [None]:
df.describe()

In [None]:
df.isnull().sum()

In [None]:
df[["redeemed_coupon", "nc1_12m_treasury"]].sort_values(
    by="nc1_12m_treasury", ascending=False
).head(20).reset_index(drop=True).pivot(
    columns="redeemed_coupon", values="nc1_12m_treasury"
)

In [None]:
# # removing outliers: all users that had more than 1000 Euros in treasury
df = df.loc[df["nc1_12m_treasury"] <= 1000, :].reset_index(drop=True)

<br>
<br>
<br>
<br>
<br>
<br>

### Naive comparison

In [None]:
# separate control and treatment for t-test
df_control = df.loc[~df["redeemed_coupon"]]
df_treatment = df.loc[df["redeemed_coupon"]]

net contribution 1

In [None]:
from scipy.stats import ttest_ind

print(df_control.nc1_12m.mean(), df_treatment.nc1_12m.mean())

# compare samples
_, p = ttest_ind(df_control["nc1_12m"], df_treatment["nc1_12m"])
print(f"p={p:.3f}")

# interpret
alpha = 0.05  # significance level
if p > alpha:
    print(
        "same distributions/same group mean (fail to reject H0 - we do not have enough evidence to reject H0)"
    )
else:
    print("different distributions/different group mean (reject H0)")

revenues

In [None]:
from scipy.stats import ttest_ind

print(df_control.rev_12m.mean(), df_treatment.rev_12m.mean())

# compare samples
_, p = ttest_ind(df_control["rev_12m"], df_treatment["rev_12m"])
print(f"p={p:.3f}")

# interpret
alpha = 0.05  # significance level
if p > alpha:
    print(
        "same distributions/same group mean (fail to reject H0 - we do not have enough evidence to reject H0)"
    )
else:
    print("different distributions/different group mean (reject H0)")

<br>
<br>
<br>
<br>
<br>
<br>


### Model Preperation

---
<br>
<br>



In [None]:
df.columns

In [None]:
df = df.drop("user_created.1", axis=1)

In [None]:
user_features = ["user_id", "user_created"]
categorical_features = [
    "use_case",
    # "names",
    # "kycc_month",
    # "last_activity_end_month",
    "redeemed_coupon",
    "no_physical_cards_flg",
]
month_features = ["months_since_kycc", "months_since_last_activity"]
frequency_features = ["frequency_score"]
balance_features = ["balance_eur_30d_avg"]

target_variable = ["nc1_12m"]

In [None]:
df = df[
    user_features
    + categorical_features
    + month_features
    + frequency_features
    + balance_features
    + target_variable
]

In [None]:
df = pd.get_dummies(df, columns=categorical_features, drop_first=True)

print(df.shape)

In [None]:
# onehotencoded_features
# last_activity_end_month_features = [f for f in df.columns if "last_activity_end_month" in f]
# kycc_month_features = [f for f in df.columns if "kycc_month" in f]
use_case_features = [f for f in df.columns if "use_case" in f]
redeemed_coupon_features = [f for f in df.columns if "redeemed_coupon" in f]
card_features = [f for f in df.columns if "no_physical_cards_flg" in f]

ohe_features = (
    # last_activity_end_month_features
    # + kycc_month_features
    use_case_features
    + redeemed_coupon_features
    + card_features
)
ohe_features

In [None]:
all_features_target = (
    ohe_features
    + balance_features
    + month_features
    + frequency_features
    + target_variable
)

In [None]:
# !pip3 install scikit-learn

In [None]:
# Preprocessing pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler, RobustScaler
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer

# df_numerical_features = StandardScaler().fit_transform(df[numerical_features].values)

In [None]:
df = df[all_features_target].reset_index(drop=True)

In [None]:
all_features_target

In [None]:
preprocessor = make_column_transformer(
    # robust scaler
    (RobustScaler(), balance_features),
    (StandardScaler(), month_features),
    remainder="passthrough",
)

In [None]:
df_fittransform = pd.DataFrame(
    preprocessor.fit_transform(
        df[balance_features + month_features + frequency_features + target_variable]
    ),
    columns=df[
        balance_features + month_features + frequency_features + target_variable
    ].columns,
)

In [None]:
df_fittransform.head()

In [None]:
df = df.loc[
    :,
    ~df.columns.isin(
        balance_features + month_features + frequency_features + target_variable
    ),
].join(df_fittransform)

In [None]:
df.head()

<br>
<br>
<br>
<br>
<br>
<br>

### Propensity score matching - matching users based on their likelihood to redeem the coupon

---

<br>
<br>

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
t = "redeemed_coupon_True"
y = "nc1_12m"
x = [f for f in df.columns if f not in t + y]

In [None]:
ps_scores_model = LogisticRegression().fit(df[x], df["redeemed_coupon_True"])

In [None]:
ps_scores = df.assign(propensity_score=ps_scores_model.predict_proba(df[x])[:, 1])
df_final = ps_scores[["propensity_score", "redeemed_coupon_True", "nc1_12m"]]

In [None]:
df_final.head()

In [None]:
# Plotting the propensity score distribution to make sure there's big enough overlapped between the two groups
sns.displot(data=df_final, x=df_final["propensity_score"], hue="redeemed_coupon_True")

<br>
<br>
<br>
<br>
<br>
<br>

### OLS estimation - Regressing NC1 with the propensity score and coupon boolean

---

<br>
<br>

In [None]:
# !pip3 install statsmodels

In [None]:
# ols estimation - Regressing NC1 with the propensity score and coupon boolean, focusing on the coefficient of redeemed_coupon_True
import statsmodels.formula.api as smf

In [None]:
smf.ols(
    "nc1_12m ~ redeemed_coupon_True + propensity_score", data=df_final
).fit().summary().tables[1]

<br>
<br>
<br>
<br>
<br>
<br>

### causal inference

---

<br>
<br>

In [None]:
# !pip3 install causalinference

In [None]:
# Using the Python package based on the propensity score method to directly get the ATE

from causalinference import CausalModel

cm = CausalModel(
    Y=df_final["nc1_12m"].values,
    D=df_final["redeemed_coupon_True"].values,
    X=df_final[["propensity_score"]].values,
)

cm.est_via_matching(matches=1, bias_adj=True)

print(cm.estimates)