title: LTV Match study - Looking back on the Veepee Campaign
author: Brieuc Van Thienen     
date: 2021-03-19      
region: EU         
summary: To assess the success of the Veepee campaign, we addressed the following questions: How does the distribution of profits compare for Veepee users vs. non-Veepee users, that signed up over the same period in France (Oct 4th - Oct 21st 2020)? Can we establish that users from the Veepee campaign are more / less profitable on average than our control group? What is the estimated treatment effect, in Euros, that could be attributed to the Veepee campaign? We compare the predicted LTV / CAC of the Veepee campaign to that of France and the affiliate channel in the month of October. In order to answer question 2, we perform a "match study" on our observational data (read "data that did not result from a proper A/B test setup"), by borrowing concepts from causal inference. A match study consists in comparing the predicted profits of each user from the treatment group (read "The Veepee campaign") to that of N users from the control group. The control group consists of a subset of non-Veepee users from the same period, that have same characteristics: gender, country, kyc_month, channel, first_product, signup_coupon, and marketing_in_app_update_opt_in. The covariates are known to have an effect on predicted LTV in general, and we want to control for them to identify the treatment effect that is coming from the Veepee campaign.
tags: growth, marketing, ad-hoc campaign, marketing campaign, ltv, match study, ltv / cac, veepee, france
    

In [None]:
!pip install warnings
import warnings

warnings.filterwarnings("ignore")

In [None]:
import math
import random
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

!pip install seaborn
import seaborn as sns

sns.set_style("whitegrid")
from utils.datalib_database import df_from_sql



## Looking back on the first wave of the Veepee campaign

---

To assess the success of the Veepee campaign, we addressed the following questions:

1. How does the distribution of profits compare for Veepee users vs. non-Veepee users, that signed up over the same period in France (Oct 4th - Oct 21st 2020)? 


2. Can we establish that users from the Veepee campaign are more / less profitable on average than our control group? What is the estimated treatment effect, in Euros, that could be attributed to the Veepee campaign?


3. We compare the predicted LTV / CAC of the Veepee campaign to that of France and the affiliate channel in the month of October.

In order to answer question 2, we perform a "match study" on our observational data (read "data that did not result from a proper A/B test setup"), by borrowing concepts from causal inference. A match study consists in comparing the predicted profits of each user from the treatment group (read "The Veepee campaign") to that of N users from the control group. The control group consists of a subset of non-Veepee users from the same period, that have same characteristics: gender, country, kyc_month, channel, first_product, signup_coupon, and marketing_in_app_update_opt_in. The covariates are known to have an effect on predicted LTV in general, and we want to control for them to identify the treatment effect that is coming from the Veepee campaign.


In [None]:
query = """
    with veepee as (
        select * from cmd_shadow_user
        where id in (
            -- insert shadow user ids
            )
    )

    select  
        a.user_created,
        b.user_created is not null veepee_flg,
        a.gender,
        a.last_click_source,
        a.first_product,
        a.signup_coupon,
        regexp_replace(a.country_tnc_legal, '\\s', '') country_tnc_legal,
        a.kyc_first_completed,
        date_trunc('month', a.kyc_first_completed) kyc_month,
        a.newsletter_opt_in,
        a.marketing_in_app_update_opt_in,
        a.signup_kycc_diff,
        p."2y_profit_forecast" - onboarding_costs as "2y_profit_forecast"
    from 
        dwh_mktg_ltv_predictions p
    inner join 
        dbt.stg_ltv_kyc a using (user_created)
    left join 
        veepee b using (user_created)
    left join 
        (select user_created, -1 * sum(value)::float / 100 as onboarding_costs from dbt.ucm_pnl left join dbt.ucm_mapping using (label) where ltv_flg_onboarding = 1 group by 1) ob using (user_created)
    where 
        p.model_version = 'v0.1' 
        and p.user_created between '2020-10-04' and '2020-10-21'
        and country_tnc_legal = 'FRA'
        and kyc_first_completed < user_created + interval '7 days'
"""

In [None]:
df = df_from_sql("redshiftreader", query)

## 1. Distribution of profits - Veepee vs. non-Veepee users

---

In [None]:
df_product = df.groupby(["veepee_flg", "first_product"])["user_created"].count()

The veepee campaign was heavily advertising the Metal product. This reflected in the results, with 35% and 13% of veepee KYC 7d users opting for the Metal and Business Metal product respectively, vs. 7% and 4% in the control group.

In [None]:
df_perc = df_product.groupby(level=0).apply(lambda x: 100 * x / float(x.sum()))
df_perc.reset_index().rename({"user_created": "cross_sell"}, axis=1).pivot(
    index="first_product", columns="veepee_flg", values="cross_sell"
)

The distribution of predicted profits for Veepee users is more skewed to the right, reflecting the greater percentage of Metal users in the treatment group.

In [None]:
fig, ax = plt.subplots(figsize=(12, 6))

sns.distplot(
    df[df.veepee_flg == True]["2y_profit_forecast"], hist=False, label="veepee"
)
sns.distplot(
    df[df.veepee_flg == False]["2y_profit_forecast"], hist=False, label="non-veepee"
)
plt.legend()

ax.set(ylim=(0, 0.03))

In [None]:
sns.displot(
    data=df,
    x="2y_profit_forecast",
    hue="veepee_flg",
    kind="ecdf",
    height=6,
    aspect=1.8 / 1,
)

## 2. Match study - Veepee users vs. control group (subset of non-Veepee users)

---

In [None]:
users_treatment = df[df.veepee_flg == True].user_created.tolist()
covariates = df[df.veepee_flg == True].set_index("user_created").to_dict()

In [None]:
n = 20
r = 100

treatment_effects = []

for _ in range(
    r
):  # randomly shuffle the user list so that users from treatment are not left unmatched.
    used = set()
    effects = []
    sizes = []

    for u in random.sample(users_treatment, len(users_treatment)):
        d = df[
            (df.gender == covariates["gender"][u])
            & (df.country_tnc_legal == covariates["country_tnc_legal"][u])
            & (df.kyc_month == covariates["kyc_month"][u])
            & (df.last_click_source == covariates["last_click_source"][u])
            & (df.first_product == covariates["first_product"][u])
            & (df.signup_coupon == covariates["signup_coupon"][u])
            & (
                df.marketing_in_app_update_opt_in
                == covariates["marketing_in_app_update_opt_in"][u]
            )
            & (df.veepee_flg == False)
            & (~df.user_created.isin(used))
        ]

        if d.shape[0] > n:
            control = d.sample(n)
        else:
            control = d

        used = used.union(control.user_created)

        if control.shape[0] == 0:
            continue

        sample_te = (
            covariates["2y_profit_forecast"][u] - control["2y_profit_forecast"]
        ).mean()
        sample_size = control.shape[0]

        effects.append(sample_te)
        sizes.append(sample_size)

    effects = np.array(effects)
    sizes = np.array(sizes)

    total = sum(sizes)

    treatment_effects.append((effects * (sizes / total)).sum())

**It can be established from the data that the treatment effect is positive**. The mean treatment effect is estimated at 15 euros, with the 5th and 95th percentiles estimates at 10 and 21 euros. While we feel confident asserting that the treatment is positive, the size of the treatment effect should be taken with a pinch of salt, as it is the first time that such analysis is conducted.

In [None]:
sns.distplot(treatment_effects, hist=False)

**Mean of the estimated treatment effect**

In [None]:
np.nanmean(treatment_effects)

**5th and 95th percentiles of the estimated treatment effect**

In [None]:
np.nanpercentile(treatment_effects, 5), np.nanpercentile(treatment_effects, 95)

## 3. LTV / CAC - Veepee campaign vs. France average

---

The predicated average LTV and predicted LTV / CAC two years after KYC, for the Veepee campaign, are estimated at 53 euros and 0.185 respectively. This compares to 26 euros and 0.78 for the signup cohort of October 2020 in France (blended, visible [here](https://metabase-marketing.tech26.de/question/2982?period=month)), and 31 euros and 0.73 for the affiliate channel in particular. 

While all findings converge towards the fact that Veepee users are more profitable than comparable groups, the price of 60k euros brings the LTV / CAC lower than the affiliate average.

**Average predicted profit per user**

In [None]:
round(df.loc[df["veepee_flg"] == True]["2y_profit_forecast"].mean(), 2)

**Average predicted LTV / CAC**

In [None]:
ltv_cac = df.loc[df["veepee_flg"] == True]["2y_profit_forecast"].sum() / 60000
round(ltv_cac, 3)