title: Premium & arrears v1    
author: Fabio Schmidt-Fischbach   
date: 2020-11-17   
region: EU   
summary: Scope is to understand dynamics of premium users and arrears. Track balances & premium charges of all ftMAU premium users that started a new subscription between September and November 2019 until now. Arrears are comming. 30% of all premium fee charges between 2020-01 and 2020-06 went through arrears. Even users on the happy path fall into arrears. 65% of users that cancelled after 12 months had at least one payment that went through arrears. 57% of users that renewed after 12 months had at least one payment that went through arrears. People recover. 64% of all arrears payments (between 2020-01 and 2020-06) were recovered. 
link: https://docs.google.com/presentation/d/1rtKXgBeXwk6m8zEMqDUyzBccmKgbBIH5vl9-_a1JUV0/edit?usp=sharing   
tags: memberships, dunning, arrears, premium, retention   

In [4]:
import pandas as pd
import numpy as np
import altair as alt
from datetime import timedelta

# Basics

We see that a lot of users enter dunning. Why? 

In [89]:
sample = """

select user_id,
        zzp.product_id,
        zzp.subscription_valid_from, 
        zzp.subscription_valid_until, 
        zzp.product_id,
        zzp.enter_reason, 
        zzp.market, 
        zzp.status, 
        zzp.transaction_date, 
        date_trunc('week', transaction_date) as tx_week,
        zzp.paid, 
        zzp.days_delay, 
        arrears_status, 
        payment_no
from dbt.zrh_subscription_payments as zzp 
inner join dbt.zrh_users as zu using (user_created)
inner join dbt.stg_cohort_first_active as fa using (user_created)
where 1=1
        and subscription_valid_from between '2019-09-01' and '2019-11-01'
"""

df = pd.read_csv("sample.csv")

In [4]:
mmb = """ 
with sample as ( 
select user_id,
		zzp.user_created,
		kyc_first_completed, 
        zzp.transaction_date,
        arrears_status, 
        zzp.product_id, 
        enter_reason
from dbt.zrh_subscription_payments as zzp 
inner join dbt.zrh_users as zu using (user_created)
inner join dbt.stg_cohort_first_active as fa using (user_created)
where 1=1 and subscription_valid_from between '2019-09-01' and '2019-11-01'
),
daily as ( 
select user_id, 
		mmb.date, 
		sum(balance_eur) as balance 
from dbt.mmb_daily_balance_aud as mmb 
inner join sample as s on s.user_created = mmb.user_created 
    and mmb.date between '2019-09-01' and current_date 
where mmb.product_key_group != 'SAVINGS' 
group by 1,2
)

select user_id, date_trunc('week', date) as week, max(balance) as max_bal, min(balance) as min_mal, avg(balance) as avg_balance
from daily 
group by 1,2 

"""
mmb = pd.read_csv("mmb.csv")

mmb.shape

(623573, 5)

In [27]:
df = pd.read_csv("sample.csv")

# join two dfs together.
df["charged"] = 0
df.loc[df["arrears_status"] != "discount", "charged"] = 1
df["directly_paid"] = 0
df.loc[(df["paid"] == True) & (df["days_delay"].isnull() == True), "directly_paid"] = 1
df["arrears"] = 0
df.loc[df["arrears_status"].isin(["open", "recovered", "written off"]), "arrears"] = 1
df["recovered"] = 0
df.loc[df["arrears_status"] == "recovered", "recovered"] = 1
# drop discounts
df = df.loc[df["charged"] == 1, :]
# aggregate to the user-week level.
df = (
    df.loc[
        :, ["user_id", "tx_week", "charged", "directly_paid", "arrears", "recovered"]
    ]
    .groupby(["user_id", "tx_week"])
    .agg("max")
    .reset_index()
)

# join mmb to to sample.
mmb = pd.read_csv("mmb.csv")
mmb = mmb.merge(
    df, left_on=["week", "user_id"], right_on=["tx_week", "user_id"], how="left"
).fillna(0)

mmb["x"] = 1
mmb["roll"] = mmb.groupby(["user_id"])["x"].cumsum()

mmb.to_csv("mmb_sample.csv")

In [40]:
df = pd.read_csv("mmb_sample.csv")

# user ids
q1_recover = df.loc[(df["roll"] <= 12) & (df["recovered"] == 1), "user_id"].unique()
q1_arrears = df.loc[(df["roll"] <= 12) & (df["arrears"] == 1), "user_id"].unique()

df["status"] = np.nan
df.loc[df["user_id"].isin(q1_recover) == True, "status"] = "q1 recovered"
df.loc[
    (df["user_id"].isin(q1_recover) == False)
    & (df["user_id"].isin(q1_arrears) == True),
    "status",
] = "q1 arrears"
df.loc[(df["user_id"].isin(q1_arrears) == False), "status"] = "q1 clean"

df = df.groupby(["week", "status"])["max_bal"].agg("median").reset_index()

alt.Chart(df.loc[df["week"] <= "2020-02-01", :]).mark_line().encode(
    x=alt.X("week:T", axis=alt.Axis(title="Week")),
    y=alt.Y("max_bal:Q", axis=alt.Axis(title="Median balance")),
    color="status",
).properties(width=500, height=500, title="Median balance across q1 status groups")

In [44]:
df = pd.read_csv("mmb_sample.csv")

# user ids
q2_recover = df.loc[
    (df["roll"] > 12) & (df["roll"] <= 24) & (df["recovered"] == 1), "user_id"
].unique()
q2_arrears = df.loc[
    (df["roll"] > 12) & (df["roll"] <= 24) & (df["arrears"] == 1), "user_id"
].unique()

df["status"] = np.nan
df.loc[df["user_id"].isin(q2_recover) == True, "status"] = "q2 recovered"
df.loc[
    (df["user_id"].isin(q2_recover) == False)
    & (df["user_id"].isin(q2_arrears) == True),
    "status",
] = "q2 no-recover"
df.loc[(df["user_id"].isin(q2_arrears) == False), "status"] = "q2 clean"

df = df.groupby(["week", "status"])["max_bal"].agg("median").reset_index()

alt.Chart(df.loc[df["week"] <= "2020-05-01", :]).mark_line().encode(
    x=alt.X("week:T", axis=alt.Axis(title="Week")),
    y=alt.Y("max_bal:Q", axis=alt.Axis(title="Median balance")),
    color="status",
).properties(width=500, height=500, title="Median balance across q2 status groups")

##  Event study 

Pre and post recovery activity.


In [93]:
df = pd.read_csv("sample.csv")

# data set is on user-payment level.
df = df.loc[df["arrears_status"] != "discount", :]
df = df.loc[df["market"] != "GBR", :]

# load mmb
mmb = pd.read_csv("mmb.csv")


mmb = mmb.merge(df, left_on=["user_id"], right_on=["user_id"], how="left")

# compute time diff col
mmb["diff"] = round(
    ((pd.to_datetime(mmb["week"]) - pd.to_datetime(mmb["tx_week"])).dt.days) / 7
)
mmb = mmb.loc[abs(mmb["diff"]) <= 8, :]

# aggregate
mmb = (
    mmb.groupby(["payment_no", "diff", "arrears_status"])["max_bal"]
    .agg("median")
    .reset_index()
)

alt.Chart(mmb.loc[mmb["payment_no"] <= 12, :]).mark_line().encode(
    x=alt.X("diff:N", axis=alt.Axis(title="Weeks to/since charge")),
    y=alt.Y("max_bal:Q", axis=alt.Axis(title="Median balance")),
    color="payment_no",
).properties(width=300, height=300, title="Event study").facet(
    columns=2, facet="arrears_status"
).resolve_scale(
    y="independent"
)

In [89]:
mmb = pd.read_csv("mmb.csv")
mmb.shape

(2430226, 5)

In [130]:
df = pd.read_csv("sample.csv")

# drop UK
df["market"] = df["market"].str.strip()
df = df.loc[df["market"] != "GBR", :]
df["country"] = df["market"]
df.loc[
    df["country"].isin(["DEU", "FRA", "ESP", "ITA", "AUT"]) == False, "country"
] = "Other"

df["first_r"] = np.nan
df.loc[df["arrears_status"] == "recovered", "first_r"] = df["payment_no"]
df["first_r"] = df.groupby(["user_id"])["first_r"].transform("min")

# ever recover.
df["ever_recover"] = 0
df.loc[df["first_r"].isnull() == False, "ever_recover"] = 1
df = df.loc[df["ever_recover"] == 1, :]


df = df.groupby(["first_r", "enter_reason"])["paid"].agg("mean").reset_index()

alt.Chart(df.loc[df["enter_reason"] != "DOWNGRADED", :]).mark_line().encode(
    x=alt.X("first_r:Q", axis=alt.Axis(title="Payment no that was first recovered")),
    y=alt.Y("paid:Q", axis=alt.Axis(format="%", title="% of total charges paid")),
    color="enter_reason:N",
).properties(title="Timing of first recovery & total retention", width=400, height=400)

In [135]:
df = pd.read_csv("sample.csv")

# drop UK
df["market"] = df["market"].str.strip()
df = df.loc[df["market"] != "GBR", :]
df["country"] = df["market"]
df.loc[
    df["country"].isin(["DEU", "FRA", "ESP", "ITA", "AUT"]) == False, "country"
] = "Other"

df["first_r"] = np.nan
df.loc[df["arrears_status"] == "recovered", "first_r"] = df["payment_no"]
df["first_r"] = df.groupby(["user_id"])["first_r"].transform("min")

# ever recover.
df["ever_recover"] = 0
df.loc[df["first_r"].isnull() == False, "ever_recover"] = 1
df = df.loc[df["ever_recover"] == 1, :]

df = df.groupby(["first_r", "status"])["user_id"].agg("nunique").reset_index()
df["perc"] = 100 * df["user_id"] / df.groupby(["first_r"])["user_id"].transform("sum")

alt.Chart(df).mark_line().encode(
    x=alt.X("first_r:Q", axis=alt.Axis(title="Timing of first recovery")),
    y=alt.Y("perc:Q", axis=alt.Axis(title="% of users")),
    color="status",
).properties(title="What happens to users that recover?", width=300)

In [166]:
df = pd.read_csv("sample.csv")

# drop UK
df["market"] = df["market"].str.strip()
df = df.loc[df["market"] != "GBR", :]
df["country"] = df["market"]
df.loc[
    df["country"].isin(["DEU", "FRA", "ESP", "ITA", "AUT"]) == False, "country"
] = "Other"

df["first_r"] = np.nan
df.loc[df["arrears_status"] == "recovered", "first_r"] = df["payment_no"]
df["first_r"] = df.groupby(["user_id"])["first_r"].transform("min")

# ever recover.
df["ever_recover"] = 0
df.loc[df["first_r"].isnull() == False, "ever_recover"] = 1
df = df.loc[df["ever_recover"] == 1, :]

df = df.groupby(["first_r", "status"])["user_id"].agg("nunique").reset_index()
df["perc"] = 100 * df["user_id"] / sum(df["user_id"])
df["cum"] = df["perc"].cumsum()

alt.Chart(df).mark_line().encode(
    x=alt.X("first_r:Q", axis=alt.Axis(title="Timing of first recovery")),
    y=alt.Y("cum:Q", axis=alt.Axis(title="% of all recoveries")),
).properties(title="What payment do most users first recover from?", width=300)

In [162]:
# timing of charge and actual revenue.

df = pd.read_csv("sample.csv")

# data set is on user-payment level.
df = df.loc[df["arrears_status"] != "discount", :]
df = df.loc[df["market"] != "GBR", :]
df["tx_week"] = pd.to_datetime(df["tx_week"])
# load mmb
mmb = pd.read_csv("mmb.csv")
# we actually want to join on the lag week --> know whether the funds were sufficient prior booking.
mmb["lag_week"] = pd.to_datetime(mmb["week"]) + timedelta(weeks=1)
mmb = mmb.merge(
    df, left_on=["user_id", "lag_week"], right_on=["user_id", "tx_week"], how="inner"
)

mmb["max_suff"] = 0
mmb.loc[(mmb["max_bal"].astype(float) > 20) == True, "max_suff"] = 1

mmb = (
    mmb.groupby(["arrears_status", "payment_no"])["max_suff"].agg("mean").reset_index()
)

alt.Chart(mmb).mark_line().encode(
    x=alt.X("payment_no:Q", axis=alt.Axis(title="Payment no")),
    y=alt.Y(
        "max_suff:Q",
        axis=alt.Axis(
            format="%", title="% of users with sufficient funds 1wk before charge"
        ),
    ),
    color="arrears_status",
).properties(
    title="% of users with sufficient funds 1 wk before charge", width=400, height=400
)

In [171]:
df = pd.read_csv("sample.csv")

# data set is on user-payment level.
df = df.loc[df["arrears_status"] != "discount", :]
df = df.loc[df["market"] != "GBR", :]

df["arrears"] = 0
df.loc[df["arrears_status"].isin(["recovered", "open", "written off"]), "arrears"] = 1

df = df.groupby(["payment_no", "enter_reason"])["arrears"].agg("mean").reset_index()

alt.Chart(df.loc[df["enter_reason"] != "DOWNGRADED", :]).mark_line().encode(
    x=alt.X("payment_no:Q", axis=alt.Axis(title="Payment no")),
    y=alt.Y("arrears:Q", axis=alt.Axis(format="%", title="% of payments in arrears")),
    color="enter_reason",
).properties(title="% of payments that fall into arrears", width=500, height=400)

In [87]:
mmb = pd.read_csv("mmb.csv")

mmb["diff"] = (
    pd.to_datetime(mmb["transaction_date"]) - pd.to_datetime(mmb["date"])
).dt.days

mmb = (
    mmb.groupby(["diff", "user_id", "transaction_date", "arrears_status"])["balance"]
    .agg("sum")
    .reset_index()
)

mmb = mmb.groupby(["diff", "arrears_status"])["balance"].agg("median").reset_index()

alt.Chart(mmb).mark_line().encode(
    x="diff:N", y="balance:Q", color="arrears_status"
).properties(width=500, height=500)

In [83]:
mmb = pd.read_csv("mmb.csv")

# add cumulative id --> data is not on the user level - need a user-tx level id.
m = mmb
m["x"] = 1
m = (
    m.loc[:, ["user_id", "transaction_date", "x"]]
    .groupby(["user_id", "transaction_date"])["x"]
    .agg("min")
    .reset_index()
)
m["id"] = m.groupby(["user_id"])["x"].cumsum()

mmb = mmb.merge(m, on=["user_id", "transaction_date"])

mmb["diff"] = (
    pd.to_datetime(mmb["transaction_date"]) - pd.to_datetime(mmb["date"])
).dt.days

mmb = mmb.groupby(["diff", "user_id", "id"])["balance"].agg("sum").reset_index()

# get rankings
m = mmb.loc[mmb["diff"] == -30, :]
m["percentile"] = round(100 * m.groupby(["id"])["balance"].rank(pct=True) / 20)
m = m.loc[:, ["user_id", "id", "percentile"]]

# merge back to main
mmb = mmb.merge(m, on=["user_id", "id"])

mmb = mmb.groupby(["percentile", "diff", "id"])["balance"].agg("median").reset_index()

alt.Chart(mmb.loc[mmb["id"] == 1, :]).mark_line().encode(
    x="diff:N", y="balance:Q", color="percentile:N", column="id"
).properties(width=600, height=600)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  app.launch_new_instance()


# Regression discontinuity


In [2]:
query = """
with sample as ( 
select user_id,
		zzp.user_created,
        zzp.product_id,
        zzp.amount_cents,
        zzp.subscription_valid_from, 
        zzp.subscription_valid_until, 
        zzp.product_id,
        zzp.enter_reason, 
        zzp.market, 
        zzp.status, 
        zzp.transaction_date, 
        date_trunc('week', transaction_date) as tx_week,
        zzp.paid, 
        zzp.days_delay, 
        arrears_status, 
        payment_no
from dbt.zrh_subscription_payments as zzp 
inner join dbt.zrh_users as zu using (user_created)
inner join dbt.stg_cohort_first_active as fa using (user_created)
where 1=1
        and subscription_valid_from between '2019-09-01' and '2019-11-01'
        and arrears_status != 'discount'
        and charged = True
)

select sample.*,
        mmb.date, 
        mmb.balance_eur, 
        mmb.product_key_group
from sample 
left join dbt.mmb_daily_balance_aud as mmb 
    on sample.user_created = mmb.user_created 
    and mmb.date::date = dateadd('days',-1, transaction_date::date)  
"""

SyntaxError: invalid syntax (<ipython-input-2-aa2fdcdd3a4e>, line 1)

In [19]:
rd = pd.read_csv("rd.csv")

# bring down to user-tx-level.
rd = (
    rd.groupby(["user_id", "transaction_date"])
    .agg({"paid": "max", "balance_eur": "sum", "amount_cents": "max"})
    .reset_index()
)

rd["charge_eur"] = rd["amount_cents"] / 100
# subtract difference between balance and charge.
rd["diff"] = round(rd["balance_eur"] - rd["charge_eur"])

rd = rd.groupby(["diff"])["paid"].agg("count").reset_index()

rd["perc"] = 100 * rd["paid"] / sum(rd["paid"])
rd = rd.loc[abs(rd["diff"]) <= 500, :]


alt.Chart(rd).mark_bar().encode(
    x=alt.X(
        "diff:Q",
        axis=alt.Axis(title="Difference between current balance and charge amount"),
    ),
    y=alt.Y("perc:Q", axis=alt.Axis(title="% of cases")),
).properties(
    width=500, height=500, title="Balance pre booking and % paid the next charge"
)

In [13]:
rd = pd.read_csv("rd.csv")

# bring down to user-tx-level.
rd = (
    rd.groupby(["user_id", "transaction_date"])
    .agg({"paid": "max", "balance_eur": "sum", "amount_cents": "max"})
    .reset_index()
)

rd["charge_eur"] = rd["amount_cents"] / 100
# subtract difference between balance and charge.
rd["diff"] = round(rd["balance_eur"] - rd["charge_eur"])

# keep only moderately close diffs
rd = rd.loc[abs(rd["diff"]) <= 20, :]

rd = rd.groupby(["diff"])["paid"].agg("mean").reset_index()

rd["group"] = "Not enough funds"
rd.loc[rd["diff"] >= 0, "group"] = "Enough funds"

alt.Chart(rd).mark_bar().encode(
    x=alt.X(
        "diff:Q",
        axis=alt.Axis(title="Difference between current balance and charge amount"),
    ),
    y=alt.Y("paid:Q", axis=alt.Axis(format="%", title="% of charges paid")),
    color="group:N",
).properties(
    width=500, height=500, title="Balance pre booking and % paid the next charge"
)

In [85]:
rd = pd.read_csv("rd.csv")

rd["product"] = "Metal"
rd.loc[
    rd["product_id"].isin(["BLACK_CARD_MONTHLY", "BUSINESS_BLACK"]) == True, "product"
] = "You"

# bring down to user-tx-level.
rd = (
    rd.groupby(["user_id", "product", "transaction_date"])
    .agg({"paid": "max", "balance_eur": "sum", "amount_cents": "max"})
    .reset_index()
)

rd["charge_eur"] = rd["amount_cents"] / 100
# subtract difference between balance and charge.
rd["diff"] = round(rd["balance_eur"] - rd["charge_eur"])

# keep only moderately close diffs
rd = rd.loc[abs(rd["diff"]) <= 30, :]

rd = rd.groupby(["product", "diff"])["paid"].agg("mean").reset_index()

rd["group"] = "Not enough funds"
rd.loc[rd["diff"] >= 0, "group"] = "Enough funds"

alt.Chart(rd).mark_bar().encode(
    x=alt.X(
        "diff:Q",
        axis=alt.Axis(title="Difference between current balance and charge amount"),
    ),
    y=alt.Y("paid:Q", axis=alt.Axis(format="%", title="% of charges paid")),
    color="group:N",
    column="product",
).properties(
    width=400, height=500, title="Balance pre booking and % paid the next charge"
)

In [79]:
rd = pd.read_csv("rd.csv")

# bring down to user-tx-level.
rd = (
    rd.groupby(["user_id", "transaction_date"])
    .agg({"paid": "max", "balance_eur": "sum", "amount_cents": "max"})
    .reset_index()
)

rd.sort_values(by=["user_id", "transaction_date"])

rd["next_paid"] = (
    rd.loc[:, ["user_id", "paid"]].groupby(["user_id"]).shift(periods=1).astype(bool)
)

rd["charge_eur"] = rd["amount_cents"] / 100
# subtract difference between balance and charge.
rd["diff"] = round(rd["balance_eur"] - rd["charge_eur"])

# keep only moderately close diffs
rd = rd.loc[abs(rd["diff"]) <= 80, :]

rd = rd.groupby(["diff"])["next_paid"].agg("mean").reset_index()

rd["group"] = "Not enough funds"
rd.loc[rd["diff"] >= 0, "group"] = "Enough funds"

alt.Chart(rd).mark_bar().encode(
    x=alt.X(
        "diff:Q",
        axis=alt.Axis(title="Difference between current balance and charge amount"),
    ),
    y=alt.Y("next_paid:Q", axis=alt.Axis(format="%", title="% paid next balance")),
    color="group:N",
).properties(
    width=500, height=500, title="Balance pre booking (t) and % paid next charge (t+1)"
)

In [87]:
rd = pd.read_csv("rd.csv")

rd["product"] = "Metal"
rd.loc[
    rd["product_id"].isin(["BLACK_CARD_MONTHLY", "BUSINESS_BLACK"]) == True, "product"
] = "You"

# bring down to user-tx-level.
rd = (
    rd.groupby(["product", "user_id", "transaction_date"])
    .agg({"paid": "max", "balance_eur": "sum", "amount_cents": "max"})
    .reset_index()
)

rd.sort_values(by=["user_id", "transaction_date"])

rd["next_paid"] = (
    rd.loc[:, ["user_id", "paid"]].groupby(["user_id"]).shift(periods=1).astype(bool)
)

rd["charge_eur"] = rd["amount_cents"] / 100
# subtract difference between balance and charge.
rd["diff"] = round(rd["balance_eur"] - rd["charge_eur"])

# keep only moderately close diffs
rd = rd.loc[abs(rd["diff"]) <= 50, :]

rd = rd.groupby(["product", "diff"])["next_paid"].agg("mean").reset_index()

rd["group"] = "Not enough funds"
rd.loc[rd["diff"] >= 0, "group"] = "Enough funds"

alt.Chart(rd).mark_bar().encode(
    x=alt.X(
        "diff:Q",
        axis=alt.Axis(title="Difference between current balance and charge amount"),
    ),
    y=alt.Y("next_paid:Q", axis=alt.Axis(format="%", title="% paid next balance")),
    color="group:N",
    column="product",
).properties(
    width=500, height=500, title="Balance pre booking (t) and % paid next charge (t+1)"
)

In [42]:
rd = pd.read_csv("rd.csv")

rd["still_premium"] = 0
rd.loc[rd["status"].isin(["RENEWED", "ACTIVE", "UPGRADED"]), "still_premium"] = 1

# bring down to user-tx-level.
rd = (
    rd.groupby(["user_id", "transaction_date"])
    .agg({"still_premium": "max", "balance_eur": "sum", "amount_cents": "max"})
    .reset_index()
)

rd["charge_eur"] = rd["amount_cents"] / 100
# subtract difference between balance and charge.
rd["diff"] = round(rd["balance_eur"] - rd["charge_eur"], 2)

# keep only moderately close diffs
rd = rd.loc[abs(rd["diff"]) <= 10, :]

rd = rd.groupby(["diff"])["still_premium"].agg("mean").reset_index()

rd["group"] = "Not enough funds"
rd.loc[rd["diff"] >= 0, "group"] = "Enough funds"


alt.Chart(rd).mark_bar().encode(
    x=alt.X(
        "diff:Q",
        axis=alt.Axis(title="Difference between current balance and charge amount"),
    ),
    y=alt.Y(
        "still_premium:Q",
        axis=alt.Axis(format="%", title="% of users still in premium"),
    ),
    color="group:N",
).properties(width=500, height=500, title="Balance pre booking and % still premium")

In [58]:
# check whether the timing matters.
rd = pd.read_csv("rd.csv")

rd["still_premium"] = 0
rd.loc[rd["status"].isin(["RENEWED", "ACTIVE", "UPGRADED"]), "still_premium"] = 1

rd = (
    rd.groupby(["user_id", "transaction_date", "payment_no"])
    .agg({"still_premium": "max", "balance_eur": "sum", "amount_cents": "max"})
    .reset_index()
)

# subtract difference between balance and charge.
rd["diff"] = round((rd["balance_eur"] * 100 - rd["amount_cents"]) / 100, 1)

# keep only moderately close diffs
rd = rd.loc[abs(rd["diff"]) <= 10, :]

rd = rd.groupby(["payment_no", "diff"])["still_premium"].agg("mean").reset_index()

rd["group"] = "Not enough funds"
rd.loc[rd["diff"] >= 0, "group"] = "Enough funds"


alt.Chart(rd).mark_bar().encode(
    x=alt.X(
        "diff:Q",
        axis=alt.Axis(title="Difference between current balance and charge amount"),
    ),
    y=alt.Y(
        "still_premium:Q",
        axis=alt.Axis(format="%", title="% of users still in premium"),
    ),
    color="group:N",
).properties(width=300, height=300).facet(facet="payment_no:N", columns=3)

In [53]:
rd.head(10)

Unnamed: 0,payment_no,diff,still_premium,group
0,1,-10.0,0.217391,Not enough funds
1,1,-9.9,0.486753,Not enough funds
2,1,-9.8,0.393617,Not enough funds
3,1,-9.7,0.406977,Not enough funds
4,1,-9.6,0.406593,Not enough funds
5,1,-9.5,0.531915,Not enough funds
6,1,-9.4,0.403846,Not enough funds
7,1,-9.3,0.407407,Not enough funds
8,1,-9.2,0.489362,Not enough funds
9,1,-9.1,0.490196,Not enough funds


In [50]:
# check for each market
rd = pd.read_csv("rd.csv")

rd["still_premium"] = 0
rd.loc[rd["status"].isin(["RENEWED", "ACTIVE", "UPGRADED"]), "still_premium"] = 1

# bring down to user-tx-level.
rd["market"] = rd["market"].str.strip()
rd["country"] = rd["market"]
rd.loc[
    rd["market"].isin(["DEU", "FRA", "ITA", "AUT", "ESP"]) == False, "country"
] = "other"

rd = (
    rd.groupby(["user_id", "transaction_date", "country"])
    .agg({"still_premium": "max", "balance_eur": "sum", "amount_cents": "max"})
    .reset_index()
)

# subtract difference between balance and charge.
rd["diff"] = round((rd["balance_eur"] * 100 - rd["amount_cents"]) / 100, 1)

# keep only moderately close diffs
rd = rd.loc[abs(rd["diff"]) <= 10, :]

rd = rd.groupby(["country", "diff"])["still_premium"].agg("mean").reset_index()

rd["group"] = "Not enough funds"
rd.loc[rd["diff"] >= 0, "group"] = "Enough funds"


alt.Chart(rd).mark_bar().encode(
    x=alt.X(
        "diff:Q",
        axis=alt.Axis(title="Difference between current balance and charge amount"),
    ),
    y=alt.Y(
        "still_premium:Q",
        axis=alt.Axis(format="%", title="% of users still in premium"),
    ),
    color="group:N",
    column="country",
).properties(width=500, height=500, title="Balance pre booking and % still premium")

In [88]:
# check for each market
rd = pd.read_csv("rd.csv")

rd["still_premium"] = 0
rd.loc[rd["status"].isin(["RENEWED", "ACTIVE", "UPGRADED"]), "still_premium"] = 1

# bring down to user-tx-level.
rd["market"] = rd["market"].str.strip()
rd["country"] = rd["market"]
rd.loc[
    rd["market"].isin(["DEU", "FRA", "ITA", "AUT", "ESP"]) == False, "country"
] = "other"

rd = (
    rd.groupby(["user_id", "transaction_date", "enter_reason"])
    .agg({"still_premium": "max", "balance_eur": "sum", "amount_cents": "max"})
    .reset_index()
)

# subtract difference between balance and charge.
rd["diff"] = round((rd["balance_eur"] * 100 - rd["amount_cents"]) / 100, 1)

# keep only moderately close diffs
rd = rd.loc[abs(rd["diff"]) <= 10, :]

rd = rd.groupby(["enter_reason", "diff"])["still_premium"].agg("mean").reset_index()

rd["group"] = "Not enough funds"
rd.loc[rd["diff"] >= 0, "group"] = "Enough funds"


alt.Chart(rd).mark_bar().encode(
    x=alt.X(
        "diff:Q",
        axis=alt.Axis(title="Difference between current balance and charge amount"),
    ),
    y=alt.Y(
        "still_premium:Q",
        axis=alt.Axis(format="%", title="% of users still in premium"),
    ),
    color="group:N",
    column="enter_reason",
).properties(width=500, height=500, title="Balance pre booking and % still premium")

In [43]:
## baseline test.

rd = pd.read_csv("rd.csv")

rd["still_premium"] = 0
rd.loc[rd["status"].isin(["RENEWED", "ACTIVE", "UPGRADED"]), "still_premium"] = 1

# bring down to user-tx-level.
rd = (
    rd.groupby(["user_id", "transaction_date"])
    .agg({"still_premium": "max", "balance_eur": "sum", "amount_cents": "max"})
    .reset_index()
)

rd["charge_eur"] = rd["amount_cents"] / 100
# subtract difference between balance and charge.
rd["diff"] = round(rd["balance_eur"] - rd["charge_eur"], 2)

# keep only moderately close diffs
rd = rd.loc[abs(rd["diff"]) <= 10, :]

model = rdd.rdd(rd, "diff", "still_premium", cut=0)
print(model.fit().summary())

Estimation Equation:	 still_premium ~ TREATED + diff
                            WLS Regression Results                            
Dep. Variable:          still_premium   R-squared:                       0.004
Model:                            WLS   Adj. R-squared:                  0.004
Method:                 Least Squares   F-statistic:                     159.4
Date:                Wed, 18 Nov 2020   Prob (F-statistic):           8.73e-70
Time:                        10:21:24   Log-Likelihood:                -50256.
No. Observations:               70616   AIC:                         1.005e+05
Df Residuals:                   70613   BIC:                         1.005e+05
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------

In [44]:
## check with only primary account
rd = pd.read_csv("rd.csv")

rd = rd.loc[rd["product_key_group"] == "PRIMARY", :]

rd["still_premium"] = 0
rd.loc[rd["status"].isin(["RENEWED", "ACTIVE", "UPGRADED"]), "still_premium"] = 1

# bring down to user-tx-level.
rd = (
    rd.groupby(["user_id", "transaction_date"])
    .agg({"still_premium": "max", "balance_eur": "sum", "amount_cents": "max"})
    .reset_index()
)

rd["charge_eur"] = rd["amount_cents"] / 100
# subtract difference between balance and charge.
rd["diff"] = round(rd["balance_eur"] - rd["charge_eur"], 2)

# keep only moderately close diffs
rd = rd.loc[abs(rd["diff"]) <= 10, :]

model = rdd.rdd(rd, "diff", "still_premium", cut=0)
print(model.fit().summary())

Estimation Equation:	 still_premium ~ TREATED + diff
                            WLS Regression Results                            
Dep. Variable:          still_premium   R-squared:                       0.005
Model:                            WLS   Adj. R-squared:                  0.004
Method:                 Least Squares   F-statistic:                     169.2
Date:                Wed, 18 Nov 2020   Prob (F-statistic):           4.95e-74
Time:                        10:22:29   Log-Likelihood:                -52911.
No. Observations:               74734   AIC:                         1.058e+05
Df Residuals:                   74731   BIC:                         1.059e+05
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------

In [51]:
## with wider definition

rd = pd.read_csv("rd.csv")

rd["still_premium"] = 0
rd.loc[rd["status"].isin(["RENEWED", "ACTIVE", "UPGRADED"]), "still_premium"] = 1

# bring down to user-tx-level.
rd = (
    rd.groupby(["user_id", "transaction_date"])
    .agg({"still_premium": "max", "balance_eur": "sum", "amount_cents": "max"})
    .reset_index()
)

rd["charge_eur"] = rd["amount_cents"] / 100
# subtract difference between balance and charge.
rd["diff"] = round(rd["balance_eur"] - rd["charge_eur"], 2)

# keep only moderately close diffs
rd = rd.loc[abs(rd["diff"]) <= 30, :]

model = rdd.rdd(rd, "diff", "still_premium", cut=0)
print(model.fit().summary())

Estimation Equation:	 still_premium ~ TREATED + diff
                            WLS Regression Results                            
Dep. Variable:          still_premium   R-squared:                       0.029
Model:                            WLS   Adj. R-squared:                  0.029
Method:                 Least Squares   F-statistic:                     2158.
Date:                Wed, 18 Nov 2020   Prob (F-statistic):               0.00
Time:                        10:34:46   Log-Likelihood:            -1.0150e+05
No. Observations:              143403   AIC:                         2.030e+05
Df Residuals:                  143400   BIC:                         2.030e+05
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------

In [59]:
## control for age.

rd = pd.read_csv("rd.csv")

rd["still_premium"] = 0
rd.loc[rd["status"].isin(["RENEWED", "ACTIVE", "UPGRADED"]), "still_premium"] = 1

# bring down to user-tx-level.
rd = (
    rd.groupby(["payment_no", "user_id", "transaction_date"])
    .agg({"still_premium": "max", "balance_eur": "sum", "amount_cents": "max"})
    .reset_index()
)

rd["charge_eur"] = rd["amount_cents"] / 100
# subtract difference between balance and charge.
rd["diff"] = round(rd["balance_eur"] - rd["charge_eur"], 2)

# keep only moderately close diffs
rd = rd.loc[abs(rd["diff"]) <= 10, :]

model = rdd.rdd(rd, "diff", "still_premium", cut=0, controls=["payment_no"])
print(model.fit().summary())

Estimation Equation:	 still_premium ~ TREATED + diff + payment_no
                            WLS Regression Results                            
Dep. Variable:          still_premium   R-squared:                       0.045
Model:                            WLS   Adj. R-squared:                  0.045
Method:                 Least Squares   F-statistic:                     1109.
Date:                Wed, 18 Nov 2020   Prob (F-statistic):               0.00
Time:                        10:41:57   Log-Likelihood:                -48789.
No. Observations:               70616   AIC:                         9.759e+04
Df Residuals:                   70612   BIC:                         9.762e+04
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------