title: Premium product retention analysis    
author: Fabio Schmidt-Fischbach   
date: 2020-03-01   
region: EU   
tags: memberships, you, metal, business, churn, retention, business   
summary: This research is split into 3 parts. In part A we look into how many people survive until month 12, how many people renew and how this differs by market. In part B we look into why people churn early (they simply never top up. It's an activation problem!). And in part C we explore why do people churn later on, do not renew? To answer this we’ll look into learnings from account balances, learnings from travel behavior and learnings from financial transactions.

In [None]:
import pandas as pd
import altair as alt
import numpy as np
import seaborn as sns
import os

pd.options.mode.chained_assignment = None  # default='warn'

# Churn investigation v1

- Author: Fabio Schmidt-Fischbach
- Date : 1st of April 2020 (seriously)

# Structure of this report. 

Section A focuses on the "what" (is happening).

- A. How good are we at retaining customers? This section focuses on understanding how big retention problems are and outlines the problem we aim to address. 

Sections B and C then focus on the why. 

- B. Why do customers churn early (months 1-4)?
- C. Why do customers churn late  (months 5-11) and why don't customers renew? 


# Exec summary 

### Part A.
- How many people survive until month 12? 
    - Upgrade flow: 60% across most products.
    - Sign-up flow: 40% for Metal and B. You + 50% for You.

- How many people renew? 
    - Upgrade flow: 80-90% for Metal and B. You : interestingly, upgrade YOU has a very low renewal rate (80%)
    - Sign-up flow: 85-90% of customers.

- How does this differ by market? 
    - "Within year retention" differs dramatically by market.
    - Italy and Spain at roughly 30% after 11 months.
    - Germany (67%) and France (40%)
    
### Part B. 

- Why do people churn early? 
    - They simply never top up. It's an activation problem! 
    
### Part C. 

- Why do people churn later on? Do not renew? 

    - Part 1. Learnings from account balances
    
        - Do non-renewals and late churners never activate? No! They do use the account in the early months but at some point stop. 
        - But does this mean that renewals vs non-renewals start with the same intentions? No! Renewal users tend to have higher balances from the very beginning. 
        - We see a gradual decline of the account balance for those that drop out: it might be too late to reach out to them in month 12. 
        
    - Part 2. Learnings from travel behavior 
    
        - Is it travelling? Yes and no. Yes, because You has more travellers and also higher retention rates. No, because You users who churn travel almost as much as You users who renew. 
        - Do users get the subscription right in advance of a trip? Yes. This mainly happens for the You account, but not for Metal and B. You. You users are much more likely to travel in the initial months of their subscription.
        - Users that go on trips later in their subscription year tend to renew at much higher rates: do people forget about the value we provided?
    - Part 3. Learnings from financial transactions.
    
        - We look at financial activity within the first 4 months and compare three groups: late churners (>= month 6, cancellers and renewers). Their activity even early on can be ranked 
             - 1. Most active: renewals (25 tx a month -- median)
             - 2. Medium active: cancellations (15 tx a month -- median)
             - 3. Least active: late churners (5 tx a month -- median) 

## Which premium users are we looking at?  
Look at premium subscriptions that started between October 2018 and February 2019 (premium cohorts that have finished by now).


In [None]:
query = """

select  dwh.month, 
        zu.user_id,
        zup.product_id, 
        zup.enter_reason, 
        zup.subscription_valid_from, 
        zup.subscription_valid_until, 
        datediff(month, subscription_valid_from, dwh.end_time) as age, 
        datediff(month, subscription_valid_from, subscription_valid_until) as churn_age, 
        subscription_end_event,
        feature,
        sum(value) as value 
from dwh_cohort_months as dwh 
left join dbt.zrh_user_product as zup 
    on dwh.end_time between zup.subscription_valid_from and zup.subscription_valid_until 
    and product_id in ('BLACK_CARD_MONTHLY','METAL_CARD_MONTHLY','BUSINESS_BLACK')
left join dbt.zrh_txn_day_rows as txn 
    on txn.user_created = zup.user_created 
    and date_trunc('month',txn.txn_date) = dwh.start_time  
    and feature in ('amount_cents_card_ecomm',
                    'amount_cents_spaces', 
                    'amount_cents_pt', 
                    'amount_cents_card_apple',
                    'amount_cents_card_cardpresent',
                    'amount_cents_card_google',
                    'amount_cents_card_atm',
                    'amount_cents_ct',
                    'amount_cents_dd',
                    'n_spaces_ct',
                    'n_pt','n_ct','n_card_cardpresent', 'n_spaces_dt', 'n_total', 'n_card_ecomm', 
                    'n_card_atm', 'n_card_ext_total_in', 'n_card_apple', 'n_spaces', 'n_dt', 'n_dd')
inner join dbt.zrh_users as zu on zu.user_created = zup.user_created
where subscription_valid_from between '2018-12-01' and '2019-07-01'
    and dwh.start_time between '2018-12-01' and current_date  
    and enter_reason in ('UPGRADED','SIGNUP')
group by 1,2,3,4,5,6,7,8,9,10


"""

In [28]:
sample = pd.read_csv("churn_sample.csv")

sample.loc[sample["product_id"] == "BLACK_CARD_MONTHLY", "product_id"] = "You"
sample.loc[sample["product_id"] == "METAL_CARD_MONTHLY", "product_id"] = "Metal"
sample.loc[sample["product_id"] == "BUSINESS_BLACK", "product_id"] = "Bus. You"

# Part A : how good are we at retaining our customers?


## Summary of Part A. 

- How many people survive until month 12? 
    - Upgrade flow: 60% across most products.
    - Sign-up flow: 40% for Metal and B. You + 50% for You.

- How many people renew? 
    - Upgrade flow: 70-80% for Metal and B. You : interestingly, upgrade YOU has a very low renewal rate (60-70%)
    - Sign-up flow: 70-80% of customers.

- How does this differ by market? 
    - "Within year retention" differs dramatically by market.
    - Italy and Spain at roughly 30% after 11 months.
    - Germany (67%) and France (40%)
    
 - Want to go more detailed than averages across cohorts? 
     https://metabase-main.tech26.de/dashboard/20?membership=METAL_CARD_MONTHLY
    

# 1. When do our customers churn? Let's visualize the % of the initial sign up cohort that is still within the membership by months of age. 

In [29]:
signup = sample.loc[sample["enter_reason"] == "SIGNUP", :]

# create cohorts.
signup["cohort"] = pd.to_datetime(signup["subscription_valid_from"]).dt.to_period("M")

signup = (
    signup.groupby(["cohort", "age", "product_id"])["user_id"].nunique().reset_index()
)

signup["cohort_size"] = signup.groupby(["product_id", "cohort"])["user_id"].transform(
    "max"
)
signup["perc"] = 100 * signup["user_id"] / signup["cohort_size"]

# average over cohorts.
signup = signup.groupby(["product_id", "age"])["perc"].agg("mean").reset_index()


alt.Chart(signup.loc[signup["age"] < 12, :]).mark_line().encode(
    x=alt.X("age:Q", axis=alt.Axis(title="Age in months")),
    y=alt.Y("perc:Q", axis=alt.Axis(title="% retained")),
    color="product_id:N",
).properties(title="% premium retention for sign up flow")

After 4 months, how many customers are still with us? 

In [30]:
print(signup.loc[signup["age"] == 4, :])

   product_id  age       perc
4    Bus. You    4  65.264483
18      Metal    4  68.266553
30        You    4  74.464232


After 8 months, how many customers are still with us?

In [31]:
print(signup.loc[signup["age"] == 8, :])

   product_id  age       perc
8    Bus. You    8  50.023037
22      Metal    8  49.446747
34        You    8  59.939779


Similarly, how does this look like after 11 months right before cancellations take effect?

In [32]:
print(signup.loc[signup["age"] == 11, :])

   product_id  age       perc
11   Bus. You   11  42.085685
25      Metal   11  38.993425
37        You   11  49.757849


# 2. What % of those that make it to month 12 renew? 

In [33]:
# look at % renewal.
signup = sample.loc[sample["enter_reason"] == "SIGNUP", :]

# drop users that could have not even made this decision
signup = signup.loc[signup["subscription_end_event"].isna() == False, :]

# drop users that have not yet finished their sub.
signup["renewed"] = 0
signup.loc[signup["subscription_end_event"] == "RENEWED", "renewed"] = 1

signup = signup.loc[signup["age"] >= 11, :]

# % renewal
signup = signup.groupby("product_id")["renewed"].agg("mean").reset_index()


alt.Chart(signup).mark_bar().encode(
    x=alt.X("product_id:N", axis=alt.Axis(title="Product")),
    y=alt.Y("renewed:Q", axis=alt.Axis(format="%", title="% renewed after 12 months")),
).properties(width=400, title="% of customers who renew after 12 months")

In [34]:
print(sample["subscription_valid_until"].max())

2100-01-01 00:00:00


In [35]:
print(signup)

  product_id   renewed
0   Bus. You  0.950282
1      Metal  0.854038
2        You  0.880254


# 3. How does this differ by market? 

In [36]:
signup = sample.loc[sample["enter_reason"] == "SIGNUP", :]

# create cohorts.
signup["cohort"] = pd.to_datetime(signup["subscription_valid_from"]).dt.to_period("M")

signup = (
    signup.groupby(["cohort", "age", "product_id", "country_tnc_legal"])["user_id"]
    .nunique()
    .reset_index()
)

signup["cohort_size"] = signup.groupby(["product_id", "cohort", "country_tnc_legal"])[
    "user_id"
].transform("max")
signup["perc"] = 100 * signup["user_id"] / signup["cohort_size"]

# average over cohorts.
signup = (
    signup.groupby(["product_id", "age", "country_tnc_legal"])["perc"]
    .agg("mean")
    .reset_index()
)


alt.Chart(signup.loc[signup["age"] < 12, :]).mark_line().encode(
    x=alt.X("age:Q", axis=alt.Axis(title="Age in months")),
    y=alt.Y("perc:Q", axis=alt.Axis(title="% retained")),
    color="country_tnc_legal:N",
    tooltip=["age", "product_id", "country_tnc_legal", "perc"],
).properties(title="% premium retention for sign up flow", width=200, height=200).facet(
    facet="product_id:N", columns=3
).interactive()

In [37]:
print(signup.loc[signup["age"] == 11, :])

    product_id  age country_tnc_legal       perc
66    Bus. You   11        AUT         51.666946
67    Bus. You   11        DEU         63.164998
68    Bus. You   11        ESP         18.825207
69    Bus. You   11        FRA         43.931718
70    Bus. You   11        ITA         32.340924
71    Bus. You   11        RoE         29.593810
140      Metal   11        AUT         47.382614
141      Metal   11        DEU         67.251036
142      Metal   11        ESP         27.450109
143      Metal   11        FRA         40.847296
144      Metal   11        ITA         31.565738
145      Metal   11        RoE         41.329024
212        You   11        AUT         57.815223
213        You   11        DEU         71.335107
214        You   11        ESP         36.240372
215        You   11        FRA         52.472301
216        You   11        ITA         42.192211
217        You   11        RoE         36.322411


In [None]:
## 12 month renewal rate

In [38]:
# look at % renewal.
signup = sample.loc[sample["enter_reason"] == "SIGNUP", :]

# drop users that could have not even made this decision
signup = signup.loc[signup["subscription_end_event"].isna() == False, :]

signup["renewed"] = 0
signup.loc[signup["subscription_end_event"] == "RENEWED", "renewed"] = 1

signup = signup.loc[signup["age"] >= 11, :]

# % renewal
signup = (
    signup.groupby(["product_id", "country_tnc_legal"])["renewed"]
    .agg("mean")
    .reset_index()
)


alt.Chart(signup).mark_bar().encode(
    x=alt.X("country_tnc_legal:N", axis=alt.Axis(title="Market")),
    y=alt.Y("renewed:Q", axis=alt.Axis(format="%", title="% renewed after 12 months")),
    column="product_id:N",
).properties(width=200, title="% of customers who renew after 12 months")

# 4. All of the above was on the sign up flow: is the same true for new Premium users who upgraded? 

### Within year retention is much better for upgraders than for signups!

In [39]:
signup = sample.loc[sample["enter_reason"] == "UPGRADED", :]

# create cohorts.
signup["cohort"] = pd.to_datetime(signup["subscription_valid_from"]).dt.to_period("M")

signup = (
    signup.groupby(["cohort", "age", "product_id"])["user_id"].nunique().reset_index()
)

signup["cohort_size"] = signup.groupby(["product_id", "cohort"])["user_id"].transform(
    "max"
)
signup["perc"] = 100 * signup["user_id"] / signup["cohort_size"]

# average over cohorts.
signup = signup.groupby(["product_id", "age"])["perc"].agg("mean").reset_index()


alt.Chart(signup.loc[signup["age"] < 12, :]).mark_line().encode(
    x=alt.X("age:Q", axis=alt.Axis(title="Age in months")),
    y=alt.Y("perc:Q", axis=alt.Axis(title="% retained")),
    color="product_id:N",
    tooltip=["age", "product_id", "perc"],
).properties(
    title="% premium retention for upgraded flow", width=400, height=400
).interactive()

In [40]:
signup = sample.loc[sample["enter_reason"] == "UPGRADED", :]

# create cohorts.
signup["cohort"] = pd.to_datetime(signup["subscription_valid_from"]).dt.to_period("M")

signup = (
    signup.groupby(["cohort", "age", "product_id", "country_tnc_legal"])["user_id"]
    .nunique()
    .reset_index()
)

signup["cohort_size"] = signup.groupby(["product_id", "cohort", "country_tnc_legal"])[
    "user_id"
].transform("max")
signup["perc"] = 100 * signup["user_id"] / signup["cohort_size"]

# average over cohorts.
signup = (
    signup.groupby(["product_id", "age", "country_tnc_legal"])["perc"]
    .agg("mean")
    .reset_index()
)


alt.Chart(signup.loc[signup["age"] < 12, :]).mark_line().encode(
    x=alt.X("age:Q", axis=alt.Axis(title="Age in months")),
    y=alt.Y("perc:Q", axis=alt.Axis(title="% retained")),
    color="country_tnc_legal:N",
    tooltip=["age", "product_id", "country_tnc_legal", "perc"],
).properties(
    title="% premium retention for upgraded flow", width=200, height=200
).facet(
    facet="product_id:N", columns=3
).interactive()

In [41]:
print(signup.loc[signup["age"] == 4, :])

    product_id  age country_tnc_legal       perc
24    Bus. You    4        AUT         84.955934
25    Bus. You    4        DEU         92.342951
26    Bus. You    4        ESP         71.513424
27    Bus. You    4        FRA         82.067789
28    Bus. You    4        ITA         78.812072
29    Bus. You    4        RoE         70.403112
96       Metal    4        AUT         87.445597
97       Metal    4        DEU         94.814104
98       Metal    4        ESP         76.903816
99       Metal    4        FRA         81.601424
100      Metal    4        ITA         77.302963
101      Metal    4        RoE         74.982949
169        You    4        AUT         90.476153
170        You    4        DEU         94.319771
171        You    4        ESP         74.667517
172        You    4        FRA         81.023169
173        You    4        ITA         79.517090
174        You    4        RoE         74.269420


### Renewal rates for upgrade YOU are worse than for YOU sign up!

In [42]:
# look at % renewal.
signup = sample.loc[sample["enter_reason"] == "UPGRADED", :]

signup["renewed"] = 0
signup.loc[signup["subscription_end_event"] == "RENEWED", "renewed"] = 1

# drop users that could have not even made this decision
signup = signup.loc[signup["subscription_end_event"].isna() == False, :]

signup = signup.loc[signup["age"] >= 11, :]

# % renewal
signup = (
    signup.groupby(["product_id", "country_tnc_legal"])["renewed"]
    .agg("mean")
    .reset_index()
)


alt.Chart(signup).mark_bar().encode(
    x=alt.X("country_tnc_legal:N", axis=alt.Axis(title="Market")),
    y=alt.Y("renewed:Q", axis=alt.Axis(format="%", title="% renewed after 12 months")),
    column="product_id:N",
).properties(width=200, title="% of customers who renew after 12 months")

In [43]:
print(signup.head(20))

   product_id country_tnc_legal   renewed
0    Bus. You        AUT         0.968718
1    Bus. You        DEU         0.917637
2    Bus. You        ESP         0.942069
3    Bus. You        FRA         0.965834
4    Bus. You        ITA         0.944240
5    Bus. You        RoE         0.897810
6       Metal        AUT         0.799883
7       Metal        DEU         0.769283
8       Metal        ESP         0.878227
9       Metal        FRA         0.892659
10      Metal        ITA         0.886901
11      Metal        RoE         0.802356
12        You        AUT         0.842219
13        You        DEU         0.776614
14        You        ESP         0.792087
15        You        FRA         0.827423
16        You        ITA         0.777571
17        You        RoE         0.844422


# Part B : Predicting early/within year churn

This is very easy. Apart from recovations and fraud, the way to drop out of your premium membership (during that period) is dunning. 

I propose to use two indicators to detect which users are likely to churn: MAU status and dunning.

My findings are: 

- MAU lapse status is predictive of churning.
- The longer the user has been in lapse status the higher the probability (5% for an MAU  - 18-23% churn prob. for a 3 month lapser)
- Lapsing is sticky: someone who has already lapsed for two months has a very low recovery probability of 20%. 
- Predicting who is going to be blocked is easy: look at users in dunning.



In [167]:
query = """
select  dwh.month, 
        zu.user_id,
        case when zu.country_tnc_legal in ('DEU','ESP','FRA','AUT','ITA') then zu.country_tnc_legal else 'RoE' end as country_tnc_legal,
        zup.product_id, 
        zup.enter_reason, 
        zup.subscription_valid_from, 
        zup.subscription_valid_until, 
        datediff(month, subscription_valid_from, dwh.end_time) as age, 
        datediff(month, subscription_valid_from, case when subscription_valid_until > current_date then null else subscription_valid_until end) as churn_age, 
        subscription_end_event,
		activity_end, 
		case when act.user_created is not null then 1 else 0 end as mau, 
        sum(balance_cents_euro) as eom_balance
from dwh_cohort_months as dwh 
left join dbt.zrh_user_product as zup 
    on dwh.end_time between zup.subscription_valid_from and zup.subscription_valid_until 
    and product_id in ('BLACK_CARD_MONTHLY','METAL_CARD_MONTHLY','BUSINESS_BLACK')
left join dbt.zrh_user_activity_txn as act 
	on act.user_created = zup.user_created 
	and act.activity_type = '1_tx_35' 
	and dwh.end_time between activity_start and activity_end 
left join dbt.zrh_monthly_balance_aud as mba 
	on mba.user_created = zup.user_created 
	and mba.month = end_time 
inner join dbt.zrh_users as zu on zu.user_created = zup.user_created
where subscription_valid_from between '2018-10-01' and '2019-07-01'
    and dwh.start_time between '2010-01-01' and current_date  
    and enter_reason in ('UPGRADED','SIGNUP')
group by 1,2,3,4,5,6,7,8,9,10,11,12

"""

In [44]:
sample = pd.read_csv("churn_sample_mau.csv")

# drop users that could have not even made this decision
sample = sample.loc[sample["subscription_end_event"].isna() == False, :]

# recode product
sample.loc[sample["product_id"] == "BLACK_CARD_MONTHLY", "product_id"] = "You"
sample.loc[sample["product_id"] == "METAL_CARD_MONTHLY", "product_id"] = "Metal"
sample.loc[sample["product_id"] == "BUSINESS_BLACK", "product_id"] = "Bus. You"

# sort dataframe
sample.sort_values(by=["user_id", "month"], ascending=False)

# code up mau variables
sample["time_since_mau"] = 0
sample.loc[sample["mau"] == 0, "time_since_mau"] = 1

# create time in lapse variable.
sample["time_lapse"] = sample.groupby("user_id")["time_since_mau"].cumsum()
sample.loc[sample["mau"] == 1, "time_lapse"] = 0

# end_period = 1 if this was the churn period
sample["age"] = sample["age"].astype(float)
sample["churn_age"] = sample["churn_age"].astype(float) - 1

sample["end_period"] = 0
sample.loc[sample["age"] == sample["churn_age"], "end_period"] = 1

In [45]:
lapse = sample.groupby(["product_id", "time_lapse"])["user_id"].nunique().reset_index()

lapse["perc"] = (
    100 * lapse["user_id"] / lapse.groupby("product_id")["user_id"].transform("sum")
)

chart = (
    alt.Chart(lapse)
    .mark_line()
    .encode(
        x=alt.X("time_lapse:Q", axis=alt.Axis(title="Months in lapse")),
        y=alt.Y("perc:Q", axis=alt.Axis(title="% of users")),
        color="product_id:N",
    )
    .properties(title="Looking at lapsing: how long are users typically in lapse?")
)

## How likely is it that you will churn by months since lapse: 
- MAU: 10% 
- 1 month lapser: 20% 
- 2 month lapser: 25%
- 3 month lapser: 30%

--> If a premium has not been an MAU in three months, more than every fourth drops out of their membership.


In [46]:
lapse = sample.groupby(["product_id", "time_lapse"]).agg("mean").reset_index()

alt.Chart(lapse.loc[lapse["time_lapse"] <= 10, :]).mark_line().encode(
    x=alt.X("time_lapse:Q", axis=alt.Axis(title="Months since lapse")),
    y=alt.Y(
        "end_period:Q", axis=alt.Axis(format="%", title="% of customers that churn")
    ),
    color="product_id:N",
)

The last graph visualized that users that have been lapsed for three (or longer) months are at a very high risk of dropping out of their membership. A simple conclusion would be to start talking to customers who have been lapsed for such a long time to try and change their minds.

A downside of this approach is that it might already be too late at this stage. We might also not want to bombard users that have only recently lapsed (and would recover anyway) with communications. Somewhere in between these two extremes there is a sweet-spot. 

This next graph tries to figure out how likely it is for a customer who has only recently lapsed to ever recover (e.g. become an MAU again). 

### What % of those that have been in lapse for 1/2/3.. months turn MAU in the next period? 


- Someone who has lapsed for a month already, recovers with 30-40% probability next month. 
- Someone who is already in lapse for two months, recovers with prob 20-30%. 

Conclusion: after two months the risk of recovery is quite low already. 

In [47]:
lapse = sample

lapse.sort_values(by=["user_id", "month"])

lapse["recover_next_period"] = lapse.groupby(["user_id"])["mau"].shift(-1)

# compute % that recover next period after having lapsed x months

lapse = (
    lapse.groupby(["product_id", "time_lapse"])["recover_next_period"]
    .agg("mean")
    .reset_index()
)

alt.Chart(
    lapse.loc[(lapse["time_lapse"] > 0) & (lapse["time_lapse"] <= 10), :]
).mark_line().encode(
    x=alt.X("time_lapse:Q", axis=alt.Axis(title="Months since lapse")),
    y=alt.Y(
        "recover_next_period:Q",
        axis=alt.Axis(format="%", title="% of customers that turn MAU next month"),
    ),
    color="product_id:N",
).properties(
    title="Months since lapse and % of users that turn MAU again next period"
)

This section argued that the problem of early activation is one of "activation"/users from the start not using the account. We already demonstrated above that users who lapse are less likely to stay premium customers. The next section shows that in fact customers who churn early never put any significant amount of money on their accounts in the first place. This stands in contrast with late stage churn (e.g. months 5-12) or non-renewals. 


## What can we learn from account balances over time?

This next graph groups each customer into buckets according to which month in their subscritpion they churned. We then show how the median balance in this group developes over time (x-axis is month of membership). 


1. Users who churn early never activate in the first place.

2. Late churners activate but also eventually fall into dunning.

3. Non-renewals are early on by far the most active but also slowly dis-save.

In [48]:
sample = pd.read_csv("churn_sample_mau.csv")
# recode product
sample.loc[sample["product_id"] == "BLACK_CARD_MONTHLY", "product_id"] = "You"
sample.loc[sample["product_id"] == "METAL_CARD_MONTHLY", "product_id"] = "Metal"
sample.loc[sample["product_id"] == "BUSINESS_BLACK", "product_id"] = "Bus. You"

# drop users that could have not even made this decision
sample = sample.loc[sample["subscription_end_event"].isna() == False, :]

lapse = sample

lapse["eom_balance"] = lapse["eom_balance"] / 100

lapse["end_period"] = 0
lapse.loc[lapse["age"] == lapse["churn_age"], "end_period"] = 1
lapse["eventually_churns"] = 0
lapse.loc[lapse["subscription_end_event"] != "RENEWED", "eventually_churns"] = 1

# keep those that churn
lapse["churn_age"] = lapse["churn_age"].astype(float) - 1

lapse = lapse.loc[lapse["eventually_churns"] == 1, :]
lapse = lapse.groupby(["churn_age", "age"])["eom_balance"].agg("median").reset_index()

# lapse = lapse.loc[lapse["churn_age"]<]

alt.Chart(lapse.loc[lapse["churn_age"] >= 2, :]).mark_line().encode(
    x=alt.X("age:Q", axis=alt.Axis(title="Months since subscription start")),
    y=alt.Y("eom_balance:Q", axis=alt.Axis(title="Median balance")),
    color="churn_age:N",
).properties(
    title="Median balance for customers that drop out of premium by timing of churn",
    width=600,
    height=400,
)

### Compute each user's maximum end of month balance and see how these distributions differ across churn groups.

- Early churn is an activation problem: 65-80% of users who drop out early (churn age 3&4) have a 
maximum account balance of basically zero.

- For users that cancel after that, this is not true. In general, you can say that the longer users
stay with us the larger the maximum end of month balance they had.

--> these late churners gave the product a chance but then decided to drop out / cancel.


In [49]:
sample = pd.read_csv("churn_sample_mau.csv")
# recode product
sample.loc[sample["product_id"] == "BLACK_CARD_MONTHLY", "product_id"] = "You"
sample.loc[sample["product_id"] == "METAL_CARD_MONTHLY", "product_id"] = "Metal"
sample.loc[sample["product_id"] == "BUSINESS_BLACK", "product_id"] = "Bus. You"

# drop users that could have not even made this decision
sample = sample.loc[sample["subscription_end_event"].isna() == False, :]

lapse = sample
lapse["eom_balance"] = lapse["eom_balance"] / 100

lapse["end_period"] = 0
lapse.loc[lapse["age"] == lapse["churn_age"], "end_period"] = 1
lapse["eventually_churns"] = 0
lapse.loc[lapse["subscription_end_event"] != "RENEWED", "eventually_churns"] = 1

# keep those that churn
lapse["churn_age"] = lapse["churn_age"].astype(float) - 1

lapse = lapse.loc[lapse["eventually_churns"] == 1, :]
lapse = lapse.groupby(["churn_age", "user_id"])["eom_balance"].agg("max").reset_index()
# round balance
lapse["eom_balance"] = lapse["eom_balance"].round(-1)

# count freqs
lapse = lapse.groupby(["churn_age", "eom_balance"])["user_id"].nunique().reset_index()
lapse["perc"] = (
    100 * lapse["user_id"] / lapse.groupby("churn_age")["user_id"].transform("sum")
)

lapse["cum"] = lapse.groupby("churn_age")["perc"].cumsum()

alt.Chart(
    lapse.loc[
        (lapse["eom_balance"] > -100)
        & (lapse["eom_balance"] < 1000)
        & (lapse["churn_age"] >= 2),
        :,
    ]
).mark_line().encode(
    y=alt.Y("cum:Q", axis=alt.Axis(title="% of customers")),
    x=alt.X("eom_balance:Q", axis=alt.Axis(title="Max EoM lifetime account balance")),
    color="churn_age:N",
).properties(
    title="Max end of month balance by churn age", width=600, height=400
)

# Part C : Predicting non-renewal + late churn.

In Part A we saw that early churn is super important. 25-35% of a cohort drops off in these first months. Part B showed us that the reason for this is : customer activation. The reason why these users drop out is because they never top up (65% of users who churn in month 4 never had more than 10 Euros at EoM on their account).

However, this is not the full picture yet. Part A also showed us that we loose another 30 pp (for Metal) between months 5 and 11 and that roughly only 70-80% of those who are still with us at renewal also stick with their product. 

The reason why customers churn away in this later part is less straightforward: Part B already revealed that it's not true that these late churners/non-renewals never top up. They do use the account but then for some reason drop out. 

In Section C, we investigate why this could be! 

## We saw that users that do not renew initially looked quite active: their account balances peaked around month 2-3. 

Does this mean that they initially look the same as users that will end up renewing? At large, non-renewals have generally lower balances than renewals from the very start. The next graph splits the customer group into two segments : those that renew after 12 months and those that do not (we drop customers that drop through dunning/AML/recovation) and computes the median balance across time. 

In [50]:
sample = pd.read_csv("churn_sample_mau.csv")
# recode product
sample.loc[sample["product_id"] == "BLACK_CARD_MONTHLY", "product_id"] = "You"
sample.loc[sample["product_id"] == "METAL_CARD_MONTHLY", "product_id"] = "Metal"
sample.loc[sample["product_id"] == "BUSINESS_BLACK", "product_id"] = "Bus. You"


# drop users that could have not even made this decision
sample = sample.loc[sample["subscription_end_event"].isna() == False, :]

lapse = sample

lapse["eom_balance"] = lapse["eom_balance"] / 100

lapse["end_period"] = 0
lapse.loc[lapse["age"] == lapse["churn_age"], "end_period"] = 1
lapse["renewal"] = False
lapse.loc[lapse["subscription_end_event"] == "RENEWED", "renewal"] = True

# keep those that churn
lapse["churn_age"] = lapse["churn_age"].astype(float) - 1
lapse = lapse.loc[lapse["churn_age"] >= 11, :]
lapse = lapse.groupby(["renewal", "age"])["eom_balance"].agg("median").reset_index()

# lapse = lapse.loc[lapse["churn_age"]<]

alt.Chart(lapse).mark_line().encode(
    x=alt.X("age:Q", axis=alt.Axis(title="Months since subscription start")),
    y=alt.Y("eom_balance:Q", axis=alt.Axis(title="Median balance")),
    color="renewal:N",
).properties(
    title="Median balance of customers that renew vs that cancel after 12 months",
    width=600,
    height=400,
)

Interestingly, both groups peak early. Is this the same for all products? The next graph shows this visualization by product.

In [51]:
sample = pd.read_csv("churn_sample_mau.csv")
# recode product
sample.loc[sample["product_id"] == "BLACK_CARD_MONTHLY", "product_id"] = "You"
sample.loc[sample["product_id"] == "METAL_CARD_MONTHLY", "product_id"] = "Metal"
sample.loc[sample["product_id"] == "BUSINESS_BLACK", "product_id"] = "Bus. You"


# drop users that could have not even made this decision
sample = sample.loc[sample["subscription_end_event"].isna() == False, :]

lapse = sample

lapse["eom_balance"] = lapse["eom_balance"] / 100

lapse["end_period"] = 0
lapse.loc[lapse["age"] == lapse["churn_age"], "end_period"] = 1
lapse["renewal"] = False
lapse.loc[lapse["subscription_end_event"] == "RENEWED", "renewal"] = True

# keep those that churn
lapse["churn_age"] = lapse["churn_age"].astype(float) - 1
lapse = lapse.loc[lapse["churn_age"] >= 11, :]

# keep sign up flow
lapse = lapse.loc[lapse["enter_reason"] == "SIGNUP", :]

lapse = (
    lapse.groupby(["renewal", "age", "product_id"])["eom_balance"]
    .agg("median")
    .reset_index()
)

# lapse = lapse.loc[lapse["churn_age"]<]

alt.Chart(lapse).mark_line().encode(
    x=alt.X("age:Q", axis=alt.Axis(title="Months since subscription start")),
    y=alt.Y("eom_balance:Q", axis=alt.Axis(title="Median balance")),
    color="renewal:N",
    column="product_id:N",
).properties(
    title="Median balance of customers that renew vs that cancel after 12 months (signup)",
    width=200,
    height=400,
)

And similarly, let's look at it for upgrade users. Since these customers typically have already been with us for a bit, they start with higher balances. 

In [52]:
sample = pd.read_csv("churn_sample_mau.csv")
# recode product
sample.loc[sample["product_id"] == "BLACK_CARD_MONTHLY", "product_id"] = "You"
sample.loc[sample["product_id"] == "METAL_CARD_MONTHLY", "product_id"] = "Metal"
sample.loc[sample["product_id"] == "BUSINESS_BLACK", "product_id"] = "Bus. You"


# drop users that could have not even made this decision
sample = sample.loc[sample["subscription_end_event"].isna() == False, :]

lapse = sample

lapse["eom_balance"] = lapse["eom_balance"] / 100

lapse["end_period"] = 0
lapse.loc[lapse["age"] == lapse["churn_age"], "end_period"] = 1
lapse["renewal"] = False
lapse.loc[lapse["subscription_end_event"] == "RENEWED", "renewal"] = True

# keep those that churn
lapse["churn_age"] = lapse["churn_age"].astype(float) - 1
lapse = lapse.loc[lapse["churn_age"] >= 11, :]

# keep sign up flow
lapse = lapse.loc[lapse["enter_reason"] == "UPGRADED", :]

lapse = (
    lapse.groupby(["renewal", "age", "product_id"])["eom_balance"]
    .agg("median")
    .reset_index()
)

# lapse = lapse.loc[lapse["churn_age"]<]

alt.Chart(lapse).mark_line().encode(
    x=alt.X("age:Q", axis=alt.Axis(title="Months since subscription start")),
    y=alt.Y("eom_balance:Q", axis=alt.Axis(title="Median balance")),
    color="renewal:N",
    column="product_id:N",
).properties(
    title="Median balance of customers that renew vs that cancel after 12 months (upgrade)",
    width=200,
    height=400,
)

## Insights:

1. Renewal users tend to have higher balances from the very start. 
2. Non-renewals seem to intentionally dis-save. Reaching out to them in month 12 might be too late! 
3. For both sign up and upgrade flows, this trend of "initially having a high balance" and then spending it is highest for YOU customers (travel?)

One of the main benefits of our premium accounts is travel related. This next session explores how users make use of this and how this relates to their churn decision
### Part C.2 Travel spend 


In [53]:
query = """

with start_data as ( 
	select user_created, created, amount_cents_eur, region_group, lag(region_group,1) over(partition by user_created order by created) as region_lag 
	from dbt.zrh_card_transactions
	where card_tx_type in ('cardpresent')  and created >= current_date - interval '24 months'
        and type = 'PT'
),
cluster as ( 
select user_created, 
		created, 
		region_group, 
		region_lag,
        amount_cents_eur,
		sum(case when region_group != region_lag then 1 else 0 end )
			over(partition by user_created order by created rows unbounded preceding) as group_id   
from start_data 
),
counts as ( 
select  user_id, 
		group_id, 
		region_group, 
        country_tnc_legal, 
        zup.product_id, 
        sum(amount_cents_eur::float/100) as volume,
		count(1) as transactions, 
		max(cluster.created)  as last_purchase, 
		min(cluster.created)  as first_purchase,
		datediff(days, first_purchase, last_purchase) as days_spent 
from cluster 
inner join dbt.zrh_users on cluster.user_created = dbt.zrh_users.user_created
inner join dbt.zrh_user_product as zup on zup.user_created = cluster.user_created 
        and created between subscription_valid_from and subscription_valid_until 
        and zup.product_id in ('BLACK_CARD_MONTHLY','METAL_CARD_MONTHLY','BUSINESS_BLACK')
where region_group = 'inter' 
group by 1,2,3,4,5 
)

select  dwh.month, 
        zu.user_id,
        case when zu.country_tnc_legal in ('DEU','ESP','FRA','AUT','ITA') then zu.country_tnc_legal else 'RoE' end as country_tnc_legal,
        zup.product_id, 
        zup.enter_reason, 
        zup.subscription_valid_from, 
        zup.subscription_valid_until, 
        datediff(month, subscription_valid_from, dwh.end_time) as age, 
        datediff(month, subscription_valid_from, case when subscription_valid_until > current_date then null else subscription_valid_until end) as churn_age, 
        subscription_end_event,
        sum(volume) as volume_abroad_euro,
        sum(days_spent+1) as days_abroad,
        sum(transactions) as txs_abroad
from dwh_cohort_months as dwh 
left join dbt.zrh_user_product as zup 
    on dwh.end_time between zup.subscription_valid_from and zup.subscription_valid_until 
    and product_id in ('BLACK_CARD_MONTHLY','METAL_CARD_MONTHLY','BUSINESS_BLACK')
inner join dbt.zrh_users as zu on zu.user_created = zup.user_created
left join counts 
    on counts.user_id = zu.user_id 
    and first_purchase between start_time and end_time 
    and days_spent > 0 -- drop trips that only took one year
where subscription_valid_from between '2018-10-01' and '2019-07-01'
    and dwh.start_time between '2010-01-01' and current_date  
    and enter_reason in ('UPGRADED','SIGNUP')
group by 1,2,3,4,5,6,7,8,9,10


"""

In [54]:
sample = pd.read_csv("churn_sample_travel.csv")

# recode product
sample.loc[sample["product_id"] == "BLACK_CARD_MONTHLY", "product_id"] = "You"
sample.loc[sample["product_id"] == "METAL_CARD_MONTHLY", "product_id"] = "Metal"
sample.loc[sample["product_id"] == "BUSINESS_BLACK", "product_id"] = "Bus. You"


# drop users that could have not even made this decision
sample = sample.loc[sample["subscription_end_event"].isna() == False, :]

# drop early churn
sample = sample.loc[sample["churn_age"] >= 6, :]

sample["volume_abroad_euro"] = sample["volume_abroad_euro"].fillna(0)
sample["days_abroad"] = sample["days_abroad"].fillna(0)
sample["txs_abroad"] = sample["txs_abroad"].fillna(0)

# dummy whether user renewed or not
sample["renewal"] = False
sample.loc[sample["subscription_end_event"] == "RENEWED", "renewal"] = True

### What % of customers went on a outside EEA trip? 
(Again we drop customers that churn throughout the year.)

- You: almost 40% of users who renew / vs not renew travel. At least this dimension does not seem to explain the renewal decision for You.

- Metal and B. You slightly higher travel rate for renewals (25%) vs non-renewals (15%).

In [55]:
travel = sample
travel = travel.loc[travel["churn_age"] >= 11, :]

travel = (
    travel.groupby(["product_id", "renewal", "user_id"])["days_abroad"]
    .agg("sum")
    .reset_index()
)

travel["traveller"] = 0
travel.loc[travel["days_abroad"] > 0, "traveller"] = 1

travel = (
    travel.groupby(["product_id", "renewal"])["traveller"].agg("mean").reset_index()
)

alt.Chart(travel).mark_bar().encode(
    x=alt.X("product_id:N", axis=alt.Axis(title="Product")),
    y=alt.Y("traveller:Q", axis=alt.Axis(format="%", title="% went on one trip")),
    column="renewal:N",
).properties(
    width=400,
    height=400,
    title="% of users that went on a non-EEA trip during their subscription",
)

Note: this only looks at users who either cancelled or renewed. We explicitly drop users that churned prior the renewal decision. 

How does this look like if we also include mature churned users that got blocked (blocked after month 6)? 
- % of You who travel goes down by quite a bit: these users are much less likely to travel!


In [56]:
travel = (
    sample.groupby(["product_id", "renewal", "user_id"])["days_abroad"]
    .agg("sum")
    .reset_index()
)

travel["traveller"] = 0
travel.loc[travel["days_abroad"] > 0, "traveller"] = 1

travel = (
    travel.groupby(["product_id", "renewal"])["traveller"].agg("mean").reset_index()
)

alt.Chart(travel).mark_bar().encode(
    x=alt.X("product_id:N", axis=alt.Axis(title="Product")),
    y=alt.Y("traveller:Q", axis=alt.Axis(format="%", title="% went on one trip")),
    column="renewal:N",
).properties(
    width=400,
    height=400,
    title="% of users that went on a non-EEA trip during their subscription",
)

### When in their subscription do users go on trips? 

- B. You and Metal users go on trips throughout the year at basically the same rate. 

- You customers have a super strong peak in the first month: a lot of You customers seem to get the account exactly for a trip abroad.

- These patterns show up in both the upgrade and sign up flow. 

In [57]:
sample = pd.read_csv("churn_sample_travel.csv")

# recode product
sample.loc[sample["product_id"] == "BLACK_CARD_MONTHLY", "product_id"] = "You"
sample.loc[sample["product_id"] == "METAL_CARD_MONTHLY", "product_id"] = "Metal"
sample.loc[sample["product_id"] == "BUSINESS_BLACK", "product_id"] = "Bus. You"

# drop users that could have not even made this decision
sample = sample.loc[sample["subscription_end_event"].isna() == False, :]

# drop early churn
sample = sample.loc[sample["churn_age"] >= 11, :]

sample["volume_abroad_euro"] = sample["volume_abroad_euro"].fillna(0)
sample["days_abroad"] = sample["days_abroad"].fillna(0)
sample["txs_abroad"] = sample["txs_abroad"].fillna(0)

# dummy whether user renewed or not
sample["renewal"] = False
sample.loc[sample["subscription_end_event"] == "RENEWED", "renewal"] = True

In [58]:
travel = sample
travel["traveller"] = 0
travel.loc[travel["days_abroad"] > 0, "traveller"] = 1

travel = travel.groupby(["age", "renewal"]).agg("mean").reset_index()

alt.Chart(travel).mark_line().encode(
    x=alt.X("age:Q", axis=alt.Axis(title="Months since subscription start")),
    y=alt.Y("traveller:Q", axis=alt.Axis(title="% who went on trip", format="%")),
    color="renewal:N",
).properties(title="% of users who went on abroad trip", width=400, height=400)

Some our users sign up in preparation of an abroad trip. Let's see how this differs by signup / upgrade flow and product. 




In [59]:
travel = sample
travel = travel.loc[travel["enter_reason"] == "SIGNUP", :]
travel["traveller"] = 0
travel.loc[travel["days_abroad"] > 0, "traveller"] = 100

travel = travel.groupby(["age", "renewal", "product_id"]).agg("mean").reset_index()

alt.Chart(travel).mark_line().encode(
    x=alt.X("age:Q", axis=alt.Axis(title="Months since subscription start")),
    y=alt.Y("traveller:Q", axis=alt.Axis(title="% who went on trip")),
    color="renewal:N",
    column="product_id",
).properties(
    title="% of users who went on abroad trip: sign up flow", width=200, height=400
)

In [60]:
travel = sample
travel = travel.loc[travel["enter_reason"] == "UPGRADED", :]
travel["traveller"] = 0
travel.loc[travel["days_abroad"] > 0, "traveller"] = 1

travel = travel.groupby(["age", "renewal", "product_id"]).agg("mean").reset_index()

alt.Chart(travel).mark_line().encode(
    x=alt.X("age:Q", axis=alt.Axis(title="Months since subscription start")),
    y=alt.Y("traveller:Q", axis=alt.Axis(title="% who went on trip", format="%")),
    color="renewal:N",
    column="product_id",
).properties(
    title="% of users who went on abroad trip: upgrade flow", width=200, height=400
)

This "early" peak is mainly driven by the You product.

Remember that the difference in "whether or not a user travelled" was not striking across You users that renewed vs You users that did not renew (although existant for the other products). 

Maybe it's not about the "whether" but about the "when".

Hypothesis: users that travelled closer to the renewal decision are more likely to renew.

In [61]:
import numpy as np

travel = sample

travel["trip_month"] = np.nan
travel.loc[travel["days_abroad"] > 0, "trip_month"] = travel.loc[
    travel["days_abroad"] > 0, "age"
]

travel = (
    travel.groupby(["user_id", "renewal", "product_id"])["trip_month"]
    .agg("max")
    .reset_index()
)

travel = (
    travel.groupby(["trip_month", "product_id"])["renewal"].agg("mean").reset_index()
)

alt.Chart(travel).mark_line().encode(
    x=alt.X(
        "trip_month:Q", axis=alt.Axis(title="Subscription month of last non-EEA trip")
    ),
    y=alt.Y("renewal:Q", axis=alt.Axis(format="%", title="% who went on trip")),
    column="product_id",
).properties(
    title="Timing of last non-EEA trip and renewal decision", width=200, height=400
)

A You user who goes on a trip in the beginning of their subscription has a 60% probability of renewing. If the user goes on a trip towards the end of their trip, they renew with a probability of > 90%. 

This could suggest that it's not only about value but also about value perception (whether user's recall having benefitted).

### Part C.3. Financial activity

Let's recap what we have learnt so far! 
Part C.1. focussed on the balances of our customers. Here, we spotted that a) future churners from the very start tend to have lower balances but also b) that they do use the account originally. 

A pattern that we spotted among the non-renewers was that they gradually spend away their money : this could suggest that they make the non-renewal decision earlier in their time with us. Engaging with them in month 10/11 might be too late? 

In Part C.2. we tried to understand why our premium balances spike in the first month of their subscription. We found that this is driven by You accounts which led us to investigate travel behavior. 40-45% of our You customers travel with their account and interestingly, *a lot of them* do so early on in their first months. 

Does travel hence explain why users don't renew? Maybe. In fact non-renewals also travel - which speaks against the hypothesis. However, we also found that the later (in their subscription period) the user goes on a trip the more likely they are to renew: this could speak to the importance of value perception. 

### What's next? We are a bank account! Let's see how non-renewals vs renewals behave financially.

In [62]:
sample = pd.read_csv("churn_sample.csv")

sample.loc[sample["product_id"] == "BLACK_CARD_MONTHLY", "product_id"] = "You"
sample.loc[sample["product_id"] == "METAL_CARD_MONTHLY", "product_id"] = "Metal"
sample.loc[sample["product_id"] == "BUSINESS_BLACK", "product_id"] = "Bus. You"

# drop users that have not finished yet.
sample = sample.loc[sample["subscription_end_event"].isna() == False, :]

# drop early churn
sample["churn_age"] = sample["churn_age"].astype(float) - 1
sample = sample.loc[sample["churn_age"] >= 6, :]

# pivot table
index_cols = [col for col in sample.columns.values if col not in ["feature", "value"]]
sample = (
    pd.pivot_table(sample, index=index_cols, columns="feature", values="value")
    .reset_index()
    .fillna(0)
)

# make it long again --> purpose of this exercise is to ensure that we dont omit users that never do anything from this.
sample = pd.melt(sample, id_vars=index_cols)

# divide amounts by 100
sample.loc[sample["feature"].str.match("n_.*") == False, "value"] = (
    sample.loc[sample["feature"].str.match("n_.*") == False, "value"] / 100
)

# create comparison group
sample["group"] = "Late churn"
sample.loc[sample["subscription_end_event"] == "RENEWED", "group"] = "Renewed"
sample.loc[sample["subscription_end_event"] == "UPGRADED", "group"] = "Renewed"
sample.loc[sample["subscription_end_event"] == "REACTIVATED", "group"] = "Renewed"

sample.loc[sample["subscription_end_event"] == "BLOCKED", "group"] = "Late churn"
sample.loc[sample["subscription_end_event"] == "CANCELLED", "group"] = "Cancellation"

## How often in a month do users that renew, churn late or cancel make certain transactions? 

In [63]:
# look at counts+
from numpy import median
import matplotlib.pyplot as plt

# plt.figure(figsize=(20, 10))
# ax = sns.barplot(x='feature', y='value', hue='group', estimator = median, data=sample.loc[sample["feature"].str.match('n_.*')==True,:])
# ax = ax.set(xlabel='Transaction type', ylabel='Monthly median frequency')

The median cancellation has roughly 5 PTs a month in their first 4 months <-> the median renewal has 7 PTs in that time frame. 
In stark contrast, the median late churner even early on has 1 PT a month. 

Simlarly, how many transactions in total does the median customer do in these groups?
- Churner : 5 
- Cancellation: 15 
- Renewal : 25



In [64]:
tx = sample.loc[
    sample["feature"].isin(
        ["n_total", "n_pt", "n_ct", "n_dd", "n_spaces", "n_card_atm"]
    )
]
tx = tx.loc[tx["age"] <= 4, :]

tx = tx.groupby(["group", "feature", "value"])["user_id"].agg("count").reset_index()
tx["perc"] = (
    100 * tx["user_id"] / tx.groupby(["group", "feature"])["user_id"].transform("sum")
)
tx["cum"] = tx.groupby(["group", "feature"])["perc"].cumsum()

alt.Chart(tx.loc[tx["value"] < 40, :]).mark_line().encode(
    x=alt.X("value:Q", axis=alt.Axis(title="Number of tx")),
    y=alt.Y("cum:Q", axis=alt.Axis(title="% of customer-months")),
    color="group:N",
).properties(title="Distribution of financial activity in the first 4 months").facet(
    facet="feature:N", columns=2
)

## The same as above but looking at volumes! 

In [65]:
tx = sample.loc[
    sample["feature"].isin(
        [
            "amount_cents_pt",
            "amount_cents_ct",
            "amount_cents_dd",
            "amount_cents_spaces",
            "amount_cents_card_atm",
        ]
    )
]
tx = tx.loc[tx["age"] <= 4, :]

tx["value"] = tx["value"].round(-2)

tx = tx.groupby(["group", "feature", "value"])["user_id"].agg("count").reset_index()
tx["perc"] = (
    100 * tx["user_id"] / tx.groupby(["group", "feature"])["user_id"].transform("sum")
)
tx["cum"] = tx.groupby(["group", "feature"])["perc"].cumsum()

alt.Chart(tx.loc[tx["value"] < 2000, :]).mark_line().encode(
    x=alt.X("value:Q", axis=alt.Axis(title="Volume of monthly tx")),
    y=alt.Y("cum:Q", axis=alt.Axis(title="% of customer-months")),
    color="group:N",
).properties(title="Distribution of financial activity in the first 4 months").facet(
    facet="feature:N", columns=3
)

### Part D. CS contacts

In [None]:
query = """ 
select  dwh.month, 
        zu.user_id,
        zup.product_id, 
        zup.enter_reason, 
        zup.subscription_valid_from, 
        zup.subscription_valid_until, 
        datediff(month, subscription_valid_from, dwh.end_time) as age, 
        datediff(month, subscription_valid_from, subscription_valid_until) as churn_age, 
        subscription_end_event,
        cs_tag,
        count(1) as cs_contacts,
        count(distinct case_id) as cs_cases,
        count(distinct case when case_closure = True then case_id end ) as cs_resolved_cases
from dwh_cohort_months as dwh 
left join dbt.zrh_user_product as zup 
    on dwh.end_time between zup.subscription_valid_from and zup.subscription_valid_until 
    and product_id in ('BLACK_CARD_MONTHLY','METAL_CARD_MONTHLY','BUSINESS_BLACK')
inner join dbt.zrh_users as zu on zu.user_created = zup.user_created
left join dbt.sf_all_contacts as cs  
    on cs.user_id = zu.user_id 
    and date_trunc('month',initiated_date) = dwh.start_time 
    and abandoned = False 
    and c_level_report = True 
where subscription_valid_from between '2018-12-01' and '2019-07-01'
    and dwh.start_time between '2018-12-01' and current_date  
    and enter_reason in ('UPGRADED','SIGNUP')
group by 1,2,3,4,5,6,7,8,9,10
"""

In [1]:
sample = pd.read_csv("retention_cs.csv")

# drop users that could have not even made this decision
sample = sample.loc[sample["subscription_end_event"].isna() == False, :]

# recode product
sample.loc[sample["product_id"] == "BLACK_CARD_MONTHLY", "product_id"] = "You"
sample.loc[sample["product_id"] == "METAL_CARD_MONTHLY", "product_id"] = "Metal"
sample.loc[sample["product_id"] == "BUSINESS_BLACK", "product_id"] = "Bus. You"

# sort dataframe
sample.sort_values(by=["user_id", "month"], ascending=False)


# replace no problem with string
sample.loc[sample["cs_tag"].isna(), "cs_tag"] = "No problem"

# pivot table to get one row per user-tag-month combination.
sample = pd.pivot_table(
    sample,
    values=["cs_cases"],
    index=[
        "user_id",
        "enter_reason",
        "age",
        "subscription_end_event",
        "product_id",
        "churn_age",
    ],
    columns=["cs_tag"],
    aggfunc=np.sum,
    fill_value=0,
).reset_index()

# make it long again
sample = pd.melt(
    sample,
    id_vars=[
        "user_id",
        "enter_reason",
        "age",
        "subscription_end_event",
        "product_id",
        "churn_age",
    ],
)

NameError: name 'pd' is not defined

In [78]:
# show distribution of cs cases across population in the first four months

df = sample.loc[(sample["enter_reason"] == "SIGNUP"), :]
df = df.loc[df["subscription_end_event"].isin(["BLOCKED", "CANCELLED", "RENEWED"]), :]

df["group"] = "Early churn: 1-4"
df.loc[
    (df["churn_age"] > 4) & (df["subscription_end_event"] == "BLOCKED"), "group"
] = "Late churn: 4-12"
df.loc[(df["subscription_end_event"] == "CANCELLED"), "group"] = "Cancelled"
df.loc[(df["subscription_end_event"] == "RENEWED"), "group"] = "Renewed"


df = df.groupby(["user_id", "group"])["value"].agg("sum").reset_index().fillna(0)

df["complainer"] = 0
df.loc[df["value"] > 0, "complainer"] = 1

df = df.groupby(["group"])["complainer"].agg("mean").reset_index()

alt.Chart(df).mark_bar().encode(
    x=alt.X("group:N", axis=alt.Axis(title="Type of customers")),
    y=alt.Y(
        "complainer:Q",
        axis=alt.Axis(title="% of group that contact CS at least once", format="%"),
    ),
    color="group:N",
).properties(
    width=500,
    height=500,
    title="% of customers who contact CS at least once (sign up flow)",
)

In [51]:
# show distribution of cs cases across population in the first four months

df = sample.loc[(sample["enter_reason"] == "SIGNUP") & (sample["age"] < 4), :]
df = df.loc[df["subscription_end_event"].isin(["BLOCKED", "CANCELLED", "RENEWED"]), :]

df["group"] = "Early churn: 1-4"
df.loc[
    (df["churn_age"] > 4) & (df["subscription_end_event"] == "BLOCKED"), "group"
] = "Late churn: 4-12"
df.loc[(df["subscription_end_event"] == "CANCELLED"), "group"] = "Cancelled"
df.loc[(df["subscription_end_event"] == "RENEWED"), "group"] = "Renewed"


df = df.groupby(["user_id", "group"])["value"].agg("sum").reset_index().fillna(0)
df = df.groupby(["group", "value"])["user_id"].agg("nunique").reset_index()

df["perc"] = df["user_id"] / df.groupby("group")["user_id"].transform("sum")
df["cum"] = df.groupby("group")["perc"].cumsum()


alt.Chart(df.loc[df["value"] < 5, :]).mark_line().encode(
    x=alt.X(
        "value:N", axis=alt.Axis(title="Number of CS cases in the first four months")
    ),
    y=alt.Y("cum:Q", axis=alt.Axis(title="% of group", format="%")),
    color="group:N",
).properties(
    width=500, height=500, title="CS contacts in the first four months (sign up flow)"
)

In [52]:
# median # of cs contacts over time.

df = sample.loc[(sample["enter_reason"] == "SIGNUP"), :]
df = df.loc[df["subscription_end_event"].isin(["BLOCKED", "CANCELLED", "RENEWED"]), :]

df["group"] = "Early churn: 1-4"
df.loc[
    (df["churn_age"] > 4) & (df["subscription_end_event"] == "BLOCKED"), "group"
] = "Late churn: 4-12"
df.loc[(df["subscription_end_event"] == "CANCELLED"), "group"] = "Cancelled"
df.loc[(df["subscription_end_event"] == "RENEWED"), "group"] = "Renewed"

df = df.groupby(["user_id", "group", "age"])["value"].agg("sum").reset_index().fillna(0)

df["complainer"] = 0
df.loc[df["value"] > 0, "complainer"] = 1

df = df.groupby(["group", "age"])["complainer"].agg("mean").reset_index()

alt.Chart(df).mark_line().encode(
    x=alt.X("age:N", axis=alt.Axis(title="Age in months")),
    y=alt.Y(
        "complainer:Q",
        axis=alt.Axis(title="% of customers that contact CS", format="%"),
    ),
    color="group:N",
).properties(
    width=500,
    height=500,
    title="% of users that contact CS at least once by age and group",
)

## Problems at the start - problems in the middle and problems in the end. what are they? 

In [82]:
# show distribution of cs cases across population in the first four months.

df = sample.loc[(sample["enter_reason"] == "SIGNUP"), :]
df = df.loc[
    df["subscription_end_event"].isin(["BLOCKED", "CANCELLED", "RENEWED", "REVOKED"]), :
]

df["group"] = "Early churn: 1-4"
df.loc[
    (df["churn_age"] > 4) & (df["subscription_end_event"] == "BLOCKED"), "group"
] = "Late churn: 4-12"
df.loc[(df["subscription_end_event"] == "CANCELLED"), "group"] = "Cancelled"
df.loc[(df["subscription_end_event"] == "RENEWED"), "group"] = "Renewed"

df = (
    df.groupby(["user_id", "age", "cs_tag"])["value"].agg("sum").reset_index().fillna(0)
)

df["complainer"] = 0
df.loc[df["value"] > 0, "complainer"] = 1

df = df.groupby(["age", "cs_tag"])["complainer"].agg("mean").reset_index()

In [87]:
alt.Chart(df.loc[(df["age"] <= 2) & (df["complainer"] >= 0.01), :]).mark_bar().encode(
    x=alt.X(
        "cs_tag:N",
        axis=alt.Axis(title="Reason"),
        sort=alt.EncodingSortField(field="complainer", op="mean", order="descending"),
    ),
    y=alt.Y("complainer:Q", axis=alt.Axis(title="% of all users", format="%")),
    color="cs_tag:N",
).properties(width=400, height=400, title="CS contacts in the first two months")

In [86]:
alt.Chart(
    df.loc[(df["age"] > 2) & (df["age"] < 10) & (df["complainer"] >= 0.005), :]
).mark_bar().encode(
    x=alt.X(
        "cs_tag:N",
        axis=alt.Axis(title="Reason"),
        sort=alt.EncodingSortField(field="complainer", op="mean", order="descending"),
    ),
    y=alt.Y("complainer:Q", axis=alt.Axis(title="% of all users", format="%")),
    color="cs_tag:N",
).properties(
    width=400, height=400, title="CS contacts between months 3 and 10"
)

In [85]:
alt.Chart(df.loc[(df["age"] > 10) & (df["complainer"] >= 0.005), :]).mark_bar().encode(
    x=alt.X(
        "cs_tag:N",
        axis=alt.Axis(title="Reason"),
        sort=alt.EncodingSortField(field="complainer", op="mean", order="descending"),
    ),
    y=alt.Y("complainer:Q", axis=alt.Axis(title="% of all users", format="%")),
    color="cs_tag:N",
).properties(width=400, height=400, title="CS contacts after 10 months")

## How does this differ by group? 


In [61]:
# show distribution of cs cases across population in the first four months.

df = sample.loc[(sample["enter_reason"] == "SIGNUP"), :]
df = df.loc[
    df["subscription_end_event"].isin(["BLOCKED", "CANCELLED", "RENEWED", "REVOKED"]), :
]

df["group"] = "Early churn: 1-4"
df.loc[
    (df["churn_age"] > 4) & (df["subscription_end_event"] == "BLOCKED"), "group"
] = "Late churn: 4-12"
df.loc[(df["subscription_end_event"] == "CANCELLED"), "group"] = "Cancelled"
df.loc[(df["subscription_end_event"] == "RENEWED"), "group"] = "Renewed"

df = (
    df.groupby(["group", "user_id", "age", "cs_tag"])["value"]
    .agg("sum")
    .reset_index()
    .fillna(0)
)

df["complainer"] = 0
df.loc[df["value"] > 0, "complainer"] = 1

df = df.groupby(["group", "age", "cs_tag"])["complainer"].agg("mean").reset_index()

In [65]:
alt.Chart(df.loc[(df["age"] <= 2) & (df["complainer"] >= 0.01), :]).mark_bar().encode(
    x=alt.X(
        "cs_tag:N",
        axis=alt.Axis(title="Number of CS cases in the first four months"),
        sort=alt.EncodingSortField(field="complainer", op="mean", order="descending"),
    ),
    y=alt.Y("complainer:Q", axis=alt.Axis(title="% of all users", format="%")),
    color="cs_tag:N",
).properties(width=400, height=400, title="CS contacts in the first two months").facet(
    facet="group:N", columns=2
)

In [71]:
alt.Chart(
    df.loc[(df["age"] > 2) & (df["age"] < 10) & (df["complainer"] >= 0.005), :]
).mark_bar().encode(
    x=alt.X(
        "cs_tag:N",
        axis=alt.Axis(title="Number of CS cases in the first four months"),
        sort=alt.EncodingSortField(field="complainer", op="mean", order="descending"),
    ),
    y=alt.Y("complainer:Q", axis=alt.Axis(title="% of all users", format="%")),
    color="cs_tag:N",
).properties(
    width=400, height=400, title="CS contacts between months 3 and 10"
).facet(
    facet="group:N", columns=2
)

In [70]:
alt.Chart(
    df.loc[
        (df["age"] >= 10)
        & (df["cs_tag"] != "No problem")
        & (df["complainer"] >= 0.005),
        :,
    ]
).mark_bar().encode(
    x=alt.X(
        "cs_tag:N",
        axis=alt.Axis(title="Number of CS cases in the first four months"),
        sort=alt.EncodingSortField(field="complainer", op="mean", order="descending"),
    ),
    y=alt.Y("complainer:Q", axis=alt.Axis(title="% of all users", format="%")),
    color="cs_tag:N",
).properties(
    width=400, height=400, title="CS contacts in the last two months"
).facet(
    facet="group:N", columns=2
)

## By product


In [88]:
# show distribution of cs cases across population in the first four months.

df = sample.loc[(sample["enter_reason"] == "SIGNUP"), :]
df = df.loc[
    df["subscription_end_event"].isin(["BLOCKED", "CANCELLED", "RENEWED", "REVOKED"]), :
]

df["group"] = "Early churn: 1-4"
df.loc[
    (df["churn_age"] > 4) & (df["subscription_end_event"] == "BLOCKED"), "group"
] = "Late churn: 4-12"
df.loc[(df["subscription_end_event"] == "CANCELLED"), "group"] = "Cancelled"
df.loc[(df["subscription_end_event"] == "RENEWED"), "group"] = "Renewed"

df = (
    df.groupby(["product_id", "user_id", "age", "cs_tag"])["value"]
    .agg("sum")
    .reset_index()
    .fillna(0)
)

df["complainer"] = 0
df.loc[df["value"] > 0, "complainer"] = 1

df = df.groupby(["product_id", "age", "cs_tag"])["complainer"].agg("mean").reset_index()

In [90]:
alt.Chart(df.loc[(df["age"] <= 2) & (df["complainer"] >= 0.01), :]).mark_bar().encode(
    x=alt.X(
        "cs_tag:N",
        axis=alt.Axis(title="Reason"),
        sort=alt.EncodingSortField(field="complainer", op="mean", order="descending"),
    ),
    y=alt.Y("complainer:Q", axis=alt.Axis(title="% of all users", format="%")),
    color="cs_tag:N",
).properties(width=400, height=400, title="CS contacts in the first two months").facet(
    facet="product_id:N", columns=3
)

### What are the most common reasons? 

In [51]:
# show distribution of cs cases across population in the first four months.

df = sample.loc[(sample["enter_reason"] == "SIGNUP"), :]
df = df.loc[
    df["subscription_end_event"].isin(["BLOCKED", "CANCELLED", "RENEWED", "REVOKED"]), :
]

df = (
    df.groupby(["subscription_end_event", "cs_tag"])["cs_cases"]
    .agg("sum")
    .reset_index()
    .fillna(0)
)
df["perc"] = (
    100
    * df["cs_cases"]
    / df.groupby("subscription_end_event")["cs_cases"].transform("sum")
)

alt.Chart(df.loc[df["perc"] > 2]).mark_bar().encode(
    x=alt.X(
        "cs_tag:N",
        axis=alt.Axis(title="Number of CS cases in the first four months"),
        sort=alt.EncodingSortField(field="perc", op="mean", order="descending"),
    ),
    y=alt.Y("perc:Q", axis=alt.Axis(title="% of all cases")),
    color="cs_tag:N",
).properties(width=400, height=400, title="CS contacts in the first four months").facet(
    facet="subscription_end_event:N", columns=2
)

## What % of churning customers complain about what ?

In [71]:
# show distribution of cs cases across population in the first four months.

df = sample.loc[(sample["enter_reason"] == "SIGNUP"), :]
df = df.loc[df["subscription_end_event"].isin(["BLOCKED", "CANCELLED", "REVOKED"]), :]

df = (
    df.groupby(["subscription_end_event", "user_id", "cs_tag"])["cs_cases"]
    .agg("sum")
    .reset_index()
    .fillna(0)
)

df = (
    pd.pivot_table(
        df,
        index=["user_id", "subscription_end_event"],
        columns="cs_tag",
        values="cs_cases",
    )
    .fillna(0)
    .reset_index()
)
df = pd.melt(df, id_vars=["user_id", "subscription_end_event"])

df["at_all"] = 0
df.loc[df["value"] > 0, "at_all"] = 1

df = df.groupby(["cs_tag"])["at_all"].agg("mean").reset_index()

alt.Chart(df.loc[df["at_all"] > 0.05, :]).mark_bar().encode(
    x=alt.X(
        "cs_tag:N",
        axis=alt.Axis(title="Number of CS cases in the first four months"),
        sort=alt.EncodingSortField(field="at_all", op="mean", order="descending"),
    ),
    y=alt.Y("at_all:Q", axis=alt.Axis(title="% of all cases", format="%")),
    color="cs_tag:N",
).properties(
    width=400, height=400, title="% of churned users that at all had this problem"
)

In [72]:
# show distribution of cs cases across population in the first four months.

df = sample.loc[(sample["enter_reason"] == "SIGNUP"), :]
df = df.loc[df["subscription_end_event"].isin(["RENEWED"]), :]

df = (
    df.groupby(["subscription_end_event", "user_id", "cs_tag"])["cs_cases"]
    .agg("sum")
    .reset_index()
    .fillna(0)
)

df = (
    pd.pivot_table(
        df,
        index=["user_id", "subscription_end_event"],
        columns="cs_tag",
        values="cs_cases",
    )
    .fillna(0)
    .reset_index()
)
df = pd.melt(df, id_vars=["user_id", "subscription_end_event"])

df["at_all"] = 0
df.loc[df["value"] > 0, "at_all"] = 1

df = df.groupby(["cs_tag"])["at_all"].agg("mean").reset_index()

alt.Chart(df.loc[df["at_all"] > 0.05, :]).mark_bar().encode(
    x=alt.X(
        "cs_tag:N",
        axis=alt.Axis(title="Number of CS cases in the first four months"),
        sort=alt.EncodingSortField(field="at_all", op="mean", order="descending"),
    ),
    y=alt.Y("at_all:Q", axis=alt.Axis(title="% of all cases", format="%")),
    color="cs_tag:N",
).properties(
    width=400, height=400, title="% of renewal users that at all had this problem"
)

In [74]:
# show distribution of cs cases across population in the first four months.

df = sample.loc[(sample["enter_reason"] == "SIGNUP"), :]
df = df.loc[(df["subscription_end_event"].isin(["BLOCKED"])) & (df["churn_age"] > 4), :]

df = (
    df.groupby(["subscription_end_event", "user_id", "cs_tag"])["cs_cases"]
    .agg("sum")
    .reset_index()
    .fillna(0)
)

df = (
    pd.pivot_table(
        df,
        index=["user_id", "subscription_end_event"],
        columns="cs_tag",
        values="cs_cases",
    )
    .fillna(0)
    .reset_index()
)
df = pd.melt(df, id_vars=["user_id", "subscription_end_event"])

df["at_all"] = 0
df.loc[df["value"] > 0, "at_all"] = 1

df = df.groupby(["cs_tag"])["at_all"].agg("mean").reset_index()

alt.Chart(df.loc[df["at_all"] > 0.05, :]).mark_bar().encode(
    x=alt.X(
        "cs_tag:N",
        axis=alt.Axis(title="Number of CS cases in the first four months"),
        sort=alt.EncodingSortField(field="at_all", op="mean", order="descending"),
    ),
    y=alt.Y("at_all:Q", axis=alt.Axis(title="% of all cases", format="%")),
    color="cs_tag:N",
).properties(
    width=400, height=400, title="% of late churn users that at all had this problem"
)

In [75]:
# show distribution of cs cases across population in the first four months.

df = sample.loc[(sample["enter_reason"] == "SIGNUP"), :]
df = df.loc[
    (df["subscription_end_event"].isin(["BLOCKED"])) & (df["churn_age"] <= 4), :
]

df = (
    df.groupby(["subscription_end_event", "user_id", "cs_tag"])["cs_cases"]
    .agg("sum")
    .reset_index()
    .fillna(0)
)

df = (
    pd.pivot_table(
        df,
        index=["user_id", "subscription_end_event"],
        columns="cs_tag",
        values="cs_cases",
    )
    .fillna(0)
    .reset_index()
)
df = pd.melt(df, id_vars=["user_id", "subscription_end_event"])

df["at_all"] = 0
df.loc[df["value"] > 0, "at_all"] = 1

df = df.groupby(["cs_tag"])["at_all"].agg("mean").reset_index()

alt.Chart(df.loc[df["at_all"] > 0.05, :]).mark_bar().encode(
    x=alt.X(
        "cs_tag:N",
        axis=alt.Axis(title="Number of CS cases in the first four months"),
        sort=alt.EncodingSortField(field="at_all", op="mean", order="descending"),
    ),
    y=alt.Y("at_all:Q", axis=alt.Axis(title="% of all cases", format="%")),
    color="cs_tag:N",
).properties(
    width=400, height=400, title="% of early churn users that at all had this problem"
)

In [77]:
# show distribution of cs cases across population in the first four months.

df = sample.loc[(sample["enter_reason"] == "SIGNUP"), :]
df = df.loc[
    (df["subscription_end_event"].isin(["BLOCKED"])) & (df["churn_age"] <= 4), :
]

df = (
    df.groupby(["subscription_end_event", "product_id", "user_id", "cs_tag"])[
        "cs_cases"
    ]
    .agg("sum")
    .reset_index()
    .fillna(0)
)

df = (
    pd.pivot_table(
        df,
        index=["user_id", "product_id", "subscription_end_event"],
        columns="cs_tag",
        values="cs_cases",
    )
    .fillna(0)
    .reset_index()
)
df = pd.melt(df, id_vars=["user_id", "product_id", "subscription_end_event"])

df["at_all"] = 0
df.loc[df["value"] > 0, "at_all"] = 1

df = df.groupby(["cs_tag", "product_id"])["at_all"].agg("mean").reset_index()

alt.Chart(df.loc[df["at_all"] > 0.05, :]).mark_bar().encode(
    x=alt.X(
        "cs_tag:N",
        axis=alt.Axis(title="Number of CS cases in the first four months"),
        sort=alt.EncodingSortField(field="at_all", op="mean", order="descending"),
    ),
    y=alt.Y("at_all:Q", axis=alt.Axis(title="% of all cases", format="%")),
    color="cs_tag:N",
    column="product_id:N",
).properties(
    width=400, height=400, title="% of early churn users that at all had this problem"
)

In [None]:
alt.Chart(df.loc[df["perc"] > 2]).mark_bar().encode(
    x=alt.X(
        "cs_tag:N",
        axis=alt.Axis(title="Number of CS cases in the first four months"),
        sort=alt.EncodingSortField(field="perc", op="mean", order="descending"),
    ),
    y=alt.Y("perc:Q", axis=alt.Axis(title="% of all cases")),
    color="cs_tag:N",
).properties(width=400, height=400, title="CS contacts in the first four months").facet(
    facet="subscription_end_event:N", columns=2
)

### Estimating prob of cancellation by contact reason.

In [106]:
# show distribution of cs cases across population in the first four months

df = sample.loc[(sample["enter_reason"] == "SIGNUP"), :]
df = df.loc[
    df["cs_tag"].isin(
        ["61 black", "64 business black", "62 metal", "95 account closure"]
    ),
    :,
]
df = df.loc[df["subscription_end_event"].isin(["BLOCKED", "CANCELLED"]), :]

df["renewal"] = 0
df.loc[df["subscription_end_event"] == "RENEWED", "renewal"] = 1

df = (
    df.groupby(["user_id", "renewal", "cs_tag"])["value"]
    .agg("sum")
    .reset_index()
    .fillna(0)
)

df["complainer"] = 0
df.loc[df["value"] > 0, "complainer"] = 1

df = df.groupby(["cs_tag", "complainer"])["renewal"].agg("count").reset_index()

alt.Chart(df).mark_bar().encode(
    x=alt.X("cs_tag:N", axis=alt.Axis(title="Contact reason")),
    y=alt.Y("renewal:Q", axis=alt.Axis(title="% renewal", format="%")),
    column="complainer:N",
    color="cs_tag:Q",
).properties(
    width=400,
    height=400,
    title="Are users who contact CS often more likely to churn early?",
)

### Part E. Summary

We have the following problems: 
- Activation: users don't top up: 55% of churners in the first 4 months never activated in the first place. In the early phase, we loose roughly 25% of our customers. 
    - Now that we have local ibans, we could double down our efforts to convince users to make N26 their salary account. 
- Keeping people engaged: we loose roughly another 25% of customers between month 5 and 12. These people activate and start using the account but then drop out. 
    - We find evidence that going on an international trip is a big reason why users get You accounts. We also see a drop in activity after these trips: can we keep these users engaged? 
- Localization: Metal retention after 12 months is 67% in Germany and 27% in Spain. We are not competitive in markets like Spain and Italy. 





In [68]:
sample = pd.read_csv("churn_sample_mau.csv")

# drop users that could have not even made this decision
sample = sample.loc[sample["subscription_end_event"].isna() == False, :]

# recode product
sample.loc[sample["product_id"] == "BLACK_CARD_MONTHLY", "product_id"] = "You"
sample.loc[sample["product_id"] == "METAL_CARD_MONTHLY", "product_id"] = "Metal"
sample.loc[sample["product_id"] == "BUSINESS_BLACK", "product_id"] = "Bus. You"

# sort dataframe
sample.sort_values(by=["user_id", "month"], ascending=False)

# code up mau variables
sample["time_since_mau"] = 0
sample.loc[sample["mau"] == 0, "time_since_mau"] = 1

# create time in lapse variable.
sample["time_lapse"] = sample.groupby("user_id")["time_since_mau"].cumsum()
sample.loc[sample["mau"] == 1, "time_lapse"] = 0

# end_period = 1 if this was the churn period
sample["age"] = sample["age"].astype(float)
sample["churn_age"] = sample["churn_age"].astype(float) - 1

sample["end_period"] = 0
sample.loc[sample["age"] == sample["churn_age"], "end_period"] = 1


# ever mau
sample["ever_mau"] = sample.groupby("user_id")["mau"].transform("max")

In [70]:
# visualize % of ever mau by churn month

data = sample.loc[sample["subscription_end_event"] != "RENEWED", :]

data = (
    data.groupby(["user_id", "churn_age", "product_id"])["mau"].agg("max").reset_index()
)
data = data.groupby(["churn_age", "product_id"])["mau"].agg("mean").reset_index()


# chart=alt.Chart(data).mark_bar().encode(
#    x=alt.X('churn_age:N', axis=alt.Axis(title='Churn age in month')),
#    y=alt.Y('mau:Q', axis=alt.Axis(format='%', title='% ever turned MAU'))
# ).properties(height=500,width=500, title = '% of customers who ever turned MAU')

### Part D. Login

In [None]:
query = """ 
select  dwh.month, 
        zu.user_id,
        zup.product_id, 
        zup.enter_reason, 
        zup.subscription_valid_from, 
        zup.subscription_valid_until, 
        datediff(month, subscription_valid_from, dwh.end_time) as age, 
        datediff(month, subscription_valid_from, subscription_valid_until) as churn_age, 
        subscription_end_event,
        cs_tag,
        count(1) as cs_contacts,
        count(distinct case_id) as cs_cases,
        count(distinct case when case_closure = True then case_id end ) as cs_resolved_cases
from dwh_cohort_months as dwh 
left join dbt.zrh_user_product as zup 
    on dwh.end_time between zup.subscription_valid_from and zup.subscription_valid_until 
    and product_id in ('BLACK_CARD_MONTHLY','METAL_CARD_MONTHLY','BUSINESS_BLACK')
inner join dbt.zrh_users as zu on zu.user_created = zup.user_created
left join dbt.sf_all_contacts as cs  
    on cs.user_id = zu.user_id 
    and date_trunc('month',initiated_date) = dwh.start_time 
    and abandoned = False 
    and c_level_report = True 
where subscription_valid_from between '2018-12-01' and '2019-07-01'
    and dwh.start_time between '2018-12-01' and current_date  
    and enter_reason in ('UPGRADED','SIGNUP')
group by 1,2,3,4,5,6,7,8,9,10
"""