title: Lapsing through the lens of early clusters
author: Brieuc Van Thienen     
date: 2021-03-19      
region: EU     
link: https://docs.google.com/presentation/d/1PnfD-mzLHJ9OFZj3LQ1_WY5uAq7nOK-ptHWvZZvOGqM/edit?usp=sharing    
summary: Understand natural usage behaviours is the first step towards improving retention. Generic KPIs (10 txns in 35 days, >= 500 euro top-ups in 35 days, ...) help differentiate high-activity clusters from low-activity clusters, but cannot help differentiate within those two groups. Those KPIs are more relevant to look at after 5 weeks of activity. They can be used to assess the overall success of our project, but our initiatives will require to look at more granular metrics. Cash26, Spaces, mobile payments, and inter travel in the first week of activity are very differentiating from a product usage perspective. In other words, a user completing one of these actions has a very high likelihood of ending up in a given cluster. However, the intensity of the usage (transaction counts, volumes) are not really differentiating between clusters. When looking at the recency and stickiness of users (days since last action / number of distinct days active), we again see some distinguishable patterns between high and low-activity groups. Coupled with the absence of core product actions, these metrics could act as warning signs that a customer will soon become disengaged.
tags: product, marketing, lapse, lapse prevention, early behaviour, engagement, onboarding, clusters
    

<br>
<br>
<br>

# EDA - Lapsing through the lens of early clusters

---

Clustering helps us understand what the spectrum of possible behaviours is, 35 days after first time MAU. 

Looking at the early activity and long term engagement of these clusters can help understand:

- How these users engage with the product in the very early days, and infer if new cohorts engage in meaningful ways with the product.

- How early usage patterns correlate with longer-term engagement, and infer for new cohorts what are their expected usage patterns (e.g. frequency of usage).

In all, clusters can help us design strategies to engage, retain or resurrect the users.

All graphs are shown in the slide deck attached, but the notebook will help reproduce all the steps. Note that some dbt models are on the dev_dbt schema and are not updated on a regular schedule.


In [2]:
!pip install warnings
import warnings

warnings.filterwarnings("ignore")

In [None]:
from utils.datalib_database import df_from_sql
import altair as alt
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

!pip install seaborn
import seaborn as sns

!pip install vega
!pip install --upgrade notebook  # need jupyter_client >= 4.2 for sys-prefix below
!jupyter nbextension install --sys-prefix --py vega  # not needed in notebook >= 5.3
alt.renderers.enable("notebook")

In [None]:
alt.data_transformers.disable_max_rows()

<br>
<br>
<br>

# Early engagement by clusters, key actions in week 1

---

In [None]:
query = """
    with main as (
        select 
            u.user_created,
            p.user_id,
            u.country_tnc_legal,
            fp.product_id,
            p.ft_mau_date,
            p.datediff,
            p.feature,
            early_cluster,
            coalesce(p.value,0) as cumulative_value,
            case 
                when datediff = 1 then coalesce(p.value,0)
                else coalesce(p.value,0) - lag(coalesce(p.value,0)) over (partition by p.user_created, feature order by datediff asc)
            end as value
        from 
            dev_dbt.stg_churn_act_unpivot p 
        inner join 
            (select user_created, country_tnc_legal from dbt.zrh_users) u using (user_created)
        inner join 
            (select user_id, early_cluster from dwh_earlycluster_labels order by random() limit 50000) ec using (user_id)
        inner join 
            (select user_created, product_id from dbt.zrh_user_product where enter_reason = 'SIGNUP') fp using (user_created)
        where 
            1=1
            and u.user_created < '2020-07-01'
            and feature in (
                'amount_cents_act_total',
                'amount_cents_ct',
                'n_act_total',
                'n_ct',
                'n_cash',
                'n_mobile',
                'n_spaces',
                'n_grocery_market',
                'n_intra',
                'n_inter',
                'dau_login',
                'dau_login_txn',
                'dau_txn'
            )
    )
    select 
        user_created,
        user_id,
        country_tnc_legal,
        product_id,
        ft_mau_date,
        datediff,
        feature,
        early_cluster,
        case 
            when split_part(feature,'_',1) = 'amount' then cumulative_value/100::int
            else cumulative_value
        end as cumulative_value,
        case 
            when split_part(feature,'_',1) = 'amount' then cumulative_value/100::int
            else value
        end as value
    from 
        main
"""

In [None]:
df = df_from_sql("redshiftreader", query)

In [None]:
df = df.replace(
    {
        "E1": "E01",
        "E2": "E02",
        "E3": "E03",
        "E4": "E04",
        "E5": "E05",
        "E6": "E06",
        "E7": "E07",
        "E8": "E08",
        "E9": "E09",
    }
)

df = df.sort_values(by=["early_cluster", "user_created", "datediff"]).reset_index()

**Sample size**

In [None]:
df_plot = df[["user_id", "early_cluster"]].loc[
    (df["datediff"] == "1") & (df["feature"] == "n_act_total")
]

chart = (
    alt.Chart(df_plot)
    .mark_bar()
    .encode(
        x=alt.X(
            "early_cluster:N", axis=alt.Axis(title="Early clusters", labelAngle=-45)
        ),
        y="count()",
        color="early_cluster:N",
    )
    .properties(height=300, width=600)
)

chart

**high activity clusters**

Change datediff and feature in df_plot to plot the different features.

In [None]:
fig, ax = plt.subplots(figsize=(12, 6))

df_plot = df.loc[
    (df["datediff"] == "5")
    & (df["feature"] == "dau_login")
    & (df["early_cluster"].isin(["E01", "E02", "E03", "E04", "E05", "E06", "E12"]))
]

sns.ecdfplot(
    data=df_plot,
    x="cumulative_value",
    hue="early_cluster",
).set_title(
    "Cumulative distributions of distinct login day counts in the first 5 weeks of activity"
)

ax.set(xlim=(0, 40))

**low activity clusters**

Change datediff and feature in df_plot to plot the different features.

In [None]:
fig, ax = plt.subplots(figsize=(12, 6))

df_plot = df.loc[
    (df["datediff"] == "5")
    & (df["feature"] == "dau_login")
    & (~df["early_cluster"].isin(["E01", "E02", "E03", "E04", "E05", "E06", "E12"]))
]

sns.ecdfplot(
    data=df_plot,
    x="cumulative_value",
    hue="early_cluster",
).set_title(
    "Cumulative distributions of distinct login day counts in the first 5 weeks of activity"
)

ax.set(xlim=(0, 40))

<br>
<br>
<br>

# Early engagement by clusters, across the first 5 weeks of activity

---

In [None]:
query = """
with main as (
    select 
        u.country_tnc_legal,
        fp.product_id,
        p.user_id,
        p.ft_mau_date,
        p.datediff,
        p.feature,
        cluster as early_cluster,
        coalesce(p.value,0) as cumulative_value,
        case 
            when datediff = 1 then coalesce(p.value,0)
            else coalesce(p.value,0) - lag(coalesce(p.value,0)) over (partition by p.user_created, feature order by datediff asc)
        end as value
    from 
        dev_dbt.stg_churn_act_unpivot p 
    inner join 
        (select user_id, cluster from dev_dbt.earlyuserclusters_2020sept) ec using (user_id)
    inner join 
        (select user_created, product_id from dbt.zrh_user_product where enter_reason = 'SIGNUP') fp using (user_created)
    inner join 
        dbt.zrh_users u on u.user_created = p.user_created 
            and coalesce(closed_at, '2100-01-01') >= dateadd(month, 6, ft_mau_date)
    where 
        1=1
        -- p.ft_mau_date between '2019-04-01' and '2019-07-01'
        -- and p.user_created >= '2019-01-01'
        ---and country_tnc_legal in ('DEU','AUT','FRA','ITA','ESP')
        -- and country_tnc_legal = 'FRA'
        and fp.product_id in ('STANDARD')
        and feature not ilike '%recency%'
        and feature not ilike '%dau%'
)

select 
    country_tnc_legal,
    product_id,
    user_id,
    ft_mau_date,
    datediff,
    feature,
    early_cluster,
    case when 
        split_part(feature,'_',1) = 'amount' then (cumulative_value/100)::int
        else cumulative_value
    end as cumulative_value,
    case when 
        split_part(feature,'_',1) = 'amount' then (value/100)::int
        else value
    end as value
from 
    main

"""

In [None]:
dfquery = df_from_sql("redshiftreader", query)

In [None]:
df = dfquery.copy()

In [None]:
df["early_cluster"] = df["early_cluster"].replace(
    {
        "E1": "E01",
        "E2": "E02",
        "E3": "E03",
        "E4": "E04",
        "E5": "E05",
        "E6": "E06",
        "E7": "E07",
        "E8": "E08",
        "E9": "E09",
    }
)

In [None]:
df["active"] = df["value"].apply(lambda x: 1 if x > 0 else 0)
df["cumulative_active"] = df["cumulative_value"].apply(lambda x: 1 if x > 0 else 0)

**Transaction data, facet charts for each cluster in each of the first 5 weeks of activity**

Change feature in df to plot the different features.

In [None]:
df1 = df.loc[
    df["feature"] == "n_act_total",
    ~df.columns.isin(["user_id", "user_created", "ft_mau_date"]),
]

alt.Chart(df1).transform_joinaggregate(
    groupby=["early_cluster", "datediff"],
    TotalUsers="count(cumulative_active)",
).transform_calculate(
    PercentageActive="datum.active / datum.TotalUsers"
).mark_bar().encode(
    x=alt.X("datediff:Q", axis=alt.Axis(title="Relative week after FT MAU")),
    y=alt.Y(
        "sum(PercentageActive):Q", axis=alt.Axis(title="Percentage of active users")
    ),
    color="early_cluster:N",
).properties(
    width=150,
    height=150,
).facet(
    facet="early_cluster:N",
    columns=5,
    title="Percentage of users with at least 1 transaction",
)

In [None]:
df1 = df[
    ["datediff", "value", "value", "cumulative_value", "early_cluster", "feature"]
].loc[df["feature"] == "n_act_total"]

alt.Chart(df1.loc[df1["cumulative_value"] >= 1]).mark_boxplot(outliers=False).encode(
    x=alt.X(
        "datediff:Q",
        axis=alt.Axis(title="Relative week after first time MAU conversion"),
    ),
    y=alt.Y("cumulative_value:Q", axis=alt.Axis(title="Number of transactions")),
    color="early_cluster:N",
).properties(
    width=150,
    height=150,
).facet(
    facet="early_cluster:N",
    columns=5,
    title="Number of transactions in the first 5 weeks of activity",
).configure_axis(
    grid=False
)

**Recency data, facet charts for each cluster in each of the first 5 weeks of activity**

Change feature in df to plot the different features.

In [None]:
query = """
    select 
        u.country_tnc_legal,
        fp.product_id,
        p.user_id,
        p.ft_mau_date,
        p.datediff,
        p.feature,
        cluster as early_cluster,
        coalesce(p.value,0) as value
    from 
        dev_dbt.stg_churn_act_unpivot p 
    inner join 
        (select user_id, cluster from dev_dbt.earlyuserclusters_2020sept) ec using (user_id)
    inner join 
        (select user_created, product_id from dbt.zrh_user_product where enter_reason = 'SIGNUP') fp using (user_created)
    inner join 
        dbt.zrh_users u on u.user_created = p.user_created 
            and coalesce(closed_at, '2100-01-01') >= dateadd(month, 6, ft_mau_date)
    where 
        1=1
        and fp.product_id in ('STANDARD')
        and (feature ilike '%recency%' or feature ilike '%dau%')

"""

In [None]:
dfquery = df_from_sql("redshiftreader", query)

In [None]:
df = dfquery.copy()

In [None]:
df["early_cluster"] = df["early_cluster"].replace(
    {
        "E1": "E01",
        "E2": "E02",
        "E3": "E03",
        "E4": "E04",
        "E5": "E05",
        "E6": "E06",
        "E7": "E07",
        "E8": "E08",
        "E9": "E09",
    }
)

In [None]:
df1 = df[["datediff", "value", "early_cluster", "feature"]].loc[
    df["feature"] == "recency_txn"
]

alt.Chart(df1.loc[df1["value"] >= 1]).mark_boxplot(outliers=False).encode(
    x=alt.X(
        "datediff:Q",
        axis=alt.Axis(title="Relative week after first time MAU conversion"),
    ),
    y=alt.Y(
        "value:Q",
        axis=alt.Axis(title="Number of days"),
        scale=alt.Scale(domain=[0, 40]),
    ),
    color="early_cluster:N",
).properties(
    width=150,
    height=150,
).facet(
    facet="early_cluster:N",
    columns=5,
    title="Number of days since last transaction, per activity week and cluster",
)

# Query for unbounded retention calculations

---

All visualisations were done in a [spreadsheet](https://docs.google.com/spreadsheets/d/1FWVsq9XvGkmlyQ11svdaybqUG3yU8xvqgW9T2sL70PQ/edit?usp=sharing).

In [None]:
query = """
    with act as (
        select 
            zu.user_created,
            zu.user_id,
            country,
            first_product,
            zu.ft_mau::date as ft_mau_date,
            early_cluster,
            m.max_l_date
        from 
            (select user_created, user_id, ft_mau, country_tnc_legal as country from dbt.zrh_users) zu 
        inner join 
            (select user_created, product_id as first_product from dbt.zrh_user_product where enter_reason = 'SIGNUP') p using (user_created)
        left join 
            (select user_created, max(txn_date) as max_l_date from dbt.zrh_txn_day_core where n_act_total > 0 group by 1) m using (user_created)
        inner join 
            dwh_earlycluster_labels ec using (user_id)	
        where 
            country in ('DEU','AUT','FRA','ITA','ESP')
            and ft_mau < current_date - interval '2 years'
    )
    select 
        early_cluster,
        country,
        first_product,
        datediff(month, ft_mau_date, d.start_time) as relative_day,
        count(act.user_created) as ft_mau,
        sum(case when max_l_date >= d.start_time then 1 else 0 end) as unbounded_retention,
        count(ua.user_created) as retention
    from 
        dwh_cohort_months d
    inner join 
        act on d.end_time between ft_mau_date and dateadd(month, 24, ft_mau_date)
    left join 
        (select user_created, date_trunc('month', txn_date)::date as act_date from dbt.zrh_txn_day_core where n_act_total > 0 group by 1,2) ua 
            on act.user_created = ua.user_created 
                and d.start_time::date = ua.act_date::date
    group by 1,2,3,4
    order by 1,2,3,4
"""