title: What do customers use overdraft for?
author: Helder Silva 
date: 2022-08-24
region: EU
tags: overdraft, bank products, credit risk, lisbon, transactions
summary: In this research we look into users who had the overdraft product enabled in H1 2022, and understand how they are using the product, by splitting them into usage groups, and understanding how these groups transact while using this product. These groups are split into 2 usage types: Number of days in a month using overdraft (days groups) and Percentage of the limit used (usage groups). One of our most interesting findings was that users who use the overdraft product have gambling, money cash financial and ATM transactions in the top 10 transaction categories, whereas users who don't use overdraft don't have these categories in their top 10. Also, users who are in arrears or use overdraft for >=28 days tend to have Wise Transfers (Foreign Currency Transfers) on top of their outgoing transactions by volume.

![header](overdraft_usage_header.png)

In this research we look into users who had the overdraft product enabled in H1 2022, and understand how they are using the product, by splitting them into usage groups, and understanding how these groups transact while using this product.

These groups are split into 2 usage types:

---
- **Number of days in a month using overdraft (days groups)**
 - not using overdraft
 - using overdraft <=10 days in a month
 - using overdraft >=11 and <= 27 in a month
 - using overdraft >=28 in a month
---
- **Percentage of the limit used (usage groups)**
 - 0% usage (not using overdraft)
 - <=20% usage
 - 20% to 40% usage
 - 40% to 60% usage
 - 60% to 80% usage
 - 80% to 90% usage
 - \>90% usage
 - in arrears (above 100% usage)
 ---
 
Having these groups in mind, we will be answering the following questions:
- [How many users do we have in each group?](#section1)
 - [How stable are these groups over time?](#section1.1)
 - [How do their Lisbon scores evolve over time?](#section1.2)
- [What are the top 10 outgoing transaction categories for each group?](#section2)
 - [Days Groups](#section2.1)
 - [Usage Groups](#section2.2)
- [What are the top 10 incoming transaction categories for each group?](#section3)
 - [Days Groups](#section3.1)
 - [Usage Groups](#section3.2)


In [1]:
%%capture
cd/app/

In [2]:
%%capture
!pip install duckdb
!pip install altair

In [3]:
import pandas as pd
from utils.datalib_database import df_from_sql
import utils.altair_functions as af
import altair as alt
import duckdb

con = duckdb.connect(database=":memory:", read_only=False)

In [4]:
def heatmap(df, color, x, y, color_condition, width, height, tooltip):
    heatmap = (
        alt.Chart(df)
        .mark_rect()
        .encode(alt.X(x), alt.Y(y), color=color, tooltip=tooltip)
        .properties(width=width, height=height)
    )

    # Configure text
    text = heatmap.mark_text(baseline="middle").encode(
        text=color,
        color=alt.condition(color_condition, alt.value("black"), alt.value("white")),
    )

    return heatmap + text

In [5]:
df = df_from_sql("redshiftreader", "select * from dev_dbt.temp_od_usage_20220815")

In [6]:
usage_df = df_from_sql(
    "redshiftreader",
    "research/product/bank_products/20220711_what_do_customers_use_overdraft_for/usage_evolution.sql",
)

<a id='section1'></a>
# How many users do we have in each group?

Here we can see that 39.4% of users with overdraft enabled in H2 2022 didn't use the product. For the ones that did use overdraft in this timeframe, we can see that our group split led to roughly even groups.

In [7]:
days_users_query = """
select 
days_bucket, count(distinct user_created) as n_users, round(count(*)::numeric/ sum(count(*)) over(), 3)*100 as perc_users 
from usage_df 
group by 1
order by 1
"""
days_users_df = con.execute(days_users_query).fetchdf()

In [8]:
days_users_df

Unnamed: 0,days_bucket,n_users,perc_users
0,not using od,58332,39.4
1,<=10,27989,18.1
2,>=11 and <= 27,33344,21.1
3,>=28,30594,21.4


In [9]:
af.column_single_label(
    days_users_df, af.teal, "days_bucket", "perc_users", 800, 400, "x"
).properties(title="Days buckets groups distribution")

As for the usage groups, we can see that 23.4% of all users are using overdraft below up to 20% of their limit. Since we are particularly interested in exploring high overdraft usage, we split the 80%+ group in intervals of 10% instead of 20%, so we can have a more detailed view on the >90% usage users, who correspond to 8.4% of all users.

In [10]:
usage_users_query = """
select 
usage_buckets, count(distinct user_created) as n_users, round(count(*)::numeric/ sum(count(*)) over(), 3)*100 as perc_users 
from usage_df 
group by 1
order by 1
"""
usage_users_df = con.execute(usage_users_query).fetchdf()
usage_users_df

Unnamed: 0,usage_buckets,n_users,perc_users
0,0%,58332,39.4
1,<=20%,34960,23.4
2,20% to 40%,12753,8.4
3,40% to 60%,10815,7.1
4,60% to 80%,11265,7.3
5,80% to 90%,7589,5.0
6,> 90%,12594,8.4
7,in arrears,1951,1.1


In [11]:
af.column_single_label(
    usage_users_df, af.petrol, "usage_buckets", "perc_users", 800, 400, "x"
).properties(title="Usage buckets groups distribution")

<a id='section1.1'></a>
## How stable are these groups over time?

Here we can see that the number of monthly days per group is relatively stable in the 6 months we have looked into, with a small decrease in February, which is of course due to this month being shorter than the remaining ones.

In [12]:
days_month_query = """
select 
month, days_bucket, avg(avg_n_days) as avg_n_days
from usage_df 
where days_bucket != ' not using od'
group by 1, 2
order by 1, 2
"""
days_month_df = con.execute(days_month_query).fetchdf()

In [13]:
af.line_multi(
    days_month_df, "days_bucket", "month", "avg_n_days", 800, 400, "x"
).properties(title="Days buckets values over time")

We can also see a similarly stable pattern for the usage buckets.

In [14]:
usage_month_query = """
select 
month, usage_buckets, avg(avg_perc_usage) as avg_perc_usage
from usage_df 
where usage_buckets != ' 0%'
group by 1, 2
order by 1, 2
"""
usage_month_df = con.execute(usage_month_query).fetchdf()

In [15]:
af.line_multi(
    usage_month_df, "usage_buckets", "month", "avg_perc_usage", 800, 400, "x"
).properties(title="Usage buckets values over time")

<a id='section1.2'></a>
## How do their Lisbon scores evolve over time?

As for rating classes, we can see an interesting shift after April 2022. This corresponds to the launch of the 2.0 version of the service that originates these scores (Lisbon). With this new version, there is a bigger difference between these groups and the >=28 days group has an average very close to 12, the cut-off score for eligibility.

In [16]:
days_score_query = """
with totals as (
select 
month, days_bucket,
count(distinct user_created) as n_users
from usage_df 
group by 1, 2
)
select 
month, days_bucket,
round(sum(avg_rating_class * n_users)::numeric/sum(n_users), 1) as avg_rating_class
from usage_df 
inner join totals using (month, days_bucket)
group by 1, 2
order by 1, 2
"""
days_score_df = con.execute(days_score_query).fetchdf()

In [17]:
af.line_multi(
    days_score_df, "days_bucket", "month:O", "avg_rating_class", 800, 400, "x"
).properties(title="Days buckets Lisbon scores over time")

We see a similar polarizing effect after the Lisbon 2.0 launch when it comes to usage buckets. This time around, we can see that on average 80%+ usage users are above 12, meaning once again that on average these high intensity users would no longer be eligible for the product after the Lisbon 2.0 Launch.

In [18]:
usage_score_query = """
with totals as (
select 
month, usage_buckets,
count(distinct user_created) as n_users
from usage_df 
group by 1, 2
)
select 
month, usage_buckets, 
round(sum(avg_rating_class * n_users)::numeric/sum(n_users), 1) as avg_rating_class
from usage_df 
inner join totals using (month, usage_buckets)
group by 1, 2
order by 1, 2
"""
usage_score_df = con.execute(usage_score_query).fetchdf()

In [19]:
af.line_multi(
    usage_score_df, "usage_buckets", "month:O", "avg_rating_class", 800, 400, "x"
).properties(title="Usage buckets Lisbon scores over time")

<a id='section2'></a>
# What are the top 10 outgoing transaction categories for each group?

There are a few transformations we decided to do in order to make this metric meaningful. First, as for transaction categories, we started from the transaction types ([here](https://number26-jira.atlassian.net/wiki/spaces/CSKB/pages/1167065684/Transaction+Types+2.0) you can find the details for each of those types). Since a lot of the big outgoing transaction groups such as Card Presentments (PTs) and Direct Transfers didn't give us much detail, we added more granularity to these:
- Added MCC categories for card transactions (PTs);
- Added Amherst categories (categories from our internal transaction classifier service) to Direct Transfers and Direct Debits;
- Split loan repayment transactions into its own group;
- Added fees categorizations from our fees model;
- Split spaces transactions into their own group.

Also, in order to exclude outliers where users have only one or two transactions corresponding to very big percentages in their transaction proportions, we're excluding users with less than 6 transactions in the selected 6 month period. This leads to an 11.9% decrease in the user base described above (from 150.3k users to 132.4k users).
 
Then, in order to understand the impact of each transaction category in user spending, we calculated the percentage of each transaction category for each user (100% being the total outgoing transactions), and then calculated the average for all users in each group. We split these results into 2 dimensions: number of transactions and transaction volume.

**Note:** For each of the charts below, we only see the top 10 transactions per group. Whenever we see empty transaction categories there, it doesn't mean we don't have transactions there, it just means that they didn't make it to the top 10 for that group.

In [20]:
top_10_out_txns_query = """
with totals as (
select 
case when usage_buckets in ('0%', '<=20%') then ' ' || usage_buckets else usage_buckets end as usage_buckets,
days_bucket, 
case when type in ('bank_overdraft', 'memberships') then type || ' fees' else type end as type,
round(perc_outgoing_txns_per_user, 3)*100 as perc_outgoing_txns_per_user,
round(perc_outgoing_txn_volume_per_user, 3)*100 as perc_outgoing_txn_volume_per_user,
row_number() over(partition by usage_buckets, days_bucket order by perc_outgoing_txns_per_user desc) as tnxs_rn,
row_number() over(partition by usage_buckets, days_bucket order by perc_outgoing_txn_volume_per_user desc) as tnx_volume_rn
from df
where perc_outgoing_txn_volume_per_user is not null
)
select 
'days n outgoing txns' as label,
* from totals 
where tnxs_rn <= 10
and usage_buckets = 'All usage buckets'
union all 
select 
'days outgoing txns volume' as label,
* from totals where tnx_volume_rn <= 10
and usage_buckets = 'All usage buckets'
union all
select 
'usage n outgoing txns' as label,
* from totals 
where tnxs_rn <= 10
and trim(days_bucket, 0) = 'All days buckets'
union all 
select 
'usage outgoing txns volume' as label,
* from totals 
where tnx_volume_rn <= 10
and trim(days_bucket, 0) = 'All days buckets'
order by 1, 2
"""
top_10_out_txns_df = con.execute(top_10_out_txns_query).fetchdf()

<a id='section2.1'></a>
## Days Groups

When it comes to the number of transactions, we can see that:
- For all groups, the most frequent transaction category is grocery market card transactions, ranging between 14% and 15.8%.
- Users who use the overdraft product for more than 11 days have gambling card transactions, gambling fees, money cash financial and ATM in the top 10 categories, whereas users who don't use overdraft don't have these categories in their top 10 (you can find a list of the top 10 merchants for gambling and money cash financial in [our annex](#section4))
- Users that either don't use overdraft or use it for less than 10 days have outgoing spaces transactions in their top 10, which is not the case for the remaining groups.



In [21]:
af.column_multi(
    top_10_out_txns_df[top_10_out_txns_df["label"] == "days n outgoing txns"],
    "days_bucket",
    "type",
    "perc_outgoing_txns_per_user",
    "days_bucket",
    180,
    400,
    "-y",
).properties(
    title="Days buckets top 10 outgoing transactions by number of transactions"
)

As for the transaction volume, we see the following patterns:
- One of the main differences we see from the charts above is that grocery market card transactions no longer come up first for most groups, meaning that these are generally high quantity, low volume transactions.
- Money cash financial transactions come up leading for all groups that use overdraft for 11 days or more, whereas it doesn't make to the top 10 of the users that don't use this product or use it for less than 10 days. 
- We also see that overdraft users have ATM and gambling transactions in their top 10, which is not the case for users who don't use overdraft.

In [22]:
af.column_multi(
    top_10_out_txns_df[top_10_out_txns_df["label"] == "days n outgoing txns"],
    "days_bucket",
    "type",
    "perc_outgoing_txn_volume_per_user",
    "days_bucket",
    180,
    400,
    "-y",
).properties(title="Days buckets top 10 outgoing transactions by transaction volume")

<a id='section2.2'></a>
## Usage Groups

When it comes to the number of transactions, we can see some interesting patterns that split users that use none or less than 90% of their overdraft limit, and the ones that use more than 90% are in arrears:
- For the ones up until 90% of usage, grocery market card transaction are the leading ones in the top 10
- For >90% users, gambling transactions take the first place. 
- For users in arrears, 5 out of the top 10 categories are fees, the highest category is membership fees with an average of 27.5%.

In [23]:
af.column_multi(
    top_10_out_txns_df[top_10_out_txns_df["label"] == "usage n outgoing txns"],
    "usage_buckets",
    "type",
    "perc_outgoing_txns_per_user",
    "usage_buckets",
    180,
    400,
    "-y",
).properties(
    title="Usage buckets top 10 outgoing transactions by number of transactions"
)

When looking at the transaction volumes, we see some different patterns:
- For the groups that are not using overdraft or using it below 40%, the top 3 categories are rent, spaces and uncategorized Direct Transfers. 
- For groups that use overdraft above 40% but are not in arrears, the top 3 transaction categories become rent, money cash financial, and ATM transactions. 
- Overdraft fees go from not being part of the top 10 up to 100% usage, to increasing considerably in arrears, becoming once again one the 2nd biggest category for the arrears group with 33.6% of all outgoing transaction volume in this group - the biggest bucket being Wise Transfers (Foreign Currency Transfers) with 39.9%.

In [24]:
af.column_multi(
    top_10_out_txns_df[top_10_out_txns_df["label"] == "usage outgoing txns volume"],
    "usage_buckets",
    "type",
    "perc_outgoing_txn_volume_per_user",
    "usage_buckets",
    180,
    400,
    "-y",
).properties(title="Usage buckets top 10 outgoing transactions by transaction volume")

<a id='section3'></a>
# What are the top 10 incoming transaction categories for each group?

Here we apply the same logic as above, but only for incoming transactions (e.g. considering Credit Transfers instead of Direct Transfers and Direct Debits).

In [25]:
top_10_in_txns_query = """
with totals as (
select 
case when usage_buckets in ('0%', '<=20%') then ' ' || usage_buckets else usage_buckets end as usage_buckets,
days_bucket, 
type,
round(perc_incoming_txns_per_user, 3)*100 as perc_incoming_txns_per_user,
round(perc_incoming_txn_volume_per_user, 3)*100 as perc_incoming_txn_volume_per_user,
row_number() over(partition by usage_buckets, days_bucket order by perc_incoming_txns_per_user desc) as tnxs_rn,
row_number() over(partition by usage_buckets, days_bucket order by perc_incoming_txn_volume_per_user desc) as tnx_volume_rn
from df
where perc_incoming_txn_volume_per_user is not null
)
select 
'days n incoming txns' as label,
* from totals 
where tnxs_rn <= 10
and usage_buckets = 'All usage buckets'
union all 
select 
'days incoming txns volume' as label,
* from totals where tnx_volume_rn <= 10
and usage_buckets = 'All usage buckets'
union all
select 
'usage n incoming txns' as label,
* from totals 
where tnxs_rn <= 10
and trim(days_bucket, 0) = 'All days buckets'
union all 
select 
'usage incoming txns volume' as label,
* from totals 
where tnx_volume_rn <= 10
and trim(days_bucket, 0) = 'All days buckets'
"""
top_10_in_txns_df = con.execute(top_10_in_txns_query).fetchdf()

<a id='section3.1'></a>
## Days Groups

When it comes to the number of incoming transactions, we can see 2 main patterns:
- The vast majority of transactions come from uncategorized Credit Transfers
- Spaces is the 2nd biggest for all groups except for the >=28 group, which has uncategorized income, moneybeam and card transaction refunds ahead of spaces.

In [26]:
af.column_multi(
    top_10_in_txns_df[top_10_in_txns_df["label"] == "days n incoming txns"],
    "days_bucket",
    "type",
    "perc_incoming_txns_per_user",
    "days_bucket",
    180,
    400,
    "-y",
).properties(
    title="Days buckets top 10 incoming transactions by number of transactions"
)

As for the transaction volume we can see:
- An even bigger share of uncategorized credit transfers, with 85+% for all groups.
- Spaces and salary income take the next 2 places in the top 10 in the groups up to 27 days of usage. 
- For the >=28 user group, moneybeam transactions take the 2nd place.

In [27]:
af.column_multi(
    top_10_in_txns_df[top_10_in_txns_df["label"] == "days n incoming txns"],
    "days_bucket",
    "type",
    "perc_incoming_txn_volume_per_user",
    "days_bucket",
    180,
    400,
    "-y",
).properties(title="Days buckets top 10 incoming transactions by transaction volume")

<a id='section3.2'></a>
## Usage Groups

When looking at number of transactions for usage groups, the following patterns emerge:
- Uncategorized credit transfers are the biggest bucket.
- Spaces, uncategorized income, and moneybeam transactions are respectively 2nd, 3rd and 4th categories for groups with usage up to 80% - after that, between 80% and 100%, spaces transactions fall to 4th place.
- As for the arrears group, we have uncategorized income, salary income, and moneybeam taking respectively the 2nd, 3rd and 4th positions.


In [28]:
af.column_multi(
    top_10_in_txns_df[top_10_in_txns_df["label"] == "usage n incoming txns"],
    "usage_buckets",
    "type",
    "perc_incoming_txns_per_user",
    "usage_buckets",
    180,
    400,
    "-y",
).properties(
    title="Usage buckets top 10 incoming transactions by number of transactions"
)

And finally, we can see the following for the transaction volumes:
- Unsurprisingly by now, uncategorized credit transfers take the lead by far.
- Spaces take the 2nd place for usage up to 20%.
- Between 20% and 90% usage, salary income transactions take the 2nd place.
- And finally, we can see that moneybeam transactions generally increase the higher the usage buckets, taking 2nd place for >90% usage and for users in arrears.


In [29]:
af.column_multi(
    top_10_in_txns_df[top_10_in_txns_df["label"] == "usage incoming txns volume"],
    "usage_buckets",
    "type",
    "perc_incoming_txn_volume_per_user",
    "usage_buckets",
    180,
    400,
    "-y",
).properties(title="Usage buckets top 10 incoming transactions by transaction volume")

<a id='section4'></a>
# Annex

## Who are the top 10 merchants behind 'gambling_gaming' and 'money_cash_financial' transactions?

In [30]:
merchant_df = df_from_sql(
    "redshiftreader",
    "research/product/bank_products/20220711_what_do_customers_use_overdraft_for/gambling_finance_merchants.sql",
)

In [31]:
merchant_df

Unnamed: 0,mcc_category,merchant_name,sum_n_txns,n_users,rn
0,gambling_gaming,WIN2DAY,15484,824,1
1,gambling_gaming,TIPICO,14058,866,2
2,gambling_gaming,LOTTOLAND,9189,181,3
3,gambling_gaming,POKERSTARS,8022,672,4
4,gambling_gaming,Red Rhino Limited,7954,442,5
5,gambling_gaming,BET365,7693,455,6
6,gambling_gaming,BWIN.DE,7641,364,7
7,gambling_gaming,Lotto24,7478,795,8
8,gambling_gaming,ROL*Wunderino,4019,280,9
9,gambling_gaming,PAYPAL *HILLSIDESPO,3516,167,10
