<a href="https://colab.research.google.com/github/Laurenyoshizuka/sample_churn_retention_analysis/blob/main/sample_churn_retention_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [223]:
import pandas as pd
import os
import sqlite3
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots

# Data loading and DB creation

In [None]:
!git clone https://github.com/Laurenyoshizuka/sample_churn_retention_analysis.git

In [98]:
os.chdir('/content/sample_churn_retention_analysis')

cust_rev_mon = pd.read_csv('/content/sample_churn_retention_analysis/customer_revenue_monthly.csv', parse_dates=['MONTH'])
dim_cust = pd.read_csv('/content/sample_churn_retention_analysis/dim_customer.csv')
dim_serv = pd.read_csv('/content/sample_churn_retention_analysis/dim_service.csv')

In [99]:
conn = sqlite3.connect(':memory:')

cust_rev_mon.to_sql('cust_rev_mon', conn, index=False, if_exists='replace')
dim_cust.to_sql('dim_cust', conn, index=False, if_exists='replace')
dim_serv.to_sql('dim_serv', conn, index=False, if_exists='replace')

5

# Quick overview of the csv data before starting analysis

In [75]:
def overview_of_dataframes(df1, df2, df3, name1='Customer Revenue Monthly Table', name2='Customer Dimension Table', name3='Service Dimension Table'):
    """
    Provides an overview of three DataFrames

    Parameters:
    df1, df2, df3 (pd.DataFrame): DataFrames to be summarized
    name1, name2, name3 (str): Optional names for the DataFrames for better readibility

    Returns:
    None
    """
    dataframes = [df1, df2, df3]
    names = [name1, name2, name3]

    for df, name in zip(dataframes, names):
        print(f"--- {name} ---")
        print(f"Shape: {df.shape}")
        print(f"Columns: {df.columns.tolist()}")
        print("Data Types:")
        print(df.dtypes)
        print("\nFirst 5 Rows:")
        print(df.head())
        print("\nSummary Statistics:")
        print(df.describe(include='all'))
        print("\nMissing Values:")
        print(df.isnull().sum())

        # Check for duplicates in month and customer_id since customers can have multiple services
        if name == name1:
            duplicate_count = df.duplicated(subset=['MONTH', 'CUSTOMER_ID']).sum()
            print(f"\nNumber of duplicate rows based on MONTH and CUSTOMER_ID: {duplicate_count}")

            if duplicate_count > 0:
                print("\nSample of duplicate rows:")
                duplicates = df[df.duplicated(subset=['MONTH', 'CUSTOMER_ID'], keep=False)]
                print(duplicates.head())

        print("\n" + "="*50 + "\n")


overview_of_dataframes(cust_rev_mon, dim_cust, dim_serv, 'Customer Revenue Monthly', 'Dimension Customer', 'Dimension Service')

--- Customer Revenue Monthly ---
Shape: (99393, 5)
Columns: ['MONTH', 'CUSTOMER_ID', 'SERVICE_ID', 'CONTRACTS', 'TOTAL_SAAS_REVENUE_USD']
Data Types:
MONTH                     datetime64[ns]
CUSTOMER_ID                        int64
SERVICE_ID                         int64
CONTRACTS                          int64
TOTAL_SAAS_REVENUE_USD           float64
dtype: object

First 5 Rows:
       MONTH  CUSTOMER_ID  SERVICE_ID  CONTRACTS  TOTAL_SAAS_REVENUE_USD
0 2023-10-01       199072          28         62                  6510.0
1 2023-12-01       473284          28          9                  1323.0
2 2023-11-01       180824          28          9                  1323.0
3 2023-12-01       174840          28         64                  9408.0
4 2023-09-01       158752          28         58                 10266.0

Summary Statistics:
                               MONTH    CUSTOMER_ID    SERVICE_ID  \
count                          99393   99393.000000  99393.000000   
mean   2023-09-17 0

# SQL (as opposed to pandas) is used to explore the data, as per requested in the home exercise guidelines.

#### Task 1: SQL Query - Churn Rate Analysis
Write a SQL query to calculate the monthly churn rate for the past year.

$$
\text{Monthly Churn Rate} = \left( \frac{\text{Number of customers lost during the month}}{\text{Number of active customers in the previous month}} \right) \times 100
$$

* *Number of customers lost during month* : customers with 0 contracts in the month

* *Number of active customers in the previous month* : distinct count of customers having >= 1 contract in previous month

In [230]:
cursor = conn.cursor()

query = '''
WITH monthly_data AS (
    SELECT
        month
        ,customer_id
        ,total_contracts
        ,LAG(month, 1) OVER (PARTITION BY customer_id ORDER BY month) AS prev_month
        ,LAG(total_contracts, 1) OVER (PARTITION BY customer_id ORDER BY month) AS prev_total_contracts
    FROM (
          SELECT
              month
              ,customer_id
              ,SUM(contracts) AS total_contracts
          FROM
              cust_rev_mon
          GROUP BY
              month
              ,customer_id
    ) subquery
)

-- Main query to calculate lost and active customers
SELECT
    month
    ,COUNT(DISTINCT CASE WHEN total_contracts = 0 THEN customer_id END) AS lost_customers_count
    ,COUNT(DISTINCT CASE WHEN prev_total_contracts >= 1 THEN customer_id END) AS active_customers_count
    ,CASE
        WHEN COUNT(DISTINCT CASE WHEN prev_total_contracts >= 1 THEN customer_id END) > 0
        THEN (COUNT(DISTINCT CASE WHEN total_contracts = 0 THEN customer_id END) * 100.0 /
              COUNT(DISTINCT CASE WHEN prev_total_contracts >= 1 THEN customer_id END))
        ELSE NULL
    END AS churn_rate
FROM
    monthly_data
WHERE
    prev_month IS NOT NULL  -- omit july for churn calculation since no data from june to calculate active customers count
GROUP BY
    month
ORDER BY
    month
'''

cursor.execute(query)
rows = cursor.fetchall()

churn = pd.DataFrame(rows, columns=[desc[0] for desc in cursor.description])
print(churn)

                 month  lost_customers_count  active_customers_count  \
0  2023-08-01 00:00:00                    18                   12259   
1  2023-09-01 00:00:00                    25                   12667   
2  2023-10-01 00:00:00                    29                   13157   
3  2023-11-01 00:00:00                    33                   13452   
4  2023-12-01 00:00:00                    44                   13719   

   churn_rate  
0    0.146831  
1    0.197363  
2    0.220415  
3    0.245317  
4    0.320723  


Checking in pandas

In [213]:
monthly_data = cust_rev_mon.groupby(['MONTH', 'CUSTOMER_ID'])['CONTRACTS'].sum().reset_index()
monthly_data.rename(columns={'CONTRACTS': 'total_contracts'}, inplace=True)
monthly_data['prev_month'] = monthly_data.groupby('CUSTOMER_ID')['MONTH'].shift(1)
monthly_data['prev_total_contracts'] = monthly_data.groupby('CUSTOMER_ID')['total_contracts'].shift(1)
monthly_data = monthly_data.dropna(subset=['prev_month'])

lost_customers = monthly_data[monthly_data['total_contracts'] == 0]
lost_customers_count = lost_customers.groupby('MONTH')['CUSTOMER_ID'].nunique().reset_index()
lost_customers_count.rename(columns={'CUSTOMER_ID': 'lost_customers_count'}, inplace=True)

active_customers_prev_month = monthly_data[monthly_data['prev_total_contracts'] >= 1]
active_customers_count = active_customers_prev_month.groupby('MONTH')['CUSTOMER_ID'].nunique().reset_index()
active_customers_count.rename(columns={'CUSTOMER_ID': 'active_customers_count'}, inplace=True)

churn_data = pd.merge(lost_customers_count, active_customers_count, on='MONTH', how='left')
churn_data['churn_rate'] = (churn_data['lost_customers_count'] / churn_data['active_customers_count']) * 100
churn_data = churn_data.sort_values('MONTH')
print(churn_data)

       MONTH  lost_customers_count  active_customers_count  churn_rate
0 2023-08-01                    18                   12259    0.146831
1 2023-09-01                    25                   12667    0.197363
2 2023-10-01                    29                   13157    0.220415
3 2023-11-01                    33                   13452    0.245317
4 2023-12-01                    44                   13719    0.320723


*Although there's an upward trend in lost customers, we're not looking at which services are losing customers...*

In [249]:
cursor = conn.cursor()

query = '''
WITH monthly_data AS (
    SELECT
        month
        ,customer_id
        ,name
        ,total_contracts
        ,LAG(month, 1) OVER (PARTITION BY customer_id, name ORDER BY month) AS prev_month
        ,LAG(total_contracts, 1) OVER (PARTITION BY customer_id, name ORDER BY month) AS prev_total_contracts
    FROM (
        SELECT
            month
            ,customer_id
            ,name
            ,SUM(contracts) AS total_contracts
        FROM
            cust_rev_mon r
        LEFT JOIN
            dim_serv s
        ON
            r.service_id = s.id
        GROUP BY
            month
            ,customer_id
            ,name
    ) subquery
)

-- Main query to calculate lost and active customers by service name
SELECT
    month
    ,name
    ,COUNT(DISTINCT CASE WHEN total_contracts = 0 THEN customer_id END) AS lost_customers_count
    ,COUNT(DISTINCT CASE WHEN prev_total_contracts >= 1 THEN customer_id END) AS active_customers_count
    ,CASE
        WHEN COUNT(DISTINCT CASE WHEN prev_total_contracts >= 1 THEN customer_id END) > 0
        THEN (COUNT(DISTINCT CASE WHEN total_contracts = 0 THEN customer_id END) * 100.0 /
              COUNT(DISTINCT CASE WHEN prev_total_contracts >= 1 THEN customer_id END))
        ELSE NULL
    END AS churn_rate
FROM
    monthly_data
WHERE
    prev_month IS NOT NULL  -- omit data with no previous month for churn calculation
GROUP BY
    month
    ,name
ORDER BY
    month
    ,name
'''

cursor.execute(query)
rows = cursor.fetchall()

churn_by_service = pd.DataFrame(rows, columns=[desc[0] for desc in cursor.description])
print(churn_by_service)

                  month name  lost_customers_count  active_customers_count  \
0   2023-08-01 00:00:00  EOR                     0                    5913   
1   2023-08-01 00:00:00   IC                    16                    7814   
2   2023-08-01 00:00:00   PR                     0                     691   
3   2023-08-01 00:00:00  SHD                     0                     366   
4   2023-09-01 00:00:00  EOR                     0                    6082   
5   2023-09-01 00:00:00   IC                    23                    8066   
6   2023-09-01 00:00:00   PR                     0                     686   
7   2023-09-01 00:00:00  SHD                     0                     432   
8   2023-10-01 00:00:00  EOR                     0                    6353   
9   2023-10-01 00:00:00   GP                     0                       1   
10  2023-10-01 00:00:00   IC                    26                    8274   
11  2023-10-01 00:00:00   PR                     0              

*Or which regions are losing customers,,,*

In [256]:
cursor = conn.cursor()

query = '''
WITH monthly_data AS (
    SELECT
        month
        ,customer_id
        ,region
        ,total_contracts
        ,LAG(month, 1) OVER (PARTITION BY customer_id, region ORDER BY month) AS prev_month
        ,LAG(total_contracts, 1) OVER (PARTITION BY customer_id, region ORDER BY month) AS prev_total_contracts
    FROM (
        SELECT
            month
            ,r.customer_id
            ,region
            ,SUM(contracts) AS total_contracts
        FROM
            cust_rev_mon r
        LEFT JOIN
            dim_cust c
        ON
            r.customer_id = c.customer_id
        GROUP BY
            month
            ,r.customer_id
            ,region
    ) subquery
)

-- Main query to calculate lost and active customers by region
SELECT
    month
    ,region
    ,COUNT(DISTINCT CASE WHEN total_contracts = 0 THEN customer_id END) AS lost_customers_count
    ,COUNT(DISTINCT CASE WHEN prev_total_contracts >= 1 THEN customer_id END) AS active_customers_count
    ,CASE
        WHEN COUNT(DISTINCT CASE WHEN prev_total_contracts >= 1 THEN customer_id END) > 0
        THEN (COUNT(DISTINCT CASE WHEN total_contracts = 0 THEN customer_id END) * 100.0 /
              COUNT(DISTINCT CASE WHEN prev_total_contracts >= 1 THEN customer_id END))
        ELSE NULL
    END AS churn_rate
FROM
    monthly_data
WHERE
    prev_month IS NOT NULL  -- omit data with no previous month for churn calculation
GROUP BY
    month
    ,region
ORDER BY
    month
    ,region
'''

cursor.execute(query)
rows = cursor.fetchall()

churn_by_region = pd.DataFrame(rows, columns=[desc[0] for desc in cursor.description])
print(churn_by_region)

                  month  region  lost_customers_count  active_customers_count  \
0   2023-08-01 00:00:00    None                     1                     193   
1   2023-08-01 00:00:00     AMS                    10                    7400   
2   2023-08-01 00:00:00    APAC                     3                     847   
3   2023-08-01 00:00:00  BRAZIL                     0                     104   
4   2023-08-01 00:00:00    EMEA                     4                    3715   
5   2023-09-01 00:00:00    None                     2                     201   
6   2023-09-01 00:00:00     AMS                    12                    7628   
7   2023-09-01 00:00:00    APAC                     2                     883   
8   2023-09-01 00:00:00  BRAZIL                     0                     103   
9   2023-09-01 00:00:00    EMEA                     9                    3852   
10  2023-10-01 00:00:00    None                     1                     214   
11  2023-10-01 00:00:00     

In [259]:
cursor = conn.cursor()

query = '''
WITH monthly_data AS (
    SELECT
        month
        ,customer_id
        ,region
        ,name
        ,total_contracts
        ,LAG(month, 1) OVER (PARTITION BY customer_id, region, name ORDER BY month) AS prev_month
        ,LAG(total_contracts, 1) OVER (PARTITION BY customer_id, region, name ORDER BY month) AS prev_total_contracts
    FROM (
        SELECT
            month
            ,r.customer_id
            ,region
            ,name
            ,SUM(contracts) AS total_contracts
        FROM
            cust_rev_mon r
        LEFT JOIN
            dim_cust c
        ON
            r.customer_id = c.customer_id
        LEFT JOIN
            dim_serv s
        ON
            r.service_id = s.id
        GROUP BY
            month
            ,r.customer_id
            ,region
            ,name
    ) subquery
)

-- Main query to calculate lost and active customers by region
SELECT
    month
    ,region
    ,name
    ,COUNT(DISTINCT CASE WHEN total_contracts = 0 THEN customer_id END) AS lost_customers_count
    ,COUNT(DISTINCT CASE WHEN prev_total_contracts >= 1 THEN customer_id END) AS active_customers_count
    ,CASE
        WHEN COUNT(DISTINCT CASE WHEN prev_total_contracts >= 1 THEN customer_id END) > 0
        THEN (COUNT(DISTINCT CASE WHEN total_contracts = 0 THEN customer_id END) * 100.0 /
              COUNT(DISTINCT CASE WHEN prev_total_contracts >= 1 THEN customer_id END))
        ELSE NULL
    END AS churn_rate
FROM
    monthly_data
WHERE
    prev_month IS NOT NULL  -- omit data with no previous month for churn calculation
GROUP BY
    month
    ,region
    ,name
ORDER BY
    month
    ,region
    ,name
'''

cursor.execute(query)
rows = cursor.fetchall()

churn_by_regional_service = pd.DataFrame(rows, columns=[desc[0] for desc in cursor.description])
print(churn_by_regional_service)

                   month region name  lost_customers_count  \
0    2023-08-01 00:00:00   None  EOR                     0   
1    2023-08-01 00:00:00   None   IC                     1   
2    2023-08-01 00:00:00   None   PR                     0   
3    2023-08-01 00:00:00   None  SHD                     0   
4    2023-08-01 00:00:00    AMS  EOR                     0   
..                   ...    ...  ...                   ...   
101  2023-12-01 00:00:00   EMEA  EOR                     0   
102  2023-12-01 00:00:00   EMEA   GP                     0   
103  2023-12-01 00:00:00   EMEA   IC                    12   
104  2023-12-01 00:00:00   EMEA   PR                     1   
105  2023-12-01 00:00:00   EMEA  SHD                     2   

     active_customers_count  churn_rate  
0                        56    0.000000  
1                       154    0.649351  
2                         8    0.000000  
3                         7    0.000000  
4                      2959    0.000000  
.. 

Generating some graphs for EDA and data storytelling.

In [224]:
fig = make_subplots(
    rows=3, cols=1, shared_xaxes=True,
    subplot_titles=('Lost Customers Over Time',
                    'Active Customers Over Time',
                    'Churn Rate Over Time')
)
fig.add_trace(
    go.Scatter(
        x=churn_data['MONTH'],
        y=churn_data['lost_customers_count'],
        mode='lines+markers',
        name='Lost Customers',
        line=dict(color='red')
    ),
    row=1, col=1
)
fig.add_trace(
    go.Scatter(
        x=churn_data['MONTH'],
        y=churn_data['active_customers_count'],
        mode='lines+markers',
        name='Active Customers',
        line=dict(color='blue')
    ),
    row=2, col=1
)
fig.add_trace(
    go.Scatter(
        x=churn_data['MONTH'],
        y=churn_data['churn_rate'],
        mode='lines+markers',
        name='Churn Rate',
        line=dict(color='red', dash='dash')
    ),
    row=3, col=1
)
fig.update_layout(
    height=900,
    width=800,
    title_text="Customer Metrics Over Time"
)

fig.show()

In [245]:
fig = px.imshow(
    churn_by_service.pivot(index='name', columns='month', values='churn_rate'),
    title='What Services are Experiencing the Most Monthly Churn?',
    labels={'color': 'Churn Rate (%)'},
    color_continuous_scale='Viridis'
)
fig.show()

I assume the GP Payroll service was introduced in Sept, since we dont have and active customers unitl Oct. This service has the highest churn, but that's without standardizing the data. GP service only has 3 active customers as of Dec.

In [258]:
fig = px.imshow(
    churn_by_region.pivot(index='region', columns='month', values='churn_rate'),
    title='What International Regions are Experiencing the Most Monthly Churn?',
    labels={'color': 'Churn Rate (%)'},
    color_continuous_scale='Viridis'
)
fig.show()

In [267]:
fig = px.line(
    churn_by_regional_service,
    x='month',
    y='churn_rate',
    color='name',
    line_dash='region',
    markers=True,
    title='Monthly Churn Rate by Service and Region',
    labels={
        'month': 'Month',
        'churn_rate': 'Churn Rate (%)',
        'name': 'Service Name',
        'region': 'Region'
    }
)
fig.update_xaxes(
    dtick="M1",
    tickformat="%b %Y"
)
fig.show()

# Task 2: Analytical Insights - Retention Strategy
Develop SQL queries to extract insights on overall NRR, as well as NRR by Business Unit and Region.
NRR provides insights into how well a company is growing
its revenue from its existing customer base.

$$
\text{Monthly NRR} = \left( \frac{\text{𝐶𝑢𝑟𝑟𝑒𝑛𝑡 𝑀𝑜𝑛𝑡h 𝑆𝑎𝑎𝑆 𝑅𝑒𝑣𝑒𝑛𝑢𝑒 from existing active customers}}{\text{𝐶𝑢𝑟𝑟𝑒𝑛𝑡 𝑀𝑜𝑛𝑡h 𝑆𝑎𝑎𝑆 𝑅𝑒𝑣𝑒𝑛𝑢𝑒 from existing active customers}} \right) \times 100
$$


- *Current Month SaaS Revenue from existing active customers* : Total SaaS revenue from
customers that were active in the previous month
- *Previous Month SaaS revenue from existing active customer*s : Total SaaS revenue from
active customers last month
  - **Note it should be the same set of customers for the numerator and denominator**