# I. Introduction to Cohort Analysis

## 1.1. Overall Understanding to Cohort Analysis

- Definition: 
    + Cohort analysis is a method of customer segmentation that help businesses understand their customers. Cohort analysis focuses on how a customer's behavior changes over time, while RFM analysis focuses on a customer's current behavior. 
- Focus: 
    + How a customer's behavior changes over time 
- Goal: 
    + Identify trends and patterns in customer behavior 
    + Compare metrics across product lifecycle
    + Compare metrics across customer lifecycle
- How it works: 
    + Groups customers into mutually exclusive **cohorts** based on when they were acquired and tracks their behavior over time 
- When it's useful: 
    + For understanding how a group of customers evolves over time
- Benefits of cohort analysis:
    + Helps identify trends and patterns in customer behavior
    + Helps identify key metrics like retention rate and upsell rate
    + Can be used to improve customer experience and retention

## 1.2. Types of Cohorts

- **Time Cohorts:**
    + Time Cohorts are customers who signed up for a product or service during a particular time frame. 
    + Analyzing these cohorts shows the customers’ behavior depending on the time they started using the company’s products | services. 
    + The time may be daily or weekly or monthly or quarterly.

- **Behavior Cohorts:**
    + Behavior Cohorts are customers who purchased a product or subscribed to a service in the past. 
    + It groups customers by the type of product or service they signed up. 
    + Customers who signed up for basic level services might have different needs than those who signed up for advanced services. 
    + Understanding the needs of the various cohorts can help a company design custom-made services or products for particular segments.
    
- **Size Cohorts:**
    + Size Cohorts refer to the various sizes of customers who purchase company’s products or services. 
    + This categorization can be based on the amount of spending in some period of time after acquisition, or the product type that the customer spent most of their order amount in some period of time.

## 1.3. Time Cohort (Customer Cohort) Example

### Load packages

In [23]:
import pandas as pd
import numpy as np
import datetime as dt
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.figure_factory as ff

### Load data 

In [24]:
# load csv file as a DataFrame. The encoding is required to read the file
online = pd.read_csv('data/E-Commerce Data.csv', encoding='ISO-8859-1')

In [3]:
# # Load xlsx file as a DataFrame
# second_online = pd.read_excel(
#     'data/online_retail_II.xlsx'
#     # , sheet_name='Year 2009-2010'
#     , sheet_name='Year 2010-2011'
# )

### Data Pre-processing

In [4]:
# include only UK
online = online[online['Country'] == 'United Kingdom']

# include only date, price, customerid 
online = online[['InvoiceDate','UnitPrice','CustomerID']]

# remove duplicates
online.drop_duplicates(inplace=True)

# Drop rows with missing CustomerID
online = online[~online['CustomerID'].isnull()]
online.reset_index(drop=True, inplace=True)

# datetime conversion
online['InvoiceDate'] = pd.to_datetime(online['InvoiceDate'],format='%m/%d/%Y %H:%M')

# view first 5 rows
online.head()


Unnamed: 0,InvoiceDate,UnitPrice,CustomerID
0,2010-12-01 08:26:00,2.55,17850.0
1,2010-12-01 08:26:00,3.39,17850.0
2,2010-12-01 08:26:00,2.75,17850.0
3,2010-12-01 08:26:00,7.65,17850.0
4,2010-12-01 08:26:00,4.25,17850.0


In [5]:
# preview data information
online.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159858 entries, 0 to 159857
Data columns (total 3 columns):
 #   Column       Non-Null Count   Dtype         
---  ------       --------------   -----         
 0   InvoiceDate  159858 non-null  datetime64[ns]
 1   UnitPrice    159858 non-null  float64       
 2   CustomerID   159858 non-null  float64       
dtypes: datetime64[ns](1), float64(2)
memory usage: 3.7 MB


In [6]:
# get first day of the month from a given date object
def get_month(datetime: dt.datetime) -> dt.datetime:
    '''get first day of the month from a given date object: datetime64[ns]'''
    return dt.datetime(
        year=datetime.year
        ,month=datetime.month
        ,day=1
    )

# assign acquisition month cohort
online['InvoiceMonth'] = online['InvoiceDate'].apply(get_month)

# preview first 5 rows
online.head()

Unnamed: 0,InvoiceDate,UnitPrice,CustomerID,InvoiceMonth
0,2010-12-01 08:26:00,2.55,17850.0,2010-12-01
1,2010-12-01 08:26:00,3.39,17850.0,2010-12-01
2,2010-12-01 08:26:00,2.75,17850.0,2010-12-01
3,2010-12-01 08:26:00,7.65,17850.0,2010-12-01
4,2010-12-01 08:26:00,4.25,17850.0,2010-12-01


In [7]:
# assign the smallest InvoiceMonth value of each customer to each record regarding the same customer ID
online['CohortMonth'] = online.groupby('CustomerID')['InvoiceMonth'].transform('min')

# preview last 5 rows
online.tail()

Unnamed: 0,InvoiceDate,UnitPrice,CustomerID,InvoiceMonth,CohortMonth
159853,2011-12-09 12:31:00,0.95,15804.0,2011-12-01,2011-05-01
159854,2011-12-09 12:49:00,2.95,13113.0,2011-12-01,2010-12-01
159855,2011-12-09 12:49:00,1.25,13113.0,2011-12-01,2010-12-01
159856,2011-12-09 12:49:00,8.95,13113.0,2011-12-01,2010-12-01
159857,2011-12-09 12:49:00,7.08,13113.0,2011-12-01,2010-12-01


In [8]:
# Assign Time Offset Value Process

# def helper function
def get_date_as_int(dataframe: pd.DataFrame, column: str) -> tuple:
    '''
    Function to extract year, month and day as integer values from a datetime object. 
    
    Required pandas as pd and datetime as dt packages

    Return tuple(year: Series[int], month: Series[int], day: Series[int])
    '''
    year = dataframe[column].dt.year
    month = dataframe[column].dt.month
    day = dataframe[column].dt.day
    return year, month, day

# extract year, month from InvoiceMonth, CohortMonth variables
invoice_year, invoice_month, _ = get_date_as_int(online,'InvoiceMonth')
cohort_year, cohort_month, _ = get_date_as_int(online,'CohortMonth')

# Calculate time offset including year and month
years_diff = invoice_year - cohort_year
months_diff = invoice_month - cohort_month

# create cohort index
online['CohortIndex'] = years_diff * 12 + months_diff

# excluding unnecessary columns
# online = online[['CustomerID','CohortMonth','CohortIndex']]

# preview last 5 rows
online.tail()

Unnamed: 0,InvoiceDate,UnitPrice,CustomerID,InvoiceMonth,CohortMonth,CohortIndex
159853,2011-12-09 12:31:00,0.95,15804.0,2011-12-01,2011-05-01,7
159854,2011-12-09 12:49:00,2.95,13113.0,2011-12-01,2010-12-01,12
159855,2011-12-09 12:49:00,1.25,13113.0,2011-12-01,2010-12-01,12
159856,2011-12-09 12:49:00,8.95,13113.0,2011-12-01,2010-12-01,12
159857,2011-12-09 12:49:00,7.08,13113.0,2011-12-01,2010-12-01,12


In [9]:
# understanding cohort table , 
online[online['CustomerID'] == 13110.0].groupby(['InvoiceMonth','CohortMonth','CohortIndex'])['CustomerID'].count().reset_index()

Unnamed: 0,InvoiceMonth,CohortMonth,CohortIndex,CustomerID
0,2011-02-01,2011-02-01,0,11
1,2011-03-01,2011-02-01,1,26
2,2011-07-01,2011-02-01,5,15
3,2011-10-01,2011-02-01,8,15
4,2011-11-01,2011-02-01,9,11


In [10]:
# calculate monthly active customers from each cohort == Count the number of unique values per customer ID
cohort_data = online.groupby(['CohortMonth','CohortIndex'])['CustomerID'].apply(pd.Series.nunique).reset_index()
cohort_data

Unnamed: 0,CohortMonth,CohortIndex,CustomerID
0,2010-12-01,0,871
1,2010-12-01,1,322
2,2010-12-01,2,291
3,2010-12-01,3,329
4,2010-12-01,4,308
...,...,...,...
86,2011-10-01,1,86
87,2011-10-01,2,40
88,2011-11-01,0,296
89,2011-11-01,1,41


### Result

In [11]:
# time cohorts table
cohort_counts = cohort_data.pivot(
    index='CohortMonth'
    ,columns='CohortIndex'
    ,values='CustomerID'
)
cohort_counts

CohortIndex,0,1,2,3,4,5,6,7,8,9,10,11,12
CohortMonth,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2010-12-01,871.0,322.0,291.0,329.0,308.0,345.0,327.0,304.0,306.0,346.0,320.0,429.0,238.0
2011-01-01,362.0,84.0,101.0,89.0,124.0,106.0,95.0,94.0,114.0,127.0,131.0,54.0,
2011-02-01,339.0,85.0,65.0,95.0,96.0,86.0,88.0,96.0,94.0,106.0,33.0,,
2011-03-01,408.0,79.0,107.0,88.0,95.0,70.0,107.0,97.0,119.0,38.0,,,
2011-04-01,276.0,62.0,61.0,60.0,57.0,64.0,64.0,73.0,23.0,,,,
2011-05-01,252.0,58.0,43.0,43.0,54.0,60.0,67.0,25.0,,,,,
2011-06-01,207.0,44.0,34.0,51.0,53.0,67.0,20.0,,,,,,
2011-07-01,172.0,35.0,33.0,40.0,48.0,19.0,,,,,,,
2011-08-01,140.0,37.0,32.0,36.0,19.0,,,,,,,,
2011-09-01,275.0,80.0,90.0,33.0,,,,,,,,,


In [12]:
# DataCamp data, same logic but we use this dataset instead for data consistency
cohort_counts = pd.read_csv('data/cohort_counts.csv',index_col=0)
cohort_counts

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,10,11,12
CohortMonth,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2010-12-01,716,246.0,221.0,251.0,245.0,285.0,249.0,236.0,240.0,265.0,254.0,348.0,172.0
2011-01-01,332,69.0,82.0,81.0,110.0,90.0,82.0,86.0,104.0,102.0,124.0,45.0,
2011-02-01,316,58.0,57.0,83.0,85.0,74.0,80.0,83.0,86.0,95.0,28.0,,
2011-03-01,388,63.0,100.0,76.0,83.0,67.0,98.0,85.0,107.0,38.0,,,
2011-04-01,255,49.0,52.0,49.0,47.0,52.0,56.0,59.0,17.0,,,,
2011-05-01,249,40.0,43.0,36.0,52.0,58.0,61.0,22.0,,,,,
2011-06-01,207,33.0,26.0,41.0,49.0,62.0,19.0,,,,,,
2011-07-01,173,28.0,31.0,38.0,44.0,17.0,,,,,,,
2011-08-01,139,30.0,28.0,35.0,14.0,,,,,,,,
2011-09-01,279,56.0,78.0,34.0,,,,,,,,,


### Question: How many customers have made their first transaction in January 2011 ?

In [13]:
cohort_counts.loc[
    # transaction date
    '2011-01-01'
    # the first transaction is the first column
    ,'0'
]

332

## 1.4. Cohort Metrics

### 1.4.1. Customer Retention Rate

In [14]:
def plot_cohorts_heatmap(dataframe: pd.DataFrame) -> plt.Figure:
    cohort_size = dataframe.iloc[:,0]
    dataframe = dataframe.divide(
        other=cohort_size
        ,axis=0
    )[::-1].fillna(0)
    ylabel = [str(dt.date(int(i.split('-')[0]), int(i.split('-')[1]), int(i.split('-')[2])).strftime(format='%B, %Y')) for i in dataframe.index]

    fig = ff.create_annotated_heatmap(
        z = dataframe.values
        , annotation_text = dataframe.map(lambda x: '{:.1%}'.format(x) if x > 0 else '').values.tolist()
        , y = ylabel
        , x = ['Month '+ str(int(i)+1) for i in dataframe.columns]
        , showscale = True
    )

    fig.update_layout(
        width=1000
        , height=700
        , xaxis={"title": "# Periods Elapsed"}
        , font_color = 'rgb(255,255,255)'
        , title="User Retention Rate by Cohort: Heatmap"
        , paper_bgcolor='rgb(0,0,0)'
    )

    fig.show()

plot_cohorts_heatmap(cohort_counts)

### 1.4.2. Cumulative Lifetime Revenue

#### Definition

The Cumulative Lifetime Revenue (CLR) is the total revenue generated by a customer over their entire relationship with a business. It's useful for understanding the long-term value a customer brings to the company.

#### Formula


$CLR = \sum{(Revenue\ from\ each\ transaction\ over\ time)}$

- **Revenue from each transaction**: This is the amount of money earned from each transaction or purchase made by the customer.

- **Summation (Σ)**: This represents the total revenue accumulated from all transactions a customer has made since their first purchase.

#### Key Components

- **Revenue per Transaction**: This is typically the total amount spent by the customer during each transaction. If there are multiple items in a transaction, it can be calculated as the sum of the prices of all items.

- **Customer Relationship Period**: The CLR spans from the first transaction to the most recent transaction. It's the total revenue from all purchases made during the customer's lifetime as a customer.

#### Theory Example

Let’s assume a customer has made the following purchases:

- First transaction: $50
- Second transaction: $30
- Third transaction: $70
- Fourth transaction: $20

To calculate the Cumulative Lifetime Revenue (CLR) for this customer, you would sum all these transactions:

CLR=50+30+70+20=170

So, the Cumulative Lifetime Revenue for this customer is $170.

- Revenue from Subscriptions or Recurring Billing: If the customer subscribes to a service or product on a recurring basis (e.g., a monthly subscription), you can calculate CLR by multiplying the subscription amount by the number of months the customer has been subscribed.

### 1.4.3. Churn Rate

$\text{Churn Rate} = 1 - \text{Retention Rate}$

In [15]:
# Calculate Retention Rate
def calculate_retention_rate(dataframe: pd.DataFrame) -> pd.DataFrame:
    cohort_size = dataframe.iloc[:,0]
    retention = dataframe.divide(
        other=cohort_size
        ,axis=0
    )
    return retention

retention = calculate_retention_rate(cohort_counts)
print(f'retention rate:')
retention

retention rate:


Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,10,11,12
CohortMonth,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2010-12-01,1.0,0.343575,0.308659,0.350559,0.342179,0.398045,0.347765,0.329609,0.335196,0.370112,0.354749,0.486034,0.240223
2011-01-01,1.0,0.207831,0.246988,0.243976,0.331325,0.271084,0.246988,0.259036,0.313253,0.307229,0.373494,0.135542,
2011-02-01,1.0,0.183544,0.18038,0.262658,0.268987,0.234177,0.253165,0.262658,0.272152,0.300633,0.088608,,
2011-03-01,1.0,0.162371,0.257732,0.195876,0.213918,0.17268,0.252577,0.219072,0.275773,0.097938,,,
2011-04-01,1.0,0.192157,0.203922,0.192157,0.184314,0.203922,0.219608,0.231373,0.066667,,,,
2011-05-01,1.0,0.160643,0.172691,0.144578,0.208835,0.232932,0.24498,0.088353,,,,,
2011-06-01,1.0,0.15942,0.125604,0.198068,0.236715,0.299517,0.091787,,,,,,
2011-07-01,1.0,0.16185,0.179191,0.219653,0.254335,0.098266,,,,,,,
2011-08-01,1.0,0.215827,0.201439,0.251799,0.100719,,,,,,,,
2011-09-01,1.0,0.200717,0.27957,0.121864,,,,,,,,,


In [16]:
churn = 1 - retention
print('churn rate: ')
churn

churn rate: 


Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,10,11,12
CohortMonth,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2010-12-01,0.0,0.656425,0.691341,0.649441,0.657821,0.601955,0.652235,0.670391,0.664804,0.629888,0.645251,0.513966,0.759777
2011-01-01,0.0,0.792169,0.753012,0.756024,0.668675,0.728916,0.753012,0.740964,0.686747,0.692771,0.626506,0.864458,
2011-02-01,0.0,0.816456,0.81962,0.737342,0.731013,0.765823,0.746835,0.737342,0.727848,0.699367,0.911392,,
2011-03-01,0.0,0.837629,0.742268,0.804124,0.786082,0.82732,0.747423,0.780928,0.724227,0.902062,,,
2011-04-01,0.0,0.807843,0.796078,0.807843,0.815686,0.796078,0.780392,0.768627,0.933333,,,,
2011-05-01,0.0,0.839357,0.827309,0.855422,0.791165,0.767068,0.75502,0.911647,,,,,
2011-06-01,0.0,0.84058,0.874396,0.801932,0.763285,0.700483,0.908213,,,,,,
2011-07-01,0.0,0.83815,0.820809,0.780347,0.745665,0.901734,,,,,,,
2011-08-01,0.0,0.784173,0.798561,0.748201,0.899281,,,,,,,,
2011-09-01,0.0,0.799283,0.72043,0.878136,,,,,,,,,


In [17]:
# Calculate the mean overall retention rate
overall_retention_rate = retention.iloc[:,1:].mean().mean()

# Calculate the mean overall churn rate
overall_churn_rate = churn.iloc[:,1:].mean().mean()

# Print rounded retention and churn rates
print('Overall Retention rate: {:.2f} \nOverall Churn rate: {:.2f}'.format(overall_retention_rate, overall_churn_rate))

Overall Retention rate: 0.24 
Overall Churn rate: 0.76


### 1.4.4. Customer Lifetime Revenue

#### Definition

- Customer lifetime revenue (CLR) is the total amount of money a customer spends on a business's products or services over their entire relationship. It's also known as customer lifetime value (CLTV).
- The goal of CLV:
    + Measure customer value in revenue / profit
    + Benchmark customers
    + Identify maxinum investment into customer acquisition

#### Self-research Formula

- $CLR\ |\ CLV = \frac{ARPU\ \times\ Gross\ Margin}{Churn\ Rate}$

    + $ARPU = \frac{Total\ Revenue}{Total\ Number\ Of\ Customers}$

    + $Gross\ Margin = \frac{Revenue\ -\ COGS}{Revenue}$ 

        + For total $COGS = Beginning\ Inventory\ + Purchases\ - Ending\ Inventory$
        
        + For per customer $COGS = \sum{(Unit\ Cost\ Of\ Item \times\ Quantity\ Purchased\ per\ Item)}$

    + $Churn\ Rate = \frac{Total\ Number\ Of\ Churned\ Customers}{Total\ Number\ Of\ Customers}$

#### DeepSeek-r1:8b Formula

- $CLV = \frac{\text{Customer history revenue}}{\text{Churn rate}} \times \frac{\text{Warranty cost or
maintenance cost}}{\text{Cost to acquire and retain customers}}$

#### DataCamp Formula

##### Basis Formula

- $CLV = \text{Average Profit} \times \text{Average Lifespan}$

    + $\text{Average Profit} = \text{Average Revenue} \times \text{Profit Margin}$

    + $\text{Average Lifespan} = \text{Average Time It Takes For Customers To Churn}$

##### Granular Formula

- $CLV = \text{Average Profit} \times \text{Average Lifespan}$

##### Traditional Formula

- Churn can be derived from retention
- Accounds for customer loyalty, most popular approach

- $CLV = \text{Average Profit} \times \frac{\text{Retention Rate}}{\text{Churn Rate}}$

    + $\text{Churn Rate} = 1 - \text{Retention Rate}$

#### Theory Example

If a company has a gross margin of 20%, an average revenue per user of $100, and a churn rate of 5%, then the CLV is $400: 
- CLV = (100 * 0.2) / 0.05 = 400

#### Case Study: Revenue-based CLV Formula (skip profit margin for simplicity)

- $\text{CLV} = \text{Average Revenue} \times \frac{Retention Rate}{Churn Rate}$

##### Data Preprocessing

In [26]:
# load csv file as a DataFrame. The encoding is required to read the file
online = pd.read_csv('data/E-Commerce Data.csv', encoding='ISO-8859-1')

# include only UK
online = online[online['Country'] == 'United Kingdom']

# remove duplicates
online.drop_duplicates(inplace=True)

# Drop rows with missing CustomerID
online = online[~online['CustomerID'].isnull()]
online.reset_index(drop=True, inplace=True)

# data filtering
not_products = [
    'Next Day Carriage',
    'CRUK Commission',
    'Bank Charges',
    'Manual',
    'POSTAGE',
    'CARRIAGE',
    'DOTCOM POSTAGE'
]

# Drop rows with not products
online = online[~online['Description'].isin(not_products)]

# datetime conversion
online['InvoiceDate'] = pd.to_datetime(online['InvoiceDate'],format='%m/%d/%Y %H:%M')

# preview data information
online.info()

<class 'pandas.core.frame.DataFrame'>
Index: 356108 entries, 0 to 356727
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype         
---  ------       --------------   -----         
 0   InvoiceNo    356108 non-null  object        
 1   StockCode    356108 non-null  object        
 2   Description  356108 non-null  object        
 3   Quantity     356108 non-null  int64         
 4   InvoiceDate  356108 non-null  datetime64[ns]
 5   UnitPrice    356108 non-null  float64       
 6   CustomerID   356108 non-null  float64       
 7   Country      356108 non-null  object        
dtypes: datetime64[ns](1), float64(2), int64(1), object(4)
memory usage: 24.5+ MB


In [27]:
# view first 5 rows
online.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom


In [30]:
# Calculate the total revenue
online['TotalSum'] = online['Quantity'] * online['UnitPrice']

# Create a new column for the date only
online['InvoiceOnlyDate'] = online['InvoiceDate'].apply(lambda x: dt.datetime.date(x))
# pre-processed data
preprocessed_data = online.groupby(['CustomerID','InvoiceOnlyDate']).agg({
    'TotalSum': 'sum' # Monetary value: sum of all transactions
    ,'InvoiceNo': 'nunique'
}).reset_index()

cancelled_data = online[(online['InvoiceNo'].str.contains('C'))].groupby(['CustomerID','InvoiceOnlyDate']).agg({
    'InvoiceNo': 'nunique'
}).rename(columns={'InvoiceNo':'CancelledNo'}).reset_index()
preprocessed_data = preprocessed_data.merge(cancelled_data, on=['CustomerID','InvoiceOnlyDate'], how='left').fillna(0)
preprocessed_data['InvoiceNo'] = preprocessed_data['InvoiceNo'] - preprocessed_data['CancelledNo']
preprocessed_data.drop(columns=['CancelledNo'], inplace=True)
preprocessed_data.head()

Unnamed: 0,CustomerID,InvoiceOnlyDate,TotalSum,InvoiceNo
0,12346.0,2011-01-18,0.0,1.0
1,12747.0,2010-12-05,358.56,1.0
2,12747.0,2010-12-13,347.71,1.0
3,12747.0,2011-01-20,303.04,1.0
4,12747.0,2011-03-01,310.78,1.0


In [31]:
preprocessed_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17188 entries, 0 to 17187
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   CustomerID       17188 non-null  float64
 1   InvoiceOnlyDate  17188 non-null  object 
 2   TotalSum         17188 non-null  float64
 3   InvoiceNo        17188 non-null  float64
dtypes: float64(3), object(1)
memory usage: 537.3+ KB


In [32]:
# Remove negative and zero TotalSum values
first_purchase_indexes = preprocessed_data.groupby('CustomerID')['InvoiceOnlyDate'].idxmin()
first_payment = preprocessed_data.loc[first_purchase_indexes, 'TotalSum']
while any(x <= 0 for x in first_payment):
    preprocessed_data.drop(first_payment[first_payment <= 0].index, inplace=True)
    first_purchase_indexes = preprocessed_data.groupby('CustomerID')['InvoiceOnlyDate'].idxmin()
    first_payment = preprocessed_data.loc[first_purchase_indexes, 'TotalSum']

In [33]:
preprocessed_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 17056 entries, 1 to 17187
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   CustomerID       17056 non-null  float64
 1   InvoiceOnlyDate  17056 non-null  object 
 2   TotalSum         17056 non-null  float64
 3   InvoiceNo        17056 non-null  float64
dtypes: float64(3), object(1)
memory usage: 1.2+ MB


In [34]:
preprocessed_data.head(10)

Unnamed: 0,CustomerID,InvoiceOnlyDate,TotalSum,InvoiceNo
1,12747.0,2010-12-05,358.56,1.0
2,12747.0,2010-12-13,347.71,1.0
3,12747.0,2011-01-20,303.04,1.0
4,12747.0,2011-03-01,310.78,1.0
5,12747.0,2011-05-05,442.96,1.0
6,12747.0,2011-05-25,328.35,1.0
7,12747.0,2011-06-28,376.3,1.0
8,12747.0,2011-08-22,301.7,1.0
9,12747.0,2011-10-04,675.38,1.0
10,12747.0,2011-11-17,312.73,1.0


In [39]:
preprocessed_data[preprocessed_data['CustomerID'] == 12748.0]

Unnamed: 0,CustomerID,InvoiceOnlyDate,TotalSum,InvoiceNo
12,12748.0,2010-12-01,4.95,1.0
13,12748.0,2010-12-02,4.25,1.0
14,12748.0,2010-12-05,938.01,6.0
15,12748.0,2010-12-06,215.19,2.0
16,12748.0,2010-12-07,295.11,1.0
...,...,...,...,...
120,12748.0,2011-12-02,5.45,1.0
121,12748.0,2011-12-04,158.14,2.0
122,12748.0,2011-12-05,321.51,1.0
123,12748.0,2011-12-08,284.51,3.0


In [43]:
preprocessed_data['InvoiceOnlyDate']

1        2010-12-05
2        2010-12-13
3        2011-01-20
4        2011-03-01
5        2011-05-05
            ...    
17183    2011-11-30
17184    2011-12-06
17185    2011-05-22
17186    2011-10-12
17187    2011-10-28
Name: InvoiceOnlyDate, Length: 17056, dtype: object

In [45]:
# Create a new column for the month only
preprocessed_data['InvoiceOnlyMonth'] = preprocessed_data['InvoiceOnlyDate'].apply(lambda x: pd.to_datetime(x).strftime('%Y-%m'))

# preview data
preprocessed_data[preprocessed_data['CustomerID'] == 12748.0]

Unnamed: 0,CustomerID,InvoiceOnlyDate,TotalSum,InvoiceNo,InvoiceOnlyMonth
12,12748.0,2010-12-01,4.95,1.0,2010-12
13,12748.0,2010-12-02,4.25,1.0,2010-12
14,12748.0,2010-12-05,938.01,6.0,2010-12
15,12748.0,2010-12-06,215.19,2.0,2010-12
16,12748.0,2010-12-07,295.11,1.0,2010-12
...,...,...,...,...,...
120,12748.0,2011-12-02,5.45,1.0,2011-12
121,12748.0,2011-12-04,158.14,2.0,2011-12
122,12748.0,2011-12-05,321.51,1.0,2011-12
123,12748.0,2011-12-08,284.51,3.0,2011-12


##### Calculate CLV

In [21]:
# Calculate monthly spend per customer
monthly_revenue = preprocessed_data.groupby('CustomerID')['UnitPrice'].sum().mean()
# Calculate average monthly spend
average_monthly_revenue

# Define or calculate the customer lifespan
customer_lifespan = 1 / overall_churn_rate
print(f'Customer Lifespan: {customer_lifespan:.2f} months')

# Calculate basic CLV
clv_basic = average_monthly_revenue * customer_lifespan

# Print the basic CLV value
print(f'Basic CLV: {clv_basic:.2f} USD')

Customer Lifespan: 1.32 months


### 1.4.4. Net Revenue 

### 1.4.5. Net Revenue Retention Rate

## 1.5. Visualizing Cohort Analysis
Each row in the heatmap represents a cohort and visualizes the percentage of users retained over time.

In [18]:
def plot_cohorts_heatmap(dataframe: pd.DataFrame) -> plt.Figure:
    cohort_size = dataframe.iloc[:,0]
    dataframe = dataframe.divide(
        other=cohort_size
        ,axis=0
    )[::-1].fillna(0)
    ylabel = [str(dt.date(int(i.split('-')[0]), int(i.split('-')[1]), int(i.split('-')[2])).strftime(format='%B, %Y')) for i in dataframe.index]

    fig = ff.create_annotated_heatmap(
        z = dataframe.values
        , annotation_text = dataframe.map(lambda x: '{:.1%}'.format(x) if x > 0 else '').values.tolist()
        , y = ylabel
        , x = ['Month '+ str(int(i)+1) for i in dataframe.columns]
        , showscale = True
    )

    fig.update_layout(
        width=1000
        , height=700
        , xaxis={"title": "# Periods Elapsed"}
        , font_color = 'rgb(255,255,255)'
        , title="User Retention Rate by Cohort: Heatmap"
        , paper_bgcolor='rgb(0,0,0)'
    )

    fig.show()

plot_cohorts_heatmap(cohort_counts)

# II. Case Study 
Customer Segmentation by Cohort Analysis