# Cohort Analysis

**Datacamp** : https://app.datacamp.com/learn/courses/customer-segmentation-in-python

## Imports & Load data

In [18]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import datetime as dt

In [8]:
! ls

Cohort_Analysis.ipynb  data  EKI.ipynb	README.md


In [9]:
cohort_counts_df = pd.read_csv('data/cohort_counts.csv')
cohort_counts_df

Unnamed: 0,CohortMonth,1,2,3,4,5,6,7,8,9,10,11,12,13
0,2010-12-01,716.0,246.0,221.0,251.0,245.0,285.0,249.0,236.0,240.0,265.0,254.0,348.0,172.0
1,2011-01-01,332.0,69.0,82.0,81.0,110.0,90.0,82.0,86.0,104.0,102.0,124.0,45.0,
2,2011-02-01,316.0,58.0,57.0,83.0,85.0,74.0,80.0,83.0,86.0,95.0,28.0,,
3,2011-03-01,388.0,63.0,100.0,76.0,83.0,67.0,98.0,85.0,107.0,38.0,,,
4,2011-04-01,255.0,49.0,52.0,49.0,47.0,52.0,56.0,59.0,17.0,,,,
5,2011-05-01,249.0,40.0,43.0,36.0,52.0,58.0,61.0,22.0,,,,,
6,2011-06-01,207.0,33.0,26.0,41.0,49.0,62.0,19.0,,,,,,
7,2011-07-01,173.0,28.0,31.0,38.0,44.0,17.0,,,,,,,
8,2011-08-01,139.0,30.0,28.0,35.0,14.0,,,,,,,,
9,2011-09-01,279.0,56.0,78.0,34.0,,,,,,,,,


In [10]:
cohort_counts_df.set_index('CohortMonth', inplace=True)
cohort_counts_df

Unnamed: 0_level_0,1,2,3,4,5,6,7,8,9,10,11,12,13
CohortMonth,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2010-12-01,716.0,246.0,221.0,251.0,245.0,285.0,249.0,236.0,240.0,265.0,254.0,348.0,172.0
2011-01-01,332.0,69.0,82.0,81.0,110.0,90.0,82.0,86.0,104.0,102.0,124.0,45.0,
2011-02-01,316.0,58.0,57.0,83.0,85.0,74.0,80.0,83.0,86.0,95.0,28.0,,
2011-03-01,388.0,63.0,100.0,76.0,83.0,67.0,98.0,85.0,107.0,38.0,,,
2011-04-01,255.0,49.0,52.0,49.0,47.0,52.0,56.0,59.0,17.0,,,,
2011-05-01,249.0,40.0,43.0,36.0,52.0,58.0,61.0,22.0,,,,,
2011-06-01,207.0,33.0,26.0,41.0,49.0,62.0,19.0,,,,,,
2011-07-01,173.0,28.0,31.0,38.0,44.0,17.0,,,,,,,
2011-08-01,139.0,30.0,28.0,35.0,14.0,,,,,,,,
2011-09-01,279.0,56.0,78.0,34.0,,,,,,,,,


Cohort Month : month of first transaction

Num : cohort index

Metrics in the table

## Cohort analysis

Cohort analysis is a descriptive analytics tool. It groups the customers into mutually exclusive cohorts - which are then measured over time. Cohort analysis provides deeper insights than the so-called vanity metrics. It helps with understanding the high level trends better by providing insights on metrics across both the product and the customer lifecycle.

Type of cohorts : 
- Time cohorts are customers who signed up for a product or service during a particular time frame. Analyzing these cohorts 
shows the customers behavior depending on the time they started using the company s products or services. The time may be 
monthly or quarterly, even daily. 
- Behavior cohorts are customers who purchased a product or subscribed to a service in the past. It groups customers by the 
type of product or service they signed up. Customers who signed up for basic level services might have different needs than 
those who signed up for advanced services. Understanding the needs of the various cohorts can help a company design custom-made 
services or products for particular segments. 
- Size cohorts refer to the various sizes of customers who purchase company s products or services. This categorization can be 
based on the amount of spending in some period of time after acquisition, or the product type that the customer spent most of 
their order amount in some period of time. Now, let's look at the main elements of the cohort analysis.

In [5]:
#How many customers have made their first transaction in January 2011?
cohort_counts_df.loc['2011-01-01', '1']

332.0

In [15]:
cohort_counts_df.iloc[1][0]

332.0

### Time cohort

Now we will learn about the most popular cohort analysis type - time cohorts. We will segment customers into acquisition cohorts based on the month they made their first purchase. We will then assign the cohort index to each purchase of the customer. It will represent the number of months since the first transaction.

In [45]:
online_df = pd.read_csv('data/online.csv', index_col='Unnamed: 0')
online_df.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
416792,572558,22745,POPPY'S PLAYHOUSE BEDROOM,6,2011-10-25 08:26:00,2.1,14286,United Kingdom
482904,577485,23196,VINTAGE LEAF MAGNETIC NOTEPAD,1,2011-11-20 11:56:00,1.45,16360,United Kingdom
263743,560034,23299,FOOD COVER WITH BEADS SET 2,6,2011-07-14 13:35:00,3.75,13933,United Kingdom
495549,578307,72349B,SET/6 PURPLE BUTTERFLY T-LIGHTS,1,2011-11-23 15:53:00,2.1,17290,United Kingdom
204384,554656,21756,BATH BUILDING BLOCK WORD,3,2011-05-25 13:36:00,5.95,17663,United Kingdom


In [46]:
online_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 70864 entries, 416792 to 312243
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   InvoiceNo    70864 non-null  int64  
 1   StockCode    70864 non-null  object 
 2   Description  70864 non-null  object 
 3   Quantity     70864 non-null  int64  
 4   InvoiceDate  70864 non-null  object 
 5   UnitPrice    70864 non-null  float64
 6   CustomerID   70864 non-null  int64  
 7   Country      70864 non-null  object 
dtypes: float64(1), int64(3), object(4)
memory usage: 4.9+ MB


In [47]:
online_df['InvoiceDate'] = pd.to_datetime(online_df['InvoiceDate'])
online_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 70864 entries, 416792 to 312243
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   InvoiceNo    70864 non-null  int64         
 1   StockCode    70864 non-null  object        
 2   Description  70864 non-null  object        
 3   Quantity     70864 non-null  int64         
 4   InvoiceDate  70864 non-null  datetime64[ns]
 5   UnitPrice    70864 non-null  float64       
 6   CustomerID   70864 non-null  int64         
 7   Country      70864 non-null  object        
dtypes: datetime64[ns](1), float64(1), int64(3), object(3)
memory usage: 4.9+ MB


In [48]:
# Function that will parse the date
def get_day(x):
    return dt.datetime(x.year, x.month, x.day)

In [49]:
# Create a InvoiceDay column by passing the InvoiceDate columns and applying the get_day function
online_df['InvoiceDay'] = online_df['InvoiceDate'].apply(get_day)
online_df.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,InvoiceDay
416792,572558,22745,POPPY'S PLAYHOUSE BEDROOM,6,2011-10-25 08:26:00,2.1,14286,United Kingdom,2011-10-25
482904,577485,23196,VINTAGE LEAF MAGNETIC NOTEPAD,1,2011-11-20 11:56:00,1.45,16360,United Kingdom,2011-11-20
263743,560034,23299,FOOD COVER WITH BEADS SET 2,6,2011-07-14 13:35:00,3.75,13933,United Kingdom,2011-07-14
495549,578307,72349B,SET/6 PURPLE BUTTERFLY T-LIGHTS,1,2011-11-23 15:53:00,2.1,17290,United Kingdom,2011-11-23
204384,554656,21756,BATH BUILDING BLOCK WORD,3,2011-05-25 13:36:00,5.95,17663,United Kingdom,2011-05-25


In [50]:
# Create a groupby object that groups CustomerID variable, and selects InvoiceDay for further calculations.
grouping = online_df.groupby('CustomerID')['InvoiceDay']

In [51]:
# Create a CohortDay column by selecting the minimum InvoiceDay value.
online_df['CohortDay'] = grouping.transform('min')
online_df.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,InvoiceDay,CohortDay
416792,572558,22745,POPPY'S PLAYHOUSE BEDROOM,6,2011-10-25 08:26:00,2.1,14286,United Kingdom,2011-10-25,2011-04-11
482904,577485,23196,VINTAGE LEAF MAGNETIC NOTEPAD,1,2011-11-20 11:56:00,1.45,16360,United Kingdom,2011-11-20,2011-09-12
263743,560034,23299,FOOD COVER WITH BEADS SET 2,6,2011-07-14 13:35:00,3.75,13933,United Kingdom,2011-07-14,2011-07-14
495549,578307,72349B,SET/6 PURPLE BUTTERFLY T-LIGHTS,1,2011-11-23 15:53:00,2.1,17290,United Kingdom,2011-11-23,2011-11-23
204384,554656,21756,BATH BUILDING BLOCK WORD,3,2011-05-25 13:36:00,5.95,17663,United Kingdom,2011-05-25,2011-02-25


#### Calculate Time offset

Calculating time offset for each transaction allows you to report the metrics for each cohort in a comparable fashion.