Kernel features :

- [Data Wrangling / Data Preparation](#data-wrangling)
- [RFM Segmentation](#rfm-segmentation)
- [Cohort Analysis](#cohort-analysis)

&nbsp;

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns 

In [None]:
df = pd.read_csv('../input/data.csv', encoding = "ISO-8859-1")

In [None]:
df.head(10)

<a name="data-wrangling"></a>

## Data Wrangling

In [None]:
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate']) #ubah format InvoiceDate menjadi datetime

In [None]:
print("Informasi dari dataset :")
print("Jumlah Row \t\t:", df.shape[0]) #check jumlah total rows pada data
print("Jumlah Column \t\t:", df.shape[1]) #check jumlah total coloumns pada data
print("Date range from \t:", df.InvoiceDate.min(), " to ", df.InvoiceDate.max()) #check range waktu pada data
print("#Jumlah Transaksi \t:", df.InvoiceNo.nunique()) #check jumlah transaksi
print("#Unique Customer \t:", df.CustomerID.nunique()) #check jumlah unique customer
print("Range Quantity \t\t:", df.Quantity.min(), " to ", df.Quantity.max()) #check range Quantity pada data
print("Range UnitPrice \t:", df.UnitPrice.min(), " to ", df.UnitPrice.max()) #check range UnitPrice pada data

In [None]:
print(df.isnull().sum().sort_values(ascending=False))

- Some data has a Negative Quantity and UnitPrice
- Some data has CustomerID and Description null / blank

we will drop those data

In [None]:
df_new = df.dropna() ## remove null
df_new = df_new[df_new.Quantity > 0] ## remove negative value in Quantity column
df_new = df_new[df_new.UnitPrice > 0] ## remove negative value in UnitPrice column

In [None]:
print(df_new.isnull().sum().sort_values(ascending=False))

In [None]:
df_new['Revenue'] = df_new['Quantity'] * df_new['UnitPrice'] ## add Revenue (Qty * UnitPrice) column
df_new['CustomerID'] = df_new['CustomerID'].astype('int64') #change format CustomerID

<a name="rfm-segmentation"></a>

## RFM Segmentation

RFM Segmentation is customer segmentation based on scoring R, F, and M (Recency: Length of day since the last transaction, Frequency: Number of transactions, Monetary: Total Revenue).

Because the last transaction on the data was December 9, 2011, we will use December 10, 2011 to calculate the recency

In [None]:
import datetime as dt
NOW = dt.datetime(2011,12,10)

In [None]:
rfmTable = df_new.groupby('CustomerID').agg({'InvoiceDate': lambda x: (NOW - x.max()).days, 'InvoiceNo': lambda x: len(x), 'Revenue': lambda x: x.sum()})
rfmTable['InvoiceDate'] = rfmTable['InvoiceDate'].astype(int)
rfmTable.rename(columns={'InvoiceDate': 'recency', 
                         'InvoiceNo': 'frequency', 
                         'Revenue': 'monetary'}, inplace=True)

In [None]:
rfmTable.head()

### Interpretation :

Customers with ID 12346 have ** frequency **: 1 (1 time transaction), ** recency **: 325 (325 days of the last transaction), and ** monetary ** 77183.60 (Grand total transactions)
 

Customer with email 12347 has ** frequency **: 182 (182 complete transaction times), ** recency **: 2 (2 days from the last transaction), and ** monetary ** 4310.00 (Grand total transactions)

The easiest way to create a segment is the quartile method. With this method there will be 4 segments that are easy to understand

<img src="https://www.mathsisfun.com/data/images/quartiles-c.gif" />

In [None]:
quantiles = rfmTable.quantile(q=[0.25,0.5,0.75])
quantiles = quantiles.to_dict()
segmented_rfm = rfmTable

In [None]:
def RScore(x,p,d):
    if x <= d[p][0.25]:
        return 4
    elif x <= d[p][0.50]:
        return 3
    elif x <= d[p][0.75]: 
        return 2
    else:
        return 1
    
def FMScore(x,p,d):
    if x <= d[p][0.25]:
        return 1
    elif x <= d[p][0.50]:
        return 2
    elif x <= d[p][0.75]: 
        return 3
    else:
        return 4

In [None]:
segmented_rfm['r_quartile'] = segmented_rfm['recency'].apply(RScore, args=('recency',quantiles,))
segmented_rfm['f_quartile'] = segmented_rfm['frequency'].apply(FMScore, args=('frequency',quantiles,))
segmented_rfm['m_quartile'] = segmented_rfm['monetary'].apply(FMScore, args=('monetary',quantiles,))
segmented_rfm.head()

In [None]:
segmented_rfm['RFMScore'] = segmented_rfm.r_quartile.map(str)+segmented_rfm.f_quartile.map(str)+segmented_rfm.m_quartile.map(str)
segmented_rfm.head()

RFM score: 444 is the best score because it has a low ** recency ** (still active), ** frequency ** high (often making transactions) and ** monetary ** high

&nbsp;

### Top 5 Customer based on RFMScore

In [None]:
segmented_rfm[segmented_rfm['RFMScore']=='444'].sort_values('monetary', ascending=False).head()


Let's look at more detailed CustomerID transactions 14646

In [None]:
top_customer = df_new[df_new['CustomerID'] == 14646]
top_customer.head(20)

<a name="cohort-analysis"></a>

## Cohort Analysis

Chohort Analysis original source from Datacamp https://campus.datacamp.com/courses/customer-segmentation-in-python/cohort-analysis?ex=3

In [None]:
df_new.head()

In [None]:
def get_month(x): return dt.datetime(x.year, x.month, 1)
df_new['InvoiceMonth'] = df_new['InvoiceDate'].apply(get_month)
grouping = df_new.groupby('CustomerID')['InvoiceMonth']
df_new['CohortMonth'] = grouping.transform('min')

In [None]:
df_new.head()

In [None]:
## function untuk extract integer value dari data
def get_date_int(df, column):
    year = df[column].dt.year
    month = df[column].dt.month
    day = df[column].dt.day
    return year, month, day

In [None]:
invoice_year, invoice_month, _ = get_date_int(df_new, 'InvoiceMonth')
cohort_year, cohort_month, _ = get_date_int(df_new, 'CohortMonth')

In [None]:
years_diff = invoice_year - cohort_year
months_diff = invoice_month - cohort_month

In [None]:
df_new['CohortIndex'] = years_diff * 12 + months_diff + 1

In [None]:
df_new.head()

In [None]:
## grouping customer berdasarkan masing masing cohort
grouping = df_new.groupby(['CohortMonth', 'CohortIndex'])
cohort_data = grouping['CustomerID'].apply(pd.Series.nunique)
cohort_data = cohort_data.reset_index()
cohort_counts = cohort_data.pivot(index='CohortMonth', columns='CohortIndex', values='CustomerID')

In [None]:
cohort_counts

### Intepretation :

<img src="https://i.imgur.com/FQn5sDf.png" />

CohortMonth 2010-12 (Cohort December 2010) has 885 Unique customers who made transactions that month (CohortIndex 1), <br>
324 customers returned to the transaction the following month (CohortIndex 2), <br>
286 customers returned again the following month (CohortIndex 3), and so on.

### Retention Rate

In [None]:
cohort_sizes = cohort_counts.iloc[:,0]
retention = cohort_counts.divide(cohort_sizes, axis=0)
retention.round(2) * 100

### Heatmap

In [None]:
plt.figure(figsize=(15, 8))
plt.title('Retention rates')
sns.heatmap(data = retention,
annot = True,
fmt = '.0%',
vmin = 0.0,
vmax = 0.5,
cmap = 'BuGn')
plt.show()

Retention rates are often ignored, but they are actually very important. Because the cost of customer acquisition is very expensive we need to do everything to convince the client to return after their first purchase. <p>

If your retention rate is low you will spend a budget for the acquisition channel so that more customers will arrive. <p>

From Cohort Analysis we can see the retention rate or what percentage of customers return in the following months after the first purchase