# WELCOME!

Welcome to "RFM Customer Segmentation & Cohort Analysis Project". This is the first project of the Capstone Project Series, which consists of 4 different project that contain different scenarios.

This is a project which you will learn what is RFM? And how to apply RFM Analysis and Customer Segmentation using K-Means Clustering. Also you will improve your Data Cleaning, Data Visualization and Exploratory Data Analysis capabilities. On the other hand you will create Cohort and Conduct Cohort Analysis. 

Before diving into the project, please take a look at the determines and project structure.

- **NOTE:** This tutorial assumes that you already know the basics of coding in Python and are familiar with the theory behind K-Means Clustering.



# #Determines

Using the [Online Retail dataset](https://archive.ics.uci.edu/ml/datasets/Online+Retail) from the UCI Machine Learning Repository for exploratory data analysis, ***Customer Segmentation***, ***RFM Analysis***, ***K-Means Clustering*** and ***Cohort Analysis***.

This is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail. The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers.

Feature Information:

**InvoiceNo**: Invoice number. *Nominal*, a 6-digit integral number uniquely assigned to each transaction. If this code starts with letter 'c', it indicates a cancellation. 
<br>
**StockCode**: Product (item) code. *Nominal*, a 5-digit integral number uniquely assigned to each distinct product.
<br>
**Description**: Product (item) name. *Nominal*. 
<br>
**Quantity**: The quantities of each product (item) per transaction. *Numeric*.
<br>
**InvoiceDate**: Invoice Date and time. *Numeric*, the day and time when each transaction was generated.
<br>
**UnitPrice**: Unit price. *Numeric*, Product price per unit in sterling.
<br>
**CustomerID**: Customer number. *Nominal*, a 5-digit integral number uniquely assigned to each customer.
<br>
**Country**: Country name. *Nominal*, the name of the country where each customer resides.


---


First of all, to observe the structure of the data and missing values, you can use exploratory data analysis and data visualization techniques.

You must do descriptive analysis. Because you must understand the relationship of the features to each other and clear the noise and missing values in the data. After that, the data set will be ready for RFM analysis.

Before starting the RFM Analysis, you will be asked to do some analysis regarding the distribution of *Orders*, *Customers* and *Countries*. These analyzes will help the company develop its sales policies and contribute to the correct use of resources.

You will notice that the UK not only has the most sales revenue, but also the most customers. So you will continue to analyze only UK transactions in the next RFM Analysis, Customer Segmentation and K-Means Clustering topics.

Next, you will begin RFM Analysis, a customer segmentation technique based on customers' past purchasing behavior. 

By using RFM Analysis, you can enable companies to develop different approaches to different customer segments so that they can get to know their customers better, observe trends better, and increase customer retention and sales revenues.

You will calculate the Recency, Frequency and Monetary values of the customers in the RFM Analysis you will make using the data consisting of UK transactions. Ultimately, you have to create an RFM table containing these values.

In the Customer Segmentation section, you will create an RFM Segmentation Table where you segment your customers by using the RFM table. For example, you can label the best customer as "Big Spenders" and the lost customer as "Lost Customer".

We will segment the customers ourselves based on their recency, frequency, and monetary values. But can an **unsupervised learning** model do this better for us? You will use the K-Means algorithm to find the answer to this question. Then you will compare the classification made by the algorithm with the classification you have made yourself.

Before applying K-Means Clustering, you should do data pre-processing. In this context, it will be useful to examine feature correlations and distributions. In addition, the data you apply for K-Means should be normalized.

On the other hand, you should inform the K-means algorithm about the number of clusters it will predict. You will also try the *** Elbow method *** and *** Silhouette Analysis *** to find the optimum number of clusters.

After the above operations, you will have made cluster estimation with K-Means. You should visualize the cluster distribution by using a scatter plot. You can observe the properties of the resulting clusters with the help of the boxplot. Thus you will be able to tag clusters and interpret results.

Finally, you will do Cohort Analysis with the data you used at the beginning, regardless of the analysis you have done before. Cohort analysis is a subset of behavioral analytics that takes the user data and breaks them into related groups for analysis. This analysis can further be used to do customer segmentation and track metrics like retention, churn, and lifetime value.


# #Project Structures

- Data Cleaning & Exploratory Data Analysis
- RFM Analysis
- Customer Segmentation
- Applying K-Means Clustering
- Create Cohort and Conduct Cohort Analysis

# #Tasks

#### 1. Data Cleaning & Exploratory Data Analysis

- Import Modules, Load Data & Data Review
- Follow the Steps Below

    *i. Take a look at relationships between InvoiceNo, Quantity and UnitPrice columns.*
    
    *ii. What does the letter "C" in the invoiceno column mean?*
    
    *iii. Handling Missing Values*
    
    *iv. Clean the Data from the Noise and Missing Values*
    
    *v. Explore the Orders*
    
    *vi. Explore Customers by Country*
    
    *vii. Explore the UK Market*
    
#### 2. RFM Analysis

- Follow the steps below

   *i. Import Libraries*
   
   *ii. Review "df_uk" DataFrame (the df_uk what you create at the end of the Task 1)*
   
   *iii. Calculate Recency*
   
   *iv. Calculate Frequency*
   
   *v. Calculate Monetary Values*
   
   *vi. Create RFM Table*

#### 3. Customer Segmentation with RFM Scores
- Calculate RFM Scoring

    *i. Creating the RFM Segmentation Table*
 
- Plot RFM Segments

#### 4. Applying K-Means Clustering
- Data Pre-Processing and Exploring

    *i. Define and Plot Feature Correlations*
 
    *ii. Visualize Feature Distributions*
 
    *iii. Data Normalization*

- K-Means Implementation

    *i. Define Optimal Cluster Number (K) by using "Elbow Method" and "Silhouette Analysis"*
 
    *ii. Visualize the Clusters*
 
    *iii. Assign the label*
 
    *iv. Conclusion*
 
#### 5. Create Cohort and Conduct Cohort Analysis
- Future Engineering

    *i. Extract the Month of the Purchase*
 
    *ii. Calculating time offset in Months i.e. Cohort Index*
 
- Create 1st Cohort: User Number & Retention Rate 

    *i. Pivot Cohort and Cohort Retention*
 
    *ii. Visualize analysis of cohort 1 using seaborn and matplotlib*

- Create 2nd Cohort: Average Quantity Sold 

    *i. Pivot Cohort and Cohort Retention*
 
    *ii. Visualize analysis of cohort 2 using seaborn and matplotlib*

- Create 3rd Cohort: Average Sales

    *i. Pivot Cohort and Cohort Retention*
 
    *ii. Visualize analysis of cohort 3 using seaborn and matplotlib*
    
- **Note: There may be sub-tasks associated with each task, you will see them in order during the course of the work.**


# 1. Data Cleaning & Exploratory Data Analysis

## Import Modules, Load Data & Data Review

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# df = pd.read_excel("Online Retail.xlsx")
# df.to_csv('Online_Retail.csv')

In [None]:
df=pd.read_csv('../input/online-retail-customer-clustering/OnlineRetail.csv')
df.head()

In [None]:
def explain(attribute):
    features= {'InvoiceNo': "Invoice number. Nominal, a 6-digit integral number uniquely assigned to each transaction. If this code starts with letter 'c', it indicates a cancellation.",
    'StockCode': 'Product (item) code. Nominal, a 5-digit integral number uniquely assigned to each distinct product.',
    'Description': 'Product (item) name. Nominal.',
    'Quantity': 'The quantities of each product (item) per transaction. Numeric.',
    'InvoiceDate': 'Invice Date and time. Numeric, the day and time when each transaction was generated.',
    'UnitPrice': 'Unit price. Numeric, Product price per unit in sterling.',
    'CustomerID': 'Customer number. Nominal, a 5-digit integral number uniquely assigned to each customer.',
    'Country': 'Country name. Nominal, the name of the country where each customer resides.'}
    return features[attribute]


### Descriptive Analysis

In [None]:
df.duplicated().value_counts()

**Explanation:**
* Repetitive rows should be dropped as this will manipulate the analysis.

In [None]:
df=df.drop_duplicates()

In [None]:
df.shape

In [None]:
df.info()

In [None]:
df.isnull().sum()

In [None]:
df.describe()
# df.describe(include=['O'])

In [None]:
# Alternate
def summary(df, pred=None):
    obs = df.shape[0]
    Types = df.dtypes
    Counts = df.apply(lambda x: x.count())
    Min = df.min()
    Max = df.max()
    Uniques = df.apply(lambda x: x.unique().shape[0])
    Nulls = df.apply(lambda x: x.isnull().sum())
    print('Data shape:', df.shape)

    if pred is None:
        cols = ['Types', 'Counts', 'Uniques', 'Nulls', 'Min', 'Max']
        str = pd.concat([Types, Counts, Uniques, Nulls, Min, Max], axis = 1, sort=True)

    str.columns = cols
    print('___________________________\nData Types:')
    print(str.Types.value_counts())
    print('___________________________')
    return str

display(summary(df).sort_values(by='Nulls', ascending=False))

**Evaluations:**
* `The CustomerID` and `Description` fields have null values.
* `Quantity` and `UnitPrice` should have a value >= 0, but from the summary above there are negative values for the two columns.

### i. Take a look at relationships between InvoiceNo, Quantity and UnitPrice columns.

We see that there are negative values in the Quantity and UnitPrice columns. These are possibly canceled and returned orders. Let's check it out.

In [None]:
# counts of the negative values in the Quantity
df[df['Quantity'] <0].shape[0]

In [None]:
# counts of the negative values in the UnitPrice
df[df['UnitPrice'] < 0].shape[0]

In [None]:
df[df['UnitPrice'] < 0]

In [None]:
df[df['Quantity'] <0].head()

**Explanation:**
* It is exceptional to have Negative Quantity and Unit Price. We cannot use these rows for the analysis. Therefore we will drop them.

### ii. What does the letter "C" in the InvoiceNo column mean?

In [None]:
explain('InvoiceNo')

If the invoice number starts with the letter "C", it means the order was cancelled. Or those who abandon their order.

In [None]:
df['cancellation']=df.InvoiceNo.str.extract('([C])').fillna(0).replace({'C':1})
df['cancellation'].value_counts()

When we filter canceled orders by Quantity> 0 or filter non-canceled orders by Quantity <0 nothing returns, this confirms that negative values mean the order was canceled. So lets find out how many orders were cancelled?

In [None]:
df[(df.cancellation==1) & (df.Quantity>0)]

**Evaluations:**
* we can say, if cancellation = 1, quantity < 0

#### 9288 or about 36% of the orders were cancelled. Looking deeper into why these orders were cancelled may prevent future cancellations. Now let's find out what a negative UnitPrice means.


In [None]:
df['cancellation'].value_counts()

In [None]:
# Proportion of the customers canceled their order
df[df.cancellation==1]['CustomerID'].nunique() / df['CustomerID'].nunique()*100

In [None]:
# Proportion of the canceled order
df[df.cancellation==1]['InvoiceNo'].nunique() /df[['InvoiceNo']].nunique()*100

![indir%20%281%29.png](attachment:indir%20%281%29.png)

**Explanation:**
* Iptal edilen faturalarin quantity degerleri, o faturaya en yakin tarihli gecmis ozdes faturadan cikarilarak data sette duzeltme yapilmalidir. Bu duzeltme bir sonraki versiyonda yapilacaktir. Bu versiyonda yalniz pozitif degerli quantityler ile analize devam edildi.

### iii. Handling Missing Values

Since the customer ID's are missing, lets assume these orders were not made by the customers already in the data set because those customers already have ID's. 

We also don't want to assign these orders to those customers because this would alter the insights we draw from the data. 


In [None]:
df[df.cancellation==1]['CustomerID'].value_counts(dropna=False).head(5)

### iv. Clean the Data from the Noise and Missing Values

In [None]:
# Drop CustomerID with null value
df = df[df.CustomerID.notnull()]

In [None]:
# Drop Quantity and UnitPrice with negative value
df = df[(df.Quantity > 0) & (df.UnitPrice > 0)]

In [None]:
details = summary(df)
display(details.sort_values(by='Uniques', ascending=False))

In [None]:
df[df.cancellation==1]

In [None]:
df=df.drop('cancellation',axis=1)

In [None]:
df

### v. Explore the Orders


1. Find the unique number of InvoiceNo  per customer

In [None]:
# Her musterinin siparis sayisi
df.groupby('CustomerID')['InvoiceNo'].nunique().sort_values(ascending=False)

2. What's the average number of unqiue items per order or per customer?

In [None]:
# Her bir musterinin her siparisteki unique products sayisinin ortalamasi
mean_of_unique_items= round(df.groupby(['CustomerID',
                      'InvoiceNo']).agg({'StockCode':lambda x:x.nunique()}).groupby('CustomerID')['StockCode'].mean(),
                      1).sort_values(ascending=False)
mean_of_unique_items

3. Let's see how this compares to the number of unique products per customer.

In [None]:
# Her musterinin satin aldigi unique product sayisi
num_of_unique_product= pd.DataFrame(df.groupby('CustomerID').StockCode.nunique()).rename(columns={'StockCode':'num_of_unique_product'})

# Her musterinin siparis sayisi
num_of_order = df.groupby('CustomerID').InvoiceNo.nunique()

In [None]:
pd.concat([mean_of_unique_items,
           num_of_order,
           num_of_unique_product],
           axis=1).rename(columns={'StockCode': "mean_of_unique_items",
                                   'InvoiceNo': 'num_of_order'}).sort_values('num_of_order', ascending=False)

### vi. Explore Customers by Country

1. What's the total revenue per country?

In [None]:
df

In [None]:
df['TotalPrice'] = df['UnitPrice']*df['Quantity']

In [None]:
df2=pd.DataFrame(df.groupby('Country').TotalPrice.sum().apply(lambda x: round(x,2))).sort_values('TotalPrice',ascending=False)
# df2=df.groupby('Country').agg({'TotalPrice': lambda x: x.sum()}).sort_values('TotalPrice',ascending=False)

df2['perc_of_TotalPrice']=round(df2.TotalPrice/df2.TotalPrice.sum()*100,2)
df2

2. Visualize number of customer per country

In [None]:
df2['customer_num']=df.groupby('Country').CustomerID.nunique()
df2['customer_rate']=round(df2.customer_num/df2.customer_num.sum()*100,2)
df2.head(5)

In [None]:
plt.figure(figsize=(15,5))
sns.barplot(y=df2.index, x=df2.customer_num.sort_values(ascending=False));

In [None]:
plt.figure(figsize=(15,10))
sns.barplot(y=df2.iloc[1:].index, x=df2.iloc[1:].customer_num.sort_values(ascending=False));

3. Visualize total cost per country

In [None]:
plt.figure(figsize=(15,10))
sns.barplot(y=df2.iloc[1:].index, x=df2.iloc[1:].TotalPrice.sort_values(ascending=False));

#### The UK not only has the most sales revenue, but also the most customers. Since the majority of this data set contains orders from the UK, we can explore the UK market further by finding out what products the customers buy together and any other buying behaviors to improve our sales and targeting strategy.

### vii. Explore the UK Market


1. Create df_uk DataFrame

In [None]:
df_uk=df[df.Country=='United Kingdom']
df_uk.head()

2. What are the most popular products that are bought in the UK?

In [None]:
pd.DataFrame(df_uk.StockCode.value_counts().head(10))

**Explanation**
* There is no special meaning of 'A' and 'B' letter at the end fo some StockCodes

### We will continue analyzing the UK transactions with customer segmentation.

# 2. RFM Analysis

In the age of the internet and e-commerce, companies that do not expand their businesses online or utilize digital tools to reach their customers will run into issues like scalability and a lack of digital precsence. An important marketing strategy e-commerce businesses use for analyzing and predicting customer value is customer segmentation. Customer data is used to sort customers into group based on their behaviors and preferences.

**[RFM](https://www.putler.com/rfm-analysis/) (Recency, Frequency, Monetary) Analysis** is a customer segmentation technique for analyzing customer value based on past buying behavior. RFM analysis was first used by the direct mail industry more than four decades ago, yet it is still an effective way to optimize your marketing.
<br>
<br>
Our goal in this Notebook is to cluster the customers in our data set to:
 - Recognize who are our most valuable customers
 - Increase revenue
 - Increase customer retention
 - Learn more about the trends and behaviors of our customers
 - Define customers that are 5-Lost Customers

We will tart with **RFM Analysis** and then compliment our findings with predictive analysis using **K-Means Clustering Algorithms.**

- RECENCY (R): Time since last purchase
- FREQUENCY (F): Total number of purchases
- MONETARY VALUE (M): Total monetary value




Benefits of RFM Analysis

- Increased customer retention
- Increased response rate
- Increased conversion rate
- Increased revenue

RFM Analysis answers the following questions:
 - Who are our best customers?
 - Who has the potential to be converted into more profitable customers?
 - Which customers do we need to retain?
 - Which group of customers is most likely to respond to our marketing campaign?
 

### i. Import Libraries

In [None]:
from datetime import datetime as dt
import warnings
warnings.filterwarnings('ignore')
warnings.warn("this will not show")

### ii. Review df_uk DataFrame

In [None]:
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'])

df_uk=df[df.Country=='United Kingdom']
df_uk.head()

### iii. Recency: Days since last purchase
To calculate the recency values, follow these steps in order:

1. To calculate recency, we need to choose a date as a point of reference to evaluate how many days ago was the customer's last purchase.
2. Create a new column called Date which contains the invoice date without the timestamp
3. Group by CustomerID and check the last date of purchase
4. Calculate the days since last purchase
5. Drop Last_Purchase_Date since we don't need it anymore
6. Plot RFM distributions

1. Choose a date as a point of reference to evaluate how many days ago was the customer's last purchase.

In [None]:
# a point of reference date 

ref_date = max(df['InvoiceDate'])
ref_date

2. Create a new column called Date which contains the invoice date without the timestamp

In [None]:
df_uk['Date']=df_uk['InvoiceDate'].apply(lambda x: x.date())
df_uk.head(3)

3. Group by CustomerID and check the last date of purchase

In [None]:
df_uk['Last_Purchase_Date']=df_uk.groupby(['CustomerID'])['Date'].transform(max)
# df_uk.groupby('CustomerID').agg({'Date': lambda x:x.max()})
df_uk.head(5)

4. Calculate the days since last purchase

In [None]:
df_uk['Recency']=df_uk.groupby('CustomerID')['Last_Purchase_Date'].apply(lambda x:ref_date.date() - x)
df_uk['Recency']=df_uk.agg({'Recency':lambda x:x.astype('timedelta64[D]')})

# df_uk.groupby('CustomerID').agg({'Date': lambda x: (today.date() - x.max()).days})
df_uk.head()

In [None]:
df_uk.Recency.value_counts().sort_index()

5. Drop Last_Purchase_Date since we don't need it anymore

In [None]:
df_uk = df_uk.drop('Last_Purchase_Date',axis=1)

6. Plot RFM distributions

In [None]:
plt.subplots(figsize=(15, 5))
sns.distplot(df_uk.groupby('CustomerID')['Recency'].max(), kde=False, bins=80)
plt.title('Recency Value Distribution', fontsize = 15)
plt.xlabel('Recency')
plt.ylabel('Count');

### iv. Frequency: Number of purchases

To calculate how many times a customer purchased something, we need to count how many invoices each customer has. To calculate the frequency values, follow these steps in order:

1. Make a copy of df_uk and drop duplicates

In [None]:
df_uk=df_uk.drop_duplicates()

2. Calculate the frequency of purchases

In [None]:
df_uk['Frequency'] = df_uk.groupby('CustomerID').InvoiceNo.transform('nunique')

3. Plot RFM distributions

In [None]:
plt.figure(figsize=(15, 5))
sns.distplot(df_uk.groupby('CustomerID')['Frequency'].max(), kde=False, bins=200)
plt.title('Frequency Value Distribution', fontsize = 15)
plt.xlim(-10, 60)
plt.xlabel('Frequency')
plt.ylabel('Count');

### v. Monetary: Total amount of money spent

The monetary value is calculated by adding together the cost of the customers' purchases.


1. Calculate sum total cost by customers and named "Monetary"

In [None]:
df_uk['Monetary'] = df_uk.groupby('CustomerID').TotalPrice.transform('sum')

2. Plot RFM distributions

In [None]:
plt.subplots(figsize=(15, 5))
sns.distplot(df_uk.groupby('CustomerID')['Monetary'].max(), kde=False, bins=400)
plt.title('Monetary Value Distribution', fontsize = 15)
plt.xlim(-10000, 40000)
plt.xlabel('Monetary')
plt.ylabel('Count');

### vi. Create RFM Table
Merge the recency, frequency and motetary dataframes

In [None]:
df_rfm = df_uk[['Recency','Frequency','Monetary']].drop_duplicates().rename(index=df_uk['CustomerID'])
df_rfm.sort_index()

In [None]:
# Alternative

# RECENCY (R): Time since last purchase
# FREQUENCY (F): Total number of purchases
# MONETARY VALUE (M): Total monetary value

#df_uk=df[df.Country=='United Kingdom']

# data_x = df_uk.groupby('CustomerID').agg({'InvoiceDate': lambda x: (ref_date - x.max()).days})

# data_y = df_uk.groupby('CustomerID').agg({'InvoiceNo': lambda x: len(x)})

# data_z = df_uk.groupby('CustomerID').agg({'TotalPrice': lambda x: x.sum()})

# df_rfm= pd.merge(data_x, data_y, on='CustomerID').merge(data_z, on='CustomerID').rename(columns= {'InvoiceDate': 'Recency',
#                                                                                                   'InvoiceNo': 'Frequency',
#                                                                                                   'TotalPrice': 'Monetary'})

# df_rfm.sort_index()

# 3. Customer Segmentation with RFM Scores

Businesses have this ever-lasting urge to understand their customers. The better you understand the customer, the better you serve them, and the higher the financial gain you receive from that customer. Since the dawn of trade, this process of understanding customers for a strategic gain has been there practiced and this task is known majorly as [Customer Segmentation](https://clevertap.com/blog/rfm-analysis/).
Well as the name suggests, Customer Segmentation could segment customers according to their precise needs. Some of the common ways of segmenting customers are based on their Recency-Frequency-Monatory values, their demographics like gender, region, country, etc, and some of their business-crafted scores. You will use Recency-Frequency-Monatory values for this case.

In this section, you will create an RFM Segmentation Table where you segment your customers by using the RFM table. For example, you can label the best customer as "Big Spenders" and the lost customer as "Lost Customer".

## Calculate RFM Scoring

The simplest way to create customer segments from an RFM model is by using **Quartiles**. We will assign a score from 1 to 4 to each category (Recency, Frequency, and Monetary) with 4 being the highest/best value. The final RFM score is calculated by combining all RFM values. For Customer Segmentation, you will use the df_rfm data set resulting from the RFM analysis.
<br>
<br>

**Note**:For better detail, the data can be assigned to more clusters, we will cluster them in 6 different levels.

1. Divide the df_rfm into quarters

In [None]:
quantiles = df_rfm.quantile(q=[0.25,0.50,0.75])
quantiles

> Interquartel range (IQR) e gore 4 grup ta clustering yapacagiz

### i. Creating the RFM Segmentation Table


1. Create two functions, one for Recency and one for Frequency and Monetary. For Recency, customers in the first quarter should be scored as 4, this represents the highest Recency value. Conversely, for Frequency and Monetary, customers in the last quarter should be scored as 4, representing the highest Frequency and Monetary values.

In [None]:
# Arguments (x = value, p = recency, monetary_value, frequency, d = quartiles dict)

def R_Point(x,p,d):
    if x <= d[p][0.25]:
        return 4
    elif x <= d[p][0.50]:
        return 3
    elif x <= d[p][0.75]: 
        return 2
    else:
        return 1

In [None]:
# Arguments (x = value, p = recency, monetary_value, frequency, d = quartiles dict)

def FM_Point(x,p,d):
    if x <= d[p][0.25]:
        return 1
    elif x <= d[p][0.50]:
        return 2
    elif x <= d[p][0.75]: 
        return 3
    else:
        return 4

2. Score customers from 1 to 4 by applying the functions you have created. Also create separate score column for each value. 

In [None]:
df_rfm['R_Quartile'] = df_rfm['Recency'].apply(R_Point, args=('Recency',quantiles))
df_rfm['F_Quartile'] = df_rfm['Frequency'].apply(FM_Point, args=('Frequency',quantiles))
df_rfm['M_Quartile'] = df_rfm['Monetary'].apply(FM_Point, args=('Monetary',quantiles))

In [None]:
df_rfm

3. Now that scored each customer, you'll combine the scores for segmentation.

In [None]:
df_rfm['RFM_Scores'] = df_rfm.R_Quartile.apply(str) + df_rfm.F_Quartile.apply(str) + df_rfm.M_Quartile.apply(str)
df_rfm

4. Define RFM_Points function that tags customers by using RFM_Scores and Create a new variable RFM_Scores_Segment

**Method-1**

* If RFM_Scores == '444', then "1-Best Customers"
* If RFM_Scores == 'X4X', then "2-Loyal Customers"
* If RFM_Scores == 'XX4', then "3-Big Spenders"
* If RFM_Scores == '244', then "4-Almost Lost"
* If RFM_Scores == '144', then "5-Lost Customers"
* If RFM_Scores == '111', then "6-Lost Cheap Customers"

In [None]:
label = list(np.zeros(len(df_rfm)))

for i in range(len(df_rfm)):
    if df_rfm['RFM_Scores'].iloc[i] =='444': label[i] = "1-Best Customers"  
    elif df_rfm['RFM_Scores'].iloc[i][1]=='4': label[i] = "2-Loyal Customers"
    elif df_rfm['RFM_Scores'].iloc[i][2]=='4': label[i] = "3-Big Spenders"
    elif df_rfm['RFM_Scores'].iloc[i]=='244' : label[i] = "4-Almost Lost"
    elif df_rfm['RFM_Scores'].iloc[i]=='144' : label[i] = "5-Lost Customers"
    elif df_rfm['RFM_Scores'].iloc[i] =='111' : label[i] = "6-Lost Cheap Customers"       
        
df_rfm['RFM_Scores_Segments'] = label
df_rfm

In [None]:
df_rfm.RFM_Scores_Segments.value_counts()

In [None]:
Loyal_Customers=['433']
Big_Spenders=['413','432','423']
Almost_Lost=['333','233','422','331','313','431','323']
Lost_Customers=['222','311','322','223','332','411','133','132','312','131','123','213','113','421','412','231','321','232']
Lost_Cheap_Customers=['122','211','112','121','212','221']

In [None]:
for i in range(len(df_rfm)):
    if df_rfm['RFM_Scores'].iloc[i] in Loyal_Customers : label[i] = "2-Loyal Customers"
    elif df_rfm['RFM_Scores'].iloc[i] in Big_Spenders : label[i] = "3-Big Spenders"
    elif df_rfm['RFM_Scores'].iloc[i] in Almost_Lost : label[i] = "4-Almost Lost"
    elif df_rfm['RFM_Scores'].iloc[i] in Lost_Customers : label[i] = "5-Lost Customers"
    elif df_rfm['RFM_Scores'].iloc[i] in Lost_Cheap_Customers  : label[i] = "6-Lost Cheap Customers"       
        
df_rfm['RFM_Scores_Segments'] = label
df_rfm

In [None]:
df_rfm.RFM_Scores_Segments.value_counts().sort_index()

In [None]:
df_rfm.groupby('RFM_Scores_Segments').agg({'Recency': ['mean','median','min','max'],
                                           'Frequency': ['mean','median','min','max'],
                                           'Monetary': ['mean','median','min','max','count']}).round(1)

**Explanation:**

**Method-2**

In [None]:
df_rfm['RFM_Points'] = df_rfm[['R_Quartile', 'F_Quartile', 'M_Quartile']].sum(axis=1).astype('float')
# df_rfm['RFM_Points'] = df_rfm['R_Quartile'] + df_rfm['F_Quartile'] + df_rfm['M_Quartile']
df_rfm.head()

* If RFM_Points == 12, then "1-Best Customers"
* If RFM_Points == 11, then "2-Loyal Customers"
* If  9 <= RFM_Points <=10, then "3-Big Spenders"
* If  7 <= RFM_Points <=8, then "4-Almost Lost"
* If  4 <= RFM_Points <=6, then "5-Lost Customers"
* If 3 == RFM_Points, "6-Lost Cheap Customers"

In [None]:
label = list(np.zeros(len(df_rfm)))

for i in range(len(df_rfm)):
    if df_rfm['RFM_Points'].iloc[i] ==12: label[i] = "1-Best Customers"  
    elif df_rfm['RFM_Points'].iloc[i] ==11: label[i] = "2-Loyal Customers"
    elif df_rfm['RFM_Points'].iloc[i] >= 9 : label[i] = "3-Big Spenders"
    elif df_rfm['RFM_Points'].iloc[i] >= 7 : label[i] = "4-Almost Lost"
    elif df_rfm['RFM_Points'].iloc[i] >= 5 : label[i] = "5-Lost Customers"
    else : label[i] = "6-Lost Cheap Customers"       
        
df_rfm['RFM_Points_Segments'] = label
df_rfm

In [None]:
df_rfm.groupby('RFM_Points').agg({'Recency': ['mean','median','min','max'],
                                 'Frequency': ['mean','median','min','max'],
                                 'Monetary': ['mean','median','min','max','count']}).round(1)

5. Calculate average values for each RFM_Points, and return a size of each segment 

In [None]:
avg_RFM_Points = df_rfm.groupby('RFM_Points_Segments').RFM_Points.mean().apply(lambda x:round(x,1))
size_RFM_Points = df_rfm['RFM_Points_Segments'].value_counts()
summary= pd.concat([avg_RFM_Points, size_RFM_Points], axis=1).rename(columns={"RFM_Points": "avg_RFM_Points", 
                                                                   "RFM_Points_Segments": "size_RFM_Points"}).sort_values('avg_RFM_Points',
                                                                                                           ascending=False)
summary

In [None]:
df_rfm.groupby('RFM_Points_Segments').agg({'Recency': ['mean','median','min','max'],
                                           'Frequency': ['mean','median','min','max'],
                                           'Monetary': ['mean','median','min','max'],
                                           'RFM_Points': ['mean','median','min','max','count']}).round(1)

In [None]:
df_rfm.head()

**Comparison of the Two Methods**

In [None]:
pd.crosstab(df_rfm['RFM_Scores_Segments'],df_rfm['RFM_Points_Segments'])

In [None]:
labels = list(df_rfm.RFM_Scores_Segments.value_counts().sort_index().index)
RFM_Scores_Segments = list(df_rfm.RFM_Scores_Segments.value_counts().sort_index().values)
RFM_Points_Segments = list(df_rfm.RFM_Points_Segments.value_counts().sort_index().values)

x = np.arange(len(labels))  # the label locations
width = 0.35  # the width of the bars

fig, ax = plt.subplots(figsize=(10,6))
rects1 = ax.bar(x - width/2, RFM_Scores_Segments, width, label='RFM_Scores_Segments')
rects2 = ax.bar(x + width/2, RFM_Points_Segments, width, label='RFM_Points_Segments')

# Add some text for labels, title and custom x-axis tick labels, etc.
ax.set_ylabel('Total Number of Segment')
ax.set_title('RFM Segmentation by Scores and Points')
ax.set_xticks(x)
ax.set_xticklabels(labels, rotation = 45)
ax.legend()


def autolabel(rects):
    """Attach a text label above each bar in *rects*, displaying its height."""
    for rect in rects:
        height = rect.get_height()
        ax.annotate('{}'.format(height),
                    xy=(rect.get_x() + rect.get_width() / 2, height),
                    xytext=(0, 3),  # 3 points vertical offset
                    textcoords="offset points",
                    ha='center', va='bottom')


autolabel(rects1)
autolabel(rects2)


fig.tight_layout()

plt.show()


**Explanation:**
* We can say: Distribution of Clustering Based on RFM_Points is more successful. It is closer to normal distribution, even through left skewed.
* We will continue with the visualization of Distribution of Clustering Based on RFM_Points

## Plot RFM Segments

1. Create your plot and resize it.

In [None]:
fig, ax = plt.subplots()
summary.size_RFM_Points.plot(ax=ax,color='b',label='Size',kind='bar',width=0.3)
plt.legend(bbox_to_anchor=(0.0, 0.90), loc=2, borderaxespad=0.)

ax2 = ax.twinx()
summary.avg_RFM_Points.plot(ax=ax2,color='r',label='Average', marker='o')
plt.legend(bbox_to_anchor=(0.0, 0.80), loc=2, borderaxespad=0.)
ax.grid()

In [None]:
# # Alternative
# fig, ax = plt.subplots()
# ax=sns.barplot(x=summary.index, y=summary.size_RFM_Points)

# ax2 = ax.twinx()
# ax2=sns.lineplot(x=summary.index, y=summary.avg_RFM_Points)

Using customer segmentation categories found [here](http://www.blastam.com/blog/rfm-analysis-boosts-sales) we can formulate different marketing strategies and approaches for customer engagement for each type of customer.

Note: The author in the article scores 1 as the highest and 4 as the lowest

In [None]:
plt.figure(figsize=(6,6))

# explode = [0.01,0.01,0.1]
plt.pie(df_rfm['RFM_Points_Segments'].value_counts().sort_index(),autopct='%1.1f%%',shadow=True,startangle=0)
plt.legend(df_rfm['RFM_Points_Segments'].value_counts().sort_index().index,bbox_to_anchor=(1.45,0.5),loc='center right')
plt.title('Customer Segmentation Distribution')
plt.axis('off')
plt.show()

In [None]:
plt.figure(figsize=(15,15))
df_rfm=df_rfm.sort_values('RFM_Points_Segments')
sns.pairplot(df_rfm[['Recency', 'Frequency', 'Monetary','RFM_Points_Segments']],hue='RFM_Points_Segments')

**Explanation:**
* There is no so clear meaningful pattern in this graph. It needs scaling.

In [None]:
plt.figure(figsize=(20,20))
# df_rfm=df_rfm.sort_values('RFM_Points_Segments')
sns.pairplot(df_rfm[['R_Quartile', 'F_Quartile', 'M_Quartile','RFM_Points_Segments']],
             hue='RFM_Points_Segments',plot_kws={"s": 200})

**Explanations:**
* "1-Best Customers" >> {444}
* "2-Loyal Customers"
* "3-Big Spenders"
* "4-Almost Lost"
* "5-Lost Customers"
* "6-Lost Cheap Customers" >> {111}

In [None]:
plt.figure(figsize=(15,5))
# df_rfm=df_rfm.sort_values('RFM_Points_Segments')

plt.subplot(1,3,1)
sns.boxplot(df_rfm['RFM_Points_Segments'], df_rfm['Recency'])
plt.xticks(rotation=90)

plt.subplot(1,3,2)
sns.boxplot(df_rfm['RFM_Points_Segments'], df_rfm['Frequency'])
plt.xticks(rotation=90)

plt.subplot(1,3,3)
sns.boxplot(df_rfm['RFM_Points_Segments'], df_rfm['Monetary'])
plt.xticks(rotation=90)
plt.show()

In [None]:
plt.figure(figsize=(15,5))

plt.subplot(1,3,1)
sns.stripplot(df_rfm['RFM_Points_Segments'], df_rfm['Recency'])
plt.xticks(rotation=90)

plt.subplot(1,3,2)
sns.stripplot(df_rfm['RFM_Points_Segments'], df_rfm['Frequency'])
plt.xticks(rotation=90)

plt.subplot(1,3,3)
sns.stripplot(df_rfm['RFM_Points_Segments'], df_rfm['Monetary'])
plt.xticks(rotation=90)
plt.show()

In [None]:
plt.figure(figsize=(15,5))
plt.subplot(1,3,1)
sns.distplot(df_rfm.Recency,bins=100)
plt.title('Recency Distribution')
plt.xlabel('Recency')
plt.ylabel('Count')

plt.subplot(1,3,2)
sns.distplot(df_rfm['Frequency'],color='red',bins=150)
plt.title('Frequency Distribution')
plt.xlabel('Frequency')
plt.ylabel('')
plt.xlim(-10, 50)

plt.subplot(1,3,3)
sns.distplot(df_rfm['Monetary'],color='green',bins=300)
plt.title('Monetary Distribution')
plt.xlabel('Monetary')
plt.ylabel('')
plt.xlim(-5000, 10000)
plt.show()

In [None]:
plt.figure(figsize=(15,5))
plt.subplot(1,3,1)
sns.kdeplot(x='Recency',data=df_rfm,hue="RFM_Points_Segments",shade=True)
plt.title('Recency Distribution')
plt.xlabel('Recency')
plt.ylabel('Count')
plt.ylim(0, 0.008)

plt.subplot(1,3,2)
sns.kdeplot(x='Frequency',data=df_rfm,hue="RFM_Points_Segments",shade=True)
plt.title('Frequency Distribution')
plt.xlabel('Frequency')
plt.ylabel('')
plt.xlim(-1, 10)
plt.ylim(0, 0.5)

plt.subplot(1,3,3)
sns.kdeplot(x='Monetary',data=df_rfm,hue="RFM_Points_Segments",shade=True)
plt.title('Monetary Distribution')
plt.xlabel('Monetary')
plt.ylabel('')
plt.xlim(-5000, 5000)
plt.ylim(0, 0.0005)
plt.show()

In [None]:
plt.figure(figsize=(15,12))

plt.subplot(3,1,1)
plt.title('R_Quartile Distribution')
sns.countplot(x='R_Quartile', hue='RFM_Points_Segments', data=df_rfm)
plt.xlabel('')

plt.subplot(3,1,2)
plt.title('F_Quartile Distribution')
sns.countplot(x='F_Quartile',  hue='RFM_Points_Segments', data=df_rfm)
plt.xlabel('')

plt.subplot(3,1,3)
plt.title('M_Quartile Distribution')
sns.countplot(x='M_Quartile',  hue='RFM_Points_Segments', data=df_rfm)
plt.xlabel('')
plt.show()

2. How many customers do we have in each segment?

In [None]:
df_rfm.RFM_Points_Segments.value_counts().sort_index()

# 3. Applying K-Means Clustering

Now that we have our customers segmented into 6 different categories, we can gain further insight into customer behavior by using predictive models in conjuction with out RFM model.
Possible algorithms include **Logistic Regression**, **K-means Clustering**, and **K-nearest Neighbor**. We will go with [K-Means](https://towardsdatascience.com/understanding-k-means-clustering-in-machine-learning-6a6e67336aa1) since we already have our distinct groups determined. K-means has also been widely used for market segmentation and has the advantage of being simple to implement.

## Data Pre-Processing and Exploring

In [None]:
df_uk.head()

In [None]:
df_uk['InvoicePeriod']=pd.to_datetime(df_uk['InvoiceDate']).apply(lambda x: x.to_period('M'))
df_line=pd.DataFrame(df_uk.groupby('InvoicePeriod').CustomerID.nunique())
df_line['TotalPrice']=df_uk.groupby('InvoicePeriod').TotalPrice.sum()

In [None]:
fig, ax = plt.subplots()
df_line.CustomerID.plot(ax=ax,color='b',label='Unique Customers', marker='o')
plt.legend(bbox_to_anchor=(0.0, 0.90), loc=2, borderaxespad=0.)

ax2 = ax.twinx()
df_line.TotalPrice.plot(ax=ax2,color='r',label='TotalPrice',linestyle='--', marker='o')
plt.legend(bbox_to_anchor=(0.0, 0.80), loc=2, borderaxespad=0.)
plt.gcf().axes[1].yaxis.get_major_formatter().set_scientific(False) # remove scientific notation
ax.grid()

**Explanation:**
* Her ay icin musteri yogunlugu ve dolayisiyla toplam satis farklilik gosterdigi goruluyor. O halde bir musterinin sistemdeki 'life sapan' ini de dikkate almamiz gereken bir parametredir diyebiliriz.

In [None]:
life_span=pd.DataFrame(df.groupby('CustomerID').InvoiceDate.apply(lambda x:max(x).date() - min(x).date()))
life_span=life_span.agg({'InvoiceDate':lambda x:x.astype('timedelta64[D]')}).rename(columns= {'InvoiceDate':'life_span'})

In [None]:
df_kmeans=df_rfm[['Recency', 'Frequency', 'Monetary']]
df_kmeans['life_span']=life_span
df_kmeans

Create Heatmap and evaluate the results 

In [None]:
plt.figure(figsize=(10,7))
sns.heatmap(df_kmeans.corr(),annot=True, cmap="coolwarm");

### ii. Visualize Feature Distributions

To get a better understanding of the dataset, you can costruct a scatter matrix of each of the three features in the RFM data.

### iii. Data Normalization

1. You can use the logarithm method to normalize the values in a column.

In [None]:
def col_plot(df,col_name,iqr=1.5):
    plt.figure(figsize=(15,5))
    
    plt.subplot(141) # 1 satir x 4 sutun dan olusan ax in 1. sutununda calis
    plt.hist(df[col_name], bins = 20)
    f=lambda x:(np.sqrt(x) if x>=0 else -np.sqrt(-x))
    
    # üç sigma aralikta(verinin %99.7 sini icine almasi beklenen bolum) iki kirmizi cizgi arasinda
    plt.axvline(x=df[col_name].mean() + 3*df[col_name].std(),color='red')
    plt.axvline(x=df[col_name].mean() - 3*df[col_name].std(),color='red')
    plt.xlabel(col_name)
    plt.tight_layout
    plt.xlabel("Histogram ±3z")
    plt.ylabel(col_name)

    plt.subplot(142)
    plt.boxplot(df[col_name], whis = iqr)
    plt.xlabel(f"IQR={iqr}")

    plt.subplot(143)
    plt.boxplot(df[col_name].apply(f), whis = iqr)
    plt.xlabel(f"ROOT SQUARE - IQR={iqr}")

    plt.subplot(144)
    plt.boxplot(np.log(df[col_name]+1), whis = iqr)
    plt.xlabel(f"LOGARITMIC - IQR={iqr}")
    plt.show()

In [None]:
for i in df_kmeans:
    col_plot(df_kmeans,i,2)

In [None]:
features=[
#           'Recency', 
          'Frequency', 
          'Monetary', 
#           'life_span',
         ]

In [None]:
df_log=df_kmeans.copy()
for i in features:
    df_log[i]=np.log(df_log[i]+1)

In [None]:
def detect_outliers(df:pd.DataFrame, col_name:str, p=1.5) ->int:
    ''' 
    this function detects outliers based on 3 time IQR and
    returns the number of lower and uper limit and number of outliers respectively
    '''
    first_quartile = np.percentile(np.array(df[col_name].tolist()), 25)
    third_quartile = np.percentile(np.array(df[col_name].tolist()), 75)
    IQR = third_quartile - first_quartile
                      
    upper_limit = third_quartile+(p*IQR)
    lower_limit = first_quartile-(p*IQR)
    outlier_count = 0
                      
    for value in df[col_name].tolist():
        if (value < lower_limit) | (value > upper_limit):
            outlier_count +=1
    return lower_limit, upper_limit, outlier_count

**Outlier Detection**

In [None]:
iqr=2
print(f"Number of Outliers for {iqr}*IQR after Logarithmed\n")

total=0
for col in features:
    if detect_outliers(df_log, col)[2] > 0:
        outliers=detect_outliers(df_log, col, iqr)[2]
        total+=outliers
        print("{} outliers in '{}'".format(outliers,col))
print("\n{} OUTLIERS TOTALLY".format(total))

**Drop Outliers**

In [None]:
df_log.shape

In [None]:
iqr=2
for i in ['Frequency','Monetary']:
    lower,upper,_=detect_outliers(df_log,i,iqr)
    df_log=df_log[(df_log[i]>lower)&(df_log[i]<upper)]

In [None]:
df_log.shape

**Explanation**
* If you have binary values, discrete attributes or categorial attributes, stay away from k-means. K-means needs to compute means, and the mean value is not meaningful on this kind of data.

**Scaling:**
* If you have attributes with a well-defined meaning. Say, latitude and longitude, then you should not scale your data, because this will cause distortion.

* If you have mixed numerical data, where each attribute is something entirely different (say, shoe size and weight), has different units attached (lb, tons, m, kg ...) then these values aren't really comparable anyway;scaling them is a best-practise to give equal weight to them.

In [None]:
from sklearn.preprocessing import StandardScaler

df_scaled = pd.DataFrame(StandardScaler().fit_transform(df_log),
                         columns=df_kmeans.columns,index=df_log.index)
df_scaled.head()

2. Plot normalized data with scatter matrix or pairplot. Also evaluate results.

In [None]:
plt.figure(figsize=(15,15))
sns.pairplot(df_scaled[['Recency', 'Frequency', 'Monetary','life_span']]);

## K-Means Implementation

For k-means, you have to set k to the number of clusters you want, but figuring out how many clusters is not obvious from the beginning. We will try different cluster numbers and check their [silhouette coefficient](http://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html). The silhouette coefficient for a data point measures how similar it is to its assigned cluster from -1 (dissimilar) to 1 (similar). 
<br>
<br>
**Note**: K-means is sensitive to initializations because they are critical to qualifty of optima found. Thus, we will use smart initialization called "Elbow Method".

### i. Define the Optimal Number of Clusters

In [None]:
!pip install pyclustertend

In [None]:
from sklearn.cluster import KMeans
from pyclustertend import hopkins
from sklearn.metrics import silhouette_score

[Hopkins Test](https://en.wikipedia.org/wiki/Hopkins_statistic)
* Null Hypothesis(Ho) ve Alternative Hypothesis(Ha) temeline dayaniyor.
* Null Hypothesis(Ho): Uniform dagilim var, anlamli kümeleme yok.
* Alternative Hypothesis(Ha): Veri, ratsgele veri noktalarindan oluşur. Yani Kumeleme vardir.
* [0,1] araliginda bir score verir. score, 0’a yaklaştıkça veri uniform degil,yani clusteringe meyilli
* 1’e yaklaştıkça uniform yapi var, 0.5 gecmedikce kümelenebilir olarak ifade edilir. pratikte 0.3 sinir alinir.

In [None]:
hopkins(df_scaled,df_scaled.shape[0])

**Evaluations:**
* According to Hopkins Score, we can say our dataframe tends to clustering too much.

[The Elbow Method](https://en.wikipedia.org/wiki/Elbow_method_(clustering) 
* aciklanan varyans(Sum of squared distances) ve küme sayisi(k) arasindaki ilişkiye dayanarak cozum geliştiren bir yöntem, 
* yani aciklanan varyans, küme sayisinin fonksiyonu olarak çizilir, grafikte dirseğin kirildigi nokta, k nin optimal degeridir
* minimum k ile minimum hatanin alindigi optimum noktayi baz alacagiz. Keskin dususun en son bittigi yeri alacagiz. 

In [None]:
ssd = []
K = range(2,10)
for k in K:
    kmeans = KMeans(n_clusters = k).fit((df_scaled))
    ssd.append(kmeans.inertia_)

In [None]:
plt.figure(figsize=(8,4))
plt.plot(K, ssd, "bo-")
plt.xlabel("Different k values")
plt.ylabel("inertia-error")
plt.title("Elbow Method")
plt.grid()
plt.show()

In [None]:
from yellowbrick.cluster import KElbowVisualizer
kmeans = KMeans()
visu = KElbowVisualizer(kmeans, k = (2,10))
visu.fit(df_scaled)
visu.show();

[Silhouette Coefficient](http://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html)
* her veri için iki uzaklığı baz alır. Bu uzaklıklardan ilki verinin bulunduğu kümeye ait diğer verilere olan uzaklıkların ortalamasıdır. 
* İkincisi veriye en yakin komsu kümenin tum verilerine olan uzaklıkların ortalamasıdır.
* S değeri ile ifade edilir, s, 1’e yakinsa high clustering, -1e yakinsa low clustering eğilimi gosterir.

In [None]:
ssd =[]

K = range(2,10)

for k in K:
    model = KMeans(n_clusters=k)
    model.fit(df_scaled)
    ssd.append(model.inertia_)
    print(f'Silhouette Score for {k} clusters: {silhouette_score(df_scaled, model.labels_)}')

### ii. Model Fitting

Fit the K-Means Algorithm with the optimal number of clusters you decided and save the model to disk.

In [None]:
kmeans = KMeans(n_clusters = 4).fit(df_scaled)
labels = kmeans.labels_
df_scaled['Kmeans_Label_ID']=labels

In [None]:
keys=df_scaled.groupby('Kmeans_Label_ID').Frequency.mean().sort_values().index
values=['Bronze','Silver','Gold','Diamond']
dictionary = dict(zip(keys, values))

df_scaled['Kmeans_Label']=df_scaled.Kmeans_Label_ID.apply(lambda x:dictionary[x] )
df_scaled

### iii. Visualize the Clusters

1. Create a scatter plot and select cluster centers

In [None]:
df_scaled.Kmeans_Label.value_counts().sort_index()

In [None]:
df_scaled.Kmeans_Label.value_counts().plot.bar(width=0.3)
plt.title('Distribution of Clustering Based on K-Means');

In [None]:
f, (ax1, ax2) = plt.subplots(1, 2, sharey=True,figsize=(14,6)) # sharey=True ile y eksen labels lari ortak kullanirlar.
ax1.set_title('Recency-Frequency')
ax1.set_xlabel('Recency')
ax1.set_ylabel('Frequency')
ax1.scatter(df_scaled.iloc[:,0],df_scaled.iloc[:,1],c=kmeans.labels_,cmap="rainbow")
ax1.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=300,alpha=0.9, label = 'Centroids')

ax2.set_title("Frequency-Monetary")
ax2.set_xlabel('Frequency')
ax2.set_ylabel('Monetary')
ax2.scatter(df_scaled.iloc[:,1],df_scaled.iloc[:,2],c=kmeans.labels_,cmap="rainbow")
ax2.scatter(kmeans.cluster_centers_[:, 1], kmeans.cluster_centers_[:, 2], s=300,alpha=0.9, label = 'Centroids');


In [None]:
plt.figure(figsize=(15,15))
sns.pairplot(df_scaled[['Recency', 'Frequency', 'Monetary','life_span','Kmeans_Label']],hue='Kmeans_Label');

In [None]:
plt.figure(figsize=(20,10))

plt.subplot(2,2,1)
sns.stripplot(df_scaled['Kmeans_Label'], df_scaled['Recency'])

plt.subplot(2,2,2)
sns.stripplot(df_scaled['Kmeans_Label'], df_scaled['Frequency'])

plt.subplot(2,2,3)
sns.stripplot(df_scaled['Kmeans_Label'], df_scaled['Monetary'])

plt.subplot(2,2,4)
sns.stripplot(df_scaled['Kmeans_Label'], df_scaled['life_span'])
plt.show()

2. Visualize Cluster Id vs Recency, Cluster Id vs Frequency and Cluster Id vs Monetary using Box plot. Also evaluate the results. 

In [None]:
plt.figure(figsize=(15,6))

plt.subplot(1,3,1)
sns.boxplot(df_scaled['Kmeans_Label'], df_scaled['Recency'])

plt.subplot(1,3,2)
sns.boxplot(df_scaled['Kmeans_Label'], df_scaled['Frequency'])

plt.subplot(1,3,3)
sns.boxplot(df_scaled['Kmeans_Label'], df_scaled['Monetary'])
plt.show()

### iv. Assign the Label

**Conclusion**


### v. Conclusion

Discuss your final results. Compare your own labels from the Customer Segmentation with the labels found by K-Means.

How we want to continue this analysis depends on how the business plans to use the results and the level of granularity the business stakeholders want to see in the clusters. We can also ask what range of customer behavior from high to low value customers are the stakeholders interested in exploring. From those answers, various methods of clustering can be used and applied on RFM variable or directly on the transaction data set.

In [None]:
RFM_Points_Segments=['1-Best Customers','2-Loyal Customers', '3-Big Spenders', '4-Almost Lost', '5-Lost Customers', '6-Lost Cheap Customers']
Kmeans_Label=['Diamond','Gold', 'Silver', 'Bronze']
pd.crosstab(df_scaled['Kmeans_Label'],df_rfm['RFM_Points_Segments'])[RFM_Points_Segments].loc[Kmeans_Label]

**Annotation:**

Limitations of K-means clustering:

1. There is no assurance that it will lead to the global best solution.
2. Can't deal with different shapes(not circular) and consider one point's probability of belonging to more than one cluster.

These disadvantages of K-means show that for many datasets (especially low-dimensional datasets), it may not perform as well as you might hope.

# 5. Create Cohort & Conduct Cohort Analysis
[Cohort Analysis](https://medium.com/swlh/cohort-analysis-using-python-and-pandas-d2a60f4d0a4d) is specifically useful in analyzing user growth patterns for products. In terms of a product, a cohort can be a group of people with the same sign-up date, the same usage starts month/date, or the same traffic source.
Cohort analysis is an analytics method by which these groups can be tracked over time for finding key insights. This analysis can further be used to do customer segmentation and track metrics like retention, churn, and lifetime value.

For e-commerce organizations, cohort analysis is a unique opportunity to find out which clients are the most valuable to their business. by performing Cohort analysis you can get the following answers to the following questions:

- How much effective was a marketing campaign held in a particular time period?
- Did the strategy employ to improve the conversion rates of Customers worked?
- Should I focus more on retention rather than acquiring new customers?
- Are my customer nurturing strategies effective?
- Which marketing channels bring me the best results?
- Is there a seasonality pattern in Customer behavior?
- Along with various performance measures/metrics for your organization.

Since we will be performing Cohort Analysis based on transaction records of customers, the columns we will be dealing with mainly:
- Invoice Data
- CustomerID
- Price
- Quantity

The following steps will performed to generate the Cohort Chart of Retention Rate:
- Month Extraction from InvioceDate column
- Assigning Cohort to Each Transaction
- Assigning Cohort Index to each transaction
- Calculating number of unique customers in each Group of (ChortDate,Index)
- Creating Cohort Table for Retention Rate
- Creating the Cohort Chart using the Cohort Table

The Detailed information about each step is given below:

## Future Engineering

### i. Extract the Month of the Purchase
First we will create a function, which takes any date and returns the formatted date with day value as 1st of the same month and Year.

In [None]:
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'])
df.head(2)

In [None]:
def extract_ym(df):
    # Extract year and month from the the column. strftim convert datetime to string 
    return df.apply(lambda x: x.strftime('%Y-%m')).astype('datetime64[ns]')

Now we will use the function created above to convert all the invoice dates into respective month date format.

In [None]:
extract_ym(df.InvoiceDate)

In [None]:
# Alternative
f=lambda x:pd.to_datetime(x).dt.to_period('M')
f(df['InvoiceDate'])

### ii. Calculating time offset in Months i.e. Cohort Index:
Calculating time offset for each transaction will allows us to report the metrics for each cohort in a comparable fashion.
First, you will create 4 variables that capture the integer value of years, months for Invoice and Cohort Date using the get_date_int() function which you'll create it below.

In [None]:
def get_date_int(df, column):
    years = df[column].dt.year
    months = df[column].dt.month
    return years, months

You will use this function to extract the integer values for Invoice as well as Cohort Date in 3 seperate series for each of the two columns

In [None]:
# Alternative
# df['InvoiceMonth']=extract_ym(df.InvoiceDate)

In [None]:
# Apply function to invoice date to invoice month column
df['InvoiceMonth'] = df['InvoiceDate'].apply(lambda x: dt(x.year, x.month, 1) )
df['cohort_date'] = df.groupby('CustomerID')['InvoiceMonth'].transform('min')

In [None]:
cohort_year, cohort_month = get_date_int(df, 'cohort_date')
invoice_year, invoice_month = get_date_int(df, 'InvoiceDate')

Use the variables created above to calcualte the difference in days and store them in cohort Index column.

In [None]:
years_diff = invoice_year - cohort_year
months_diff = invoice_month - cohort_month

## Create 1st Cohort: User number & Retention Rate

### i. Pivot Cohort and Cohort Retention

In [None]:
df['CohortIndex'] = years_diff * 12 + months_diff + 1
df.head()

In [None]:
# Count monthly active customers from each cohort
grouping_count = df.groupby(['cohort_date', 'CohortIndex'])
cohort_data = grouping_count['CustomerID'].apply(pd.Series.nunique)
cohort_data = cohort_data.reset_index()
cohort_counts = cohort_data.pivot(index='cohort_date',
                                  columns='CohortIndex',
                                  values='CustomerID')
cohort_counts

In [None]:
# --Calculate Retention Rate--
cohort_sizes = cohort_counts.iloc[:,0]
retention = cohort_counts.divide(cohort_sizes, axis=0).apply(lambda x: round(x,2))
retention.index = retention.index.strftime('%m-%Y')

In [None]:
retention.round(3) * 100 #to show the number as percentage 

### ii. Visualize analysis of cohort 1 using seaborn and matplotlib modules

In [None]:
retention.T.columns

In [None]:
plt.figure(figsize=(15, 6))
plt.title('Retention rates')
sns.heatmap(data = retention, annot = True, fmt = '.0%',vmin = 0.0,vmax = 0.5,cmap = 'BuGn')
plt.show()

**Insights:**
* we can see from the above chart that more users tend to purchase as time goes on.
* The 12-2010 cohort is the strongest. The 02-2011 and 04-2011 cohort are the weaknest cohort.

In [None]:
plt.figure(figsize=(15,6))
# retention[list(range(1,14))].plot(figsize=(10,5))
retention.loc[['12-2010', '02-2011', '04-2011','01-2011'],:].T.plot(figsize=(10,5))
# retention.iloc[:,:].T.plot(figsize=(10,5))
plt.title('Cohorts: Retention')
plt.xlim(1,12)
plt.ylim(0,0.6)
plt.xlabel('Cohort Period')
plt.ylabel('% of Cohort Purchasing')

## Create the 2nd Cohort: Average Quantity Sold

### i. Pivot Cohort and Cohort Retention

In [None]:
# --Calculate Average Quantity--
grouping_qty = df.groupby(['cohort_date', 'CohortIndex'])
cohort_data_qty = grouping_qty['Quantity'].mean()
cohort_data_qty = cohort_data_qty.reset_index()
average_quantity = cohort_data_qty.pivot(index='cohort_date',
                                     columns='CohortIndex',
                                     values='Quantity')
average_quantity.index = average_quantity.index.strftime('%m-%Y')

### ii. Visualize analysis of cohort 2 using seaborn and matplotlib modules

In [None]:
# Plot average quantity
plt.figure(figsize=(15, 6))
plt.title('Average Quantity')
sns.heatmap(data = average_quantity, annot=True, cmap='Blues')
plt.show()

## Create the 3rd Cohort: Average Sales


### i. Pivot Cohort and Cohort Retention

In [None]:
# --Calculate Average Price--
grouping_price = df.groupby(['cohort_date', 'CohortIndex'])
cohort_data_price = grouping_price['TotalPrice'].mean()
cohort_data_price = cohort_data_price.reset_index()
average_price = cohort_data_price.pivot(index='cohort_date',
                                     columns='CohortIndex',
                                     values='TotalPrice')
average_price.index = average_price.index.strftime('%m-%Y')

### ii. Visualize analysis of cohort 3 using seaborn and matplotlib modules

In [None]:
# Plot average sales
plt.figure(figsize=(15, 6))
plt.title('Average Price')
sns.heatmap(data = average_price, annot=True, cmap='Blues')
plt.show()

For e-commerce organisations, cohort analysis is a unique opportunity to find out which clients are the most valuable to their business. by performing Cohort analysis you can get answers to following questions:

- How much effective was a marketing campaign held in a particular time period?
- Did the strategy employed to improve the conversion rates of Customers worked?
- Should I focus more on retention rather than acquiring new customers?
- Are my customer nurturing strategies effective?
- Which marketing channels bring me the best results?
- Is there a seasoanlity pattern in Customer behahiour?