# EDA, Customer Segmentation using RFM and KMeans

Customer segmentation is the process of dividing customers into groups based on common characteristics so companies can market to each group effectively and appropriately.

This kernel is EDA and customer segmentation on Online Retail II data set containing all the transactions occurring for a UK-based and registered, non-store online retail between 01/12/2009 and 09/12/2011.The company mainly sells unique all-occasion gift-ware. Many customers of the company are wholesalers.

**Types of Segmentation factors:**

* Demographic (Age, Gender, Income, Location, Education, Ethnicity)
* Psychographic (Interests, Lifestyles, Priorities, Motivation, Influence)
* Behavioural (Purchasing habits, Spending habits, User status, Brand interactions)
* Geographic (zip code, city, country, climate)

**Major purpose of customer segmentation is Testing Pricing options, Focusing on Profitable customers, Communicating Targeted Marketing messages.**

**Methodology**

In this dataset we only have features that demonstrate Purchasing habits and Spending habits (Behavioural) factors.
We perform RFM Modelling and KMeans Clustering on this dataset to segment customers.

### Importing required libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from sklearn import preprocessing
import warnings
warnings.filterwarnings('ignore')
from numpy import math

pd.set_option("display.max_columns", 50)

**Loading Dataset**

In [None]:
retail_df=pd.read_csv('../input/online-retail-ii-uci/online_retail_II.csv')
retail_df.head()

In [None]:
# shape of dataset
retail_df.shape

### Data Wrangling

In [None]:
# checking the datatypes and null values in dataset
retail_df.info()

In [None]:
# Checking for Null values
retail_df.isnull().sum()

### Observations
* Datatype of InvoiceDate is object need to convert it into datatime.
* There are null values in CustomerID and Description.

**Customer ID is our Identification feature and Description has Product description.**

**We cannot do RFM analysis and KMeans Clustering without Customer ID values.**

**Hence, droppingg the missing values**

In [None]:
retail_df.dropna(subset=['Customer ID'],inplace=True)

In [None]:
retail_df.isnull().sum()

**DataSet Summary**

In [None]:
retail_df.describe()

We observe Quantity and Price columns have negative values lets explore these entries

In [None]:
retail_df[retail_df['Quantity']<0]

**Invoice numbers start with C and as per description of data these are cancellations hence dropping these entries** 

In [None]:
# changing the datatype to str
retail_df['Invoice'] = retail_df['Invoice'].astype('str')

In [None]:
retail_df=retail_df[~retail_df['Invoice'].str.contains('C')]

In [None]:
retail_df[retail_df['Price']<=0].sort_values('Price')

**We observe that these are not product purchase transactions but transactions of store related to some debt hence also deopping these entries**

In [None]:
# taking price values greater than 0.
retail_df=retail_df[retail_df['Price']>0]
retail_df.head()

In [None]:
retail_df.shape

**Our data got reduced now we have 1041671 datapoints**

In [None]:
retail_df.describe()

### Feature Engineering

In [None]:
# Converting InvoiceDate to datetime. InvoiceDate is in format of 01-12-2010 08:26:33.
retail_df["InvoiceDate"] = pd.to_datetime(retail_df["InvoiceDate"], format="%Y-%m-%d %H:%M:%S")

In [None]:
retail_df["year"] = retail_df["InvoiceDate"].apply(lambda x: x.year)
retail_df["month_num"] = retail_df["InvoiceDate"].apply(lambda x: x.month)
retail_df["day_num"] = retail_df["InvoiceDate"].apply(lambda x: x.day)
retail_df["hour"] = retail_df["InvoiceDate"].apply(lambda x: x.hour)
retail_df["minute"] = retail_df["InvoiceDate"].apply(lambda x: x.minute)
retail_df["second"] = retail_df["InvoiceDate"].apply(lambda x: x.second)

In [None]:
# extracting month from the Invoice date
retail_df['Month']=retail_df['InvoiceDate'].dt.month_name()

In [None]:
# extracting day from the Invoice date
retail_df['Day']=retail_df['InvoiceDate'].dt.day_name()

**Making total amount column by multiplying quantity with price**

In [None]:
retail_df['TotalAmount']=retail_df['Quantity']*retail_df['Price']

In [None]:
retail_df.head()

## Exploratory Data Analysis

In [None]:
retail_df.columns

**TOP 10 HIGHEST SELLING PRODUCTS SOLD BY THE STORE**

In [None]:
df1=retail_df.groupby('Description').sum()
df1.sort_values(['Quantity'], ascending=False,inplace=True)
df1.reset_index(inplace=True)
df1.rename(columns={'Description':'Product_name'},inplace=True)
df2=df1[['Product_name','Quantity']][:10]
df2

In [None]:
# top 10 products by quantity
plt.figure(figsize=(12,6))
sns.barplot(x=df2['Quantity'],y=df2['Product_name'])
plt.title('Top 10 products by quantity')

### Observations
* WORLD WAR 2 GLIDERS ASSTD DESIGNS	was the highest selling product
* WHITE HANGING HEART T-LIGHT HOLDER was the second highest selling product

**10 LEAST SELLING PRODUCTS OF THE STORE**

In [None]:
df3=df1[['Product_name','Quantity']].tail(10)
df3

**These are the least selling products of the store with only 1 unit sold of each product**

**TOP 10 STOCKCODES BY QUANTITY**

In [None]:
df4=retail_df.groupby('StockCode').sum()
df4.sort_values(['Quantity'], ascending=False,inplace=True)
df4.reset_index(inplace=True)
df5=df4[['StockCode','Quantity']][:10]
df5

In [None]:
# top 10 StockCodes by quantity
plt.figure(figsize=(12,6))
sns.barplot(x=df5['Quantity'],y=df5['StockCode'])
plt.title('Top 10 StockCodes by quantity')

**TOP 10 HIGHEST SPENDING CUSTOMERS**

In [None]:
Top10Spending=retail_df.groupby('Customer ID')['TotalAmount'].sum().reset_index().sort_values('TotalAmount',ascending=False).head(10)
Top10Spending

In [None]:
# Top 10 Spending Customers
plt.figure(figsize=(18,6))
sns.barplot(x=Top10Spending['Customer ID'],y=Top10Spending['TotalAmount'].head(10))
plt.title('Top 10 Spending Customers.')

**TOP 10 MOST FREQUENT CUSTOMERS**

In [None]:
Top10Frequent=retail_df['Customer ID'].value_counts().reset_index().rename(columns={'index':'Customer ID','Customer ID':'Frequency'}).head(10)
Top10Frequent

**We observe that both lists have 3 Customer IDs common imptlying most frequent customers tend to be the most spending customers**

**TOP 10 CUSTOMERS BY AVERAGE ORDER AMOUNT**

In [None]:
avg_amount=retail_df.groupby('Customer ID')['TotalAmount'].mean().reset_index().rename(columns={'TotalAmount':'Avg_amount_per_customer'}).sort_values('Avg_amount_per_customer',ascending=False).head(10)

In [None]:
avg_amount

In [None]:
plt.figure(figsize=(12,6))
sns.barplot(x=avg_amount['Customer ID'].head(5),y=avg_amount['Avg_amount_per_customer'].head(15))
plt.title('Average amount spent by each Customer')

**TOP COUNTRIES CONTRIBUTING HIGHEST REVENUE TO THE STORE** 

In [None]:
TopCountries=retail_df.groupby('Country')['TotalAmount'].sum().reset_index().sort_values('TotalAmount',ascending=False)
TopCountries

In [None]:
# top 5 countries where maximum sale happens.
plt.figure(figsize=(15,6))
sns.barplot(x=TopCountries['Country'].head(5),y=TopCountries['TotalAmount'].head(5))
plt.title('Top 5 Countries based on highest store revenue contributions')

**UK contributes most revenue to the store**

**European countries like Germany, France, Netherlands, EIRE contribute significant revenue to the store**

In [None]:
# top 5 countries where least sell happens.
plt.figure(figsize=(15,6))
sns.barplot(x=TopCountries['Country'].tail(5),y=TopCountries['TotalAmount'].tail(5))
plt.title('Top 5 Countries based on last store revenue contributors ')

**Countries contributing least to the store revenue are non european countries**

In [None]:
SalesbyMonth=retail_df.groupby('Month')['TotalAmount'].sum().reset_index().sort_values('TotalAmount',ascending=False)
SalesbyMonth

In [None]:
# Sales different months.
plt.figure(figsize=(20,6))
sns.barplot(x=SalesbyMonth['Month'],y=SalesbyMonth['TotalAmount'])
plt.title('Sales in different Months ')

**Highest sales happened in the month of November (Eve of Holiday Season) while least sale happened in the month of February**

In [None]:
sales_on_day_basis=retail_df.groupby('Day')['TotalAmount'].sum().reset_index().sort_values('TotalAmount',ascending=False)
sales_on_day_basis

In [None]:
# Sales on different days.
plt.figure(figsize=(20,6))
sns.barplot(x=sales_on_day_basis['Day'],y=sales_on_day_basis['TotalAmount'])
plt.title('Sales onn different Days ')

**Sale on Thursdays is very high**

**Sale on Saturdays is very low**

In [None]:
salescount_on_day_basis=retail_df['Day'].value_counts().reset_index().rename(columns={'index':'Day',"Day":'Sale_count'})
salescount_on_day_basis

In [None]:
# Sales count on different days.
plt.figure(figsize=(20,6))
sns.barplot(x=salescount_on_day_basis['Day'],y=salescount_on_day_basis['Sale_count'])
plt.title('Sales count on different Days ')

**As the sales revenue and sales count is negligible on Saturdays probably the store is closed on Saturday and the few orders have been given on phone**

**SALES IN DIFFERENT DAY TIMINGS**

In [None]:
retail_df['hour'].unique()

In [None]:
def time(time):
  if (time==6 or time==7 or time==8 or time==9 or time==10 or time==11) :
    return'Morning'
  elif (time==12 or time==13 or time==14 or time==15 or time==16 or time==17):
    return 'Afternoon'
  else:
    return 'Evening' 

In [None]:
retail_df['Day_time_type']=retail_df['hour'].apply(time)

In [None]:
sales_timing=retail_df.groupby('Day_time_type')['TotalAmount'].sum().reset_index().sort_values('TotalAmount',ascending=False)
sales_timing

In [None]:
#Sales on different day-time types
plt.figure(figsize=(12,6))
sns.barplot(x=sales_timing['Day_time_type'],y=sales_timing['TotalAmount'])
plt.title('Sales count in different day timings')

# Model Building

## RFM Model Analysis

* **RFM is a method used to analyze customer value. RFM stands for RECENCY, Frequency, and Monetary.**

* **RECENCY: How recently did the customer visit our website or how recently did a customer purchase?**

* **Frequency: How often do they visit or how often do they purchase?**

* **Monetary: How much revenue we get from their visit or how much do they spend when they purchase?**

**The RFM Analysis helps the businesses to segment their customer base into different homogenous groups so that they can engage with each group with different targeted marketing strategies.**

In [None]:
rfm_dataframe=retail_df.copy()

In [None]:
rfm_dataframe.head()

In [None]:
#Recency = Latest Date - Last Invoice Date, Frequency = count of invoice no. of transaction(s), Monetary = Sum of Total Amount for each customer
import datetime as dt

#Set Latest date 2011-12-10 as last invoice date was 2011-12-09.
Latest_Date = dt.datetime(2011,12,10)

#Creating RFM Modelling scores for each customer
rfm_dataframe = retail_df.groupby('Customer ID').agg({'InvoiceDate': lambda x: (Latest_Date - x.max()).days, 'Invoice': lambda x: len(x), 'TotalAmount': lambda x: x.sum()})

rfm_dataframe['InvoiceDate'] = rfm_dataframe['InvoiceDate'].astype(int)

#Renaming column names to Recency, Frequency and Monetary
rfm_dataframe.rename(columns={'InvoiceDate': 'Recency', 
                         'Invoice': 'Frequency', 
                         'TotalAmount': 'Monetary'}, inplace=True)

rfm_dataframe.reset_index().head()

**Descriptive Summary and distribution of Recency**

In [None]:
rfm_dataframe.Recency.describe()

In [None]:
plt.figure(figsize=(12,6))
sns.distplot(x=rfm_dataframe['Recency'])
plt.title('Distribution of Recency')

**Recency distribution is right skewed**

**Descriptive summary and distribution of Frequency**

In [None]:
rfm_dataframe['Frequency'].describe()

In [None]:
plt.figure(figsize=(12,6))
sns.distplot(x=rfm_dataframe['Frequency'])
plt.title('Distribution of Frequency')

**Frequency distribution is skewed to extreme right**

**Descriptive summary and distribution of Monetary**

In [None]:
rfm_dataframe['Monetary'].describe()

In [None]:
plt.figure(figsize=(12,6))
sns.distplot(x=rfm_dataframe['Monetary'])
plt.title('Distribution of Monetary')

**Monetary distribution is skewed to extreme right**

**Splitting data into four sections using quantile**

In [None]:
quantile = rfm_dataframe.quantile(q = [0.25,0.50,0.75])

In [None]:
quantile = quantile.to_dict()

In [None]:
quantile

In [None]:
# arguments (x = value, p = recency, monetary_value, frequency, d = quartiles dict)
# Good customer= Low Recency, High Frequency, High Monetary

#Function for scoring recency
def RScoring(x,p,d):
    if x <= d[p][0.25]:
        return 1                               
    elif x <= d[p][0.50]:
        return 2
    elif x <= d[p][0.75]: 
        return 3
    else:
        return 4

#Function for scoring frequency and Monetary
def FnMScoring(x,p,d):
    if x <= d[p][0.25]:
        return 4
    elif x <= d[p][0.50]:
        return 3
    elif x <= d[p][0.75]: 
        return 2
    else:
        return 1

Calculating R,F and M values and adding to dataframe

In [None]:
rfm_dataframe["R"] = rfm_dataframe['Recency'].apply(RScoring,args=('Recency',quantile,))
rfm_dataframe["F"] = rfm_dataframe['Frequency'].apply(FnMScoring,args=('Frequency',quantile,))
rfm_dataframe["M"] = rfm_dataframe['Monetary'].apply(FnMScoring,args=('Monetary',quantile,))
rfm_dataframe.head()

Adding Combined RFM value to dataset

In [None]:
rfm_dataframe['RFM_Group'] = rfm_dataframe.R.map(str)+rfm_dataframe.F.map(str)+rfm_dataframe.M.map(str)

Adding RFM Score column summing R,F and M values

In [None]:
rfm_dataframe['RFM_Score'] = rfm_dataframe[['R', 'F', 'M']].sum(axis = 1)
rfm_dataframe.head()

In [None]:
rfm_dataframe.info()

In [None]:
rfm_dataframe['RFM_Score'].unique()

Assigning Loyalty Level to each customer

In [None]:
Loyalty_Level = ['Platinum','Gold','Silver','Bronze']
Score_cut = pd.qcut(rfm_dataframe['RFM_Score'],q = 4,labels=Loyalty_Level)
rfm_dataframe['RFM_Loyalty_Level'] = Score_cut.values
rfm_dataframe.reset_index().head()

Checking data for RFM_Group=111

In [None]:
rfm_dataframe[rfm_dataframe['RFM_Group'] == '111'].sort_values("Monetary",ascending = False).reset_index().head(10)

**Bar Chart of Loyalty Level**

In [None]:
plt.figure(figsize=(12,6))
sns.countplot(rfm_dataframe['RFM_Loyalty_Level'])
plt.title('Loyalty Level of Customers')
plt.show()

In [None]:
segmentation_based_on_RFM=rfm_dataframe[['Recency','Frequency','Monetary','RFM_Loyalty_Level']]

In [None]:
segmentation_based_on_RFM.groupby('RFM_Loyalty_Level').agg({
    'Recency': ['mean', 'min', 'max'],
    'Frequency': ['mean', 'min', 'max'],
    'Monetary': ['mean', 'min', 'max','count']
})

In [None]:
def handle_neg_n_zero(num):
    if num <= 0:
        return 1
    else:
        return num
#Apply handle_neg_n_zero function to Recency and Monetary columns 
rfm_dataframe['Recency'] = [handle_neg_n_zero(x) for x in rfm_dataframe.Recency]
rfm_dataframe['Monetary'] = [handle_neg_n_zero(x) for x in rfm_dataframe.Monetary]

In [None]:
#Perform Log transformation to bring data into normal or near normal distribution
Log_rfm_df = rfm_dataframe[['Recency', 'Frequency', 'Monetary']].apply(np.log, axis = 1).round(3)

**Now let's Visualize the Distribution of Recency,Frequency and Monetary.**

In [None]:
plt.figure(figsize=(12,6))
sns.distplot(x=Log_rfm_df['Recency'])
plt.title('Distribution of Recency')

In [None]:
plt.figure(figsize=(12,6))
sns.distplot(x=Log_rfm_df['Frequency'])
plt.title('Distribution of Frequency')

In [None]:
plt.figure(figsize=(12,6))
sns.distplot(x=Log_rfm_df['Monetary'])
plt.title('Distribution of Monetary')

In [None]:
rfm_dataframe['Recency_log'] = rfm_dataframe['Recency'].apply(math.log)
rfm_dataframe['Frequency_log'] = rfm_dataframe['Frequency'].apply(math.log)
rfm_dataframe['Monetary_log'] = rfm_dataframe['Monetary'].apply(math.log)

In [None]:
rfm_dataframe

## KMeans Clustering

### Applying Elbow Method on Recency and Monetary.

In [None]:
from sklearn.metrics import silhouette_score
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

In [None]:
# taking Recency_log and Monetary_log in list.
Recency_and_Monetary_feat=['Recency_log','Monetary_log']

# taking only values of recency and monetary in X.
X=rfm_dataframe[Recency_and_Monetary_feat].values

# standardising the data
scaler=StandardScaler()
X=scaler.fit_transform(X)

#applying Elbow Method
wcss = {}
for k in range(1,15):
    km = KMeans(n_clusters= k, init= 'k-means++', max_iter= 1000)
    km = km.fit(X)
    wcss[k] = km.inertia_


#Plot the graph for the sum of square distance values and Number of Clusters
plt.figure(figsize=(12,6))
sns.pointplot(x = list(wcss.keys()), y = list(wcss.values()))
plt.xlabel('Number of Clusters(k)')
plt.ylabel('Sum of Square Distances')
plt.title('Elbow Method For Optimal k')
plt.show()

**Silhouette Score**

In [None]:
# taking Recency_log and Monetary_log in list.
Recency_and_Monetary_feat=['Recency_log','Monetary_log']

# taking only values of recency and monetary in X.
X=rfm_dataframe[Recency_and_Monetary_feat].values

# standardising the data
scaler=StandardScaler()
X=scaler.fit_transform(X)

#Silhouette Score
range_n_clusters = [2,3,4,5,6,7,8,9,10,11,12,13,14,15]
for n_clusters in range_n_clusters:
    clusterer = KMeans(n_clusters=n_clusters,random_state=1)
    preds = clusterer.fit_predict(X)
    centers = clusterer.cluster_centers_

    score = silhouette_score(X, preds)
    print("For n_clusters = {}, silhouette score is {}".format(n_clusters, score))

In [None]:
# applying Kmeans_clustering algorithm
kmeans_rec_mon = KMeans(n_clusters=2)
kmeans_rec_mon.fit(X)
y_kmeans= kmeans_rec_mon.predict(X)

In [None]:
#Find the clusters for the observation given in the dataset
rfm_dataframe['Cluster_based_rec_mon'] = kmeans_rec_mon.labels_
rfm_dataframe.head(10)

In [None]:
# Centers of the clusters
centers = kmeans_rec_mon.cluster_centers_
centers

In [None]:
# Visualizing the clusters
plt.figure(figsize=(15,10))
plt.title('customer segmentation based on Recency and Monetary')
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='winter')

centers = kmeans_rec_mon.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='red', s=300, alpha=0.8)

In [None]:
data_process_normalized=rfm_dataframe[['Recency','Frequency','Monetary','Recency_log','Frequency_log','Monetary_log','RFM_Loyalty_Level','Cluster_based_rec_mon']]

In [None]:
data_process_normalized.groupby('Cluster_based_rec_mon').agg({
    'Recency': ['mean', 'min', 'max'],
    'Frequency': ['mean', 'min', 'max'],
    'Monetary': ['mean', 'min', 'max','count']
})

### Applying Elbow Method on Frequency and Monetary

In [None]:
# taking Frequency_log and Monetory_log in list.
Frequency_and_Monetary_feat=['Frequency_log','Monetary_log']

# taking only values of frequency and monetary in X.
X=rfm_dataframe[Frequency_and_Monetary_feat].values

# standardising the data
scaler=StandardScaler()
X=scaler.fit_transform(X)

#applying Elbow Method
wcss = {}
for k in range(1,15):
    km = KMeans(n_clusters= k, init= 'k-means++', max_iter= 1000)
    km = km.fit(X)
    wcss[k] = km.inertia_


#Plot the graph for the sum of square distance values and Number of Clusters
plt.figure(figsize=(12,6))
sns.pointplot(x = list(wcss.keys()), y = list(wcss.values()))
plt.xlabel('Number of Clusters(k)')
plt.ylabel('Sum of Square Distances')
plt.title('Elbow Method For Optimal k')
plt.show()

In [None]:
# taking Frequency_log and Monetory_log in list.
Frequency_and_Monetary_feat=['Frequency_log','Monetary_log']

# taking only values of frequency and monetary in X.
X=rfm_dataframe[Frequency_and_Monetary_feat].values

# standardising the data
scaler=StandardScaler()
X=scaler.fit_transform(X)

#Silhouette Score
range_n_clusters = [2,3,4,5,6,7,8,9,10,11,12,13,14,15]
for n_clusters in range_n_clusters:
    clusterer = KMeans(n_clusters=n_clusters,random_state=1)
    preds = clusterer.fit_predict(X)
    centers = clusterer.cluster_centers_

    score = silhouette_score(X, preds)
    print("For n_clusters = {}, silhouette score is {}".format(n_clusters, score))

In [None]:
# applying Kmeans_clustering algorithm
kmeans_freq_mon = KMeans(n_clusters=2)
kmeans_freq_mon.fit(X)
y_kmeans= kmeans_freq_mon.predict(X)

In [None]:
#Find the clusters for the observation given in the dataset
rfm_dataframe['Cluster_based_on_freq_mon'] = kmeans_freq_mon.labels_
rfm_dataframe.head(10)

In [None]:
# Centers of the clusters
centers = kmeans_freq_mon.cluster_centers_
centers

In [None]:
# Visualizing the clusters
plt.figure(figsize=(15,10))
plt.title('customer segmentation based on Frequency and Monetary')
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='winter')

centers = kmeans_freq_mon.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='red', s=300, alpha=0.8)

In [None]:
data_process_normalized1=rfm_dataframe[['Recency','Frequency','Monetary','Recency_log','Frequency_log','Monetary_log','RFM_Loyalty_Level','Cluster_based_on_freq_mon']]

In [None]:
data_process_normalized1.groupby('Cluster_based_on_freq_mon').agg({
    'Recency': ['mean', 'min', 'max'],
    'Frequency': ['mean', 'min', 'max'],
    'Monetary': ['mean', 'min', 'max','count']
})

### Applying Elbow Method on Recency, Frequency and Monetary.

In [None]:
# taking Recency_log, Frequency_log and Monetory_log in list.
Recency_frequency_and_Monetary_feat=['Recency_log','Frequency_log','Monetary_log']

# taking only values of recency and monetory in X.
X=rfm_dataframe[Recency_frequency_and_Monetary_feat].values

# standardising the data
scaler=StandardScaler()
X=scaler.fit_transform(X)

#applying Elbow Method
wcss = {}
for k in range(1,15):
    km = KMeans(n_clusters= k, init= 'k-means++', max_iter= 1000)
    km = km.fit(X)
    wcss[k] = km.inertia_


#Plot the graph for the sum of square distance values and Number of Clusters
plt.figure(figsize=(12,6))
sns.pointplot(x = list(wcss.keys()), y = list(wcss.values()))
plt.xlabel('Number of Clusters(k)')
plt.ylabel('Sum of Square Distances')
plt.title('Elbow Method For Optimal k')
plt.show()

**Silhouette Score**

In [None]:
# taking Recency_log,Frequency_log and Monetory_log in list.
Recency_and_Monetary_feat=['Recency_log','Frequency_log','Monetary_log']

# taking only values of recency and monetory in X.
X=rfm_dataframe[Recency_and_Monetary_feat].values

# standardising the data
scaler=StandardScaler()
X=scaler.fit_transform(X)

#Silhouette Score
range_n_clusters = [2,3,4,5,6,7,8,9,10,11,12,13,14,15]
for n_clusters in range_n_clusters:
    clusterer = KMeans(n_clusters=n_clusters,random_state=1)
    preds = clusterer.fit_predict(X)
    centers = clusterer.cluster_centers_

    score = silhouette_score(X, preds)
    print("For n_clusters = {}, silhouette score is {}".format(n_clusters, score))

In [None]:
# applying Kmeans clustering algorithm
kmeans_freq_mon_rec = KMeans(n_clusters=2)
kmeans_freq_mon_rec.fit(X)
y_kmeans= kmeans_freq_mon_rec.predict(X)

In [None]:
#Finding the clusters for the observation in the dataset
rfm_dataframe['Cluster_based_on_freq_mon_rec'] = kmeans_freq_mon_rec.labels_
rfm_dataframe.head(10)

In [None]:
# Centers of the clusters
centers = kmeans_freq_mon_rec.cluster_centers_
centers

In [None]:
# plotting visualizing the clusters
plt.figure(figsize=(15,10))
plt.title('customer segmentation based on Recency, Frequency and Monetary')
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='winter')

centers = kmeans_freq_mon_rec.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='red', s=300, alpha=0.8)

In [None]:
data_process_normalized2=rfm_dataframe[['Recency','Frequency','Monetary','Recency_log','Frequency_log','Monetary_log','RFM_Loyalty_Level','Cluster_based_on_freq_mon_rec']]

In [None]:
data_process_normalized2.groupby('Cluster_based_on_freq_mon_rec').agg({
    'Recency': ['mean', 'min', 'max'],
    'Frequency': ['mean', 'min', 'max'],
    'Monetary': ['mean', 'min', 'max','count']
})

**We can see that mean of all the features Recency, Frequency and Monetary is significantly different for the two clusters**