<a href="https://colab.research.google.com/github/Gauravmehra1/Online-Retail-Customer-Segmentation/blob/main/Copy_of_Copy_of_Sample_EDA_Submission_Template.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name  - Online Retail Customer Segmentation**   



##### **Project Type**    - Unsupervised
##### **Contribution**    - Individual
##### **Team Member**     - Gaurav Mehra



# **Project Summary -**

In this project, we aimed to identify customer segments for a UK-based online retail company that specializes in selling unique gifts. We analyzed the transactional data between December 2010 and December 2011 and found that many customers were wholesalers.

We started by cleaning and processing the data, and then performed exploratory data analysis to gain insights into the dataset. Next, we used the RFM model to quantify customer behavior, including recency, frequency, and monetary value. We then applied clustering algorithms such as KMeans and DBSCAN to identify two customer segments based on their behavior.

We concluded that customers with high recency and low frequency and monetary values belonged to one segment, while those with low recency and high frequency and monetary values belonged to another segment.

This analysis can be further modified and customized based on the company's objectives and preferences. For example, the clustering could be performed on additional features such as preferred product types or customer lifetime value. The labeled feature after clustering can also be fed into classification algorithms to predict the classes for new observations.

In summary, this project showcased how machine learning can be used to perform customer segmentation and gain insights into customer behavior. It's important to note that machine learning is an art, and there is no right or wrong way to perform it. We always strive to improve our outcomes based on our final objectives.

# **GitHub Link -**

https://github.com/Gauravmehra1/Online-Retail-Customer-Segmentation

# **Problem Statement**


*   **The task is to perform customer segmentation on a transnational dataset of a UK-based and registered non-store online retail company.**

*   **The company sells unique all-occasion gifts and has many wholesale customers.**

*   **The results can be used to better understand customer behavior and tailor marketing strategies to each segment.**

#### **Define Your Business Objective?**

**Customer segmentation is the practice of dividing a company’s customers into groups that reflect similarity among customers in each group. The goal of segmenting customers is to decide how to relate to customers in each segment in order to maximize the value of each customer to the business.**

**Customer segmentation has the potential to allow marketers to address each customer in the most effective way. Using the large amount of data available on customers (and potential customers), a customer segmentation analysis allows marketers to identify discrete groups of customers with a high degree of accuracy based on demographic, behavioral and other indicators.**

**Since the marketer’s goal is usually to maximize the value (revenue and/or profit) from each customer, it is critical to know in advance how any particular marketing action will influence the customer. Ideally, such “action-centric” customer segmentation will not focus on the short-term value of a marketing action, but rather the long-term customer lifetime value (CLV) impact that such a marketing action will have. Thus, it is necessary to group, or segment, customers according to their CLV.**

**Of course, it is always easier to make assumptions and use “gut feelings” to define rules which will segment customers into logical groupings, e.g., customers who came from a particular source, who live in a particular location or who bought a particular product/service. However, these high-level categorizations will seldom lead to the desired results.**

**It is obvious that some customers will spend more than others during their relationship with a company. The best customers will spend a lot for many years. Good customers will spend modestly over a long period of time, or will spend a lot over a short period of time. Others won’t spend too much and/or won’t stick around too long.**

**The right approach to segmentation analysis is to segment customers into groups based on predictions regarding their total future value to the company, with the goal of addressing each group (or individual) in the way most likely to maximize that future, or lifetime, value.**

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required. 
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits. 
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule. 

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

from datetime import datetime
import datetime as dt
from sklearn import preprocessing
import math

import warnings
warnings.filterwarnings('ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=UserWarning)

from sklearn.metrics import silhouette_score
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples, silhouette_score
from sklearn.cluster import KMeans
from sklearn.cluster import DBSCAN
from sklearn import metrics
import matplotlib.cm as cm
from sklearn.cluster import AgglomerativeClustering


### Dataset Loading

In [None]:
#let's mount the google drive for import the dtaset
from google.colab import drive
drive.mount('/content/drive')

In [None]:
#load the Credit-Card-Default-Prediction data set from drive
customer_df = pd.read_excel('/content/drive/MyDrive/EDA/Online Retail.xlsx')

### Dataset First View

In [None]:
# Dataset First Look
customer_df.head()

In [None]:
# View the data of bottom 5 rows to take a glimps of the data
customer_df.tail()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
customer_df.shape

### Dataset Information

In [None]:
# Dataset Info
customer_df.info()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
customer_df.isnull().sum()

In [None]:
customer_df.dropna(inplace=True)

In [None]:
customer_df.isnull().sum()

#### Duplicate Values



*Why is it important to remove duplicate records from my data?*



*   "Duplication" just means that you have repeated data in your dataset. This could be due to things like data entry errors or data collection methods. by removing duplication in our data set, Time and money are saved by not sending identical communications multiple times to the same person.

In [None]:
# Dataset Duplicate Value Count
customer_df.duplicated().sum()

In [None]:
customer_df[customer_df.duplicated()]

In [None]:
## droping duplicates 
customer_df=customer_df.drop_duplicates()
len(customer_df[customer_df.duplicated()])

**we have to drop some InvoiceNo which are starts with 'c' because 'c', it indicates a cancellation** 

In [None]:
customer_df['InvoiceNo'] = customer_df['InvoiceNo'].astype('str') 

In [None]:
# checking invoice no.
customer_df[customer_df['InvoiceNo'].str.contains('C')]

In [None]:
customer_df=customer_df[~customer_df['InvoiceNo'].str.contains('C')] 

### What did you know about your dataset?

Answer Here

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
customer_df.columns

In [None]:
# Dataset Describe
customer_df.describe()

### Variables Description 

Answer Here

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
customer_df.nunique()

In [None]:
customer_df.shape

## 3. ***Data Wrangling***

### Data Wrangling Code

**convert invoice Data column into 'year','month','day','hour','minute','second'**

In [None]:
# Write your code to make your dataset analysis ready.
customer_df['InvoiceDate_year']=customer_df['InvoiceDate'].dt.year
customer_df['InvoiceDate_month']=customer_df['InvoiceDate'].dt.month
customer_df['InvoiceDate_day']=customer_df['InvoiceDate'].dt.day 
customer_df['InvoiceDate_hour']=customer_df['InvoiceDate'].dt.hour 
customer_df['InvoiceDate_minute']=customer_df['InvoiceDate'].dt.minute 
customer_df['InvoiceDate_second']=customer_df['InvoiceDate'].dt.second

In [None]:
customer_df.dtypes

In [None]:
customer_df.shape

### What all manipulations have you done and insights you found?

Answer Here.

# **Exploratory Data Analysis**

**Why do we perform EDA?**

 **An EDA is a thorough examination meant to uncover the underlying structure of a data set and is important for a company because it exposes trends, patterns, and relationships that are not readily apparent.** 

In [None]:
customer_df['CustomerID'].nunique()

In [None]:
# finding most active customer
active_customer= pd.DataFrame(customer_df['CustomerID'].value_counts().sort_values(ascending=False).reset_index())
active_customer.rename(columns={'index':'CustomerID','CustomerID':'count'},inplace=True)
active_customer

In [None]:
active_customer.head()

# **Analysis of Categorical Features**

In [None]:
categorical_columns= list(customer_df.select_dtypes(['object']).columns)
categorical_features= pd.Index(categorical_columns)
categorical_features

# **Analysis of Description Variable**

In [None]:
description_df=customer_df['Description'].value_counts().reset_index()
description_df.rename(columns={'index':'Description_name','Description':'Count'},inplace=True)
description_df

In [None]:
pd.value_counts(customer_df['Description'].values)


## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
plt.figure(figsize=(18,11))
plt.title('Top 5 Product Name')
sns.barplot(x='Description_name',y='Count',data=description_df[:5])


In [None]:
plt.figure(figsize=(18,11))
plt.title('Top 5 Product Name')
sns.barplot(x='Description_name',y='Count',data=description_df[-5:])

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 2

In [None]:
StockCode_df=customer_df['StockCode'].value_counts().reset_index()
StockCode_df.rename(columns={'index': 'StockCode_Name'}, inplace=True)
StockCode_df.rename(columns={'StockCode': 'Count'}, inplace=True)

In [None]:
# Chart - 2 visualization code
plt.figure(figsize=(13,8))
plt.title('Top 5 Stock Name')
sns.barplot(x='StockCode_Name',y='Count',data=StockCode_df[:5])

In [None]:
plt.figure(figsize=(13,8))
plt.title('Bottom 5 Stock Name')
sns.barplot(x='StockCode_Name',y='Count',data=StockCode_df[-5:])

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 3

In [None]:
country_df=customer_df['Country'].value_counts().reset_index()
country_df.rename(columns={'index':'Country_Name','Country':'count'},inplace=True)
country_df

In [None]:
# Chart - 3 visualization code
plt.figure(figsize=(13,8))
plt.title('Top 5 Country based on the Most Numbers Customers')
sns.barplot(x='Country_Name',y='count',data=country_df[:5])

In [None]:
plt.figure(figsize=(13,8))
plt.title('Top 5 Country based on the Most Numbers Customers')
sns.barplot(x='Country_Name',y='count',data=country_df[-5:])

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

# **Analysis Numeric Features**

#### Chart - 4

In [None]:
numerical_columns=list(customer_df.select_dtypes(['int64','float64']).columns)
numerical_features=pd.Index(numerical_columns)
numerical_features

In [None]:
# Chart - 4 visualization code
for col in numerical_features:
  fig=plt.figure(figsize=(9,6))
  ax=fig.gca()
  feature =(customer_df[col])
  feature.hist(bins=50,ax=ax)
  ax.axvline(feature.mean(),color='magenta', linestyle='dashed', linewidth=2)
  ax.axvline(feature.median(),color='cyan', linestyle='dashed', linewidth=2)
  ax.set_title(col)
  plt.show()
  print( "Skewness :",customer_df[col].skew())
  print( "Kurtosis :",customer_df[col].kurt())

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 5

In [None]:
#plot a bar plot for each numerical feature count with Dist Plot (except Car_ID)
for col in numerical_features:
  fig=plt.figure(figsize=(9,6))
  ax=fig.gca()
  feature= (customer_df[col])
  sns.distplot(customer_df[col])
  ax.axvline(feature.mean(),color='magenta', linestyle='dashed', linewidth=2)
  ax.axvline(feature.median(),color='cyan', linestyle='dashed', linewidth=2)
  ax.set_title(col)
  plt.show()
  print( "Skewness :",customer_df[col].skew())
  print( "Kurtosis :",customer_df[col].kurt())

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 6

In [None]:
# plot a boxplot for the label by each numerical feature  

for col in numerical_features:
    fig = plt.figure(figsize=(9, 6))
    ax = fig.gca()
    customer_df.boxplot(col)
    ax.set_title('Label by ' + col)
    #ax.set_ylabel("Churn")
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
plt.figure(figsize=(15,8))
correlation=customer_df.corr()
sns.heatmap(abs(correlation), annot=True, cmap='coolwarm')

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

# **Feature engineering**

**creating new features day from invoicedate**

In [None]:
customer_df['day']=customer_df['InvoiceDate'].dt.day_name()

**creating new feature TotalAmount from Quantity and unitprice**

In [None]:
customer_df['TotalAmount']=customer_df['Quantity']*customer_df['UnitPrice']
customer_df.head()

In [None]:
plt.figure(figsize=(15,8))
plt.title('distridution of amount')
sns.distplot(customer_df['TotalAmount'],color='Red')

In [None]:
customer_df['TotalAmount'].describe()

In [None]:
day_df=customer_df['day'].value_counts().reset_index()
day_df.rename(columns={'index':'Day_Name','day':'Count'},inplace=True)
day_df

In [None]:
plt.figure(figsize=(13,8))
plt.title('Day')
sns.barplot(x='Day_Name',y='Count',data=day_df)

**most of the item are purches in Thrusday , Wednesday & Tuesday**

In [None]:
month_df=customer_df['InvoiceDate_month'].value_counts().reset_index()
month_df.rename(columns={'index':'month_name','InvoiceDate_month':'Count'},inplace=True)
month_df

In [None]:
plt.figure(figsize=(18,5))
plt.title('month')
sns.barplot(x='month_name',y='Count',data=month_df)

*  **most of the gifts are purchesed in month of setp , oct , nov, dec**
*  **less number of gifts are purchesed in month of jan, feb and april**


In [None]:
hour_df=customer_df['InvoiceDate_hour'].value_counts().reset_index()
hour_df.rename(columns={'index':'Hour_Name','InvoiceDate_hour':'Count'},inplace=True)
hour_df


In [None]:
plt.figure(figsize=(18,7))
plt.title('hour')
sns.barplot(x='Hour_Name', y='Count',data=hour_df)

* **most of the item are purchased in afternoon as per the graph**

In [None]:
def time_type(time):
  if(time==6 or time==7 or time==8 or time==9 or time==10 or time==11):
    return 'morning'
  if(time==12 or time==13 or time==14 or time==15 or time==16 or time==17):
    return 'afternon'
  else:
    return 'evening'

In [None]:
customer_df['Time_type']=customer_df['InvoiceDate_hour'].apply(time_type)

In [None]:
plt.figure(figsize=(13,5))
plt.title('Time_type')
sns.countplot(x='Time_type',data=customer_df)


**Most of the customers have purches the items in Aftrnoon ,moderate numbers of customers have purches the items in Morning and least numbers of customers have purches the items in Evening**

# **Creating RMF Model**

**Before applying any clustering algorithms it is always necessary to determine various quantitative factors on which the algorithm will perform segmentation. Examples of these would be features such as amount spend, activeness of the customer, their last visit, etc.**

**RFM model which stands for Recency, Frequency, and Monetary is one of such steps in which we determine the recency - days to last visit, frequency - how actively the customer repurchases and monetary - total expenditure of the customer, for each customer. There are other steps too in which we divide each of these features accordingly and calculate a score for each customer. However, this approach doesnot require machine learning algorithms as segmentation can be done manually. Therefore we will skip the second step and directly use the rfm features and feed it to clustering algorithms.**



*   **Recency = Latest Date - Last Inovice Data,**
*   **Frequency = count of invoice no. of transaction(s),**
*   **Monetary = Sum of Total Amount for each customer**

In [None]:
#Set Latest date 2011-12-10 as last invoice date was 2011-12-09. This is to calculate the number of days from recent purchase.
Latest_Date = dt.datetime(2011,12,10)

#calculating RMF modling score for each customer
rfm_df = customer_df.groupby('CustomerID').agg({'InvoiceDate': lambda x: (Latest_Date - x.max()).days, 'InvoiceNo': lambda x: len(x), 'TotalAmount': lambda x: x.sum()})

#converting invoice data into int type 
rfm_df['InvoiceDate'] = rfm_df['InvoiceDate'].astype(int)

#remain column name by Recency, Frequency and 

In [None]:
#Set Latest date 2011-12-10 as last invoice date was 2011-12-09. This is to calculate the number of days from recent purchase
Latest_Date = dt.datetime(2011,12,10)

#Create RFM Modelling scores for each customer
rfm_df = customer_df.groupby('CustomerID').agg({'InvoiceDate': lambda x: (Latest_Date - x.max()).days, 'InvoiceNo': lambda x: len(x), 'TotalAmount': lambda x: x.sum()})

#Convert Invoice Date into type int
rfm_df['InvoiceDate'] = rfm_df['InvoiceDate'].astype(int)

#Rename column names to Recency, Frequency and Monetary
rfm_df.rename(columns={'InvoiceDate': 'Recency', 'InvoiceNo': 'Frequency', 'TotalAmount': 'Monetary'}, inplace=True)
rfm_df.reset_index().head()

In [None]:
# drescribe recency
rfm_df.Recency.describe()

In [None]:
#plot of Recency
x =rfm_df['Recency']
plt.figure(figsize=(18,5))
sns.distplot(x)

In [None]:
rfm_df.Frequency.describe()

In [None]:
x = rfm_df['Frequency']
plt.figure(figsize=(13,8))
sns.displot(x)


In [None]:
# describe Monetary
rfm_df.Monetary.describe()

In [None]:
#Monateray distribution plot, taking observations which have monetary value less than 10000
x =rfm_df['Monetary']
plt.figure(figsize=(13,5))
sns.displot(x)

**Split into four segments using quantiles**

In [None]:
# split into 4 segments into quantiles 
quantiles = rfm_df.quantile(q=[0.25,0.5,0.75])
quantiles = quantiles.to_dict()
quantiles


In [None]:
# creating R,M,F segments 
def RScoring(x,p,d):
  if x <= d[p][0.25]:
    return 1
  elif x <= d[p][0.50]:
    return 2
  elif x<= d[p][0.75]:
    return 3
  else :
    return 4

def FnMScoring(x,p,d):
  if x <= d[p][0.25]:
    return 4
  elif x <= d[p][0.50]:
    return 3
  elif x<= d[p][0.75]:
    return 2
  else :
    return 1



In [None]:
# Adding R, F and M segment value columns in the existing dataset to show R, F and M segment values
rfm_df['R'] = rfm_df['Recency'].apply(RScoring, args=('Recency',quantiles,)) 
rfm_df['M'] = rfm_df['Frequency'].apply(FnMScoring, args=('Frequency',quantiles,))
rfm_df['F'] = rfm_df['Monetary'].apply(FnMScoring, args=('Monetary',quantiles,))
rfm_df.head()

In [None]:
#Adding RFMGroup value column showing combined concatenated score of RFM
rfm_df['RFMGroup'] = rfm_df.R.map(str) + rfm_df.F.map(str) + rfm_df.M.map(str)

#Adding RFMScore value column showing total sum of RFMGroup values
rfm_df['RFMScore'] =rfm_df[['R','F','M']].sum(axis=1)
rfm_df.head()


In [None]:
#handle negative and zero values so as to handle infinite numbers during log transformation.
def handle_neg_n_zero(num):
  if num<= 0:
    return 1
  else:
    return num
  
#Apply handle_neg_n_zero function to Recency and Monetary columns.
rfm_df['Recency'] = [handle_neg_n_zero(x)  for(x) in rfm_df.Recency]
rfm_df['Monetary'] = [handle_neg_n_zero(x)  for(x) in rfm_df.Monetary]

#Perform Log transformation to bring data into normal or near normal distribution
log_Tfd_Data =  rfm_df[['Recency', 'Frequency', 'Monetary']].apply(np.log, axis = 1).round(3)
log_Tfd_Data.head()

In [None]:
#Data distribution after data normalization for Recency
Recency_plot = log_Tfd_Data['Recency']
plt.figure(figsize=(13,5))
sns.distplot(Recency_plot)

In [None]:
#Data distribution after data normalization  for Frequency
Frequency_plot = log_Tfd_Data.query('Frequency < 1000')['Frequency']
plt.figure(figsize=(13,5))
sns.distplot(Frequency_plot)

In [None]:
#Data distribution after data normalization for Monetary
Monetary_plot = log_Tfd_Data.query('Monetary<1000')['Monetary']
plt.figure(figsize=(13,5))
sns.distplot(Monetary_plot)


In [None]:
rfm_df['Recency_log'] = rfm_df['Recency'].apply(math.log)
rfm_df['Frequency_log'] = rfm_df['Frequency'].apply(math.log)
rfm_df['Monetary_log'] = rfm_df['Monetary'].apply(math.log)

# **K-Means Clustering**

**Clustering is one of the most common exploratory data analysis technique used to get an intuition about the structure of the data. It can be defined as the task of identifying subgroups in the data such that data points in the same subgroup (cluster) are very similar while data points in different clusters are very different. In other words, we try to find homogeneous subgroups within the data such that data points in each cluster are as similar as possible according to a similarity measure such as euclidean-based distance or correlation-based distance. The decision of which similarity measure to use is application-specific.**


**Kmeans algorithm is an iterative algorithm that tries to partition the dataset into Kpre-defined distinct non-overlapping subgroups (clusters) where each data point belongs to only one group. It tries to make the intra-cluster data points as similar as possible while also keeping the clusters as different (far) as possible. It assigns data points to a cluster such that the sum of the squared distance between the data points and the cluster’s centroid (arithmetic mean of all the data points that belong to that cluster) is at the minimum. The less variation we have within clusters, the more homogeneous (similar) the data points are within the same cluster.**

**The way kmeans algorithm works is as follows:**


*   **Specify number of clusters K.**
*   **Initialize centroids by first shuffling the dataset and then randomly selecting K data points for the centroids without replacement.**
*   **Keep iterating until there is no change to the centroids. i.e assignment of data points to clusters isn’t changing.**
*   **Compute the sum of the squared distance between data points and all centroids.**
*   **Assign each data point to the closest cluster (centroid).**
*   **Compute the centroids for the clusters by taking the average of the all data points that belong to each cluster**.

**The approach kmeans follows to solve the problem is called Expectation-Maximization. The E-step is assigning the data points to the closest cluster. The M-step is computing the centroid of each cluster.**

**Calculation of Silhouette score**

**Silhouette score is used to evaluate the quality of clusters created using clustering algorithms such as K-Means in terms of how well samples are clustered with other samples that are similar to each other. The Silhouette score is calculated for each sample of different clusters. To calculate the Silhouette score for each observation/data point, the following distances need to be found out for each observations belonging to all the clusters.**



*    **Mean distance between the observation and all other data points in the same cluster. This distance can also be called a mean intra-cluster distance. The mean distance is denoted by a.**
*   **Mean distance between the observation and all other data points of the next nearest cluster. This distance can also be called a mean nearest-cluster distance. The mean distance is denoted by b.**


**The Silhouette Coefficient for a sample is**     S =(b-a)/max(a,b)


In [None]:
features_rec_mon=['Recency_log','Monetary_log']
X_features_rec_mon=rfm_df[features_rec_mon].values
scaler_rec_mon=preprocessing.StandardScaler()
X_rec_mon=scaler_rec_mon.fit_transform(X_features_rec_mon)
X=X_rec_mon
range_n_clusters = [2,3,4,5,6,7,8,9,10,11,12,13,14,15]
for n_clusters in range_n_clusters:
    clusterer = kmeans = KMeans(n_clusters=3, n_init=5)
    preds = clusterer.fit_predict(X)
    centers = clusterer.cluster_centers_

    score = silhouette_score(X, preds)
    print("For n_clusters = {}, silhouette score is {}".format(n_clusters, score))

##  **Elbow method**
  
   **It is the one of the most famous method by whoch you can select the right value of K and boost your model performance. We also perform the hyper peremeter tunning to choose the best value of K . Elbow method is an imperical method to find the best value of K . It pick up the range of values and take the best amoung them it calculates the sum of square of the points and calculate the average distance.**

   **When the value of K is 1, the within-cluster sum of the square will be high.As the value of K increase , the within-cluster sum of the square value will decrease**


**Finally we will plot a graph between k-values and the within-cluster sum of the square to get the K-value. We will examine the graph carefully. At some point, our graph will decrease abruptly.The point will conciderd as the value of K.**

In [None]:
#applying elbow method on recency and monetary
features_rec_mon=['Recency_log','Monetary_log']
X_features_rec_mon=rfm_df[features_rec_mon].values
scaler_rec_mon=preprocessing.StandardScaler()
X_rec_mon= scaler_rec_mon.fit_transform(X_features_rec_mon)
X=X_rec_mon

In [None]:
sum_of_sq_dist ={}
for k in range(1,15):
  km = KMeans(n_clusters=k, init='k-means++', max_iter=1000)
  km=km.fit(X)
  sum_of_sq_dist[k] = km.inertia_

In [None]:
#Plot the graph for the sum of square distance values and Number of Clusters
sns.pointplot(x = list(sum_of_sq_dist.keys()), y = list(sum_of_sq_dist.values()))
plt.xlabel('Number of Clusters(k)')
plt.ylabel('Sum of Square Distances')
plt.title('Elbow Method For Optimal k')
plt.show()

In [None]:
kmeans = KMeans(n_clusters=2)
kmeans.fit(X)
y_kmeans= kmeans.predict(X)


In [None]:
plt.figure(figsize=(15,8))
plt.title('customer segmentation based on Recency and Monetary')
plt.scatter(X[:,0],X[:,1], c=y_kmeans, s=50, cmap='spring')

centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='red', s=200, alpha=0.5)

**we see that ,Customers are well separate when we cluster them by Recency and Monetary**

## **what is DBSCAN Clustering**

**It is basically an Unsupervised learning method that divides the data points into a number of specific batches or groups , such that the data points in the same groups have similar properties and data points in different groups have different properties in some sense. It comperises many different method based on differential evolution. E.g K-Means(distance between points), affinity propagation(graph distance),Mean-shift (distance between points), DBSCAN (distance between nearest points), Gaussian mixtures (Mahalanobis distance to centers), Spectral clustering (graph distance) etc.**

**Fundamentally, all clustering methods use the same approach i.e. first we calculate similarities and then we use it to cluster the data points into groups or batches. Here we will focus on Density-based spatial clustering of applications with noise (DBSCAN) clustering method**

In [None]:
#Applying DBSCAN on Recency and Monetary
y_pred = DBSCAN(eps=0.5, min_samples=15).fit_predict(X)
plt.figure(figsize=(13,8))
plt.scatter(X[:,0], X[:,1], c=y_pred)

**we see that ,Customers are well separate when we cluster them by Recency and Monetary**

In [None]:
features_fre_mon=['Frequency_log','Monetary_log']
X_features_fre_mon=rfm_df[features_fre_mon].values
scaler_fre_mon=preprocessing.StandardScaler()
X_fre_mon=scaler_fre_mon.fit_transform(X_features_fre_mon)
X=X_fre_mon
range_n_clusters = [2,3,4,5,6,7,8,9,10,11,12,13,14,15]
for n_clusters in range_n_clusters:
    clusterer = KMeans(n_clusters=n_clusters)
    preds = clusterer.fit_predict(X)
    centers = clusterer.cluster_centers_

    score = silhouette_score(X, preds)
    print("For n_clusters = {}, silhouette score is {}".format(n_clusters, score))


In [None]:
#Applying Elbow Method on Frequency and Monetary
sum_of_sq_dist ={}
for k in range(1,15):
  km = KMeans(n_clusters=k, init= 'k-means++', max_iter=1000)
  km = km.fit(X)
  sum_of_sq_dist[k]= km.inertia_

  

In [None]:
#Plot the graph for the sum of square distance values and Number of Clusters
sns.pointplot(x = list(sum_of_sq_dist.keys()), y = list(sum_of_sq_dist.values()))
plt.xlabel('Number of Clusters(k)')
plt.ylabel('Sum of Square Distances')
plt.title('Elbow Method For Optimal k')
plt.show()

In [None]:
kmeans = KMeans(n_clusters= 2 )
kmeans.fit(X)
y_kmeans = kmeans.predict(X)

In [None]:
plt.figure(figsize=(15,8))
plt.title('customer segmentation based on Frequency and Monetary')
plt.scatter(X[:,0], X[:,1] , c=y_kmeans, s=50, cmap='PiYG')

centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='red', s=200, alpha=0.5)

**we see that ,Customers are well separate when we cluster them by Frequency and Monetary**

In [None]:
#Applying DBSCAN to Method on Frquency and Monetary
y_pred = DBSCAN(eps=0.5, min_samples=15).fit_predict(X)
plt.figure(figsize=(13,8))
plt.scatter(X[:,0], X[:,1], c=y_pred)


In [None]:
plt.figure(figsize=(13,8))
plt.title('R vs M and F vs M')
plt.scatter(rfm_df.Recency_log,rfm_df.Monetary_log,alpha=0.5)
plt.scatter(rfm_df.Frequency_log,rfm_df.Monetary_log,alpha=0.5)

# **Applying Silhouette Method on Recency ,Frequency and Monetary**

In [None]:
feature_vector=['Recency_log','Frequency_log','Monetary_log']
X_features=rfm_df[feature_vector].values
scaler=preprocessing.StandardScaler()
X=scaler.fit_transform(X_features)

In [None]:
range_n_clusters = [2,3,4,5,6,7,8,9,10,11,12,13,14,15]

for n_clusters in range_n_clusters:
    # Create a subplot with 1 row and 2 columns
    fig, (ax1, ax2) = plt.subplots(1, 2)
    fig.set_size_inches(18, 7)

    # The 1st subplot is the silhouette plot
    # The silhouette coefficient can range from -1, 1 but in this example all
    # lie within [-0.1, 1]
    ax1.set_xlim([-0.1, 1])
    # The (n_clusters+1)*10 is for inserting blank space between silhouette
    # plots of individual clusters, to demarcate them clearly.
    ax1.set_ylim([0, len(X) + (n_clusters + 1) * 10])

    # Initialize the clusterer with n_clusters value and a random generator
    # seed of 10 for reproducibility.
    clusterer = KMeans(n_clusters=n_clusters, random_state=10)
    cluster_labels = clusterer.fit_predict(X)

    # The silhouette_score gives the average value for all the samples.
    # This gives a perspective into the density and separation of the formed
    # clusters
    silhouette_avg = silhouette_score(X, cluster_labels)
    print("For n_clusters =", n_clusters,
          "The average silhouette_score is :", silhouette_avg)

    # Compute the silhouette scores for each sample
    sample_silhouette_values = silhouette_samples(X, cluster_labels)

    y_lower = 10
    for i in range(n_clusters):
        # Aggregate the silhouette scores for samples belonging to
        # cluster i, and sort them
        ith_cluster_silhouette_values = \
            sample_silhouette_values[cluster_labels == i]

        ith_cluster_silhouette_values.sort()

        size_cluster_i = ith_cluster_silhouette_values.shape[0]
        y_upper = y_lower + size_cluster_i

        color = cm.nipy_spectral(float(i) / n_clusters)
        ax1.fill_betweenx(np.arange(y_lower, y_upper),
                          0, ith_cluster_silhouette_values,
                          facecolor=color, edgecolor=color, alpha=0.7)

        # Label the silhouette plots with their cluster numbers at the middle
        ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))

        # Compute the new y_lower for next plot
        y_lower = y_upper + 10  # 10 for the 0 samples

    ax1.set_title("The silhouette plot for the various clusters.")
    ax1.set_xlabel("The silhouette coefficient values")
    ax1.set_ylabel("Cluster label")

    # The vertical line for average silhouette score of all the values
    ax1.axvline(x=silhouette_avg, color="red", linestyle="--")

    ax1.set_yticks([])  # Clear the yaxis labels / ticks
    ax1.set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1])

    # 2nd Plot showing the actual clusters formed
    colors = cm.nipy_spectral(cluster_labels.astype(float) /n_clusters)
    ax2.scatter(X[:, 0], X[:, 1], marker='.', s=30, lw=0, alpha=0.7,
                c=colors, edgecolor='k')

    # Labeling the clusters
    centers = clusterer.cluster_centers_
    # Draw white circles at cluster centers
    ax2.scatter(centers[:, 0], centers[:, 1], marker='o',
                c="white", alpha=1, s=200, edgecolor='k')

    for i, c in enumerate(centers):
        ax2.scatter(c[0], c[1], marker='$%d$' % i, alpha=1,
                    s=50, edgecolor='k')

    ax2.set_title("The visualization of the clustered data.")
    ax2.set_xlabel("Feature space for the 1st feature")
    ax2.set_ylabel("Feature space for the 2nd feature")
    plt.suptitle(("Silhouette analysis for KMeans clustering on sample data "
                  "with n_clusters = %d" % n_clusters),
                 fontsize=14, fontweight='bold')

plt.show()

In [None]:
kmeans = KMeans(n_clusters=2)
kmeans.fit(X)
y_kmeans= kmeans.predict(X)

In [None]:
plt.figure(figsize=(14,10))
plt.title('customer segmentation based on    Recency ,Frequency and Monetary')
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='RdYlBu')

centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='yellow', s=200, alpha=0.5)

**we see that ,Customers are well separate when we cluster them by Recency ,Frequency and Monetary**

In [None]:
sum_of_sq_dist = {}
for k in range(1,15):
    km = KMeans(n_clusters= k, init= 'k-means++', max_iter= 1000)
    km = km.fit(X)
    sum_of_sq_dist[k] = km.inertia_

In [None]:
#Plot the graph for the sum of square distance values and Number of Clusters
sns.pointplot(x = list(sum_of_sq_dist.keys()), y = list(sum_of_sq_dist.values()))
plt.xlabel('Number of Clusters(k)')
plt.ylabel('Sum of Square Distances')
plt.title('Elbow Method For Optimal k')
plt.show()

In [None]:
#Perform K-Mean Clustering or build the K-Means clustering model
KMean_clust = KMeans(n_clusters= 2, init= 'k-means++', max_iter= 1000)
KMean_clust.fit(X)

#Find the clusters for the observation given in the dataset
rfm_df['Cluster'] = KMean_clust.labels_
rfm_df.head(10)

# **Dendogram to find the optimal number of clusters**

In [None]:
# Using the dendogram to find the optimal number of clusters
import scipy.cluster.hierarchy as sch
plt.figure(figsize=(13,8))
dendrogram = sch.dendrogram(sch.linkage(X, method = 'ward'))
plt.title('Dendrogram')
plt.xlabel('Customers')
plt.ylabel('Euclidean Distances')
plt.show() # find largest vertical distance we can make without crossing any other horizontal line

*   **The number of clusters will be the number of vertical lines which are being intersected by the line drawn using the threshold=90**

*   **No. of Cluster = 2**

In [None]:
#fitting hierarchical clustering to the mall dataset
hc = AgglomerativeClustering(n_clusters = 2, affinity = 'euclidean', linkage = 'ward')
y_hc = hc.fit_predict(X)

In [None]:
#Visulizing the clusters (two dimension only)
plt.figure(figsize=(13,5))
plt.scatter(X[y_hc == 0,0 ], X[y_hc == 0,1] , s=100 , c = 'green', label = 'Customer 1')
plt.scatter(X[y_hc == 1, 0], X[y_hc == 1, 1], s = 100, c = 'blue', label = 'Customer 2')

plt.title('Clusters of Customer')
plt.xlabel('RFM')

plt.ylabel('Spending Score (1-100)')
plt.legend()
plt.show()

**By applying different clustering algorithem to our dataset .we get the optimal number of cluster is equal to 2**

In [None]:
#Applying DBSCAN to Recency ,Frequency and Monetary
y_pred = DBSCAN(eps=0.5, min_samples=15).fit_predict(X)
plt.figure(figsize=(13,5))
plt.scatter(X[:,0] , X[:,1], c=y_pred)

In [None]:
y_pred = DBSCAN(eps=0.5, min_samples=15).fit_predict(X)
plt.figure(figsize=(13,8))
plt.scatter(X[:,0], X[:,1], c=y_pred)

**we see that ,Customers are well separate when we cluster them by Recency ,Frequency and Monetary and optimal number of cluster is equal to 3**

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ? 
Explain Briefly.

Answer Here.

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***