<a href="https://colab.research.google.com/github/Gauravmehra1/Online-Retail-Customer-Segmentation/blob/main/Copy_of_Copy_of_Sample_EDA_Submission_Template.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name  - Online Retail Customer Segmentation**   



##### **Project Type**    - Unsupervised
##### **Contribution**    - Individual
##### **Team Member**     - Gaurav Mehra



# **Project Summary -**

In this project, we aimed to identify customer segments for a UK-based online retail company that specializes in selling unique gifts. We analyzed the transactional data between December 2010 and December 2011 and found that many customers were wholesalers.

We started by cleaning and processing the data, and then performed exploratory data analysis to gain insights into the dataset. Next, we used the RFM model to quantify customer behavior, including recency, frequency, and monetary value. We then applied clustering algorithms such as KMeans and DBSCAN to identify two customer segments based on their behavior.

We concluded that customers with high recency and low frequency and monetary values belonged to one segment, while those with low recency and high frequency and monetary values belonged to another segment.

This analysis can be further modified and customized based on the company's objectives and preferences. For example, the clustering could be performed on additional features such as preferred product types or customer lifetime value. The labeled feature after clustering can also be fed into classification algorithms to predict the classes for new observations.

In summary, this project showcased how machine learning can be used to perform customer segmentation and gain insights into customer behavior. It's important to note that machine learning is an art, and there is no right or wrong way to perform it. We always strive to improve our outcomes based on our final objectives.

# **GitHub Link -**

https://github.com/Gauravmehra1/Online-Retail-Customer-Segmentation

# **Problem Statement**


*   **The task is to perform customer segmentation on a transnational dataset of a UK-based and registered non-store online retail company.**

*   **The company sells unique all-occasion gifts and has many wholesale customers.**

*   **The results can be used to better understand customer behavior and tailor marketing strategies to each segment.**

#### **Define Your Business Objective?**

**Customer segmentation is the practice of dividing a company’s customers into groups that reflect similarity among customers in each group. The goal of segmenting customers is to decide how to relate to customers in each segment in order to maximize the value of each customer to the business.**

**Customer segmentation has the potential to allow marketers to address each customer in the most effective way. Using the large amount of data available on customers (and potential customers), a customer segmentation analysis allows marketers to identify discrete groups of customers with a high degree of accuracy based on demographic, behavioral and other indicators.**

**Since the marketer’s goal is usually to maximize the value (revenue and/or profit) from each customer, it is critical to know in advance how any particular marketing action will influence the customer. Ideally, such “action-centric” customer segmentation will not focus on the short-term value of a marketing action, but rather the long-term customer lifetime value (CLV) impact that such a marketing action will have. Thus, it is necessary to group, or segment, customers according to their CLV.**

**Of course, it is always easier to make assumptions and use “gut feelings” to define rules which will segment customers into logical groupings, e.g., customers who came from a particular source, who live in a particular location or who bought a particular product/service. However, these high-level categorizations will seldom lead to the desired results.**

**It is obvious that some customers will spend more than others during their relationship with a company. The best customers will spend a lot for many years. Good customers will spend modestly over a long period of time, or will spend a lot over a short period of time. Others won’t spend too much and/or won’t stick around too long.**

**The right approach to segmentation analysis is to segment customers into groups based on predictions regarding their total future value to the company, with the goal of addressing each group (or individual) in the way most likely to maximize that future, or lifetime, value.**

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required. 
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits. 
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule. 

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

from datetime import datetime
import datetime as dt

import warnings
warnings.filterwarnings('ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=UserWarning)

### Dataset Loading

In [None]:
#let's mount the google drive for import the dtaset
from google.colab import drive
drive.mount('/content/drive')

In [None]:
#load the Credit-Card-Default-Prediction data set from drive
customer_df = pd.read_excel('/content/drive/MyDrive/EDA/Online Retail.xlsx')

### Dataset First View

In [None]:
# Dataset First Look
customer_df.head()

In [None]:
# View the data of bottom 5 rows to take a glimps of the data
customer_df.tail()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
customer_df.shape

### Dataset Information

In [None]:
# Dataset Info
customer_df.info()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
customer_df.isnull().sum()

In [None]:
customer_df.dropna(inplace=True)

In [None]:
customer_df.isnull().sum()

#### Duplicate Values



*Why is it important to remove duplicate records from my data?*



*   "Duplication" just means that you have repeated data in your dataset. This could be due to things like data entry errors or data collection methods. by removing duplication in our data set, Time and money are saved by not sending identical communications multiple times to the same person.

In [None]:
# Dataset Duplicate Value Count
customer_df.duplicated().sum()

In [None]:
customer_df[customer_df.duplicated()]

In [None]:
## droping duplicates 
customer_df=customer_df.drop_duplicates()
len(customer_df[customer_df.duplicated()])

**we have to drop some InvoiceNo which are starts with 'c' because 'c', it indicates a cancellation** 

In [None]:
customer_df['InvoiceNo'] = customer_df['InvoiceNo'].astype('str') 

In [None]:
# checking invoice no.
customer_df[customer_df['InvoiceNo'].str.contains('C')]

In [None]:
customer_df=customer_df[~customer_df['InvoiceNo'].str.contains('C')] 

### What did you know about your dataset?

Answer Here

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
customer_df.columns

In [None]:
# Dataset Describe
customer_df.describe()

### Variables Description 

Answer Here

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
customer_df.nunique()

In [None]:
customer_df.shape

## 3. ***Data Wrangling***

### Data Wrangling Code

**convert invoice Data column into 'year','month','day','hour','minute','second'**

In [None]:
# Write your code to make your dataset analysis ready.
customer_df['InvoiceDate_year']=customer_df['InvoiceDate'].dt.year
customer_df['InvoiceDate_month']=customer_df['InvoiceDate'].dt.month
customer_df['InvoiceDate_day']=customer_df['InvoiceDate'].dt.day 
customer_df['InvoiceDate_hour']=customer_df['InvoiceDate'].dt.hour 
customer_df['InvoiceDate_minute']=customer_df['InvoiceDate'].dt.minute 
customer_df['InvoiceDate_second']=customer_df['InvoiceDate'].dt.second

In [None]:
customer_df.dtypes

In [None]:
customer_df.shape

### What all manipulations have you done and insights you found?

Answer Here.

# **Exploratory Data Analysis**

**Why do we perform EDA?**

 **An EDA is a thorough examination meant to uncover the underlying structure of a data set and is important for a company because it exposes trends, patterns, and relationships that are not readily apparent.** 

In [None]:
customer_df['CustomerID'].nunique()

In [None]:
# finding most active customer
active_customer= pd.DataFrame(customer_df['CustomerID'].value_counts().sort_values(ascending=False).reset_index())
active_customer.rename(columns={'index':'CustomerID','CustomerID':'count'},inplace=True)
active_customer

In [None]:
active_customer.head()

# **Analysis of Categorical Features**

In [None]:
categorical_columns= list(customer_df.select_dtypes(['object']).columns)
categorical_features= pd.Index(categorical_columns)
categorical_features

# **Analysis of Description Variable**

In [None]:
description_df=customer_df['Description'].value_counts().reset_index()
description_df.rename(columns={'index':'Description_name','Description':'Count'},inplace=True)
description_df

In [None]:
pd.value_counts(customer_df['Description'].values)


## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
plt.figure(figsize=(18,11))
plt.title('Top 5 Product Name')
sns.barplot(x='Description_name',y='Count',data=description_df[:5])


In [None]:
plt.figure(figsize=(18,11))
plt.title('Top 5 Product Name')
sns.barplot(x='Description_name',y='Count',data=description_df[-5:])

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 2

In [None]:
StockCode_df=customer_df['StockCode'].value_counts().reset_index()
StockCode_df.rename(columns={'index': 'StockCode_Name'}, inplace=True)
StockCode_df.rename(columns={'StockCode': 'Count'}, inplace=True)

In [None]:
# Chart - 2 visualization code
plt.figure(figsize=(13,8))
plt.title('Top 5 Stock Name')
sns.barplot(x='StockCode_Name',y='Count',data=StockCode_df[:5])

In [None]:
plt.figure(figsize=(13,8))
plt.title('Bottom 5 Stock Name')
sns.barplot(x='StockCode_Name',y='Count',data=StockCode_df[-5:])

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 3

In [None]:
country_df=customer_df['Country'].value_counts().reset_index()
country_df.rename(columns={'index':'Country_Name','Country':'count'},inplace=True)
country_df

In [None]:
# Chart - 3 visualization code
plt.figure(figsize=(13,8))
plt.title('Top 5 Country based on the Most Numbers Customers')
sns.barplot(x='Country_Name',y='count',data=country_df[:5])

In [None]:
plt.figure(figsize=(13,8))
plt.title('Top 5 Country based on the Most Numbers Customers')
sns.barplot(x='Country_Name',y='count',data=country_df[-5:])

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

# **Analysis Numeric Features**

#### Chart - 4

In [None]:
numerical_columns=list(customer_df.select_dtypes(['int64','float64']).columns)
numerical_features=pd.Index(numerical_columns)
numerical_features

In [None]:
# Chart - 4 visualization code
for col in numerical_features:
  fig=plt.figure(figsize=(9,6))
  ax=fig.gca()
  feature =(customer_df[col])
  feature.hist(bins=50,ax=ax)
  ax.axvline(feature.mean(),color='magenta', linestyle='dashed', linewidth=2)
  ax.axvline(feature.median(),color='cyan', linestyle='dashed', linewidth=2)
  ax.set_title(col)
  plt.show()
  print( "Skewness :",customer_df[col].skew())
  print( "Kurtosis :",customer_df[col].kurt())

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 5

In [None]:
#plot a bar plot for each numerical feature count with Dist Plot (except Car_ID)
for col in numerical_features:
  fig=plt.figure(figsize=(9,6))
  ax=fig.gca()
  feature= (customer_df[col])
  sns.distplot(customer_df[col])
  ax.axvline(feature.mean(),color='magenta', linestyle='dashed', linewidth=2)
  ax.axvline(feature.median(),color='cyan', linestyle='dashed', linewidth=2)
  ax.set_title(col)
  plt.show()
  print( "Skewness :",customer_df[col].skew())
  print( "Kurtosis :",customer_df[col].kurt())

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 6

In [None]:
# plot a boxplot for the label by each numerical feature  

for col in numerical_features:
    fig = plt.figure(figsize=(9, 6))
    ax = fig.gca()
    customer_df.boxplot(col)
    ax.set_title('Label by ' + col)
    #ax.set_ylabel("Churn")
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
plt.figure(figsize=(15,8))
correlation=customer_df.corr()
sns.heatmap(abs(correlation), annot=True, cmap='coolwarm')

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

# **Feature engineering**

**creating new features day from invoicedate**

In [None]:
customer_df['day']=customer_df['InvoiceDate'].dt.day_name()

**creating new feature TotalAmount from Quantity and unitprice**

In [None]:
customer_df['TotalAmount']=customer_df['Quantity']*customer_df['UnitPrice']
customer_df.head()

In [None]:
plt.figure(figsize=(15,8))
plt.title('distridution of amount')
sns.distplot(customer_df['TotalAmount'],color='Red')

In [None]:
customer_df['TotalAmount'].describe()

In [None]:
day_df=customer_df['day'].value_counts().reset_index()
day_df.rename(columns={'index':'Day_Name','day':'Count'},inplace=True)
day_df

In [None]:
plt.figure(figsize=(13,8))
plt.title('Day')
sns.barplot(x='Day_Name',y='Count',data=day_df)

**most of the item are purches in Thrusday , Wednesday & Tuesday**

In [None]:
month_df=customer_df['InvoiceDate_month'].value_counts().reset_index()
month_df.rename(columns={'index':'month_name','InvoiceDate_month':'Count'},inplace=True)
month_df

In [None]:
plt.figure(figsize=(18,5))
plt.title('month')
sns.barplot(x='month_name',y='Count',data=month_df)

*  **most of the gifts are purchesed in month of setp , oct , nov, dec**
*  **less number of gifts are purchesed in month of jan, feb and april**


In [None]:
hour_df=customer_df['InvoiceDate_hour'].value_counts().reset_index()
hour_df.rename(columns={'index':'Hour_Name','InvoiceDate_hour':'Count'},inplace=True)
hour_df


In [None]:
plt.figure(figsize=(18,7))
plt.title('hour')
sns.barplot(x='Hour_Name', y='Count',data=hour_df)

* **most of the item are purchased in afternoon as per the graph**

In [None]:
def time_type(time):
  if(time==6 or time==7 or time==8 or time==9 or time==10 or time==11):
    return 'morning'
  if(time==12 or time==13 or time==14 or time==15 or time==16 or time==17):
    return 'afternon'
  else:
    return 'evening'

In [None]:
customer_df['Time_type']=customer_df['InvoiceDate_hour'].apply(time_type)

In [None]:
plt.figure(figsize=(13,5))
plt.title('Time_type')
sns.countplot(x='Time_type',data=customer_df)


**Most of the customers have purches the items in Aftrnoon ,moderate numbers of customers have purches the items in Morning and least numbers of customers have purches the items in Evening**

# **Creating RMF Model**

**Before applying any clustering algorithms it is always necessary to determine various quantitative factors on which the algorithm will perform segmentation. Examples of these would be features such as amount spend, activeness of the customer, their last visit, etc.**

**RFM model which stands for Recency, Frequency, and Monetary is one of such steps in which we determine the recency - days to last visit, frequency - how actively the customer repurchases and monetary - total expenditure of the customer, for each customer. There are other steps too in which we divide each of these features accordingly and calculate a score for each customer. However, this approach doesnot require machine learning algorithms as segmentation can be done manually. Therefore we will skip the second step and directly use the rfm features and feed it to clustering algorithms.**



*   **Recency = Latest Date - Last Inovice Data,**
*   **Frequency = count of invoice no. of transaction(s),**
*   **Monetary = Sum of Total Amount for each customer**

In [None]:
#Set Latest date 2011-12-10 as last invoice date was 2011-12-09. This is to calculate the number of days from recent purchase.
Latest_Date = dt.datetime(2011,12,10)

#calculating RMF modling score for each customer
rfm_df = customer_df.groupby('CustomerID').agg({'InvoiceDate': lambda x: (Latest_Date - x.max()).days, 'InvoiceNo': lambda x: len(x), 'TotalAmount': lambda x: x.sum()})

#converting invoice data into int type 
rfm_df['InvoiceDate'] = rfm_df['InvoiceDate'].astype(int)

#remain column name by Recency, Frequency and 

In [None]:
#Set Latest date 2011-12-10 as last invoice date was 2011-12-09. This is to calculate the number of days from recent purchase
Latest_Date = dt.datetime(2011,12,10)

#Create RFM Modelling scores for each customer
rfm_df = customer_df.groupby('CustomerID').agg({'InvoiceDate': lambda x: (Latest_Date - x.max()).days, 'InvoiceNo': lambda x: len(x), 'TotalAmount': lambda x: x.sum()})

#Convert Invoice Date into type int
rfm_df['InvoiceDate'] = rfm_df['InvoiceDate'].astype(int)

#Rename column names to Recency, Frequency and Monetary
rfm_df.rename(columns={'InvoiceDate': 'Recency', 'InvoiceNo': 'Frequency', 'TotalAmount': 'Monetary'}, inplace=True)
rfm_df.reset_index().head()

In [None]:
# drescribe recency
rfm_df.Recency.describe()

In [None]:
#plot of Recency
x =rfm_df['Recency']
plt.figure(figsize=(18,5))
sns.distplot(x)

In [None]:
rfm_df.Frequency.describe()

In [None]:
x = rfm_df['Frequency']
plt.figure(figsize=(13,8))
sns.displot(x)


In [None]:
# describe Monetary
rfm_df.Monetary.describe()

In [None]:
#Monateray distribution plot, taking observations which have monetary value less than 10000
x =rfm_df['Monetary']
plt.figure(figsize=(13,5))
sns.displot(x)

**Split into four segments using quantiles**

In [None]:
# split into 4 segments into quantiles 
quantiles = rfm_df.quantile(q=[0.25,0.5,0.75])
quantiles = quantiles.to_dict()
quantiles


In [None]:
# creating R,M,F segments 
def RScoring(x,p,d):
  if x <= d[p][0.25]:
    return 1
  elif x <= d[p][0.50]:
    return 2
  elif x<= d[p][0.75]:
    return 3
  else :
    return 4

def FnMScoring(x,p,d):
  if x <= d[p][0.25]:
    return 4
  elif x <= d[p][0.50]:
    return 3
  elif x<= d[p][0.75]:
    return 2
  else :
    return 1



## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ? 
Explain Briefly.

Answer Here.

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***