<a href="https://colab.research.google.com/github/Aramnani/Capstone-Project---4---Online-Retail-Customer-Segmentation/blob/main/Online_Retail_Customer_Segmentation_Capstone_Project_4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -



##### **Project Type**    - Unsupervised
##### **Contribution**    - Individual
##### **Name -** Aakash Ramnani


# **Project Summary -**

In this project, the task is to indentify major customer segments on a traditional data set which contains all the transaction occuring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store retail.

Project is performed in following steps

1. **Basic EDA on Dataset -**  This step involves exploring data set and checking relationship between variables and checking their distributions.

2. **RFM Analysis -** RFM (Recency, Frequency, Monetary) analysis is a customer segmentation technique that uses past purchase behavior to divide customers into groups.

  - **RECENCY (R):** Days since last purchase
  - **FREQUENCY (F):** Total number of purchases
  - **MONETARY VALUE (M):** Total money this customer spent.

3. **Visualization Using different Charts -** This step involves creating various charts and graphs to visualize the data and identify the patterns and relationships among the features. Some of the charts that can be used are bar charts, scatter plots, heat maps, etc.

4. **Hypothesis Testing -** This step involves testing some hypotheses or assumptions about the data using statistical methods.

5. **Feature Engineering for clustering -** This step involves creating new features or transforming existing features to make them suitable for clustering.

6. **Clustering analysis using k-means and agglomerative -** This step involves applying k-means and agglomerative clustering algorithms to group the customers based on their RFM score. This step can help to identify the optimal segments of customers.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


**In this project, your task is to identify major customer segments on a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers.**

# **General Guidelines** : -

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.

     The additional credits will have advantages over other students during Star Student selection.

             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.


```
# Chart visualization code
```


*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np

import time, warnings
import datetime as dt

#modules for predictive models
import sklearn.cluster as cluster
from sklearn.cluster import KMeans
from sklearn.cluster import DBSCAN
from sklearn.decomposition import PCA
import scipy.cluster.hierarchy as sch

from sklearn.metrics import silhouette_samples, silhouette_score
import scipy.stats as stats

#visualizations
import matplotlib.pyplot as plt
from pandas.plotting import scatter_matrix
%matplotlib inline
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Load Dataset
dataset = pd.read_csv('/content/drive/MyDrive/Almabetter/machine learning/project/unsupervised/Online Retail.xlsx - Online Retail.csv')

### Dataset First View

In [None]:
# Dataset First Look
dataset.head()

In [None]:
dataset.tail()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print("No. of rows in dataset are : ", dataset.shape[0])
print("No. of columns in dataset are : ", dataset.shape[1])

### Dataset Information

In [None]:
# Dataset Info
dataset.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
dataset.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
dataset.isnull().sum()

In [None]:
# Visualizing the missing values
sns.heatmap(dataset.isnull(), cbar=False)

### What did you know about your dataset?

- No. of rows in dataset are :  541909
- No. of columns in dataset are :  8
- There are missing values in Description and CustomerID column.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
dataset.columns

In [None]:
# Dataset Describe
dataset.describe()

### Variables Description

- **InvoiceNo:** Invoice number. Nominal, a 6-digit integral number uniquely assigned to each transaction. If this code starts with letter 'c', it indicates a cancellation.
- **StockCode:** Product (item) code. Nominal, a 5-digit integral number uniquely assigned to each distinct product.
- **Description:** Product (item) name. Nominal.
- **Quantity:** The quantities of each product (item) per transaction. Numeric.
- **InvoiceDate:** Invice Date and time. Numeric, the day and time when each transaction was generated.
- **UnitPrice:** Unit price. Numeric, Product price per unit in sterling.
- **CustomerID:** Customer number. Nominal, a 5-digit integral number uniquely assigned to each customer.
- **Country:** Country name. Nominal, the name of the country where each customer resides.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for col in dataset.columns:
  print(f"Unique value for {col} : ", dataset[col].unique())

In [None]:
# number of unique value for each variable
uni_df = pd.DataFrame()
uni_df['variables'] = dataset.columns.to_list()
uni_df['unique values'] = uni_df['variables'].apply(lambda x : dataset[x].nunique())
uni_df

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
df1 = dataset.copy()

In [None]:
# Handling Missing Values
df1.isnull().sum()

- There are missing values in CustomerID and Description.
- The reason behind this could that customer buying from the online store was not a registered customer.
- And there is no way we can impute the CustomerID and Description as CustomerID is unique to every customer and Description is unique to every product. So we will drop all the information with missing values.

In [None]:
df1.dropna(inplace = True)

In [None]:
df1.isnull().sum()

In [None]:
# Shape of our dataframe after dropping missing values
df1.shape

In [None]:
df1.info()

In [None]:
# Changeing the datatype of InvoiceNo
df1['InvoiceNo'] = df1['InvoiceNo'].astype('str')

In [None]:
# Invoice Starting with C are cancellation invoice.
# Dropping all the invoice starting with C.
df1 = df1[~df1['InvoiceNo'].str.contains('C')]

In [None]:
# Shape of our dataframe after dropping cancellation invoice
df1.shape

In [None]:
# Changing the datatype of InvoiceDate to datetime datatype
# Checking for oldest and latest date
df1["InvoiceDate"] = pd.to_datetime(df1["InvoiceDate"])
print("Minimum Invoice Date", min(df1["InvoiceDate"]))
print("Maximum Invoice Date", max(df1["InvoiceDate"]))

In [None]:
# Checking for unit price
print("Minimum UnitPrice", min(df1["UnitPrice"]))
print("Maximum UnitPrice", max(df1["UnitPrice"]))
df1["UnitPrice"].describe()

- There are observations with unit price 0.
- We will be considering only those observation with unit price greater than 0.

In [None]:
# Droping the observations with unit price 0
df1 = df1[df1.UnitPrice > 0]
df1["UnitPrice"].describe()

In [None]:
# Checking for Quantity
print("Minimum Quantity", min(df1["Quantity"]))
print("Maximum Quantity", max(df1["Quantity"]))
df1["Quantity"].describe()

In [None]:
# Adding the new column to the dataset Total_Amount
df1["Total_sales"] = df1["UnitPrice"]*df1["Quantity"]

In [None]:
df1.head(1)

### What all manipulations have you done and insights you found?

- There is no way we can impute the CustomerID and Description as CustomerID is unique to every customer and Description is unique to every product. So we will drop all the information with missing values.
- Changed the datatype of InvoiceNo.
- Dropped all the invoice starting with C(Cancellation Invoice).
- Changed the datatype of InvoiceDate to datetime datatype.
- There are observations with unit price 0.
- Dropped the observations with unit price 0.
- Added the new column to the dataset Total_Amount using UnitPrice and Quantity.

**EDA (Exploratory Data Analysis)**

In [None]:
df = df1.copy()

In [None]:
df.head()

In [None]:
# Customer and sales analysis with respect to the country.
# Unique Customer count and percentage with respect to the country.
country_df = df.groupby("Country")["CustomerID"].nunique().reset_index().rename(columns = {"CustomerID":"count_CustomerID"})
country_df["customer_%"] = round(country_df["count_CustomerID"]*100/country_df["count_CustomerID"].sum(),2)
country_df

In [None]:
# Total sales count and percentage with respect to the country.
country_sales_df = df.groupby("Country")["Total_sales"].sum().reset_index()
country_sales_df["Total_sales%"] = round(country_sales_df["Total_sales"]*100/country_sales_df["Total_sales"].sum(),2)

In [None]:
country_sales_df

- 90% of total customers are from United Kingdom and 82% of total sales is from United Kingdom.
- So we will be only considering the observations corresponding to United Kingdom.

In [None]:
df = df[df.Country == 'United Kingdom']
df['Country'].unique()

In [None]:
# Most Ordered Product
most_ordered = df.groupby(['StockCode','Description'], as_index= False)['Quantity'].sum().sort_values(by='Quantity', ascending=False)
most_ordered.head(5)

In [None]:
# Creating a new feature from InvoiceDate
df['Month']=df['InvoiceDate'].dt.month_name()
df['Day']=df['InvoiceDate'].dt.day_name()
df['Hour']=df['InvoiceDate'].dt.hour

In [None]:
# Monthly Transaction
month_df=df['Month'].value_counts().reset_index()
month_df.rename(columns={'index': 'Month_Name'}, inplace=True)
month_df.rename(columns={'Month': 'Count'}, inplace=True)
month_df

- Highest number of transactions are seen in month of November followed by October and December.

In [None]:
# Day-wise Transaction
day_df=df['Day'].value_counts().reset_index()
day_df.rename(columns={'index': 'Days'}, inplace=True)
day_df.rename(columns={'Day': 'Count'}, inplace=True)
day_df

- Highest transaction are seen on thursday.
- There are no transaction on saturday.

In [None]:
# Checking what time of the day customer transact the most.
hour_df=df['Hour'].value_counts().reset_index()
hour_df.rename(columns={'index': 'Hours'}, inplace=True)
hour_df.rename(columns={'Hour': 'Count'}, inplace=True)
hour_df

- **RFM(Recency,Frequency,Monetary) Model**

- RFM (Recency, Frequency, Monetary) analysis is a customer segmentation technique that uses past purchase behavior to divide customers into groups.

- RFM helps divide customers into various categories or clusters to identify customers who are more likely to respond to promotions and also for future personalization services.

  - RECENCY (R): Days since last purchase
  - FREQUENCY (F): Total number of purchases
  - MONETARY VALUE (M): Total money this customer spent.

**Recency**

- To create a Recency feature variable, we need to decide the reference date for analysis and we wil define the reference date as one day before the last transaction.

In [None]:
#last date available in our dataset
df['InvoiceDate'].max()

In [None]:
latest_date = dt.date(2011,12,9)
print(latest_date)

In [None]:
#create a new column called date which contains the date of invoice only
df['date'] = df['InvoiceDate'].dt.date

In [None]:
df.head()

In [None]:
# Checking last of purchase
recency_df = df.groupby(by='CustomerID', as_index=False)['date'].max()
recency_df.columns = ['CustomerID','LastPurshaceDate']
recency_df.head()

In [None]:
# Calculating Recency
recency_df['Recency'] = recency_df['LastPurshaceDate'].apply(lambda x: (latest_date - x).days)
recency_df.head()

In [None]:
#droping LastPurchaseDate
recency_df.drop('LastPurshaceDate', axis=1, inplace=True)

**Frequency**

- Frequency helps us to know how many times a customer purchased from us. To do that we need to check how many invoices are registered by the same customer.

In [None]:
# drop duplicates
df2 = df.copy()
df2.drop_duplicates(subset=['InvoiceNo', 'CustomerID'], keep="first", inplace=True)

In [None]:
#calculate frequency of purchases
frequency_df = df2.groupby(by=['CustomerID'], as_index=False)['InvoiceNo'].count()
frequency_df.columns = ['CustomerID','Frequency']
frequency_df.head()

**Monetary Value**

Monetary attribute answers the question: How much money did the customer spent over time?

In [None]:
monetary_df = df.groupby(by='CustomerID',as_index=False).agg({'Total_sales': 'sum'})
monetary_df.columns = ['CustomerID','Monetary_Value']
monetary_df.head()

In [None]:
# Creating RFM table
# Merging recency_df and frequency_df
df3 = recency_df.merge(frequency_df,on='CustomerID')
df3.head()

In [None]:
# Merging df3 and monetary_df
rfm_df = df3.merge(monetary_df, on='CustomerID')
# Use CustomerID as index
rfm_df.set_index('CustomerID', inplace=True)
rfm_df.head()

- **Calculating RFM Score**

In [None]:
# Using Quartiles to divide customer segments from RFM model
quantiles = rfm_df.quantile(q=[0.25,0.5,0.75])
quantiles

In [None]:
# Converting into dictionary
quantiles.to_dict()

In [None]:
# Creating 2 funtions since according to quartiles high recency is bad while high frequency and monetory value is good.
def RScore(x,p,d):
    if x <= d[p][0.25]:
        return 4
    elif x <= d[p][0.50]:
        return 3
    elif x <= d[p][0.75]:
        return 2
    else:
        return 1

def FMScore(x,p,d):
    if x <= d[p][0.25]:
        return 1
    elif x <= d[p][0.50]:
        return 2
    elif x <= d[p][0.75]:
        return 3
    else:
        return 4

In [None]:
# Calculating R_score, F_score and M_score
rfm_segmentation = rfm_df.copy()
rfm_segmentation['R_Quartile'] = rfm_segmentation['Recency'].apply(RScore, args=('Recency',quantiles,))
rfm_segmentation['F_Quartile'] = rfm_segmentation['Frequency'].apply(FMScore, args=('Frequency',quantiles,))
rfm_segmentation['M_Quartile'] = rfm_segmentation['Monetary_Value'].apply(FMScore, args=('Monetary_Value',quantiles,))
rfm_segmentation.head()

- We assign a score from 1 to 4 to Recency, Frequency and Monetary. Four is the best/highest value, and one is the lowest/worst value.

In [None]:
# Combinig R_Quartile, F_Quartile and M_Quartile
rfm_segmentation['RFM'] = rfm_segmentation.R_Quartile.map(str) \
                            + rfm_segmentation.F_Quartile.map(str) \
                            + rfm_segmentation.M_Quartile.map(str)
rfm_segmentation.head()

In [None]:
# Calculating RMF score
rfm_segmentation['RFMScore'] = rfm_segmentation[['R_Quartile', 'F_Quartile', 'M_Quartile']].sum(axis = 1)
rfm_segmentation.head()

In [None]:
# Checking the mean value for Recency, Frequency and Monetary corresponding to each score
rfm_segmentation.groupby("RFMScore")[['Recency','Frequency', 'Monetary_Value']].mean()

- Customer with low recency value has high frequency and monetory value and vice a versa is true as well.

In [None]:
# Checking our best customer with high frequency score
rfm_segmentation[rfm_segmentation['RFMScore'] == 12].sort_values('Monetary_Value', ascending=False).head(10)

**RFM Segmentation**

In [None]:
rfm_segment = rfm_segmentation.copy()
rfm_segment.reset_index(inplace=True)
import itertools

# Highest frequency as well as monetary value with least recency
platinum_customers = ['444', '443']
print ("Platinum Customers                     : {}".format(platinum_customers))

# Get all combinations of [1, 2, 3,4] and length 2
big_spenders_comb =  itertools.product([1, 2, 3,4],repeat = 2)

# Print the obtained combinations
big_spenders = []
for i in list(big_spenders_comb):
    item = (list(i))
    item.append(4)
    big_spenders.append( ("".join(map(str,item))))
print ("Big Spenders                           : {}".format(big_spenders))

#High-spending New Customers – This group consists of those customers in 1-4-1 and 1-4-2.
#These are customers who transacted only once, but very recently and they spent a lot

high_spend_new_customers = ['413', '314' ,'313','414']
print ("High Spend New Customers               : {}".format(high_spend_new_customers))


lowest_spending_active_loyal_customers_comb =  itertools.product([ 3,4], repeat = 2)
lowest_spending_active_loyal_customers = []
for i in list(lowest_spending_active_loyal_customers_comb):
    item = (list(i))
    item.append(1)
    lowest_spending_active_loyal_customers.append( ("".join(map(str,item))))
print ("Lowest Spending Active Loyal Customers : {}".format(lowest_spending_active_loyal_customers))

recent_customers_comb =  itertools.product([ 2,3,4], repeat = 2)
recent_customers = []
for i in list(recent_customers_comb):
    item = (list(i))
    item.insert(0,4)
    recent_customers.append( ("".join(map(str,item))))
print ("Recent Customers                       : {}".format(recent_customers))




almost_lost = ['244', '234', '243', '233']        #  Low R - Customers are shopping less often now who used to shop a lot
print ("Good Customers Almost Lost             : {}".format(almost_lost))

churned_best_customers = ['144', '134' ,'143','133']
print ("Churned Best Customers                 : {}".format(churned_best_customers))


lost_cheap_customers = ['122','111' ,'121','112','221','212' ,'211'] # Customers shopped long ago but with less frequency and monetary value
print ("Lost Cheap Customers                   : {}".format(lost_cheap_customers))

In [None]:
# Create a dictionary for each segment to map them against each customer
segment_dict = {
    'Platinum Customers':platinum_customers,
    'Big Spenders':      big_spenders,
    'High Spend New Customers':high_spend_new_customers,
    'Lowest-Spending Active Loyal Customers' : lowest_spending_active_loyal_customers ,
    'Recent Customers': recent_customers,
    'Good Customers Almost Lost':almost_lost,
    'Churned Best Customers':   churned_best_customers,
    'Lost Cheap Customers ': lost_cheap_customers,
}

In [None]:
# Allocate segments to each customer as per the RFM score mapping
def find_key(value):
    for k, v in segment_dict.items():
        if value in v:
            return k
rfm_segment['Segment'] = rfm_segment.RFM.map(find_key)

# Allocate all remaining customers to others segment category
rfm_segment.Segment.fillna('others', inplace=True)
rfm_segment.sample(10)

- Now that we know our customers segments we can choose how to target or deal with each segment.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
# Unique Customer count and percentage with respect to the country.
country_df1 = df1.groupby("Country")["CustomerID"].nunique().reset_index().rename(columns = {"CustomerID":"count_CustomerID"})
country_df1["customer_%"] = round(country_df1["count_CustomerID"]*100/country_df1["count_CustomerID"].sum(),2)

# Visualizing for country vs customer percentage
country_df1 = country_df1.sort_values(by = "customer_%", ascending = False)
fig, ax = plt.subplots(figsize=(10,4),dpi=100)
ax=sns.barplot(x=country_df1["Country"], y=country_df1['customer_%'])
ax.set_xticklabels(ax.get_xticklabels(), rotation=50, ha="right")
plt.show()

##### 1. Why did you pick the specific chart?

We have used barplot to show the percentage count of customer with respect to country.

##### 2. What is/are the insight(s) found from the chart?

90% of total customer are from United-Kingdom.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The store is UK based so we can expect that more number of customer will be from United Kingdom.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
# Total sales count and percentage with respect to the country.
country_sales_df1 = df1.groupby("Country")["Total_sales"].sum().reset_index()
country_sales_df1["Total_sales%"] = round(country_sales_df1["Total_sales"]*100/country_sales_df1["Total_sales"].sum(),2)

# Visualizing for country vs total sales percentage
country_sales_df1 = country_sales_df1.sort_values(by = "Total_sales%", ascending = False)
fig, ax = plt.subplots(figsize=(10,4),dpi=100)
ax=sns.barplot(x=country_sales_df1["Country"], y=country_sales_df1['Total_sales%'])
ax.set_xticklabels(ax.get_xticklabels(), rotation=50, ha="right")
plt.show()

##### 1. Why did you pick the specific chart?

We have used barplot to show the percentage count of total sales with respect to country.

##### 2. What is/are the insight(s) found from the chart?

82% of total sales revenue is from United Kingdom.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

As we learnt that almost 90% of total customers are from United Kingdom, this leads to 82% of total sales revenue coming from United Kingdom.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
#Visualizing top 5 product name
plt.figure(figsize=(12,8))
plt.title('Top 5 Product Name')
sns.barplot(x = 'Quantity', y = 'Description', data = most_ordered[:5], palette = 'spring_r')

##### 1. Why did you pick the specific chart?

We have used barplot to comapre the Description with respect to Quantity to come up with top 5 most ordered products.

##### 2. What is/are the insight(s) found from the chart?

Paper Craft,little birdie and Medium Ceramic top storage jar are the top 2 products that are ordered the most.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

This insight helps the store to understand the demand of a pirticular product so that they can keep the stock intact and reduce that their purchase price by buying in quantity as they are sure about the fact that it is one of the most ordered product.

#### Chart - 4

In [None]:
# Chart - 4 visualization code
# Visualising monthly transaction
plt.figure(figsize=(13,8))
plt.title('Month wise transaction')
sns.barplot(x = 'Month_Name' ,y = 'Count', data = month_df, palette = 'spring_r')

##### 1. Why did you pick the specific chart?

We have used Bar plot to show the transaction count per month.

##### 2. What is/are the insight(s) found from the chart?

Most Number of transactions are done in the month of November followed by October and December.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

These behaviour can be expected because of festive season in those month. This insight can help store to be prepared for all sales by keeping stock intact and also running promotion accordingly to drive more sales.

#### Chart - 5

In [None]:
# Chart - 5 visualization code
#Visualizing for day-wise transaction
plt.figure(figsize=(12,8))
plt.title('Day wise transaction')
sns.barplot(x = 'Days', y = 'Count', data = day_df, palette = 'spring_r')

##### 1. Why did you pick the specific chart?

We have used Bar plot to show the transaction count per day of the week.

##### 2. What is/are the insight(s) found from the chart?

- Highest transaction are seen on thursday.
- There are no transaction on saturday.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

There are no sales on saturday store should look into it and figure out the reason behind it.

#### Chart - 6

In [None]:
# Chart - 6 visualization code
# Checking what time of the day customer transact the most.
#hour_df=df['Hour'].value_counts().reset_index()
#hour_df.rename(columns={'index': 'Hours'}, inplace=True)
#hour_df.rename(columns={'Hour': 'Count'}, inplace=True)

# Categorizing Hour into Morning, Afternoon and Evening
def time_type(time):
  if(time>=6 and time<=11):
    return 'Morning'
  elif(time>=12 and time<=17):
    return 'Afternoon'
  else:
    return 'Evening'

df['Time_type']=df['Hour'].apply(time_type)

#Visualizing the transactions in morning, afternoon and night
plt.figure(figsize=(12,8))
plt.title('Transaction in Hour of the day')
sns.countplot(x = 'Time_type', data = df, palette = 'spring_r')

##### 1. Why did you pick the specific chart?

We have used barplot to show the count of sales in morning afternoon and evening time of a day.

##### 2. What is/are the insight(s) found from the chart?

- Most number of transaction are done in afternoon, followed by morning.
- Least number of transaction are done in Evening.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

This can help store to schedule there digital advertisement accordingly for better optimization.

#### Chart - 7

In [None]:
# Chart - 7 visualization code
# Recency Histogram
import math
import scipy
x=recency_df.Recency
mu=np.mean(x)
sigma=math.sqrt(np.var(x))
n,bins,patches=plt.hist(x,1000,facecolor='blue',alpha=0.75)#alpha=transparency parameter
# Add a best fit line
y=scipy.stats.norm.pdf(bins,mu,sigma)#norm.pdf-probability density function for norm
l=plt.plot(bins,y,'r--',lw=2)

plt.xlabel('Recency in days')
plt.ylabel('Number of transactions')
plt.title('Histogram of Sales Recency')
plt.grid(True)
plt.show()

##### 1. Why did you pick the specific chart?

We have used the histogram to see the distribution of recency with respect to number of transaction.

##### 2. What is/are the insight(s) found from the chart?

We have a skewed distribution for recency showing higher transaction in recent days.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Recency shows days since last purchace this information can help store to see the which customers they are about to lose and which they have lost. And thus come up with a stratergies to retain them.

#### Chart - 8

In [None]:
# Chart - 8 visualization code
# Distribution Plot for frequency
x = frequency_df['Frequency']
plt.figure(figsize=(10,8))
sns.distplot(x,color='r')

##### 1. Why did you pick the specific chart?

We have used distribution plot to check the distribution of frequency.

##### 2. What is/are the insight(s) found from the chart?

Distribution is right skewed.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Number of perchases are between 1 to 15. Customers with higher number of purchases are loyal customer and could be included in loyalty program.

#### Chart - 9

In [None]:
# Chart - 9 visualization code
# Distribution Plot for Monetary Value
x = monetary_df['Monetary_Value']
plt.figure(figsize=(10,8))
sns.distplot(x,color='r')

##### 1. Why did you pick the specific chart?

We have used distribution plot to check the distribution of Monetary Value.

##### 2. What is/are the insight(s) found from the chart?

Distribution is right skewed.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Value ranges from 0-2000. Customer on higher end of this spectrum are premium customer.

#### Chart - 10 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
rfm_df.corr()

In [None]:
sns.heatmap(rfm_df.corr())

##### 1. Why did you pick the specific chart?

We have used heatmap to plot the correlation between Recency, Frequency and Monetary Value.

##### 2. What is/are the insight(s) found from the chart?

- Recency is negatively correlated to both Frequency and Monetary value.
- Frequency and Monetary Value are positively correlated.

#### Chart - 11 - Pair Plot

In [None]:
# Pair Plot visualization code

plt.figure(figsize=(15,15))
# setting the axis for graph
sns.pairplot(rfm_df)
# adding visualizations to chart
plt.minorticks_on()
plt.grid(which='both',alpha=0.3,linestyle='--')
plt.show()


##### 1. Why did you pick the specific chart?

We have used Pair plot to understand the best set of features to explain a relationship between two variables.

##### 2. What is/are the insight(s) found from the chart?

Distribution of all three feature is rightly skewed.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

1. Recency is negatively correlated to Frequency.

2. Monetary Value and Frequency are positively correlated.

3. Recency is negatively correlated to Monetary value.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

- **Null Hypothesis(H0) -** There is no correlation between Recency and Frequency.

- **Alternate Hypothesis(HA) -** Recency is negatively correlated to Frequency.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
group1 = rfm_df['Recency'].values
group2 = rfm_df['Frequency'].values


In [None]:
stats.ttest_ind(a = group1, b = group2, equal_var = False)

- p-value is 0.00 which is less than significance level 0.05.
- We have enough evidence to reject the null hypothesis.


##### Which statistical test have you done to obtain P-Value?

- We have performed two sample t-test to obtain p-value.

##### Why did you choose the specific statistical test?

- We need to find the group means of two sample for comparision so we have used two sample t-test.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.


- **Null Hypothesis(H0) -** There is no correlation between Monetary value and Frequency.

- **Alternate Hypothesis(HA) -** Monetary Value is positively correlated to Frequency.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
group3 = rfm_df['Monetary_Value'].values

In [None]:
stats.ttest_ind(a = group3, b = group2, equal_var = False)

- p-value is very less than significance level 0.05.
- We have enough evidence to reject the null hypothesis.

##### Which statistical test have you done to obtain P-Value?

We have performed two sample t-test to obtain p-value.

##### Why did you choose the specific statistical test?

We need to find the group means of two sample for comparision so we have used two sample t-test.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

- **Null Hypothesis(H0) -** There is no correlation between Monetary value and Recency.

- **Alternate Hypothesis(HA) -** Recency is Negatively correlated to Monetary Value.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
stats.ttest_ind(a = group1, b = group3, equal_var = False)

- p-value is less than significance level 0.05.
- We have enough evidence to reject the null hypothesis.

##### Which statistical test have you done to obtain P-Value?

We have performed two sample t-test to obtain p-value.

##### Why did you choose the specific statistical test?

We need to find the group means of two sample for comparision so we have used two sample t-test.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation
df.isnull().sum()

#### What all missing value imputation techniques have you used and why did you use those techniques?


- There were missing values in CustomerID and Description.
- The reason behind this could be that customer buying from the online store was not a registered customer.
- And there is no way we can impute the CustomerID and Description as CustomerID is unique to every customer and Description is unique to every product.
- So we dropped all the information with missing values.

### 2. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data
#Checking distribution for Recency, Frequency and Monetary value
scatter_matrix(rfm_df, alpha = 0.3, figsize = (11,5), diagonal = 'kde')

- Distribution for all three features, Recency, Frequency and Monetary value are right skewed.
- Clustering algorithms require normal distribuiton.
- We'll convert these right skewed distribution to near normal distribution by applying log transformation.

In [None]:
# Applying log transformation
rfm_log_R = np.log(rfm_df['Recency']+0.1) #can't take log(0) and so add a small number
rfm_log_F = np.log(rfm_df['Frequency'])
rfm_log_M = np.log(rfm_df['Monetary_Value']+0.1)

In [None]:
log_df = pd.DataFrame({'Monetary_Value': rfm_log_M,'Recency': rfm_log_R,'Frequency': rfm_log_F})
log_df.head()

In [None]:
#Visualizing the distribution of Recency, Frequency and Monetary Value after log transformation
scatter_matrix(log_df, alpha = 0.2, figsize = (11,5), diagonal = 'kde')

- The Distribution of Monetary value is better, However the distribution of Recency and Frequency have inproved but not as much.

In [None]:
log_df.corr()

In [None]:
sns.heatmap(log_df.corr())

- After log transformation it can be seen that Monetary value and frequency show strong positive correlation.

## ***7. ML Model Implementation***

### ML Model - 1 - KMeans Clustering

In [None]:
# Applying Elbow Method
sse = {} #Sum Of Squared Error
# Fit KMeans and calculate SSE for each k
for k in range(1, 11):

    # Initialize KMeans with k clusters
    kmeans = KMeans(n_clusters=k, random_state=1)

    # Fit KMeans on the normalized dataset
    kmeans.fit(log_df)

    # Assign sum of squared distances to k element of dictionary
    sse[k] = kmeans.inertia_

# Plotting the elbow plot
plt.figure(figsize=(10,4))
plt.title('The Elbow Method')
plt.xlabel('k');
plt.ylabel('Sum of squared errors')
sns.pointplot(x=list(sse.keys()), y=list(sse.values()))
plt.show()

In [None]:
from sklearn.metrics import silhouette_score
log_values = log_df.values
wcss_silhouette = []
for i in range(3,12):
    km = KMeans(n_clusters=i, random_state=0,init='k-means++').fit(log_values)
    preds = km.predict(log_values)
    silhouette = silhouette_score(log_values,preds)
    wcss_silhouette.append(silhouette)
    print("Silhouette score for number of cluster(s) {}: {}".format(i,silhouette))

plt.figure(figsize=(10,5))
plt.title("The silhouette coefficient method \nfor determining number of clusters\n",fontsize=16)
plt.scatter(x=[i for i in range(3,12)],y=wcss_silhouette,s=150,edgecolor='k')
plt.grid(True)
plt.xlabel("Number of clusters",fontsize=14)
plt.ylabel("Silhouette score",fontsize=15)
plt.xticks([i for i in range(3,12)],fontsize=14)
plt.yticks(fontsize=15)
plt.show()

- Here we can clearly see that optimum number of cluster should be 5 not 3 or 4. Because that is the only point after which the mean cluster distance looks to be plateaued after a steep downfall.

- So we will assume the 5 number of clusters as best for grouping of customer segments.

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
n_clusters = 5
kmeans = KMeans(init='k-means++', n_clusters = n_clusters, n_init=30)
kmeans.fit(log_values)
clusters_customers = kmeans.predict(log_values)
silhouette_avg = silhouette_score(log_values, clusters_customers)
print('score de silhouette: {:<.3f}'.format(silhouette_avg))

In [None]:
# Building K-means model using n_cluster = 5
kmeans = KMeans(n_clusters = 5)
kmeans.fit(log_values)
y_kmeans= kmeans.predict(log_values)

In [None]:
#Visualizing Cluster
plt.figure(figsize=(15,10))
plt.title('customer segmentation based on Recency, Frequency and Monetary')
plt.scatter(log_values[:, 0], log_values[:, 1], c=y_kmeans, s=50, cmap='spring_r')

centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='black', s=200, alpha=0.5)

In [None]:
# Building K-means model using n_cluster = 4
#kmeans3 = KMeans(n_clusters = 4)
#kmeans3.fit(log_values)
#y_kmeans3= kmeans3.predict(log_values)

In [None]:
#Visualizing Cluster
#plt.figure(figsize=(15,10))
#plt.title('customer segmentation based on Recency, Frequency and Monetary')
#plt.scatter(log_values[:, 0], log_values[:, 1], c=y_kmeans3, s=50, cmap='spring_r')

#centers = kmeans3.cluster_centers_
#plt.scatter(centers[:, 0], centers[:, 1], c='black', s=200, alpha=0.5)

##### Which hyperparameter optimization technique have you used and why?

We have used silhouette score and elbow method to get the optimal number of cluster.

### ML Model - 2 - DBScan Clustering

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
y_pred = DBSCAN(eps=0.5, min_samples=15).fit_predict(log_values)
plt.figure(figsize=(13,8))
plt.scatter(log_values[:,0], log_values[:,1], c=y_pred)

- DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is the most well-known density-based clustering algorithm
- Unlike k-means, DBSCAN does not require the number of clusters as a parameter. Rather it infers the number of clusters based on the data, and it can discover clusters of arbitrary shape (for comparison, k-means usually discovers spherical clusters).
- DBSCAN categories the data points into three categories
  - Core Points - (Steel blue points in scatter plot)
  - Border Points - (Green points in scatter plot)
  - Outliers - (Dark Blue points in scatterplot)
- (As we can see from the DBscan Visualization)

### ML Model - 3 - Hierarchical Clustering

In [None]:
# Using the dendogram to find the optimal number of clusters
plt.figure(figsize=(13,8))
dendrogram = sch.dendrogram(sch.linkage(log_values, method = 'ward'))
plt.title('Dendrogram')
plt.xlabel('Customers')
plt.ylabel('Euclidean Distances')
plt.show()

- Optimal number of cluster is 3 as per the dendogram.

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# Agglomerative clustering for n_cluster = 2
#from sklearn.cluster import AgglomerativeClustering
#hc = AgglomerativeClustering(n_clusters = 2, affinity = 'euclidean', linkage = 'ward')
#y_hc = hc.fit_predict(log_values)

In [None]:
# Visualizing the clusters
#plt.figure(figsize=(13,8))
#plt.scatter(log_values[y_hc == 0, 0], log_values[y_hc == 0, 1], s = 100, c = 'red', label = 'Customer Type 1')
#plt.scatter(log_values[y_hc == 1, 0], log_values[y_hc == 1, 1], s = 100, c = 'blue', label = 'Customer Type 2')

#plt.title('Clusters')
#plt.xlabel('RFM')

#plt.ylabel('Spending Score')
#plt.legend()
#plt.show()

In [None]:
# Agglomerative clustering for n_cluster = 3
from sklearn.cluster import AgglomerativeClustering
hc3 = AgglomerativeClustering(n_clusters = 3, affinity = 'euclidean', linkage = 'ward')
y_hc3 = hc3.fit_predict(log_values)

In [None]:
plt.figure(figsize=(13,8))
plt.scatter(log_values[y_hc3 == 0, 0], log_values[y_hc3 == 0, 1], s = 100, c = 'red', label = 'Customer Type 1')
plt.scatter(log_values[y_hc3 == 1, 0], log_values[y_hc3 == 1, 1], s = 100, c = 'blue', label = 'Customer Type 2')
plt.scatter(log_values[y_hc3 == 2, 0], log_values[y_hc3 == 2, 1], s = 100, c = 'green', label = 'Customer Type 3')

plt.title('Clusters')
plt.xlabel('RFM')

plt.ylabel('Spending Score')
plt.legend()
plt.show()

**Customer Segmentation**

In [None]:
log_df1 = log_df.rename(columns = {'Recency' : 'Log_R', 'Frequency' : 'Log_F', 'Monetary_Value' : 'Log_M'})
log_df1.head()

In [None]:
rfm_data = rfm_segment.merge(right=log_df1, on="CustomerID", how="left")
rfm_data.head()

In [None]:
rfm_data['Cluster'] = kmeans.labels_

In [None]:
rfm_data.head(10)

In [None]:
columns = list(rfm_data.columns)
columns.remove('Segment')
columns.append('Segment')
rfm_data = rfm_data[columns]

In [None]:
rfm_data.head()

In [None]:
rfm_data['Cluster'].value_counts()

In [None]:
#let's check mean values of the cluster for recency, frequnecy and monetary
rfm_data.groupby('Cluster').agg({'Recency':'mean',
                               'Frequency':'mean',
                               'Monetary_Value':'mean'})

In [None]:
rfm_data.sample(10)
print ("Platinum customers belong to cluster                      : {} ".format(rfm_data[rfm_data['Segment']=='Platinum Customers']['Cluster'].unique()))
print ("Big Spenders belong to cluster                            : {} ".format(rfm_data[rfm_data['Segment']=='Big Spenders']['Cluster'].unique()))
print ("High Spend new Customers belong to cluster                : {} ".format(rfm_data[rfm_data['Segment']=='High Spend New Customers']['Cluster'].unique()))
print ("Lowest-Spending Active Loyal Customers belong to cluster  : {} ".format(rfm_data[rfm_data['Segment']=='Lowest-Spending Active Loyal Customers']['Cluster'].unique()))
print ("Recent Customers belong to cluster                        : {} ".format(rfm_data[rfm_data['Segment']=='Recent Customers']['Cluster'].unique()))
print ("Good Customers Almost Lost belong to cluster              : {} ".format(rfm_data[rfm_data['Segment']=='Good Customers Almost Lost']['Cluster'].unique()))
print ("Churned Best Customers belong to cluster                  : {} ".format(rfm_data[rfm_data['Segment']=='Churned Best Customers']['Cluster'].unique()))
print ("Lost Cheap customers belong to cluster                    : {} ".format(rfm_data[rfm_data['Segment']=='Lost Cheap Customers ']['Cluster'].unique()))


**Analysis Customer in each cluster**

In [None]:
# Checking for cluster - 0
rfm_data[rfm_data.Cluster == 0].sample(5)

- Customer belonging to cluster 0 have lowest RFM value and RFM Score and they belong to Lost Cheap Customer segment.

In [None]:
# Checking for cluster - 1
rfm_data[rfm_data.Cluster == 1].sample(5)

- Customer belonging to cluster 1 have high recency value and low monetary and frequency value and most of them fall into Lowest-Spending Active Loyal Customer and Recent Customer segments.

In [None]:
# Checking for cluster - 2
rfm_data[rfm_data.Cluster == 2].sample(5)

- Customer belonging to cluster 2 shows the same charecteristics as customer in cluster 0, with low RFM value and RFM score, and they also belong to Lost Cheap Customer and others segments.

In [None]:
# Checking for cluster - 3
rfm_data[rfm_data.Cluster == 3].sample(5)

- Customer belonging to cluster 3 have very high RFM value and RMF score, most of the customer belong to Platinum Customer segment and some to Big Spenders segment.

In [None]:
# Checking for cluster - 4
rfm_data[rfm_data.Cluster == 4].sample(5)

- Customer belonging to cluster 4 have good RFM value and RFM score, most of the customer belong to Platinum Customer and Big Spender segments and some also belong to Recent Customer with high Monetary Value.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

- Customer are segmented in 5 different clusters.

- Customer belonging to cluster 0 have lowest RFM value and RFM Score and they belong to Lost Cheap Customer segment.

- Customer belonging to cluster 1 have high recency value and low monetary and frequency value and most of them fall into Lowest-Spending Active Loyal Customer and Recent Customer segments.

- Customer belonging to cluster 2 shows the same charecteristics as customer in cluster 0, with low RFM value and RFM score, and they also belong to Lost Cheap Customer and others segments.

- Customer belonging to cluster 3 have very high RFM value and RMF score, most of the customer belong to Platinum Customer segment and some to Big Spenders segment.

- Customer belonging to cluster 4 have good RFM value and RFM score, most of the customer belong to Platinum Customer and Big Spender segments and some also belong to Recent Customer with high Monetary Value.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***