In the retail sector, it is possible to unlock insights to win and retain customers, drive business efficiencies, and ultimately improve purchases and customer interest. Retail organizations are using advanced analysis to understand their customers, improve forecasting, and achieve better, faster results. As a company's resources are limited, it is crucial to identify and target customers to secure their loyalty, enhance business efficiency, and ultimately improve performance.

You have been given access to a dataset containing customer transactions for an online retailer and tasked with using your machine learning tools to gain and report on business insights. The audience for this report are non-specialists.  In particular, your tasks are:

Clustering
Apply and evaluate various clustering techniques with the aim of generating actionable insights from the data. 

●	Select and justify the features you will be using.

●	Apply appropriate clustering algorithms to the dataset.

●	Evaluate the performance of the algorithms and make a recommendation as to which gives the “best” results.

●	Include in your report your own interpretation of the results.

Market Basket Analysis
Perform a market basket analysis of the transaction data. 

●	Include in your report a comparison and evaluation of at least two algorithms.

●	Include in your report your own interpretation of the results.


# Introduction

The analysis focuses on clustering regularly purchased products to identify patterns and gain insights into consumer buying behaviour. This approach facilitates a better understanding of the dynamics of everyday purchases, enabling more targeted marketing strategies.

In [1]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt 
import seaborn as sns
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report
from imblearn.over_sampling import SMOTE
from sklearn.preprocessing import scale
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import NearestNeighbors

In [None]:
df = pd.read_excel("data.xlsx")
df.head()

In [None]:
df['Year'] = df['InvoiceDate'].dt.year
df['Month'] = df['InvoiceDate'].dt.month

# Display the DataFrame with new columns
print(df[['InvoiceDate', 'Year', 'Month']].head())

Note

It was decided to split 'InvoiceDate' into 'Year', 'Month', from the beginning to make future analysis easier.

In [None]:
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'])
df = df.set_index('InvoiceDate')

Note

To improve readability, we have set the 'InvoiceDate' column as the index of the dataset.
This guarantees that each row is unique, eliminating the need for the default numerical index.

In [None]:
df.describe()

Note

An analysis using the .describe() method reveals several noteworthy observations: 

- The majority of transactions involve quantities ranging from 3 to 10 items, with most items priced at £5 or less. 

- Negative quantities and prices are present, and some records lack CustomerID data. The majority of transactions involve quantities ranging from 3 to 10 items, with most items priced at £5 or less. 

- The majority of transactions involve quantities ranging from 3 to 10 items, with most items priced at £5 or less. Additionally, there are several significant outliers that require further attention.

In [None]:
df.shape

# Exploratory Data Analysis

In [None]:
df.info()

In [None]:
def to_camel_case(s):
    # Split the string into words
    words = s.replace('_', ' ').split()
    # Convert the first word to lowercase and capitalize the initials of the remaining words
    camel_case_str = words[0].lower() + ''.join(word.capitalize() for word in words[1:])
    return camel_case_str

df.columns = [to_camel_case(col) for col in df.columns]

print(df.columns)

In [None]:
df.dtypes

In [None]:
# # To prevent errors, converting 'Description','Invoice' and 'StockCode' to a string.
# df['Description'] = df['Description'].astype(str)
# df['Invoice'] = df['Invoice'].astype(str)
# df['StockCode'] = df['StockCode'].astype(str)

In [None]:
df.describe()

In [None]:
df.country.nunique()

In [None]:
df.country.unique()

In [None]:
customer_country=df[['country','customerId']]
customer_country.groupby(['country'])['customerId'].aggregate('count').reset_index().sort_values('customerId', ascending=False).head()

In [None]:
pd.DataFrame(df.nunique())

In [None]:
for column in df.columns:
    # Checking if any value in the column is duplicated
    has_duplicates = df[column].duplicated().any()
    print(f'{column} has duplicates: {has_duplicates}')

## Null Values

In [None]:
df.isnull().sum(axis=0)

In [None]:
df['description'].tail(20)

In [None]:
df[df['description'].isnull()].tail()

Note

- The Price in these rows is 0, indicating that these orders did not generate any purchases.

- At present, we can impute it with 'UNKNOWN ITEM' and address those later during the analysis.

## Features Analysis

#### Analyzing "Description" feature

In [None]:
df['description'].value_counts().tail(20)

In [None]:
df['description'].value_counts().head()

Note

The code above shows that valid items are typically in uppercase, while non-valid or cancelled items are in lowercase.

#### Analyzing Invoice feature

In [None]:
df['invoice'].value_counts().tail(20)

In [None]:
df['invoice'].value_counts().head()

# Data cleaning

## Removing Duplicates

In [None]:
customer_country=df[['country','customerId','invoice']].drop_duplicates()
customer_country.groupby(['country'])['customerId'].aggregate('count').reset_index().sort_values('customerId', ascending=False).head()

Note

We will no longer track customers that appear to be duplicated, particularly when invoices for these customers are repeated in the same country.

## Labelling Unknown Items

In [None]:
df['description'] = df['description'].fillna('UNKNOWN ITEM')
df.isnull().sum()

In [None]:
df['description'] = df['description'].fillna('UNKNOWN ITEM')
df.isnull().sum()

## Removing Unidentified Customers.

In [None]:
df = df[pd.notnull(df['customerId'])]
df.isnull().sum(axis=0)

## Outliers

In [None]:
plt.figure(figsize=(18,6))
plt.scatter(x=df.index, y=df['price'])

In [None]:
Q1=df.quantile(.25)
Q3=df.quantile(.75)
IQR=Q3-Q1
print(IQR)
print("--------")
print(Q1)
print("--------")
print(Q3)

df=df[~((df<(Q1-1.5*IQR))|(df>(Q3+1.5*IQR))).any(axis=1)]

In [None]:
Q1_price = Q1['price'] 

# Filtering the dataset for products over 1.25 euros
high_value_purchases = df[df['price'] > Q1_price]

# Getting unique customers who have made high-value purchases
unique_customers_high_value = high_value_purchases['customerId'].unique()

# Getting the total number of unique customers in the entire dataset
total_unique_customers = df['customerId'].unique()

# Calculating the percentage of customers who have purchased items over 3000 euros
percentage_high_value_customers = (len(unique_customers_high_value) / len(total_unique_customers)) * 100

# Displaying the percentage
print(f'Percentage of customers who have purchased items over 1.25 euros: {percentage_high_value_customers:.2f}%')

In [None]:
Q3_price = Q3['price'] 

# Filtering the dataset for products over 2.55 euros
high_value_purchases = df[df['price'] > Q3_price]

# Getting unique customers who have made high-value purchases
unique_customers_high_value = high_value_purchases['customerId'].unique()

# Getting the total number of unique customers in the entire dataset
total_unique_customers = df['customerId'].unique()

# Calculating the percentage of customers who have purchased items over 3000 euros
percentage_high_value_customers = (len(unique_customers_high_value) / len(total_unique_customers)) * 100

# Displaying the percentage
print(f'Percentage of customers who have purchased items over 2.55 euros: {percentage_high_value_customers:.2f}%')

In [None]:
plt.figure(figsize=(18,6))
plt.scatter(x=df.index, y=df['price'])

In [None]:
df.describe()

Note

- To achieve a more precise clustering of products, we excluded products that cost over 7.5 € from our datasets. Our analysis aims to cluster regularly purchased products, and those over this price point are not considered in this case.

# Exploratory Data Analysis II

## Do we Have Returns?

In [None]:
df[df['quantity'] < 0].head(20)

In [None]:
print(df['invoice'].isna().sum())
#The code is not working unless we instruct .str.startswith() to consider NA/NaN values as False
print(df[df['invoice'].str.startswith('C', na=False)].describe())

In [None]:
total_invoices = df['invoice'].notna().sum()  # Count non-NA invoice entries
invoices_starting_with_c = df['invoice'].str.startswith('C', na=False).sum()  # Count invoices starting with 'C'

# Data for plotting
sizes = [invoices_starting_with_c, total_invoices - invoices_starting_with_c]
labels = ['Invoices Starting with C', 'Other Invoices']

# Plotting the pie chart
fig1, ax1 = plt.subplots()
ax1.pie(sizes, labels=labels, autopct='%1.1f%%', startangle=90)
ax1.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.

plt.title('Percentage of Invoices Returned')
plt.show()

In [None]:
invoices_starting_with_c = df[df['invoice'].str.startswith('C', na=False)].shape[0]
total_invoices = df['invoice'].notna().sum()
percentage_starting_with_c = (invoices_starting_with_c / total_invoices) * 100
print("Percentage of invoices starting with 'C': {:.2f}%".format(percentage_starting_with_c))

Note

Invoices beginning with the letter 'C' are designated as 'Canceling' or 'Returning' invoices.

While a more in-depth analysis of these returns would be beneficial, for the sake of simplicity we will disregard them for now.

In [None]:
# Calculating total number of entries
total_entries = len(df)

# Counting number of entries where the invoice starts with 'C'
invoices_starting_with_c = df['invoice'].str.startswith('C', na=False).sum()

# Calculating the percentage
percentage_starting_with_c = (invoices_starting_with_c / total_entries) * 100

# Printing the result
print(f"Percentage of invoices starting with 'C': {percentage_starting_with_c:.2f}%")

## Removing Invoices Starting with 'C'.

As the number of invoices starting with the letter 'C' represents only 1.90% of the total dataset and is not part of the purpose of the analysis, it was decided to remove them.

In [None]:
df = df[~df['invoice'].str.startswith('C', na=False)]

# Now df does not contain rows where the invoice starts with 'C'
print("Rows with invoices starting with 'C' have been removed.")
print(f"Updated DataFrame shape: {df.shape}")

## How Many Customers are Not Recurrent?

In [None]:
def unique_counts(df):
   for i in df.columns:
       count = df[i].nunique()
       print(i, ": ", count)
unique_counts(df)

## What Items Were Purchased More Frequently?

In [None]:
item_counts = df['description'].value_counts().sort_values(ascending=False).iloc[0:15]
plt.figure(figsize=(18,6))
sns.barplot(x=item_counts.index, y=item_counts.values, palette=sns.cubehelix_palette(15))
plt.ylabel("Counts")
plt.title("Which items were bought more often?");
plt.xticks(rotation=90);

##  Which Invoices Had the Highest Number of Items?

In [None]:
inv_counts = df['invoice'].value_counts().sort_values(ascending=False).iloc[0:15]
plt.figure(figsize=(18,6))
sns.barplot(x=inv_counts.index, y=inv_counts.values, palette=sns.color_palette("BuGn_d"))
plt.ylabel("Counts")
plt.title("Which invoices had the most items?");
plt.xticks(rotation=90);

## What is the Country with The Highest Number of purchases?

In [None]:
info = pd.DataFrame(data = df.groupby(['country'])['invoice'].nunique(), index=df.groupby(['country']).groups.keys()).T
info

In [None]:
plt.figure(figsize=(14,6))
plt.bar(list(df.groupby(['country']).groups.keys()), df.groupby(['country'])['customerId'].count())
plt.xticks(rotation = 90, fontsize = 14)
plt.title("Number of transanctions done for each country")
plt.ylabel("No. of trans.")
plt.xlabel("Country")
plt.show()

In [None]:
# Calculating the number of unique invoices per country
sales_per_country = df.groupby('country')['invoice'].nunique()

# Calculating the total number of sales transactions
total_sales = sales_per_country.sum()

# Calculating the percentage of total sales for each country
percent_sales = (sales_per_country / total_sales) * 100

# Sorting the percentages to find the countries with the smallest percent of sales
smallest_percent_sales = percent_sales.sort_values()

# Displaying the sorted percentages
print(smallest_percent_sales)

In [None]:
# Filtering for countries with 1% or less in purchases
countries_with_one_percent_or_less = percent_sales[percent_sales <= 1]

# Counting the number of countries meeting the criterion
number_of_countries = countries_with_one_percent_or_less.count()

# Display the count
print(f'Number of countries with 1% or less in purchases: {number_of_countries}')

# Displaying the names of these countries
print("Countries with 1% or less in purchases:")
print(countries_with_one_percent_or_less.index.tolist())

Note

- The UK conducted the majority of the transactions, with a total of 19857.

- 'Australia', 'Austria', 'Bahrain', 'Belgium', 'Brazil', 'Canada', 'Channel Islands', 'Cyprus', 'Denmark', 'Finland', 'Greece', 'Iceland', 'Israel', 'Italy', 'Japan', 'Korea', 'Lithuania', 'Malta', 'Netherlands', 'Nigeria', 'Norway', 'Poland', 'Portugal', 'RSA', 'Singapore', 'Spain', 'Sweden', 'Switzerland', 'Thailand', 'USA', 'United Arab Emirates', 'Unspecified', 'West Indies' are countries with less than the 1% of purchases.

In [None]:
df2=df.groupby('invoice')[['quantity']].sum()

In [None]:
df2 = df2.reset_index()
df2.head()

In [None]:
df['invoicedate'] = df.index

# Merging df with df2 based on the invoice column using the 'left' join to retain all records from 'df' in the merged 
# DataFrame, while only the matching entries from 'df2' are included.

df = df.merge(df2, how='left', on='invoice')

# Changing the column names to clarify their meaning. 'quantity_x' has been replaced with 'product units' and 'quantityInv' with 'number of products invoiced in each invoice'.

df = df.rename(columns={'quantity_x' : 'quantity', 'quantity_y' : 'quantityInv'})
df.tail(10)

In [None]:
df.describe()

In [None]:
df['invoicedate'] = df['invoicedate'].dt.strftime('%Y-%m-%d %H:%M:%S')

In [None]:
df.head()

In [None]:
df.dtypes

## What is The Revenue/Groth of The Company Thourgh the Year?

In [None]:
df['invoicedate'] = pd.to_datetime(df['invoicedate'])

#creating YearMonth field for the ease of reporting and visualization
df['invoiceyearmonth'] = df['invoicedate'].dt.strftime('%Y-%m')

# Calculate revenue for each row
df['revenue'] = df['price'] * df['quantity']

# Group by the new 'invoiceyearmonth' and sum the revenue
df_revenue = df.groupby('invoiceyearmonth')['revenue'].sum().reset_index()

# Calculating monthly percentage change
df_revenue['monthlygrowth'] = df_revenue['revenue'].pct_change()

In [None]:
# Seting the figure size for better readability
plt.figure(figsize=(12, 6))  
plt.plot(df_revenue['invoiceyearmonth'], df_revenue['revenue'], marker='o')  # Plotting revenue over time
# Adding a title
plt.title('Monthly Revenue Over Time')  
# Labeling the x-axis
plt.xlabel('Year-Month') 
# Labeling the y-axis
plt.ylabel('Revenue')  
# Rotating date labels for better visibility
plt.xticks(rotation=45)  
# Adding a grid for easier reading
plt.grid(True)  
# Adjusting the layout to make room for the rotated date labels
plt.tight_layout()  
# Displaying the plot
plt.show()  

Note

As can be seen over 2010, the company had a steady turnover until August, since then and until November it experienced an exponential turnover, finally in December the turnover dropped drastically. 

In [None]:
plt.figure(figsize=(10, 5))

# Plotting the monthly growth. Assuming the data goes up to November 2011 and excluding that from the plot.
mask = df_revenue['invoiceyearmonth'] < '201112'
plt.plot(df_revenue[mask]['invoiceyearmonth'], df_revenue[mask]['monthlygrowth'], marker='o', linestyle='-')

# Adding title and labels
plt.title('Monthly Growth Rate')
plt.xlabel('Year-Month')
plt.ylabel('Percentage Growth')

# Rotate date labels for better readability
plt.xticks(rotation=45)
plt.grid(True)

# Show plot
plt.tight_layout()
plt.show()

Note

As can be seen over the year as a whole, the company did not experience significant growth, although there was some growth in the first and last quarters of the year.

# Clustering Analysis

- Recency - Customers are clustered based on their most recent purchase.

- Frequency - Customers are clustered based on their purchase frequency within the company.

- Revenue - Customers are clustered based on benefits generated for the company through their purchases.

## Customer Segmentation by Purchase Frequency and Revenue Impact

In [None]:
df_uk = df.query("country=='United Kingdom'").reset_index(drop=True)

Note

To perform an analysis with less bias, the clustering analysis will solely focus on the United Kingdom, as it accounts for the majority of the company's purchases. This is due to the potential variation in customer behaviour between countries.

## Time Since Each Client's Largest Purchase

In [None]:
#geting the max purchase date for each customer and create a dataframe with it
df_uk_purchase = df_uk.groupby('customerId').invoicedate.max().reset_index()
df_uk_purchase.columns = ['customerId','maxpurchasedate']

#Taking observation points as the max invoice date in the dataset
df_uk_purchase['recency'] = (df_uk_purchase['maxpurchasedate'].max() - df_uk_purchase['maxpurchasedate']).dt.days

df_uk_purchase

The analysis is determining the impact on revenue by considering the timing of customers' largest purchases in the past. It feature is being called 'recency'.

- The dataset is grouping by 'customerid' and 'invoicedate' to track changes over time. 

## K-Means

In [None]:
# Creating a generic user dataframe to keep CustomerID and new segmentation scores
df_user = pd.DataFrame(df_uk['customerId'].unique())
df_user.columns = ['customerId']

# Merging this dataframe to our new user dataframe
df_user = pd.merge(df_user, df_uk_purchase[['customerId','recency']], on='customerId')

from sklearn.cluster import KMeans

sse={}
df_recency = df_user[['recency']]
for k in range(1, 10):
    kmeans = KMeans(n_clusters=k, max_iter=1000).fit(df_recency)
    df_recency["clusters"] = kmeans.labels_
    sse[k] = kmeans.inertia_ 
plt.figure()
plt.plot(list(sse.keys()), list(sse.values()))
plt.xlabel("Number of cluster")
plt.show()

In [None]:
#building 4 clusters for recency and add it to dataframe
kmeans = KMeans(n_clusters=4)
kmeans.fit(df_user[['recency']])
df_user['recencyCluster_Kmeans'] = kmeans.predict(df_user[['recency']])

#function for ordering cluster numbers
def order_cluster(cluster_field_name, target_field_name,df,ascending):
    new_cluster_field_name = 'new_' + cluster_field_name
    df_new = df.groupby(cluster_field_name)[target_field_name].mean().reset_index()
    df_new = df_new.sort_values(by=target_field_name,ascending=ascending).reset_index(drop=True)
    df_new['index'] = df_new.index
    df_final = pd.merge(df,df_new[[cluster_field_name,'index']], on=cluster_field_name)
    df_final = df_final.drop([cluster_field_name],axis=1)
    df_final = df_final.rename(columns={"index":cluster_field_name})
    return df_final

df_user = order_cluster('recencyCluster_Kmeans', 'recency',df_user,False)

df_user

## DBSCAN

In [None]:
# Initialising an object neigh by calling a method NearestNeighbors()
neigh = NearestNeighbors(n_neighbors = 20)

# Training the model by calling a method fit()
nbrs = neigh.fit(df_user[['customerId','recency']])

# Storing the distance and indices into distances and indices arrays
distances, indices = nbrs.kneighbors(df_user[['customerId','recency']])

print(distances, indices)

In [None]:
# Plotting K-distance Graph
distances = np.sort(distances, axis=0)
distances = distances[:,1]
plt.figure(figsize=(20,10))
plt.plot(distances)
plt.title('K-distance Graph',fontsize=20)
plt.xlabel('Data Points sorted by distance',fontsize=14)
plt.ylabel('Epsilon',fontsize=14)
plt.show()

In [None]:
# Initialise an object by calling a method DBSCAN along with parameters as eps and min_samples
dbscan_opt = DBSCAN(eps = 50, min_samples =6)

# Train the model by calling a method fit()
dbscan_opt.fit(df_user[['customerId','recency']])

In [None]:
# Add another column into the dataframe (df)
df_user['recencyCluster_DBSCAN'] = dbscan_opt.labels_

# Display the counts by labels
df_user['recencyCluster_DBSCAN'].value_counts()
df_user.head()

In [None]:
max(dbscan_opt.labels_)

In [None]:
df_user['recencyCluster_DBSCAN']=dbscan_opt.labels_
df_user['recencyCluster_DBSCAN'].value_counts()

## Frequency Each Client Purchase

In [None]:
#getting order counts for each user and create a dataframe with it
df_frequency = df_uk.groupby('customerId').invoicedate.count().reset_index()
df_frequency.columns = ['customerId','frequency']

#adding this data to our main dataframe
df_user = pd.merge(df_user, df_frequency, on='customerId')

df_user

## K-Means

In [None]:
# k-means
kmeans = KMeans(n_clusters=4)
kmeans.fit(df_user[['frequency']])
df_user['frequencyCluster_Kmeans'] = kmeans.predict(df_user[['frequency']])

# ordering the frequency cluster
df_user = order_cluster('frequencyCluster_Kmeans', 'frequency',df_user,True)

# details of each cluster
df_user.groupby('frequencyCluster_Kmeans')['frequency'].describe()

## DBSCAN

In [None]:
# Initialising an object neigh by calling a method NearestNeighbors()
neigh = NearestNeighbors(n_neighbors = 20)

# Training the model by calling a method fit()
nbrs = neigh.fit(df_user[['customerId','frequency']])

# Storing the distance and indices into distances and indices arrays
distances, indices = nbrs.kneighbors(df_user[['customerId','frequency']])

print(distances, indices)

In [None]:
# Plotting K-distance Graph
distances = np.sort(distances, axis=0)
distances = distances[:,1]
plt.figure(figsize=(20,10))
plt.plot(distances)
plt.title('K-distance Graph',fontsize=20)
plt.xlabel('Data Points sorted by distance',fontsize=14)
plt.ylabel('Epsilon',fontsize=14)
plt.show()

In [None]:
# Initialise an object by calling a method DBSCAN along with parameters as eps and min_samples
dbscan_opt = DBSCAN(eps = 100, min_samples =6)

# Train the model by calling a method fit()
dbscan_opt.fit(df_user[['customerId','frequency']])

In [None]:
# Add another column into the dataframe (df)
df_user['frequencyCluster_DBSCAN'] = dbscan_opt.labels_

# Display the counts by labels
df_user['frequencyCluster_DBSCAN'].value_counts()
df_user.head()

In [None]:
max(dbscan_opt.labels_)

In [None]:
df_user['frequencyCluster_DBSCAN']=dbscan_opt.labels_
df_user['frequencyCluster_DBSCAN'].value_counts()

## Revenue the company makes by product

In [None]:
#calculating revenue for each customer
df_user['revenue'] = df['price'] * df['quantityInv']
df_revenue = df_user.groupby('customerId').revenue.sum().reset_index()

#merging it with our main dataframe
df_user = pd.merge(df_user, df_revenue, on='customerId')

In [None]:
df_user.info()

## K-Means

In [None]:
#applying clustering
kmeans = KMeans(n_clusters=4)
kmeans.fit(df_user[['revenue_x']])
df_user['revenueCluster_Kmeans'] = kmeans.predict(df_user[['revenue_x']])


#ordering the cluster numbers
df_user = order_cluster('revenueCluster_Kmeans', 'revenue_x',df_user,True)

#showing details of the dataframe
df_user.groupby('revenueCluster_Kmeans')['revenue_x'].describe()

## DBSCAN

In [None]:
# Initialising an object neigh by calling a method NearestNeighbors()
neigh = NearestNeighbors(n_neighbors = 20)

# Training the model by calling a method fit()
nbrs = neigh.fit(df_user[['customerId','revenue_x']])

# Storing the distance and indices into distances and indices arrays
distances, indices = nbrs.kneighbors(df_user[['customerId','revenue_x']])

print(distances, indices)

In [None]:
# Plotting K-distance Graph
distances = np.sort(distances, axis=0)
distances = distances[:,1]
plt.figure(figsize=(20,10))
plt.plot(distances)
plt.title('K-distance Graph',fontsize=20)
plt.xlabel('Data Points sorted by distance',fontsize=14)
plt.ylabel('Epsilon',fontsize=14)
plt.show()

In [None]:
# Initialise an object by calling a method DBSCAN along with parameters as eps and min_samples
dbscan_opt = DBSCAN(eps = 100, min_samples =6)

# Train the model by calling a method fit()
dbscan_opt.fit(df_user[['customerId','revenue_x']])

In [None]:
# Add another column into the dataframe (df)
df_user['revenueCluster_DBSCAN'] = dbscan_opt.labels_

# Display the counts by labels
df_user['revenueCluster_DBSCAN'].value_counts()
df_user.head()

In [None]:
#ordering the cluster numbers
df_user = order_cluster('revenueCluster_DBSCAN', 'revenue_x',df_user,True)

#showing details of the dataframe
df_user.groupby('revenueCluster_DBSCAN')['revenue_x'].describe()

In [None]:
max(dbscan_opt.labels_)

In [None]:
df_user['revenueCluster_DBSCAN']=dbscan_opt.labels_
df_user['revenueCluster_DBSCAN'].value_counts()

## Overall Score

In [None]:
df_user.info()

In [None]:
# df_user.to_csv('df_user.csv', index=False)

# # If you need to provide a download link in Jupyter Notebook:
# from IPython.display import FileLink
# FileLink(r'df_user.csv')

In [None]:
# Calculating overallScore by summing the cluster scores
df_user['overallScore'] = (df_user['recencyCluster_Kmeans'] +
                           df_user['frequencyCluster_Kmeans'] +
                           df_user['revenueCluster_Kmeans'] +
                           df_user['recencyCluster_DBSCAN'] +
                           df_user['frequencyCluster_DBSCAN'] +
                           df_user['revenueCluster_DBSCAN'])

# Grouping by overallScore and calculate mean of 'recency', 'frequency', and 'revenue'
df_user.groupby('overallScore')[['recency', 'frequency', 'revenue_x']].mean()  # Assuming revenue_x is the correct revenue column to use

In [None]:
print(df_user.describe())

df_user.loc[df_user['overallScore']<=5,'Segment'] = 'Low-Value'
df_user.loc[df_user['overallScore']>5,'Segment'] = 'Mid-Value' 
df_user.loc[df_user['overallScore']>7,'Segment'] = 'High-Value' 



## Clustering Revenue vs Frequency - K-Means

In [None]:
data_to_cluster = df_user[['revenue_x', 'frequency']]

# Applying K-means clustering on the full dataset with an optimal cluster count
km = KMeans(n_clusters=4, random_state=42)
data_to_cluster['cluster'] = km.fit_predict(data_to_cluster)

# Defining cluster names based on the discussed characteristics
cluster_labels = {0: "Casual", 1: "VIPs", 2: "Regulars", 3: "Big Spenders"}
data_to_cluster['cluster_label'] = data_to_cluster['cluster'].map(cluster_labels)

# Plotting Revenue vs Frequency with cluster coloring and labels
plt.figure(figsize=(10, 6))
scatter = plt.scatter(data_to_cluster['revenue_x'], data_to_cluster['frequency'], 
                      c=data_to_cluster['cluster'], cmap='viridis', alpha=0.5)
plt.title('Revenue vs Frequency - K-means Clustering')
plt.xlabel('Revenue')
plt.ylabel('Frequency')
plt.grid(True)

# Creating a legend with the named clusters
legend_labels = [cluster_labels[i] for i in range(4)]  # List of labels for the legend
legend_handles = scatter.legend_elements()[0]  # Get legend handles
legend1 = plt.legend(legend_handles, legend_labels, title="Clusters")
plt.gca().add_artist(legend1)

plt.show()

## Clustering Revenue vs Frequency - DBSCAN

In [None]:
data_to_cluster = df_user[['revenue_x', 'frequency']]
data_to_cluster['cluster'] = df_user['revenueCluster_DBSCAN']

# Defining custom labels for DBSCAN clusters based on the hypothetical understanding of the cluster
# Adjusting these keys based on the actual cluster numbers observed
cluster_labels = {
    -1: "Outliers",  # DBSCAN labels noise as -1
    0: "Sporadic Shoppers",
    1: "Occasional Shoppers",
    2: "Frequent Customers",
    3: "Regular Customers",
    4: "Consistent Customers",
    5: "Loyal Customers",
    6: "Very Loyal Customers",
    7: "Premium Customers",
    8: "Very Premium Customers"
}
data_to_cluster['cluster_label'] = data_to_cluster['cluster'].map(cluster_labels)

# Plotting Revenue vs Frequency with DBSCAN cluster coloring
plt.figure(figsize=(10, 6))
scatter = plt.scatter(data_to_cluster['revenue_x'], data_to_cluster['frequency'], 
                      c=data_to_cluster['cluster'], cmap='Set1', alpha=0.5)
plt.title('Revenue vs Frequency - DBSCAN Clustering')
plt.xlabel('Revenue')
plt.ylabel('Frequency')
plt.grid(True)

# Creating a legend with the named clusters
legend_labels = [cluster_labels.get(i, "Unknown") for i in sorted(data_to_cluster['cluster'].unique())]
legend_handles = scatter.legend_elements()[0]  # Get legend handles
legend1 = plt.legend(legend_handles, legend_labels, title="Clusters")
plt.gca().add_artist(legend1)

plt.show()

## Clustering Revenue vs Recency - K-Means

In [None]:
cluster_descriptive_mapping = {
    1: "Frequent & High Spenders",
    2: "Recent & Moderate Spenders",
    3: "Inactive & High Spenders",
    4: "Inactive & Low Spenders"
}

# Apply the mapping to the DataFrame
df_user['descriptive_cluster'] = df_user['recencyCluster_Kmeans'].map(cluster_descriptive_mapping)

# Setting up the plot with the new descriptive labels
plt.figure(figsize=(10, 6))
sns.scatterplot(data=df_user, x='recency', y='revenue_x', hue='descriptive_cluster', palette='viridis', s=100, alpha=0.7)

# Enhancing the plot
plt.title('Revenue vs Recency', fontsize=16)
plt.xlabel('Recency (days since last purchase)', fontsize=14)
plt.ylabel('Revenue', fontsize=14)
plt.legend(title='Customer Segment', title_fontsize='13', fontsize='12', loc='upper right')
plt.grid(True)

# Show plot
plt.show()

## Clustering Revenue vs Recency - DBSCAN

In [None]:
plt.figure(figsize=(10, 6))

# Assign more descriptive labels based on cluster IDs and characteristics
descriptive_labels = {
    -1: 'Noise - Outliers',
    0: 'Inactive & Spender',
    1: 'Inactive & Good Spender',
    2: 'Recent & Moderate Spender',
    3: 'Recent & Good Spender',
    4: 'Recent & High Spender',
    5: 'Frequent & Good Spender',
    6: 'Frequent & High Spender'
}

# Create a scatter plot with uniform circle markers
for cluster_id in sorted(df_user['recencyCluster_DBSCAN'].unique()):
    cluster_data = df_user[df_user['recencyCluster_DBSCAN'] == cluster_id]
    label = descriptive_labels.get(cluster_id, f'Cluster {cluster_id}')
    plt.scatter(cluster_data['recency'], cluster_data['revenue_x'], label=label, marker='o', s=100)

# Enhancing the plot
plt.title('Refined Revenue vs Recency (DBSCAN Clustering)', fontsize=16)
plt.xlabel('Recency (days since last purchase)', fontsize=14)
plt.ylabel('Revenue', fontsize=14)
plt.legend(title='DBSCAN Cluster', title_fontsize='13', fontsize='12', loc='upper right')
plt.grid(True)

# Show the refined plot
plt.show()

## Clustering Revenue vs Frecency - K-Means

In [None]:
frequency_cluster_descriptive_mapping = {
    0: "Low Activity",
    1: "Moderate Activity",
    2: "High Activity",
    3: "Very High Activity"
}

# Apply the mapping to the DataFrame
df_user['descriptive_frequency_label'] = df_user['frequencyCluster_Kmeans'].map(frequency_cluster_descriptive_mapping)


# Setting up the plot with uniform circle markers and an easier to distinguish color palette
plt.figure(figsize=(10, 6))
sns.scatterplot(data=df_user, x='frequency', y='revenue_x', hue='descriptive_frequency_label', 
                palette='Set2', s=100, style='descriptive_frequency_label', markers=['o']*4, alpha=0.7)

# Enhancing the plot
plt.title('Revenue vs Frequency with Accessible Colors', fontsize=16)
plt.xlabel('Frequency (number of transactions)', fontsize=14)
plt.ylabel('Revenue', fontsize=14)
plt.legend(title='Customer Activity Level', title_fontsize='13', fontsize='12', loc='upper right')
plt.grid(True)

# Show the plot
plt.show()

## Clustering Revenue vs Frecency - DBSCAN

In [None]:
# Assign more descriptive labels based on DBSCAN clustering characteristics
dbscan_frequency_descriptive_labels = {
    -1: 'Outliers - Noise',
    0: 'Low Activity',
    1: 'Moderate Activity',
    2: 'High Activity',
    3: 'Very High Activity'
}

# Map the new descriptive labels to the data
df_user['descriptive_frequency_dbscan_label'] = df_user['frequencyCluster_DBSCAN'].map(dbscan_frequency_descriptive_labels)

# Setting up the plot with descriptive labels
plt.figure(figsize=(10, 6))
sns.scatterplot(data=df_user, x='frequency', y='revenue_x', hue='descriptive_frequency_dbscan_label', 
                palette='Set2', s=100, marker='o', alpha=0.7)

# Enhancing the plot
plt.title('Revenue vs Frequency with Descriptive DBSCAN Labels', fontsize=16)
plt.xlabel('Frequency (number of transactions)', fontsize=14)
plt.ylabel('Revenue', fontsize=14)
plt.legend(title='DBSCAN Frequency Cluster', title_fontsize='13', fontsize='12', loc='upper right')
plt.grid(True)

# Show the updated plot
plt.show()

# References

Scikit-learn.org. (2017). sklearn.cluster.DBSCAN — scikit-learn 0.22 documentation. [online] Available at: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html.

Sharma, A. (2020). How Does DBSCAN Clustering Work? | DBSCAN Clustering for ML. [online] Analytics Vidhya. Available at: https://www.analyticsvidhya.com/blog/2020/09/how-dbscan-clustering-works/.

GeeksforGeeks. (2010). Print nodes at k distance from root. [online] Available at: https://www.geeksforgeeks.org/print-nodes-at-k-distance-from-root/ [Accessed 10 Apr. 2024].