# **Project Name**    - Shopper Spectrum







##### **Project Type**    - Unsupervised Machine Learning - Clustering
##### **Contribution**    - Individual
##### **Team Member 1 -** Mustafiz Ahmed

# **Project Summary -**

The Shopper Spectrum project aims to leverage customer transactional data to perform intelligent segmentation and provide personalized product recommendations within an e-commerce context. Using the online_retail.csv dataset, the project implements RFM (Recency, Frequency, Monetary) analysis to classify customers based on their shopping behaviors and value to the business. This segmentation enables the identification of high-value, at-risk, and potential customers, helping the business adopt tailored marketing strategies.

The project also builds a collaborative filtering–based recommendation system that uses customer-item purchase patterns to recommend relevant products. By combining segmentation with personalized suggestions, the project addresses both customer retention and revenue generation goals.

Furthermore, an interactive Streamlit web application is developed to provide business users with an easy-to-navigate dashboard for exploring segments and generating live recommendations. The platform combines clustering insights and real-time suggestions, transforming static insights into actionable business strategies.

Overall, Shopper Spectrum demonstrates the end-to-end process of customer analytics — from raw data preprocessing and feature engineering to unsupervised learning and deployment. It delivers scalable, data-driven solutions to improve customer engagement, increase retention, and maximize lifetime value.

# **GitHub Link -**

https://github.com/MZ-314/Shopper-Spectrum

# **Problem Statement**


E-commerce platforms serve a wide range of customers, each with distinct buying behaviors and preferences. However, treating all customers the same leads to missed marketing opportunities and reduced retention. Without customer segmentation and personalized recommendations, businesses struggle to engage users meaningfully, predict future behavior, or prioritize valuable customers.

The objective of this project is twofold:

Segment customers based on behavioral data using RFM analysis and clustering to identify loyal, inactive, or high-potential customers.

Recommend products to customers using item-based collaborative filtering based on past purchase patterns.

By solving this, the business can target customers more strategically, offer better deals, reduce churn, and ultimately boost customer satisfaction and revenue.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
!pip install streamlit

In [None]:
# Data manipulation and numerical operations
import pandas as pd
import numpy as np

# Date and time handling
from datetime import datetime, timedelta

# Data visualization
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Warnings
import warnings
warnings.filterwarnings('ignore')

# Machine Learning - Clustering
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Distance calculation
from scipy.spatial.distance import cdist

# For collaborative filtering
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer

# Streamlit app (for deployment)
import streamlit as st

# Evaluation and model utilities (if needed later)
from sklearn.metrics import silhouette_score

### Dataset Loading

In [None]:
df=pd.read_csv('online_retail.csv')

### Dataset First View

In [None]:
# Display the first 5 rows of the dataset
df.head()

### Dataset Rows & Columns count

In [None]:
# Get the number of rows and columns
rows, columns = df.shape
print(f"The dataset contains {rows} rows and {columns} columns.")

### Dataset Information

In [None]:
# Dataset information
df.info()

#### Duplicate Values

In [None]:
# Count the number of duplicate rows in the dataset
duplicate_count = df.duplicated().sum()
print(f"Number of duplicate rows in the dataset: {duplicate_count}")

#### Missing Values/Null Values

In [None]:
# Count missing values in each column
missing_values = df.isnull().sum()

# Display only columns with at least one missing value
missing_values = missing_values[missing_values > 0]
print("Missing values in each column:\n")
print(missing_values)


In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 5))
sns.heatmap(df.isnull(), cbar=False, cmap='Reds')
plt.title("Heatmap of Missing Values", fontsize=14)
plt.show()


### What did you know about your dataset?

The dataset contains over 540,000 transaction records from a UK-based online retail store between 2010 and 2011. Each row represents a single product purchase, including fields such as InvoiceNo, StockCode, Description, Quantity, InvoiceDate, UnitPrice, CustomerID, and Country. The dataset helps in understanding customer buying patterns, product popularity, and purchasing frequency. However, it also contains missing values (particularly in the CustomerID column) and some duplicate or cancelled transactions that need to be cleaned. This dataset is ideal for customer segmentation using RFM (Recency, Frequency, Monetary) analysis and for building a product recommendation system.

## ***2. Understanding Your Variables***

In [None]:
# Display all column names
print("Dataset Columns:")
print(df.columns.tolist())

In [None]:
# Summary statistics for numeric columns
df.describe()

### Variables Description

The dataset contains eight columns, each providing specific information about online retail transactions. The InvoiceNo column is a unique identifier for each transaction; if it begins with the letter "C", it denotes a cancelled order. The StockCode represents the unique code assigned to each product. Description provides the textual name or description of the purchased item. The Quantity column indicates the number of units purchased in the transaction, where negative values may represent product returns. InvoiceDate records the exact date and time of the transaction. The UnitPrice reflects the price per item in British Pounds (GBP). The CustomerID is a unique numeric identifier for each customer; this field contains some missing values for unidentified customers. Lastly, the Country column specifies the geographic location of the customer at the time of purchase.

### Check Unique Values for each variable.

In [None]:
# Check number of unique values per column
unique_counts = df.nunique()
print("Unique value count for each column:\n")
print(unique_counts)

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# 1. Drop rows with missing CustomerID
df = df.dropna(subset=['CustomerID'])

# 2. Remove canceled orders (InvoiceNo starting with 'C')
df = df[~df['InvoiceNo'].astype(str).str.startswith('C')]

# 3. Remove rows with negative or zero Quantity or UnitPrice
df = df[(df['Quantity'] > 0) & (df['UnitPrice'] > 0)]

# 4. Convert InvoiceDate to datetime format
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'])

# 5. Ensure CustomerID is treated as string (for RFM segmentation)
df['CustomerID'] = df['CustomerID'].astype(str)

# Reset index after filtering
df.reset_index(drop=True, inplace=True)

# View new shape
print("Cleaned dataset shape:", df.shape)


### What all manipulations have you done and insights you found?

To make the dataset analysis-ready, several data wrangling steps were performed. First, all rows with missing CustomerID were removed since customer identification is critical for segmentation and personalized recommendations. Next, cancelled transactions were filtered out by excluding records where InvoiceNo started with the letter 'C'. Transactions with non-positive Quantity or UnitPrice were also removed as they indicate invalid or erroneous entries. The InvoiceDate column was converted to proper datetime format to enable time-based analysis, and CustomerID was cast to string to treat it as a categorical identifier. After cleaning, the dataset size was significantly reduced, improving data quality for meaningful insights. These manipulations helped eliminate noise and ensure accurate calculations in the upcoming RFM segmentation and recommendation tasks.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
plt.figure(figsize=(12,6))
sns.barplot(x=top_countries.index, y=top_countries.values, palette="viridis")
plt.yscale('log')
plt.title("Top 10 Countries by Number of Transactions (Log Scale)")
plt.ylabel("Transaction Count (log scale)")
plt.xlabel("Country")
plt.xticks(rotation=45)
plt.show()


##### 1. Why did you pick the specific chart?

A histogram was selected to visualize the distribution of quantities ordered across all transactions. This chart type helps in identifying the frequency of various quantity values and detecting any abnormalities or skewed behavior in ordering patterns.

##### 2. What is/are the insight(s) found from the chart?

The chart reveals that most orders are for low quantities—typically between 1 and 10 items. There is a sharp drop-off as quantities increase, with a few extreme outliers (e.g., 1000+ quantity), which could indicate bulk purchases or erroneous entries.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. Understanding typical order sizes helps in optimizing logistics, inventory planning, and fulfillment efficiency. The presence of large outliers can also highlight potential data entry issues or bulk buyer segments worth special targeting. Addressing errors can improve data quality and customer satisfaction.

#### Chart - 2

In [None]:
top_products = df['Description'].value_counts().head(10)

plt.figure(figsize=(10,5))
sns.barplot(x=top_products.values, y=top_products.index, palette="magma")
plt.title("Top 10 Most Purchased Products")
plt.xlabel("Frequency")
plt.ylabel("Product")
plt.show()


##### 1. Why did you pick the specific chart?

A histogram with KDE (Kernel Density Estimate) was chosen to explore the distribution of product pricing. This allows a smooth view of how products are priced and highlights the central tendencies and outliers.

##### 2. What is/are the insight(s) found from the chart?

Most products are priced below £10, with a steep decline beyond that. A long right tail indicates a few high-priced items. This suggests a business model focused on low-cost, high-volume products with a few premium offerings.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. It supports pricing strategies tailored to the company’s main price segments. Identifying high-value products can inform promotions or bundling, while confirming that the bulk of the catalog targets budget-conscious consumers.

#### Chart - 3

In [None]:
plt.figure(figsize=(10, 6))
sns.histplot(df[df['Quantity'] < 100]['Quantity'], bins=50, kde=True)
plt.title("Distribution of Quantity Ordered (Filtered < 100)")
plt.xlabel("Quantity")
plt.ylabel("Frequency")
plt.show()


##### 1. Why did you pick the specific chart?

A histogram was used to evaluate the spread of revenue per transaction, helping assess the business’s financial structure—whether most transactions are low- or high-value.

##### 2. What is/are the insight(s) found from the chart?

The chart shows that most transactions generate less than £100 in revenue, with a few exceptionally high revenue points. This indicates high reliance on smaller orders and potentially few large wholesale buyers.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. This suggests marketing efforts should focus on increasing basket size for smaller orders, while also maintaining strong relationships with high-value clients. Fraud detection systems should review outlier transactions for anomalies.

#### Chart - 4

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Filter out extreme outliers (e.g., prices > 50)
filtered_df = df[df['UnitPrice'] < 50]

plt.figure(figsize=(10, 6))
sns.histplot(filtered_df['UnitPrice'], bins=50, kde=True, color='skyblue')
plt.title("Distribution of Unit Price (Filtered < £50)")
plt.xlabel("Price (GBP)")
plt.ylabel("Frequency")
plt.show()


##### 1. Why did you pick the specific chart?

This refined histogram filters out high-price outliers to better visualize the common product price distribution. It improves readability and focuses analysis on the majority of the catalog.

##### 2. What is/are the insight(s) found from the chart?

Filtering out prices above £50 reveals a tighter cluster of pricing around the £2–£10 range. This gives a more realistic picture of what regular customers typically spend.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. This refined view confirms the business’s competitive pricing strategy and helps identify ideal price points for upselling and discount thresholds.

#### Chart - 5

In [None]:
df['InvoiceMonth'] = df['InvoiceDate'].dt.to_period('M')
monthly_sales = df.groupby('InvoiceMonth')['Quantity'].sum()

plt.figure(figsize=(10,5))
monthly_sales.plot(marker='o')
plt.title("Monthly Sales Trend")
plt.xlabel("Month")
plt.ylabel("Total Quantity Sold")
plt.grid()
plt.show()


##### 1. Why did you pick the specific chart?

A horizontal bar chart is ideal for displaying categorical data with long labels (country names) and comparing order counts clearly.

##### 2. What is/are the insight(s) found from the chart?

Excluding the UK, top order contributors include Germany, France, Netherlands, and others. This confirms demand from select international markets.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. These insights allow the business to target international logistics improvements, create country-specific promotions, and focus retention efforts on these key non-UK markets.

#### Chart - 6

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Calculate revenue (UnitPrice × Quantity)
df['Revenue'] = df['UnitPrice'] * df['Quantity']

# Group by country and sum revenue
top_countries = df.groupby('Country')['Revenue'].sum().sort_values(ascending=False).head(10)

# Plot
plt.figure(figsize=(12, 6))
ax = sns.barplot(x=top_countries.index, y=top_countries.values, palette="coolwarm")

# Title and labels
plt.title("Top 10 Countries by Revenue", fontsize=14)
plt.xlabel("Country")
plt.ylabel("Revenue (GBP)")
plt.xticks(rotation=45)

# Annotate bars
for i, v in enumerate(top_countries.values):
    ax.text(i, v + (0.01 * v), f'£{int(v):,}', ha='center', va='bottom', fontsize=9)

plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

This chart evaluates not just frequency of orders but the total revenue contribution per country, helping identify the most financially valuable markets.

##### 2. What is/are the insight(s) found from the chart?

Some countries contribute more in revenue despite fewer orders, implying larger basket sizes or premium purchases.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. Countries contributing higher revenue per transaction can be prioritized for loyalty programs, premium shipping services, or product expansions. Conversely, low-revenue countries may be reevaluated for profitability.

#### Chart - 7

In [None]:
top_customers = df.groupby('CustomerID')['Revenue'].sum().sort_values(ascending=False).head(10)

plt.figure(figsize=(10,5))
sns.barplot(x=top_customers.index, y=top_customers.values, palette='Set2')
plt.title("Top 10 Customers by Revenue")
plt.xlabel("Customer ID")
plt.ylabel("Revenue (GBP)")
plt.xticks(rotation=45)
plt.show()


##### 1. Why did you pick the specific chart?

Bar plots help identify the top individual contributors to order volume, aiding customer-level analysis.

##### 2. What is/are the insight(s) found from the chart?

A handful of customers place significantly more orders than the rest, indicating VIP or business clients.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. These customers are vital to business stability and should be enrolled in loyalty programs or offered exclusive benefits. Losing them could severely impact revenue, so retention is critical.

#### Chart - 8

In [None]:
df['Hour'] = df['InvoiceDate'].dt.hour
hourly_orders = df.groupby('Hour')['InvoiceNo'].nunique()

plt.figure(figsize=(10,5))
sns.lineplot(x=hourly_orders.index, y=hourly_orders.values, marker='o')
plt.title("Number of Orders by Hour")
plt.xlabel("Hour of Day")
plt.ylabel("Number of Orders")
plt.grid()
plt.show()


##### 1. Why did you pick the specific chart?

While order volume is important, total revenue per customer shows who brings in the most income—even if they order less frequently.

##### 2. What is/are the insight(s) found from the chart?

Some customers place fewer but high-value orders, pointing toward wholesale or premium buyers. This customer segment may not be as visible from order volume charts.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. These insights inform segmentation strategies: offer value-based promotions, optimize shipping deals, and prioritize these users during service escalations.

#### Chart - 9

In [None]:
product_revenue = df.groupby('Description')['Revenue'].sum().sort_values(ascending=False).head(10)

plt.figure(figsize=(10,5))
sns.barplot(x=product_revenue.values, y=product_revenue.index, palette="cubehelix")
plt.title("Top 10 Revenue-Generating Products")
plt.xlabel("Revenue (GBP)")
plt.ylabel("Product")
plt.show()


##### 1. Why did you pick the specific chart?

A line plot over time identifies the specific days with peak sales, useful for temporal analysis and campaign performance tracking.

##### 2. What is/are the insight(s) found from the chart?

There are noticeable sales peaks around specific dates—these may coincide with holidays, marketing events, or inventory clearances.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. This supports planning for future promotions, staffing during peak demand, and seasonal inventory management. It also helps detect anomalies or fraud patterns on unusual high-sale days.

#### Chart - 10

In [None]:
# Count number of orders per customer
order_freq = df['CustomerID'].value_counts()

# Plot histogram with log scale on x-axis
plt.figure(figsize=(12, 6))
sns.histplot(order_freq, bins=50, kde=True, color='teal')
plt.yscale('linear')  # optional: 'log' if needed
plt.xlabel('Number of Orders')
plt.ylabel('Customer Count')
plt.title('Frequency of Orders by Customers')
plt.grid(True)
plt.show()


##### 1. Why did you pick the specific chart?

A histogram helps in understanding how often customers make purchases, identifying one-time vs. repeat buyers.

##### 2. What is/are the insight(s) found from the chart?

Most customers are one-time or infrequent buyers. There is a small tail of highly engaged customers who buy regularly.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. These insights enable segmentation for email marketing and remarketing efforts. One-time buyers could be nurtured into repeat customers with targeted campaigns.

#### Chart - 11

In [None]:
# Top 10 most frequently ordered products
top_products = df['Description'].value_counts().head(10)

# Plot
plt.figure(figsize=(12, 6))
sns.barplot(x=top_products.values, y=top_products.index, palette="viridis")
plt.title("Top 10 Most Frequently Ordered Products")
plt.xlabel("Number of Orders")
plt.ylabel("Product Description")
plt.grid(axis='x')
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

A horizontal bar chart was ideal for visualizing the products with the highest order frequency while accommodating long product descriptions.

##### 2. What is/are the insight(s) found from the chart?

Certain products dominate the order list, indicating consistent customer interest or essential items.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. These products can be promoted more aggressively, stocked adequately, or bundled with other items to boost sales of slower-moving stock.

#### Chart - 12

In [None]:
# Create a revenue column
df['Revenue'] = df['Quantity'] * df['UnitPrice']

# Group by product and sum revenue
top_revenue_products = df.groupby('Description')['Revenue'].sum().sort_values(ascending=False).head(10)

# Plot
plt.figure(figsize=(12, 6))
sns.barplot(x=top_revenue_products.values, y=top_revenue_products.index, palette="magma")
plt.title("Top 10 Products by Total Revenue")
plt.xlabel("Total Revenue (GBP)")
plt.ylabel("Product Description")
plt.grid(axis='x')
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

Instead of volume, this chart shows top revenue-generating products, revealing financially most important SKUs.

##### 2. What is/are the insight(s) found from the chart?

A few products contribute disproportionately to revenue, possibly due to high price points or frequent large orders.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. Prioritizing marketing, stock, and supplier management around these products can significantly impact bottom-line growth.

#### Chart - 13

In [None]:
# Convert InvoiceDate to datetime if not already
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'])

# Create Year-Month column
df['InvoiceMonth'] = df['InvoiceDate'].dt.to_period('M')

# Group by InvoiceMonth and calculate monthly revenue
monthly_revenue = df.groupby('InvoiceMonth')['Revenue'].sum().reset_index()
monthly_revenue['InvoiceMonth'] = monthly_revenue['InvoiceMonth'].astype(str)

# Plot
plt.figure(figsize=(12, 6))
sns.lineplot(x='InvoiceMonth', y='Revenue', data=monthly_revenue, marker='o', color='teal')
plt.title("Monthly Revenue Trend")
plt.xlabel("Month")
plt.ylabel("Revenue (GBP)")
plt.xticks(rotation=45)
plt.grid(True)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

A line plot over time is best to evaluate performance trends, seasonality, and long-term growth.

##### 2. What is/are the insight(s) found from the chart?

Revenue fluctuates by month, with visible spikes indicating high-performing periods. These might align with seasonal sales, promotions, or campaign rollouts.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. This insight allows for future sales forecasting, campaign planning, and resource allocation to match expected seasonal trends.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Select only numerical columns for correlation
numeric_cols = df.select_dtypes(include='number')

# Compute correlation matrix
corr_matrix = numeric_cols.corr()

# Plot heatmap
plt.figure(figsize=(10, 6))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=0.5)
plt.title("Correlation Heatmap of Numerical Features")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

A correlation heatmap was chosen because it is one of the most effective visual tools to understand the linear relationships between multiple numerical features in a dataset. This chart provides a compact and intuitive representation of how strongly variables are related to each other, using color gradients to show the correlation coefficients. For a project that involves customer segmentation, product analysis, and revenue prediction, it's essential to detect multicollinearity, redundant features, or highly associated metrics (e.g., Quantity and Revenue). By visualizing these relationships, we can make informed decisions during feature selection, modeling, and transformation stages. This also helps identify features that might influence each other or the target variable the most.

##### 2. What is/are the insight(s) found from the chart?

The heatmap revealed several important correlations within the dataset. Notably, there was a strong positive correlation between Quantity and Revenue, indicating that higher quantities ordered naturally lead to higher transaction values. Another mild positive correlation was observed between UnitPrice and Revenue, but it was less significant than quantity. On the other hand, CustomerID and InvoiceNo showed almost no correlation with revenue, which makes sense as they are more categorical identifiers than numerical indicators. Additionally, there was minimal multicollinearity between other numerical variables, suggesting that each numerical feature carries relatively unique information, which is valuable for modeling and clustering tasks later in the pipeline.

#### Chart - 15 - Pair Plot

In [None]:
# Select key numerical features for the pair plot
pairplot_features = df[['Quantity', 'UnitPrice', 'Revenue']]

# Plotting the pair plot
sns.pairplot(pairplot_features, diag_kind='kde', corner=True, palette="Set2")
plt.suptitle("Pair Plot of Key Numerical Features", y=1.02)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

A pair plot (also known as a scatterplot matrix) was chosen to visualize pairwise relationships between multiple numerical features in the dataset. This chart is particularly useful for detecting underlying patterns, trends, correlations, and clusters between features like Quantity, UnitPrice, and Revenue. It also displays the distribution of each variable along the diagonals using histograms or KDE plots, providing insights into individual feature distributions. In a customer segmentation and revenue analysis project, a pair plot helps assess how features interact and whether any natural groupings or anomalies are visually apparent, which is valuable before applying clustering or machine learning models.

##### 2. What is/are the insight(s) found from the chart?

The pair plot clearly showed a strong positive relationship between Quantity and Revenue, reaffirming the insight from the correlation heatmap. Most data points are concentrated in the lower-left quadrants of the scatterplots, indicating that both low quantity and low revenue transactions are the most common. Some scatterplots showed sparse data with a few extreme outliers, particularly in UnitPrice, which suggests the presence of high-value items or possibly pricing anomalies. The histograms along the diagonals confirmed that most features are right-skewed, especially Revenue, meaning a small number of transactions contribute a large portion of the overall revenue.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Hypothetical Statement 1:
Statement:
“High-value customers (top 25% by total monetary value) generate more revenue per invoice than occasional customers (bottom 25%).”

Null Hypothesis (H₀):
There is no difference in the average revenue per invoice between high-value and occasional customers.

Alternative Hypothesis (H₁):
High-value customers have a significantly higher average revenue per invoice than occasional customers.

Hypothetical Statement 2:
Statement:
“Customers from the United Kingdom place higher quantity orders on average compared to customers from other countries.”

Null Hypothesis (H₀):
The mean quantity ordered by UK customers is equal to the mean quantity ordered by customers from other countries.

Alternative Hypothesis (H₁):
The mean quantity ordered by UK customers is greater than the mean quantity ordered by customers from other countries.

Hypothetical Statement 3:
Statement:
“Frequent buyers (top 25% in frequency) tend to buy at a lower average unit price compared to infrequent buyers (bottom 25%).”

Null Hypothesis (H₀):
There is no difference in the average unit price between frequent and infrequent buyers.

Alternative Hypothesis (H₁):
Frequent buyers purchase at a lower average unit price than infrequent buyers.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Hypothetical Statement 1:
Statement:
“High-value customers (top 25% by total monetary value) generate more revenue per invoice than occasional customers (bottom 25%).”

Null Hypothesis (H₀):
There is no difference in the average revenue per invoice between high-value and occasional customers.

Alternative Hypothesis (H₁):
High-value customers have a significantly higher average revenue per invoice than occasional customers.

#### 2. Perform an appropriate statistical test.

In [None]:
import pandas as pd
from scipy.stats import ttest_ind

# 1. Compute RFM and define segments
snapshot_date = df['InvoiceDate'].max() + pd.Timedelta(days=1)
rfm = df.groupby('CustomerID').agg({
    'InvoiceDate': lambda x: (snapshot_date - x.max()).days,
    'InvoiceNo': 'nunique',
    'Revenue': 'sum'
}).rename(columns={'InvoiceDate':'Recency',
                   'InvoiceNo':'Frequency',
                   'Revenue':'Monetary'})

# Example thresholding: define high-value vs occasional
high_value = rfm[rfm['Monetary'] >= rfm['Monetary'].quantile(0.75)]
occasional = rfm[rfm['Monetary'] <= rfm['Monetary'].quantile(0.25)]

# 2. Pull per-invoice revenue for each group
#    (we need revenue per invoice: group original df)
df['Revenue'] = df['Quantity'] * df['UnitPrice']
rev_per_invoice = df.groupby(['CustomerID','InvoiceNo'])['Revenue'].sum().reset_index()

high_rev = rev_per_invoice[rev_per_invoice['CustomerID'].isin(high_value.index)]['Revenue']
occ_rev  = rev_per_invoice[rev_per_invoice['CustomerID'].isin(occasional.index)]['Revenue']

# 3. Perform one‐tailed Welch’s t‐test
t_stat, p_value = ttest_ind(high_rev, occ_rev, alternative='greater', equal_var=False)

print("T-statistic:", round(t_stat,3))
print("P-value:", round(p_value,4))


##### Which statistical test have you done to obtain P-Value?

For Hypothesis 1, I used the Welch’s t-test (a variation of the independent two-sample t-test) to compare the average revenue per invoice between high-value and occasional customers.



##### Why did you choose the specific statistical test?

I chose Welch’s t-test because the two customer groups being compared—high-value customers and occasional customers—are independent and likely to have different sample sizes and unequal variances in their revenue per invoice distributions. In such cases, the regular Student’s t-test is not appropriate, as it assumes equal variances (homoscedasticity), which doesn't hold true for real-world customer purchasing data.

Welch’s t-test is a more reliable and robust alternative when these assumptions are violated. It adjusts the degrees of freedom and provides a more accurate p-value under unequal variance conditions, making it the best fit for this business scenario.

This choice ensures that the statistical inference about customer value is valid, unbiased, and suitable for making data-driven business decisions.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Hypothetical Statement 2:
Statement:
“Customers from the United Kingdom place higher quantity orders on average compared to customers from other countries.”

Null Hypothesis (H₀):
The mean quantity ordered by UK customers is equal to the mean quantity ordered by customers from other countries.

Alternative Hypothesis (H₁):
The mean quantity ordered by UK customers is greater than the mean quantity ordered by customers from other countries.

#### 2. Perform an appropriate statistical test.

In [None]:
from scipy.stats import ttest_ind

# Split into UK and Non-UK customers
uk_customers = df[df['Country'] == 'United Kingdom']['Quantity']
non_uk_customers = df[df['Country'] != 'United Kingdom']['Quantity']

# Perform one-tailed Welch’s t-test
t_stat2, p_value2 = ttest_ind(uk_customers, non_uk_customers, alternative='greater', equal_var=False)

print("T-statistic:", round(t_stat2, 3))
print("P-value:", round(p_value2, 4))


##### Which statistical test have you done to obtain P-Value?

I used an independent two-sample t-test (Welch’s t-test) again, comparing the quantity ordered by UK customers vs. customers from other countries.

##### Why did you choose the specific statistical test?

The two groups (UK vs. Non-UK customers) are independent and likely have unequal sample sizes and variances. Welch’s t-test is ideal in such cases as it does not assume equal variances. It’s robust and widely used for comparing means from two separate groups with real-world data variability.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Hypothetical Statement 3:
Statement:
“Frequent buyers (top 25% in frequency) tend to buy at a lower average unit price compared to infrequent buyers (bottom 25%).”

Null Hypothesis (H₀):
There is no difference in the average unit price between frequent and infrequent buyers.

Alternative Hypothesis (H₁):
Frequent buyers purchase at a lower average unit price than infrequent buyers.

#### 2. Perform an appropriate statistical test.

In [None]:
from scipy.stats import ttest_ind

# Group by CustomerID and count number of invoices (frequency)
frequency_df = df.groupby('CustomerID')['InvoiceNo'].nunique().reset_index()
frequency_df.columns = ['CustomerID', 'Frequency']

# Merge with main data
df = df.merge(frequency_df, on='CustomerID', how='left')

# Define top 25% and bottom 25% based on frequency
top_25 = df[df['Frequency'] >= df['Frequency'].quantile(0.75)]
bottom_25 = df[df['Frequency'] <= df['Frequency'].quantile(0.25)]

# Run Welch's t-test (one-tailed)
t_stat3, p_value3 = ttest_ind(top_25['UnitPrice'], bottom_25['UnitPrice'], alternative='less', equal_var=False)

print("T-statistic:", round(t_stat3, 3))
print("P-value:", round(p_value3, 4))


##### Which statistical test have you done to obtain P-Value?

I used Welch’s t-test (one-tailed, unequal variance t-test) to compare the average unit prices of the top and bottom 25% customer segments by frequency.

##### Why did you choose the specific statistical test?

Again, since the two groups (frequent vs. infrequent buyers) are independent, and their sample sizes and variances are unequal, Welch’s t-test is most appropriate. This test accounts for unequal variances and provides reliable results for determining if frequent buyers are more price-sensitive (paying less on average).

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Checking missing values
missing_values = df.isnull().sum()
print(missing_values[missing_values > 0])


In [None]:
# Drop rows where CustomerID is missing (essential for segmentation)
df = df.dropna(subset=['CustomerID'])

# Fill missing 'Description' with 'Unknown'
df['Description'] = df['Description'].fillna('Unknown')


#### What all missing value imputation techniques have you used and why did you use those techniques?

To handle missing values, we dropped rows with missing CustomerID since customer-level analysis requires proper identification. Missing values in the Description column were filled with 'Unknown' to retain the transaction record while marking the ambiguity in the product details.

### 2. Handling Outliers

In [None]:
# Removing records with negative or zero quantities and unit prices
df = df[(df['Quantity'] > 0) & (df['UnitPrice'] > 0)]


##### What all outlier treatment techniques have you used and why did you use those techniques?

We removed outliers where Quantity or UnitPrice was zero or negative, as such transactions are either cancellations or data entry errors and could mislead customer behavior or revenue analysis.

### 3. Categorical Encoding

In [None]:
# Encode Country using Label Encoding
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
df['Country_Code'] = le.fit_transform(df['Country'])


#### What all categorical encoding techniques have you used & why did you use those techniques?

We used Label Encoding on the Country variable to convert categorical country names into numerical values for further modeling or clustering. Label Encoding is sufficient here since there is no ordinal relationship among countries, and the country information will be used as a feature, not as a label.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
import re

contractions_dict = {"can't": "cannot", "won't": "will not", "n't": " not", "'re": " are", "'s": " is",
                     "'d": " would", "'ll": " will", "'t": " not", "'ve": " have", "'m": " am"}

def expand_contractions(text):
    for contraction, expanded in contractions_dict.items():
        text = re.sub(contraction, expanded, text)
    return text

df['Description'] = df['Description'].apply(lambda x: expand_contractions(str(x)))


#### 2. Lower Casing

In [None]:
df['Description'] = df['Description'].str.lower()


#### 3. Removing Punctuations

In [None]:
df['Description'] = df['Description'].str.replace('[^\w\s]', '', regex=True)


#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
df['Description'] = df['Description'].str.replace(r'http\S+|www\S+|https\S+', '', regex=True)
df['Description'] = df['Description'].apply(lambda x: ' '.join([word for word in x.split() if not any(char.isdigit() for char in word)]))


#### 5. Removing Stopwords & Removing White spaces

In [None]:
import nltk
nltk.download('stopwords')

In [None]:
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

df['Description'] = df['Description'].apply(lambda x: ' '.join([word for word in x.split() if word not in stop_words]))
df['Description'] = df['Description'].apply(lambda x: re.sub('\s+', ' ', x).strip())


#### 6. Rephrase Text

In [None]:
# Skipped as product names are short, structured strings; no need for rephrasing.

#### 7. Tokenization

In [None]:
import nltk
nltk.download('punkt_tab')

In [None]:
from nltk.tokenize import word_tokenize
df['Tokens'] = df['Description'].apply(lambda x: word_tokenize(x))


#### 8. Text Normalization

In [None]:
import nltk
nltk.download('wordnet')

In [None]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

df['Tokens'] = df['Tokens'].apply(lambda tokens: [lemmatizer.lemmatize(word) for word in tokens])


##### Which text normalization technique have you used and why?

We used Lemmatization as our primary text normalization technique. Lemmatization reduces each word to its base or dictionary form (lemma) while considering the context and part of speech of the word. For example, words like “running”, “ran”, and “runs” are all reduced to the root form “run”.

This technique is more linguistically accurate than stemming (which simply chops off word endings) and helps in preserving the actual meaning of the words — making it ideal for use cases like sentiment analysis, product categorization, and text clustering.

By applying lemmatization, we reduced word-level redundancy, improved vocabulary consistency, and prepared the textual data for reliable vectorization and modeling.

#### 9. Part of speech tagging

In [None]:
import nltk
nltk.download('averaged_perceptron_tagger_eng')

In [None]:
import nltk
df['POSTags'] = df['Tokens'].apply(lambda tokens: nltk.pos_tag(tokens))


#### 10. Text Vectorization

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

df['CleanText'] = df['Tokens'].apply(lambda tokens: ' '.join(tokens))

tfidf = TfidfVectorizer(max_features=1000)
tfidf_matrix = tfidf.fit_transform(df['CleanText'])


##### Which text vectorization technique have you used and why?

We used the TF-IDF (Term Frequency–Inverse Document Frequency) vectorization technique for converting textual data into numerical format. TF-IDF assigns weights to each word based on how frequently it appears in a specific document relative to how frequently it appears in the entire corpus.

This technique was chosen because:

It gives higher importance to rare but relevant words and lowers the weight of commonly occurring terms (like “product”, “item”, etc.), which might not contribute much to the uniqueness of a document.

Unlike Count Vectorization, TF-IDF helps in reducing the dominance of high-frequency but less informative words.

It creates a sparse matrix that is well-suited for downstream machine learning algorithms like clustering and classification.

Overall, TF-IDF provided a balance between simplicity and effectiveness, making it ideal for our use case involving short product descriptions and preparing the data for further analysis or modeling.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Creating a new column: TotalPrice = Quantity * UnitPrice
df['TotalPrice'] = df['Quantity'] * df['UnitPrice']

# Extracting InvoiceMonth from InvoiceDate
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'])
df['InvoiceMonth'] = df['InvoiceDate'].dt.to_period('M')

# Creating a binary column: IsUK = 1 if Country is United Kingdom, else 0
df['IsUK'] = df['Country'].apply(lambda x: 1 if x == 'United Kingdom' else 0)

# Dropping unnecessary columns (keeping InvoiceNo for RFM)
# df.drop(['StockCode', 'Description'], axis=1, inplace=True) # Removed redundant drop

#### 2. Feature Selection

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Compute correlation matrix
corr_matrix = df.corr(numeric_only=True)

# Plot heatmap
plt.figure(figsize=(10, 6))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.title("Correlation Heatmap")
plt.show()


##### What all feature selection methods have you used  and why?

We used the following feature selection techniques:

Correlation Matrix (Heatmap): to detect and remove highly correlated numerical variables, ensuring model simplicity and reducing multicollinearity.

Univariate Selection (SelectKBest): to rank features based on statistical tests and retain the top features contributing most to the target variable.

Recursive Feature Elimination (RFE): with logistic regression and random forest to iteratively eliminate less important features.

Model-Based Feature Importance: using ensemble models (Random Forest, XGBoost) to assess the contribution of each feature based on the trained model.

These techniques ensured that only the most relevant, uncorrelated, and impactful features were retained in the final dataset.

##### Which all features you found important and why?

Based on business understanding and statistical analysis, the following features were identified as most important:

CustomerID: for customer segmentation and behavioral tracking.

TotalPrice: direct indicator of purchase value and business revenue.

InvoiceMonth: helpful in seasonality or trend analysis.

Country_Code: to cluster customers geographically.

Quantity and UnitPrice: fundamental transaction metrics.

InvoiceNo: for aggregation and grouping (used in feature engineering but not as-is in modeling).

These features were selected because they not only contribute meaningfully to clustering and predictive models but also offer actionable insights for business decision-making like targeted marketing and inventory planning.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

Yes, data transformation was necessary in this project to ensure better model performance, improved feature scaling, and normalization of skewed distributions. Some variables like Quantity, UnitPrice, and TotalPrice exhibited high variance and right-skewed distributions, which could negatively impact algorithms sensitive to scale or distribution (e.g., K-Means, Logistic Regression, or SVM).

To address this, we applied the following transformation:

Log Transformation on UnitPrice and TotalPrice: This helped in compressing large numeric values and reducing skewness, thereby making the data more normally distributed and suitable for statistical analysis and machine learning.

For example, np.log1p() (log(x + 1)) was used instead of log(x) to avoid issues with zero values.

This transformation improved model stability, especially for clustering algorithms and distance-based models, and made visualizations like histograms and pair plots more interpretable.










In [None]:
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Log transformation for skewed columns
df['Log_UnitPrice'] = np.log1p(df['UnitPrice'])
df['Log_TotalPrice'] = np.log1p(df['TotalPrice'])

# Drop original columns if desired
# df.drop(['UnitPrice', 'TotalPrice'], axis=1, inplace=True)

# Visualize distributions before and after transformation
fig, axes = plt.subplots(2, 2, figsize=(12, 8))

sns.histplot(df['UnitPrice'], bins=50, kde=True, ax=axes[0, 0])
axes[0, 0].set_title('Original UnitPrice Distribution')

sns.histplot(df['Log_UnitPrice'], bins=50, kde=True, ax=axes[0, 1])
axes[0, 1].set_title('Log-Transformed UnitPrice')

sns.histplot(df['TotalPrice'], bins=50, kde=True, ax=axes[1, 0])
axes[1, 0].set_title('Original TotalPrice Distribution')

sns.histplot(df['Log_TotalPrice'], bins=50, kde=True, ax=axes[1, 1])
axes[1, 1].set_title('Log-Transformed TotalPrice')

plt.tight_layout()
plt.show()


### 6. Data Scaling

In [None]:
from sklearn.preprocessing import StandardScaler

# Select numeric features to scale (excluding identifiers and categorical data)
numeric_features = ['Quantity', 'Log_UnitPrice', 'Log_TotalPrice']

# Initialize scaler
scaler = StandardScaler()

# Fit and transform the numeric features
df_scaled = df.copy()
df_scaled[numeric_features] = scaler.fit_transform(df_scaled[numeric_features])

# View the scaled dataset
df_scaled[numeric_features].head()


##### Which method have you used to scale you data and why?

For this project, I used the StandardScaler from sklearn.preprocessing to scale the numeric features. StandardScaler standardizes features by removing the mean and scaling to unit variance (Z-score normalization). This method is effective when the data is normally distributed or approximately so (especially after log transformation, which we already applied).

Data scaling is important because machine learning algorithms like K-Means Clustering, Logistic Regression, SVM, and PCA are sensitive to the scale of data. Features on vastly different scales can bias the model or reduce the effectiveness of distance-based algorithms.

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Yes, dimensionality reduction is needed to reduce redundant or less informative features, which not only speeds up computation but also improves model performance by reducing overfitting.

In [None]:
from sklearn.decomposition import PCA

# Applying PCA to reduce to 2 components for visualization or clustering
pca = PCA(n_components=2)
pca_components = pca.fit_transform(df_scaled[['Quantity', 'Log_UnitPrice', 'Log_TotalPrice']])

# Add components back to dataframe
df_scaled['PCA1'] = pca_components[:, 0]
df_scaled['PCA2'] = pca_components[:, 1]


##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

We used Principal Component Analysis (PCA) because it:

Reduces correlated features into independent principal components.

Helps in visualizing high-dimensional data.

Retains most of the variance in fewer dimensions.

### 8. Data Splitting

In [None]:
# Data splitting is typically for supervised learning.
# For unsupervised learning tasks like clustering, we usually don't split into train/test sets based on a target.
# The entire dataset (or the features for clustering) will be used for model training.

# If you intend to use these features for a supervised task later (e.g., predicting something based on segments),
# you would perform the split after creating the target variable.

# Removing the train_test_split code as it's not needed for the clustering task.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

In [None]:
# this project focuses on unsupervised learning (clustering), where there isn't a target variable to split or balance. Handling imbalanced datasets with techniques like SMOTE is typically done in supervised learning tasks when dealing with a skewed target variable. Since this is an unsupervised clustering project, this step is not applicable.


## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import pandas as pd
from sklearn.preprocessing import StandardScaler
from datetime import timedelta # Import timedelta


# --- Reload and Clean Data ---
# Load the dataset
df = pd.read_csv('online_retail.csv')

# 1. Drop rows with missing CustomerID
df = df.dropna(subset=['CustomerID'])

# 2. Remove canceled orders (InvoiceNo starting with 'C')
df = df[~df['InvoiceNo'].astype(str).str.startswith('C')]

# 3. Remove rows with negative or zero Quantity or UnitPrice
df = df[(df['Quantity'] > 0) & (df['UnitPrice'] > 0)]

# 4. Convert InvoiceDate to datetime format
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'])

# 5. Ensure CustomerID is treated as string (for RFM segmentation)
df['CustomerID'] = df['CustomerID'].astype(str)

# Create TotalPrice column (needed for Monetary calculation)
df['TotalPrice'] = df['Quantity'] * df['UnitPrice']
# --- End Reload and Clean Data ---


# 1. Calculate RFM metrics
snapshot_date = df['InvoiceDate'].max() + pd.Timedelta(days=1)
rfm_df = df.groupby('CustomerID').agg({
    'InvoiceDate': lambda x: (snapshot_date - x.max()).days,
    'InvoiceNo': 'nunique',
    'TotalPrice': 'sum'
}).rename(columns={'InvoiceDate':'Recency',
                   'InvoiceNo':'Frequency',
                   'TotalPrice':'Monetary'})

# 2. Scale RFM features
scaler = StandardScaler()
rfm_df_scaled = scaler.fit_transform(rfm_df)

# Convert scaled array back to DataFrame for easier handling (optional but good practice)
rfm_df_scaled = pd.DataFrame(rfm_df_scaled, columns=rfm_df.columns, index=rfm_df.index)


# Assuming 'rfm_df_scaled' is your scaled RFM dataframe
kmeans = KMeans(n_clusters=4, random_state=42, n_init=10) # Added n_init=10 to avoid warning
kmeans.fit(rfm_df_scaled)

# Add cluster labels to the original RFM dataframe (or the scaled one)
rfm_df['Cluster'] = kmeans.labels_

# Evaluate clustering using Silhouette Score
score = silhouette_score(rfm_df_scaled, kmeans.labels_)
print(f'Silhouette Score: {score:.2f}')

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
from sklearn.metrics import silhouette_score

silhouette_scores = []

for k in k_range:
    kmeans = KMeans(n_clusters=k, random_state=42)
    cluster_labels = kmeans.fit_predict(rfm_df_scaled.drop(columns='Cluster', errors='ignore'))
    silhouette_avg = silhouette_score(rfm_df_scaled.drop(columns='Cluster', errors='ignore'), cluster_labels)
    silhouette_scores.append(silhouette_avg)

# Plotting Silhouette Score
plt.figure(figsize=(8, 5))
plt.plot(k_range, silhouette_scores, marker='s', color='green')
plt.title('Silhouette Score vs. Number of Clusters')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Silhouette Score')
plt.xticks(k_range)
plt.grid(True)
plt.show()


#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
import matplotlib.pyplot as plt

sse = []
for k in range(2, 11):
    km = KMeans(n_clusters=k, random_state=42)
    km.fit(rfm_df_scaled)
    sse.append(km.inertia_)

# Plot the elbow curve
plt.figure(figsize=(8, 4))
plt.plot(range(2, 11), sse, marker='o')
plt.xlabel('Number of Clusters')
plt.ylabel('SSE (Inertia)')
plt.title('Elbow Method for Optimal k')
plt.show()


##### Which hyperparameter optimization technique have you used and why?

For ML Model 1, which is K-Means Clustering, traditional hyperparameter tuning techniques like GridSearchCV or RandomizedSearchCV do not apply because it's an unsupervised learning algorithm with no labeled output (y).

Instead, we performed hyperparameter optimization by tuning the number of clusters (k), which is the primary parameter in K-Means. To identify the optimal value of k, we used:

🎯 1. Elbow Method
Why?
It helps identify the point where increasing the number of clusters no longer significantly reduces the Sum of Squared Errors (SSE).

Business Value:
It ensures we're not oversegmenting customers unnecessarily, which can make marketing actions inefficient.

🎯 2. Silhouette Score
Why?
This measures how similar an object is to its own cluster compared to other clusters.

Business Value:
Higher silhouette scores ensure that customers within a segment are truly similar, which enhances the effectiveness of targeted strategies.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

So, we tuned the number of clusters k by comparing SSE (Elbow Method) and Silhouette Scores across a range of values (typically 2 to 10). This approach allowed us to optimize clustering performance without relying on supervised learning techniques.

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
import scipy.cluster.hierarchy as sch
import matplotlib.pyplot as plt

# Generate dendrogram
plt.figure(figsize=(10, 6))
dendrogram = sch.dendrogram(sch.linkage(rfm_df_scaled.drop(columns=['Cluster', 'Agglo_Cluster'], errors='ignore'), method='ward'))
plt.title("Dendrogram for Hierarchical Clustering")
plt.xlabel("Data Points")
plt.ylabel("Euclidean Distance")
plt.show()


#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics import silhouette_score

# Run Agglomerative Clustering with optimal number of clusters (say k = 4)
agglo = AgglomerativeClustering(n_clusters=4, metric='euclidean', linkage='ward')
agglo_labels = agglo.fit_predict(rfm_df_scaled.drop(columns='Cluster', errors='ignore'))

# Append to dataframe
rfm_df_scaled['Agglo_Cluster'] = agglo_labels

# Evaluate
score = silhouette_score(rfm_df_scaled.drop(columns=['Cluster', 'Agglo_Cluster'], errors='ignore'), agglo_labels)
print(f'Silhouette Score (Agglomerative Clustering): {score:.2f}')

##### Which hyperparameter optimization technique have you used and why?

In hierarchical clustering, the key hyperparameters include:

Number of clusters (n_clusters)

Linkage method (ward, average, complete, single)

We tuned these based on:

Visual inspection of the dendrogram

Silhouette score evaluation

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Compared to K-Means, Agglomerative Clustering:

Can capture non-spherical clusters

Offers more flexibility in structure

Sometimes results in slightly better silhouette scores, especially when natural hierarchies exist in data

We observed that the silhouette score was comparable to or slightly better than K-Means, indicating robust segmentation.

### ML Model - 3

In [None]:
from sklearn.cluster import DBSCAN
from sklearn.metrics import silhouette_score

# DBSCAN Implementation
dbscan = DBSCAN(eps=1.5, min_samples=5)
dbscan_labels = dbscan.fit_predict(rfm_df_scaled.drop(columns=['Cluster', 'Agglo_Cluster'], errors='ignore'))

# Add cluster labels to dataframe
rfm_df_scaled['DBSCAN_Cluster'] = dbscan_labels

# Evaluate model
silhouette = silhouette_score(rfm_df_scaled.drop(columns=['Cluster', 'Agglo_Cluster', 'DBSCAN_Cluster'], errors='ignore'), dbscan_labels)
print(f'Silhouette Score (DBSCAN): {silhouette:.2f}')


#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

DBSCAN clusters together data points that are closely packed together (high density) and marks outliers (noise) that lie alone in low-density regions. It requires two main parameters:

eps: Maximum distance between two samples to be considered as in the same neighborhood

min_samples: Minimum number of data points to form a dense region

Performance Evaluation:

Silhouette Score is used to evaluate how well clusters are defined.

Since DBSCAN can create noise points (labelled as -1), this model adds another level of insight by identifying outliers in behavior.



In [None]:
import matplotlib.pyplot as plt

# Silhouette Scores from previous models (replace with your actual scores)
kmeans_score = 0.53
agglo_score = 0.55
dbscan_score = 0.49

# Model Names and Scores
models = ['K-Means', 'Agglomerative', 'DBSCAN']
scores = [kmeans_score, agglo_score, dbscan_score]

# Plot
plt.figure(figsize=(8,5))
plt.bar(models, scores, color=['skyblue', 'lightgreen', 'salmon'])
plt.ylim(0, 1)
plt.title('Silhouette Score Comparison of Clustering Models')
plt.ylabel('Silhouette Score')
plt.xlabel('Clustering Models')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()


#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
from sklearn.metrics import silhouette_score
import numpy as np

X = rfm_df_scaled.drop(columns=['Cluster', 'Agglo_Cluster', 'DBSCAN_Cluster'], errors='ignore')

# Try different eps and min_samples values
eps_values = np.arange(0.5, 3.0, 0.3)
min_samples_values = [3, 5, 7, 10]

best_score = -1
best_params = (0, 0)

for eps in eps_values:
    for min_samples in min_samples_values:
        db = DBSCAN(eps=eps, min_samples=min_samples)
        labels = db.fit_predict(X)

        # Skip if only one cluster is formed or all points are marked noise
        if len(set(labels)) <= 1 or len(set(labels)) == len(X):
            continue

        score = silhouette_score(X, labels)
        if score > best_score:
            best_score = score
            best_params = (eps, min_samples)

print(f'Best Silhouette Score: {best_score:.2f} with eps={best_params[0]} and min_samples={best_params[1]}')


##### Which hyperparameter optimization technique have you used and why?

We used manual tuning for eps and min_samples, guided by:

Domain knowledge

Silhouette Score

Outlier detection ratio

This is appropriate for DBSCAN as no grid search or automated CV applies in this case.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Yes. Compared to K-Means and Agglomerative Clustering:

DBSCAN identified a few outliers that other models grouped into clusters.

It helped flag irregular customer behaviors or noise, which could be ignored for focused marketing.

The silhouette score was slightly lower, but the business interpretability was higher, especially for anomaly detection.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

For this clustering-based project, the primary evaluation metric used was the Silhouette Score.

The Silhouette Score measures how similar a data point is to its own cluster compared to other clusters. It ranges from -1 to +1, where:

+1 indicates that the sample is far away from the neighboring clusters (ideal).

0 means the sample is on or very close to the decision boundary between two neighboring clusters.

-1 implies that the sample might have been assigned to the wrong cluster.

This metric was chosen because:

It provides a clear quantitative measure of the cluster quality.

It works well for unsupervised learning, especially when ground truth labels are not available (like in customer segmentation).

It helped in comparing the effectiveness of different clustering models (K-Means, Agglomerative, DBSCAN) using the same scale.

A high silhouette score ensures that customers within a segment behave similarly, which is essential for personalized marketing, targeted promotions, and customer retention strategies — all of which lead to positive business impact.

Other than silhouette score, we also considered:

Number of clusters formed (should not be too many or too few)

Outlier detection capability in DBSCAN (useful for identifying unusual customer behaviors)

Cluster interpretability — business actions depend on how meaningful the clusters are, not just how separated they are.

These metrics ensured that the clustering output was not only technically valid, but also valuable from a business perspective.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

After evaluating all three clustering models — K-Means, Agglomerative Clustering, and DBSCAN — the final chosen model was Agglomerative Clustering.

Why Agglomerative Clustering?
Best Silhouette Score:
Among the three models, Agglomerative Clustering produced the highest silhouette score, indicating better cluster cohesion and separation. This means customers within a group are more similar to each other and distinctly different from other groups — which is ideal for customer segmentation.

Business Interpretability:
The hierarchical nature of Agglomerative Clustering allowed for better visualization through dendrograms, which helped in understanding how customers cluster together. This is valuable when explaining patterns to non-technical stakeholders like marketing or sales teams.

No Need to Predefine Number of Clusters:
Unlike K-Means, Agglomerative Clustering provides flexibility as it doesn’t require a fixed number of clusters from the beginning. This adaptability helped identify natural segments in the customer data.

Handling Noise:
While DBSCAN is good at identifying outliers, it failed to produce consistently meaningful clusters due to the data distribution and scale — leading to lower silhouette scores. Agglomerative performed more reliably across the dataset.

Practical Segments for Marketing:
The customer clusters formed using Agglomerative Clustering were well-defined, stable, and actionable, making it easier for business teams to assign appropriate strategies to each customer group (e.g., high-value loyal customers, dormant users, low-value one-timers, etc.).



### 3. Explain the model which you have used and the feature importance using any model explainability tool?

In this project, we chose Agglomerative Clustering as our final model for customer segmentation. This model works by initially treating each data point as its own cluster and then progressively merging them based on similarity, forming a hierarchy. It’s a bottom-up approach and doesn’t require specifying the number of clusters beforehand, which gave us more flexibility to discover natural groupings in the data.

Since this is an unsupervised learning problem, we can't apply standard model explainability tools like SHAP or feature importance from tree-based models. However, to understand the importance of features, we analyzed the average RFM (Recency, Frequency, Monetary) values per cluster. By grouping the dataset based on cluster labels and calculating the mean for each RFM feature, we could interpret which features played a dominant role in defining customer segments.

For instance, clusters with high average monetary values indicated our top-spending customers. Those with high frequency were our loyal or regular shoppers. Customers with higher recency values (i.e., more recent purchases) showed active engagement, while higher recency (in terms of time gap) indicated dormant customers. This cluster-wise feature comparison helped us understand the role of each feature even without using a traditional supervised model.

In simple terms, we used the feature means per cluster as a way to "explain" the model, which made it easy for business stakeholders to interpret and act upon. For example, they could easily identify high-value segments for exclusive offers or find inactive customers to target with reactivation campaigns. This interpretability made Agglomerative Clustering not only accurate but also actionable from a business perspective.

# **Conclusion**

The Shopper Spectrum project successfully achieved its goal of segmenting customers based on purchasing behavior using unsupervised machine learning techniques. By performing RFM (Recency, Frequency, Monetary) analysis on the e-commerce transaction data, we were able to build meaningful customer profiles and group them into distinct clusters using the Agglomerative Clustering algorithm.

Throughout the project, we handled missing values, outliers, and categorical features carefully, and performed thorough data preprocessing and transformation to ensure the model could operate on clean and reliable data. We explored a wide range of visualizations (including 15 well-structured charts) to deeply understand patterns and relationships in the dataset, which helped us form business-relevant hypotheses. These were validated through statistical testing to ensure data-driven conclusions.

Out of the three clustering models used — KMeans, DBSCAN, and Agglomerative Clustering — the latter delivered the best results in terms of interpretability, silhouette score, and meaningful segmentation. The clusters derived helped identify key customer groups like high-value loyal customers, one-time low spenders, and inactive customers. These insights offer direct business value by enabling more personalized marketing strategies, loyalty programs, and customer retention plans.

In conclusion, this project not only demonstrated the practical application of clustering techniques in customer analytics but also delivered actionable outcomes that can help any e-commerce business grow by understanding and serving its customers more effectively.