<a href="https://colab.research.google.com/github/ANUSHREE1403/Shopper-s-Spectrum/blob/main/Shopper_Spectrum__Segmentation_and_Recommendation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name -** Shopper Spectrum: Customer Segmentation and Product Recommendations in E-Commerce  
# **Project Type -** Unsupervised + Recommendation System  
# **Contribution -** Individual  
# Team Member 1 - Anushree

# **Project Summary -**

**Objective:** Segment e-commerce users based on behavioral patterns to support personalized marketing and customer engagement.

**Dataset Used:** Online retail customer behavior dataset (classification-based segmentation).

**Tech Stack:** Python, Scikit-learn, LightGBM, Random Forest, Streamlit.

**Process:**

Performed EDA using the UBM (Univariate-Bivariate-Multivariate) framework.

Handled outliers, missing values, and imbalanced data.

Applied feature engineering and label encoding.

Built and compared ML models with hyperparameter tuning (RandomSearchCV).

**Outcome:** Achieved >99.9% accuracy with LightGBM model.

**Deliverables:**

Best model saved as .pkl file.

Streamlit app for real-time predictions via form or CSV upload.

Encapsulated ML pipeline ready for deployment.

# **GitHub Link -**

https://github.com/ANUSHREE1403/Shopper-s-Spectrum

# **Problem Statement**


In the dynamic world of e-commerce, understanding customer behavior is crucial for driving growth and retention. Businesses often struggle to identify which customers are most valuable, which ones are at risk of churning, and how to target each segment effectively. Without proper segmentation, marketing campaigns become generic, customer experiences suffer, and profitability declines.

This project aims to build a machine learning-based solution to segment e-commerce customers based on their purchasing behavior. By classifying customers into meaningful groups, the platform can:

Personalize marketing strategies,

Improve customer retention, and

Maximize sales and customer lifetime value.

# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# 1. Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import plotly.express as px
import warnings
warnings.filterwarnings("ignore")

# For clustering
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# For recommendations
from sklearn.metrics.pairwise import cosine_similarity

# For deployment
import joblib

print(" Libraries successfully imported!")


### Dataset Loading

In [None]:
# Load the uploaded file as CSV with correct encoding
df = pd.read_csv('/content/online_retail.csv', encoding='ISO-8859-1')

# Show the first 5 rows
df.head()

In [None]:
from google.colab import drive
drive.mount('/content/drive')

### Dataset First View

In [None]:
# 3. Dataset Rows & Columns count
print("Number of Rows:", df.shape[0])
print("Number of Columns:", df.shape[1])

### Dataset Information

In [None]:
# 4. Dataset Information
df.info()

In [None]:
# 5. Dataset Description
df.describe()

#### Duplicate Values

In [None]:
# 6. Null values count
df.isnull().sum()

In [None]:
# 8. Checking and removing duplicate rows
print("Before removing duplicates:", df.shape)

df.drop_duplicates(inplace=True)

print("After removing duplicates:", df.shape)

#### Missing Values/Null Values

In [None]:
# 7. Visualizing null values using heatmap
plt.figure(figsize=(12,6))
sns.heatmap(df.isnull(), cbar=False, cmap='viridis')
plt.title("Missing Values Heatmap")
plt.show()

### What did you know about your dataset?

**Dataset Dimensions**

The dataset contains a large number of rows and 8 columns.

**Columns and Their Meaning**

invoiceno: Unique identifier for each invoice/transaction.

stockcode: Product code.

description: Text description of the product.

quantity: Number of units sold. Can be negative (indicates returns).

invoicedate: Date and time of the transaction.

unitprice: Price per item.

customerid: ID of the customer. Missing for some rows.

country: Country where the transaction occurred.

**Missing Values**

The customerid column has missing values.

The description column also has some missing values.

These were visualized using a heatmap.

**Duplicates**

Duplicate rows existed and were successfully removed.

**Data Types**

Most columns are of type object or float.

The invoicedate column will need to be converted to datetime format.

**Cleaned Column Names**

All column names have been stripped of spaces and converted to lowercase with underscores for consistency.



## ***2. Understanding Your Variables***

In [None]:
# 9. View column names
df.columns

In [None]:
# 10. Rename columns for consistency
df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_')

# View renamed columns
df.columns

### Variables Description

Answer Here

### Check Unique Values for each variable.

In [None]:
# Unique counts of important categorical columns
print("Unique Invoice Numbers:", df['invoiceno'].nunique())
print("Unique Stock Codes:", df['stockcode'].nunique())
print("Unique Product Descriptions:", df['description'].nunique())
print("Unique Customers:", df['customerid'].nunique())
print("Unique Countries:", df['country'].nunique())

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Convert 'invoicedate' to datetime format
df['invoicedate'] = pd.to_datetime(df['invoicedate'])

# Show updated data types
df.dtypes

In [None]:
# Extract components from datetime
df['year'] = df['invoicedate'].dt.year
df['month'] = df['invoicedate'].dt.month
df['day'] = df['invoicedate'].dt.day
df['hour'] = df['invoicedate'].dt.hour
df['dayofweek'] = df['invoicedate'].dt.dayofweek  # Monday=0, Sunday=6

# Show the first few rows with new time columns
df[['invoicedate', 'year', 'month', 'day', 'hour', 'dayofweek']].head()

### What all manipulations have you done and insights you found?

Answer Here.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
#total sales per month

import matplotlib.pyplot as plt

# Create 'total_price' column
df['total_price'] = df['quantity'] * df['unitprice']

# Group by Year and Month
monthly_sales = df.groupby(['year', 'month'])['total_price'].sum().reset_index()

# Create a 'year-month' column for plotting
monthly_sales['year_month'] = monthly_sales['year'].astype(str) + '-' + monthly_sales['month'].astype(str).str.zfill(2)

# Plotting
plt.figure(figsize=(12, 6))
plt.plot(monthly_sales['year_month'], monthly_sales['total_price'], marker='o')
plt.xticks(rotation=45)
plt.title("Total Sales per Month")
plt.xlabel("Year-Month")
plt.ylabel("Total Sales (£)")
plt.grid(True)
plt.tight_layout()
plt.show()

**1. Why did you pick the specific chart?**
A line chart is ideal for visualizing trends over time. It shows how sales change from month to month, highlighting seasonality and growth patterns clearly.

**2. What is/are the insight(s) found from the chart?**
The chart reveals the monthly revenue trend, peaks in certain months, and possible seasonal effects. It may also help identify months with lower sales or potential promotional opportunities.

**3. Will the gained insights help create a positive business impact? Are there any insights that lead to negative growth? Justify with specific reason.**
Yes, these insights can help plan marketing campaigns during low-performing months and prepare inventory for peak months, which will positively impact revenue. If certain months consistently underperform, the business might investigate reasons (like supply issues or lower demand) and take corrective action.

#### Chart - 2

In [None]:
# Chart - 2: Top 10 Countries by Total Sales

# Grouping by country and summing total sales
top_countries = df.groupby('country')['total_price'].sum().sort_values(ascending=False).head(10)

# Plotting
plt.figure(figsize=(10,6))
top_countries.plot(kind='bar', color='skyblue')
plt.title("Top 10 Countries by Total Sales")
plt.ylabel("Total Sales (£)")
plt.xlabel("Country")
plt.xticks(rotation=45)
plt.grid(axis='y')
plt.tight_layout()
plt.show()

1. Why did you pick the specific chart?
A bar chart is appropriate here because it compares discrete categories (countries) based on total revenue. It provides a clear visual hierarchy of which countries contribute the most to the business.

2. What is/are the insight(s) found from the chart?
The chart shows that the UK dominates the total sales, followed by other European countries. It highlights which regions are the company’s strongest markets.

3. Will the gained insights help create a positive business impact? Are there any insights that lead to negative growth? Justify with specific reason.
Yes, these insights can guide the business to focus marketing efforts on high-performing countries and investigate ways to improve performance in lower-ranked countries. If some countries show unusually low sales despite potential demand, this could indicate missed opportunities or distribution issues.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
# Chart - 3: Top 10 Most Sold Products (by quantity)

top_products = df.groupby('description')['quantity'].sum().sort_values(ascending=False).head(10)

# Plotting
plt.figure(figsize=(10,6))
top_products.plot(kind='bar', color='orange')
plt.title("Top 10 Most Sold Products")
plt.ylabel("Total Quantity Sold")
plt.xlabel("Product Description")
plt.xticks(rotation=90)
plt.grid(axis='y')
plt.tight_layout()
plt.show()

1. Why did you pick the specific chart?
A bar chart is ideal to compare product sales volume across categories. It allows for quick identification of top-selling products by quantity.

2. What is/are the insight(s) found from the chart?
We can clearly identify which products are best-sellers. These may be products that are consistently in demand, and likely to be driving a significant portion of revenue.

3. Will the gained insights help create a positive business impact? Are there any insights that lead to negative growth? Justify with specific reason.
Yes — identifying the top-selling products helps optimize inventory, forecast demand, and run targeted promotions. If some best-sellers have thin profit margins or high return rates (explored later), they might negatively affect overall profitability despite high sales.

#### Chart - 4

In [None]:
# Chart - 4: Top 10 Most Valuable Products (by total revenue)

top_valuable_products = df.groupby('description')['total_price'].sum().sort_values(ascending=False).head(10)

# Plotting
plt.figure(figsize=(10,6))
top_valuable_products.plot(kind='bar', color='green')
plt.title("Top 10 Most Valuable Products")
plt.ylabel("Total Revenue (£)")
plt.xlabel("Product Description")
plt.xticks(rotation=90)
plt.grid(axis='y')
plt.tight_layout()
plt.show()

1. Why did you pick the specific chart?
Bar charts clearly showcase comparisons among product categories based on revenue. This helps prioritize products that bring in the most money.

2. What is/are the insight(s) found from the chart?
We identify which products contribute the most revenue — not just by sales volume, but by price x quantity. A product may not be sold the most, but could still top revenue.

3. Will the gained insights help create a positive business impact? Are there any insights that lead to negative growth? Justify with specific reason.
Yes — this helps in profit optimization by focusing on high-revenue products. If a high-revenue product also has high return rates, stockouts, or customer complaints, it might negatively impact long-term business, requiring deeper review.

#### Chart - 5

In [None]:
# Chart - 5: Top 10 Customers by Total Revenue

top_customers = df.groupby('customerid')['total_price'].sum().sort_values(ascending=False).head(10)

# Plotting
plt.figure(figsize=(10,6))
top_customers.plot(kind='bar', color='purple')
plt.title("Top 10 Customers by Revenue")
plt.ylabel("Total Revenue (£)")
plt.xlabel("Customer ID")
plt.xticks(rotation=0)
plt.grid(axis='y')
plt.tight_layout()
plt.show()

1. Why did you pick the specific chart?
A bar chart efficiently ranks customers based on the revenue they generate, giving a snapshot of your most valuable clients.

2. What is/are the insight(s) found from the chart?
The chart shows which customers contribute the most to revenue. These could be loyal customers, resellers, or bulk buyers.

3. Will the gained insights help create a positive business impact? Are there any insights that lead to negative growth? Justify with specific reason.
Yes — these insights can guide loyalty programs, personalized marketing, or early access strategies. However, over-reliance on a few key customers can be risky if one leaves or reduces spending.

#### Chart - 6

In [None]:
# Chart - 6: Monthly Revenue Trend

monthly_sales = df.resample('M', on='invoicedate')['total_price'].sum()

# Plotting
plt.figure(figsize=(12,6))
monthly_sales.plot(marker='o', linestyle='-')
plt.title("Monthly Revenue Trend")
plt.ylabel("Total Revenue (£)")
plt.xlabel("Month")
plt.grid(True)
plt.tight_layout()
plt.show()

1. Why did you pick the specific chart?
A time series line chart is the best choice to visualize trends across months. It captures patterns, seasonality, and anomalies in revenue flow.

2. What is/are the insight(s) found from the chart?
You may observe peaks around November/December — likely due to holiday season spikes. Some months may show dips indicating low customer activity or supply chain issues.

3. Will the gained insights help create a positive business impact? Are there any insights that lead to negative growth? Justify with specific reason.
Yes — understanding monthly patterns supports seasonal planning, staffing, and promotional timing. A sudden decline in expected months may signal underlying issues like poor campaigns or product unavailability, prompting corrective action.

#### Chart - 7

In [None]:
# Chart - 7: Total Value of Returns (Negative Quantity) by Country

returns_df = df[df['quantity'] < 0]
returns_by_country = returns_df.groupby('country')['total_price'].sum().sort_values().head(10)  # least 10 countries by return value

# Plotting
plt.figure(figsize=(10,6))
returns_by_country.plot(kind='barh', color='red')
plt.title("Top 10 Countries by Return Value")
plt.xlabel("Total Return Value (£)")
plt.ylabel("Country")
plt.grid(axis='x')
plt.tight_layout()
plt.show()

1. Why did you pick the specific chart?
A horizontal bar chart helps us compare return amounts across countries clearly, especially when country names vary in length.

2. What is/are the insight(s) found from the chart?
We can see which countries experience the most monetary loss due to product returns. This may indicate dissatisfaction, shipping issues, or cultural return behaviors.

3. Will the gained insights help create a positive business impact? Are there any insights that lead to negative growth? Justify with specific reason.
Yes — high return values can flag operational or quality problems in specific regions. Reducing returns through quality checks, better logistics, or localized support directly improves profit margins and customer satisfaction.



#### Chart - 8

In [None]:
# Chart - 8: Heatmap of Purchases by Hour and Day of Week

# Extract hour and weekday from invoice datetime
df['hour'] = df['invoicedate'].dt.hour
df['weekday'] = df['invoicedate'].dt.day_name()

# Create pivot table
hourly_pivot = df.pivot_table(index='weekday', columns='hour', values='invoiceno', aggfunc='count').fillna(0)

# Reorder weekdays
weekday_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
hourly_pivot = hourly_pivot.reindex(weekday_order)

# Plotting heatmap
plt.figure(figsize=(15,6))
sns.heatmap(hourly_pivot, cmap='YlGnBu')
plt.title('Heatmap of Purchase Volume by Hour and Day of Week')
plt.xlabel('Hour of Day')
plt.ylabel('Day of Week')
plt.tight_layout()
plt.show()

1. Why did you pick the specific chart?
A heatmap helps visualize two-dimensional patterns — in this case, when during the day and which weekdays customers are most active. This is hard to interpret from raw data alone.

2. What is/are the insight(s) found from the chart?
We can identify peak purchasing hours (e.g., mid-morning or early afternoon) and busiest days (e.g., Tuesdays or Thursdays). Lulls may be seen during weekends or night hours.

3. Will the gained insights help create a positive business impact? Are there any insights that lead to negative growth? Justify with specific reason.
Yes — these insights support marketing timing, staffing, and promotions. If weekends are unusually slow, it may indicate poor weekend engagement or need for special campaigns.

#### Chart - 9

In [None]:
# Chart - 9: Boxplots for Outlier Detection in Quantity and UnitPrice

plt.figure(figsize=(14, 6))

# Quantity
plt.subplot(1, 2, 1)
sns.boxplot(y=df['quantity'])
plt.title("Boxplot of Quantity")

# UnitPrice
plt.subplot(1, 2, 2)
sns.boxplot(y=df['unitprice'])
plt.title("Boxplot of UnitPrice")

plt.tight_layout()
plt.show()

1. Why did you pick the specific chart?
Boxplots clearly highlight outliers and spread of data, making them perfect for numerical outlier detection.

2. What is/are the insight(s) found from the chart?
There are several unusually high values in both quantity and unitprice which deviate far from the norm, indicating potential data entry errors, bulk orders, or anomalies.

3. Will the gained insights help create a positive business impact? Are there any insights that lead to negative growth? Justify with specific reason.
Yes. Removing or capping these extreme outliers ensures robust models and reliable insights. If left untreated, they could mislead pricing strategies or inventory planning.

#### Chart - 10

In [None]:
# Copy original DataFrame
df_cleaned = df.copy()

# Function to remove outliers using IQR
def remove_outliers_iqr(data, column):
    Q1 = data[column].quantile(0.25)
    Q3 = data[column].quantile(0.75)
    IQR = Q3 - Q1
    lower = Q1 - 1.5 * IQR
    upper = Q3 + 1.5 * IQR
    return data[(data[column] >= lower) & (data[column] <= upper)]

# Remove outliers from 'quantity' and 'unitprice'
df_cleaned = remove_outliers_iqr(df_cleaned, 'quantity')
df_cleaned = remove_outliers_iqr(df_cleaned, 'unitprice')

# Check new shape
print("Original rows:", df.shape[0])
print("Cleaned rows:", df_cleaned.shape[0])

1. Why did you pick this method?
The IQR method is simple yet effective for identifying extreme values that fall outside a reasonable range in skewed distributions.

2. What is/are the insight(s) found from cleaning?
Outlier removal reduced the dataset size, removing rare and possibly erroneous or bulk order values that could distort analysis.

3. Will the gained insights help create a positive business impact? Are there any insights that lead to negative growth?
Yes, cleaning improves model performance and decision-making reliability. There's no negative growth, but ensure legitimate high-value orders are not unintentionally removed.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Answer Here.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

H₀ (Null Hypothesis): Mean revenue for UK and non-UK customers is equal.

H₁ (Alternative Hypothesis): Mean revenue for UK and non-UK customers is not equal.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
from scipy.stats import ttest_ind

# Group data by country
uk_revenue = df_cleaned[df_cleaned['country'] == 'United Kingdom']['total_price']
non_uk_revenue = df_cleaned[df_cleaned['country'] != 'United Kingdom']['total_price']

# Perform Welch's t-test
t_stat, p_value = ttest_ind(uk_revenue, non_uk_revenue, equal_var=False)

print("T-statistic:", round(t_stat, 3))
print("P-value:", round(p_value, 3))

# Interpret result
alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis. Revenue differs significantly between UK and non-UK customers.")
else:
    print("Fail to reject the null hypothesis. No significant difference in revenue.")

##### Which statistical test have you done to obtain P-Value?

 Welch’s t-test, which is a variation of the independent t-test. This test compares the means of two independent groups, in our case, UK vs. non-UK customer revenues,while accounting for unequal variances and sample sizes between the two groups. The p-value obtained from this test tells us whether the difference in means is statistically significant.

##### Why did you choose the specific statistical test?

The Welch’s t-test was chosen because:

The UK and non-UK groups are independent (i.e., different customers from different countries).

The sample sizes and variances between the groups are not equal, which violates assumptions of the standard t-test.

Welch’s t-test is robust and more reliable in such cases, and is a preferred test when comparing two population means with unequal variance.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

H₀ (Null Hypothesis): There is no correlation between quantity and total_price.

H₁ (Alternative Hypothesis): There is a significant correlation between quantity and total_price.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
from scipy.stats import pearsonr

# Group by 'invoiceno' and sum 'quantity' and 'total_price'
invoice_summary = df_cleaned.groupby('invoiceno')[['total_price', 'quantity']].sum().reset_index()

# Perform Pearson correlation test
corr_coef, p_value = pearsonr(invoice_summary['quantity'], invoice_summary['total_price'])

# Show results
print(f"Correlation Coefficient: {corr_coef}")
print(f"P-Value: {p_value}")

##### Which statistical test have you done to obtain P-Value?

We used the Pearson Correlation Coefficient test to evaluate the strength and significance of the linear relationship between quantity and total_price per invoice. The p-value obtained from this test tells us whether the correlation is statistically significant.

##### Why did you choose the specific statistical test?

The Pearson correlation test is suitable because:

Both variables (quantity and total_price) are continuous numerical variables.

We want to test for a linear relationship between these variables.

It provides both the correlation coefficient (strength & direction) and the p-value to determine significance.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation

# Check missing values
missing_values = df_cleaned.isnull().sum()
missing_percent = (missing_values / len(df_cleaned)) * 100

# Combine into a DataFrame for better visibility
missing_df = pd.DataFrame({
    'Missing Values': missing_values,
    'Percentage': missing_percent
}).sort_values(by='Percentage', ascending=False)

missing_df

#### What all missing value imputation techniques have you used and why did you use those techniques?

CustomerID column had missing values.

Since CustomerID is a categorical identifier and not useful for modeling or aggregation directly, we dropped the rows where it was missing.

We did not use imputation (like mode or random ID) to avoid data leakage or incorrect grouping.

For other columns, there were no missing values, so no imputation was needed.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments
import seaborn as sns
import matplotlib.pyplot as plt

# Visualizing outliers for 'quantity' and 'unitprice'
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
sns.boxplot(x=df_cleaned['quantity'])
plt.title("Boxplot - Quantity")

plt.subplot(1, 2, 2)
sns.boxplot(x=df_cleaned['unitprice'])
plt.title("Boxplot - UnitPrice")

plt.tight_layout()
plt.show()

##### What all outlier treatment techniques have you used and why did you use those techniques?

Outliers were identified in the quantity and unitprice columns using boxplots.

We used the Interquartile Range (IQR) method to detect and treat extreme outliers.

Rows where:

quantity was unusually high/low (like negative or above 99th percentile)

or unitprice was negative or extremely high

were removed.

Reason:

Negative quantity values usually indicate returns or cancellations.

Negative unitprice values are likely data errors.

Removing them improves the reliability of analysis and modeling.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns
# Check object/categorical columns
df_cleaned.select_dtypes(include='object').nunique()

In [None]:
# Label Encoding for binary or low-cardinality categorical columns
from sklearn.preprocessing import LabelEncoder

# Making a copy of the dataset
df_encoded = df_cleaned.copy()

# Encoding 'country' using LabelEncoder since it's not used for modeling but helpful as a numeric feature
le = LabelEncoder()
df_encoded['country_encoded'] = le.fit_transform(df_encoded['country'])

# Dropping the original 'country' to avoid redundancy
df_encoded.drop('country', axis=1, inplace=True)

# Checking the result
df_encoded.head()


#### What all categorical encoding techniques have you used & why did you use those techniques?

We used Label Encoding for the country column.

Reason:

Label encoding assigns each unique category a numeric code.

Since we're not using country for categorical relationships or model decision paths (like in trees), label encoding is fast and efficient.

The column had moderate cardinality and was not intended for detailed categorical analysis.

We avoided One-Hot Encoding as it would create many dummy columns due to the large number of unique countries, leading to dimensionality explosion.

### 4. Feature Manipulation & Selection

In [None]:
# Manipulate Features to minimize feature correlation and create new features
# Example: Extracting additional features from 'invoicedate'
df_encoded['is_weekend'] = df_encoded['dayofweek'].apply(lambda x: 1 if x in [5,6] else 0)
df_encoded['is_night'] = df_encoded['hour'].apply(lambda x: 1 if x < 6 or x > 20 else 0)

# Preview the new features
df_encoded[['dayofweek', 'hour', 'is_weekend', 'is_night']].head()

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting
# Select only numeric columns for correlation
numeric_df = df_encoded.select_dtypes(include='number')

# Plot heatmap
plt.figure(figsize=(12, 8))
sns.heatmap(numeric_df.corr(), annot=True, cmap='coolwarm')
plt.title("Correlation Heatmap")
plt.show()


##### What all feature selection methods have you used  and why?

Derived is_weekend from dayofweek to identify whether the transaction happened during the weekend.

Derived is_night from hour to capture user behavior during late hours.

These binary features may help improve model performance by capturing user purchase patterns across time segments.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

Yes, data transformation was necessary because the distribution of some numerical features like unitprice, quantity, and total_price was highly skewed (right-skewed). Highly skewed data can adversely affect model performance, especially for models that assume normal distribution (e.g., linear regression, logistic regression).

Transformation Technique Used:
Log Transformation
Applied to: unitprice, quantity, and total_price
Reason: To reduce skewness and normalize the distribution of numerical features, improving model performance and convergence.

In [None]:
# Transform Your data
# Log transform skewed numerical columns
df_encoded['log_unitprice'] = np.log1p(df_encoded['unitprice'])
df_encoded['log_quantity'] = np.log1p(df_encoded['quantity'])
df_encoded['log_total_price'] = np.log1p(df_encoded['total_price'])

# Drop the original columns if not needed
# df_encoded.drop(['unitprice', 'quantity', 'total_price'], axis=1, inplace=True)

### 6. Data Scaling

In [None]:
# Scaling your data
# Check for infinite values
print(np.isinf(df_encoded[numeric_cols]).sum())

# Replace inf/-inf with NaN
df_encoded[numeric_cols] = df_encoded[numeric_cols].replace([np.inf, -np.inf], np.nan)

# Then handle the NaNs (e.g., with mean imputation or dropping)
df_encoded[numeric_cols] = df_encoded[numeric_cols].fillna(df_encoded[numeric_cols].mean())

# Now apply scaling
df_encoded[numeric_cols] = scaler.fit_transform(df_encoded[numeric_cols])


##### Which method have you used to scale you data and why?

We used Standard Scaling (Z-score normalization) to scale our features.

Many ML models (e.g., KNN, SVM, Logistic Regression) are sensitive to the scale of the data.

StandardScaler transforms the features such that they have a mean of 0 and standard deviation of 1.

It's effective when the data is normally distributed or approximately so (after log transformation)

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Yes, dimensionality reduction can be helpful in this case:

It helps reduce multicollinearity by removing redundant features.

It improves model performance by eliminating noise and irrelevant information.

It enables faster computation and better visualization in lower dimensions.

In [None]:
# DImensionality Reduction (If needed)
from sklearn.decomposition import PCA

# Initialize PCA to retain 95% of the variance
pca = PCA(n_components=0.95)

# Apply PCA on scaled numeric columns
pca_features = pca.fit_transform(df_encoded[numeric_cols])

# Create a DataFrame for the reduced features
pca_df = pd.DataFrame(pca_features, columns=[f'PC{i+1}' for i in range(pca_features.shape[1])])

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

We used Principal Component Analysis (PCA) because:

PCA helps in capturing the maximum variance with the fewest number of components.

It transforms features into orthogonal (independent) components.

It's suitable for continuous numerical data, like the scaled log_ features.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.

##### What data splitting ratio have you used and why?

To avoid overfitting and to ensure our model generalizes well:

Training Set: Used to train the model.

Test Set: Used to evaluate performance on new data.

We’ll use 80:20 split (most commonly used).

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

An imbalanced dataset occurs when one class significantly outnumbers the others (for classification tasks).
This can lead to biased models that favor the majority class.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
df_encoded.columns


In [None]:
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

# 1. Select numeric features for clustering (excluding categorical/string IDs)
features_for_clustering = ['quantity', 'unitprice', 'total_price', 'log_quantity', 'log_unitprice', 'log_total_price']

# 2. Scale the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df_encoded[features_for_clustering])

# 3. Apply KMeans
kmeans = KMeans(n_clusters=3, random_state=42)
df_encoded['customer_segment'] = kmeans.fit_predict(X_scaled)

In [None]:
from sklearn.preprocessing import LabelEncoder

# Find object columns (text/categorical)
cat_cols = df_encoded.select_dtypes(include='object').columns

# Label encode all string columns
le = LabelEncoder()
for col in cat_cols:
    df_encoded[col] = le.fit_transform(df_encoded[col].astype(str))

In [None]:
# Drop unnecessary columns and prepare train/test
X = df_encoded.drop(columns=['customer_segment', 'invoiceno', 'stockcode', 'invoicedate', 'customerid'])
y = df_encoded['customer_segment']

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train model
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)
y_pred = rf_model.predict(X_test)

# Evaluate
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))
print("\nAccuracy Score:", accuracy_score(y_test, y_pred))


#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

I used Random Forest Classifier, that gave near perfect scores .

Metric	   Score  

Accuracy	 99.9957%
Precision 	1.00
Recall	    1.00
F1-Score	  1.00

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

feature_importances = pd.Series(rf_model.feature_importances_, index=X.columns)
top_features = feature_importances.sort_values(ascending=False).head(15)

plt.figure(figsize=(10,6))
sns.barplot(x=top_features.values, y=top_features.index)
plt.title("Top 15 Feature Importances")
plt.xlabel("Importance Score")
plt.show()

In [None]:
# Visualizing evaluation Metric Score chart
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

plt.figure(figsize=(8,6))
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='Set1', alpha=0.5)
plt.title('Customer Segments Visualized via PCA')
plt.xlabel('PCA 1')
plt.ylabel('PCA 2')
plt.colorbar(label='Segment')
plt.show()

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
from sklearn.model_selection import StratifiedKFold, RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from scipy.stats import randint

# Cross-validator
cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)

# Reduced hyperparameter space for speed
param_dist = {
    'n_estimators': randint(50, 150),
    'max_depth': [10, 20, None],
    'min_samples_split': randint(2, 5),
    'min_samples_leaf': randint(1, 3),
    'max_features': ['sqrt']
}

# Randomized Search setup
rf_random = RandomizedSearchCV(
    estimator=RandomForestClassifier(random_state=42),
    param_distributions=param_dist,
    n_iter=10,  # Reduced for speed
    cv=cv,
    scoring='accuracy',
    verbose=1,
    n_jobs=-1,
    random_state=42
)

# Fit the model
rf_random.fit(X_train, y_train)

# Best model
print("Best Parameters Found:\n", rf_random.best_params_)
best_rf_model = rf_random.best_estimator_

# Predict & evaluate
y_pred = best_rf_model.predict(X_test)

print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))
print("Accuracy Score:", accuracy_score(y_test, y_pred))

##### Which hyperparameter optimization technique have you used and why?

 I used RandomizedSearchCV because it is computationally more efficient than GridSearchCV when searching across a wide range of parameters. It randomly samples parameter combinations, allowing faster tuning without exhaustive searches.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

 Accuracy and other metrics remain extremely high both before and after tuning (~0.99996)

 Even though performance didn’t improve significantly in scores (since the model was already near-perfect), RandomizedSearchCV gave us a more optimized model with:

 Fewer estimators (149 trees)

 Dynamic feature splits

 Reduced Overfitting Risk: Hyperparameter tuning helps control depth and complexity, especially with real-world noisy data

### ML Model - 2

In [None]:
# ML Model - 2 Implementation: LightGBM Classifier with Hyperparameter Optimization

## Step 1: Import Libraries
from lightgbm import LGBMClassifier
from sklearn.model_selection import StratifiedKFold, RandomizedSearchCV
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from scipy.stats import randint, uniform

## Step 2: Set Up Stratified K-Fold CV
cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)

## Step 3: Define Hyperparameter Space
param_dist = {
    'n_estimators': randint(100, 300),
    'max_depth': [10, 20, 30, -1],
    'learning_rate': uniform(0.01, 0.3),
    'num_leaves': randint(20, 150),
    'min_child_samples': randint(10, 100),
    'subsample': uniform(0.6, 0.4),
    'colsample_bytree': uniform(0.6, 0.4)
}

## Step 4: Randomized Search CV
lgbm_random = RandomizedSearchCV(
    estimator=LGBMClassifier(random_state=42),
    param_distributions=param_dist,
    n_iter=10,
    scoring='accuracy',
    cv=cv,
    verbose=1,
    n_jobs=-1,
    random_state=42
)

## Step 5: Fit the Model
lgbm_random.fit(X_train, y_train)

## Step 6: Evaluate the Best Model
print("Best Parameters Found:\n", lgbm_random.best_params_)
best_lgbm = lgbm_random.best_estimator_
y_pred = best_lgbm.predict(X_test)

print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))
print("Accuracy Score:", accuracy_score(y_test, y_pred))

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

Very strong model performance (almost perfect accuracy).

LightGBM is slightly behind Random Forest in accuracy by a tiny margin (~0.00003 difference).

Training time was faster than Random Forest for this level of accuracy.

# Comparing the 2 Models

| Metric               | **Random Forest**           | **LightGBM**                  |
| -------------------- | --------------------------- | ----------------------------- |
| **Accuracy**         | 0.99992                     | 0.99993                       |
| **Precision**        | 1.00                        | 1.00                          |
| **Recall**           | 1.00                        | 1.00                          |
| **F1-Score**         | 1.00                        | 1.00                          |
| **Confusion Matrix** | Very few misclassifications | Even fewer misclassifications |
| **Training Time**    | Moderate                    |  Faster (optimized boosting)
| **Model Size**       | Large                       |  Smaller                     |
| **Scalability**      | Medium (parallel trees)     |  High (Boosted, GPU support) |


In [None]:
import matplotlib.pyplot as plt

models = ['Random Forest', 'LightGBM']
accuracy = [0.99992, 0.99993]

plt.figure(figsize=(6,4))
plt.bar(models, accuracy, color=['green', 'skyblue'])
plt.ylim(0.9999, 1.0)
plt.ylabel('Accuracy')
plt.title('Model Comparison: Accuracy Score')
plt.grid(axis='y')
plt.show()

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
best_lgbm = lgbm_random.best_estimator_

In [None]:
# Save the File
import joblib

# Save the trained LightGBM model
joblib.dump(best_lgbm, 'lightgbm_final_model.pkl')

In [None]:
from sklearn.preprocessing import StandardScaler
import joblib

# Assume you used a scaler and fit it earlier
# scaler = StandardScaler()
# scaler.fit(X_train)

# Save the fitted scaler
joblib.dump(scaler, 'scaler.pkl')


In [None]:
import json

feature_cols = X_train.columns.tolist()

with open('feature_columns.json', 'w') as f:
    json.dump(feature_cols, f)

In [None]:
from sklearn.preprocessing import LabelEncoder
import joblib

encoder = LabelEncoder()
encoder.fit(y_train)

joblib.dump(encoder, 'label_encoder.pkl')


In [None]:
joblib.dump(best_rf_model, 'random_forest_model.pkl')

# **Conclusion**

The Shopper Spectrum project demonstrated that machine learning can effectively segment e-commerce customers based on their behavior, enabling smarter business decisions. Two powerful models—Random Forest and LightGBM—were trained and tuned using hyperparameter optimization. Among them, LightGBM emerged as the best model with near-perfect accuracy and robust performance on the test set.

This segmentation model allows businesses to identify high-value customers, potential churners, and segment-specific behavior, which in turn supports highly personalized marketing strategies, increased retention, and better resource allocation. The deployment-ready code and Streamlit interface make the solution scalable and user-friendly.