Project Name: 🛒 Shopper Spectrum: Customer Segmentation and Product Recommendations in E-Commerce 

Project Type: Unsupervised Machine Learning – Clustering & Collaborative Filtering – Recommendation System 

Member Name: Aryan Dubey

This project analyzes transaction data from an online retail business to uncover patterns in customer purchasing behavior. The goal is to segment customers based on Recency, Frequency, and Monetary (RFM) analysis and develop a product recommendation system using collaborative filtering techniques. The project uses various data science methods and technologies, including data cleaning, feature engineering, and exploratory data analysis (EDA). We utilize K-Means clustering for customer segmentation and cosine similarity for the recommendation system. The final deliverables include a Python notebook and a Streamlit web application that provides real-time outputs for both customer segmentation and product recommendations.
Project Name: 🛒 Shopper Spectrum: Customer Segmentation and Product Recommendations in E-Commerce 

Project Type: Unsupervised Machine Learning – Clustering & Collaborative Filtering – Recommendation System 

This project analyzes transaction data from an online retail business to uncover patterns in customer purchasing behavior. The goal is to segment customers based on Recency, Frequency, and Monetary (RFM) analysis and develop a product recommendation system using collaborative filtering techniques. The project uses various data science methods and technologies, including data cleaning, feature engineering, and exploratory data analysis (EDA). We utilize K-Means clustering for customer segmentation and cosine similarity for the recommendation system. The final deliverables include a Python notebook and a Streamlit web application that provides real-time outputs for both customer segmentation and product recommendations.

Key Project Elements:

Domain: E-Commerce and Retail Analytics 

Skills & Methods: Public Dataset Exploration and Preprocessing, Data Cleaning and Feature Engineering, Exploratory Data Analysis (EDA), Clustering Techniques, Collaborative Filtering-based Product Recommendation, and Model Evaluation.

Technologies & Libraries:

Data Manipulation: Pandas, Numpy 

Machine Learning: Scikit-Learn (K-Means Clustering, StandardScaler, Cosine Similarity) 

Visualization: DataVisualization 

Web Application: Streamlit 

Problem Type: Unsupervised Machine Learning (Clustering) and Collaborative Filtering (Recommendation System).


In [3]:
import pandas as pd

# Load the dataset from the CSV file
df_retail = pd.read_csv('online_retail.csv')

# Display the first 5 rows to understand the data structure
print("Initial Data Preview:")
print(df_retail.head())

# Display data types and non-null counts
print("\nData Information:")
df_retail.info()

Initial Data Preview:
  InvoiceNo StockCode                          Description  Quantity  \
0    536365    85123A   WHITE HANGING HEART T-LIGHT HOLDER         6   
1    536365     71053                  WHITE METAL LANTERN         6   
2    536365    84406B       CREAM CUPID HEARTS COAT HANGER         8   
3    536365    84029G  KNITTED UNION FLAG HOT WATER BOTTLE         6   
4    536365    84029E       RED WOOLLY HOTTIE WHITE HEART.         6   

           InvoiceDate  UnitPrice  CustomerID         Country  
0  2022-12-01 08:26:00       2.55     17850.0  United Kingdom  
1  2022-12-01 08:26:00       3.39     17850.0  United Kingdom  
2  2022-12-01 08:26:00       2.75     17850.0  United Kingdom  
3  2022-12-01 08:26:00       3.39     17850.0  United Kingdom  
4  2022-12-01 08:26:00       3.39     17850.0  United Kingdom  

Data Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
 #   Column       Non-Null Count

In [4]:
# Remove rows with missing CustomerID
df_retail.dropna(subset=['CustomerID'], inplace=True)
df_retail['CustomerID'] = df_retail['CustomerID'].astype(str)

# Remove rows with negative Quantity or UnitPrice
df_retail = df_retail[df_retail['Quantity'] > 0]
df_retail = df_retail[df_retail['UnitPrice'] > 0]

# Convert InvoiceDate to datetime object
df_retail['InvoiceDate'] = pd.to_datetime(df_retail['InvoiceDate'])

# Recalculate total sales for each transaction line
df_retail['TotalPrice'] = df_retail['Quantity'] * df_retail['UnitPrice']

# Display the cleaned data information
print("\nCleaned Data Information:")
df_retail.info()


Cleaned Data Information:
<class 'pandas.core.frame.DataFrame'>
Index: 397884 entries, 0 to 541908
Data columns (total 9 columns):
 #   Column       Non-Null Count   Dtype         
---  ------       --------------   -----         
 0   InvoiceNo    397884 non-null  object        
 1   StockCode    397884 non-null  object        
 2   Description  397884 non-null  object        
 3   Quantity     397884 non-null  int64         
 4   InvoiceDate  397884 non-null  datetime64[ns]
 5   UnitPrice    397884 non-null  float64       
 6   CustomerID   397884 non-null  object        
 7   Country      397884 non-null  object        
 8   TotalPrice   397884 non-null  float64       
dtypes: datetime64[ns](1), float64(2), int64(1), object(5)
memory usage: 30.4+ MB


In [5]:
import datetime as dt

# Set a reference date for calculating Recency (the day after the last transaction)
snapshot_date = df_retail['InvoiceDate'].max() + dt.timedelta(days=1)

# Calculate RFM metrics
rfm_df = df_retail.groupby('CustomerID').agg({
    'InvoiceDate': lambda x: (snapshot_date - x.max()).days,
    'InvoiceNo': 'count',
    'TotalPrice': 'sum'
})

# Rename the columns for clarity
rfm_df.rename(columns={
    'InvoiceDate': 'Recency',
    'InvoiceNo': 'Frequency',
    'TotalPrice': 'Monetary'
}, inplace=True)

# Display the RFM table
print("\nRFM Table:")
print(rfm_df.head())


RFM Table:
            Recency  Frequency  Monetary
CustomerID                              
12346.0         326          1  77183.60
12347.0           2        182   4310.00
12348.0          75         31   1797.24
12349.0          19         73   1757.55
12350.0         310         17    334.40


In [10]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import datetime as dt
import os

In [8]:
# Assume df_retail and rfm_df are already loaded and prepared from previous steps

# Placeholder for df_retail, assuming it's loaded from the previous step
df_retail = pd.read_csv('online_retail.csv')
df_retail.dropna(subset=['CustomerID'], inplace=True)
df_retail['CustomerID'] = df_retail['CustomerID'].astype(str)
df_retail = df_retail[df_retail['Quantity'] > 0]
df_retail = df_retail[df_retail['UnitPrice'] > 0]
df_retail['InvoiceDate'] = pd.to_datetime(df_retail['InvoiceDate'])
df_retail['TotalPrice'] = df_retail['Quantity'] * df_retail['UnitPrice']

# Placeholder for rfm_df, assuming it's loaded from the previous step
snapshot_date = df_retail['InvoiceDate'].max() + dt.timedelta(days=1)
rfm_df = df_retail.groupby('CustomerID').agg({
    'InvoiceDate': lambda x: (snapshot_date - x.max()).days,
    'InvoiceNo': 'count',
    'TotalPrice': 'sum'
})
rfm_df.rename(columns={
    'InvoiceDate': 'Recency',
    'InvoiceNo': 'Frequency',
    'TotalPrice': 'Monetary'
}, inplace=True)

print("Starting Exploratory Data Analysis...")

Starting Exploratory Data Analysis...


In [15]:
# 1. Analyze transaction volume by country
print("\nTop 10 Countries by Transaction Volume:")
country_transactions = df_retail.groupby('Country')['InvoiceNo'].nunique().sort_values(ascending=False).head(10)
print(country_transactions)

plt.figure(figsize=(12, 6))
sns.barplot(x=country_transactions.index, y=country_transactions.values)
plt.title('Top 10 Countries by Transaction Volume')
plt.xlabel('Country')
plt.ylabel('Number of Transactions')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.savefig('country_transaction_volume.png')
plt.clf()


# 2. Identify top-selling products
print("\nTop 10 Selling Products by Quantity:")
top_products = df_retail.groupby('Description')['Quantity'].sum().sort_values(ascending=False).head(10)
print(top_products)

plt.figure(figsize=(12, 6))
sns.barplot(x=top_products.values, y=top_products.index)
plt.title('Top 10 Selling Products by Quantity')
plt.xlabel('Total Quantity Sold')
plt.ylabel('Product Description')
plt.tight_layout()
plt.savefig('top_selling_products.png')
plt.clf()


# 3. Visualize purchase trends over time
print("\nPurchase Trends Over Time:")
monthly_sales = df_retail.set_index('InvoiceDate').resample('M')['TotalPrice'].sum()

plt.figure(figsize=(12, 6))
sns.lineplot(x=monthly_sales.index, y=monthly_sales.values)
plt.title('Monthly Sales Trend Over Time')
plt.xlabel('Date')
plt.ylabel('Total Sales ($)')
plt.grid(True)
plt.tight_layout()
plt.savefig('monthly_sales_trend.png')
plt.clf()


# 4. Inspect monetary distribution per transaction and customer
print("\nMonetary Distribution Analysis:")
plt.figure(figsize=(12, 6))

plt.subplot(1, 2, 1)
sns.histplot(df_retail['TotalPrice'], bins=50, kde=True)
plt.title('Distribution of Transaction Monetary Value')
plt.xlabel('Total Price ($)')
plt.ylabel('Frequency')

plt.subplot(1, 2, 2)
sns.histplot(rfm_df['Monetary'], bins=50, kde=True)
plt.title('Distribution of Customer Monetary Value')
plt.xlabel('Monetary Value ($)')
plt.ylabel('Frequency')

plt.tight_layout()
plt.savefig('monetary_distributions.png')
plt.clf()


# 5. RFM distributions
print("\nRFM Distribution Analysis:")
plt.figure(figsize=(15, 5))

plt.subplot(1, 3, 1)
sns.histplot(rfm_df['Recency'], bins=50, kde=True)
plt.title('Recency Distribution')
plt.xlabel('Recency (Days)')
plt.ylabel('Number of Customers')

plt.subplot(1, 3, 2)
sns.histplot(rfm_df['Frequency'], bins=50, kde=True)
plt.title('Frequency Distribution')
plt.xlabel('Frequency (Total Purchases)')
plt.ylabel('Number of Customers')

plt.subplot(1, 3, 3)
sns.histplot(rfm_df['Monetary'], bins=50, kde=True)
plt.title('Monetary Distribution')
plt.xlabel('Monetary Value ($)')
plt.ylabel('Number of Customers')

plt.tight_layout()
plt.savefig('rfm_distributions.png')
plt.clf()


# 6. Elbow curve for cluster selection
print("\nGenerating Elbow Curve for K-Means Clustering...")
rfm_scaled = rfm_df[['Recency', 'Frequency', 'Monetary']]
scaler = StandardScaler()
rfm_scaled = scaler.fit_transform(rfm_scaled)

wcss = []
k_values = range(1, 11)
for i in k_values:
    kmeans = KMeans(n_clusters=i, init='k-means++', random_state=42, n_init=10)
    kmeans.fit(rfm_scaled)
    wcss.append(kmeans.inertia_)

plt.figure(figsize=(10, 6))
plt.plot(k_values, wcss, marker='o')
plt.title('Elbow Method for Optimal K')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Within-Cluster Sum of Squares (WCSS)')
plt.grid(True)
plt.tight_layout()
plt.savefig('elbow_curve.png')
plt.clf()


# 7. Customer cluster profiles
print("\nPreparing for Customer Cluster Profiling...")
# This step requires choosing 'k' from the elbow curve.
# For demonstration, let's assume k=4.
kmeans = KMeans(n_clusters=4, init='k-means++', random_state=42, n_init=10)
rfm_df['Cluster'] = kmeans.fit_predict(rfm_scaled)

cluster_profiles = rfm_df.groupby('Cluster')[['Recency', 'Frequency', 'Monetary']].mean()
print("Customer Cluster Profiles:")
print(cluster_profiles)

# Visualization of cluster profiles
plt.figure(figsize=(15, 6))
plt.subplot(1, 3, 1)
sns.barplot(x=cluster_profiles.index, y=cluster_profiles['Recency'])
plt.title('Recency by Cluster')
plt.xlabel('Cluster')
plt.ylabel('Mean Recency')

plt.subplot(1, 3, 2)
sns.barplot(x=cluster_profiles.index, y=cluster_profiles['Frequency'])
plt.title('Frequency by Cluster')
plt.xlabel('Cluster')
plt.ylabel('Mean Frequency')

plt.subplot(1, 3, 3)
sns.barplot(x=cluster_profiles.index, y=cluster_profiles['Monetary'])
plt.title('Monetary by Cluster')
plt.xlabel('Cluster')
plt.ylabel('Mean Monetary')

plt.tight_layout()
plt.savefig('customer_cluster_profiles.png')
plt.clf()


# 8. Product recommendation heatmap / similarity matrix
print("\nBuilding Product Similarity Matrix for Recommendation...")

# Create a user-item matrix
user_item_matrix = df_retail.pivot_table(index='CustomerID', columns='Description', values='Quantity').fillna(0)

from sklearn.metrics.pairwise import cosine_similarity
product_similarity_matrix = cosine_similarity(user_item_matrix.T)
product_similarity_df = pd.DataFrame(product_similarity_matrix, index=user_item_matrix.columns, columns=user_item_matrix.columns)

# Save the similarity matrix to a CSV file
product_similarity_df.to_csv('product_similarity_matrix.csv')
print("Product similarity matrix saved to 'product_similarity_matrix.csv'")




Top 10 Countries by Transaction Volume:
Country
United Kingdom    16646
Germany             457
France              389
EIRE                260
Belgium              98
Netherlands          94
Spain                90
Portugal             57
Australia            57
Switzerland          51
Name: InvoiceNo, dtype: int64

Top 10 Selling Products by Quantity:
Description
PAPER CRAFT , LITTLE BIRDIE           80995
MEDIUM CERAMIC TOP STORAGE JAR        77916
WORLD WAR 2 GLIDERS ASSTD DESIGNS     54415
JUMBO BAG RED RETROSPOT               46181
WHITE HANGING HEART T-LIGHT HOLDER    36725
ASSORTED COLOUR BIRD ORNAMENT         35362
PACK OF 72 RETROSPOT CAKE CASES       33693
POPCORN HOLDER                        30931
RABBIT NIGHT LIGHT                    27202
MINI PAINT SET VINTAGE                26076
Name: Quantity, dtype: int64

Purchase Trends Over Time:


  monthly_sales = df_retail.set_index('InvoiceDate').resample('M')['TotalPrice'].sum()



Monetary Distribution Analysis:

RFM Distribution Analysis:

Generating Elbow Curve for K-Means Clustering...

Preparing for Customer Cluster Profiling...
Customer Cluster Profiles:
            Recency    Frequency       Monetary
Cluster                                        
0         41.341133   104.582512    2091.817116
1          2.000000  5807.000000   70925.287500
2        247.308333    27.787963     637.318510
3          7.666667   826.833333  190863.461667

Building Product Similarity Matrix for Recommendation...
Product similarity matrix saved to 'product_similarity_matrix.csv'


<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1500x500 with 0 Axes>

<Figure size 1000x600 with 0 Axes>

<Figure size 1500x600 with 0 Axes>

In [16]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
import datetime as dt

# Placeholder for df_retail, assuming it's loaded and cleaned from previous steps
try:
    df_retail = pd.read_csv('online_retail.csv')
    df_retail.dropna(subset=['CustomerID'], inplace=True)
    df_retail['CustomerID'] = df_retail['CustomerID'].astype(str)
    df_retail = df_retail[df_retail['Quantity'] > 0]
    df_retail = df_retail[df_retail['UnitPrice'] > 0]
    df_retail['InvoiceDate'] = pd.to_datetime(df_retail['InvoiceDate'])
    df_retail['TotalPrice'] = df_retail['Quantity'] * df_retail['UnitPrice']

    # Placeholder for rfm_df, assuming it's loaded from previous steps
    snapshot_date = df_retail['InvoiceDate'].max() + dt.timedelta(days=1)
    rfm_df = df_retail.groupby('CustomerID').agg({
        'InvoiceDate': lambda x: (snapshot_date - x.max()).days,
        'InvoiceNo': 'count',
        'TotalPrice': 'sum'
    })
    rfm_df.rename(columns={
        'InvoiceDate': 'Recency',
        'InvoiceNo': 'Frequency',
        'TotalPrice': 'Monetary'
    }, inplace=True)

except FileNotFoundError:
    print("Error: 'online_retail.csv' not found. Please make sure the file is in the same directory.")
    # Exit or handle the error gracefully
except Exception as e:
    print(f"An error occurred: {e}")

print("Starting Step 4: Clustering Methodology...")

# 1. Feature Engineering (RFM values already calculated in the previous step)
print("\nRFM values have been calculated in the previous step. Displaying head:")
print(rfm_df.head())

# 2. Standardize/Normalize the RFM values
print("\nStandardizing RFM values...")
scaler = StandardScaler()
rfm_scaled = scaler.fit_transform(rfm_df[['Recency', 'Frequency', 'Monetary']])
rfm_scaled_df = pd.DataFrame(rfm_scaled, columns=['Recency_Scaled', 'Frequency_Scaled', 'Monetary_Scaled'], index=rfm_df.index)

print("Standardized RFM values head:")
print(rfm_scaled_df.head())

# 3. & 4. Choose K-Means, and use Elbow Method & Silhouette Score
# We will evaluate K from 2 to 10 for both methods
print("\nUsing Elbow Method and Silhouette Score to find optimal number of clusters (K)...")
wcss = []
silhouette_scores = []
k_values = range(2, 11)

for i in k_values:
    kmeans = KMeans(n_clusters=i, init='k-means++', random_state=42, n_init=10)
    kmeans.fit(rfm_scaled_df)
    wcss.append(kmeans.inertia_)
    silhouette_scores.append(silhouette_score(rfm_scaled_df, kmeans.labels_))

# Plotting the Elbow Curve
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.plot(k_values, wcss, marker='o')
plt.title('Elbow Method for Optimal K')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Within-Cluster Sum of Squares (WCSS)')
plt.grid(True)

# Plotting the Silhouette Score
plt.subplot(1, 2, 2)
plt.plot(k_values, silhouette_scores, marker='o', color='orange')
plt.title('Silhouette Score for Optimal K')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Silhouette Score')
plt.grid(True)

plt.tight_layout()
plt.savefig('clustering_evaluation.png')
plt.clf()

# 5. Run Clustering
# Based on the elbow curve and silhouette score plots, we will choose a value for K.
# Both methods often suggest K=4 as a reasonable trade-off.
optimal_k = 4
print(f"\nChoosing optimal number of clusters as {optimal_k} based on the plots.")

kmeans = KMeans(n_clusters=optimal_k, init='k-means++', random_state=42, n_init=10)
rfm_df['Cluster'] = kmeans.fit_predict(rfm_scaled_df)

print(f"\nK-Means clustering with K={optimal_k} completed.")

# Display the size of each cluster
print("Size of each cluster:")
print(rfm_df['Cluster'].value_counts().sort_index())

# Display the mean RFM values for each cluster to create profiles
cluster_profiles = rfm_df.groupby('Cluster')[['Recency', 'Frequency', 'Monetary']].mean()
print("\nCluster Profiles (Mean RFM values):")
print(cluster_profiles)

# Visualize cluster profiles
plt.figure(figsize=(15, 6))
plt.subplot(1, 3, 1)
sns.barplot(x=cluster_profiles.index, y=cluster_profiles['Recency'])
plt.title('Mean Recency by Cluster')
plt.xlabel('Cluster')
plt.ylabel('Mean Recency (Days)')

plt.subplot(1, 3, 2)
sns.barplot(x=cluster_profiles.index, y=cluster_profiles['Frequency'])
plt.title('Mean Frequency by Cluster')
plt.xlabel('Cluster')
plt.ylabel('Mean Frequency (Purchases)')

plt.subplot(1, 3, 3)
sns.barplot(x=cluster_profiles.index, y=cluster_profiles['Monetary'])
plt.title('Mean Monetary by Cluster')
plt.xlabel('Cluster')
plt.ylabel('Mean Monetary ($)')

plt.tight_layout()
plt.savefig('final_cluster_profiles.png')
plt.clf()

Starting Step 4: Clustering Methodology...

RFM values have been calculated in the previous step. Displaying head:
            Recency  Frequency  Monetary
CustomerID                              
12346.0         326          1  77183.60
12347.0           2        182   4310.00
12348.0          75         31   1797.24
12349.0          19         73   1757.55
12350.0         310         17    334.40

Standardizing RFM values...
Standardized RFM values head:
            Recency_Scaled  Frequency_Scaled  Monetary_Scaled
CustomerID                                                   
12346.0           2.334574         -0.396578         8.358668
12347.0          -0.905340          0.394649         0.250966
12348.0          -0.175360         -0.265435        -0.028596
12349.0          -0.735345         -0.081836        -0.033012
12350.0           2.174578         -0.326635        -0.191347

Using Elbow Method and Silhouette Score to find optimal number of clusters (K)...

Choosing optimal numb

<Figure size 1200x600 with 0 Axes>

<Figure size 1500x600 with 0 Axes>

In [17]:
import joblib

# ... (rest of your previous code) ...

# 6. Visualize the clusters using a scatter plot or 3D plot of RFM scores.
# (Code to be added later)

# 7. Save the best performing model for streamlit usage
joblib.dump(scaler, 'scaler.pkl')
joblib.dump(kmeans, 'kmeans_model.pkl')
print("\nKMeans model and scaler saved successfully!")


KMeans model and scaler saved successfully!


In [18]:
import streamlit as st
import pandas as pd
import joblib
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

# =================================================================================
# 1. Load Pre-trained Models and Data
# =================================================================================

try:
    # Load the trained K-Means model and the scaler
    kmeans = joblib.load('kmeans_model.pkl')
    scaler = joblib.load('scaler.pkl')

    # Load the product similarity matrix from the previous step
    similarity_df = pd.read_csv('product_similarity_matrix.csv', index_col=0)

except FileNotFoundError:
    st.error("Error: Model or data files not found. Please ensure 'kmeans_model.pkl', 'scaler.pkl', and 'product_similarity_matrix.csv' are in the same directory.")
    st.stop()


# =================================================================================
# 2. Helper Functions
# =================================================================================

# Create a mapping from cluster number to a descriptive label
cluster_labels = {
    0: "High-Value Shopper", # These labels are based on the cluster profiling
    1: "Big Spender",
    2: "At-Risk/Occasional",
    3: "Regular/Promising"
}

# Function to get product recommendations
def recommend_products(product_name, n_recommendations=5):
    """
    Finds and returns the top N similar products.
    """
    # Check if the product exists in our similarity matrix
    if product_name not in similarity_df.index:
        return "Product not found. Please try a different product name."

    # Get the similarity scores for the given product
    product_scores = similarity_df[product_name]

    # Sort the scores and get the top N recommendations (excluding the product itself)
    similar_products = product_scores.sort_values(ascending=False)
    recommended = similar_products.index[1:n_recommendations+1].tolist()
    return recommended


# =================================================================================
# 3. Streamlit UI and App Logic
# =================================================================================

st.set_page_config(
    page_title="Shopper Spectrum Analytics",
    layout="wide",
    initial_sidebar_state="expanded"
)

# Sidebar navigation
st.sidebar.title("Navigation")
page = st.sidebar.radio("Go to", ["Product Recommender", "Customer Segmentation"])

if page == "Product Recommender":
    st.title("🛒 Product Recommender")
    st.markdown("Enter a product name to get 5 similar product recommendations.")
    st.image("https://raw.githubusercontent.com/streamlit/streamlit/develop/docs/images/streamlit-logo-primary-colormark-light.png", width=200) # Placeholder image
    
    # Get user input for product name
    product_name_input = st.text_input("Enter Product Name (e.g., JUMBO BAG RED RETROSPOT)", "")
    
    if st.button("Get Recommendations"):
        if product_name_input:
            recommendations = recommend_products(product_name_input)
            
            if isinstance(recommendations, str):
                st.warning(recommendations)
            else:
                st.subheader("Recommended Products:")
                for i, product in enumerate(recommendations, 1):
                    st.write(f"{i}. {product}")
        else:
            st.warning("Please enter a product name.")

elif page == "Customer Segmentation":
    st.title("👤 Customer Segmentation")
    st.markdown("Enter RFM values to predict a customer's segment label.")
    
    st.image("https://raw.githubusercontent.com/streamlit/streamlit/develop/docs/images/streamlit-logo-primary-colormark-light.png", width=200) # Placeholder image
    
    # Get user input for RFM values
    recency = st.number_input("Recency (days since last purchase)", min_value=0, max_value=365, value=100)
    frequency = st.number_input("Frequency (number of purchases)", min_value=1, max_value=5000, value=50)
    monetary = st.number_input("Monetary (total spend)", min_value=0.0, max_value=200000.0, value=500.0, format="%.2f")

    if st.button("Predict Segment"):
        # Create a DataFrame for the user's input
        input_data = pd.DataFrame([[recency, frequency, monetary]], columns=['Recency', 'Frequency', 'Monetary'])
        
        # Standardize the input data using the same scaler from training
        scaled_input = scaler.transform(input_data)
        
        # Predict the cluster
        predicted_cluster = kmeans.predict(scaled_input)[0]
        
        # Get the descriptive label from our mapping
        predicted_label = cluster_labels.get(predicted_cluster, "Unknown")
        
        st.subheader("Prediction Result:")
        st.success(f"This customer belongs to the **{predicted_label}** segment.")

2025-08-04 00:52:33.338 
  command:

    streamlit run c:\Shopper_Spectrum_Project\venv\Lib\site-packages\ipykernel_launcher.py [ARGUMENTS]
2025-08-04 00:52:33.344 Session state does not function when running a script without `streamlit run`
