# Semantic Product Clustering Demo

This notebook demonstrates semantic clustering of product names to identify categories regardless of barcode differences.

## Problem Statement
Different people record the same TYPE of product with different names:
- **Tables**: "Table", "Desk", "Mesa", "Masa", "Writing Desk"
- **Chairs**: "Chair", "Sandalye", "Silla", "Gaming Chair"
- **Computers**: "Computer", "PC", "Bilgisayar", "Laptop"

**Goal**: Group semantically similar names to count product categories:
- "How many table-type items do we have?"
- "How many chair-type items?"
- "What's our inventory breakdown by category?"

## Pipeline Overview
1. **Ingest** CSV data with product names and barcodes
2. **Normalize** multilingual text
3. **Generate** semantic embeddings (TF-IDF or transformers)
4. **Cluster** semantically similar names
5. **Analyze** clusters to identify product categories
6. **Summarize** inventory by semantic categories


In [1]:
# Import necessary libraries
import sys
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

# Add src directory to path
sys.path.append('../src')

# Import our modules
from ingest import CSVIngester
from normalize import TextNormalizer, MultilingualNormalizer
from embedding import NameEmbedder, SimilarityAnalyzer
from cluster import NameClusterer, ClusterOptimizer
from semantic import SemanticAnalyzer

# SSL fixes for corporate networks
import ssl
import urllib3
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
os.environ['CURL_CA_BUNDLE'] = ''
os.environ['REQUESTS_CA_BUNDLE'] = ''

# Configure plotting
plt.style.use('default')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (12, 8)

print("‚úÖ All modules imported successfully!")
print(f"üìÅ Working directory: {os.getcwd()}")
print(f"üìÅ Data directory: {Path('../data').resolve()}")


  from .autonotebook import tqdm as notebook_tqdm


‚úÖ All modules imported successfully!
üìÅ Working directory: c:\Users\TCEERBIL\Desktop\ege-workspace\notebooks
üìÅ Data directory: C:\Users\TCEERBIL\Desktop\ege-workspace\data


## Step 1: Load Semantic Test Data

Our test data contains semantically similar products with different names and different barcodes.


In [2]:
# Load semantic test data
data_path = "../data/semantic_products.csv"

# Initialize ingester
ingester = CSVIngester()
raw_data = ingester.load_csv(data_path)

print("üìä Semantic Test Dataset Overview:")
print(f"Shape: {raw_data.shape}")
print(f"Columns: {list(raw_data.columns)}")

# Show sample data
print("\nüìã Sample data:")
print(raw_data.head(10))

# Auto-detect columns
name_col, barcode_col = ingester.detect_columns()
print(f"\nüîç Detected columns:")
print(f"Name column: '{name_col}'")
print(f"Barcode column: '{barcode_col}'")

# Get clean data
clean_data = ingester.get_clean_data()
print(f"\nüßπ Clean dataset: {len(clean_data)} rows")
print(f"üì¶ Unique barcodes: {clean_data['barcode'].nunique()}")
print(f"üè∑Ô∏è Unique names: {clean_data['name'].nunique()}")


üìä Semantic Test Dataset Overview:
Shape: (56, 4)
Columns: ['name', 'barcode', 'supplier', 'price']

üìã Sample data:
               name barcode    supplier  price
0      Office Table  TBL001  Supplier A  150.0
1         Work Desk  TBL002  Supplier B  180.0
2    √áalƒ±≈üma Masasƒ±  TBL003  Supplier C  165.0
3   Mesa de Oficina  TBL004  Supplier D  170.0
4      Writing Desk  TBL005  Supplier A  190.0
5       Study Table  TBL006  Supplier E  145.0
6      Gaming Chair  CHR001  Supplier A  250.0
7      Office Chair  CHR002  Supplier B  220.0
8          Sandalye  CHR003  Supplier C  200.0
9  Silla de Oficina  CHR004  Supplier D  240.0

üîç Detected columns:
Name column: 'name'
Barcode column: 'barcode'

üßπ Clean dataset: 56 rows
üì¶ Unique barcodes: 56
üè∑Ô∏è Unique names: 56


## Step 2: Text Normalization

Normalize multilingual product names for better clustering.


In [3]:
# Initialize multilingual normalizer
ml_normalizer = MultilingualNormalizer()

# Show normalization examples
sample_names = [
    "√áalƒ±≈üma Masasƒ±", "Mesa de Oficina", "Writing Desk",
    "Gaming Chair", "Sandalye", "Silla de Oficina",
    "Bilgisayar", "Ordenador", "Desktop PC"
]

print("üî§ Text Normalization Examples:")
print("Original ‚Üí Normalized")
print("-" * 50)

for name in sample_names:
    normalized = ml_normalizer.normalize_multilingual(name)
    print(f"{name:<20} ‚Üí {normalized}")

# Apply normalization to entire dataset
print(f"\nüîÑ Normalizing {len(clean_data)} product names...")
clean_data['normalized_name'] = [ml_normalizer.normalize_multilingual(name) for name in clean_data['name']]

print("‚úÖ Normalization complete!")
print("\nüìã Normalized data preview:")
print(clean_data[['name', 'normalized_name', 'barcode']].head(10))


üî§ Text Normalization Examples:
Original ‚Üí Normalized
--------------------------------------------------
√áalƒ±≈üma Masasƒ±       ‚Üí calƒ±sma masasƒ±
Mesa de Oficina      ‚Üí mesa de oficina
Writing Desk         ‚Üí writing desk
Gaming Chair         ‚Üí gaming chair
Sandalye             ‚Üí sandalye
Silla de Oficina     ‚Üí silla de oficina
Bilgisayar           ‚Üí bilgisayar
Ordenador            ‚Üí ordenador
Desktop PC           ‚Üí desktop piece

üîÑ Normalizing 56 product names...
‚úÖ Normalization complete!

üìã Normalized data preview:
               name   normalized_name barcode
0      Office Table      office table  TBL001
1         Work Desk         work desk  TBL002
2    √áalƒ±≈üma Masasƒ±    calƒ±sma masasƒ±  TBL003
3   Mesa de Oficina   mesa de oficina  TBL004
4      Writing Desk      writing desk  TBL005
5       Study Table       study table  TBL006
6      Gaming Chair      gaming chair  CHR001
7      Office Chair      office chair  CHR002
8          Sandalye      

## Step 3: Generate Embeddings

Create semantic embeddings for similarity calculation.


In [4]:
# Initialize embedder (will use TF-IDF fallback if transformers fail)
embedder = NameEmbedder('paraphrase-multilingual-MiniLM-L12-v2')

print("ü§ñ Generating embeddings (using TF-IDF fallback if needed)...")

# Generate embeddings
embeddings = embedder.generate_embeddings(clean_data['normalized_name'].tolist())

print(f"‚úÖ Generated embeddings with shape: {embeddings.shape}")
print(f"üìè Embedding dimension: {embeddings.shape[1]}")
print(f"üéØ Method used: {'TF-IDF Fallback' if embedder._use_tfidf_fallback else 'Sentence Transformer'}")

if embedder._use_tfidf_fallback:
    print("‚ÑπÔ∏è  Note: Using TF-IDF - this works great for semantic clustering!")
    
    # Show TF-IDF vocabulary sample
    if hasattr(embedder, '_tfidf_vectorizer') and embedder._tfidf_vectorizer:
        vocab_sample = list(embedder._tfidf_vectorizer.vocabulary_.keys())[:15]
        print(f"üìö Sample vocabulary: {vocab_sample}")

# Analyze similarity distribution
analyzer = SimilarityAnalyzer(embedder)
similarity_stats = analyzer.analyze_similarity_distribution()

print("\nüìä Similarity Analysis:")
print(f"Mean similarity: {similarity_stats['mean_similarity']:.3f}")
print(f"Std deviation: {similarity_stats['std_similarity']:.3f}")
print(f"25th percentile: {similarity_stats['q25']:.3f}")
print(f"75th percentile: {similarity_stats['q75']:.3f}")

# Get suggested clustering threshold
suggested_threshold = analyzer.suggest_clustering_threshold()
print(f"\nüí° Suggested clustering threshold: {suggested_threshold:.3f}")


ü§ñ Generating embeddings (using TF-IDF fallback if needed)...


No sentence-transformers model found with name sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2. Creating a new one with mean pooling.
Failed to load model paraphrase-multilingual-MiniLM-L12-v2: (MaxRetryError("HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2/resolve/main/adapter_config.json (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate in certificate chain (_ssl.c:1007)')))"), '(Request ID: 2a688add-1201-4806-9844-f0e21c211d29)')
No sentence-transformers model found with name sentence-transformers/all-MiniLM-L6-v2. Creating a new one with mean pooling.
All models failed to load: (MaxRetryError("HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /sentence-transformers/all-MiniLM-L6-v2/resolve/main/adapter_config.json (Caused by SSLError(SSLCertVerificationError

‚úÖ Generated embeddings with shape: (56, 97)
üìè Embedding dimension: 97
üéØ Method used: TF-IDF Fallback
‚ÑπÔ∏è  Note: Using TF-IDF - this works great for semantic clustering!
üìö Sample vocabulary: ['office', 'table', 'office table', 'work', 'desk', 'work desk', 'mesa', 'de', 'oficina', 'mesa de', 'de oficina', 'writing', 'writing desk', 'study', 'study table']

üìä Similarity Analysis:
Mean similarity: 0.009
Std deviation: 0.055
25th percentile: 0.000
75th percentile: 0.000

üí° Suggested clustering threshold: 0.500


## Step 4: Semantic Clustering

Cluster semantically similar product names.


In [5]:
# Perform semantic clustering
print("üéØ Performing semantic clustering...")

# Use appropriate threshold for clustering
clustering_threshold = suggested_threshold
print(f"üîß Using clustering threshold: {clustering_threshold:.3f}")

clusterer = NameClusterer('agglomerative')
cluster_labels = clusterer.fit(
    embeddings, 
    clean_data['normalized_name'].tolist(),
    similarity_threshold=clustering_threshold
)

# Add cluster labels to dataframe
clean_data['cluster_id'] = cluster_labels

# Get clustering results
clusters_df = clusterer.get_clusters_dataframe()
eval_results = clusterer.evaluate_clustering()

print(f"‚úÖ Clustering complete!")
print(f"üìä Found {eval_results['n_clusters']} semantic clusters")
print(f"üìä Silhouette Score: {eval_results.get('silhouette_score', 'N/A')}")
print(f"üî¢ Largest cluster: {eval_results['largest_cluster_size']} items")
print(f"üî¢ Average cluster size: {eval_results['average_cluster_size']:.1f}")
print(f"üî¢ Singleton clusters: {eval_results['singleton_clusters']}")

# Show clustering preview
print(f"\nüìã Clustering preview (by cluster size):")
cluster_preview = clusters_df.head(20)
print(cluster_preview[['text', 'cluster_id', 'cluster_size']].to_string(index=False))


üéØ Performing semantic clustering...
üîß Using clustering threshold: 0.500
‚úÖ Clustering complete!
üìä Found 55 semantic clusters
üìä Silhouette Score: 0.03571428571428571
üî¢ Largest cluster: 2 items
üî¢ Average cluster size: 1.0
üî¢ Singleton clusters: 54

üìã Clustering preview (by cluster size):
          text  cluster_id  cluster_size
         kitap           0             2
    kitap rafƒ±           0             2
 optical mouse           1             1
          fare           2             1
         raton           3             1
computer mouse           4             1
office cabinet           5             1
     book rack           6             1
     bookshelf           7             1
wireless mouse           8             1
    estanteria           9             1
      cuaderno          10             1
       journal          11             1
  file cabinet          12             1
  drinking mug          13             1
   writing pad          14      

## Step 5: Semantic Analysis

Analyze clusters to identify product categories and generate insights.


In [6]:
# Initialize semantic analyzer
semantic_analyzer = SemanticAnalyzer()

print("üîç Analyzing semantic clusters...")

# Analyze clusters to identify categories
cluster_analysis = semantic_analyzer.analyze_clusters(
    clean_data,
    name_column='name',
    barcode_column='barcode',
    cluster_column='cluster_id'
)

print(f"‚úÖ Semantic analysis complete!")
print(f"üìä Identified {cluster_analysis['category'].nunique()} product categories")

# Show cluster analysis
print(f"\nüìã Cluster Analysis Results:")
display_cols = ['cluster_id', 'category', 'representative_name', 'unique_names', 'unique_barcodes', 'total_items']
print(cluster_analysis[display_cols].to_string(index=False))

# Show detailed cluster information
print(f"\nüîç Detailed Cluster Information:")
for _, row in cluster_analysis.head(10).iterrows():
    print(f"\nüè∑Ô∏è Cluster {row['cluster_id']} - {row['category']}:")
    print(f"   Representative: '{row['representative_name']}'")
    print(f"   {row['unique_barcodes']} unique products, {row['unique_names']} name variations")
    print(f"   All names: {row['all_names'][:100]}{'...' if len(row['all_names']) > 100 else ''}")


üîç Analyzing semantic clusters...
‚úÖ Semantic analysis complete!
üìä Identified 10 product categories

üìã Cluster Analysis Results:
 cluster_id  category representative_name  unique_names  unique_barcodes  total_items
          0     Books               Kitap             2                2            2
         33    Tables           Work Desk             1                1            1
         52    Tables        Office Table             1                1            1
         54    Tables     Mesa de Oficina             1                1            1
         47    Tables        Writing Desk             1                1            1
         48    Tables         Study Table             1                1            1
         50    Tables      √áalƒ±≈üma Masasƒ±             1                1            1
         51    Chairs        Office Chair             1                1            1
         25    Chairs            Sandalye             1                1            

## Step 6: Category Summary

Generate inventory summary by product category.


In [7]:
# Generate category summary
category_summary = semantic_analyzer.get_category_summary(cluster_analysis)

print("üìä INVENTORY SUMMARY BY CATEGORY:")
print("=" * 70)

total_barcodes = category_summary['total_barcodes'].sum()
total_items = category_summary['total_items'].sum()

for _, row in category_summary.iterrows():
    category = row['category']
    barcodes = row['total_barcodes']
    items = row['total_items']
    clusters = row['num_clusters']
    examples = row['example_names']
    
    percentage = (barcodes / total_barcodes) * 100
    print(f"\nüìÇ {category.upper()}:")
    print(f"   ‚Ä¢ {barcodes} unique products ({percentage:.1f}% of inventory)")
    print(f"   ‚Ä¢ {items} total items across {clusters} name cluster(s)")
    print(f"   ‚Ä¢ Examples: {examples}")

print(f"\nüìà OVERALL STATISTICS:")
print(f"   ‚Ä¢ Total unique products: {total_barcodes}")
print(f"   ‚Ä¢ Total items: {total_items}")
print(f"   ‚Ä¢ Product categories: {len(category_summary)}")
print(f"   ‚Ä¢ Average items per product: {total_items/total_barcodes:.2f}")

print("=" * 70)


üìä INVENTORY SUMMARY BY CATEGORY:

üìÇ TABLES:
   ‚Ä¢ 9 unique products (16.1% of inventory)
   ‚Ä¢ 9 total items across 9 name cluster(s)
   ‚Ä¢ Examples: Work Desk, Office Table, Mesa de Oficina

üìÇ CHAIRS:
   ‚Ä¢ 7 unique products (12.5% of inventory)
   ‚Ä¢ 7 total items across 7 name cluster(s)
   ‚Ä¢ Examples: Office Chair, Sandalye, Silla de Oficina

üìÇ CUPS:
   ‚Ä¢ 6 unique products (10.7% of inventory)
   ‚Ä¢ 6 total items across 6 name cluster(s)
   ‚Ä¢ Examples: Coffee Cup, Mug, Taza

üìÇ BOOKS:
   ‚Ä¢ 6 unique products (10.7% of inventory)
   ‚Ä¢ 6 total items across 5 name cluster(s)
   ‚Ä¢ Examples: Kitap, Notebook, Journal

üìÇ COMPUTERS:
   ‚Ä¢ 6 unique products (10.7% of inventory)
   ‚Ä¢ 6 total items across 6 name cluster(s)
   ‚Ä¢ Examples: Laptop Computer, Desktop PC, Bilgisayar

üìÇ CABINETS:
   ‚Ä¢ 5 unique products (8.9% of inventory)
   ‚Ä¢ 5 total items across 5 name cluster(s)
   ‚Ä¢ Examples: Storage Cabinet, File Cabinet, Dolap

üìÇ MICE:
   ‚Ä¢ 