# Bangkok Redundancy & Duplication Detection System
## Graph-based Clustering for Complaint Analysis

**Objective:** Detect duplicate and related complaints using:
- Text similarity (Sentence-BERT embeddings)
- Geospatial proximity (lat/lon distance)
- Temporal proximity (timestamp)
- Graph clustering (Louvain algorithm)

**Impact:**
- Reduce duplicate work for government staff
- Identify chronic/recurring problems
- Enable strategic policy planning

## Step 0: Setup and Install Dependencies

In [58]:
# Install required packages (run once)
#%pip install sentence-transformers faiss-cpu networkx python-louvain scikit-learn folium

In [59]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from datetime import datetime, timedelta
import warnings
warnings.filterwarnings('ignore')

# ML/AI libraries
from sentence_transformers import SentenceTransformer
import faiss
import networkx as nx
import community.community_louvain as community_louvain
from sklearn.metrics.pairwise import cosine_similarity

# Geospatial
import folium
from folium import plugins

print("All libraries imported successfully!")

All libraries imported successfully!


In [60]:
# Load cleaned data
df = pd.read_csv("../data/interim/bangkok_traffy_cleaned.csv")

# Convert datetime columns
df['timestamp'] = pd.to_datetime(df['timestamp'], format='mixed')
df['last_activity'] = pd.to_datetime(df['last_activity'], format='mixed')

# Filter out rows with missing critical data
df_clean = df.dropna(subset=['comment', 'latitude', 'longitude', 'timestamp']).copy()
df_clean = df_clean[df_clean['comment'].str.len() > 10]  # Remove very short comments

print(f"Original dataset: {len(df):,} rows")
print(f"After filtering: {len(df_clean):,} rows")
print(f"Columns: {df_clean.columns.tolist()}")
df_clean.head()

Original dataset: 778,254 rows
After filtering: 750,674 rows
Columns: ['ticket_id', 'type', 'organization', 'comment', 'photo', 'photo_after', 'coords', 'address', 'subdistrict', 'district', 'province', 'timestamp', 'state', 'count_reopen', 'last_activity', 'longitude', 'latitude', 'year', 'month', 'day', 'hour', 'day_of_week', 'day_name', 'resolution_time_hours', 'has_photo_after', 'is_reopened']


Unnamed: 0,ticket_id,type,organization,comment,photo,photo_after,coords,address,subdistrict,district,...,latitude,year,month,day,hour,day_of_week,day_name,resolution_time_hours,has_photo_after,is_reopened
1,2021-CGPMUN,"{‡∏ô‡πâ‡∏≥‡∏ó‡πà‡∏ß‡∏°,‡∏£‡πâ‡∏≠‡∏á‡πÄ‡∏£‡∏µ‡∏¢‡∏ô}","‡πÄ‡∏Ç‡∏ï‡∏õ‡∏£‡∏∞‡πÄ‡∏ß‡∏®,‡∏ù‡πà‡∏≤‡∏¢‡πÇ‡∏¢‡∏ò‡∏≤ ‡πÄ‡∏Ç‡∏ï‡∏õ‡∏£‡∏∞‡πÄ‡∏ß‡∏®",‡∏ô‡πâ‡∏≥‡∏ó‡πà‡∏ß‡∏°‡πÄ‡∏ß‡∏•‡∏≤‡∏ù‡∏ô‡∏ï‡∏Å‡πÅ‡∏•‡∏∞‡∏ó‡∏∞‡∏•‡∏∏‡πÄ‡∏Ç‡πâ‡∏≤‡∏ö‡πâ‡∏≤‡∏ô‡πÄ‡∏î‡∏∑‡∏≠‡∏î‡∏£‡πâ‡∏≠‡∏ô‡∏°‡∏≤‡∏Å‡∏ó‡∏∏‡∏Å‡πÜ...,https://storage.googleapis.com/traffy_public_b...,https://storage.googleapis.com/traffy_public_b...,"100.66709,13.67891",189 ‡πÄ‡∏â‡∏•‡∏¥‡∏°‡∏û‡∏£‡∏∞‡πÄ‡∏Å‡∏µ‡∏¢‡∏£‡∏ï‡∏¥ ‡∏£.9 ‡πÅ‡∏Ç‡∏ß‡∏á ‡∏´‡∏ô‡∏≠‡∏á‡∏ö‡∏≠‡∏ô ‡πÄ‡∏Ç‡∏ï ‡∏õ‡∏£‡∏∞‡πÄ‡∏ß...,‡∏´‡∏ô‡∏≠‡∏á‡∏ö‡∏≠‡∏ô,‡∏õ‡∏£‡∏∞‡πÄ‡∏ß‡∏®,...,13.67891,2021,9,19,14,6,Sunday,6593.416835,1,0
2,2021-7XATFA,{‡∏™‡∏∞‡∏û‡∏≤‡∏ô},‡πÄ‡∏Ç‡∏ï‡∏™‡∏≤‡∏ó‡∏£,‡∏™‡∏∞‡∏û‡∏≤‡∏ô‡∏•‡∏≠‡∏¢‡∏õ‡∏£‡∏±‡∏ö‡∏õ‡∏£‡∏∏‡∏á‡πÑ‡∏°‡πà‡πÄ‡∏™‡∏£‡πá‡∏à‡∏ï‡∏≤‡∏°‡∏Å‡∏≥‡∏´‡∏ô‡∏î\n‡∏õ‡∏≤‡∏Å‡∏ã‡∏≠‡∏¢ ‡∏™‡∏≤‡∏ó‡∏£12,https://storage.googleapis.com/traffy_public_b...,,"100.52649,13.72060",191/1 ‡∏ñ‡∏ô‡∏ô ‡∏™‡∏≤‡∏ó‡∏£‡πÄ‡∏´‡∏ô‡∏∑‡∏≠ ‡πÅ‡∏Ç‡∏ß‡∏á ‡∏™‡∏µ‡∏•‡∏° ‡πÄ‡∏Ç‡∏ï‡∏ö‡∏≤‡∏á‡∏£‡∏±‡∏Å ‡∏Å‡∏£‡∏∏‡∏á‡πÄ‡∏ó...,‡∏¢‡∏≤‡∏ô‡∏ô‡∏≤‡∏ß‡∏≤,‡∏™‡∏≤‡∏ó‡∏£,...,13.7206,2021,9,26,5,6,Sunday,6068.222133,0,0
4,2021-DVEWYM,"{‡∏ô‡πâ‡∏≥‡∏ó‡πà‡∏ß‡∏°,‡∏ñ‡∏ô‡∏ô}","‡πÄ‡∏Ç‡∏ï‡∏•‡∏≤‡∏î‡∏û‡∏£‡πâ‡∏≤‡∏ß,‡∏ù‡πà‡∏≤‡∏¢‡πÇ‡∏¢‡∏ò‡∏≤ ‡πÄ‡∏Ç‡∏ï‡∏•‡∏≤‡∏î‡∏û‡∏£‡πâ‡∏≤‡∏ß",‡∏ã‡∏≠‡∏¢‡∏•‡∏≤‡∏î‡∏û‡∏£‡πâ‡∏≤‡∏ß‡∏ß‡∏±‡∏á‡∏´‡∏¥‡∏ô 75 ‡∏ñ‡∏ô‡∏ô‡∏•‡∏≤‡∏î‡∏û‡∏£‡πâ‡∏≤‡∏ß‡∏ß‡∏±‡∏á‡∏´‡∏¥‡∏ô ‡πÅ‡∏Ç‡∏ß‡∏á‡∏•‡∏≤‡∏î...,https://storage.googleapis.com/traffy_public_b...,,"100.59165,13.82280",702 ‡∏ñ. ‡∏•‡∏≤‡∏î‡∏û‡∏£‡πâ‡∏≤‡∏ß‡∏ß‡∏±‡∏á‡∏´‡∏¥‡∏ô ‡πÅ‡∏Ç‡∏ß‡∏á‡∏•‡∏≤‡∏î‡∏û‡∏£‡πâ‡∏≤‡∏ß ‡πÄ‡∏Ç‡∏ï‡∏•‡∏≤‡∏î‡∏û‡∏£‡πâ‡∏≤‡∏ß...,‡∏•‡∏≤‡∏î‡∏û‡∏£‡πâ‡∏≤‡∏ß,‡∏•‡∏≤‡∏î‡∏û‡∏£‡πâ‡∏≤‡∏ß,...,13.8228,2021,12,9,12,3,Thursday,5898.826799,0,0
5,2021-4D9Y98,{},"‡πÄ‡∏Ç‡∏ï‡∏•‡∏≤‡∏î‡∏û‡∏£‡πâ‡∏≤‡∏ß,‡∏Å‡∏≤‡∏£‡πÑ‡∏ü‡∏ü‡πâ‡∏≤‡∏ô‡∏Ñ‡∏£‡∏´‡∏•‡∏ß‡∏á ‡πÄ‡∏Ç‡∏ï‡∏ô‡∏ß‡∏•‡∏à‡∏±‡∏ô‡∏ó‡∏£‡πå",‡∏´‡∏ô‡πâ‡∏≤‡∏õ‡∏≤‡∏Å‡∏ã‡∏≠‡∏¢ ‡∏•‡∏≤‡∏î‡∏û‡∏£‡πâ‡∏≤‡∏ß‡∏ß‡∏±‡∏á‡∏´‡∏¥‡∏ô26,https://storage.googleapis.com/traffy_public_b...,https://storage.googleapis.com/traffy_public_b...,"100.59131,13.80910",17/73 17/73 ‡∏ñ. ‡∏•‡∏≤‡∏î‡∏û‡∏£‡πâ‡∏≤‡∏ß‡∏ß‡∏±‡∏á‡∏´‡∏¥‡∏ô ‡πÅ‡∏Ç‡∏ß‡∏á‡∏•‡∏≤‡∏î‡∏û‡∏£‡πâ‡∏≤‡∏ß ‡πÄ‡∏Ç‡∏ï...,‡∏•‡∏≤‡∏î‡∏û‡∏£‡πâ‡∏≤‡∏ß,‡∏•‡∏≤‡∏î‡∏û‡∏£‡πâ‡∏≤‡∏ß,...,13.8091,2021,12,13,5,0,Monday,10950.26058,1,0
6,2021-7U9RED,{},‡πÄ‡∏Ç‡∏ï‡∏î‡∏∏‡∏™‡∏¥‡∏ï,‡∏¢‡∏±‡∏á‡πÑ‡∏°‡πà‡∏°‡∏µ‡∏´‡∏ô‡πà‡∏ß‡∏¢‡∏á‡∏≤‡∏ô‡πÑ‡∏´‡∏ô‡∏°‡∏≤‡∏î‡∏π‡πÅ‡∏•‡∏Ñ‡∏£‡∏±‡∏ö ‡∏£‡∏ñ‡∏à‡∏∞‡πÄ‡∏ä‡∏µ‡πà‡∏¢‡∏ß‡∏´‡∏•‡∏≤‡∏¢‡∏Ñ‡∏ô...,https://storage.googleapis.com/traffy_public_b...,https://storage.googleapis.com/traffy_public_b...,"100.50848,13.77832",627 ‡∏ñ‡∏ô‡∏ô‡∏™‡∏≤‡∏°‡πÄ‡∏™‡∏ô ‡πÅ‡∏Ç‡∏ß‡∏á ‡∏î‡∏∏‡∏™‡∏¥‡∏ï ‡πÄ‡∏Ç‡∏ï‡∏î‡∏∏‡∏™‡∏¥‡∏ï ‡∏Å‡∏£‡∏∏‡∏á‡πÄ‡∏ó‡∏û‡∏°‡∏´‡∏≤‡∏ô‡∏Ñ...,‡∏î‡∏∏‡∏™‡∏¥‡∏ï,‡∏î‡∏∏‡∏™‡∏¥‡∏ï,...,13.77832,2021,12,17,8,4,Friday,12381.424959,1,0


## Step 1: Text Embedding Generation
Convert complaint text to dense vectors using Sentence-BERT

In [61]:
# Use 100,000 samples for comprehensive analysis (balanced approach)
SAMPLE_SIZE = 100000
df_sample = df_clean.sample(min(SAMPLE_SIZE, len(df_clean)), random_state=42).copy()
df_sample = df_sample.reset_index(drop=True)

print(f"Working with {len(df_sample):,} complaints for analysis")
print(f"Date range: {df_sample['timestamp'].min()} to {df_sample['timestamp'].max()}")
print(f"Unique districts: {df_sample['district'].nunique()}")
print(f"\n‚è±Ô∏è Estimated processing time:")
print(f"   - Embeddings: ~10-15 minutes")
print(f"   - Graph construction: ~5-8 minutes")
print(f"   - Clustering: ~1-2 minutes")
print(f"   Total: ~20-30 minutes")

Working with 100,000 complaints for analysis
Date range: 2021-12-22 10:15:33.294829+00:00 to 2025-01-16 02:52:33.878797+00:00
Unique districts: 73

‚è±Ô∏è Estimated processing time:
   - Embeddings: ~10-15 minutes
   - Graph construction: ~5-8 minutes
   - Clustering: ~1-2 minutes
   Total: ~20-30 minutes


In [62]:
# Load Sentence-BERT model (multilingual, works well with Thai text)
print("Loading Sentence-BERT model...")
model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')
print(f"Model loaded: {model.get_sentence_embedding_dimension()}-D embeddings")

# Generate embeddings for all complaints
print(f"\nGenerating embeddings for {len(df_sample):,} complaints...")
embeddings = model.encode(df_sample['comment'].tolist(), 
                          show_progress_bar=True,
                          batch_size=32)

print(f"Embeddings shape: {embeddings.shape}")
print(f"Sample embedding (first 10 dims): {embeddings[0][:10]}")

Loading Sentence-BERT model...
Model loaded: 384-D embeddings

Generating embeddings for 100,000 complaints...
Model loaded: 384-D embeddings

Generating embeddings for 100,000 complaints...


Batches:   0%|          | 0/3125 [00:00<?, ?it/s]

Embeddings shape: (100000, 384)
Sample embedding (first 10 dims): [ 0.5404708  -0.06882917  0.12064148  0.2447273   0.03686376 -0.15918002
  0.07919765 -0.06838866 -0.14063287  0.14891599]


## Step 2: Build Similarity Matrix with FAISS
Use FAISS for efficient similarity search

In [63]:
# Normalize embeddings for cosine similarity
embeddings_normalized = embeddings / np.linalg.norm(embeddings, axis=1, keepdims=True)

# Build FAISS index
dimension = embeddings.shape[1]
index = faiss.IndexFlatIP(dimension)  # Inner Product = Cosine similarity for normalized vectors
index.add(embeddings_normalized.astype('float32'))

print(f"FAISS index built with {index.ntotal:,} vectors")
print(f"Dimension: {dimension}")

FAISS index built with 100,000 vectors
Dimension: 384


In [64]:
# Helper function: Calculate geographic distance (Haversine formula)
def haversine_distance(lat1, lon1, lat2, lon2):
    """Calculate distance in meters between two points"""
    R = 6371000  # Earth radius in meters
    phi1, phi2 = np.radians(lat1), np.radians(lat2)
    delta_phi = np.radians(lat2 - lat1)
    delta_lambda = np.radians(lon2 - lon1)
    
    a = np.sin(delta_phi/2)**2 + np.cos(phi1) * np.cos(phi2) * np.sin(delta_lambda/2)**2
    c = 2 * np.arctan2(np.sqrt(a), np.sqrt(1-a))
    
    return R * c

# Test the function
lat1, lon1 = 13.7563, 100.5018  # Bangkok center
lat2, lon2 = 13.7563, 100.5028
dist = haversine_distance(lat1, lon1, lat2, lon2)
print(f"Test distance: {dist:.2f} meters")

Test distance: 108.01 meters


## Step 3: Graph Construction
Build graph with edges based on similarity, proximity, and temporal closeness

In [65]:
# Configuration for similarity thresholds
TEXT_SIMILARITY_THRESHOLD = 0.75  # Cosine similarity threshold
GEO_DISTANCE_THRESHOLD = 200  # meters
TIME_WINDOW_DAYS = 30  # days

print("Graph Construction Parameters:")
print(f"- Text similarity threshold: {TEXT_SIMILARITY_THRESHOLD}")
print(f"- Geographic distance threshold: {GEO_DISTANCE_THRESHOLD}m")
print(f"- Temporal window: {TIME_WINDOW_DAYS} days")

Graph Construction Parameters:
- Text similarity threshold: 0.75
- Geographic distance threshold: 200m
- Temporal window: 30 days


In [66]:
# Build graph
G = nx.Graph()

# Add all complaints as nodes
for idx, row in df_sample.iterrows():
    G.add_node(idx, 
               comment=row['comment'],  # Store full comment
               lat=row['latitude'],
               lon=row['longitude'],
               timestamp=row['timestamp'],
               district=row['district'],
               state=row['state'])

print(f"Added {G.number_of_nodes():,} nodes")

# Find similar pairs using FAISS
k = 20  # Number of nearest neighbors to consider
print(f"\nSearching for {k} nearest neighbors for each complaint...")
distances, indices = index.search(embeddings_normalized.astype('float32'), k)

# Add edges based on similarity + geo + temporal proximity
edge_count = 0
for i in range(len(df_sample)):
    for j_idx in range(1, k):  # Skip first (self)
        j = indices[i][j_idx]
        similarity = distances[i][j_idx]
        
        if similarity < TEXT_SIMILARITY_THRESHOLD:
            continue
            
        # Check geographic distance
        geo_dist = haversine_distance(
            df_sample.iloc[i]['latitude'], df_sample.iloc[i]['longitude'],
            df_sample.iloc[j]['latitude'], df_sample.iloc[j]['longitude']
        )
        
        if geo_dist > GEO_DISTANCE_THRESHOLD:
            continue
        
        # Check temporal proximity
        time_diff = abs((df_sample.iloc[i]['timestamp'] - df_sample.iloc[j]['timestamp']).days)
        
        if time_diff > TIME_WINDOW_DAYS:
            continue
        
        # Add edge with combined weight
        weight = similarity * (1 - min(geo_dist / GEO_DISTANCE_THRESHOLD, 1)) * (1 - min(time_diff / TIME_WINDOW_DAYS, 1))
        G.add_edge(i, j, weight=weight, similarity=similarity, geo_dist=geo_dist, time_diff=time_diff)
        edge_count += 1

print(f"Added {edge_count:,} edges")
print(f"\nGraph statistics:")
print(f"- Nodes: {G.number_of_nodes():,}")
print(f"- Edges: {G.number_of_edges():,}")
print(f"- Density: {nx.density(G):.6f}")

Added 100,000 nodes

Searching for 20 nearest neighbors for each complaint...
Added 20,539 edges

Graph statistics:
- Nodes: 100,000
- Edges: 14,415
- Density: 0.000003
Added 20,539 edges

Graph statistics:
- Nodes: 100,000
- Edges: 14,415
- Density: 0.000003


## Step 4: Cluster Detection with Louvain Algorithm
Detect communities (clusters) of related complaints

In [67]:
# Apply Louvain algorithm for community detection
print("Applying Louvain algorithm...")
partition = community_louvain.best_partition(G, weight='weight', random_state=42)

# Add cluster labels to dataframe
df_sample['cluster'] = df_sample.index.map(partition)

# Statistics
n_clusters = len(set(partition.values()))
print(f"\nClustering Results:")
print(f"- Number of clusters detected: {n_clusters}")
print(f"- Modularity score: {community_louvain.modularity(partition, G, weight='weight'):.4f}")

# Cluster size distribution
cluster_sizes = df_sample['cluster'].value_counts().sort_values(ascending=False)
print(f"\nTop 10 largest clusters:")
print(cluster_sizes.head(10))
print(f"\nCluster size statistics:")
print(cluster_sizes.describe())

Applying Louvain algorithm...

Clustering Results:
- Number of clusters detected: 94531

Clustering Results:
- Number of clusters detected: 94531
- Modularity score: 0.9891

Top 10 largest clusters:
cluster
501      80
27577    70
8745     63
10348    62
663      47
5214     40
15405    37
703      37
6383     35
947      33
Name: count, dtype: int64

Cluster size statistics:
count    94531.000000
mean         1.057854
std          0.722129
min          1.000000
25%          1.000000
50%          1.000000
75%          1.000000
max         80.000000
Name: count, dtype: float64
- Modularity score: 0.9891

Top 10 largest clusters:
cluster
501      80
27577    70
8745     63
10348    62
663      47
5214     40
15405    37
703      37
6383     35
947      33
Name: count, dtype: int64

Cluster size statistics:
count    94531.000000
mean         1.057854
std          0.722129
min          1.000000
25%          1.000000
50%          1.000000
75%          1.000000
max         80.000000
Name: co

## Step 5: Analyze Clusters
Extract insights from detected clusters

In [68]:
# Analyze top clusters (size >= 5)
large_clusters = cluster_sizes[cluster_sizes >= 5].index.tolist()

print(f"Analyzing {len(large_clusters)} clusters with 5+ complaints\n")
print("=" * 80)

for cluster_id in large_clusters[:10]:  # Show top 10
    cluster_data = df_sample[df_sample['cluster'] == cluster_id]
    
    print(f"\nüî¥ Cluster #{cluster_id} ({len(cluster_data)} complaints)")
    print(f"   Location: {cluster_data['district'].mode()[0] if len(cluster_data) > 0 else 'N/A'}")
    print(f"   Time span: {cluster_data['timestamp'].min().date()} to {cluster_data['timestamp'].max().date()}")
    print(f"   Avg distance from center: {cluster_data[['latitude', 'longitude']].std().mean()*111000:.0f}m")
    print(f"   Sample comments:")
    
    for idx, comment in enumerate(cluster_data['comment'].head(3), 1):
        print(f"      {idx}. {comment[:80]}...")
    
    print("-" * 80)

Analyzing 204 clusters with 5+ complaints


üî¥ Cluster #501 (80 complaints)
   Location: ‡∏£‡∏≤‡∏ä‡πÄ‡∏ó‡∏ß‡∏µ
   Time span: 2023-08-09 to 2024-02-21
   Avg distance from center: 10m
   Sample comments:
      1. ‡πÅ‡∏ó‡πá‡∏Å‡∏ã‡∏µ‡πà ‡∏™‡∏≤‡∏°‡∏•‡πâ‡∏≠ ‡πÜ‡∏•‡πÜ ‡∏à‡∏≠‡∏î ‡∏ï‡∏£‡∏á ‡∏ñ‡∏ô‡∏ô‡πÄ‡∏û‡∏ä‡∏£‡∏ö‡∏∏‡∏£‡∏µ ‡∏≠‡∏¢‡∏π‡πà‡πÄ‡∏•‡∏¢ ‡∏ã‡∏≠‡∏¢‡πÄ‡∏û‡∏ä‡∏£‡∏ö‡∏∏‡∏£‡∏µ 15 ‡πÄ‡∏¢‡∏∑‡πâ‡∏≠‡∏á ‡∏´‡πâ‡∏≤‡∏á ‡∏û‡∏±‡∏ô‡∏ò‡∏∏‡πå...
      2. ‡πÅ‡∏ó‡πá‡∏Å‡∏ã‡∏µ‡πà ‡∏™‡∏≤‡∏°‡∏•‡πâ‡∏≠ ‡πÜ‡∏•‡πÜ ‡∏à‡∏≠‡∏î ‡∏ï‡∏£‡∏á ‡∏ñ‡∏ô‡∏ô‡πÄ‡∏û‡∏ä‡∏£‡∏ö‡∏∏‡∏£‡∏µ ‡∏≠‡∏¢‡∏π‡πà‡πÄ‡∏•‡∏¢ ‡∏ã‡∏≠‡∏¢‡πÄ‡∏û‡∏ä‡∏£‡∏ö‡∏∏‡∏£‡∏µ 15 ‡πÄ‡∏¢‡∏∑‡πâ‡∏≠‡∏á ‡∏´‡πâ‡∏≤‡∏á ‡∏û‡∏±‡∏ô‡∏ò‡∏∏‡πå...
      3. ‡πÅ‡∏ó‡πá‡∏Å‡∏ã‡∏µ‡πà ‡∏™‡∏≤‡∏°‡∏•‡πâ‡∏≠ ‡πÜ‡∏•‡πÜ ‡∏à‡∏≠‡∏î ‡∏ï‡∏£‡∏á ‡∏ñ‡∏ô‡∏ô‡πÄ‡∏û‡∏ä‡∏£‡∏ö‡∏∏‡∏£‡∏µ ‡∏≠‡∏¢‡∏π‡πà‡πÄ‡∏•‡∏¢ ‡∏ã‡∏≠‡∏¢‡πÄ‡∏û‡∏ä‡∏£‡∏ö‡∏∏‡∏£‡∏µ 15 ‡πÄ‡∏¢‡∏∑‡πâ‡∏≠‡∏á ‡∏´‡πâ‡∏≤‡∏á ‡∏û‡∏±‡∏ô‡∏ò‡∏∏‡πå...
--------------------------------------------------------------------------------

üî¥ Cluster #27577 (70 complain

## Step 6: Visualizations
Create interactive visualizations of clusters

In [69]:
# 1. Cluster Size Distribution
fig = px.histogram(cluster_sizes, x=cluster_sizes.values, nbins=50,
                   title='Distribution of Cluster Sizes',
                   labels={'x': 'Cluster Size (number of complaints)', 'y': 'Frequency'},
                   color_discrete_sequence=['#EF553B'])
fig.add_vline(x=cluster_sizes.median(), line_dash="dash", 
              annotation_text=f"Median: {cluster_sizes.median():.0f}")
fig.show()

In [70]:
# 2. Geospatial Map of Clusters
# Filter to show only large clusters for clarity
df_large_clusters = df_sample[df_sample['cluster'].isin(large_clusters[:20])]

fig = px.scatter_mapbox(df_large_clusters, 
                        lat='latitude', lon='longitude',
                        color='cluster', 
                        hover_data=['district', 'state', 'comment'],
                        title=f'Geographic Distribution of Top 20 Largest Clusters ({len(df_large_clusters)} complaints)',
                        mapbox_style='open-street-map',
                        zoom=10,
                        height=700,
                        color_continuous_scale='Rainbow')
fig.update_traces(marker=dict(size=8, opacity=0.7))
fig.show()

In [71]:
# 3. Network Graph Visualization (sample of largest cluster)
largest_cluster_id = cluster_sizes.index[0]
nodes_in_largest = [n for n, attr in G.nodes(data=True) if partition[n] == largest_cluster_id]
subgraph = G.subgraph(nodes_in_largest)

print(f"Visualizing largest cluster (Cluster #{largest_cluster_id})")
print(f"Nodes: {subgraph.number_of_nodes()}, Edges: {subgraph.number_of_edges()}")

# Create network visualization using plotly
pos = nx.spring_layout(subgraph, k=0.5, iterations=50)

edge_x = []
edge_y = []
for edge in subgraph.edges():
    x0, y0 = pos[edge[0]]
    x1, y1 = pos[edge[1]]
    edge_x.extend([x0, x1, None])
    edge_y.extend([y0, y1, None])

node_x = [pos[node][0] for node in subgraph.nodes()]
node_y = [pos[node][1] for node in subgraph.nodes()]
node_text = [f"ID: {node}<br>{G.nodes[node]['comment']}" for node in subgraph.nodes()]

fig = go.Figure()
fig.add_trace(go.Scatter(x=edge_x, y=edge_y, mode='lines', 
                         line=dict(width=0.5, color='#888'), hoverinfo='none'))
fig.add_trace(go.Scatter(x=node_x, y=node_y, mode='markers',
                         marker=dict(size=10, color='#EF553B'),
                         text=node_text, hoverinfo='text'))
fig.update_layout(title=f'Network Graph of Cluster #{largest_cluster_id}',
                  showlegend=False, hovermode='closest',
                  xaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
                  yaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
                  height=600)
fig.show()

Visualizing largest cluster (Cluster #501)
Nodes: 80, Edges: 373


In [72]:
# 4. Top Districts by Number of Large Clusters
district_cluster_counts = df_sample[df_sample['cluster'].isin(large_clusters)].groupby('district')['cluster'].nunique().sort_values(ascending=False).head(15)

fig = px.bar(district_cluster_counts, x=district_cluster_counts.values, y=district_cluster_counts.index,
             orientation='h',
             title='Top 15 Districts by Number of Problem Clusters',
             labels={'x': 'Number of Clusters', 'y': 'District'},
             color=district_cluster_counts.values,
             color_continuous_scale='Reds')
fig.update_layout(yaxis={'categoryorder': 'total ascending'})
fig.show()

## Step 7: Export Results and Summary

In [73]:
# Save clustered data
output_path = "../data/processed/bangkok_traffy_clustered.csv"
df_sample.to_csv(output_path, index=False)
print(f"‚úì Clustered data saved to: {output_path}")

# Create cluster summary report
cluster_summary = []
for cluster_id in large_clusters:
    cluster_data = df_sample[df_sample['cluster'] == cluster_id]
    
    summary = {
        'cluster_id': cluster_id,
        'size': len(cluster_data),
        'district': cluster_data['district'].mode()[0] if len(cluster_data) > 0 else 'N/A',
        'date_start': cluster_data['timestamp'].min(),
        'date_end': cluster_data['timestamp'].max(),
        'duration_days': (cluster_data['timestamp'].max() - cluster_data['timestamp'].min()).days,
        'avg_lat': cluster_data['latitude'].mean(),
        'avg_lon': cluster_data['longitude'].mean(),
        'geo_spread_meters': cluster_data[['latitude', 'longitude']].std().mean() * 111000,
        'sample_comment': cluster_data['comment'].iloc[0][:200]
    }
    cluster_summary.append(summary)

cluster_summary_df = pd.DataFrame(cluster_summary).sort_values('size', ascending=False)
cluster_summary_df.to_csv("../data/processed/cluster_summary.csv", index=False)
print(f"‚úì Cluster summary saved to: ../data/processed/cluster_summary.csv")
print(f"‚úì Total large clusters analyzed: {len(cluster_summary)}")

‚úì Clustered data saved to: ../data/processed/bangkok_traffy_clustered.csv
‚úì Cluster summary saved to: ../data/processed/cluster_summary.csv
‚úì Total large clusters analyzed: 204
‚úì Cluster summary saved to: ../data/processed/cluster_summary.csv
‚úì Total large clusters analyzed: 204


In [74]:
# Final Summary Report
print("=" * 80)
print("BANGKOK REDUNDANCY & DUPLICATION DETECTION SYSTEM")
print("FINAL SUMMARY REPORT")
print("=" * 80)
print(f"\nüìä DATASET STATISTICS")
print(f"   Total complaints analyzed: {len(df_sample):,}")
print(f"   Date range: {df_sample['timestamp'].min().date()} to {df_sample['timestamp'].max().date()}")
print(f"   Districts covered: {df_sample['district'].nunique()}")

print(f"\nüß† AI/ML PROCESSING")
print(f"   Embedding model: Sentence-BERT (multilingual)")
print(f"   Embedding dimension: {embeddings.shape[1]}")
print(f"   Similarity search: FAISS (Approximate Nearest Neighbor)")

print(f"\nüîó GRAPH CONSTRUCTION")
print(f"   Nodes (complaints): {G.number_of_nodes():,}")
print(f"   Edges (relationships): {G.number_of_edges():,}")
print(f"   Text similarity threshold: {TEXT_SIMILARITY_THRESHOLD}")
print(f"   Geographic distance threshold: {GEO_DISTANCE_THRESHOLD}m")
print(f"   Temporal window: {TIME_WINDOW_DAYS} days")

print(f"\nüéØ CLUSTERING RESULTS")
print(f"   Algorithm: Louvain Community Detection")
print(f"   Total clusters detected: {n_clusters}")
print(f"   Large clusters (‚â•5 complaints): {len(large_clusters)}")
print(f"   Modularity score: {community_louvain.modularity(partition, G, weight='weight'):.4f}")
print(f"   Largest cluster size: {cluster_sizes.max()} complaints")
print(f"   Median cluster size: {cluster_sizes.median():.0f} complaints")

print(f"\nüî• KEY INSIGHTS")
redundant_complaints = df_sample[df_sample['cluster'].isin(large_clusters)]
print(f"   Redundant/duplicate complaints: {len(redundant_complaints):,} ({len(redundant_complaints)/len(df_sample)*100:.1f}%)")
print(f"   Potential work reduction: ~{len(redundant_complaints) - len(large_clusters):,} duplicate tasks")
print(f"   Top problematic district: {district_cluster_counts.index[0]} ({district_cluster_counts.values[0]} clusters)")

print(f"\nüíæ OUTPUT FILES")
print(f"   ‚úì {output_path}")
print(f"   ‚úì ../data/processed/cluster_summary.csv")

print("\n" + "=" * 80)
print("üéâ Analysis Complete! Ready for presentation to stakeholders.")
print("=" * 80)

BANGKOK REDUNDANCY & DUPLICATION DETECTION SYSTEM
FINAL SUMMARY REPORT

üìä DATASET STATISTICS
   Total complaints analyzed: 100,000
   Date range: 2021-12-22 to 2025-01-16
   Districts covered: 73

üß† AI/ML PROCESSING
   Embedding model: Sentence-BERT (multilingual)
   Embedding dimension: 384
   Similarity search: FAISS (Approximate Nearest Neighbor)

üîó GRAPH CONSTRUCTION
   Nodes (complaints): 100,000
   Edges (relationships): 14,415
   Text similarity threshold: 0.75
   Geographic distance threshold: 200m
   Temporal window: 30 days

üéØ CLUSTERING RESULTS
   Algorithm: Louvain Community Detection
   Total clusters detected: 94531
   Large clusters (‚â•5 complaints): 204
   Modularity score: 0.9891
   Largest cluster size: 80 complaints
   Median cluster size: 1 complaints

üî• KEY INSIGHTS
   Redundant/duplicate complaints: 2,220 (2.2%)
   Potential work reduction: ~2,016 duplicate tasks
   Top problematic district: ‡∏ß‡∏±‡∏í‡∏ô‡∏≤ (18 clusters)

üíæ OUTPUT FILES
   ‚úì .