Skip to content

CodeOfMe/ATCND

Repository files navigation

ATCND

Adaptive Topic and Cluster Number Determination via Structured Search over Sliding Ranges

PyPI Python License: GPL v3 Tests

ATCND is a model-agnostic framework that determines the optimal number of topics (for LDA/NMF) or clusters (for K-Means) by treating K selection as a structured search problem over a user-specified integer range. Prior to ATCND, no model-agnostic method with correct K* achieved sub-linear evaluation count: exhaustive grid search (the SOTA baseline in the model-agnostic + exact-K class) requires O(K_max − K_min) model evaluations. ATCND applies eight search strategies to achieve O(log(K_max − K_min)) evaluations — 59–79% fewer than grid search while matching its K* accuracy.

8-Strategy Comparison

ATCND: All 8 search strategies on K-Means (K_true=8)

Benchmark on K-Means (SyntheticBlobs, K_true=8, range [2,30]):

Strategy Complexity Evals vs Grid
Grid O(N) 29
Binary O(log N) 9 69%
Golden Section O(log_φ N) 12 59%
Ternary O(log_{1.5} N) 14 52%
Fibonacci O(log_φ N) 10 66%
Interpolation O(log log N)* 11 62%
Exponential O(log K*) 10 66%
Predictive O(1)+O(log Δ) 7 76%

All strategies recover K*=8 with identical silhouette score. Predictive search with PCA hot-start achieves the fewest evaluations.

Consistency Across Datasets

ATCND never catastrophically fails. Compare absolute errors |K* − K_true| across five benchmarks:

Method Iris Wine Breast Digits Blobs Avg Max
ATCND-sil 1 1 0 1 0 0.6 1
Kneedle 2 2 2 1 0 1.4 2
X-Means 0 0 0 8 6 2.8 8
Gap Stat. 12 12 8 10 0 8.4 12
G-Means 12 4 8 10 22 11.2 22

Consistency analysis: ATCND never catastrophically fails

Real-World Demos

K-Means on Iris (3D)

K-Means on Iris (K*=2) Iris K-Means search curve

GMM on Wine (3D)

GMM on Wine (K*=2) Wine GMM search curve

PCA on Digits (3D)

PCA on Digits (K*=31 for 95% variance) Digits PCA cumulative variance

DBSCAN on Two-Moons

DBSCAN on Two-Moons DBSCAN eps search curve

Optimal Histogram Bins (AIC)

ATCND optimal bins vs Sturges rule

Smoothing Spline (SciPy)

Smoothing spline with ATCND-selected parameter

Rolling Window (Pandas)

Optimal rolling window on time series

NMF Topic Count (Gensim)

NMF topic search with c_v coherence

PyTorch Hidden Layer Size

PyTorch hidden layer size search

Run all demos with figures (SVG + PDF + PNG):

python examples/demo_all.py
# Output: examples/figures/*.{svg,pdf,png}

Key Features

  • 8 search strategies: grid, binary, golden section, ternary, Fibonacci, interpolation, exponential, predictive
  • 15 adapters spanning NumPy, SciPy, scikit-learn, PyTorch, Gensim, and Pandas
  • 3 model families: LDA, NMF, K-Means (plus GMM, PCA, DBSCAN, etc.)
  • 5+5 quality metrics: general (silhouette, coherence c_v, perplexity, reconstruction, combined) + K-Means-specific (silhouette_knee, bic, combined, silhouette_drop)
  • Adaptive strategy selection: automatically recommends best strategy + metric based on data profile (dimensionality, sparsity, separation)
  • Multi-objective optimization: simultaneously optimize multiple metrics (e.g., silhouette + BIC) with Pareto frontier
  • Ranked candidate set: returns multiple optimal K values (handles plateaus, ties, multi-optima)
  • First O(log N) method in the model-agnostic + exact-K class: 59–79% fewer evaluations than grid search
  • Predictive search with PCA hot-start: 76% reduction (7 evaluations vs 29 for grid)
  • Automatic K range estimation: suggest_k_range() recommends search bounds from PCA, √N, and data sparsity heuristics
  • CLI and Python API

Installation

pip install atcnd

For PyTorch adapters:

pip install atcnd[torch]

From source:

git clone https://github.com/CodeOfMe/ATCND.git
cd ATCND
pip install -e .

Quick Start

Python API

from atcnd import ATCNDConfig, atcnd_search

# K-Means on numeric data
from sklearn.datasets import make_blobs
X, _ = make_blobs(n_samples=1000, n_features=50, centers=8, random_state=42)

config = ATCNDConfig(
    k_min=2, k_max=30,
    model_type="kmeans",
    search_strategy="binary",
    metric="silhouette",
    n_candidates=3,
)
result = atcnd_search(X=X, config=config)

print(f"Optimal K: {result.optimal_k}")
print(f"Best score: {result.optimal_score:.4f}")
print(f"Top candidates: {result.candidate_ks}")
print(f"Evaluations: {len(result.search_history)}")

Low-level API (any callable)

from atcnd import search
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

def f(k):
    km = KMeans(n_clusters=k, random_state=42, n_init=10)
    labels = km.fit_predict(X)
    return silhouette_score(X, labels)

result = search(f, k_min=2, k_max=30, strategy="binary")
print(f"K* = {result.optimal_k}, evals = {len(result.search_history)}")

Adapter API (scikit-learn, PyTorch, Gensim, ...)

from atcnd import search_model, search_gmm_components, search_nmf_topics
from sklearn.cluster import KMeans
from sklearn.mixture import GaussianMixture

# K-Means
r = search_model(KMeans, X, param_name="n_clusters", k_min=2, k_max=30, strategy="binary")

# GMM with BIC
r = search_gmm_components(X, k_min=2, k_max=15, strategy="binary")

# NMF topics with coherence
r = search_nmf_topics(texts, k_min=2, k_max=20, strategy="binary", metric="coherence")

# Predictive search with PCA hot-start
from atcnd import estimate_k_n_clusters
hot_k = estimate_k_n_clusters(X, k_min=2, k_max=30)
r = search(f, k_min=2, k_max=30, strategy="predictive", hot_start=hot_k)

Text data (LDA/NMF)

from atcnd import ATCNDConfig, atcnd_search

texts = ["your document texts here"] * 100

# NMF with coherence metric
config = ATCNDConfig(
    k_min=2, k_max=20,
    model_type="nmf",
    search_strategy="binary",
    metric="coherence",
)
result = atcnd_search(texts=texts, config=config)

Adaptive Strategy Selection

ATCND automatically recommends the best strategy + metric based on data characteristics (dimensionality, sparsity, cluster separation, intrinsic dimension):

from atcnd import adaptive_select, adaptive_search

# Get recommendation without running search
rec = adaptive_select(X, k_min=2, k_max=30)
print(f"Recommended: {rec.strategy} + {rec.metric} (confidence: {rec.confidence:.2f})")
print(f"Top 5: {[(r['strategy'], r['metric'], r['score']) for r in rec.all_recommendations[:5]}")

# Or run search directly with auto-selected strategy
result = adaptive_search(X, k_min=2, k_max=30)
print(f"K* = {result.optimal_k}, strategy = {result.strategy}")

CLI:

atcnd adaptive --k-min 2 --k-max 30
atcnd adaptive --k-min 2 --k-max 30 --run --json

Automatic K Range Estimation

Instead of manually choosing k_min and k_max, ATCND can suggest a search range from data characteristics:

from atcnd import suggest_k_range

rec = suggest_k_range(X, model_type="kmeans")
print(f"Suggested range: [{rec.k_min}, {rec.k_max}]")
print(f"Heuristics: {rec.rationale}")
# e.g. {pca_intrinsic_dim: 14, sqrt_n: 22, pca_elbow_doubled: 14}

The returned k_max is the harmonic mean of three upper-bound heuristics, scaled by a 1.2× safety factor:

  • pca_intrinsic_dim: 2 × (PCA dimensions for 95% cumulative variance)
  • sqrt_n: √N rule (Mardia 1979)
  • pca_elbow_doubled (clustering) or topic_estimate_doubled (topic models)

CLI:

atcnd range --model kmeans --json
atcnd range --model lda

Multi-Objective Optimization

Optimize multiple metrics simultaneously and find Pareto-optimal K values:

from atcnd import multi_objective_search
from sklearn.cluster import KMeans

# Silhouette + BIC with equal weights
mo = multi_objective_search(KMeans, X, metrics=["silhouette", "bic"], k_min=2, k_max=30)
print(f"Combined best: K* = {mo.optimal_k}")
print(f"Pareto frontier: {mo.pareto_ks}")  # Non-dominated K values

# Custom weights (prioritize silhouette)
mo = multi_objective_search(KMeans, X, metrics=["silhouette", "bic"],
                              weights={"silhouette": 0.7, "bic": 0.3}, k_min=2, k_max=30)

# Three metrics
mo = multi_objective_search(KMeans, X, metrics=["silhouette", "bic", "silhouette_drop"],
                              k_min=2, k_max=30)

CLI:

atcnd multi --metrics silhouette bic --k-min 2 --k-max 30
atcnd multi --metrics silhouette bic --weights 0.7 0.3 --json

CLI

# Search with K-Means on synthetic data
atcnd search --model kmeans --strategy binary --k-min 2 --k-max 30

# Search with NMF
atcnd search --model nmf --strategy golden_section --metric silhouette

# JSON output
atcnd search --model kmeans --json

# Suggest K range from data
atcnd range --model kmeans --json

# Run benchmarks
atcnd benchmark --dataset blobs --k-min 2 --k-max 30

# All 8 strategies comparison
atcnd benchmark --dataset blobs --k-min 2 --k-max 30 --all-strategies

Search Strategies Detail

Predictive Search

Predictive search uses PCA-based hot-start to estimate K* before any model evaluation, then applies parabolic peak fitting:

  1. PCA hot-start: Eigenvalue elbow method estimates K* from data structure
  2. Probing: Evaluate f(K̂−1), f(K̂), f(K̂+1)
  3. Parabolic peak fitting: Fit a parabola through the three best points and jump to the predicted peak
  4. Binary refinement: Narrow the search around the predicted peak
from atcnd import estimate_k_n_clusters, search

# PCA hot-start estimates K from eigenvalue elbow
hot_k = estimate_k_n_clusters(X, k_min=2, k_max=30)

# Predictive search uses hot_start to reduce evaluations
result = search(f, k_min=2, k_max=30, strategy="predictive", hot_start=hot_k)

Fibonacci Search

Classical optimal algorithm for discrete unimodal search (Kiefer 1953). Achieves minimum worst-case evaluations among all derivative-free methods on unimodal functions.

Exponential Search

Doubles the probe point (1, 2, 4, 8, ...) until f(K) decreases, then refines via binary search. Best when K* is near K_min.

Adapters

15 adapters wrap common model/parameter pairs with appropriate quality metrics:

Library Adapter Parameter Metric
sklearn search_model n_clusters silhouette
sklearn search_neighbors n_neighbors CV accuracy
NumPy search_bins bins AIC
sklearn search_components n_components variance
SciPy search_knots internal knots MSE
signal search_window window BIC
any search_param any user-defined
PyTorch search_hidden hidden_dim −CE
PyTorch search_layers n_layers −CE
sklearn search_trees n_estimators CV accuracy
sklearn search_dbscan_eps eps silhouette
sklearn search_gmm_components n_components BIC
Pandas search_dataframe_bins bins AIC
Pandas search_rolling_window window BIC
Gensim search_nmf_topics n_topics c_v coherence

Quality Metrics

Metric LDA NMF K-Means Description
Silhouette Yes Yes Yes Inter-cluster separation vs intra-cluster cohesion
Coherence (c_v) Yes Yes No Semantic coherence of top topic words
Perplexity Yes No No Negative log-likelihood per word
Reconstruction No Yes Yes Frobenius norm / inertia
Combined Yes Yes No 0.5 × silhouette + 0.5 × coherence

Multiple Optima

K is a discrete integer parameter. Multiple values of K may achieve the same or nearly the same quality score. ATCND returns candidate_ks (ranked list) alongside optimal_k, handling plateaus, ties, and multiple local maxima.

Comparison with Baselines

ATCND is the first method in the model-agnostic + exact-K class to achieve O(log N) evaluations:

Method Model-agnostic? Exact K*? Evals
ATCND (all) Yes Yes O(log N)
Grid Search Yes Yes O(N)
HDP No (LDA only) No 1 (costly)
Top2Vec No No 1
Black-box Opt. No (LDA only) Yes Variable

Real-World Results

Dataset Adapter K* Strategy Evals Reduction
Iris search_model 2 binary 5 64%
Wine search_gmm 2 binary 6 57%
Digits search_components 31 binary 10 84%
Moons search_dbscan 3 binary 10 64%
Wine search_trees 300 binary 37 87%
Iris search_neighbors 12 binary 9 70%

Development

# Install in development mode
pip install -e ".[dev]"

# Run tests (75 tests)
python -m pytest tests/ -v

# Format code
black src/atcnd/ tests/

# Lint code
ruff check src/atcnd/ tests/

License

GNU General Public License v3.0 (GPLv3)

About

Adaptive Topic and Cluster Number Determination

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors