ATCND

Adaptive Topic and Cluster Number Determination via Structured Search over Sliding Ranges

ATCND is a model-agnostic framework that determines the optimal number of topics (for LDA/NMF) or clusters (for K-Means) by treating K selection as a structured search problem over a user-specified integer range. Prior to ATCND, no model-agnostic method with correct K* achieved sub-linear evaluation count: exhaustive grid search (the SOTA baseline in the model-agnostic + exact-K class) requires O(K_max − K_min) model evaluations. ATCND applies eight search strategies to achieve O(log(K_max − K_min)) evaluations — 59–79% fewer than grid search while matching its K* accuracy.

8-Strategy Comparison

Benchmark on K-Means (SyntheticBlobs, K_true=8, range [2,30]):

Strategy	Complexity	Evals	vs Grid
Grid	O(N)	29	—
Binary	O(log N)	9	69%
Golden Section	O(log_φ N)	12	59%
Ternary	O(log_{1.5} N)	14	52%
Fibonacci	O(log_φ N)	10	66%
Interpolation	O(log log N)*	11	62%
Exponential	O(log K*)	10	66%
Predictive	O(1)+O(log Δ)	7	76%

All strategies recover K*=8 with identical silhouette score. Predictive search with PCA hot-start achieves the fewest evaluations.

Consistency Across Datasets

ATCND never catastrophically fails. Compare absolute errors |K* − K_true| across five benchmarks:

Method	Iris	Wine	Breast	Digits	Blobs	Avg	Max
ATCND-sil	1	1	0	1	0	0.6	1
Kneedle	2	2	2	1	0	1.4	2
X-Means	0	0	0	8	6	2.8	8
Gap Stat.	12	12	8	10	0	8.4	12
G-Means	12	4	8	10	22	11.2	22

Real-World Demos

K-Means on Iris (3D)

GMM on Wine (3D)

PCA on Digits (3D)

DBSCAN on Two-Moons

Optimal Histogram Bins (AIC)

Smoothing Spline (SciPy)

Rolling Window (Pandas)

NMF Topic Count (Gensim)

PyTorch Hidden Layer Size

Run all demos with figures (SVG + PDF + PNG):

python examples/demo_all.py
# Output: examples/figures/*.{svg,pdf,png}

Key Features

8 search strategies: grid, binary, golden section, ternary, Fibonacci, interpolation, exponential, predictive
15 adapters spanning NumPy, SciPy, scikit-learn, PyTorch, Gensim, and Pandas
3 model families: LDA, NMF, K-Means (plus GMM, PCA, DBSCAN, etc.)
5+5 quality metrics: general (silhouette, coherence c_v, perplexity, reconstruction, combined) + K-Means-specific (silhouette_knee, bic, combined, silhouette_drop)
Adaptive strategy selection: automatically recommends best strategy + metric based on data profile (dimensionality, sparsity, separation)
Multi-objective optimization: simultaneously optimize multiple metrics (e.g., silhouette + BIC) with Pareto frontier
Ranked candidate set: returns multiple optimal K values (handles plateaus, ties, multi-optima)
First O(log N) method in the model-agnostic + exact-K class: 59–79% fewer evaluations than grid search
Predictive search with PCA hot-start: 76% reduction (7 evaluations vs 29 for grid)
Automatic K range estimation: suggest_k_range() recommends search bounds from PCA, √N, and data sparsity heuristics
CLI and Python API

Installation

pip install atcnd

For PyTorch adapters:

pip install atcnd[torch]

From source:

git clone https://github.com/CodeOfMe/ATCND.git
cd ATCND
pip install -e .

Quick Start

Python API

from atcnd import ATCNDConfig, atcnd_search

# K-Means on numeric data
from sklearn.datasets import make_blobs
X, _ = make_blobs(n_samples=1000, n_features=50, centers=8, random_state=42)

config = ATCNDConfig(
    k_min=2, k_max=30,
    model_type="kmeans",
    search_strategy="binary",
    metric="silhouette",
    n_candidates=3,
)
result = atcnd_search(X=X, config=config)

print(f"Optimal K: {result.optimal_k}")
print(f"Best score: {result.optimal_score:.4f}")
print(f"Top candidates: {result.candidate_ks}")
print(f"Evaluations: {len(result.search_history)}")

Low-level API (any callable)

from atcnd import search
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

def f(k):
    km = KMeans(n_clusters=k, random_state=42, n_init=10)
    labels = km.fit_predict(X)
    return silhouette_score(X, labels)

result = search(f, k_min=2, k_max=30, strategy="binary")
print(f"K* = {result.optimal_k}, evals = {len(result.search_history)}")

Adapter API (scikit-learn, PyTorch, Gensim, ...)

from atcnd import search_model, search_gmm_components, search_nmf_topics
from sklearn.cluster import KMeans
from sklearn.mixture import GaussianMixture

# K-Means
r = search_model(KMeans, X, param_name="n_clusters", k_min=2, k_max=30, strategy="binary")

# GMM with BIC
r = search_gmm_components(X, k_min=2, k_max=15, strategy="binary")

# NMF topics with coherence
r = search_nmf_topics(texts, k_min=2, k_max=20, strategy="binary", metric="coherence")

# Predictive search with PCA hot-start
from atcnd import estimate_k_n_clusters
hot_k = estimate_k_n_clusters(X, k_min=2, k_max=30)
r = search(f, k_min=2, k_max=30, strategy="predictive", hot_start=hot_k)

Text data (LDA/NMF)

from atcnd import ATCNDConfig, atcnd_search

texts = ["your document texts here"] * 100

# NMF with coherence metric
config = ATCNDConfig(
    k_min=2, k_max=20,
    model_type="nmf",
    search_strategy="binary",
    metric="coherence",
)
result = atcnd_search(texts=texts, config=config)

Adaptive Strategy Selection

ATCND automatically recommends the best strategy + metric based on data characteristics (dimensionality, sparsity, cluster separation, intrinsic dimension):

from atcnd import adaptive_select, adaptive_search

# Get recommendation without running search
rec = adaptive_select(X, k_min=2, k_max=30)
print(f"Recommended: {rec.strategy} + {rec.metric} (confidence: {rec.confidence:.2f})")
print(f"Top 5: {[(r['strategy'], r['metric'], r['score']) for r in rec.all_recommendations[:5]}")

# Or run search directly with auto-selected strategy
result = adaptive_search(X, k_min=2, k_max=30)
print(f"K* = {result.optimal_k}, strategy = {result.strategy}")

CLI:

atcnd adaptive --k-min 2 --k-max 30
atcnd adaptive --k-min 2 --k-max 30 --run --json

Automatic K Range Estimation

Instead of manually choosing k_min and k_max, ATCND can suggest a search range from data characteristics:

from atcnd import suggest_k_range

rec = suggest_k_range(X, model_type="kmeans")
print(f"Suggested range: [{rec.k_min}, {rec.k_max}]")
print(f"Heuristics: {rec.rationale}")
# e.g. {pca_intrinsic_dim: 14, sqrt_n: 22, pca_elbow_doubled: 14}

The returned k_max is the harmonic mean of three upper-bound heuristics, scaled by a 1.2× safety factor:

pca_intrinsic_dim: 2 × (PCA dimensions for 95% cumulative variance)
sqrt_n: √N rule (Mardia 1979)
pca_elbow_doubled (clustering) or topic_estimate_doubled (topic models)

CLI:

atcnd range --model kmeans --json
atcnd range --model lda

Multi-Objective Optimization

Optimize multiple metrics simultaneously and find Pareto-optimal K values:

from atcnd import multi_objective_search
from sklearn.cluster import KMeans

# Silhouette + BIC with equal weights
mo = multi_objective_search(KMeans, X, metrics=["silhouette", "bic"], k_min=2, k_max=30)
print(f"Combined best: K* = {mo.optimal_k}")
print(f"Pareto frontier: {mo.pareto_ks}")  # Non-dominated K values

# Custom weights (prioritize silhouette)
mo = multi_objective_search(KMeans, X, metrics=["silhouette", "bic"],
                              weights={"silhouette": 0.7, "bic": 0.3}, k_min=2, k_max=30)

# Three metrics
mo = multi_objective_search(KMeans, X, metrics=["silhouette", "bic", "silhouette_drop"],
                              k_min=2, k_max=30)

CLI:

atcnd multi --metrics silhouette bic --k-min 2 --k-max 30
atcnd multi --metrics silhouette bic --weights 0.7 0.3 --json

CLI

# Search with K-Means on synthetic data
atcnd search --model kmeans --strategy binary --k-min 2 --k-max 30

# Search with NMF
atcnd search --model nmf --strategy golden_section --metric silhouette

# JSON output
atcnd search --model kmeans --json

# Suggest K range from data
atcnd range --model kmeans --json

# Run benchmarks
atcnd benchmark --dataset blobs --k-min 2 --k-max 30

# All 8 strategies comparison
atcnd benchmark --dataset blobs --k-min 2 --k-max 30 --all-strategies

Search Strategies Detail

Predictive Search

Predictive search uses PCA-based hot-start to estimate K* before any model evaluation, then applies parabolic peak fitting:

PCA hot-start: Eigenvalue elbow method estimates K* from data structure
Probing: Evaluate f(K̂−1), f(K̂), f(K̂+1)
Parabolic peak fitting: Fit a parabola through the three best points and jump to the predicted peak
Binary refinement: Narrow the search around the predicted peak

from atcnd import estimate_k_n_clusters, search

# PCA hot-start estimates K from eigenvalue elbow
hot_k = estimate_k_n_clusters(X, k_min=2, k_max=30)

# Predictive search uses hot_start to reduce evaluations
result = search(f, k_min=2, k_max=30, strategy="predictive", hot_start=hot_k)

Fibonacci Search

Classical optimal algorithm for discrete unimodal search (Kiefer 1953). Achieves minimum worst-case evaluations among all derivative-free methods on unimodal functions.

Exponential Search

Doubles the probe point (1, 2, 4, 8, ...) until f(K) decreases, then refines via binary search. Best when K* is near K_min.

Adapters

15 adapters wrap common model/parameter pairs with appropriate quality metrics:

Library	Adapter	Parameter	Metric
sklearn	`search_model`	n_clusters	silhouette
sklearn	`search_neighbors`	n_neighbors	CV accuracy
NumPy	`search_bins`	bins	AIC
sklearn	`search_components`	n_components	variance
SciPy	`search_knots`	internal knots	MSE
signal	`search_window`	window	BIC
any	`search_param`	any	user-defined
PyTorch	`search_hidden`	hidden_dim	−CE
PyTorch	`search_layers`	n_layers	−CE
sklearn	`search_trees`	n_estimators	CV accuracy
sklearn	`search_dbscan_eps`	eps	silhouette
sklearn	`search_gmm_components`	n_components	BIC
Pandas	`search_dataframe_bins`	bins	AIC
Pandas	`search_rolling_window`	window	BIC
Gensim	`search_nmf_topics`	n_topics	c_v coherence

Quality Metrics

Metric	LDA	NMF	K-Means	Description
Silhouette	Yes	Yes	Yes	Inter-cluster separation vs intra-cluster cohesion
Coherence (c_v)	Yes	Yes	No	Semantic coherence of top topic words
Perplexity	Yes	No	No	Negative log-likelihood per word
Reconstruction	No	Yes	Yes	Frobenius norm / inertia
Combined	Yes	Yes	No	0.5 × silhouette + 0.5 × coherence

Multiple Optima

K is a discrete integer parameter. Multiple values of K may achieve the same or nearly the same quality score. ATCND returns candidate_ks (ranked list) alongside optimal_k, handling plateaus, ties, and multiple local maxima.

Comparison with Baselines

ATCND is the first method in the model-agnostic + exact-K class to achieve O(log N) evaluations:

Method	Model-agnostic?	Exact K*?	Evals
ATCND (all)	Yes	Yes	O(log N)
Grid Search	Yes	Yes	O(N)
HDP	No (LDA only)	No	1 (costly)
Top2Vec	No	No	1
Black-box Opt.	No (LDA only)	Yes	Variable

Real-World Results

Dataset	Adapter	K*	Strategy	Evals	Reduction
Iris	search_model	2	binary	5	64%
Wine	search_gmm	2	binary	6	57%
Digits	search_components	31	binary	10	84%
Moons	search_dbscan	3	binary	10	64%
Wine	search_trees	300	binary	37	87%
Iris	search_neighbors	12	binary	9	70%

Development

# Install in development mode
pip install -e ".[dev]"

# Run tests (75 tests)
python -m pytest tests/ -v

# Format code
black src/atcnd/ tests/

# Lint code
ruff check src/atcnd/ tests/

License

GNU General Public License v3.0 (GPLv3)

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
.codegraph		.codegraph
examples		examples
experiments		experiments
src/atcnd		src/atcnd
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README_CN.md		README_CN.md
pyproject.toml		pyproject.toml
upload_pypi.sh		upload_pypi.sh

Folders and files

Latest commit

History

Repository files navigation

ATCND

8-Strategy Comparison

Consistency Across Datasets

Real-World Demos

K-Means on Iris (3D)

GMM on Wine (3D)

PCA on Digits (3D)

DBSCAN on Two-Moons

Optimal Histogram Bins (AIC)

Smoothing Spline (SciPy)

Rolling Window (Pandas)

NMF Topic Count (Gensim)

PyTorch Hidden Layer Size

Key Features

Installation

Quick Start

Python API

Low-level API (any callable)

Adapter API (scikit-learn, PyTorch, Gensim, ...)

Text data (LDA/NMF)

Adaptive Strategy Selection

Automatic K Range Estimation

Multi-Objective Optimization

CLI

Search Strategies Detail

Predictive Search

Fibonacci Search

Exponential Search

Adapters

Quality Metrics

Multiple Optima

Comparison with Baselines

Real-World Results

Development

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages