perf: fix low-hanging performance issues in MetaCAT and linking by bgriffen · Pull Request #400 · CogStack/cogstack-nlp

bgriffen · 2026-04-07T11:35:14Z

Summary

Four small, safe performance fixes across the MetaCAT and linking components:

Scope batch padding to current batch (ml_utils.py) — create_batch_piped_data was computing max_seq_len over the entire dataset on every batch call, then slicing data[start_ind:end_ind] three times. Now scoped to a single batch slice. Reduces unnecessary padding and iteration.
In-place similarity update (vector_context_model.py) — _preprocess_disamb_similarities was copying the similarities list, clearing it, and rebuilding via list comprehension. Replaced with a simple in-place loop, eliminating three intermediate allocations in the disambiguation hot path.
O(1) dict lookup instead of O(n) values scan (data_utils.py) — undersample_data and encode_category_values checked membership against category_value2id.values() (linear scan) every iteration. Since label_data dicts use the same keys, check against the dict directly.
Append instead of list concatenation (ml_utils.py) — _eval_predictions used dict.get(k, []) + [item] which allocates a new list per iteration. Switched to setdefault + append for O(1) amortized insertions.

All changes are behaviour-preserving — no new dependencies, no config changes, no API changes.

Test plan

Existing medcat-v2 unit tests pass across Python 3.10–3.13
MetaCAT training + inference produces identical results
Linking disambiguation produces identical concept rankings

create_batch_piped_data was computing max_seq_len over the entire dataset on every batch call, and slicing data[start_ind:end_ind] three times. Scope both to a single batch slice — reduces padding overhead and eliminates redundant iteration.

Replace list copy + clear + rebuild with a simple in-place loop. Eliminates three intermediate list allocations in the disambiguation hot path.

undersample_data and encode_category_values both checked membership against category_value2id.values() (linear scan) on every iteration. Since label_data dicts are keyed by the same IDs, check membership against the dict itself (O(1) hash lookup).

dict.get(k, []) + [item] allocates a new list on every iteration, making example collection O(n*k). Use setdefault + append for O(1) amortized per insertion.

mart-r

Thanks for the contribution!

Looks good to go. Minor changes, but clear improvement.

bgriffen added 4 commits April 7, 2026 21:31

perf(linking): update similarities in-place during disambiguation

eca5727

Replace list copy + clear + rebuild with a simple in-place loop. Eliminates three intermediate list allocations in the disambiguation hot path.

perf(metacat): use append instead of list concatenation in eval

d28f10f

dict.get(k, []) + [item] allocates a new list on every iteration, making example collection O(n*k). Use setdefault + append for O(1) amortized per insertion.

mart-r approved these changes Apr 7, 2026

View reviewed changes

mart-r merged commit 8a630ba into CogStack:main Apr 7, 2026
19 of 20 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: fix low-hanging performance issues in MetaCAT and linking#400

perf: fix low-hanging performance issues in MetaCAT and linking#400
mart-r merged 4 commits intoCogStack:mainfrom
bgriffen:perf/low-hanging-optimizations

bgriffen commented Apr 7, 2026

Uh oh!

mart-r left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

bgriffen commented Apr 7, 2026

Summary

Test plan

Uh oh!

mart-r left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants