Skip to content

perf: fix low-hanging performance issues in MetaCAT and linking#400

Merged
mart-r merged 4 commits intoCogStack:mainfrom
bgriffen:perf/low-hanging-optimizations
Apr 7, 2026
Merged

perf: fix low-hanging performance issues in MetaCAT and linking#400
mart-r merged 4 commits intoCogStack:mainfrom
bgriffen:perf/low-hanging-optimizations

Conversation

@bgriffen
Copy link
Copy Markdown
Contributor

@bgriffen bgriffen commented Apr 7, 2026

Summary

Four small, safe performance fixes across the MetaCAT and linking components:

  • Scope batch padding to current batch (ml_utils.py) — create_batch_piped_data was computing max_seq_len over the entire dataset on every batch call, then slicing data[start_ind:end_ind] three times. Now scoped to a single batch slice. Reduces unnecessary padding and iteration.

  • In-place similarity update (vector_context_model.py) — _preprocess_disamb_similarities was copying the similarities list, clearing it, and rebuilding via list comprehension. Replaced with a simple in-place loop, eliminating three intermediate allocations in the disambiguation hot path.

  • O(1) dict lookup instead of O(n) values scan (data_utils.py) — undersample_data and encode_category_values checked membership against category_value2id.values() (linear scan) every iteration. Since label_data dicts use the same keys, check against the dict directly.

  • Append instead of list concatenation (ml_utils.py) — _eval_predictions used dict.get(k, []) + [item] which allocates a new list per iteration. Switched to setdefault + append for O(1) amortized insertions.

All changes are behaviour-preserving — no new dependencies, no config changes, no API changes.

Test plan

  • Existing medcat-v2 unit tests pass across Python 3.10–3.13
  • MetaCAT training + inference produces identical results
  • Linking disambiguation produces identical concept rankings

bgriffen added 4 commits April 7, 2026 21:31
create_batch_piped_data was computing max_seq_len over the entire
dataset on every batch call, and slicing data[start_ind:end_ind]
three times. Scope both to a single batch slice — reduces padding
overhead and eliminates redundant iteration.
Replace list copy + clear + rebuild with a simple in-place loop.
Eliminates three intermediate list allocations in the disambiguation
hot path.
undersample_data and encode_category_values both checked membership
against category_value2id.values() (linear scan) on every iteration.
Since label_data dicts are keyed by the same IDs, check membership
against the dict itself (O(1) hash lookup).
dict.get(k, []) + [item] allocates a new list on every iteration,
making example collection O(n*k). Use setdefault + append for O(1)
amortized per insertion.
Copy link
Copy Markdown
Collaborator

@mart-r mart-r left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution!

Looks good to go. Minor changes, but clear improvement.

@mart-r mart-r merged commit 8a630ba into CogStack:main Apr 7, 2026
19 of 20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants