perf: fix low-hanging performance issues in MetaCAT and linking#400
Merged
mart-r merged 4 commits intoCogStack:mainfrom Apr 7, 2026
Merged
perf: fix low-hanging performance issues in MetaCAT and linking#400mart-r merged 4 commits intoCogStack:mainfrom
mart-r merged 4 commits intoCogStack:mainfrom
Conversation
create_batch_piped_data was computing max_seq_len over the entire dataset on every batch call, and slicing data[start_ind:end_ind] three times. Scope both to a single batch slice — reduces padding overhead and eliminates redundant iteration.
Replace list copy + clear + rebuild with a simple in-place loop. Eliminates three intermediate list allocations in the disambiguation hot path.
undersample_data and encode_category_values both checked membership against category_value2id.values() (linear scan) on every iteration. Since label_data dicts are keyed by the same IDs, check membership against the dict itself (O(1) hash lookup).
dict.get(k, []) + [item] allocates a new list on every iteration, making example collection O(n*k). Use setdefault + append for O(1) amortized per insertion.
mart-r
approved these changes
Apr 7, 2026
Collaborator
mart-r
left a comment
There was a problem hiding this comment.
Thanks for the contribution!
Looks good to go. Minor changes, but clear improvement.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Four small, safe performance fixes across the MetaCAT and linking components:
Scope batch padding to current batch (
ml_utils.py) —create_batch_piped_datawas computingmax_seq_lenover the entire dataset on every batch call, then slicingdata[start_ind:end_ind]three times. Now scoped to a single batch slice. Reduces unnecessary padding and iteration.In-place similarity update (
vector_context_model.py) —_preprocess_disamb_similaritieswas copying the similarities list, clearing it, and rebuilding via list comprehension. Replaced with a simple in-place loop, eliminating three intermediate allocations in the disambiguation hot path.O(1) dict lookup instead of O(n) values scan (
data_utils.py) —undersample_dataandencode_category_valueschecked membership againstcategory_value2id.values()(linear scan) every iteration. Sincelabel_datadicts use the same keys, check against the dict directly.Append instead of list concatenation (
ml_utils.py) —_eval_predictionsuseddict.get(k, []) + [item]which allocates a new list per iteration. Switched tosetdefault+appendfor O(1) amortized insertions.All changes are behaviour-preserving — no new dependencies, no config changes, no API changes.
Test plan
medcat-v2unit tests pass across Python 3.10–3.13