Fix TopicMapper.add_new_topics inserting None placeholders that break model.save#2480
Closed
sebastianbreguel wants to merge 1 commit intoMaartenGr:masterfrom
Closed
Conversation
Closes #2432 TopicMapper.add_new_topics filled the intermediate history columns of new rows with None placeholders. The class docstring documents mappings_ as a matrix of integers, and _save_utils.save_topics relies on that contract by casting it to np.array(..., dtype=int). After partial_fit produced new clusters, model.save(serialization='safetensors') crashed with TypeError because None is not castable to int. Backfill the intermediate columns with the topic's own key instead, so mappings_ stays a homogeneous integer matrix. get_mappings only reads the first/last/second-to-last columns and never relied on the None sentinel, so callers are unaffected. Adds a unit regression test on TopicMapper that simulates two prior reduce_topics calls (so the buggy length - 2 path is exercised), then calls add_new_topics and asserts the matrix round-trips through np.array(..., dtype=int) and that pre-existing rows are untouched.
Author
|
@MaartenGr could you review this PR? |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Fixes #2432
BERTopic.save(serialization=\"safetensors\")crashes withTypeErrorafterpartial_fitdiscovers new clusters.Root cause
TopicMapper.add_new_topicsfills the intermediate history columns of new rows withNoneplaceholders:The
TopicMapperclass docstring documentsmappings_as "A matrix indicating the mappings from one topic to another" — i.e., integers — and_save_utils.save_topicsrelies on that contract:Noneis not castable toint, so the cast raisesTypeError: int() argument must be a string, a bytes-like object or a real number, not 'NoneType'whenever a model is saved afterpartial_fitproduced new clusters.Fix
Backfill the intermediate columns with the topic's own
keyinstead ofNone, keepingmappings_a homogeneous integer matrix:The fix is at the source rather than in
_save_utils.py: the docstring promises an integer matrix, and theNoneplaceholders were a contract violation.get_mappingsonly ever reads columns[0, -1](or[-3, -1]) and never relied on theNonesentinel for new rows, so callers are unaffected. Pre-existing rows added by__init__andadd_mappingsalready contained only integers, so the new rows are now consistent with them.Tests
Adds
test_topic_mapper_add_new_topics_keeps_integer_matrixintests/test_bertopic.py. The test:TopicMapperand simulates two prior `reduce_topics` calls so the matrix has more than 2 columns (otherwise the buggylength - 2path is hidden).add_new_topics({3: 2, 4: 3})to mimic new clusters being discovered duringpartial_fit.np.array(..., dtype=int)— exactly what_save_utils.save_topicsdoes.key) and current (value) state of new rows are preserved, and the intermediate columns are backfilled with the topic's own key.uv run pytest tests/test_bertopic.py→ 10 passed, 1 skipped (cuML), confirming nothing downstream (includingonline_topic_model'spartial_fitflow) depended on theNonesentinel.ruff checkandruff format --checkare clean.Before submitting