I periodically seem to encounter the following error:
Traceback (most recent call last):
File "<string>", line 4, in <module>
File "/opt/homebrew/lib/python3.11/site-packages/bertopic/_bertopic.py", line 550, in transform
probabilities = self._map_probabilities(probabilities, original_topics=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/bertopic/_bertopic.py", line 4124, in _map_probabilities
mapped_probabilities[:, to_topic] += probabilities[:, from_topic]
~~~~~~~~~~~~~^^^^^^^^^^^^^^^
IndexError: index 14 is out of bounds for axis 1 with size 14
I am unsure of how to help debug it because it only appears in some runs and not others. In each case there is a BERTopic model of the form BERTopic(embedding_model=embedding_model, umap_model=umap_model, hdbscan_model=hdbscan_model, representation_model=representation_model, calculate_probabilities=True), I have fitted the model successfully using fit_transform, and then called transform to compute topics and probabilities in a new sample. In addition, in each case, I provide both the documents and the embeddings. The code operates over a collection of sets of documents so its run as follows:
for key in topic_models:
topics[key], _ = topic_models[key].fit_transform(datasets[key], embeddings[key])
I know the models fit successfully because I can obtain topics from them and there does not seem to be an error. It is only when calling transform that an error periodically manifests. Its stochastic appearance suggests it has something to do with the fitted topics but I am entirely unclear as to how to debug.
In this code:
# Map array of probabilities (probability for assigned topic per document)
if probabilities is not None:
if len(probabilities.shape) == 2:
mapped_probabilities = np.zeros((probabilities.shape[0],
len(set(mappings.values())) - self._outliers))
for from_topic, to_topic in mappings.items():
if to_topic != -1 and from_topic != -1:
mapped_probabilities[:, to_topic] += probabilities[:, from_topic]
return mapped_probabilities
return probabilities
Is to_topic guaranteed to be sequential? Could there be a gap in the indices? I don't know the code base well enough but len(set(mappings.values())) may be the issue? Maybe something like:
if probabilities is not None:
if len(probabilities.shape) == 2:
# Find the maximum 'to_topic' index, ensuring the array is large enough
max_to_topic = max(mappings.values())
# Initialize 'mapped_probabilities' with a size based on the maximum index found
mapped_probabilities = np.zeros((probabilities.shape[0], max_to_topic + 1 - self._outliers))
for from_topic, to_topic in mappings.items():
if to_topic != -1 and from_topic != -1:
# Safely add probabilities, knowing 'mapped_probabilities' has enough columns
mapped_probabilities[:, to_topic] += probabilities[:, from_topic]
# If necessary, additional steps to handle outliers or resize the array can be added here
return mapped_probabilities
In this code, the case of non sequential indices is handled naturally. I do not, however, know if non sequential indices are symptomatic of a deeper issue. HTH.
I should note that I am unclear of exactly what was going on with self._outliers so I left it in. Maybe this should be max_to_topic + 1? That is what I would have done without the self._outliers but I left self._outliers in because I don't understand (have not had the time to look that carefully) what it is.
I periodically seem to encounter the following error:
I am unsure of how to help debug it because it only appears in some runs and not others. In each case there is a BERTopic model of the form
BERTopic(embedding_model=embedding_model, umap_model=umap_model, hdbscan_model=hdbscan_model, representation_model=representation_model, calculate_probabilities=True), I have fitted the model successfully usingfit_transform, and then calledtransformto compute topics and probabilities in a new sample. In addition, in each case, I provide both the documents and the embeddings. The code operates over a collection of sets of documents so its run as follows:I know the models fit successfully because I can obtain topics from them and there does not seem to be an error. It is only when calling
transformthat an error periodically manifests. Its stochastic appearance suggests it has something to do with the fitted topics but I am entirely unclear as to how to debug.In this code:
Is to_topic guaranteed to be sequential? Could there be a gap in the indices? I don't know the code base well enough but
len(set(mappings.values()))may be the issue? Maybe something like:In this code, the case of non sequential indices is handled naturally. I do not, however, know if non sequential indices are symptomatic of a deeper issue. HTH.
I should note that I am unclear of exactly what was going on with
self._outliersso I left it in. Maybe this should bemax_to_topic + 1? That is what I would have done without theself._outliersbut I leftself._outliersin because I don't understand (have not had the time to look that carefully) what it is.