Error in transform probabilities

I periodically seem to encounter the following error:

```
Traceback (most recent call last):
  File "<string>", line 4, in <module>
  File "/opt/homebrew/lib/python3.11/site-packages/bertopic/_bertopic.py", line 550, in transform
    probabilities = self._map_probabilities(probabilities, original_topics=True)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/bertopic/_bertopic.py", line 4124, in _map_probabilities
    mapped_probabilities[:, to_topic] += probabilities[:, from_topic]
                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^
IndexError: index 14 is out of bounds for axis 1 with size 14
```

I am unsure of how to help debug it because it only appears in some runs and not others. In each case there is a BERTopic model of the form `BERTopic(embedding_model=embedding_model, umap_model=umap_model, hdbscan_model=hdbscan_model, representation_model=representation_model, calculate_probabilities=True)`, I have fitted the model successfully using `fit_transform`, and then called `transform` to compute topics and probabilities in a new sample. In addition, in each case, I provide both the documents and the embeddings. The code operates over a collection of sets of documents so its run as follows:

```
for key in topic_models:
    topics[key], _ = topic_models[key].fit_transform(datasets[key], embeddings[key])
```

I know the models fit successfully because I can obtain topics from them and there does not seem to be an error. It is only when calling `transform` that an error periodically manifests. Its stochastic appearance suggests it has something to do with the fitted topics but I am entirely unclear as to how to debug.

In this code:

```
# Map array of probabilities (probability for assigned topic per document)
        if probabilities is not None:
            if len(probabilities.shape) == 2:
                mapped_probabilities = np.zeros((probabilities.shape[0],
                                                 len(set(mappings.values())) - self._outliers))
                for from_topic, to_topic in mappings.items():
                    if to_topic != -1 and from_topic != -1:
                        mapped_probabilities[:, to_topic] += probabilities[:, from_topic]

                return mapped_probabilities

        return probabilities
```

Is to_topic guaranteed to be sequential? Could there be a gap in the indices? I don't know the code base well enough but  `len(set(mappings.values()))` may be the issue? Maybe something like:

```
if probabilities is not None:
    if len(probabilities.shape) == 2:
        # Find the maximum 'to_topic' index, ensuring the array is large enough
        max_to_topic = max(mappings.values())
        
        # Initialize 'mapped_probabilities' with a size based on the maximum index found
        mapped_probabilities = np.zeros((probabilities.shape[0], max_to_topic + 1 - self._outliers))
        
        for from_topic, to_topic in mappings.items():
            if to_topic != -1 and from_topic != -1:
                # Safely add probabilities, knowing 'mapped_probabilities' has enough columns
                mapped_probabilities[:, to_topic] += probabilities[:, from_topic]
                
        # If necessary, additional steps to handle outliers or resize the array can be added here

        return mapped_probabilities

```

In this code, the case of non sequential indices is handled naturally. I do not, however, know if non sequential indices are symptomatic of a deeper issue. HTH.

I should note that I am unclear of exactly what was going on with `self._outliers` so I left it in. Maybe this should be `max_to_topic + 1`? That is what I would have done without the `self._outliers` but I left `self._outliers` in because I don't understand (have not had the time to look that carefully) what it is.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error in transform probabilities #1807

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Error in transform probabilities #1807

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions