Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Guided Topic Modeling] ValueError: setting an array element with a sequence. #2036

Open
RTChou opened this issue Jun 5, 2024 · 6 comments

Comments

@RTChou
Copy link

RTChou commented Jun 5, 2024

Hi, I am trying to run the example code given in https://maartengr.github.io/BERTopic/getting_started/guided/guided.html#example and got an error.

Example code:

from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups

docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))["data"]

seed_topic_list = [["drug", "cancer", "drugs", "doctor"],
                   ["windows", "drive", "dos", "file"],
                   ["space", "launch", "orbit", "lunar"]]

topic_model = BERTopic(seed_topic_list=seed_topic_list)
topics, probs = topic_model.fit_transform(docs)

Error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/reneechou/git/BERTopic/bertopic/_bertopic.py", line 400, in fit_transform
    y, embeddings = self._guided_topic_modeling(embeddings)
  File "/Users/reneechou/git/BERTopic/bertopic/_bertopic.py", line 3770, in _guided_topic_modeling
    embeddings[indices] = np.average([embeddings[indices], seed_topic_embeddings[seed_topic]], weights=[3, 1])
  File "/Users/reneechou/miniconda3/envs/bertopic/lib/python3.10/site-packages/numpy/lib/function_base.py", line 511, in average
    a = np.asanyarray(a)
ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (2,) + inhomogeneous part.

The issue happened when calculating the (weighted) averages between a set of documents (embeddings[indices]) and their seed topic embeddings (seed_topic_embeddings[seed_topic]), where np.average cannot calculate the averages between a 2D array and a 1D array.

This issue can be solved by broadcasting the 1D array to match the shape of the 2D array, and calculating the averages along axis 0.

Original code (https://github.com/MaartenGr/BERTopic/blob/master/bertopic/_bertopic.py#L3766):

# Average the document embeddings related to the seeded topics with the
# embedding of the seeded topic to force the documents in a cluster
for seed_topic in range(len(seed_topic_list)):
    indices = [index for index, topic in enumerate(y) if topic == seed_topic]
    embeddings[indices] = np.average([embeddings[indices], seed_topic_embeddings[seed_topic]], weights=[3, 1])

Modified code:

# Average the document embeddings related to the seeded topics with the
# embedding of the seeded topic to force the documents in a cluster
for seed_topic in range(len(seed_topic_list)):
    indices = [index for index, topic in enumerate(y) if topic == seed_topic]
    embeddings_ = embeddings[indices]
    seed_topic_embeddings_ = seed_topic_embeddings[seed_topic]
    seed_topic_embeddings_ = np.tile(seed_topic_embeddings_, (embeddings_.shape[0], 1))
    embeddings[indices] = np.average([embeddings_, seed_topic_embeddings_], axis=0, weights=[3, 1])
@MaartenGr
Copy link
Owner

Hmmm, although I think I understand the issue, it is not clear to me why this issue suddenly appears whereas it has been working fine for a while now (aside from the underlying issues with np.average and weights). Perhaps a new version of numpy?

Either way, I have seen the solution of tiling the embeddings before but was hesitant to implement it since that would increase the size of the seeded topic embeddings quite a bit. If I'm not mistaken, your embeddings would now be twice as big.

@RTChou
Copy link
Author

RTChou commented Jun 6, 2024

Thanks for your prompt reply. I am using numpy version 1.26.4. Your concern totally make sense to me. It seems like this issue regarding difference in array shapes only occurs in np.average, so another possible fix can be:

# Average the document embeddings related to the seeded topics with the
# embedding of the seeded topic to force the documents in a cluster
for seed_topic in range(len(seed_topic_list)):
    indices = [index for index, topic in enumerate(y) if topic == seed_topic]
    embeddings[indices] = embeddings[indices] * 0.75 + seed_topic_embeddings[seed_topic] * 0.25

@MaartenGr
Copy link
Owner

@RTChou Would this also be possible even though the shapes of the embedding matrices differ?

@RTChou
Copy link
Author

RTChou commented Jun 10, 2024

@MaartenGr Yes, and it is essentially doing broadcasting under the hood in C, so I believe it will be more efficient comparing to the previous solution that uses np.tile to broadcast an array explicitly.

Here is a toy example showing the calculation of the weighted average of arrays/matrices with different shapes:

  1. Implicit broadcasting
import numpy as np
array1 = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
array2 = np.array([2, 4, 6])
avg = array1 * 0.75 + array2 * 0.25
avg
array([[1.25, 2.5 , 3.75],
       [3.5 , 4.75, 6.  ],
       [5.75, 7.  , 8.25]])
  1. Explicit broadcasting
array2_broadcasted = np.tile(array2, (array1.shape[0], 1))
avg_broadcasted = array1 * 0.75 + array2_broadcasted * 0.25
avg_broadcasted
array([[1.25, 2.5 , 3.75],
       [3.5 , 4.75, 6.  ],
       [5.75, 7.  , 8.25]])

@MaartenGr
Copy link
Owner

@RTChou Awesome, great proposed solution! I can implement it myself but if you could make a PR, then that would be highly appreciated. Do note that some linting (#2033) will be merged first so that could result in minor conflicts if you create the fix before that PR is merged.

@RTChou
Copy link
Author

RTChou commented Jun 14, 2024

@MaartenGr Sounds good. I will make a PR after the merging then. Thanks for letting me know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants