[Guided Topic Modeling] ValueError: setting an array element with a sequence. #2036

RTChou · 2024-06-05T13:26:27Z

Hi, I am trying to run the example code given in https://maartengr.github.io/BERTopic/getting_started/guided/guided.html#example and got an error.

Example code:

from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups

docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))["data"]

seed_topic_list = [["drug", "cancer", "drugs", "doctor"],
                   ["windows", "drive", "dos", "file"],
                   ["space", "launch", "orbit", "lunar"]]

topic_model = BERTopic(seed_topic_list=seed_topic_list)
topics, probs = topic_model.fit_transform(docs)

Error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/reneechou/git/BERTopic/bertopic/_bertopic.py", line 400, in fit_transform
    y, embeddings = self._guided_topic_modeling(embeddings)
  File "/Users/reneechou/git/BERTopic/bertopic/_bertopic.py", line 3770, in _guided_topic_modeling
    embeddings[indices] = np.average([embeddings[indices], seed_topic_embeddings[seed_topic]], weights=[3, 1])
  File "/Users/reneechou/miniconda3/envs/bertopic/lib/python3.10/site-packages/numpy/lib/function_base.py", line 511, in average
    a = np.asanyarray(a)
ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (2,) + inhomogeneous part.

The issue happened when calculating the (weighted) averages between a set of documents (embeddings[indices]) and their seed topic embeddings (seed_topic_embeddings[seed_topic]), where np.average cannot calculate the averages between a 2D array and a 1D array.

This issue can be solved by broadcasting the 1D array to match the shape of the 2D array, and calculating the averages along axis 0.

Original code (https://github.com/MaartenGr/BERTopic/blob/master/bertopic/_bertopic.py#L3766):

# Average the document embeddings related to the seeded topics with the
# embedding of the seeded topic to force the documents in a cluster
for seed_topic in range(len(seed_topic_list)):
    indices = [index for index, topic in enumerate(y) if topic == seed_topic]
    embeddings[indices] = np.average([embeddings[indices], seed_topic_embeddings[seed_topic]], weights=[3, 1])

Modified code:

# Average the document embeddings related to the seeded topics with the
# embedding of the seeded topic to force the documents in a cluster
for seed_topic in range(len(seed_topic_list)):
    indices = [index for index, topic in enumerate(y) if topic == seed_topic]
    embeddings_ = embeddings[indices]
    seed_topic_embeddings_ = seed_topic_embeddings[seed_topic]
    seed_topic_embeddings_ = np.tile(seed_topic_embeddings_, (embeddings_.shape[0], 1))
    embeddings[indices] = np.average([embeddings_, seed_topic_embeddings_], axis=0, weights=[3, 1])

The text was updated successfully, but these errors were encountered:

MaartenGr · 2024-06-06T14:27:38Z

Hmmm, although I think I understand the issue, it is not clear to me why this issue suddenly appears whereas it has been working fine for a while now (aside from the underlying issues with np.average and weights). Perhaps a new version of numpy?

Either way, I have seen the solution of tiling the embeddings before but was hesitant to implement it since that would increase the size of the seeded topic embeddings quite a bit. If I'm not mistaken, your embeddings would now be twice as big.

RTChou · 2024-06-06T15:21:02Z

Thanks for your prompt reply. I am using numpy version 1.26.4. Your concern totally make sense to me. It seems like this issue regarding difference in array shapes only occurs in np.average, so another possible fix can be:

# Average the document embeddings related to the seeded topics with the
# embedding of the seeded topic to force the documents in a cluster
for seed_topic in range(len(seed_topic_list)):
    indices = [index for index, topic in enumerate(y) if topic == seed_topic]
    embeddings[indices] = embeddings[indices] * 0.75 + seed_topic_embeddings[seed_topic] * 0.25

MaartenGr · 2024-06-08T07:39:23Z

@RTChou Would this also be possible even though the shapes of the embedding matrices differ?

RTChou · 2024-06-10T02:44:12Z

@MaartenGr Yes, and it is essentially doing broadcasting under the hood in C, so I believe it will be more efficient comparing to the previous solution that uses np.tile to broadcast an array explicitly.

Here is a toy example showing the calculation of the weighted average of arrays/matrices with different shapes:

Implicit broadcasting

import numpy as np
array1 = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
array2 = np.array([2, 4, 6])
avg = array1 * 0.75 + array2 * 0.25
avg

array([[1.25, 2.5 , 3.75],
       [3.5 , 4.75, 6.  ],
       [5.75, 7.  , 8.25]])

Explicit broadcasting

array2_broadcasted = np.tile(array2, (array1.shape[0], 1))
avg_broadcasted = array1 * 0.75 + array2_broadcasted * 0.25
avg_broadcasted

array([[1.25, 2.5 , 3.75],
       [3.5 , 4.75, 6.  ],
       [5.75, 7.  , 8.25]])

MaartenGr · 2024-06-14T08:30:00Z

@RTChou Awesome, great proposed solution! I can implement it myself but if you could make a PR, then that would be highly appreciated. Do note that some linting (#2033) will be merged first so that could result in minor conflicts if you create the fix before that PR is merged.

RTChou · 2024-06-14T15:33:08Z

@MaartenGr Sounds good. I will make a PR after the merging then. Thanks for letting me know!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Guided Topic Modeling] ValueError: setting an array element with a sequence. #2036

[Guided Topic Modeling] ValueError: setting an array element with a sequence. #2036

RTChou commented Jun 5, 2024

MaartenGr commented Jun 6, 2024

RTChou commented Jun 6, 2024

MaartenGr commented Jun 8, 2024

RTChou commented Jun 10, 2024

MaartenGr commented Jun 14, 2024

RTChou commented Jun 14, 2024

[Guided Topic Modeling] ValueError: setting an array element with a sequence. #2036

[Guided Topic Modeling] ValueError: setting an array element with a sequence. #2036

Comments

RTChou commented Jun 5, 2024

MaartenGr commented Jun 6, 2024

RTChou commented Jun 6, 2024

MaartenGr commented Jun 8, 2024

RTChou commented Jun 10, 2024

MaartenGr commented Jun 14, 2024

RTChou commented Jun 14, 2024