You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently the transform method of the BERTopic has to recieve a list of strings as the documents and if embeddings are present then it needs to be a 2darray of shape (len(documents, embedding_dimension). This requires extra reshape calls when trying to call the transform method on a single document.
For example it would be nicer to do this:
document='This is a really interesting document'document_embedding=np.random.rand(768)
topic_model.transform(document, document_embedding)
Example of what you currently have to do
document='This is a really interesting document'document_embedding=np.random.rand(768)
topic_model.transform([document], document_embedding.reshape(1,-1))
Currently if you run it you get this error.
BFile ~/code/BERTopic/bertopic/_utils.py:55, in check_embeddings_shape(embeddings, docs)
else:
if embeddings.shape[0] != len(docs):
---> raise ValueError("Make sure that the embeddings are a numpy array with shape: "
"(len(docs), vector_dim) where vector_dim is the dimensionality "
"of the vector embeddings. ")
ValueError: Make sure that the embeddings are a numpy array with shape: (len(docs), vector_dim) where vector_dim is the dimensionality of the vector embeddings.
Why I think it is happening
This is because the embeddings shape check inside transform happens before:
the document argument is converted into a list if it is str
The embedding is reshaped from a 1d array to 2d array (Note that it currently never does this).
Thoughts on making it better
Am I missing something here about np arrays or the transform function? It seems to me that wanting to transform a single document is common enough that the function could have that little bit of extra functionality.
I have already made the change on a fork for my uses did you agree with the idea and would you like a PR?
The text was updated successfully, but these errors were encountered:
It seems to me that wanting to transform a single document is common enough that the function could have that little bit of extra functionality.
It's interesting but in my experience, it rarely happens that users try it on a single document and also pass an embedding. Transforming a single document to an embedding is incredibly cheap, so there is no need to then also pass the embedding. I think that's why this issue hasn't been caught until now if I remember most issues of the last couple of years well enough.
Either way, a fix should indeed be necessary here to transform an embedding into the right shape when there is only one document.
Issue
Currently the
transform
method of theBERTopic
has to recieve a list of strings as the documents and if embeddings are present then it needs to be a 2darray of shape (len(documents, embedding_dimension). This requires extra reshape calls when trying to call thetransform
method on a single document.For example it would be nicer to do this:
Example of what you currently have to do
Currently if you run it you get this error.
Why I think it is happening
This is because the embeddings shape check inside
transform
happens before:document
argument is converted into a list if it is strThoughts on making it better
Am I missing something here about np arrays or the transform function? It seems to me that wanting to transform a single document is common enough that the function could have that little bit of extra functionality.
I have already made the change on a fork for my uses did you agree with the idea and would you like a PR?
The text was updated successfully, but these errors were encountered: