Warn when automatically choosing SklearnEmbedder backend #1980

freddyheppell · 2024-05-08T11:55:48Z

I recently installed an older version of BERTopic which caused an incompatible version of sentence-transformers and huggingface_hub to be installed. With these versions, this error is generated, but if you rely on the default model being used, it will be caught by BERTopic, and the default sklearn embedder will be selected completely silently.

I was wondering why the results were so bad until I tried manually importing SentenceTransformers to create the model myself, which then gave me the ModuleNotFoundError. Had I not tried this, I would've had no idea I was using the SklearnEmbedder.

This doesn't happen if you install the current version of BERTopic in a fresh environment, but it's entirely possible that a similar incompatibility could occur in the future, other package constraints could cause an older version of sentence-transformers to be installed, or a broken environment will cause a ModuleNotFoundError even if all the packages are supposed to be installed.

I think it would be wise to:

find an alternative way of more specifically detecting minimal installs (e.g. the entire SentenceTransformers package being missing), to prevent this code path activating unintentionally
provide a warning if sklearn is selected automatically. The fact that the default model is all-MiniLM-L6-v2 is mentioned a few times in the docs, so most users will be expecting it and have no idea they could get a Tfidf/SVD model instead. This may negatively affect users of the minimal install who rely on the default TfidfVectorizer/TruncatedSVD model, so maybe a special model name could be added to select this model without a warning?

Of course this wouldn't prevent issues with older versions of BERTopic, but it would reduce the risk of this happening in future.

The text was updated successfully, but these errors were encountered:

MaartenGr · 2024-05-09T20:51:16Z

Thank you for sharing this extensive description! Indeed, in those rare expectations the SklearnEmbedder will be chosen when it cannot find or use sentence-transformers.

find an alternative way of more specifically detecting minimal installs (e.g. the entire SentenceTransformers package being missing), to prevent this code path activating unintentionally

Not sure if I understand correctly. The SklearnEmbedder should be selected at some point in the code as it helps with the minimal variant of BERTopic. Regardless of the code path, it all ends up with the SklearnEmbedder right?

provide a warning if sklearn is selected automatically. The fact that the default model is all-MiniLM-L6-v2 is mentioned a few times in the docs, so most users will be expecting it and have no idea they could get a Tfidf/SVD model instead. This may negatively affect users of the minimal install who rely on the default TfidfVectorizer/TruncatedSVD model, so maybe a special model name could be added to select this model without a warning?

Instead of a warning perhaps a lower priority message could be used instead like debug/info to make sure that users can disable that message if needed. In practice, I don't think many users use the minimal version since it require the --no-deps parameter which isn't the most user-friendly method. I would rather have something like pip install bertopic["minimal"] but that's unfortunately not supported last time I checked. Do you think a non-warning message would suffice?

freddyheppell · 2024-05-09T22:10:57Z

I think basically the issue is just catching every ModuleNotFoundError isn't specific enough. It means that if the dependencies of sentence-transformers are broken then anyone who installs BERTopic will unexpectedly and unknowingly be using the sklearn embedder.

Sentence-transformers apparently has 41 direct or indirect dependencies, so there's a large surface area for this to happen, and it's already happened I believe twice. Between the package being broken and a fix being released for sentence-transformers, anyone who installed BERTopic and used it would be unknowingly using the sklearn embedder rather than the transformer, unless they manually specified a model.

An easy solution might just be to use the .name property of the ModuleNotFoundError to check which module is causing it, e.g.

try:
  from ._sentencetransformers import SentenceTransformerBackend
  # ...
except ModuleNotFoundError as e:
  if e.name != "sentence_transformers":
    # some other import issue occured, re-throw so the user can see it
    raise e

  # sentence_transformers is completely missing, should be a minimal install
  # ...

That way ModuleNotFoundErrors which the user actually needs to see aren't hidden, but the existing behaviour is still preserved.

MaartenGr · 2024-05-10T06:40:25Z

Thanks for the extensive description! This sounds good to me. Additional warnings/messages would definitely help users understand what is being selected. If you have the time, a PR would be greatly appreciated. If not, I'll make sure to put it on the todo list.

freddyheppell linked a pull request May 10, 2024 that will close this issue

Make sklearn embedding backend auto-select more cautious #1984

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Warn when automatically choosing SklearnEmbedder backend #1980

Warn when automatically choosing SklearnEmbedder backend #1980

freddyheppell commented May 8, 2024 •

edited

MaartenGr commented May 9, 2024

freddyheppell commented May 9, 2024

MaartenGr commented May 10, 2024

Warn when automatically choosing SklearnEmbedder backend #1980

Warn when automatically choosing SklearnEmbedder backend #1980

Comments

freddyheppell commented May 8, 2024 • edited

MaartenGr commented May 9, 2024

freddyheppell commented May 9, 2024

MaartenGr commented May 10, 2024

freddyheppell commented May 8, 2024 •

edited