Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Warn when automatically choosing SklearnEmbedder backend #1980

Open
freddyheppell opened this issue May 8, 2024 · 3 comments · May be fixed by #1984
Open

Warn when automatically choosing SklearnEmbedder backend #1980

freddyheppell opened this issue May 8, 2024 · 3 comments · May be fixed by #1984

Comments

@freddyheppell
Copy link

freddyheppell commented May 8, 2024

I recently installed an older version of BERTopic which caused an incompatible version of sentence-transformers and huggingface_hub to be installed. With these versions, this error is generated, but if you rely on the default model being used, it will be caught by BERTopic, and the default sklearn embedder will be selected completely silently.

I was wondering why the results were so bad until I tried manually importing SentenceTransformers to create the model myself, which then gave me the ModuleNotFoundError. Had I not tried this, I would've had no idea I was using the SklearnEmbedder.

This doesn't happen if you install the current version of BERTopic in a fresh environment, but it's entirely possible that a similar incompatibility could occur in the future, other package constraints could cause an older version of sentence-transformers to be installed, or a broken environment will cause a ModuleNotFoundError even if all the packages are supposed to be installed.

I think it would be wise to:

  • find an alternative way of more specifically detecting minimal installs (e.g. the entire SentenceTransformers package being missing), to prevent this code path activating unintentionally
  • provide a warning if sklearn is selected automatically. The fact that the default model is all-MiniLM-L6-v2 is mentioned a few times in the docs, so most users will be expecting it and have no idea they could get a Tfidf/SVD model instead. This may negatively affect users of the minimal install who rely on the default TfidfVectorizer/TruncatedSVD model, so maybe a special model name could be added to select this model without a warning?

Of course this wouldn't prevent issues with older versions of BERTopic, but it would reduce the risk of this happening in future.

@MaartenGr
Copy link
Owner

Thank you for sharing this extensive description! Indeed, in those rare expectations the SklearnEmbedder will be chosen when it cannot find or use sentence-transformers.

find an alternative way of more specifically detecting minimal installs (e.g. the entire SentenceTransformers package being missing), to prevent this code path activating unintentionally

Not sure if I understand correctly. The SklearnEmbedder should be selected at some point in the code as it helps with the minimal variant of BERTopic. Regardless of the code path, it all ends up with the SklearnEmbedder right?

provide a warning if sklearn is selected automatically. The fact that the default model is all-MiniLM-L6-v2 is mentioned a few times in the docs, so most users will be expecting it and have no idea they could get a Tfidf/SVD model instead. This may negatively affect users of the minimal install who rely on the default TfidfVectorizer/TruncatedSVD model, so maybe a special model name could be added to select this model without a warning?

Instead of a warning perhaps a lower priority message could be used instead like debug/info to make sure that users can disable that message if needed. In practice, I don't think many users use the minimal version since it require the --no-deps parameter which isn't the most user-friendly method. I would rather have something like pip install bertopic["minimal"] but that's unfortunately not supported last time I checked. Do you think a non-warning message would suffice?

@freddyheppell
Copy link
Author

I think basically the issue is just catching every ModuleNotFoundError isn't specific enough. It means that if the dependencies of sentence-transformers are broken then anyone who installs BERTopic will unexpectedly and unknowingly be using the sklearn embedder.

Sentence-transformers apparently has 41 direct or indirect dependencies, so there's a large surface area for this to happen, and it's already happened I believe twice. Between the package being broken and a fix being released for sentence-transformers, anyone who installed BERTopic and used it would be unknowingly using the sklearn embedder rather than the transformer, unless they manually specified a model.

An easy solution might just be to use the .name property of the ModuleNotFoundError to check which module is causing it, e.g.

try:
  from ._sentencetransformers import SentenceTransformerBackend
  # ...
except ModuleNotFoundError as e:
  if e.name != "sentence_transformers":
    # some other import issue occured, re-throw so the user can see it
    raise e

  # sentence_transformers is completely missing, should be a minimal install
  # ...
  

That way ModuleNotFoundErrors which the user actually needs to see aren't hidden, but the existing behaviour is still preserved.

@MaartenGr
Copy link
Owner

Thanks for the extensive description! This sounds good to me. Additional warnings/messages would definitely help users understand what is being selected. If you have the time, a PR would be greatly appreciated. If not, I'll make sure to put it on the todo list.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants