Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use model's own embedding to compute similarity smart tags. #603

Closed
Dref360 opened this issue Jul 23, 2023 · 2 comments
Closed

Use model's own embedding to compute similarity smart tags. #603

Dref360 opened this issue Jul 23, 2023 · 2 comments
Labels
enhancement New feature or request

Comments

@Dref360
Copy link
Contributor

Dref360 commented Jul 23, 2023

Hello!

It would be useful to me if I could use the model's own embedding to compute similarity smart tags. For my particular use case, semantic embeddings are not useful.

Proposal

config.json

{
	"similarity" : {"faiss_encoder": "model" } // could be 'self-similar'? 
}

One can get the embedding of their HuggingFace model with:

inputs = tokenizer(...)
model : BertModelForSequenceClassification = ...
embedding = model.base_model(**inputs).last_hidden_state[:, 0, :] # Take first token embedding

A simpler approach is to load the same model with the feature-extractor task, but that might be more involved.

pipe = pipeline('feature-extraction', model='your_model', truncation=True)

EDIT: One can also do

pipe2 = pipeline('feature-extraction', model=pipe.model.base_model, tokenizer=pipe.tokenizer)
``` where pipe is the initial PipelineForSequenceClassification.
@Dref360 Dref360 added the enhancement New feature or request label Jul 23, 2023
@gabegma
Copy link
Contributor

gabegma commented Aug 10, 2023

Hey @Dref360 - that's a great suggestion, we've had this request a few times. We'll aim to prioritize that in the coming weeks.

@Dref360
Copy link
Contributor Author

Dref360 commented Dec 27, 2023

I didn't realize that sentence-transformer can now load model directly from the hub and convert them. The only caveat is that it uses mean pooling instead of the first token. Not a terrible issue AFAIK.

So if your model name is cardiffnlp/twitter-roberta-base-sentiment-latest, Sentence Transformers will extract the Roberta base when you do

model = SentenceTransformers('cardiffnlp/twitter-roberta-base-sentiment-latest').

It also uses authentication so it works with private repo as well.

Closing!

@Dref360 Dref360 closed this as completed Dec 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants