Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

textEmbed, model = 'distilroberta-base' fails based on length of text #36

Open
sebsilas opened this issue Dec 20, 2022 · 0 comments
Open

Comments

@sebsilas
Copy link

sebsilas commented Dec 20, 2022

The textEmbed function can fail when the model is set to 'distilroberta-base', seemingly depending on the length of the given text.

The following does not fail using the default model argument ("bert-base-uncased"):

t1 <- text::textEmbed(texts = "Voice sounds a little too English and a bit boring. The tone could be a bit more upbeat and happy. The pace is a little slow. I think the speed could be a little quicker. I wouldn't want to meet this person as they speak a bit too slowly and I may get bored of them")
# Success

But trying this with model = 'distilroberta-base' seems to not work (note, I am using layers = 5, as per #35)...

t2 <- text::textEmbed(texts = "Voice sounds a little too English and a bit boring. The tone could be a bit more upbeat and happy. The pace is a little slow. I think the speed could be a little quicker. I wouldn't want to meet this person as they speak a bit too slowly and I may get bored of them",
                             model = 'distilroberta-base',
                             layers = 5)
# Fail

t3 <- text::textEmbed(texts = "Voice sounds a little too English and a bit boring. The tone could be a bit more upbeat and happy. The pace is a little slow. I think the speed could be a little quicker. ",
                             model = 'distilroberta-base',
                             layers = 5)

# Fail

t4 <- text::textEmbed(texts = "Voice sounds a little too English and a bit boring. The tone could be a bit more upbeat and happy. The pace is a little slow.",
                      model = 'distilroberta-base',
                      layers = 5)

# Fail

t5 <- text::textEmbed(texts = "Voice sounds a little too English and a bit boring. The tone could be a bit more upbeat and happy.",
                      model = 'distilroberta-base',
                      layers = 5)

# Success

... until I subtract to latter length of text.

The error on failure is:

Error in dplyr::bind_cols(tokens_layer_number, layers_4_token) :

Can't recycle ..1 (size 120) to match ..2 (size 71).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant