-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Length of text passed to nlp pipeline #14
Comments
Hi Emilie! Your code sample can actually run without errors on the colab demo with python 3.7 and transformers 4.12.3. Can you provide more information on your environment (python & package versions) ? Fundamentally, the issue seems to come from the number tokens generated from the input text (the BERT model can only handle up to 512 tokens). However, the |
Hi Theophile ! Python version : 3.8.1 (default, Jan 8 2020, 22:29:32) You say that the algorithm truncates the input automatically. How ? From the beginning of the string ? From the end ? |
Hi again! I managed to reproduce your issue and to find a possible fix. tokenizer = AutoTokenizer.from_pretrained("tblard/tf-allocine", use_fast=True)
model = TFAutoModelForSequenceClassification.from_pretrained("tblard/tf-allocine")
nlp = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer, truncation=True, max_length=512) The culprit might be the tokenizer, which by default have tokenizer = AutoTokenizer.from_pretrained("tblard/tf-allocine", use_fast=True, model_max_length=512)
model = TFAutoModelForSequenceClassification.from_pretrained("tblard/tf-allocine")
nlp = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer, truncation=True) Now, regarding truncation, there's an in depth documentation here. For text classification it's simple: the tokenizer keeps the first 512 ( text = "J'aime le camembert"
tokens = tokenizer.encode_plus(text, max_length=5, truncation=True)
truncated_text = tokenizer.decode(tokens['input_ids'], skip_special_tokens=True)
print(len(tokens['input_ids'])) # 5
print(truncated_text) # J'aime Because we use subword tokenizers, the number of generated tokens is > to the number of words, as the tokenizer might broke down words into multiple tokens. It also automatically adds special tokens. Truncation is often the best way to deal with long sentences; in your example the polarity can typically be inferred from the first word. Keep also in mind that the model was trained on the Allociné dataset, which does not contain a lot of long reviews. |
Thank you so much for your in depth response. I will try the propose code and get back to you when I can. |
Hi @emiliepicardcantin, did you manage to make it work? |
I have quite long texts that I want to label using your module. I run into the same problem over and over again using the nlp pipeline. Here is my code :
I get the following error :
See attachment for more details on the error. Can someone help me ? Thank you !
error_nlp_pipeline.txt
The text was updated successfully, but these errors were encountered: