TextEncodeInput must be Union[TextInputSequence, Tuple[InputSequence, InputSequence]] #1429

DrRaja · 2022-08-22T20:52:27Z

Hi,
While trying to finetune a bert-base model for multi-label text classification, I keep encountering this error. TextEncodeInput must be Union[TextInputSequence, Tuple[InputSequence, InputSequence]]. I looked around and saw people suggesting to check if there's some missing values of None values in the dataset. I've checked my dataset and its been properly preprocessed to remove an NaN and missing values. I even compared my dataset with the toy example's dataset of toxic comments, and the only difference I could see was the number of categories (in my case these are > 30).

Can anyone please help me on this one?

Thank you

uakarsh · 2022-08-26T17:13:37Z

Hi @DrRaja, can you help in reproducing the error?

DrRaja · 2022-08-26T19:46:15Z

Here's my code:
`import torch

import flash
from flash.core.data.utils import download_data
from flash.text import TextClassificationData, TextClassifier

#Create the DataModule
datamodule = TextClassificationData.from_csv(
"answer_text",
label_cols,
train_file="/content/new-train.csv",
batch_size=128,
val_split=0.1
)

#Build the task
model = TextClassifier(
num_classes = datamodule.num_classes,
backbone="bert-base-uncased",
multi_label=True,
#metrics=F1(num_classes=datamodule.num_classes)
)

#Create the trainer and finetune the model
trainer = flash.Trainer(max_epochs=1)
trainer.fit(model, datamodule=datamodule)`

The 1st Epoch runs till 73% and then I get an error. This is the stack trace:
`---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
in
28 # model = TextClassifier(num_classes=datamodule.num_classes, backbone="prajjwal1/bert-tiny")
29 trainer = flash.Trainer(max_epochs=1)
---> 30 trainer.fit(model, datamodule=datamodule)

25 frames
/usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_fast.py in _batch_encode_plus(self, batch_text_or_text_pairs, add_special_tokens, padding_strategy, truncation_strategy, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose)
427 batch_text_or_text_pairs,
428 add_special_tokens=add_special_tokens,
--> 429 is_pretokenized=is_split_into_words,
430 )
431

TypeError: TextEncodeInput must be Union[TextInputSequence, Tuple[InputSequence, InputSequence]]`

I've checked my dataset, it doesn't have any missing values or NULLs.

uakarsh · 2022-08-27T05:54:32Z

Thanks for mentioning the error. I found this online. Did you see this? I think that, maybe due to the is_split_into_words argument or nonstring type in the column, the error might have been occurring.

DrRaja · 2022-08-27T20:04:47Z

@uakarsh, thank you for the response. Yes I came across that link while I was searching for the solution, but I just didn't understand how/where can I set the token as I'm not directly importing the tokenizer. Can you please help me with that?

uakarsh · 2022-08-27T20:21:50Z

Sure, if possible can you provide me the dataset, so that i can run on colab and try to debug it?

Regards,
Akarsh

DrRaja · 2022-09-01T06:20:40Z

@uakarsh, thank you I was able to get the code running. I was getting that error because of very short 2-3 character strings in the text.
Now after running the training (finetuning) a base model, the results are very confusing/unpredictable. It either gives me an empty list of lists, or the same random labels for all the test inputs as output.
I think its important to mention that my dataset has 318 labels which are quite imbalanced.

Thank you very much for your help.

krshrimali · 2022-09-01T06:30:51Z

Great to hear, @DrRaja - thank you so much @uakarsh for the discussion and the help! Lovely to see the community stepping forward. ⚡ 🔥

uakarsh · 2022-09-01T07:09:30Z

@DrRaja Good to hear that you have debugged the issue. Maybe, you can check for the fact that the labels are appropriate. Also, it could be the case, that if the confidence is not above a certain threshold, the empty list is being returned, or a high class imbalance is present.

Maybe, we can use something like Oversampling or Undersampling for the same, since i have also faced the issue related to imbalance classes. How about adding a feature similar to it in flash @krshrimali ?

DrRaja · 2022-09-01T07:24:07Z

Thanks for your responses guys, really appreciate it.

Is it possible to change the confidence thresholds in the prediction functions?

uakarsh · 2022-09-01T11:28:31Z

When you run this notebook, you would get an output like:

[[tensor([ 3.0591, -6.4909, 1.8139, -16.4584, 2.6716, -8.2024]), tensor([ 4.9536, -3.4972, 6.0760, -16.4508, 3.2731, -7.8231]), tensor([ -7.2569, -13.6809, -8.6636, -13.1463, -10.7626, -12.5976])]]

So, I think that there is some issue with the threshold if I am not wrong. Maybe @krshrimali can help us guide us about where to look for the threshold. I scrolled through the code, i.e of Flash and Lightning Trainer, but I tend to get a bit lost about how to go about it.

krshrimali · 2022-09-02T16:22:20Z

Hey, @uakarsh and @DrRaja - I'm sorry for the delay. Please allow me a couple of days to get back to you on this. We are working on v0.8 release, hence the delay in response. I'll have to check the context to be able to get back, hopefully this weekend 🤞🏻! Thank you!

ethanwharris · 2022-09-05T14:41:18Z

Hey @uakarsh @DrRaja I think there you are just seeing the raw logits output from the model. To see e.g. probabilities you would need to use

trainer.predict(model, datamodule=datamodule, output="probabilities")

To get classes or labels output with a custom threshold, you can pass the Output object yourself like this:

from flash.core.classification import LabelsOutput  # or ClassesOutput

trainer.predict(model, datamodule=datamodule, output=LabelsOutput(threshold=0.7))

Hope that helps 😃 Closing this but please feel free to re-open if you have any other questions

DrRaja added bug / fix Something isn't working help wanted Extra attention is needed labels Aug 22, 2022

ethanwharris closed this as completed Sep 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TextEncodeInput must be Union[TextInputSequence, Tuple[InputSequence, InputSequence]] #1429

TextEncodeInput must be Union[TextInputSequence, Tuple[InputSequence, InputSequence]] #1429

DrRaja commented Aug 22, 2022

uakarsh commented Aug 26, 2022

DrRaja commented Aug 26, 2022 •

edited

uakarsh commented Aug 27, 2022

DrRaja commented Aug 27, 2022

uakarsh commented Aug 27, 2022

DrRaja commented Sep 1, 2022

krshrimali commented Sep 1, 2022

uakarsh commented Sep 1, 2022 •

edited

DrRaja commented Sep 1, 2022

uakarsh commented Sep 1, 2022

krshrimali commented Sep 2, 2022

ethanwharris commented Sep 5, 2022 •

edited

TextEncodeInput must be Union[TextInputSequence, Tuple[InputSequence, InputSequence]] #1429

TextEncodeInput must be Union[TextInputSequence, Tuple[InputSequence, InputSequence]] #1429

Comments

DrRaja commented Aug 22, 2022

uakarsh commented Aug 26, 2022

DrRaja commented Aug 26, 2022 • edited

uakarsh commented Aug 27, 2022

DrRaja commented Aug 27, 2022

uakarsh commented Aug 27, 2022

DrRaja commented Sep 1, 2022

krshrimali commented Sep 1, 2022

uakarsh commented Sep 1, 2022 • edited

DrRaja commented Sep 1, 2022

uakarsh commented Sep 1, 2022

krshrimali commented Sep 2, 2022

ethanwharris commented Sep 5, 2022 • edited

DrRaja commented Aug 26, 2022 •

edited

uakarsh commented Sep 1, 2022 •

edited

ethanwharris commented Sep 5, 2022 •

edited