Skip to content
This repository has been archived by the owner on Oct 9, 2023. It is now read-only.

TextEncodeInput must be Union[TextInputSequence, Tuple[InputSequence, InputSequence]] #1429

Closed
DrRaja opened this issue Aug 22, 2022 · 12 comments
Labels
bug / fix Something isn't working help wanted Extra attention is needed

Comments

@DrRaja
Copy link

DrRaja commented Aug 22, 2022

Hi,
While trying to finetune a bert-base model for multi-label text classification, I keep encountering this error. TextEncodeInput must be Union[TextInputSequence, Tuple[InputSequence, InputSequence]]. I looked around and saw people suggesting to check if there's some missing values of None values in the dataset. I've checked my dataset and its been properly preprocessed to remove an NaN and missing values. I even compared my dataset with the toy example's dataset of toxic comments, and the only difference I could see was the number of categories (in my case these are > 30).

Can anyone please help me on this one?

Thank you

@DrRaja DrRaja added bug / fix Something isn't working help wanted Extra attention is needed labels Aug 22, 2022
@uakarsh
Copy link
Contributor

uakarsh commented Aug 26, 2022

Hi @DrRaja, can you help in reproducing the error?

@DrRaja
Copy link
Author

DrRaja commented Aug 26, 2022

Here's my code:
`import torch

import flash
from flash.core.data.utils import download_data
from flash.text import TextClassificationData, TextClassifier

#Create the DataModule
datamodule = TextClassificationData.from_csv(
"answer_text",
label_cols,
train_file="/content/new-train.csv",
batch_size=128,
val_split=0.1
)

#Build the task
model = TextClassifier(
num_classes = datamodule.num_classes,
backbone="bert-base-uncased",
multi_label=True,
#metrics=F1(num_classes=datamodule.num_classes)
)

#Create the trainer and finetune the model
trainer = flash.Trainer(max_epochs=1)
trainer.fit(model, datamodule=datamodule)`

The 1st Epoch runs till 73% and then I get an error. This is the stack trace:
`---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
in
28 # model = TextClassifier(num_classes=datamodule.num_classes, backbone="prajjwal1/bert-tiny")
29 trainer = flash.Trainer(max_epochs=1)
---> 30 trainer.fit(model, datamodule=datamodule)

25 frames
/usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_fast.py in _batch_encode_plus(self, batch_text_or_text_pairs, add_special_tokens, padding_strategy, truncation_strategy, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose)
427 batch_text_or_text_pairs,
428 add_special_tokens=add_special_tokens,
--> 429 is_pretokenized=is_split_into_words,
430 )
431

TypeError: TextEncodeInput must be Union[TextInputSequence, Tuple[InputSequence, InputSequence]]`

I've checked my dataset, it doesn't have any missing values or NULLs.

@uakarsh
Copy link
Contributor

uakarsh commented Aug 27, 2022

Thanks for mentioning the error. I found this online. Did you see this? I think that, maybe due to the is_split_into_words argument or nonstring type in the column, the error might have been occurring.

@DrRaja
Copy link
Author

DrRaja commented Aug 27, 2022

@uakarsh, thank you for the response. Yes I came across that link while I was searching for the solution, but I just didn't understand how/where can I set the token as I'm not directly importing the tokenizer. Can you please help me with that?

@uakarsh
Copy link
Contributor

uakarsh commented Aug 27, 2022

Sure, if possible can you provide me the dataset, so that i can run on colab and try to debug it?

Regards,
Akarsh

@DrRaja
Copy link
Author

DrRaja commented Sep 1, 2022

@uakarsh, thank you I was able to get the code running. I was getting that error because of very short 2-3 character strings in the text.
Now after running the training (finetuning) a base model, the results are very confusing/unpredictable. It either gives me an empty list of lists, or the same random labels for all the test inputs as output.
I think its important to mention that my dataset has 318 labels which are quite imbalanced.

Thank you very much for your help.

@krshrimali
Copy link
Contributor

Great to hear, @DrRaja - thank you so much @uakarsh for the discussion and the help! Lovely to see the community stepping forward. ⚡ 🔥

@uakarsh
Copy link
Contributor

uakarsh commented Sep 1, 2022

@DrRaja Good to hear that you have debugged the issue. Maybe, you can check for the fact that the labels are appropriate. Also, it could be the case, that if the confidence is not above a certain threshold, the empty list is being returned, or a high class imbalance is present.

Maybe, we can use something like Oversampling or Undersampling for the same, since i have also faced the issue related to imbalance classes. How about adding a feature similar to it in flash @krshrimali ?

@DrRaja
Copy link
Author

DrRaja commented Sep 1, 2022

Thanks for your responses guys, really appreciate it.

Is it possible to change the confidence thresholds in the prediction functions?

@uakarsh
Copy link
Contributor

uakarsh commented Sep 1, 2022

When you run this notebook, you would get an output like:

[[tensor([ 3.0591, -6.4909, 1.8139, -16.4584, 2.6716, -8.2024]), tensor([ 4.9536, -3.4972, 6.0760, -16.4508, 3.2731, -7.8231]), tensor([ -7.2569, -13.6809, -8.6636, -13.1463, -10.7626, -12.5976])]]

So, I think that there is some issue with the threshold if I am not wrong. Maybe @krshrimali can help us guide us about where to look for the threshold. I scrolled through the code, i.e of Flash and Lightning Trainer, but I tend to get a bit lost about how to go about it.

@krshrimali
Copy link
Contributor

Hey, @uakarsh and @DrRaja - I'm sorry for the delay. Please allow me a couple of days to get back to you on this. We are working on v0.8 release, hence the delay in response. I'll have to check the context to be able to get back, hopefully this weekend 🤞🏻! Thank you!

@ethanwharris
Copy link
Collaborator

ethanwharris commented Sep 5, 2022

Hey @uakarsh @DrRaja I think there you are just seeing the raw logits output from the model. To see e.g. probabilities you would need to use

trainer.predict(model, datamodule=datamodule, output="probabilities")

To get classes or labels output with a custom threshold, you can pass the Output object yourself like this:

from flash.core.classification import LabelsOutput  # or ClassesOutput

trainer.predict(model, datamodule=datamodule, output=LabelsOutput(threshold=0.7))

Hope that helps 😃 Closing this but please feel free to re-open if you have any other questions

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug / fix Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

4 participants