Skip to content

Bug Report: Unexpected Keyword Argument 'padding_side' in PreTrainedTokenizerFast #37989

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
1 of 4 tasks
yunqianluo opened this issue May 7, 2025 · 1 comment
Open
1 of 4 tasks
Labels

Comments

@yunqianluo
Copy link

System Info

absl-py==2.1.0
accelerate==1.6.0
aiohappyeyeballs==2.3.5
aiohttp==3.10.2
aiosignal==1.3.1
aniso8601==9.0.1
annotated-types==0.7.0
anyio==4.4.0
async-timeout==4.0.3
attrs==24.2.0
blinker==1.8.2
certifi==2024.7.4
charset-normalizer==3.3.2
chinesebert==0.2.1
click==8.1.7
confluent-kafka==2.5.0
datasets==1.18.3
dill==0.3.8
elastic-transport==8.15.0
elasticsearch==8.14.0
exceptiongroup==1.2.2
fastapi==0.112.0
fastcore==1.3.29
filelock==3.15.4
Flask==3.0.3
Flask-Cors==4.0.1
Flask-RESTful==0.3.10
frozenlist==1.4.1
fsspec==2024.6.1
gevent==24.10.3
greenlet==3.1.1
grpcio==1.65.4
gunicorn==23.0.0
h11==0.14.0
huggingface-hub==0.24.5
idna==3.7
importlib_metadata==8.2.0
iocextract==1.16.1
itsdangerous==2.2.0
Jinja2==3.1.4
joblib==1.4.2
kazoo==2.5.0
Markdown==3.6
markdown-it-py==3.0.0
MarkupSafe==2.1.5
mdurl==0.1.2
mpmath==1.3.0
multidict==6.0.5
multiprocess==0.70.16
networkx==3.2.1
numpy==1.26.4
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==9.1.0.70
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-nccl-cu12==2.20.5
nvidia-nvjitlink-cu12==12.6.20
nvidia-nvtx-cu12==12.1.105
packaging==24.1
pandas==2.2.2
protobuf==4.25.4
psutil==7.0.0
pyarrow==17.0.0
pydantic==2.8.2
pydantic_core==2.20.1
Pygments==2.18.0
pykafka==2.8.0
PyMySQL==1.1.1
pypinyin==0.38.1
python-dateutil==2.9.0.post0
pytz==2024.1
PyYAML==6.0.2
regex==2024.7.24
requests==2.32.3
requests-file==2.1.0
rich==13.7.1
sacremoses==0.1.1
safetensors==0.5.3
shellingham==1.5.4
six==1.16.0
sniffio==1.3.1
starlette==0.37.2
sympy==1.13.1
tabulate==0.9.0
tensorboard==2.17.0
tensorboard-data-server==0.7.2
tldextract==5.1.2
tokenizers==0.19.1
torch==2.4.0
tqdm==4.66.5
transformers==4.42.0
triton==3.0.0
typer==0.12.3
typing_extensions==4.12.2
tzdata==2024.1
urllib3==2.2.2
uvicorn==0.30.5
Werkzeug==3.0.3
xxhash==3.4.1
yarl==1.9.4
zipp==3.19.2
zope.event==5.0
zope.interface==7.1.1

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction


Bug Report: Unexpected Keyword Argument 'padding_side' in PreTrainedTokenizerFast

Description:

I encountered a TypeError when using the AutoTokenizer with tokenizer_name after upgrading to Transformers version 4.42.0 and tokenizers version 0.19.1. The error message is as follows:

TypeError: PreTrainedTokenizerFast._batch_encode_plus() got an unexpected keyword argument 'padding_side'

Details:

  • Code Snippet:

    tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
    
    def tokenize_and_align_train_labels(examples):
        tokenized_inputs = tokenizer(
            examples[text_column_name],
            max_length=args.cutoff_len,
            padding=False,
            truncation=True,
            return_token_type_ids=False,
        )
  • Observations:

    • Before upgrading to Transformers version 4.42.0, this error did not occur.
    • The issue arises with tokenizers version 0.19.1 and later.
    • Setting use_fast=False when initializing the tokenizer resolves the error, indicating the issue is specific to the fast tokenizer implementation.

Request:

Please investigate this issue, as it seems to be a regression related to the handling of the padding_side argument in the fast tokenizer. Any guidance or fix would be appreciated.


Expected behavior

The AutoTokenizer should correctly handle the padding_side argument without raising a TypeError when using the fast tokenizer implementation. Specifically, when calling the tokenizer with parameters such as padding, truncation, and return_token_type_ids, it should process the input text as expected, regardless of the version of the tokenizers library being used. This behavior should be consistent across versions, ensuring backward compatibility and seamless upgrades.

@yunqianluo yunqianluo added the bug label May 7, 2025
@Rocketknight1
Copy link
Member

Hi @yunqianluo, this might be specific to the tokenizer class you're using. Can you give us some code we can run that will show the issue?

cc @itazap

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants