Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Taskflow做信息抽取报错:substring not found #2854

Closed
lightCraft2020 opened this issue Jul 22, 2022 · 4 comments · Fixed by #2897
Closed

Taskflow做信息抽取报错:substring not found #2854

lightCraft2020 opened this issue Jul 22, 2022 · 4 comments · Fixed by #2897

Comments

@lightCraft2020
Copy link

出错代码

from paddlenlp import Taskflow schema = ['仪器'] ie = Taskflow('information_extraction', schema=schema) ie.set_schema(schema) ie('MetertechΣ960 酶标仪(中国台湾Metertech公司)')

报错信息:

1137 return batch_outputs /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddlenlp/transformers/tokenizer_utils.py in _batch_prepare_for_model(self, batch_ids_pairs, add_special_tokens, padding_strategy, truncation_strategy, max_length, stride, pad_to_multiple_of, return_position_ids, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_dict, return_offsets_mapping, return_length, verbose, **kwargs) 1281 prepend_batch_axis=False, 1282 verbose=verbose, -> 1283 **kwargs) 1284 for key, value in encoded_inputs.items(): 1285 if key not in batch_outputs: /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddlenlp/transformers/tokenizer_utils_base.py in prepare_for_model(self, ids, pair_ids, padding, truncation, max_length, stride, pad_to_multiple_of, return_tensors, return_position_ids, return_token_type_ids, return_attention_mask, return_length, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, add_special_tokens, verbose, prepend_batch_axis, **kwargs) 2798 2799 token_offset_mapping = self.get_offset_mapping(text) -> 2800 token_pair_offset_mapping = self.get_offset_mapping(text_pair) 2801 if max_length and total_len > max_length: 2802 token_offset_mapping, token_pair_offset_mapping, _ = self.truncate_sequences( /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddlenlp/transformers/tokenizer_utils.py in get_offset_mapping(self, text) 1349 token = token[2:] 1350 -> 1351 start = text[offset:].index(token) + offset 1352 1353 end = start + len(token) ValueError: substring not found

初步测试和判断:是特殊文本导致的,但不知道具体原因??

@linjieccc
Copy link
Contributor

linjieccc commented Jul 22, 2022

@yingyibiao 辛苦帮忙看下这个问题

from paddlenlp.transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-base-zh")
tokenizer("MetertechΣ960 酶标仪(中国台湾Metertech公司)", return_offsets_mapping=True)

报错如下:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/workspace/PaddleNLP/paddlenlp/transformers/tokenizer_utils_base.py", line 2267, in __call__
    **kwargs)
  File "/workspace/PaddleNLP/paddlenlp/transformers/tokenizer_utils_base.py", line 2341, in encode
    **kwargs,
  File "/workspace/PaddleNLP/paddlenlp/transformers/tokenizer_utils.py", line 1031, in _encode_plus
    **kwargs)
  File "/workspace/PaddleNLP/paddlenlp/transformers/tokenizer_utils_base.py", line 2800, in prepare_for_model
    token_offset_mapping = self.get_offset_mapping(text)
  File "/workspace/PaddleNLP/paddlenlp/transformers/tokenizer_utils.py", line 1366, in get_offset_mapping
    start = text[offset:].index(token) + offset
ValueError: substring not found

@leon-cas
Copy link

leon-cas commented Jul 28, 2022

在使用Taskflow进行information extraction时遇到同样的问题, ValueError: substring not found

@wawltor
Copy link
Collaborator

wawltor commented Jul 29, 2022

在使用Taskflow进行information extraction时遇到同样的问题, ValueError: substring not found

可以安装一下最新的develop版本的paddlenlp,下周一我们会发布官方的版本

@Viserion-nlper
Copy link

@leon-cas 请问该问题目前2.4.4版本有解决方案吗?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants