Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question]: 调用ErnieFastTokenizer.EncodePairStrings抛异常std::range_error #4673

Closed
hhxdestiny opened this issue Feb 7, 2023 · 3 comments
Assignees
Labels
fast_tokenizer question Further information is requested triage

Comments

@hhxdestiny
Copy link

请提出你的问题

环境

  • 【FastDeploy版本】:
  • 【编译命令】官方编译
  • 【系统平台】: Windows x64(Windows11)
  • 【硬件】: Nvidia GPU 3060, CUDA 11.7 CUDNN 8.4
  • 【编译语言】: C++
  • 【IDE】: VS2022

测试代码

#include <fast_tokenizer/tokenizers/ernie_fast_tokenizer.h>

using namespace paddlenlp::fast_tokenizer;

int main() {
  std::cout << "start ..." << std::endl;
  tokenizers_impl::ErnieFastTokenizer ernie_fast_tokenizer("model/vocab1.txt");
  std::vector<core::Encoding> encodings(2);
  ernie_fast_tokenizer.EncodePairStrings("今天天气真好", &encodings[0]);
  std::cout << encodings[0].DebugString() << std::endl;
  ernie_fast_tokenizer.EncodePairStrings(
      "don't know how this missed award nominations.", &encodings[1]);
  std::cout << encodings[0].DebugString() << std::endl;
  std::cout << "end ..." << std::endl;
}

报错截图
Dingtalk_20230207101145
输出

0x00007FFFF86506BC 处(位于 TokenizerTest.exe 中)引发的异常: Microsoft C++ 异常: std::range_error,位于内存位置 0x000000EE868FEE30 处。
0x00007FFFF86506BC 处(位于 TokenizerTest.exe 中)引发的异常: Microsoft C++ 异常: [rethrow],位于内存位置 0x0000000000000000 处。
0x00007FFFF86506BC 处(位于 TokenizerTest.exe 中)引发的异常: Microsoft C++ 异常: std::range_error,位于内存位置 0x000000EE868FEE30 处。
0x00007FFFF86506BC 处(位于 TokenizerTest.exe 中)有未经处理的异常: Microsoft C++ 异常: std::range_error,位于内存位置 0x000000EE868FEE30 处。
@hhxdestiny hhxdestiny added the question Further information is requested label Feb 7, 2023
@github-actions github-actions bot added the triage label Feb 7, 2023
@hhxdestiny
Copy link
Author

为方便测试,附上我的Project
TokenizerTest.zip

@joey12300
Copy link
Contributor

您好,我使用您的代码在Linux上是可以正常运行的。我看您的代码的字符集应该不是utf-8,所以导致对中文分词时出错。可以参考Visual Studio的官方文档,设置源字符集为/utf-8解决:/utf-8(将源字符集和执行字符集设置为 UTF-8)

@hhxdestiny
Copy link
Author

我之前注意到这个说明了,但是改完编码还是会报错。看到您的回复又试了一次,发现依旧报错,于是怀疑除了设置属性里的编码,main.cpp文件也要改编码,把main.cpp文件改成utf-8后,可以work了。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fast_tokenizer question Further information is requested triage
Projects
None yet
Development

No branches or pull requests

2 participants