Skip to content

UnicodeDecodeError #160

@woshigenm

Description

@woshigenm

weclone-cli make-dataset
[WeClone] W | 06:15:05 | 警告:您的 settings.jsonc 文件版本 (0.2.23) 与项目建议的配置版本 (0.2.22) 不一致。
[WeClone] W | 06:15:05 | 这可能导致意外行为或错误。请从 settings.template.json 复制或更新您的 settings.jsonc 文件。
[WeClone] W | 06:15:05 | 配置文件更新日志:
[0.2.2] - 2025-05-01 - 增加llm清洗数据配置,blocked_words迁移到settings.jsonc统一配置文件。
[0.2.21] - 2025-05-01 - 增加在线llm清洗数据配置,兼容openai风格接口。
[0.2.22] - 2025-06-05 - 支持图片模态聊天记录微调。

[WeClone] I | 06:15:05 | Loading configuration from: ./settings.jsonc
[WeClone] I | 06:15:06 | Loading configuration from: ./settings.jsonc
[WeClone] I | 06:15:06 | 共发现 2 个 CSV 文件,开始处理,请耐心等待...
Traceback (most recent call last):
File "pandas/_libs/parsers.pyx", line 1120, in pandas._libs.parsers.TextReader._convert_tokens
File "pandas/_libs/parsers.pyx", line 1272, in pandas._libs.parsers.TextReader._convert_with_dtype
File "pandas/_libs/parsers.pyx", line 1285, in pandas._libs.parsers.TextReader._string_convert
File "pandas/_libs/parsers.pyx", line 1535, in pandas._libs.parsers._string_box_utf8
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd5 in position 2: invalid continuation byte

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/root/autodl-tmp/WeClone/.venv/bin/weclone-cli", line 10, in
sys.exit(cli())
File "/root/autodl-tmp/WeClone/.venv/lib/python3.10/site-packages/click/core.py", line 1442, in call
return self.main(*args, **kwargs)
File "/root/autodl-tmp/WeClone/.venv/lib/python3.10/site-packages/click/core.py", line 1363, in main
rv = self.invoke(ctx)
File "/root/autodl-tmp/WeClone/.venv/lib/python3.10/site-packages/click/core.py", line 1830, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/root/autodl-tmp/WeClone/.venv/lib/python3.10/site-packages/click/core.py", line 1226, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/root/autodl-tmp/WeClone/.venv/lib/python3.10/site-packages/click/core.py", line 794, in invoke
return callback(*args, **kwargs)
File "/root/autodl-tmp/WeClone/weclone/cli.py", line 33, in wrapper
return func(*args, **kwargs)
File "/root/autodl-tmp/WeClone/weclone/cli.py", line 51, in new_runtime_wrapper
return original_cmd_func(*args, **kwargs)
File "/root/autodl-tmp/WeClone/weclone/cli.py", line 76, in qa_generator
processor.main()
File "/root/autodl-tmp/WeClone/weclone/data/qa_generator.py", line 177, in main
chat_messages = self.load_csv(csv_file)
File "/root/autodl-tmp/WeClone/weclone/data/qa_generator.py", line 569, in load_csv
df = pd.read_csv(
File "/root/autodl-tmp/WeClone/.venv/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1026, in read_csv
return _read(filepath_or_buffer, kwds)
File "/root/autodl-tmp/WeClone/.venv/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 626, in _read
return parser.read(nrows)
File "/root/autodl-tmp/WeClone/.venv/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1923, in read
) = self._engine.read( # type: ignore[attr-defined]
File "/root/autodl-tmp/WeClone/.venv/lib/python3.10/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 234, in read
chunks = self._reader.read_low_memory(nrows)
File "pandas/_libs/parsers.pyx", line 838, in pandas._libs.parsers.TextReader.read_low_memory
File "pandas/_libs/parsers.pyx", line 921, in pandas._libs.parsers.TextReader._read_rows
File "pandas/_libs/parsers.pyx", line 1066, in pandas._libs.parsers.TextReader._convert_column_data
File "pandas/_libs/parsers.pyx", line 1127, in pandas._libs.parsers.TextReader._convert_tokens
File "pandas/_libs/parsers.pyx", line 1272, in pandas._libs.parsers.TextReader._convert_with_dtype
File "pandas/_libs/parsers.pyx", line 1285, in pandas._libs.parsers.TextReader._string_convert
File "pandas/_libs/parsers.pyx", line 1535, in pandas._libs.parsers._string_box_utf8
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd5 in position 2: invalid continuation byte

Metadata

Metadata

Assignees

No one assigned

    Labels

    DiscussionbugSomething isn't workingfeatureNew feature or requestgood first issueGood for newcomersinvalidThis doesn't seem right

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions