Skip to content

project translated to English #166

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 14 commits into from
Closed

project translated to English #166

wants to merge 14 commits into from

Conversation

xming521
Copy link
Owner

No description provided.

xming521 and others added 14 commits June 20, 2025 12:10
@xming521 xming521 requested a review from Copilot June 25, 2025 03:43
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR aims to update and partially translate the project into English while introducing several improvements and new features. Key changes include:

  • Addition of a new function (calculate_token_length) with logging in length_cdf.py.
  • Updates to configuration models and arguments (including Telegram support) with updated version strings.
  • Documentation and README adjustments to provide both Chinese and English details.

Reviewed Changes

Copilot reviewed 19 out of 20 changed files in this pull request and generated no comments.

Show a summary per file
File Description
weclone/utils/length_cdf.py Added calculate_token_length function; log messages and docstring remain in Chinese.
weclone/utils/config_models.py Refactored BaseModel usage and introduced TelegramArgs.
weclone/data/clean/strategies.py Increased max_new_tokens from 100 to 200 in vllm_infer call.
tests/tests_data/test_person/test_0_730.csv Updated test data rows to numeric/empty messages.
README.md, settings.template.jsonc, etc. Version and documentation updates for an English translation.
Comments suppressed due to low confidence (4)

weclone/utils/length_cdf.py:41

  • The log message in calculate_token_length is still in Chinese. Consider translating it (as well as the docstring) into English for consistency with the project's translated content.
    logger.info(f"正在计算文本token长度: {text[:50]}...")

weclone/data/clean/strategies.py:124

  • The max_new_tokens default has been increased from 100 to 200. Please confirm that this change is intentional and will not negatively affect performance or output length expectations.
            max_new_tokens=200,

tests/tests_data/test_person/test_0_730.csv:7

  • The updated message content is a numeric value ('2.0156416') instead of text. Verify if this change is deliberate and does not break downstream assumptions about the message format.
10,5704142615879617852,文本,0,wxid_6789z5qlxzfj22,wxid_6789z5qlxzfj22,2.0156416,,2024/10/4 11:43

tests/tests_data/test_person/test_0_730.csv:8

  • This test data row now has an empty message field. Please confirm that having an empty string as message content is intentional for the test scenario.
11,1337798072543283708,文本,0,wxid_6789z5qlxzfj22,wxid_6789z5qlxzfj22,,,2024/10/4 11:43

@xming521 xming521 closed this Jun 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant