Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support offline process llava data #448

Merged
merged 3 commits into from
Mar 15, 2024
Merged

Conversation

HIT-cwh
Copy link
Collaborator

@HIT-cwh HIT-cwh commented Mar 6, 2024

离线处理llava数据可以尝试在XTuner main分支基础上
修改:

  1. xtuner/dataset/llava.py
  2. xtuner/dataset/map_fns/dataset_map_fns/llava_map_fn.py
    新增:
  3. xtuner/configs/internlm/internlm2_chat_7b/internlm2_chat_7b_llava.py
  4. xtuner/tools/process_untokenized_llava_data.py
  • 首先通过xtuner/tools/process_untokenized_llava_data.py离线处理llava训练数据中的文本部分
python xtuner/tools/process_untokenized_llava_data.py llava_cfg.py --save-folder llava_data
  • 处理后可以读取数据集查看是否符合预期
from datasets import load_from_disk
ds = load_from_disk('llava_data')
print(ds)
  • 之后修改 llava_cfg.py 配置文件中的llava_dataset,新增 offline_processed_text_folder = ${save-folder} 字段就可以直接读取离线处理后的数据了

@HIT-cwh HIT-cwh marked this pull request as ready for review March 11, 2024 08:53
remove_unused_columns=False,
pack_to_max_length=False,
with_image_token=True)
assert offline_processed_text_folder or (data_path and tokenizer)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

同时设置了 data_path 和 offline_processed_text_folder 报个 warning 吧,提示下用的是哪一个

@hhaAndroid hhaAndroid changed the title [Draft]support offline process llava data support offline process llava data Mar 12, 2024
@pppppM pppppM merged commit 9bce7b9 into InternLM:main Mar 15, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants