Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Support Gemma #429

Merged
merged 12 commits into from
Feb 23, 2024
Merged

[Feature] Support Gemma #429

merged 12 commits into from
Feb 23, 2024

Conversation

PommesPeter
Copy link
Contributor

@PommesPeter PommesPeter commented Feb 22, 2024

Description

add Gemma-2B and Gemma-7B to xtuner.

遇到一些问题:

使用写好的配置文件开启多卡微调好程序不动,显卡使用率 100%,但毫无反应。请问是否有遇到过类似情况?根据个人经验,之前 Mistral 系列的模型在微调的时候也遇到相同的情况,推测和 Deepspeed 有关。之前我也遇到过类似的问题,在训练 llava 的时候遇到,debug 后发现是计算 attention 的时候遇到死锁,两个多卡程序相互争抢资源。相关 issue:

复现指令:

NPROC_PER_NODE=2 xtuner train xtuner/config/gemma/gemma_2b_base/gemma_2b_base_full_alpaca_e3.py # 两卡

NPROC_PER_NODE=8 xtuner train xtuner/config/gemma/gemma_2b_base/gemma_2b_base_full_alpaca_e3.py # 八卡

NPROC_PER_NODE=2 xtuner train xtuner/config/gemma/gemma_2b_base/gemma_2b_base_full_alpaca_e3.py --deepspeed deepspeed_zero2 # 开启 zero2 两卡

Environment

  • os: Ubuntu 20.04
  • pytorch: 2.2
  • xtuner: master branch
  • deepspeed: 0.13.2
  • transformers: 4.38.1
  • cuda: 11.8

@PommesPeter PommesPeter changed the title [WIP] [Feature] added Gemma model [WIP] [Feature] added Gemma config Feb 22, 2024
@LZHgrla
Copy link
Collaborator

LZHgrla commented Feb 23, 2024

Hi @PommesPeter
感谢您的贡献!
我这边尝试复现一下您遇到的问题

@KooSung
Copy link
Contributor

KooSung commented Feb 23, 2024

我这边llava使用gemma能正常训练

@LZHgrla
Copy link
Collaborator

LZHgrla commented Feb 23, 2024

@PommesPeter
我执行了以下三个命令,均能正常训练。

xtuner train xtuner/config/gemma/gemma_2b_base/gemma_2b_base_full_alpaca_e3.py # 单卡
NPROC_PER_NODE=2 xtuner train xtuner/config/gemma/gemma_2b_base/gemma_2b_base_full_alpaca_e3.py # 两卡
NPROC_PER_NODE=2 xtuner train xtuner/config/gemma/gemma_2b_base/gemma_2b_base_full_alpaca_e3.py --deepspeed deepspeed_zero2 # 开启 zero2 两卡

可以考虑先修整好config?我这边测试没问题后可以merge PR,然后您遇到的问题可以提issue一起讨论

@PommesPeter
Copy link
Contributor Author

我这边llava使用gemma能正常训练

请问您用的机器配置大概是怎么样呢,难道是我们这边机器的问题?

@PommesPeter
Copy link
Contributor Author

PommesPeter commented Feb 23, 2024

@PommesPeter
我执行了以下三个命令,均能正常训练。

xtuner train xtuner/config/gemma/gemma_2b_base/gemma_2b_base_full_alpaca_e3.py # 单卡
NPROC_PER_NODE=2 xtuner train xtuner/config/gemma/gemma_2b_base/gemma_2b_base_full_alpaca_e3.py # 两卡
NPROC_PER_NODE=2 xtuner train xtuner/config/gemma/gemma_2b_base/gemma_2b_base_full_alpaca_e3.py --deepspeed deepspeed_zero2 # 开启 zero2 两卡

可以考虑先修整好config?我这边测试没问题后可以merge PR,然后您遇到的问题可以提issue一起讨论

好的,那可能是和机器,显卡这些有关系吗?还是环境不对呢?请问复现用的机器配置大概是怎么样的呢?

@KooSung
Copy link
Contributor

KooSung commented Feb 23, 2024

@PommesPeter
deepspeed==0.12.6

@PommesPeter
Copy link
Contributor Author

@PommesPeter deepspeed==0.12.6

好的,我换成了这个版本的 deepspeed 还是一样的问题

@LZHgrla
Copy link
Collaborator

LZHgrla commented Feb 23, 2024

@PommesPeter
我执行了以下三个命令,均能正常训练。

xtuner train xtuner/config/gemma/gemma_2b_base/gemma_2b_base_full_alpaca_e3.py # 单卡
NPROC_PER_NODE=2 xtuner train xtuner/config/gemma/gemma_2b_base/gemma_2b_base_full_alpaca_e3.py # 两卡
NPROC_PER_NODE=2 xtuner train xtuner/config/gemma/gemma_2b_base/gemma_2b_base_full_alpaca_e3.py --deepspeed deepspeed_zero2 # 开启 zero2 两卡

可以考虑先修整好config?我这边测试没问题后可以merge PR,然后您遇到的问题可以提issue一起讨论

好的,那可能是和机器,显卡这些有关系吗?还是环境不对呢?请问复现用的机器配置大概是怎么样的呢?

我这边用的是8xA100。建议可以尝试一下其他模型的训练是否有问题

@PommesPeter
Copy link
Contributor Author

@PommesPeter
我执行了以下三个命令,均能正常训练。

xtuner train xtuner/config/gemma/gemma_2b_base/gemma_2b_base_full_alpaca_e3.py # 单卡
NPROC_PER_NODE=2 xtuner train xtuner/config/gemma/gemma_2b_base/gemma_2b_base_full_alpaca_e3.py # 两卡
NPROC_PER_NODE=2 xtuner train xtuner/config/gemma/gemma_2b_base/gemma_2b_base_full_alpaca_e3.py --deepspeed deepspeed_zero2 # 开启 zero2 两卡

可以考虑先修整好config?我这边测试没问题后可以merge PR,然后您遇到的问题可以提issue一起讨论

好的,那可能是和机器,显卡这些有关系吗?还是环境不对呢?请问复现用的机器配置大概是怎么样的呢?

我这边用的是8xA100。建议可以尝试一下其他模型的训练是否有问题

运行了 internlm_7b_full_alpaca_e3.py 这个配置的微调,仍然遇到相同的问题。

@LZHgrla
Copy link
Collaborator

LZHgrla commented Feb 23, 2024

@PommesPeter
我执行了以下三个命令,均能正常训练。

xtuner train xtuner/config/gemma/gemma_2b_base/gemma_2b_base_full_alpaca_e3.py # 单卡
NPROC_PER_NODE=2 xtuner train xtuner/config/gemma/gemma_2b_base/gemma_2b_base_full_alpaca_e3.py # 两卡
NPROC_PER_NODE=2 xtuner train xtuner/config/gemma/gemma_2b_base/gemma_2b_base_full_alpaca_e3.py --deepspeed deepspeed_zero2 # 开启 zero2 两卡

可以考虑先修整好config?我这边测试没问题后可以merge PR,然后您遇到的问题可以提issue一起讨论

好的,那可能是和机器,显卡这些有关系吗?还是环境不对呢?请问复现用的机器配置大概是怎么样的呢?

我这边用的是8xA100。建议可以尝试一下其他模型的训练是否有问题

运行了 internlm_7b_full_alpaca_e3.py 这个配置的微调,仍然遇到相同的问题。

那看来需要优先考虑是环境、机器的问题,而非Gemma

@PommesPeter
Copy link
Contributor Author

@PommesPeter
我执行了以下三个命令,均能正常训练。

xtuner train xtuner/config/gemma/gemma_2b_base/gemma_2b_base_full_alpaca_e3.py # 单卡
NPROC_PER_NODE=2 xtuner train xtuner/config/gemma/gemma_2b_base/gemma_2b_base_full_alpaca_e3.py # 两卡
NPROC_PER_NODE=2 xtuner train xtuner/config/gemma/gemma_2b_base/gemma_2b_base_full_alpaca_e3.py --deepspeed deepspeed_zero2 # 开启 zero2 两卡

可以考虑先修整好config?我这边测试没问题后可以merge PR,然后您遇到的问题可以提issue一起讨论

好的,那可能是和机器,显卡这些有关系吗?还是环境不对呢?请问复现用的机器配置大概是怎么样的呢?

我这边用的是8xA100。建议可以尝试一下其他模型的训练是否有问题

运行了 internlm_7b_full_alpaca_e3.py 这个配置的微调,仍然遇到相同的问题。

那看来需要优先考虑是环境、机器的问题,而非Gemma

好的,那我先 double check config。顺便问一下你们机器是有 nvlink 么,非 SXM,会不会是和这个有关系。

@LZHgrla
Copy link
Collaborator

LZHgrla commented Feb 23, 2024

@PommesPeter
我执行了以下三个命令,均能正常训练。

xtuner train xtuner/config/gemma/gemma_2b_base/gemma_2b_base_full_alpaca_e3.py # 单卡
NPROC_PER_NODE=2 xtuner train xtuner/config/gemma/gemma_2b_base/gemma_2b_base_full_alpaca_e3.py # 两卡
NPROC_PER_NODE=2 xtuner train xtuner/config/gemma/gemma_2b_base/gemma_2b_base_full_alpaca_e3.py --deepspeed deepspeed_zero2 # 开启 zero2 两卡

可以考虑先修整好config?我这边测试没问题后可以merge PR,然后您遇到的问题可以提issue一起讨论

好的,那可能是和机器,显卡这些有关系吗?还是环境不对呢?请问复现用的机器配置大概是怎么样的呢?

我这边用的是8xA100。建议可以尝试一下其他模型的训练是否有问题

运行了 internlm_7b_full_alpaca_e3.py 这个配置的微调,仍然遇到相同的问题。

那看来需要优先考虑是环境、机器的问题,而非Gemma

好的,那我先 double check config。顺便问一下你们机器是有 nvlink 么,非 SXM,会不会是和这个有关系。

SXM版本的,应该关系不大,这两者只是多机通信的底层连接不一样

PommesPeter and others added 5 commits February 23, 2024 15:21
…e3.py

Co-authored-by: Zhihao Lin <36994684+LZHgrla@users.noreply.github.com>
…3.py

Co-authored-by: Zhihao Lin <36994684+LZHgrla@users.noreply.github.com>
…3.py

Co-authored-by: Zhihao Lin <36994684+LZHgrla@users.noreply.github.com>
…e3.py

Co-authored-by: Zhihao Lin <36994684+LZHgrla@users.noreply.github.com>
Co-authored-by: Zhihao Lin <36994684+LZHgrla@users.noreply.github.com>
Copy link
Contributor Author

@PommesPeter PommesPeter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

确定好这个应该就可以了,其他没有问题

@LZHgrla LZHgrla changed the title [WIP] [Feature] added Gemma config [Feature] added Gemma config Feb 23, 2024
@LZHgrla LZHgrla changed the title [Feature] added Gemma config [Feature] add Gemma config Feb 23, 2024
@LZHgrla LZHgrla changed the title [Feature] add Gemma config [Feature] Support Gemma Feb 23, 2024
@LZHgrla LZHgrla merged commit 648a003 into InternLM:main Feb 23, 2024
1 check passed
LZHgrla added a commit that referenced this pull request Mar 11, 2024
* [Improve] Redesign the `prompt_template` (#294)

* update

* update cfgs

* update

* fix bugs

* upload docs

* rename

* update

* Revert "update cfgs"

This reverts commit 93966aa.

* update cfgs

* update

* rename

* rename

* fix bc

* fix stop_word

* fix

* fix

* Update prompt_template.md

* [Fix] Fix errors about `stop_words`  (#313)

* fix bugs

* Update mmbench.py

* [Fix] Fix Mixtral LoRA setting (#312)

set target_modules

* [Feature] Support DeepSeek-MoE (#311)

* support deepseek moe

* update docs

* update

* update

* [Fix] Set `torch.optim.AdamW` as the default optimizer (#318)

fix

* [FIx] Fix `pth_to_hf` for LLaVA model (#316)

Update pth_to_hf.py

* [Improve] Add `demo_data` examples (#278)

* update examples

* add examples

* add json template config

* rename

* update

* update

* update

* [Feature] Support InternLM2 (#321)

* add cfgs

* add internlm2 template

* add dispatch

* add docs

* update readme

* update

* [Fix] Fix the resume of seed (#309)

* fix

* Update utils.py

* [Feature] Accelerate `xtuner xxx`  (#307)

* accelerate cli

* Update entry_point.py

* Update entry_point.py

---------

Co-authored-by: Zhihao Lin <36994684+LZHgrla@users.noreply.github.com>

* [Fix] Fix InternLM2 url (#325)

* fix

* update

* Update README.md

* Update README_zh-CN.md

* [Fix] Limit the version of python, `>=3.8, <3.11` (#327)

update

* [Fix] Add `trust_remote_code=True` for AutoModel  (#328)

update

* [Docs] Improve README  (#326)

* update

* Update README.md

* Update README.md

* Update README.md

* Update README_zh-CN.md

* update

* update

* fix pre-commit

* update

* bump verion to v0.1.12 (#323)

bump v0.1.12

* set dev version (#329)

Update version.py

* [Docs] Add LLaVA-InternLM2 results (#332)

* update results

* update

* Update internlm2_chat template (#339)

Update internlm2 template

* [Fix] Fix examples demo_data configs (#334)

fix

* bump version to v0.1.13 (#340)

update

* set dev version (#341)

update

* [Feature] More flexible `TrainLoop` (#348)

* add new loop

* rename

* fix pre-commit

* add max_keep_ckpts

* fix

* update cfgs

* update examples

* fix

* update

* update llava

* update

* update

* update

* update

* [Feature]Support CEPH (#266)

* support petrelfs

* fix deepspeed save/load/resume

* add ENV to toggle petrelfs

* support hf save_pretrained

* patch deepspeed engine

* [Improve] Add `--repetition-penalty` for `xtuner chat` (#351)

fix

* [Feature] Support MMBench DDP Evaluate (#300)

* support ddp mmbench evaluate

* Update xtuner/tools/mmbench.py

Co-authored-by: Zhihao Lin <36994684+LZHgrla@users.noreply.github.com>

* Update xtuner/tools/mmbench.py

Co-authored-by: Zhihao Lin <36994684+LZHgrla@users.noreply.github.com>

* update minimum version of mmengine

* Update runtime.txt

---------

Co-authored-by: Zhihao Lin <36994684+LZHgrla@users.noreply.github.com>

* [Fix] `KeyError` of `encode_fn` (#361)

fix

* [Fix] Fix `batch_size` of full fine-tuing LLaVA-InternLM2 (#360)

fix

* [Fix] Remove `system` for `alpaca_map_fn` (#363)

update

* [Fix] Use `DEFAULT_IMAGE_TOKEN` instead of `'<image>'` (#353)

Update utils.py

* [Feature] Efficient SFT (#302)

* add local_attn_args_to_messagehub_hook

* add internlm repo sampler

* add internlm repo dataset and collate_fn

* dispatch internlm1 and internlm2 local attn

* add internlm2 config

* add internlm1 and intenrlm2 config

* add internlm2 template

* fix replace_internlm1_rote bugs

* add internlm1 and internlm2 config templates

* change priority of EvaluateChatHook

* fix docs

* fix config

* fix bug

* set rotary_base according the latest internlm2 config

* add llama local attn

* add llama local attn

* update intern_repo_dataset docs when using aliyun

* support using both hf load_dataset and intern_repo packed_dataset

* add configs

* add opencompass doc

* update opencompass doc

* use T data order

* use T data order

* add config

* add a tool to get data order

* support offline processing untokenized dataset

* add docs

* add doc about only saving model weights

* add doc about only saving model weights

* dispatch mistral

* add mistral template

* add mistral template

* fix torch_dtype

* reset pre-commit-config

* fix config

* fix internlm_7b_full_intern_repo_dataset_template

* update local_attn to varlen_attn

* rename local_attn

* fix InternlmRepoSampler and train.py to support resume

* modify Packer to support varlen attn

* support varlen attn in default pipeline

* update mmengine version requirement to 0.10.3

* Update ceph.md

* delete intern_repo_collate_fn

* delete intern_repo_collate_fn

* delete useless files

* assert pack_to_max_length=True if use_varlen_attn=True

* add varlen attn doc

* add varlen attn to configs

* delete useless codes

* update

* update

* update configs

* fix priority of ThroughputHook and flake8 ignore W504

* using map_fn to set length attr to dataset

* support split=None in process_hf_dataset

* add dataset_format_mapping

* support preprocess ftdp and normal dataset

* refactor process_hf_dataset

* support pack dataset in process_untokenized_datasets

* add xtuner_dataset_timeout

* using gloo backend for monitored barrier

* set gloo timeout

* fix bugs

* fix configs

* refactor intern repo dataset docs

* fix doc

* fix lint

---------

Co-authored-by: pppppM <67539920+pppppM@users.noreply.github.com>
Co-authored-by: pppppM <gjf_mail@126.com>

* [Fix] Add `attention_mask` for `default_collate_fn` (#371)

fix

* [Fix] Update requirements (#369)

Update runtime.txt

* [Fix] Fix rotary_base, add `colors_map_fn` to `DATASET_FORMAT_MAPPING` and rename 'internlm_repo' to 'intern_repo' (#372)

* fix

* rename internlm_repo to intern_repo

* add InternlmRepoSampler for preventing bc break

* add how to install flash_attn to doc

* update (#377)

* Delete useless codes and refactor process_untokenized_datasets (#379)

* delete useless codes

* refactor process_untokenized_datasets: add ftdp to dataset-format

* fix lint

* [Feature] support flash attn 2 in internlm1, internlm2 and llama (#381)

support flash attn 2 in internlm1, internlm2 and llama

* [Fix] Fix installation docs of mmengine in `intern_repo_dataset.md` (#384)

update

* [Fix] Update InternLM2 `apply_rotary_pos_emb` (#383)

update

* [Feature] support saving eval output before save checkpoint (#385)

* support saving eval output before save checkpoint

* refactor

* [Fix] lr scheduler setting (#394)

* fix lr scheduler setting

* fix more

---------

Co-authored-by: zilong.guo <zilong.guo@zeron.ai>
Co-authored-by: LZHgrla <linzhihao@pjlab.org.cn>

* [Fix] Remove pre-defined `system` of `alpaca_zh_map_fn` (#395)

fix

* [Feature] Support `Qwen1.5` (#407)

* rename

* update docs

* update template

* update

* add cfgs

* update

* update

* [Fix] Fix no space in chat output using InternLM2. (#357) (#404)

* [Fix] Fix no space in chat output using InternLM2. (#357)

* Update chat.py

* Update utils.py

* Update utils.py

* fix pre-commit

---------

Co-authored-by: Zhihao Lin <36994684+LZHgrla@users.noreply.github.com>
Co-authored-by: LZHgrla <linzhihao@pjlab.org.cn>

* [Fix] typo: `--system-prompt` to `--system-template` (#406)

fix

* [Improve] Add `output_with_loss` for dataset process (#408)

update

* [Fix] Fix dispatch to support transformers>=4.36 & Add USE_TRITON_KERNEL environment variable (#411)

* dispatch support transformers>=4.36

* add USE_TRITON_KERNEL environment variable

* raise RuntimeError use triton kernels on cpu

* fix lint

* [Feature]Add InternLM2-1_8b configs (#396)

* [Feature]Add InternLM2-Chat-1_8b full config

* [Feature]Add InternLM2-Chat-1_8b full config

* update

---------

Co-authored-by: LZHgrla <linzhihao@pjlab.org.cn>
Co-authored-by: Zhihao Lin <36994684+LZHgrla@users.noreply.github.com>

* [Fix] Fix `extract_json_objects` (#419)

* [Fix] Fix pth_to_hf error (#426)

fix

* [Feature] Support `Gemma` (#429)

* added gemma config and template

* check config and make sure the consistancy

* Update xtuner/configs/gemma/gemma_2b_base/gemma_2b_base_qlora_alpaca_e3.py

Co-authored-by: Zhihao Lin <36994684+LZHgrla@users.noreply.github.com>

* Update xtuner/configs/gemma/gemma_2b_base/gemma_2b_base_full_alpaca_e3.py

Co-authored-by: Zhihao Lin <36994684+LZHgrla@users.noreply.github.com>

* Update xtuner/configs/gemma/gemma_7b_base/gemma_7b_base_full_alpaca_e3.py

Co-authored-by: Zhihao Lin <36994684+LZHgrla@users.noreply.github.com>

* Update xtuner/configs/gemma/gemma_7b_base/gemma_7b_base_qlora_alpaca_e3.py

Co-authored-by: Zhihao Lin <36994684+LZHgrla@users.noreply.github.com>

* Update xtuner/utils/templates.py

Co-authored-by: Zhihao Lin <36994684+LZHgrla@users.noreply.github.com>

* update

* added  required version

* update

* update

---------

Co-authored-by: Zhihao Lin <36994684+LZHgrla@users.noreply.github.com>
Co-authored-by: LZHgrla <linzhihao@pjlab.org.cn>

* add refcoco to llava (#425)

* add base dataset

* update dataset generation

* update refcoco

* add convert refcooc

* add eval_refcoco

* add config

* update dataset

* fix bug

* fix bug

* update data prepare

* fix error

* refactor eval_refcoco

* fix bug

* fix error

* update readme

* add entry_point

* update config

* update config

* update entry point

* update

* update doc

* update

---------

Co-authored-by: jacky <jacky@xx.com>

* [Fix] Inconsistent BatchSize of `LengthGroupedSampler` (#436)

update

* bump version to v0.1.14 (#431)

update

* set dev version (#437)

* Update version.py

* Update version.py

* [Bugs] Fix bugs when using EpochBasedRunner (#439)

fix bugs when using epochbasedrunner

* [Feature] Support processing ftdp dataset and custom dataset offline (#410)

* support smart_tokenizer_and_embedding_resize

* replace ast with json.loads

* support list_dataset_format cli

* add doc about ftdp and custom dataset

* add custom dataset template

* add args name to process_hf_dataset

* use new process_untokenized_datasets

* support tokenize_ftdp_datasets

* add mistral_7b_w_tokenized_dataset config

* update doc

* update doc

* add comments

* fix data save path

* smart_tokenizer_and_embedding_resize support zero3

* fix lint

* add data format to internlm2_7b_full_finetune_custom_dataset_e1.py

* add a data format example to configs associated with finetuning custom dataset

* add a data format example to configs associated with finetuning custom dataset

* fix lint

* Update prompt_template.md (#441)

修改了一个错别字

* [Doc] Split finetune_custom_dataset.md to 6 parts (#445)

* split finetune_custom_dataset.md to 6 parts

* refactor custom_dataset and ftdp_dataset related docs

* fix comments

* fix pre-commit

---------

Co-authored-by: pppppM <67539920+pppppM@users.noreply.github.com>
Co-authored-by: RangiLyu <lyuchqi@gmail.com>
Co-authored-by: whcao <41630003+HIT-cwh@users.noreply.github.com>
Co-authored-by: pppppM <gjf_mail@126.com>
Co-authored-by: gzlong96 <30570937+gzlong96@users.noreply.github.com>
Co-authored-by: zilong.guo <zilong.guo@zeron.ai>
Co-authored-by: Ko Sung <34935911+KooSung@users.noreply.github.com>
Co-authored-by: 不要葱姜蒜 <77671993+KMnO4-zx@users.noreply.github.com>
Co-authored-by: fanqiNO1 <75657629+fanqiNO1@users.noreply.github.com>
Co-authored-by: PommesPeter <54879512+PommesPeter@users.noreply.github.com>
Co-authored-by: LKJacky <108643365+LKJacky@users.noreply.github.com>
Co-authored-by: jacky <jacky@xx.com>
Co-authored-by: xzw <62385492+aJupyter@users.noreply.github.com>
@BlueBlueFF
Copy link

Description

add Gemma-2B and Gemma-7B to xtuner.

遇到一些问题:

使用写好的配置文件开启多卡微调好程序不动,显卡使用率 100%,但毫无反应。请问是否有遇到过类似情况?根据个人经验,之前 Mistral 系列的模型在微调的时候也遇到相同的情况,推测和 Deepspeed 有关。之前我也遇到过类似的问题,在训练 llava 的时候遇到,debug 后发现是计算 attention 的时候遇到死锁,两个多卡程序相互争抢资源。相关 issue:

复现指令:

NPROC_PER_NODE=2 xtuner train xtuner/config/gemma/gemma_2b_base/gemma_2b_base_full_alpaca_e3.py # 两卡

NPROC_PER_NODE=8 xtuner train xtuner/config/gemma/gemma_2b_base/gemma_2b_base_full_alpaca_e3.py # 八卡

NPROC_PER_NODE=2 xtuner train xtuner/config/gemma/gemma_2b_base/gemma_2b_base_full_alpaca_e3.py --deepspeed deepspeed_zero2 # 开启 zero2 两卡

Environment

  • os: Ubuntu 20.04
  • pytorch: 2.2
  • xtuner: master branch
  • deepspeed: 0.13.2
  • transformers: 4.38.1
  • cuda: 11.8

请问您最后解决 RuntimeError: still have inflight params 这个问题了吗

@PommesPeter
Copy link
Contributor Author

PommesPeter commented Apr 7, 2024

Description

add Gemma-2B and Gemma-7B to xtuner.
遇到一些问题:
使用写好的配置文件开启多卡微调好程序不动,显卡使用率 100%,但毫无反应。请问是否有遇到过类似情况?根据个人经验,之前 Mistral 系列的模型在微调的时候也遇到相同的情况,推测和 Deepspeed 有关。之前我也遇到过类似的问题,在训练 llava 的时候遇到,debug 后发现是计算 attention 的时候遇到死锁,两个多卡程序相互争抢资源。相关 issue:

复现指令:

NPROC_PER_NODE=2 xtuner train xtuner/config/gemma/gemma_2b_base/gemma_2b_base_full_alpaca_e3.py # 两卡

NPROC_PER_NODE=8 xtuner train xtuner/config/gemma/gemma_2b_base/gemma_2b_base_full_alpaca_e3.py # 八卡

NPROC_PER_NODE=2 xtuner train xtuner/config/gemma/gemma_2b_base/gemma_2b_base_full_alpaca_e3.py --deepspeed deepspeed_zero2 # 开启 zero2 两卡

Environment

  • os: Ubuntu 20.04
  • pytorch: 2.2
  • xtuner: master branch
  • deepspeed: 0.13.2
  • transformers: 4.38.1
  • cuda: 11.8

请问您最后解决 RuntimeError: still have inflight params 这个问题了吗

您好,感谢您的提问。这个问题产生是来自 deepspeed,出现这个报错应该是不同 node 上的数据状态不一致导致的,我排查后发现我这边的机器安装了 nvlink,需要强制设置 NCCL_P2P_LEVEL=NVL 强制走 nvlink 之后才解决这个问题。

不知道您这边的开发环境是怎么样的,可能需要提供更多信息。

@BlueBlueFF
Copy link

Description

add Gemma-2B and Gemma-7B to xtuner.
遇到一些问题:
使用写好的配置文件开启多卡微调好程序不动,显卡使用率 100%,但毫无反应。请问是否有遇到过类似情况?根据个人经验,之前 Mistral 系列的模型在微调的时候也遇到相同的情况,推测和 Deepspeed 有关。之前我也遇到过类似的问题,在训练 llava 的时候遇到,debug 后发现是计算 attention 的时候遇到死锁,两个多卡程序相互争抢资源。相关 issue:

复现指令:

NPROC_PER_NODE=2 xtuner train xtuner/config/gemma/gemma_2b_base/gemma_2b_base_full_alpaca_e3.py # 两卡

NPROC_PER_NODE=8 xtuner train xtuner/config/gemma/gemma_2b_base/gemma_2b_base_full_alpaca_e3.py # 八卡

NPROC_PER_NODE=2 xtuner train xtuner/config/gemma/gemma_2b_base/gemma_2b_base_full_alpaca_e3.py --deepspeed deepspeed_zero2 # 开启 zero2 两卡

Environment

  • os: Ubuntu 20.04
  • pytorch: 2.2
  • xtuner: master branch
  • deepspeed: 0.13.2
  • transformers: 4.38.1
  • cuda: 11.8

请问您最后解决 RuntimeError: still have inflight params 这个问题了吗

您好,感谢您的提问。这个问题产生是来自 deepspeed,出现这个报错应该是不同 node 上的数据状态不一致导致的,我排查后发现我这边的机器安装了 nvlink,需要强制设置 NCCL_P2P_LEVEL=NVL 强制走 nvlink 之后才解决这个问题。

不知道您这边的开发环境是怎么样的,可能需要提供更多信息。

谢谢您的回复,我遇到的类似intel/intel-extension-for-transformers#1201 这个问题,纯文本数据ok,纯图像数据ok,这两部分数据混合不ok,您说的方法我也在尝试 有结论了我回复给您

pppppM added a commit that referenced this pull request Jun 12, 2024
* [Docs] Readthedocs (#304)

* init readthedocs

* add en docs

* add zh docs

* fix lint

* [Fix] Support ZH Readthedocs (#305)

* add zh yaml

* test zh cn

* test yaml path

* pass

* update conf.py

* [Docs] Document optimization (#362)

Document optimization

* [Docs] Update Docs docs/en/get_started/installation.md  (#364)

* 更新中文 installation.md

完成中文的 安装-安装流程-最佳实践 & 安装-验证安装

* Update installation.md en

* Update installation.md zh

typo

* [Docs] Refine Quick Start (#378)

* [Docs] Add zh_cn quickstart

* [Fix] Fix color rendering logic for github

* [Fix] Fix comments

* [Fix] Add hyperlinks

* [Docs] Add en quickstart

* [Fix] Fix comments

* Update overview.md (#412)

* Update overview.md

* Update overview.md

已根据要求进行修改,请查阅

* Update overview.md

进一步的修正

* Update overview.md

根据要求的完善

* Merge branch 'main' into 'docs' (#463)

* [Improve] Redesign the `prompt_template` (#294)

* update

* update cfgs

* update

* fix bugs

* upload docs

* rename

* update

* Revert "update cfgs"

This reverts commit 93966aa.

* update cfgs

* update

* rename

* rename

* fix bc

* fix stop_word

* fix

* fix

* Update prompt_template.md

* [Fix] Fix errors about `stop_words`  (#313)

* fix bugs

* Update mmbench.py

* [Fix] Fix Mixtral LoRA setting (#312)

set target_modules

* [Feature] Support DeepSeek-MoE (#311)

* support deepseek moe

* update docs

* update

* update

* [Fix] Set `torch.optim.AdamW` as the default optimizer (#318)

fix

* [FIx] Fix `pth_to_hf` for LLaVA model (#316)

Update pth_to_hf.py

* [Improve] Add `demo_data` examples (#278)

* update examples

* add examples

* add json template config

* rename

* update

* update

* update

* [Feature] Support InternLM2 (#321)

* add cfgs

* add internlm2 template

* add dispatch

* add docs

* update readme

* update

* [Fix] Fix the resume of seed (#309)

* fix

* Update utils.py

* [Feature] Accelerate `xtuner xxx`  (#307)

* accelerate cli

* Update entry_point.py

* Update entry_point.py

---------

Co-authored-by: Zhihao Lin <36994684+LZHgrla@users.noreply.github.com>

* [Fix] Fix InternLM2 url (#325)

* fix

* update

* Update README.md

* Update README_zh-CN.md

* [Fix] Limit the version of python, `>=3.8, <3.11` (#327)

update

* [Fix] Add `trust_remote_code=True` for AutoModel  (#328)

update

* [Docs] Improve README  (#326)

* update

* Update README.md

* Update README.md

* Update README.md

* Update README_zh-CN.md

* update

* update

* fix pre-commit

* update

* bump verion to v0.1.12 (#323)

bump v0.1.12

* set dev version (#329)

Update version.py

* [Docs] Add LLaVA-InternLM2 results (#332)

* update results

* update

* Update internlm2_chat template (#339)

Update internlm2 template

* [Fix] Fix examples demo_data configs (#334)

fix

* bump version to v0.1.13 (#340)

update

* set dev version (#341)

update

* [Feature] More flexible `TrainLoop` (#348)

* add new loop

* rename

* fix pre-commit

* add max_keep_ckpts

* fix

* update cfgs

* update examples

* fix

* update

* update llava

* update

* update

* update

* update

* [Feature]Support CEPH (#266)

* support petrelfs

* fix deepspeed save/load/resume

* add ENV to toggle petrelfs

* support hf save_pretrained

* patch deepspeed engine

* [Improve] Add `--repetition-penalty` for `xtuner chat` (#351)

fix

* [Feature] Support MMBench DDP Evaluate (#300)

* support ddp mmbench evaluate

* Update xtuner/tools/mmbench.py

Co-authored-by: Zhihao Lin <36994684+LZHgrla@users.noreply.github.com>

* Update xtuner/tools/mmbench.py

Co-authored-by: Zhihao Lin <36994684+LZHgrla@users.noreply.github.com>

* update minimum version of mmengine

* Update runtime.txt

---------

Co-authored-by: Zhihao Lin <36994684+LZHgrla@users.noreply.github.com>

* [Fix] `KeyError` of `encode_fn` (#361)

fix

* [Fix] Fix `batch_size` of full fine-tuing LLaVA-InternLM2 (#360)

fix

* [Fix] Remove `system` for `alpaca_map_fn` (#363)

update

* [Fix] Use `DEFAULT_IMAGE_TOKEN` instead of `'<image>'` (#353)

Update utils.py

* [Feature] Efficient SFT (#302)

* add local_attn_args_to_messagehub_hook

* add internlm repo sampler

* add internlm repo dataset and collate_fn

* dispatch internlm1 and internlm2 local attn

* add internlm2 config

* add internlm1 and intenrlm2 config

* add internlm2 template

* fix replace_internlm1_rote bugs

* add internlm1 and internlm2 config templates

* change priority of EvaluateChatHook

* fix docs

* fix config

* fix bug

* set rotary_base according the latest internlm2 config

* add llama local attn

* add llama local attn

* update intern_repo_dataset docs when using aliyun

* support using both hf load_dataset and intern_repo packed_dataset

* add configs

* add opencompass doc

* update opencompass doc

* use T data order

* use T data order

* add config

* add a tool to get data order

* support offline processing untokenized dataset

* add docs

* add doc about only saving model weights

* add doc about only saving model weights

* dispatch mistral

* add mistral template

* add mistral template

* fix torch_dtype

* reset pre-commit-config

* fix config

* fix internlm_7b_full_intern_repo_dataset_template

* update local_attn to varlen_attn

* rename local_attn

* fix InternlmRepoSampler and train.py to support resume

* modify Packer to support varlen attn

* support varlen attn in default pipeline

* update mmengine version requirement to 0.10.3

* Update ceph.md

* delete intern_repo_collate_fn

* delete intern_repo_collate_fn

* delete useless files

* assert pack_to_max_length=True if use_varlen_attn=True

* add varlen attn doc

* add varlen attn to configs

* delete useless codes

* update

* update

* update configs

* fix priority of ThroughputHook and flake8 ignore W504

* using map_fn to set length attr to dataset

* support split=None in process_hf_dataset

* add dataset_format_mapping

* support preprocess ftdp and normal dataset

* refactor process_hf_dataset

* support pack dataset in process_untokenized_datasets

* add xtuner_dataset_timeout

* using gloo backend for monitored barrier

* set gloo timeout

* fix bugs

* fix configs

* refactor intern repo dataset docs

* fix doc

* fix lint

---------

Co-authored-by: pppppM <67539920+pppppM@users.noreply.github.com>
Co-authored-by: pppppM <gjf_mail@126.com>

* [Fix] Add `attention_mask` for `default_collate_fn` (#371)

fix

* [Fix] Update requirements (#369)

Update runtime.txt

* [Fix] Fix rotary_base, add `colors_map_fn` to `DATASET_FORMAT_MAPPING` and rename 'internlm_repo' to 'intern_repo' (#372)

* fix

* rename internlm_repo to intern_repo

* add InternlmRepoSampler for preventing bc break

* add how to install flash_attn to doc

* update (#377)

* Delete useless codes and refactor process_untokenized_datasets (#379)

* delete useless codes

* refactor process_untokenized_datasets: add ftdp to dataset-format

* fix lint

* [Feature] support flash attn 2 in internlm1, internlm2 and llama (#381)

support flash attn 2 in internlm1, internlm2 and llama

* [Fix] Fix installation docs of mmengine in `intern_repo_dataset.md` (#384)

update

* [Fix] Update InternLM2 `apply_rotary_pos_emb` (#383)

update

* [Feature] support saving eval output before save checkpoint (#385)

* support saving eval output before save checkpoint

* refactor

* [Fix] lr scheduler setting (#394)

* fix lr scheduler setting

* fix more

---------

Co-authored-by: zilong.guo <zilong.guo@zeron.ai>
Co-authored-by: LZHgrla <linzhihao@pjlab.org.cn>

* [Fix] Remove pre-defined `system` of `alpaca_zh_map_fn` (#395)

fix

* [Feature] Support `Qwen1.5` (#407)

* rename

* update docs

* update template

* update

* add cfgs

* update

* update

* [Fix] Fix no space in chat output using InternLM2. (#357) (#404)

* [Fix] Fix no space in chat output using InternLM2. (#357)

* Update chat.py

* Update utils.py

* Update utils.py

* fix pre-commit

---------

Co-authored-by: Zhihao Lin <36994684+LZHgrla@users.noreply.github.com>
Co-authored-by: LZHgrla <linzhihao@pjlab.org.cn>

* [Fix] typo: `--system-prompt` to `--system-template` (#406)

fix

* [Improve] Add `output_with_loss` for dataset process (#408)

update

* [Fix] Fix dispatch to support transformers>=4.36 & Add USE_TRITON_KERNEL environment variable (#411)

* dispatch support transformers>=4.36

* add USE_TRITON_KERNEL environment variable

* raise RuntimeError use triton kernels on cpu

* fix lint

* [Feature]Add InternLM2-1_8b configs (#396)

* [Feature]Add InternLM2-Chat-1_8b full config

* [Feature]Add InternLM2-Chat-1_8b full config

* update

---------

Co-authored-by: LZHgrla <linzhihao@pjlab.org.cn>
Co-authored-by: Zhihao Lin <36994684+LZHgrla@users.noreply.github.com>

* [Fix] Fix `extract_json_objects` (#419)

* [Fix] Fix pth_to_hf error (#426)

fix

* [Feature] Support `Gemma` (#429)

* added gemma config and template

* check config and make sure the consistancy

* Update xtuner/configs/gemma/gemma_2b_base/gemma_2b_base_qlora_alpaca_e3.py

Co-authored-by: Zhihao Lin <36994684+LZHgrla@users.noreply.github.com>

* Update xtuner/configs/gemma/gemma_2b_base/gemma_2b_base_full_alpaca_e3.py

Co-authored-by: Zhihao Lin <36994684+LZHgrla@users.noreply.github.com>

* Update xtuner/configs/gemma/gemma_7b_base/gemma_7b_base_full_alpaca_e3.py

Co-authored-by: Zhihao Lin <36994684+LZHgrla@users.noreply.github.com>

* Update xtuner/configs/gemma/gemma_7b_base/gemma_7b_base_qlora_alpaca_e3.py

Co-authored-by: Zhihao Lin <36994684+LZHgrla@users.noreply.github.com>

* Update xtuner/utils/templates.py

Co-authored-by: Zhihao Lin <36994684+LZHgrla@users.noreply.github.com>

* update

* added  required version

* update

* update

---------

Co-authored-by: Zhihao Lin <36994684+LZHgrla@users.noreply.github.com>
Co-authored-by: LZHgrla <linzhihao@pjlab.org.cn>

* add refcoco to llava (#425)

* add base dataset

* update dataset generation

* update refcoco

* add convert refcooc

* add eval_refcoco

* add config

* update dataset

* fix bug

* fix bug

* update data prepare

* fix error

* refactor eval_refcoco

* fix bug

* fix error

* update readme

* add entry_point

* update config

* update config

* update entry point

* update

* update doc

* update

---------

Co-authored-by: jacky <jacky@xx.com>

* [Fix] Inconsistent BatchSize of `LengthGroupedSampler` (#436)

update

* bump version to v0.1.14 (#431)

update

* set dev version (#437)

* Update version.py

* Update version.py

* [Bugs] Fix bugs when using EpochBasedRunner (#439)

fix bugs when using epochbasedrunner

* [Feature] Support processing ftdp dataset and custom dataset offline (#410)

* support smart_tokenizer_and_embedding_resize

* replace ast with json.loads

* support list_dataset_format cli

* add doc about ftdp and custom dataset

* add custom dataset template

* add args name to process_hf_dataset

* use new process_untokenized_datasets

* support tokenize_ftdp_datasets

* add mistral_7b_w_tokenized_dataset config

* update doc

* update doc

* add comments

* fix data save path

* smart_tokenizer_and_embedding_resize support zero3

* fix lint

* add data format to internlm2_7b_full_finetune_custom_dataset_e1.py

* add a data format example to configs associated with finetuning custom dataset

* add a data format example to configs associated with finetuning custom dataset

* fix lint

* Update prompt_template.md (#441)

修改了一个错别字

* [Doc] Split finetune_custom_dataset.md to 6 parts (#445)

* split finetune_custom_dataset.md to 6 parts

* refactor custom_dataset and ftdp_dataset related docs

* fix comments

* fix pre-commit

---------

Co-authored-by: pppppM <67539920+pppppM@users.noreply.github.com>
Co-authored-by: RangiLyu <lyuchqi@gmail.com>
Co-authored-by: whcao <41630003+HIT-cwh@users.noreply.github.com>
Co-authored-by: pppppM <gjf_mail@126.com>
Co-authored-by: gzlong96 <30570937+gzlong96@users.noreply.github.com>
Co-authored-by: zilong.guo <zilong.guo@zeron.ai>
Co-authored-by: Ko Sung <34935911+KooSung@users.noreply.github.com>
Co-authored-by: 不要葱姜蒜 <77671993+KMnO4-zx@users.noreply.github.com>
Co-authored-by: fanqiNO1 <75657629+fanqiNO1@users.noreply.github.com>
Co-authored-by: PommesPeter <54879512+PommesPeter@users.noreply.github.com>
Co-authored-by: LKJacky <108643365+LKJacky@users.noreply.github.com>
Co-authored-by: jacky <jacky@xx.com>
Co-authored-by: xzw <62385492+aJupyter@users.noreply.github.com>

* [Docs] Add `docs/zh_cn/preparation/pretrained_model.md` (#462)

* fix pre-commit

* update

* Update pretrained_model.md

* Update pretrained_model.md

* fix pre-commit

* Update pretrained_model.md

* update

* update

* update

* update

* Update pretrained_model.md

* [Docs] Add `docs/zh_cn/training/multi_modal_dataset.md` (#503)

* update

* update

* [Docs] Improve readthedocs style (#545)

* update style

* update style

* fix requirements

* fix

* fix

* add logo

* update

* update

* update

* [Docs] `.md` to `.rst` (#544)

* update rst

* update rst

* update rst

* [Docs] Add `docs/zh_cn/training/custom_pretrain_dataset.rst` (#535)

* update

* update

* update rst

* [Docs] Add docs about training on large scale dataset (#517)

* add train_on_large_scale_dataset doc

* refine doc

* add llava offline doc

* refine doc

* replace md with rst

* refine rst

* refine rst

* [Docs] Add internevo migration related documents (#506)

* add internevo related

* fix comments

* refine doc

* rename internlm2_7b_w_tokenized_dataset.py to internlm2_7b_w_internevo_dataset.py

* refine doc

* replace md with rst

* refine rst

* refine rst

* [Docs] Add `docs/zh_cn/training/modify_settings.rst` (#490)

* update

* update

* update

* update

* update

* update

* Update modify_settings.md

* Update modify_settings.md

* update

* Update docs/zh_cn/training/modify_settings.md

Co-authored-by: Haian Huang(深度眸) <1286304229@qq.com>

* update deepspeed

* update rst

* update rst

---------

Co-authored-by: Haian Huang(深度眸) <1286304229@qq.com>

* [Docs] Add `length_grouped_sampler.rst` (#511)

* update

* update

* update

* Update length_grouped_sampler.md

* update rst

* Update length_grouped_sampler.rst

Co-authored-by: whcao <41630003+HIT-cwh@users.noreply.github.com>

---------

Co-authored-by: whcao <41630003+HIT-cwh@users.noreply.github.com>

* [Docs] Add accelerate related (#504)

* add accelerate related

* split accelerate docs

* fix comments

* add speed benchmark

* explain why qlora can not be used with zero3

* refine doc

* fix configs

* refine doc

* refine doc

* refine configs

* add benchmark to index.rst

* refine doc

* add hyper-param docs

* refine doc

* add explanation about memory cost optimization when using zero

* add figure to show the speed comparison

* refine figures

* refine doc

* fix figures

* refine figures

* update figures and benchmark configs

* add pack rst

* delete pack md

* replace md with rst

* replace md with rst

* replace md with rst

* replace md with rst

* refine rst

* refine rst

* refine rst

* refine rst

* refine rst

* refine rst

* refine rst

* refine rst

* refine rst

* refine rst

* refine rst

* refine rst

* refine rst

* refine rst

---------

Co-authored-by: pppppM <67539920+pppppM@users.noreply.github.com>

* [Docs] Add visualization docs (#516)

* add visualization docs

* delete other visualization tools and add explanation about how to use tensorboard

* replace md with rst

---------

Co-authored-by: pppppM <67539920+pppppM@users.noreply.github.com>

* [Docs] Add docs about SFT with custom dataset (#514)

* add custom sft dataset docs

* add custom dataset template configs

* add openai data format

* refine doc

* update (#2)

* replace md with rst

---------

Co-authored-by: Zhihao Lin <36994684+LZHgrla@users.noreply.github.com>
Co-authored-by: pppppM <67539920+pppppM@users.noreply.github.com>

* [Docs] Add `docs/zh_cn/training/open_source_dataset.rst` (#502)

* update

* update

* update

* update

* format table

* fix typo

* update rst

---------

Co-authored-by: pppppM <67539920+pppppM@users.noreply.github.com>

* [Docs] Add `docs/zh_cn/preparation/prompt_template.rst` (#475)

* update

* update

* Update prompt_template.md

* Update prompt_template.md

* update

* add tips

* update

* update rst

---------

Co-authored-by: pppppM <67539920+pppppM@users.noreply.github.com>

* [Docs] Add Sequence Parallel documents (#505)

* add sp related

* add sequence parallel supported models

* refine doc

* Update docs/zh_cn/training/training_extreme_long_sequence.md

Co-authored-by: Haian Huang(深度眸) <1286304229@qq.com>

* refine doc

* refine doc

* test the capability boundary of zero3

* refine doc

* test rst

* test rst

* add training speed figure

* delete debug rst

* sp need flash_attn

* WIP

* replace md with rst

* refine rst

* refine rst

* add explanation about why pt 2.1 is not accepted

* refine rst

* refine rst

* add loss curve

---------

Co-authored-by: Haian Huang(深度眸) <1286304229@qq.com>
Co-authored-by: pppppM <67539920+pppppM@users.noreply.github.com>

* [Docs] Update `docs/zh_cn` outline (#556)

update

* [Docs] Update `docs/en` theme (#557)

* update

* update

* update

* update

* update

* update

* update

* update

* [Docs] Add tokenizer to sft in Case 2 (#584)

add tokenizer to sft in Case 2

* [Docs] Improve the Rendering Effect of Readthedocs (#664)

* refine get_start and training

* fix acceleration

* update maxdepth

* refine internevo migration

* refine internevo

* fix typos

* fix lint

---------

Co-authored-by: zhengjie.xu <jerryxuzhengjie@gmail.com>
Co-authored-by: Ma Zhiming <101508488+JimmyMa99@users.noreply.github.com>
Co-authored-by: fanqiNO1 <75657629+fanqiNO1@users.noreply.github.com>
Co-authored-by: Jianfeng777 <108343727+Jianfeng777@users.noreply.github.com>
Co-authored-by: Zhihao Lin <36994684+LZHgrla@users.noreply.github.com>
Co-authored-by: RangiLyu <lyuchqi@gmail.com>
Co-authored-by: whcao <41630003+HIT-cwh@users.noreply.github.com>
Co-authored-by: gzlong96 <30570937+gzlong96@users.noreply.github.com>
Co-authored-by: zilong.guo <zilong.guo@zeron.ai>
Co-authored-by: Ko Sung <34935911+KooSung@users.noreply.github.com>
Co-authored-by: 不要葱姜蒜 <77671993+KMnO4-zx@users.noreply.github.com>
Co-authored-by: PommesPeter <54879512+PommesPeter@users.noreply.github.com>
Co-authored-by: LKJacky <108643365+LKJacky@users.noreply.github.com>
Co-authored-by: jacky <jacky@xx.com>
Co-authored-by: xzw <62385492+aJupyter@users.noreply.github.com>
Co-authored-by: Haian Huang(深度眸) <1286304229@qq.com>
llkn-2 pushed a commit to llkn-2/xtuner that referenced this pull request Jul 31, 2024
* added gemma config and template

* check config and make sure the consistancy

* Update xtuner/configs/gemma/gemma_2b_base/gemma_2b_base_qlora_alpaca_e3.py

Co-authored-by: Zhihao Lin <36994684+LZHgrla@users.noreply.github.com>

* Update xtuner/configs/gemma/gemma_2b_base/gemma_2b_base_full_alpaca_e3.py

Co-authored-by: Zhihao Lin <36994684+LZHgrla@users.noreply.github.com>

* Update xtuner/configs/gemma/gemma_7b_base/gemma_7b_base_full_alpaca_e3.py

Co-authored-by: Zhihao Lin <36994684+LZHgrla@users.noreply.github.com>

* Update xtuner/configs/gemma/gemma_7b_base/gemma_7b_base_qlora_alpaca_e3.py

Co-authored-by: Zhihao Lin <36994684+LZHgrla@users.noreply.github.com>

* Update xtuner/utils/templates.py

Co-authored-by: Zhihao Lin <36994684+LZHgrla@users.noreply.github.com>

* update

* added  required version

* update

* update

---------

Co-authored-by: Zhihao Lin <36994684+LZHgrla@users.noreply.github.com>
Co-authored-by: LZHgrla <linzhihao@pjlab.org.cn>
llkn-2 pushed a commit to llkn-2/xtuner that referenced this pull request Jul 31, 2024
* [Docs] Readthedocs (InternLM#304)

* init readthedocs

* add en docs

* add zh docs

* fix lint

* [Fix] Support ZH Readthedocs (InternLM#305)

* add zh yaml

* test zh cn

* test yaml path

* pass

* update conf.py

* [Docs] Document optimization (InternLM#362)

Document optimization

* [Docs] Update Docs docs/en/get_started/installation.md  (InternLM#364)

* 更新中文 installation.md

完成中文的 安装-安装流程-最佳实践 & 安装-验证安装

* Update installation.md en

* Update installation.md zh

typo

* [Docs] Refine Quick Start (InternLM#378)

* [Docs] Add zh_cn quickstart

* [Fix] Fix color rendering logic for github

* [Fix] Fix comments

* [Fix] Add hyperlinks

* [Docs] Add en quickstart

* [Fix] Fix comments

* Update overview.md (InternLM#412)

* Update overview.md

* Update overview.md

已根据要求进行修改,请查阅

* Update overview.md

进一步的修正

* Update overview.md

根据要求的完善

* Merge branch 'main' into 'docs' (InternLM#463)

* [Improve] Redesign the `prompt_template` (InternLM#294)

* update

* update cfgs

* update

* fix bugs

* upload docs

* rename

* update

* Revert "update cfgs"

This reverts commit 93966aa.

* update cfgs

* update

* rename

* rename

* fix bc

* fix stop_word

* fix

* fix

* Update prompt_template.md

* [Fix] Fix errors about `stop_words`  (InternLM#313)

* fix bugs

* Update mmbench.py

* [Fix] Fix Mixtral LoRA setting (InternLM#312)

set target_modules

* [Feature] Support DeepSeek-MoE (InternLM#311)

* support deepseek moe

* update docs

* update

* update

* [Fix] Set `torch.optim.AdamW` as the default optimizer (InternLM#318)

fix

* [FIx] Fix `pth_to_hf` for LLaVA model (InternLM#316)

Update pth_to_hf.py

* [Improve] Add `demo_data` examples (InternLM#278)

* update examples

* add examples

* add json template config

* rename

* update

* update

* update

* [Feature] Support InternLM2 (InternLM#321)

* add cfgs

* add internlm2 template

* add dispatch

* add docs

* update readme

* update

* [Fix] Fix the resume of seed (InternLM#309)

* fix

* Update utils.py

* [Feature] Accelerate `xtuner xxx`  (InternLM#307)

* accelerate cli

* Update entry_point.py

* Update entry_point.py

---------

Co-authored-by: Zhihao Lin <36994684+LZHgrla@users.noreply.github.com>

* [Fix] Fix InternLM2 url (InternLM#325)

* fix

* update

* Update README.md

* Update README_zh-CN.md

* [Fix] Limit the version of python, `>=3.8, <3.11` (InternLM#327)

update

* [Fix] Add `trust_remote_code=True` for AutoModel  (InternLM#328)

update

* [Docs] Improve README  (InternLM#326)

* update

* Update README.md

* Update README.md

* Update README.md

* Update README_zh-CN.md

* update

* update

* fix pre-commit

* update

* bump verion to v0.1.12 (InternLM#323)

bump v0.1.12

* set dev version (InternLM#329)

Update version.py

* [Docs] Add LLaVA-InternLM2 results (InternLM#332)

* update results

* update

* Update internlm2_chat template (InternLM#339)

Update internlm2 template

* [Fix] Fix examples demo_data configs (InternLM#334)

fix

* bump version to v0.1.13 (InternLM#340)

update

* set dev version (InternLM#341)

update

* [Feature] More flexible `TrainLoop` (InternLM#348)

* add new loop

* rename

* fix pre-commit

* add max_keep_ckpts

* fix

* update cfgs

* update examples

* fix

* update

* update llava

* update

* update

* update

* update

* [Feature]Support CEPH (InternLM#266)

* support petrelfs

* fix deepspeed save/load/resume

* add ENV to toggle petrelfs

* support hf save_pretrained

* patch deepspeed engine

* [Improve] Add `--repetition-penalty` for `xtuner chat` (InternLM#351)

fix

* [Feature] Support MMBench DDP Evaluate (InternLM#300)

* support ddp mmbench evaluate

* Update xtuner/tools/mmbench.py

Co-authored-by: Zhihao Lin <36994684+LZHgrla@users.noreply.github.com>

* Update xtuner/tools/mmbench.py

Co-authored-by: Zhihao Lin <36994684+LZHgrla@users.noreply.github.com>

* update minimum version of mmengine

* Update runtime.txt

---------

Co-authored-by: Zhihao Lin <36994684+LZHgrla@users.noreply.github.com>

* [Fix] `KeyError` of `encode_fn` (InternLM#361)

fix

* [Fix] Fix `batch_size` of full fine-tuing LLaVA-InternLM2 (InternLM#360)

fix

* [Fix] Remove `system` for `alpaca_map_fn` (InternLM#363)

update

* [Fix] Use `DEFAULT_IMAGE_TOKEN` instead of `'<image>'` (InternLM#353)

Update utils.py

* [Feature] Efficient SFT (InternLM#302)

* add local_attn_args_to_messagehub_hook

* add internlm repo sampler

* add internlm repo dataset and collate_fn

* dispatch internlm1 and internlm2 local attn

* add internlm2 config

* add internlm1 and intenrlm2 config

* add internlm2 template

* fix replace_internlm1_rote bugs

* add internlm1 and internlm2 config templates

* change priority of EvaluateChatHook

* fix docs

* fix config

* fix bug

* set rotary_base according the latest internlm2 config

* add llama local attn

* add llama local attn

* update intern_repo_dataset docs when using aliyun

* support using both hf load_dataset and intern_repo packed_dataset

* add configs

* add opencompass doc

* update opencompass doc

* use T data order

* use T data order

* add config

* add a tool to get data order

* support offline processing untokenized dataset

* add docs

* add doc about only saving model weights

* add doc about only saving model weights

* dispatch mistral

* add mistral template

* add mistral template

* fix torch_dtype

* reset pre-commit-config

* fix config

* fix internlm_7b_full_intern_repo_dataset_template

* update local_attn to varlen_attn

* rename local_attn

* fix InternlmRepoSampler and train.py to support resume

* modify Packer to support varlen attn

* support varlen attn in default pipeline

* update mmengine version requirement to 0.10.3

* Update ceph.md

* delete intern_repo_collate_fn

* delete intern_repo_collate_fn

* delete useless files

* assert pack_to_max_length=True if use_varlen_attn=True

* add varlen attn doc

* add varlen attn to configs

* delete useless codes

* update

* update

* update configs

* fix priority of ThroughputHook and flake8 ignore W504

* using map_fn to set length attr to dataset

* support split=None in process_hf_dataset

* add dataset_format_mapping

* support preprocess ftdp and normal dataset

* refactor process_hf_dataset

* support pack dataset in process_untokenized_datasets

* add xtuner_dataset_timeout

* using gloo backend for monitored barrier

* set gloo timeout

* fix bugs

* fix configs

* refactor intern repo dataset docs

* fix doc

* fix lint

---------

Co-authored-by: pppppM <67539920+pppppM@users.noreply.github.com>
Co-authored-by: pppppM <gjf_mail@126.com>

* [Fix] Add `attention_mask` for `default_collate_fn` (InternLM#371)

fix

* [Fix] Update requirements (InternLM#369)

Update runtime.txt

* [Fix] Fix rotary_base, add `colors_map_fn` to `DATASET_FORMAT_MAPPING` and rename 'internlm_repo' to 'intern_repo' (InternLM#372)

* fix

* rename internlm_repo to intern_repo

* add InternlmRepoSampler for preventing bc break

* add how to install flash_attn to doc

* update (InternLM#377)

* Delete useless codes and refactor process_untokenized_datasets (InternLM#379)

* delete useless codes

* refactor process_untokenized_datasets: add ftdp to dataset-format

* fix lint

* [Feature] support flash attn 2 in internlm1, internlm2 and llama (InternLM#381)

support flash attn 2 in internlm1, internlm2 and llama

* [Fix] Fix installation docs of mmengine in `intern_repo_dataset.md` (InternLM#384)

update

* [Fix] Update InternLM2 `apply_rotary_pos_emb` (InternLM#383)

update

* [Feature] support saving eval output before save checkpoint (InternLM#385)

* support saving eval output before save checkpoint

* refactor

* [Fix] lr scheduler setting (InternLM#394)

* fix lr scheduler setting

* fix more

---------

Co-authored-by: zilong.guo <zilong.guo@zeron.ai>
Co-authored-by: LZHgrla <linzhihao@pjlab.org.cn>

* [Fix] Remove pre-defined `system` of `alpaca_zh_map_fn` (InternLM#395)

fix

* [Feature] Support `Qwen1.5` (InternLM#407)

* rename

* update docs

* update template

* update

* add cfgs

* update

* update

* [Fix] Fix no space in chat output using InternLM2. (InternLM#357) (InternLM#404)

* [Fix] Fix no space in chat output using InternLM2. (InternLM#357)

* Update chat.py

* Update utils.py

* Update utils.py

* fix pre-commit

---------

Co-authored-by: Zhihao Lin <36994684+LZHgrla@users.noreply.github.com>
Co-authored-by: LZHgrla <linzhihao@pjlab.org.cn>

* [Fix] typo: `--system-prompt` to `--system-template` (InternLM#406)

fix

* [Improve] Add `output_with_loss` for dataset process (InternLM#408)

update

* [Fix] Fix dispatch to support transformers>=4.36 & Add USE_TRITON_KERNEL environment variable (InternLM#411)

* dispatch support transformers>=4.36

* add USE_TRITON_KERNEL environment variable

* raise RuntimeError use triton kernels on cpu

* fix lint

* [Feature]Add InternLM2-1_8b configs (InternLM#396)

* [Feature]Add InternLM2-Chat-1_8b full config

* [Feature]Add InternLM2-Chat-1_8b full config

* update

---------

Co-authored-by: LZHgrla <linzhihao@pjlab.org.cn>
Co-authored-by: Zhihao Lin <36994684+LZHgrla@users.noreply.github.com>

* [Fix] Fix `extract_json_objects` (InternLM#419)

* [Fix] Fix pth_to_hf error (InternLM#426)

fix

* [Feature] Support `Gemma` (InternLM#429)

* added gemma config and template

* check config and make sure the consistancy

* Update xtuner/configs/gemma/gemma_2b_base/gemma_2b_base_qlora_alpaca_e3.py

Co-authored-by: Zhihao Lin <36994684+LZHgrla@users.noreply.github.com>

* Update xtuner/configs/gemma/gemma_2b_base/gemma_2b_base_full_alpaca_e3.py

Co-authored-by: Zhihao Lin <36994684+LZHgrla@users.noreply.github.com>

* Update xtuner/configs/gemma/gemma_7b_base/gemma_7b_base_full_alpaca_e3.py

Co-authored-by: Zhihao Lin <36994684+LZHgrla@users.noreply.github.com>

* Update xtuner/configs/gemma/gemma_7b_base/gemma_7b_base_qlora_alpaca_e3.py

Co-authored-by: Zhihao Lin <36994684+LZHgrla@users.noreply.github.com>

* Update xtuner/utils/templates.py

Co-authored-by: Zhihao Lin <36994684+LZHgrla@users.noreply.github.com>

* update

* added  required version

* update

* update

---------

Co-authored-by: Zhihao Lin <36994684+LZHgrla@users.noreply.github.com>
Co-authored-by: LZHgrla <linzhihao@pjlab.org.cn>

* add refcoco to llava (InternLM#425)

* add base dataset

* update dataset generation

* update refcoco

* add convert refcooc

* add eval_refcoco

* add config

* update dataset

* fix bug

* fix bug

* update data prepare

* fix error

* refactor eval_refcoco

* fix bug

* fix error

* update readme

* add entry_point

* update config

* update config

* update entry point

* update

* update doc

* update

---------

Co-authored-by: jacky <jacky@xx.com>

* [Fix] Inconsistent BatchSize of `LengthGroupedSampler` (InternLM#436)

update

* bump version to v0.1.14 (InternLM#431)

update

* set dev version (InternLM#437)

* Update version.py

* Update version.py

* [Bugs] Fix bugs when using EpochBasedRunner (InternLM#439)

fix bugs when using epochbasedrunner

* [Feature] Support processing ftdp dataset and custom dataset offline (InternLM#410)

* support smart_tokenizer_and_embedding_resize

* replace ast with json.loads

* support list_dataset_format cli

* add doc about ftdp and custom dataset

* add custom dataset template

* add args name to process_hf_dataset

* use new process_untokenized_datasets

* support tokenize_ftdp_datasets

* add mistral_7b_w_tokenized_dataset config

* update doc

* update doc

* add comments

* fix data save path

* smart_tokenizer_and_embedding_resize support zero3

* fix lint

* add data format to internlm2_7b_full_finetune_custom_dataset_e1.py

* add a data format example to configs associated with finetuning custom dataset

* add a data format example to configs associated with finetuning custom dataset

* fix lint

* Update prompt_template.md (InternLM#441)

修改了一个错别字

* [Doc] Split finetune_custom_dataset.md to 6 parts (InternLM#445)

* split finetune_custom_dataset.md to 6 parts

* refactor custom_dataset and ftdp_dataset related docs

* fix comments

* fix pre-commit

---------

Co-authored-by: pppppM <67539920+pppppM@users.noreply.github.com>
Co-authored-by: RangiLyu <lyuchqi@gmail.com>
Co-authored-by: whcao <41630003+HIT-cwh@users.noreply.github.com>
Co-authored-by: pppppM <gjf_mail@126.com>
Co-authored-by: gzlong96 <30570937+gzlong96@users.noreply.github.com>
Co-authored-by: zilong.guo <zilong.guo@zeron.ai>
Co-authored-by: Ko Sung <34935911+KooSung@users.noreply.github.com>
Co-authored-by: 不要葱姜蒜 <77671993+KMnO4-zx@users.noreply.github.com>
Co-authored-by: fanqiNO1 <75657629+fanqiNO1@users.noreply.github.com>
Co-authored-by: PommesPeter <54879512+PommesPeter@users.noreply.github.com>
Co-authored-by: LKJacky <108643365+LKJacky@users.noreply.github.com>
Co-authored-by: jacky <jacky@xx.com>
Co-authored-by: xzw <62385492+aJupyter@users.noreply.github.com>

* [Docs] Add `docs/zh_cn/preparation/pretrained_model.md` (InternLM#462)

* fix pre-commit

* update

* Update pretrained_model.md

* Update pretrained_model.md

* fix pre-commit

* Update pretrained_model.md

* update

* update

* update

* update

* Update pretrained_model.md

* [Docs] Add `docs/zh_cn/training/multi_modal_dataset.md` (InternLM#503)

* update

* update

* [Docs] Improve readthedocs style (InternLM#545)

* update style

* update style

* fix requirements

* fix

* fix

* add logo

* update

* update

* update

* [Docs] `.md` to `.rst` (InternLM#544)

* update rst

* update rst

* update rst

* [Docs] Add `docs/zh_cn/training/custom_pretrain_dataset.rst` (InternLM#535)

* update

* update

* update rst

* [Docs] Add docs about training on large scale dataset (InternLM#517)

* add train_on_large_scale_dataset doc

* refine doc

* add llava offline doc

* refine doc

* replace md with rst

* refine rst

* refine rst

* [Docs] Add internevo migration related documents (InternLM#506)

* add internevo related

* fix comments

* refine doc

* rename internlm2_7b_w_tokenized_dataset.py to internlm2_7b_w_internevo_dataset.py

* refine doc

* replace md with rst

* refine rst

* refine rst

* [Docs] Add `docs/zh_cn/training/modify_settings.rst` (InternLM#490)

* update

* update

* update

* update

* update

* update

* Update modify_settings.md

* Update modify_settings.md

* update

* Update docs/zh_cn/training/modify_settings.md

Co-authored-by: Haian Huang(深度眸) <1286304229@qq.com>

* update deepspeed

* update rst

* update rst

---------

Co-authored-by: Haian Huang(深度眸) <1286304229@qq.com>

* [Docs] Add `length_grouped_sampler.rst` (InternLM#511)

* update

* update

* update

* Update length_grouped_sampler.md

* update rst

* Update length_grouped_sampler.rst

Co-authored-by: whcao <41630003+HIT-cwh@users.noreply.github.com>

---------

Co-authored-by: whcao <41630003+HIT-cwh@users.noreply.github.com>

* [Docs] Add accelerate related (InternLM#504)

* add accelerate related

* split accelerate docs

* fix comments

* add speed benchmark

* explain why qlora can not be used with zero3

* refine doc

* fix configs

* refine doc

* refine doc

* refine configs

* add benchmark to index.rst

* refine doc

* add hyper-param docs

* refine doc

* add explanation about memory cost optimization when using zero

* add figure to show the speed comparison

* refine figures

* refine doc

* fix figures

* refine figures

* update figures and benchmark configs

* add pack rst

* delete pack md

* replace md with rst

* replace md with rst

* replace md with rst

* replace md with rst

* refine rst

* refine rst

* refine rst

* refine rst

* refine rst

* refine rst

* refine rst

* refine rst

* refine rst

* refine rst

* refine rst

* refine rst

* refine rst

* refine rst

---------

Co-authored-by: pppppM <67539920+pppppM@users.noreply.github.com>

* [Docs] Add visualization docs (InternLM#516)

* add visualization docs

* delete other visualization tools and add explanation about how to use tensorboard

* replace md with rst

---------

Co-authored-by: pppppM <67539920+pppppM@users.noreply.github.com>

* [Docs] Add docs about SFT with custom dataset (InternLM#514)

* add custom sft dataset docs

* add custom dataset template configs

* add openai data format

* refine doc

* update (InternLM#2)

* replace md with rst

---------

Co-authored-by: Zhihao Lin <36994684+LZHgrla@users.noreply.github.com>
Co-authored-by: pppppM <67539920+pppppM@users.noreply.github.com>

* [Docs] Add `docs/zh_cn/training/open_source_dataset.rst` (InternLM#502)

* update

* update

* update

* update

* format table

* fix typo

* update rst

---------

Co-authored-by: pppppM <67539920+pppppM@users.noreply.github.com>

* [Docs] Add `docs/zh_cn/preparation/prompt_template.rst` (InternLM#475)

* update

* update

* Update prompt_template.md

* Update prompt_template.md

* update

* add tips

* update

* update rst

---------

Co-authored-by: pppppM <67539920+pppppM@users.noreply.github.com>

* [Docs] Add Sequence Parallel documents (InternLM#505)

* add sp related

* add sequence parallel supported models

* refine doc

* Update docs/zh_cn/training/training_extreme_long_sequence.md

Co-authored-by: Haian Huang(深度眸) <1286304229@qq.com>

* refine doc

* refine doc

* test the capability boundary of zero3

* refine doc

* test rst

* test rst

* add training speed figure

* delete debug rst

* sp need flash_attn

* WIP

* replace md with rst

* refine rst

* refine rst

* add explanation about why pt 2.1 is not accepted

* refine rst

* refine rst

* add loss curve

---------

Co-authored-by: Haian Huang(深度眸) <1286304229@qq.com>
Co-authored-by: pppppM <67539920+pppppM@users.noreply.github.com>

* [Docs] Update `docs/zh_cn` outline (InternLM#556)

update

* [Docs] Update `docs/en` theme (InternLM#557)

* update

* update

* update

* update

* update

* update

* update

* update

* [Docs] Add tokenizer to sft in Case 2 (InternLM#584)

add tokenizer to sft in Case 2

* [Docs] Improve the Rendering Effect of Readthedocs (InternLM#664)

* refine get_start and training

* fix acceleration

* update maxdepth

* refine internevo migration

* refine internevo

* fix typos

* fix lint

---------

Co-authored-by: zhengjie.xu <jerryxuzhengjie@gmail.com>
Co-authored-by: Ma Zhiming <101508488+JimmyMa99@users.noreply.github.com>
Co-authored-by: fanqiNO1 <75657629+fanqiNO1@users.noreply.github.com>
Co-authored-by: Jianfeng777 <108343727+Jianfeng777@users.noreply.github.com>
Co-authored-by: Zhihao Lin <36994684+LZHgrla@users.noreply.github.com>
Co-authored-by: RangiLyu <lyuchqi@gmail.com>
Co-authored-by: whcao <41630003+HIT-cwh@users.noreply.github.com>
Co-authored-by: gzlong96 <30570937+gzlong96@users.noreply.github.com>
Co-authored-by: zilong.guo <zilong.guo@zeron.ai>
Co-authored-by: Ko Sung <34935911+KooSung@users.noreply.github.com>
Co-authored-by: 不要葱姜蒜 <77671993+KMnO4-zx@users.noreply.github.com>
Co-authored-by: PommesPeter <54879512+PommesPeter@users.noreply.github.com>
Co-authored-by: LKJacky <108643365+LKJacky@users.noreply.github.com>
Co-authored-by: jacky <jacky@xx.com>
Co-authored-by: xzw <62385492+aJupyter@users.noreply.github.com>
Co-authored-by: Haian Huang(深度眸) <1286304229@qq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants