[fix] fix save eval result failed with mutil-node pretrain #678
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
在运行 4 节点 32卡的 LLaVA-InternLM2-20B 的预训练时,每次到eval阶段除master节点之外都会报错 FileNotExist,经过阅读 xtuner 和 mmengine 的代码后定位到问题:
mmengine在多节点训练时,默认只在master节点保存log/vis_data等信息,这会导致worker节点的没有 vis_data 这个文件夹,但是 xtuner 在保存eval结果的时候是每个节点都保存一份,而且在打开文件的时候没有做父文件夹是否存在的验证,因此导致了除master节点外都因为文件夹不存在而挂掉。。。
修复方式也很简单:保证只在master节点存储结果(利用mmengine提供的
master_only
装饰器),每次保存前利用mmengine提供的接口mkdir_or_exist
进行文件夹存在性检查。