双卡训练报错：ModuleNotFoundError: No module named 'mmengine' #324

KMnO4-zx · 2024-01-16T15:45:25Z

在使用xtuner0.1.9 双卡（A100*2） deepspeed 全量微调InternLM-7b-chat的时候遇到了ModuleNotFoundError: No module named 'mmengine'这个问题。使用单卡（A100）可以正常加载模型不会报错，但会OOM。

使用的命令为：

NPROC_PER_NODE=2 xtuner train internlm_chat_7b_qlora_oasst1_e3_copy.py --deepspeed deepspeed_zero2

以下是报错信息：

(xtuner0.1.9) root@intern-studio:~/code/full/internlm# NPROC_PER_NODE=2 xtuner train internlm_chat_7b_qlora_oasst1_e3_copy.py --deepspeed deepspeed_zero2
[2024-01-16 23:31:36,222] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
01/16 23:31:36 - mmengine - WARNING - Use random port: 21558
[2024-01-16 23:31:38,919] torch.distributed.run: [WARNING] 
[2024-01-16 23:31:38,919] torch.distributed.run: [WARNING] *****************************************
[2024-01-16 23:31:38,919] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
[2024-01-16 23:31:38,919] torch.distributed.run: [WARNING] *****************************************
Traceback (most recent call last):
  File "/root/code/math-internlm/xtuner/xtuner/tools/train.py", line 10, in <module>
Traceback (most recent call last):
  File "/root/code/math-internlm/xtuner/xtuner/tools/train.py", line 10, in <module>
    from mmengine.config import Config, DictAction
ModuleNotFoundError: No module named 'mmengine'
    from mmengine.config import Config, DictAction
ModuleNotFoundError: No module named 'mmengine'
[2024-01-16 23:31:43,927] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 37397) of binary: /share/conda_envs/internlm-base/bin/python
Traceback (most recent call last):
  File "/root/.local/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/root/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/root/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main
    run(args)
  File "/root/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/root/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/root/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/root/code/math-internlm/xtuner/xtuner/tools/train.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-01-16_23:31:43
  host      : intern-studio
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 37398)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-01-16_23:31:43
  host      : intern-studio
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 37397)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
(xtuner0.1.9) root@intern-studio:~/code/full/internlm# pip show mmengine
Name: mmengine
Version: 0.10.2
Summary: Engine of OpenMMLab projects
Home-page: https://github.com/open-mmlab/mmengine
Author: MMEngine Authors
Author-email: openmmlab@gmail.com
License: UNKNOWN
Location: /root/.conda/envs/xtuner0.1.9/lib/python3.10/site-packages
Requires: addict, matplotlib, numpy, opencv-python, pyyaml, rich, termcolor, yapf
Required-by: xtuner

config脚本为：

# Copyright (c) OpenMMLab. All rights reserved.
import torch
from torch.optim import AdamW
from bitsandbytes.optim import PagedAdamW32bit, Adam
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
                            LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR
from peft import LoraConfig
from transformers import (AutoModelForCausalLM, AutoTokenizer,
                          BitsAndBytesConfig)

from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import oasst1_map_fn, template_map_fn_factory
from xtuner.engine import DatasetInfoHook, EvaluateChatHook
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE

#######################################################################
#                          PART 1  Settings                           #
#######################################################################
# Model
pretrained_model_name_or_path = '/root/model/internlm-chat-7b'

# Data
data_path = '/root/data/huanhuan_xtuner.json'
prompt_template = PROMPT_TEMPLATE.internlm_chat
max_length = 238
pack_to_max_length = True

# Scheduler & Optimizer
batch_size = 1  # per_device
accumulative_counts = 1
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1  # grad clip

# Evaluate the generation performance during the training
evaluation_freq = 90
SYSTEM = '现在你要扮演皇帝身边的女人--甄嬛.'
evaluation_inputs = ['你是谁',  '小姐，别的秀女都在求中选，唯有咱们小姐想被撂牌子，菩萨一定记得真真儿的——']

#######################################################################
#                      PART 2  Model & Tokenizer                      #
#######################################################################
tokenizer = dict(
    type=AutoTokenizer.from_pretrained,
    pretrained_model_name_or_path=pretrained_model_name_or_path,
    trust_remote_code=True,
    padding_side='right')

model = dict(
    type=SupervisedFinetune,
    llm=dict(
        type=AutoModelForCausalLM.from_pretrained,
        pretrained_model_name_or_path=pretrained_model_name_or_path,
        trust_remote_code=True,
        torch_dtype=torch.float16,
            ))

#######################################################################
#                      PART 3  Dataset & Dataloader                   #
#######################################################################
train_dataset = dict(
    type=process_hf_dataset,
    dataset=dict(type=load_dataset, path='json', data_files=dict(train=data_path)),
    tokenizer=tokenizer,
    max_length=max_length,
    dataset_map_fn=None,
    template_map_fn=dict(
        type=template_map_fn_factory, template=prompt_template),
    remove_unused_columns=True,
    shuffle_before_pack=True,
    pack_to_max_length=pack_to_max_length)

train_dataloader = dict(
    batch_size=batch_size,
    num_workers=dataloader_num_workers,
    dataset=train_dataset,
    sampler=dict(type=DefaultSampler, shuffle=True),
    collate_fn=dict(type=default_collate_fn))

#######################################################################
#                    PART 4  Scheduler & Optimizer                    #
#######################################################################
# optimizer
optim_wrapper = dict(
    type=AmpOptimWrapper,
    optimizer=dict(
        type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
    clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
    accumulative_counts=accumulative_counts,
    loss_scale='dynamic',
    dtype='float16')

# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md  # noqa: E501
param_scheduler = dict(
    type=CosineAnnealingLR,
    eta_min=0.0,
    by_epoch=True,
    T_max=max_epochs,
    convert_to_iter_based=True)

# train, val, test setting
train_cfg = dict(by_epoch=True, max_epochs=max_epochs, val_interval=1)

#######################################################################
#                           PART 5  Runtime                           #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
    dict(type=DatasetInfoHook, tokenizer=tokenizer),
    dict(
        type=EvaluateChatHook,
        tokenizer=tokenizer,
        every_n_iters=evaluation_freq,
        evaluation_inputs=evaluation_inputs,
        system=SYSTEM,
        prompt_template=prompt_template)
]

# configure default hooks
default_hooks = dict(
    # record the time of every iteration.
    timer=dict(type=IterTimerHook),
    # print log every 100 iterations.
    logger=dict(type=LoggerHook, interval=10),
    # enable the parameter scheduler.
    param_scheduler=dict(type=ParamSchedulerHook),
    # save checkpoint per epoch.
    checkpoint=dict(type=CheckpointHook, interval=1),
    # set sampler seed in distributed evrionment.
    sampler_seed=dict(type=DistSamplerSeedHook),
)

# configure environment
env_cfg = dict(
    # whether to enable cudnn benchmark
    cudnn_benchmark=False,
    # set multi process parameters
    mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
    # set distributed parameters
    dist_cfg=dict(backend='nccl'),
)

# set visualizer
visualizer = None

# set log level
log_level = 'INFO'

# load from which checkpoint
load_from = None

# whether to resume training from the loaded checkpoint
resume = False

# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)

The text was updated successfully, but these errors were encountered:

LZHgrla · 2024-01-16T15:54:37Z

请查看 torchrun 路径，是否在虚拟环境下

where torchrun

KMnO4-zx · 2024-01-16T16:05:57Z

请查看 torchrun 路径，是否在虚拟环境下
where torchrun

torchrun 没有被找到

(xtuner0.1.9) root@intern-studio:~/code/full/internlm# where torchrun
bash: where: command not found

LZHgrla · 2024-01-16T16:10:09Z

多卡程序默认使用 torchrun 启动。单卡程序则使用 python 启动。我觉得问题应该出在这，程序没有找到正确的 torchrun，可以检查一下 pytorch 的版本及 torchrun 能否正常使用。

xtuner/xtuner/entry_point.py

Line 244 in 9ddf308

subprocess.run(['python', fn()] + args[n_arg + 1:])

xtuner/xtuner/entry_point.py

Line 258 in 9ddf308

subprocess.run(['torchrun'] + torchrun_args + [fn()] +

KMnO4-zx · 2024-01-16T16:15:01Z

多卡程序默认使用 torchrun 启动。单卡程序则使用 python 启动。我觉得问题应该出在这，程序没有找到正确的 torchrun，可以检查一下 pytorch 的版本及 torchrun 能否正常使用。

xtuner/xtuner/entry_point.py

Line 244 in 9ddf308

subprocess.run(['python', fn()] + args[n_arg + 1:])

xtuner/xtuner/entry_point.py

Line 258 in 9ddf308

subprocess.run(['torchrun'] + torchrun_args + [fn()] +

但是这个torchrun --help命令是可以用的，但pip list 显示没有 torchrun

LZHgrla · 2024-01-16T16:16:42Z

查看一下虚拟环境的bin内有没有torchrun

xxx/anaconda3/envs/xtuner0.1.9/bin/torchrun

KMnO4-zx · 2024-01-16T16:20:30Z

查看一下虚拟环境的bin内有没有torchrun
xxx/anaconda3/envs/xtuner0.1.9/bin/torchrun

先休息了佬~~明天再看~~ 感谢解答~

KMnO4-zx · 2024-01-17T01:56:01Z

查看一下虚拟环境的bin内有没有torchrun
xxx/anaconda3/envs/xtuner0.1.9/bin/torchrun

这个目录下有torchrun

LZHgrla · 2024-01-17T03:03:05Z

test.py

import sys
print(sys.executable)

命令：

torchrun --nnodes=1 --nproc_per_node=2 --master_port=29666 test.py

查看一下打印出来的 python 路径是哪一个？

KMnO4-zx · 2024-01-17T03:07:15Z

test.py
import sys
print(sys.executable)
命令：
torchrun --nnodes=1 --nproc_per_node=2 --master_port=29666 test.py
查看一下打印出来的 python 路径是哪一个？

(xtuner0.1.9) root@intern-studio:~/code# torchrun --nnodes=1 --nproc_per_node=2 --master_port=29666 test.py
[2024-01-17 11:06:46,956] torch.distributed.run: [WARNING] 
[2024-01-17 11:06:46,956] torch.distributed.run: [WARNING] *****************************************
[2024-01-17 11:06:46,956] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
[2024-01-17 11:06:46,956] torch.distributed.run: [WARNING] *****************************************
/share/conda_envs/internlm-base/bin/python
/share/conda_envs/internlm-base/bin/python

LZHgrla · 2024-01-17T03:09:05Z

@KMnO4-zx
看起来所执行的 torchrun 并不是 xtuner0.1.9 的 torchrun，得检查一下环境变量什么的，确保 torchrun 能够正确指示

KMnO4-zx · 2024-01-17T03:09:21Z

@KMnO4-zx 看起来所执行的 torchrun 并不是 xtuner0.1.9 的 torchrun，得检查一下环境变量什么的，确保 torchrun 能够正确指示

好的感谢~

kikyzzz · 2024-03-13T11:52:26Z

您好，我也遇到了同样的问题，请问解决了吗

kikyzzz · 2024-03-13T11:53:10Z

LZHgrla · 2024-03-13T11:55:19Z

test.py
import sys
print(sys.executable)
命令：
torchrun --nnodes=1 --nproc_per_node=2 --master_port=29666 test.py
查看一下打印出来的 python 路径是哪一个？

@kikyzzz 尝试执行这一步骤，验证 torchrun 所执行的 python 路径

kikyzzz · 2024-03-13T11:59:27Z

您好，我这里显示的是：/home/wangbenzhi/miniconda3/envs/disco/bin/python /home/wangbenzhi/miniconda3/envs/disco/bin/python 确实并不是我的项目在的虚拟环境里，我的虚拟环境的/home/wangbenzhi/miniconda3/envs/project/bin/python

…

------------------ 原始邮件 ------------------ 发件人: "InternLM/xtuner" ***@***.***>; 发送时间: 2024年3月13日(星期三) 晚上7:55 ***@***.***>; ***@***.******@***.***>; 主题: Re: [InternLM/xtuner] 双卡训练报错：ModuleNotFoundError: No module named 'mmengine' (Issue #324) test.py import sys print(sys.executable) 命令： torchrun --nnodes=1 --nproc_per_node=2 --master_port=29666 test.py 查看一下打印出来的 python 路径是哪一个？ @kikyzzz 尝试执行这一步骤，验证 torchrun 所执行的 python 路径 — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: ***@***.***>

LZHgrla · 2024-03-13T12:10:20Z

@kikyzzz 这个问题的根源，就是当前环境的 torchrun 调用了其他环境的 python。可以执行下面这个命令查看一下 torchrun 的实际运行代码

cat /home/wangbenzhi/miniconda3/envs/project/bin/torchrun

kikyzzz · 2024-03-13T12:23:59Z

非常感谢您，我已经明白这个问题出在哪里了，但是我太懂这些操作系统的指令，我查看了torchrun文件，里面的内容是： import sys from torch.distributed.run import main if __name__ == '__main__': sys.argv[0] = re.sub(r'(-script\.pyw|\.exe)?$', '', sys.argv[0]) sys.exit(main()) 我该如何修改它呢

…

------------------ 原始邮件 ------------------ 发件人: "InternLM/xtuner" ***@***.***>; 发送时间: 2024年3月13日(星期三) 晚上8:10 ***@***.***>; ***@***.******@***.***>; 主题: Re: [InternLM/xtuner] 双卡训练报错：ModuleNotFoundError: No module named 'mmengine' (Issue #324) @kikyzzz 这个问题的根源，就是当前环境的 torchrun 调用了其他环境的 python。可以执行下面这个命令查看一下 torchrun 的实际运行代码 cat /home/wangbenzhi/miniconda3/envs/project/bin/torchrun — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: ***@***.***>

LZHgrla · 2024-03-13T12:26:10Z

@kikyzzz
我这边是这样的，开头有一个可执行文件的指定

#!/xxxx/bin/python
# -*- coding: utf-8 -*-
import re
import sys
from torch.distributed.run import main
if __name__ == '__main__':
    sys.argv[0] = re.sub(r'(-script\.pyw|\.exe)?$', '', sys.argv[0])
    sys.exit(main())

kikyzzz · 2024-03-13T12:34:16Z

我这里上面的路劲好像也是我现在虚拟环境下的正确路径欸， #!/home/wangbenzhi/miniconda3/envs/project/bin/python # -*- coding: utf-8 -*- import re import sys from torch.distributed.run import main if __name__ == '__main__':     sys.argv[0] = re.sub(r'(-script\.pyw|\.exe)?$', '', sys.argv[0])     sys.exit(main())

…

------------------ 原始邮件 ------------------ 发件人: "InternLM/xtuner" ***@***.***>; 发送时间: 2024年3月13日(星期三) 晚上8:26 ***@***.***>; ***@***.******@***.***>; 主题: Re: [InternLM/xtuner] 双卡训练报错：ModuleNotFoundError: No module named 'mmengine' (Issue #324) @kikyzzz 我这边是这样的，开头有一个可执行文件的指定 #!/xxxx/bin/python # -*- coding: utf-8 -*- import re import sys from torch.distributed.run import main if __name__ == '__main__': sys.argv[0] = re.sub(r'(-script\.pyw|\.exe)?$', '', sys.argv[0]) sys.exit(main()) — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: ***@***.***>

LZHgrla · 2024-03-13T12:35:48Z

@kikyzzz
那检查一下所执行的 torchrun 路径？

where torchrun

LZHgrla closed this as completed Feb 4, 2024

LZHgrla mentioned this issue Apr 8, 2024

ModuleNotFoundError: No module named 'mmengine' #546

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

双卡训练报错：ModuleNotFoundError: No module named 'mmengine' #324

双卡训练报错：ModuleNotFoundError: No module named 'mmengine' #324

KMnO4-zx commented Jan 16, 2024

LZHgrla commented Jan 16, 2024 •

edited

KMnO4-zx commented Jan 16, 2024

LZHgrla commented Jan 16, 2024

KMnO4-zx commented Jan 16, 2024

LZHgrla commented Jan 16, 2024

KMnO4-zx commented Jan 16, 2024

KMnO4-zx commented Jan 17, 2024

LZHgrla commented Jan 17, 2024

KMnO4-zx commented Jan 17, 2024

LZHgrla commented Jan 17, 2024

KMnO4-zx commented Jan 17, 2024

kikyzzz commented Mar 13, 2024

kikyzzz commented Mar 13, 2024

LZHgrla commented Mar 13, 2024

kikyzzz commented Mar 13, 2024 via email

LZHgrla commented Mar 13, 2024

kikyzzz commented Mar 13, 2024 via email

LZHgrla commented Mar 13, 2024

kikyzzz commented Mar 13, 2024 via email

LZHgrla commented Mar 13, 2024

双卡训练报错：ModuleNotFoundError: No module named 'mmengine' #324

双卡训练报错：ModuleNotFoundError: No module named 'mmengine' #324

Comments

KMnO4-zx commented Jan 16, 2024

LZHgrla commented Jan 16, 2024 • edited

KMnO4-zx commented Jan 16, 2024

LZHgrla commented Jan 16, 2024

KMnO4-zx commented Jan 16, 2024

LZHgrla commented Jan 16, 2024

KMnO4-zx commented Jan 16, 2024

KMnO4-zx commented Jan 17, 2024

LZHgrla commented Jan 17, 2024

KMnO4-zx commented Jan 17, 2024

LZHgrla commented Jan 17, 2024

KMnO4-zx commented Jan 17, 2024

kikyzzz commented Mar 13, 2024

kikyzzz commented Mar 13, 2024

LZHgrla commented Mar 13, 2024

kikyzzz commented Mar 13, 2024 via email

LZHgrla commented Mar 13, 2024

kikyzzz commented Mar 13, 2024 via email

LZHgrla commented Mar 13, 2024

kikyzzz commented Mar 13, 2024 via email

LZHgrla commented Mar 13, 2024

LZHgrla commented Jan 16, 2024 •

edited