Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

双卡训练报错:ModuleNotFoundError: No module named 'mmengine' #324

Closed
KMnO4-zx opened this issue Jan 16, 2024 · 20 comments
Closed

双卡训练报错:ModuleNotFoundError: No module named 'mmengine' #324

KMnO4-zx opened this issue Jan 16, 2024 · 20 comments

Comments

@KMnO4-zx
Copy link
Contributor

在使用xtuner0.1.9 双卡(A100*2) deepspeed 全量微调InternLM-7b-chat的时候遇到了ModuleNotFoundError: No module named 'mmengine'这个问题。使用单卡(A100)可以正常加载模型不会报错,但会OOM。

使用的命令为:

NPROC_PER_NODE=2 xtuner train internlm_chat_7b_qlora_oasst1_e3_copy.py --deepspeed deepspeed_zero2

以下是报错信息:

(xtuner0.1.9) root@intern-studio:~/code/full/internlm# NPROC_PER_NODE=2 xtuner train internlm_chat_7b_qlora_oasst1_e3_copy.py --deepspeed deepspeed_zero2
[2024-01-16 23:31:36,222] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
01/16 23:31:36 - mmengine - WARNING - Use random port: 21558
[2024-01-16 23:31:38,919] torch.distributed.run: [WARNING] 
[2024-01-16 23:31:38,919] torch.distributed.run: [WARNING] *****************************************
[2024-01-16 23:31:38,919] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
[2024-01-16 23:31:38,919] torch.distributed.run: [WARNING] *****************************************
Traceback (most recent call last):
  File "/root/code/math-internlm/xtuner/xtuner/tools/train.py", line 10, in <module>
Traceback (most recent call last):
  File "/root/code/math-internlm/xtuner/xtuner/tools/train.py", line 10, in <module>
    from mmengine.config import Config, DictAction
ModuleNotFoundError: No module named 'mmengine'
    from mmengine.config import Config, DictAction
ModuleNotFoundError: No module named 'mmengine'
[2024-01-16 23:31:43,927] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 37397) of binary: /share/conda_envs/internlm-base/bin/python
Traceback (most recent call last):
  File "/root/.local/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/root/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/root/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main
    run(args)
  File "/root/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/root/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/root/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/root/code/math-internlm/xtuner/xtuner/tools/train.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-01-16_23:31:43
  host      : intern-studio
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 37398)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-01-16_23:31:43
  host      : intern-studio
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 37397)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
(xtuner0.1.9) root@intern-studio:~/code/full/internlm# pip show mmengine
Name: mmengine
Version: 0.10.2
Summary: Engine of OpenMMLab projects
Home-page: https://github.com/open-mmlab/mmengine
Author: MMEngine Authors
Author-email: openmmlab@gmail.com
License: UNKNOWN
Location: /root/.conda/envs/xtuner0.1.9/lib/python3.10/site-packages
Requires: addict, matplotlib, numpy, opencv-python, pyyaml, rich, termcolor, yapf
Required-by: xtuner

config脚本为:

# Copyright (c) OpenMMLab. All rights reserved.
import torch
from torch.optim import AdamW
from bitsandbytes.optim import PagedAdamW32bit, Adam
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
                            LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR
from peft import LoraConfig
from transformers import (AutoModelForCausalLM, AutoTokenizer,
                          BitsAndBytesConfig)

from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import oasst1_map_fn, template_map_fn_factory
from xtuner.engine import DatasetInfoHook, EvaluateChatHook
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE

#######################################################################
#                          PART 1  Settings                           #
#######################################################################
# Model
pretrained_model_name_or_path = '/root/model/internlm-chat-7b'

# Data
data_path = '/root/data/huanhuan_xtuner.json'
prompt_template = PROMPT_TEMPLATE.internlm_chat
max_length = 238
pack_to_max_length = True

# Scheduler & Optimizer
batch_size = 1  # per_device
accumulative_counts = 1
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1  # grad clip

# Evaluate the generation performance during the training
evaluation_freq = 90
SYSTEM = '现在你要扮演皇帝身边的女人--甄嬛.'
evaluation_inputs = ['你是谁',  '小姐,别的秀女都在求中选,唯有咱们小姐想被撂牌子,菩萨一定记得真真儿的——']

#######################################################################
#                      PART 2  Model & Tokenizer                      #
#######################################################################
tokenizer = dict(
    type=AutoTokenizer.from_pretrained,
    pretrained_model_name_or_path=pretrained_model_name_or_path,
    trust_remote_code=True,
    padding_side='right')

model = dict(
    type=SupervisedFinetune,
    llm=dict(
        type=AutoModelForCausalLM.from_pretrained,
        pretrained_model_name_or_path=pretrained_model_name_or_path,
        trust_remote_code=True,
        torch_dtype=torch.float16,
            ))

#######################################################################
#                      PART 3  Dataset & Dataloader                   #
#######################################################################
train_dataset = dict(
    type=process_hf_dataset,
    dataset=dict(type=load_dataset, path='json', data_files=dict(train=data_path)),
    tokenizer=tokenizer,
    max_length=max_length,
    dataset_map_fn=None,
    template_map_fn=dict(
        type=template_map_fn_factory, template=prompt_template),
    remove_unused_columns=True,
    shuffle_before_pack=True,
    pack_to_max_length=pack_to_max_length)

train_dataloader = dict(
    batch_size=batch_size,
    num_workers=dataloader_num_workers,
    dataset=train_dataset,
    sampler=dict(type=DefaultSampler, shuffle=True),
    collate_fn=dict(type=default_collate_fn))

#######################################################################
#                    PART 4  Scheduler & Optimizer                    #
#######################################################################
# optimizer
optim_wrapper = dict(
    type=AmpOptimWrapper,
    optimizer=dict(
        type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
    clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
    accumulative_counts=accumulative_counts,
    loss_scale='dynamic',
    dtype='float16')

# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md  # noqa: E501
param_scheduler = dict(
    type=CosineAnnealingLR,
    eta_min=0.0,
    by_epoch=True,
    T_max=max_epochs,
    convert_to_iter_based=True)

# train, val, test setting
train_cfg = dict(by_epoch=True, max_epochs=max_epochs, val_interval=1)

#######################################################################
#                           PART 5  Runtime                           #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
    dict(type=DatasetInfoHook, tokenizer=tokenizer),
    dict(
        type=EvaluateChatHook,
        tokenizer=tokenizer,
        every_n_iters=evaluation_freq,
        evaluation_inputs=evaluation_inputs,
        system=SYSTEM,
        prompt_template=prompt_template)
]

# configure default hooks
default_hooks = dict(
    # record the time of every iteration.
    timer=dict(type=IterTimerHook),
    # print log every 100 iterations.
    logger=dict(type=LoggerHook, interval=10),
    # enable the parameter scheduler.
    param_scheduler=dict(type=ParamSchedulerHook),
    # save checkpoint per epoch.
    checkpoint=dict(type=CheckpointHook, interval=1),
    # set sampler seed in distributed evrionment.
    sampler_seed=dict(type=DistSamplerSeedHook),
)

# configure environment
env_cfg = dict(
    # whether to enable cudnn benchmark
    cudnn_benchmark=False,
    # set multi process parameters
    mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
    # set distributed parameters
    dist_cfg=dict(backend='nccl'),
)

# set visualizer
visualizer = None

# set log level
log_level = 'INFO'

# load from which checkpoint
load_from = None

# whether to resume training from the loaded checkpoint
resume = False

# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
@LZHgrla
Copy link
Collaborator

LZHgrla commented Jan 16, 2024

请查看 torchrun 路径,是否在虚拟环境下

where torchrun

@KMnO4-zx
Copy link
Contributor Author

请查看 torchrun 路径,是否在虚拟环境下

where torchrun

torchrun 没有被找到

(xtuner0.1.9) root@intern-studio:~/code/full/internlm# where torchrun
bash: where: command not found

@LZHgrla
Copy link
Collaborator

LZHgrla commented Jan 16, 2024

多卡程序默认使用 torchrun 启动。单卡程序则使用 python 启动。我觉得问题应该出在这,程序没有找到正确的 torchrun,可以检查一下 pytorch 的版本及 torchrun 能否正常使用。

subprocess.run(['python', fn()] + args[n_arg + 1:])

subprocess.run(['torchrun'] + torchrun_args + [fn()] +

@KMnO4-zx
Copy link
Contributor Author

多卡程序默认使用 torchrun 启动。单卡程序则使用 python 启动。我觉得问题应该出在这,程序没有找到正确的 torchrun,可以检查一下 pytorch 的版本及 torchrun 能否正常使用。

subprocess.run(['python', fn()] + args[n_arg + 1:])

subprocess.run(['torchrun'] + torchrun_args + [fn()] +

但是这个torchrun --help命令是可以用的,但pip list 显示没有 torchrun

image

@LZHgrla
Copy link
Collaborator

LZHgrla commented Jan 16, 2024

查看一下虚拟环境的bin内有没有torchrun

xxx/anaconda3/envs/xtuner0.1.9/bin/torchrun

@KMnO4-zx
Copy link
Contributor Author

查看一下虚拟环境的bin内有没有torchrun

xxx/anaconda3/envs/xtuner0.1.9/bin/torchrun

先休息了佬明天再看 感谢解答~

@KMnO4-zx
Copy link
Contributor Author

查看一下虚拟环境的bin内有没有torchrun

xxx/anaconda3/envs/xtuner0.1.9/bin/torchrun

这个目录下有torchrun

image

@LZHgrla
Copy link
Collaborator

LZHgrla commented Jan 17, 2024

test.py

import sys
print(sys.executable)

命令:

torchrun --nnodes=1 --nproc_per_node=2 --master_port=29666 test.py

查看一下打印出来的 python 路径是哪一个?

@KMnO4-zx
Copy link
Contributor Author

test.py

import sys
print(sys.executable)

命令:

torchrun --nnodes=1 --nproc_per_node=2 --master_port=29666 test.py

查看一下打印出来的 python 路径是哪一个?

(xtuner0.1.9) root@intern-studio:~/code# torchrun --nnodes=1 --nproc_per_node=2 --master_port=29666 test.py
[2024-01-17 11:06:46,956] torch.distributed.run: [WARNING] 
[2024-01-17 11:06:46,956] torch.distributed.run: [WARNING] *****************************************
[2024-01-17 11:06:46,956] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
[2024-01-17 11:06:46,956] torch.distributed.run: [WARNING] *****************************************
/share/conda_envs/internlm-base/bin/python
/share/conda_envs/internlm-base/bin/python

@LZHgrla
Copy link
Collaborator

LZHgrla commented Jan 17, 2024

@KMnO4-zx
看起来所执行的 torchrun 并不是 xtuner0.1.9 的 torchrun,得检查一下环境变量什么的,确保 torchrun 能够正确指示

@KMnO4-zx
Copy link
Contributor Author

@KMnO4-zx 看起来所执行的 torchrun 并不是 xtuner0.1.9 的 torchrun,得检查一下环境变量什么的,确保 torchrun 能够正确指示

好的 感谢~

@LZHgrla LZHgrla closed this as completed Feb 4, 2024
@kikyzzz
Copy link

kikyzzz commented Mar 13, 2024

您好,我也遇到了同样的问题,请问解决了吗

@kikyzzz
Copy link

kikyzzz commented Mar 13, 2024

image

@LZHgrla
Copy link
Collaborator

LZHgrla commented Mar 13, 2024

test.py

import sys
print(sys.executable)

命令:

torchrun --nnodes=1 --nproc_per_node=2 --master_port=29666 test.py

查看一下打印出来的 python 路径是哪一个?

@kikyzzz 尝试执行这一步骤,验证 torchrun 所执行的 python 路径

@kikyzzz
Copy link

kikyzzz commented Mar 13, 2024 via email

@LZHgrla
Copy link
Collaborator

LZHgrla commented Mar 13, 2024

@kikyzzz 这个问题的根源,就是当前环境的 torchrun 调用了其他环境的 python。可以执行下面这个命令查看一下 torchrun 的实际运行代码

cat /home/wangbenzhi/miniconda3/envs/project/bin/torchrun

@kikyzzz
Copy link

kikyzzz commented Mar 13, 2024 via email

@LZHgrla
Copy link
Collaborator

LZHgrla commented Mar 13, 2024

@kikyzzz
我这边是这样的,开头有一个可执行文件的指定

#!/xxxx/bin/python
# -*- coding: utf-8 -*-
import re
import sys
from torch.distributed.run import main
if __name__ == '__main__':
    sys.argv[0] = re.sub(r'(-script\.pyw|\.exe)?$', '', sys.argv[0])
    sys.exit(main())

@kikyzzz
Copy link

kikyzzz commented Mar 13, 2024 via email

@LZHgrla
Copy link
Collaborator

LZHgrla commented Mar 13, 2024

@kikyzzz
那检查一下所执行的 torchrun 路径?

where torchrun

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants