Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

【Bug】Training becomes pending if the training dataset contains text data. #45

Open
lg123666 opened this issue Jun 18, 2024 · 1 comment

Comments

@lg123666
Copy link

lg123666 commented Jun 18, 2024

When I add text data to the training dataset, the training process always becomes pending in the first step. Conversely, if I remove the text data, the training will proceed normally. What could be one possible cause for this issue?

Different situations as follows:

  1. Single GPU with text data: normal (unsure if training will complete)
  2. Multi-GPU with text data: abnormal
  3. Video-LLaVA codebase with ChatUniVi model: pending at 70% instead of at the first step
  4. Text data only : normal (unsure if training will complete)

pending

pip list

accelerate                0.21.0
aiofiles                  23.2.1
aiohttp                   3.8.5
aiosignal                 1.3.1
altair                    5.1.1
anyio                     3.7.1
appdirs                   1.4.4
async-timeout             4.0.3
attrs                     23.1.0
av                        12.1.0
bitsandbytes              0.41.0
certifi                   2023.7.22
charset-normalizer        3.2.0
click                     8.1.7
cmake                     3.27.2
contourpy                 1.1.0
cycler                    0.11.0
decord                    0.6.0
deepspeed                 0.9.5
docker-pycreds            0.4.0
einops                    0.6.1
einops-exts               0.0.4
exceptiongroup            1.1.3
fastapi                   0.103.1
ffmpy                     0.3.1
filelock                  3.12.3
flash-attn                2.1.0
fonttools                 4.42.1
frozenlist                1.4.0
fsspec                    2023.9.0
fvcore                    0.1.5.post20221221
gitdb                     4.0.10
GitPython                 3.1.34
gradio                    3.35.2
gradio_client             0.2.9
h11                       0.14.0
hjson                     3.1.0
httpcore                  0.17.3
httpx                     0.24.0
huggingface-hub           0.23.4
idna                      3.4
iopath                    0.1.10
Jinja2                    3.1.2
joblib                    1.3.2
jsonschema                4.19.0
jsonschema-specifications 2023.7.1
kiwisolver                1.4.5
linkify-it-py             2.0.2
lit                       16.0.6
llava                     1.0.1              /home/users/LLaVA
markdown-it-py            2.2.0
markdown2                 2.4.10
MarkupSafe                2.1.3
matplotlib                3.7.2
mdit-py-plugins           0.3.3
mdurl                     0.1.2
mpmath                    1.3.0
multidict                 6.0.4
networkx                  3.1
ninja                     1.11.1
numpy                     1.25.2
opencv-python             4.10.0.84
orjson                    3.9.5
packaging                 23.1
pandas                    2.1.0
parameterized             0.9.0
pathtools                 0.1.2
peft                      0.4.0
Pillow                    10.0.0
pip                       23.2.1
portalocker               2.8.2
protobuf                  4.24.2
psutil                    5.9.5
py-cpuinfo                9.0.0
pydantic                  1.10.12
pydub                     0.25.1
Pygments                  2.16.1
pyparsing                 3.0.9
python-dateutil           2.8.2
python-multipart          0.0.6
pytorch-triton-rocm       2.0.2
pytorchvideo              0.1.5
pytz                      2023.3.post1
PyYAML                    6.0.1
referencing               0.30.2
regex                     2023.8.8
requests                  2.31.0
rpds-py                   0.10.2
safetensors               0.3.3
scikit-learn              1.2.2
scipy                     1.11.2
semantic-version          2.10.0
sentencepiece             0.1.99
sentry-sdk                1.30.0
setproctitle              1.3.2
setuptools                68.0.0
shortuuid                 1.0.11
six                       1.16.0
smmap                     5.0.0
sniffio                   1.3.0
some-package              0.1
starlette                 0.27.0
svgwrite                  1.4.3
sympy                     1.12
tabulate                  0.9.0
tensorboardX              2.6.2.2
termcolor                 2.4.0
threadpoolctl             3.2.0
timm                      0.6.13
tokenizers                0.15.2
toolz                     0.12.0
torch                     2.0.1+cu118
torchvision               0.15.2+cu118
tqdm                      4.66.1
transformers              4.37.0
triton                    2.0.0
typing_extensions         4.7.1
tzdata                    2023.3
uc-micro-py               1.0.2
urllib3                   2.0.4
uvicorn                   0.23.2
wandb                     0.15.9
wavedrom                  2.0.3.post3
websockets                11.0.3
wheel                     0.38.4
xformers                  0.0.21
yacs                      0.1.8
yarl                      1.9.2
@jpthu17
Copy link
Member

jpthu17 commented Jun 18, 2024

This error comes from the deepspeed bug (microsoft/DeepSpeed#2223). In our code, it is very easy to hang because the lengths of the text data vary greatly.

If you are using text data from llava v1.5, the training is normal after I delete some very long or short text. The data I filtered is as follows:
https://huggingface.co/datasets/Chat-UniVi/Chat-UniVi-Instruct/blob/main/v1.5_train_json/llavaimage_tune.json

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants