【Bug】Training becomes pending if the training dataset contains text data. #45

lg123666 · 2024-06-18T08:51:21Z

When I add text data to the training dataset, the training process always becomes pending in the first step. Conversely, if I remove the text data, the training will proceed normally. What could be one possible cause for this issue?

Different situations as follows:

Single GPU with text data: normal (unsure if training will complete)
Multi-GPU with text data: abnormal
Video-LLaVA codebase with ChatUniVi model: pending at 70% instead of at the first step
Text data only : normal (unsure if training will complete)

pip list

accelerate                0.21.0
aiofiles                  23.2.1
aiohttp                   3.8.5
aiosignal                 1.3.1
altair                    5.1.1
anyio                     3.7.1
appdirs                   1.4.4
async-timeout             4.0.3
attrs                     23.1.0
av                        12.1.0
bitsandbytes              0.41.0
certifi                   2023.7.22
charset-normalizer        3.2.0
click                     8.1.7
cmake                     3.27.2
contourpy                 1.1.0
cycler                    0.11.0
decord                    0.6.0
deepspeed                 0.9.5
docker-pycreds            0.4.0
einops                    0.6.1
einops-exts               0.0.4
exceptiongroup            1.1.3
fastapi                   0.103.1
ffmpy                     0.3.1
filelock                  3.12.3
flash-attn                2.1.0
fonttools                 4.42.1
frozenlist                1.4.0
fsspec                    2023.9.0
fvcore                    0.1.5.post20221221
gitdb                     4.0.10
GitPython                 3.1.34
gradio                    3.35.2
gradio_client             0.2.9
h11                       0.14.0
hjson                     3.1.0
httpcore                  0.17.3
httpx                     0.24.0
huggingface-hub           0.23.4
idna                      3.4
iopath                    0.1.10
Jinja2                    3.1.2
joblib                    1.3.2
jsonschema                4.19.0
jsonschema-specifications 2023.7.1
kiwisolver                1.4.5
linkify-it-py             2.0.2
lit                       16.0.6
llava                     1.0.1              /home/users/LLaVA
markdown-it-py            2.2.0
markdown2                 2.4.10
MarkupSafe                2.1.3
matplotlib                3.7.2
mdit-py-plugins           0.3.3
mdurl                     0.1.2
mpmath                    1.3.0
multidict                 6.0.4
networkx                  3.1
ninja                     1.11.1
numpy                     1.25.2
opencv-python             4.10.0.84
orjson                    3.9.5
packaging                 23.1
pandas                    2.1.0
parameterized             0.9.0
pathtools                 0.1.2
peft                      0.4.0
Pillow                    10.0.0
pip                       23.2.1
portalocker               2.8.2
protobuf                  4.24.2
psutil                    5.9.5
py-cpuinfo                9.0.0
pydantic                  1.10.12
pydub                     0.25.1
Pygments                  2.16.1
pyparsing                 3.0.9
python-dateutil           2.8.2
python-multipart          0.0.6
pytorch-triton-rocm       2.0.2
pytorchvideo              0.1.5
pytz                      2023.3.post1
PyYAML                    6.0.1
referencing               0.30.2
regex                     2023.8.8
requests                  2.31.0
rpds-py                   0.10.2
safetensors               0.3.3
scikit-learn              1.2.2
scipy                     1.11.2
semantic-version          2.10.0
sentencepiece             0.1.99
sentry-sdk                1.30.0
setproctitle              1.3.2
setuptools                68.0.0
shortuuid                 1.0.11
six                       1.16.0
smmap                     5.0.0
sniffio                   1.3.0
some-package              0.1
starlette                 0.27.0
svgwrite                  1.4.3
sympy                     1.12
tabulate                  0.9.0
tensorboardX              2.6.2.2
termcolor                 2.4.0
threadpoolctl             3.2.0
timm                      0.6.13
tokenizers                0.15.2
toolz                     0.12.0
torch                     2.0.1+cu118
torchvision               0.15.2+cu118
tqdm                      4.66.1
transformers              4.37.0
triton                    2.0.0
typing_extensions         4.7.1
tzdata                    2023.3
uc-micro-py               1.0.2
urllib3                   2.0.4
uvicorn                   0.23.2
wandb                     0.15.9
wavedrom                  2.0.3.post3
websockets                11.0.3
wheel                     0.38.4
xformers                  0.0.21
yacs                      0.1.8
yarl                      1.9.2

The text was updated successfully, but these errors were encountered:

jpthu17 · 2024-06-18T14:58:01Z

This error comes from the deepspeed bug (microsoft/DeepSpeed#2223). In our code, it is very easy to hang because the lengths of the text data vary greatly.

If you are using text data from llava v1.5, the training is normal after I delete some very long or short text. The data I filtered is as follows:
https://huggingface.co/datasets/Chat-UniVi/Chat-UniVi-Instruct/blob/main/v1.5_train_json/llavaimage_tune.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

【Bug】Training becomes pending if the training dataset contains text data. #45

【Bug】Training becomes pending if the training dataset contains text data. #45

lg123666 commented Jun 18, 2024 •

edited

jpthu17 commented Jun 18, 2024

【Bug】Training becomes pending if the training dataset contains text data. #45

【Bug】Training becomes pending if the training dataset contains text data. #45

Comments

lg123666 commented Jun 18, 2024 • edited

jpthu17 commented Jun 18, 2024

lg123666 commented Jun 18, 2024 •

edited