Model stops as the training starts. #638

Nazzish · 2020-08-07T19:07:12Z

Hello,

I am using the code in Google Colab and facing an error. As soon as I reach on the following step:

the system stops itself and doesn't run further without any error.
Please let me know what can be the possible issue and how to resolve it?

soccerdroid · 2020-08-08T21:16:42Z

I am having the same issue. Using bert-base-multilingual-cased as a MultiLabelClassificationModel

soccerdroid · 2020-08-10T16:33:38Z

any thoughts/hints on this??

ThilinaRajapakse · 2020-08-10T17:14:25Z

I am having the same issue. Using bert-base-multilingual-cased as a MultiLabelClassificationModel

Does this remain stuck indefinitely? It's possible that the training is paused for evaluation. Can you check the GPU usage and see if it is working or idle?

ThilinaRajapakse · 2020-08-10T17:15:40Z

Hello,

I am using the code in Google Colab and facing an error. As soon as I reach on the following step:

the system stops itself and doesn't run further without any error.
Please let me know what can be the possible issue and how to resolve it?

Is this only happening for this dataset or does it happen with any data you use?

soccerdroid · 2020-08-10T19:03:11Z

I am having the same issue. Using bert-base-multilingual-cased as a MultiLabelClassificationModel

Does this remain stuck indefinitely? It's possible that the training is paused for evaluation. Can you check the GPU usage and see if it is working or idle?

Yes, I left it for amost an hour I think, and no progress was made. I have cuda disabled. I haven't tried with a different dataset but with different models, bert and alberta, and the issue persists.

ThilinaRajapakse · 2020-08-10T19:16:16Z

If CUDA is disabled, then the evaluation might even take longer than an hour even on a moderate dataset.

Can you try running it with CUDA enabled? You can use Colab if needed.

Nazzish · 2020-08-10T19:36:02Z

I have used "cuda_available = torch.cuda.is_available()" as mentioned on your tutorial website on simple transformers. The GPU remains idle at this point shown in the screenshot. I want to use binary classification and have also used models other than bert but the problem persists. I haven't tried other datasets though.

soccerdroid · 2020-08-11T18:37:37Z

If CUDA is disabled, then the evaluation might even take longer than an hour even on a moderate dataset.

Can you try running it with CUDA enabled? You can use Colab if needed.

I tried running it without hyperparameters optimization, and an error with the evaluation metric function popped up. I guess that error was causing the training to hang in. I fixed it and the training went well (still without sweeping, CUDA disabled). I haven't tried it yet to re run the script with sweeping. Will let you know how it went!

Nazzish · 2020-08-12T19:11:51Z

Any idea how to get it started? The error persists on different datasets as well.

cadae · 2020-08-14T00:47:23Z

Same issue here. This issue started to appear in multilabel classification training after I changed the model's args from dictionary args to ClassificationArgs. And it seems like this issue affects Linux only. The training was able to finish on windows 10 but not on the RedHat 8 with the same dataset and configuration. The workaround would be just using dictionary args for model's args (training args has to be dict). The following args are the one I'm used that paused the training.

model_args = ClassificationArgs(
    regression=False,
    no_cache=True,
    fp16=False,
    silent=False,
    reprocess_input_data=True,
    use_early_stopping=True,
    sliding_window=False,
    save_model_every_epoch=False,
    save_eval_checkpoints=False,
    use_multiprocessing=False,
    cache_dir=MODELS_DIR+"temp",
    tensorboard_dir=MODELS_DIR+"temp",
    train_batch_size=8,
    eval_batch_size=8,
    early_stopping_metric="eval_loss",
    early_stopping_delta=0.01,
    early_stopping_metric_minimize=True,
    early_stopping_patience=3,
    early_stopping_consider_epochs=True,
    labels_map={}
)

ThilinaRajapakse · 2020-08-26T13:44:36Z

Can I get a minimal script to reproduce this issue? I have no clue why this is happening.

stale · 2020-10-25T14:29:28Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

yanqiangmiffy · 2021-01-26T04:46:07Z

yes,I also have this issue,i see a another cpu process is hanged up

creatorrr · 2022-03-07T05:31:51Z

For now, this has helped me:

model_args = ClassificationArgs(
    use_multiprocessing=False,
    use_multiprocessing_for_evaluation=False,
)

os.environ["TOKENIZERS_PARALLELISM"] = "false"

gilzeevi25 · 2022-05-12T17:16:50Z

For now, this has helped me:

model_args = ClassificationArgs(
    use_multiprocessing=False,
    use_multiprocessing_for_evaluation=False,
)

os.environ["TOKENIZERS_PARALLELISM"] = "false"

oddly enough, this also helped me so cheers mate

nemetht · 2022-12-21T17:21:47Z

For now, this has helped me:

model_args = ClassificationArgs(
    use_multiprocessing=False,
    use_multiprocessing_for_evaluation=False,
)

os.environ["TOKENIZERS_PARALLELISM"] = "false"

Worked also for me - thank you!

phileas-condemine · 2023-01-26T10:00:54Z

For now, this has helped me:

model_args = ClassificationArgs(
    use_multiprocessing=False,
    use_multiprocessing_for_evaluation=False,
)

os.environ["TOKENIZERS_PARALLELISM"] = "false"

Same here, I've been struggling with this issue for more than a year with training staying idle in the middle of the training.
Thank you for the tip @creatorrr !

shtosti · 2023-12-04T11:40:49Z

For now, this has helped me:

model_args = ClassificationArgs(
    use_multiprocessing=False,
    use_multiprocessing_for_evaluation=False,
)

os.environ["TOKENIZERS_PARALLELISM"] = "false"

This was the only fix that helped me with this problem. Thank you so much for sharing it @creatorrr!

StrikerRUS · 2024-04-26T18:30:37Z

@creatorrr

Thanks a lot for the fix!

stale bot added the stale This issue has become stale label Oct 25, 2020

stale bot closed this as completed Nov 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model stops as the training starts. #638

Model stops as the training starts. #638

Nazzish commented Aug 7, 2020 •

edited

soccerdroid commented Aug 8, 2020

soccerdroid commented Aug 10, 2020

ThilinaRajapakse commented Aug 10, 2020

ThilinaRajapakse commented Aug 10, 2020

soccerdroid commented Aug 10, 2020 •

edited

ThilinaRajapakse commented Aug 10, 2020

Nazzish commented Aug 10, 2020 •

edited

soccerdroid commented Aug 11, 2020

Nazzish commented Aug 12, 2020

cadae commented Aug 14, 2020 •

edited

ThilinaRajapakse commented Aug 26, 2020

stale bot commented Oct 25, 2020

yanqiangmiffy commented Jan 26, 2021

creatorrr commented Mar 7, 2022

gilzeevi25 commented May 12, 2022

nemetht commented Dec 21, 2022

phileas-condemine commented Jan 26, 2023 •

edited

shtosti commented Dec 4, 2023

StrikerRUS commented Apr 26, 2024

Model stops as the training starts. #638

Model stops as the training starts. #638

Comments

Nazzish commented Aug 7, 2020 • edited

soccerdroid commented Aug 8, 2020

soccerdroid commented Aug 10, 2020

ThilinaRajapakse commented Aug 10, 2020

ThilinaRajapakse commented Aug 10, 2020

soccerdroid commented Aug 10, 2020 • edited

ThilinaRajapakse commented Aug 10, 2020

Nazzish commented Aug 10, 2020 • edited

soccerdroid commented Aug 11, 2020

Nazzish commented Aug 12, 2020

cadae commented Aug 14, 2020 • edited

ThilinaRajapakse commented Aug 26, 2020

stale bot commented Oct 25, 2020

yanqiangmiffy commented Jan 26, 2021

creatorrr commented Mar 7, 2022

gilzeevi25 commented May 12, 2022

nemetht commented Dec 21, 2022

phileas-condemine commented Jan 26, 2023 • edited

shtosti commented Dec 4, 2023

StrikerRUS commented Apr 26, 2024

Nazzish commented Aug 7, 2020 •

edited

soccerdroid commented Aug 10, 2020 •

edited

Nazzish commented Aug 10, 2020 •

edited

cadae commented Aug 14, 2020 •

edited

phileas-condemine commented Jan 26, 2023 •

edited