Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model stops as the training starts. #638

Closed
Nazzish opened this issue Aug 7, 2020 · 19 comments
Closed

Model stops as the training starts. #638

Nazzish opened this issue Aug 7, 2020 · 19 comments
Labels
stale This issue has become stale

Comments

@Nazzish
Copy link

Nazzish commented Aug 7, 2020

Hello,

I am using the code in Google Colab and facing an error. As soon as I reach on the following step:

image

the system stops itself and doesn't run further without any error.
Please let me know what can be the possible issue and how to resolve it?

@soccerdroid
Copy link

I am having the same issue. Using bert-base-multilingual-cased as a MultiLabelClassificationModel
image

@soccerdroid
Copy link

any thoughts/hints on this??

@ThilinaRajapakse
Copy link
Owner

I am having the same issue. Using bert-base-multilingual-cased as a MultiLabelClassificationModel
image

Does this remain stuck indefinitely? It's possible that the training is paused for evaluation. Can you check the GPU usage and see if it is working or idle?

@ThilinaRajapakse
Copy link
Owner

Hello,

I am using the code in Google Colab and facing an error. As soon as I reach on the following step:

image

the system stops itself and doesn't run further without any error.
Please let me know what can be the possible issue and how to resolve it?

Is this only happening for this dataset or does it happen with any data you use?

@soccerdroid
Copy link

soccerdroid commented Aug 10, 2020

I am having the same issue. Using bert-base-multilingual-cased as a MultiLabelClassificationModel
image

Does this remain stuck indefinitely? It's possible that the training is paused for evaluation. Can you check the GPU usage and see if it is working or idle?

Yes, I left it for amost an hour I think, and no progress was made. I have cuda disabled. I haven't tried with a different dataset but with different models, bert and alberta, and the issue persists.

@ThilinaRajapakse
Copy link
Owner

If CUDA is disabled, then the evaluation might even take longer than an hour even on a moderate dataset.

Can you try running it with CUDA enabled? You can use Colab if needed.

@Nazzish
Copy link
Author

Nazzish commented Aug 10, 2020

I have used "cuda_available = torch.cuda.is_available()" as mentioned on your tutorial website on simple transformers. The GPU remains idle at this point shown in the screenshot. I want to use binary classification and have also used models other than bert but the problem persists. I haven't tried other datasets though.

@soccerdroid
Copy link

If CUDA is disabled, then the evaluation might even take longer than an hour even on a moderate dataset.

Can you try running it with CUDA enabled? You can use Colab if needed.

I tried running it without hyperparameters optimization, and an error with the evaluation metric function popped up. I guess that error was causing the training to hang in. I fixed it and the training went well (still without sweeping, CUDA disabled). I haven't tried it yet to re run the script with sweeping. Will let you know how it went!

@Nazzish
Copy link
Author

Nazzish commented Aug 12, 2020

Any idea how to get it started? The error persists on different datasets as well.

@cadae
Copy link

cadae commented Aug 14, 2020

Same issue here. This issue started to appear in multilabel classification training after I changed the model's args from dictionary args to ClassificationArgs. And it seems like this issue affects Linux only. The training was able to finish on windows 10 but not on the RedHat 8 with the same dataset and configuration. The workaround would be just using dictionary args for model's args (training args has to be dict). The following args are the one I'm used that paused the training.

model_args = ClassificationArgs(
    regression=False,
    no_cache=True,
    fp16=False,
    silent=False,
    reprocess_input_data=True,
    use_early_stopping=True,
    sliding_window=False,
    save_model_every_epoch=False,
    save_eval_checkpoints=False,
    use_multiprocessing=False,
    cache_dir=MODELS_DIR+"temp",
    tensorboard_dir=MODELS_DIR+"temp",
    train_batch_size=8,
    eval_batch_size=8,
    early_stopping_metric="eval_loss",
    early_stopping_delta=0.01,
    early_stopping_metric_minimize=True,
    early_stopping_patience=3,
    early_stopping_consider_epochs=True,
    labels_map={}
)

@ThilinaRajapakse
Copy link
Owner

Can I get a minimal script to reproduce this issue? I have no clue why this is happening.

@stale
Copy link

stale bot commented Oct 25, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale This issue has become stale label Oct 25, 2020
@stale stale bot closed this as completed Nov 1, 2020
@yanqiangmiffy
Copy link

yes,I also have this issue,i see a another cpu process is hanged up

@creatorrr
Copy link

For now, this has helped me:

model_args = ClassificationArgs(
    use_multiprocessing=False,
    use_multiprocessing_for_evaluation=False,
)

os.environ["TOKENIZERS_PARALLELISM"] = "false"

@gilzeevi25
Copy link

For now, this has helped me:

model_args = ClassificationArgs(
    use_multiprocessing=False,
    use_multiprocessing_for_evaluation=False,
)

os.environ["TOKENIZERS_PARALLELISM"] = "false"

oddly enough, this also helped me so cheers mate

@nemetht
Copy link

nemetht commented Dec 21, 2022

For now, this has helped me:

model_args = ClassificationArgs(
    use_multiprocessing=False,
    use_multiprocessing_for_evaluation=False,
)

os.environ["TOKENIZERS_PARALLELISM"] = "false"

Worked also for me - thank you!

@phileas-condemine
Copy link

phileas-condemine commented Jan 26, 2023

For now, this has helped me:

model_args = ClassificationArgs(
    use_multiprocessing=False,
    use_multiprocessing_for_evaluation=False,
)

os.environ["TOKENIZERS_PARALLELISM"] = "false"

Same here, I've been struggling with this issue for more than a year with training staying idle in the middle of the training.
Thank you for the tip @creatorrr !

@shtosti
Copy link

shtosti commented Dec 4, 2023

For now, this has helped me:

model_args = ClassificationArgs(
    use_multiprocessing=False,
    use_multiprocessing_for_evaluation=False,
)

os.environ["TOKENIZERS_PARALLELISM"] = "false"

This was the only fix that helped me with this problem. Thank you so much for sharing it @creatorrr!

@StrikerRUS
Copy link

@creatorrr

Thanks a lot for the fix!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale This issue has become stale
Projects
None yet
Development

No branches or pull requests