New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Model stops as the training starts. #638
Comments
any thoughts/hints on this?? |
Is this only happening for this dataset or does it happen with any data you use? |
If CUDA is disabled, then the evaluation might even take longer than an hour even on a moderate dataset. Can you try running it with CUDA enabled? You can use Colab if needed. |
I have used "cuda_available = torch.cuda.is_available()" as mentioned on your tutorial website on simple transformers. The GPU remains idle at this point shown in the screenshot. I want to use binary classification and have also used models other than bert but the problem persists. I haven't tried other datasets though. |
I tried running it without hyperparameters optimization, and an error with the evaluation metric function popped up. I guess that error was causing the training to hang in. I fixed it and the training went well (still without sweeping, CUDA disabled). I haven't tried it yet to re run the script with sweeping. Will let you know how it went! |
Any idea how to get it started? The error persists on different datasets as well. |
Same issue here. This issue started to appear in multilabel classification training after I changed the model's args from dictionary args to ClassificationArgs. And it seems like this issue affects Linux only. The training was able to finish on windows 10 but not on the RedHat 8 with the same dataset and configuration. The workaround would be just using dictionary args for model's args (training args has to be dict). The following args are the one I'm used that paused the training.
|
Can I get a minimal script to reproduce this issue? I have no clue why this is happening. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
yes,I also have this issue,i see a another cpu process is hanged up |
For now, this has helped me: model_args = ClassificationArgs(
use_multiprocessing=False,
use_multiprocessing_for_evaluation=False,
)
os.environ["TOKENIZERS_PARALLELISM"] = "false" |
oddly enough, this also helped me so cheers mate |
Worked also for me - thank you! |
Same here, I've been struggling with this issue for more than a year with training staying idle in the middle of the training. |
This was the only fix that helped me with this problem. Thank you so much for sharing it @creatorrr! |
Thanks a lot for the fix! |
Hello,
I am using the code in Google Colab and facing an error. As soon as I reach on the following step:
the system stops itself and doesn't run further without any error.
Please let me know what can be the possible issue and how to resolve it?
The text was updated successfully, but these errors were encountered: