Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Infinite loss when fine-tuning #50

Open
Spongeorge opened this issue Apr 29, 2023 · 1 comment
Open

Infinite loss when fine-tuning #50

Spongeorge opened this issue Apr 29, 2023 · 1 comment

Comments

@Spongeorge
Copy link

I'm trying to fine-tune the AMR3.0 large SBART checkpoint on another dataset, but during training I get the following warnings:

2023-04-29 00:02:05 | WARNING | tensorboardX.x2num | NaN or Inf found in input tensor.
2023-04-29 00:02:05 | WARNING | tensorboardX.x2num | NaN or Inf found in input tensor.
2023-04-29 00:02:05 | WARNING | tensorboardX.x2num | NaN or Inf found in input tensor.
2023-04-29 00:02:05 | WARNING | tensorboardX.x2num | NaN or Inf found in input tensor.
2023-04-29 00:02:05 | INFO | train | {"epoch": 1, "train_loss": "inf", "train_nll_loss": "inf", "train_loss_seq": "inf", "train_nll_loss_seq": "inf", "train_loss_pos": "0.710562", "train_nll_loss_pos": "0.710562", "train_wps": "687.9", "train_ups": "0.51", "train_wpb": "1354.7", "train_bsz": "55.2", "train_num_updates": "71", "train_lr": "1.87323e-06", "train_gnorm": "17.868", "train_loss_scale": "8", "train_train_wall": "45", "train_wall": "158"}

In my config I set the fairseq-preprocess arguments as:

FAIRSEQ_PREPROCESS_FINETUNE_ARGS="--srcdict /content/DATA/AMR3.0/models/amr3.0-structured-bart-large-neur-al/seed42/dict.en.txt --tgtdict /content/DATA/AMR3.0/models/amr3.0-structured-bart-large-neur-al/seed42/dict.actions_nopos.txt"

and train args as:

FAIRSEQ_TRAIN_FINETUNE_ARGS="--finetune-from-model /content/DATA/AMR3.0/models/amr3.0-structured-bart-large-neur-al/seed42/checkpoint_wiki.smatch_top5-avg.pt --memory-efficient-fp16 --batch-size 16 --max-tokens 512 --patience 10"

Any ideas as to what I'm doing wrong?
Thanks in advance.

@Spongeorge
Copy link
Author

Spongeorge commented May 2, 2023

Output from tests/correctly_installed.sh

pytorch 1.10.1+cu102
cuda 10.2
Apex not installed
smatch installed
pytorch-scatter installed
fairseq works
[OK] correctly installed

I also tried with the wiki25 dataset downloaded in tests/minimal_test.sh and got the same issue, infinite loss in both training and validation, so I don't think its an issue with my input. During tests/minimal_test.sh the loss isn't infinite, though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant