Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lower performance with retrained model #2

Open
tomhosking opened this issue Oct 19, 2021 · 6 comments
Open

Lower performance with retrained model #2

tomhosking opened this issue Oct 19, 2021 · 6 comments

Comments

@tomhosking
Copy link

When I use a checkpoint that I've trained from scratch instead of the checkpoint downloaded from here, performance is ~2 iBLEU lower. The command used to train the model was:

python train.py --cuda \
                --train_source ./data/qqp_train.src \
                --train_target ./data/qqp_train.tgt \
                --test_source  ./data/qqp_dev.src \
                --test_target  ./data/qqp_dev.tgt \
                --vocab_path ./checkpoints/qqp.vocab \
                --batch_size 8 \
                --epoch 100 \
                --num_rounds 2 \
                --max_length 50 \
                --clip_length 50 \
                --model_save_path ./checkpoints/qqp.model \
                --generation_save_path ./outputs/qqp/

Are there additional hyperparameters that I need to set?

@L-Zhe
Copy link
Owner

L-Zhe commented Oct 20, 2021

We do not employ iBLEU to evaluate our model, so I think you may have chosen the wrong evaluation matric.

@tomhosking
Copy link
Author

Thanks for your response - iBLEU is just a weighted difference between BLEU and self-BLEU, which you do report in the paper. I get the following scores on MSCOCO when I train your model from scratch (using the command above), after 10 rounds:
BLEU: 18.13, self-BLEU: 11.22
Compare to the results when I do the same with the checkpoint you've provided:
BLEU: 21.30, self-BLEU: 13.84
This is much closer to the result from your paper (there will be a small difference since I'm not using exactly the same split).

I'm trying to train your model on another dataset (that you don't use in your paper), and the performance is currently much worse than the other comparison systems. So I wanted to check that I was training in the correct way, to make a fair comparison - please let me know if I should be doing anything different?

@L-Zhe
Copy link
Owner

L-Zhe commented Oct 20, 2021

I cannot hand your problem as lacking of your datasets. But I notice your batch size is too small, so I suggest you to increase it.

@L-Zhe
Copy link
Owner

L-Zhe commented Oct 20, 2021

Or you can try to close the diversity coef, in line85 of utils/run.py. This is used to solve the problem of lack of diversity in the first word.

@tomhosking
Copy link
Author

Thanks - I will try reducing the length limit and increasing the batch size.

@tomhosking
Copy link
Author

I've been able to train to completion using a batch size of 32 - but I now get BLEU and Self-BLEU scores of 0. It looks like training is stable at the start, but validation scores go to 0 about halfway through. Does the training script not use early stopping? How should I pick the number of training epochs?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants