Lower performance with retrained model #2

tomhosking · 2021-10-19T10:59:28Z

When I use a checkpoint that I've trained from scratch instead of the checkpoint downloaded from here, performance is ~2 iBLEU lower. The command used to train the model was:

python train.py --cuda \
                --train_source ./data/qqp_train.src \
                --train_target ./data/qqp_train.tgt \
                --test_source  ./data/qqp_dev.src \
                --test_target  ./data/qqp_dev.tgt \
                --vocab_path ./checkpoints/qqp.vocab \
                --batch_size 8 \
                --epoch 100 \
                --num_rounds 2 \
                --max_length 50 \
                --clip_length 50 \
                --model_save_path ./checkpoints/qqp.model \
                --generation_save_path ./outputs/qqp/

Are there additional hyperparameters that I need to set?

The text was updated successfully, but these errors were encountered:

L-Zhe · 2021-10-20T10:42:21Z

We do not employ iBLEU to evaluate our model, so I think you may have chosen the wrong evaluation matric.

tomhosking · 2021-10-20T10:53:29Z

Thanks for your response - iBLEU is just a weighted difference between BLEU and self-BLEU, which you do report in the paper. I get the following scores on MSCOCO when I train your model from scratch (using the command above), after 10 rounds:
BLEU: 18.13, self-BLEU: 11.22
Compare to the results when I do the same with the checkpoint you've provided:
BLEU: 21.30, self-BLEU: 13.84
This is much closer to the result from your paper (there will be a small difference since I'm not using exactly the same split).

I'm trying to train your model on another dataset (that you don't use in your paper), and the performance is currently much worse than the other comparison systems. So I wanted to check that I was training in the correct way, to make a fair comparison - please let me know if I should be doing anything different?

L-Zhe · 2021-10-20T11:13:15Z

I cannot hand your problem as lacking of your datasets. But I notice your batch size is too small, so I suggest you to increase it.

L-Zhe · 2021-10-20T11:17:54Z

Or you can try to close the diversity coef, in line85 of utils/run.py. This is used to solve the problem of lack of diversity in the first word.

tomhosking · 2021-10-20T12:41:09Z

Thanks - I will try reducing the length limit and increasing the batch size.

tomhosking · 2021-10-24T17:09:18Z

I've been able to train to completion using a batch size of 32 - but I now get BLEU and Self-BLEU scores of 0. It looks like training is stable at the start, but validation scores go to 0 about halfway through. Does the training script not use early stopping? How should I pick the number of training epochs?

hahally mentioned this issue Sep 14, 2022

Lower performance with trained model tomhosking/hrq-vae#7

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lower performance with retrained model #2

Lower performance with retrained model #2

tomhosking commented Oct 19, 2021

L-Zhe commented Oct 20, 2021

tomhosking commented Oct 20, 2021

L-Zhe commented Oct 20, 2021

L-Zhe commented Oct 20, 2021

tomhosking commented Oct 20, 2021

tomhosking commented Oct 24, 2021

Lower performance with retrained model #2

Lower performance with retrained model #2

Comments

tomhosking commented Oct 19, 2021

L-Zhe commented Oct 20, 2021

tomhosking commented Oct 20, 2021

L-Zhe commented Oct 20, 2021

L-Zhe commented Oct 20, 2021

tomhosking commented Oct 20, 2021

tomhosking commented Oct 24, 2021