Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discrepancy in Model Performance Reproduction and Pretrained Model Parameters #10

Closed
shouyezhe opened this issue Apr 28, 2024 · 6 comments

Comments

@shouyezhe
Copy link

Hello BarqueroGerman,

I'm working on replicating your model's performance but noticed a gap between my results and the pretrained model's performance. I've confirmed that my hyperparameters match the ones in your Readme. Could you share the pretrained model's hyperparameters to help me troubleshoot? The performence of my trained model is shown.
image
image

Thanks

@BarqueroGerman
Copy link
Owner

BarqueroGerman commented Apr 28, 2024 via email

@shouyezhe
Copy link
Author

Hi! Thank you for reporting the issue. We noticed as well a few differences during evaluation after refactoring and cleaning the code for its release. We will work on this and fix the issue soon. Thanks for your patience!

On 28 Apr 2024 at 07:33 +0100, shouyezhe @.>, wrote: Hello BarqueroGerman, I'm working on replicating your model's performance but noticed a gap between my results and the pretrained model's performance. I've confirmed that my hyperparameters match the ones in your Readme. Could you share the pretrained model's hyperparameters to help me troubleshoot? The performence of my trained model is shown. image.png (view on web) image.png (view on web) Thanks — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.Message ID: @.>

Hi! Thank you for the tremendous effort you've put into this work. I'm interested in the current progress of the code deployment differences you're working on. Could you share some details? Best wishes!

@BarqueroGerman
Copy link
Owner

Hi!

Sorry for the delay and thank you for your patience! I just fixed an error in the evaluation command (--bpe_denoising_step). Please, check if this resolved your problem.

In any case, I will double-check the training loop, as your deviation looks higher than the one you should be observing.

@shouyezhe
Copy link
Author

Hi!

Sorry for the delay and thank you for your patience! I just fixed an error in the evaluation command (--bpe_denoising_step). Please, check if this resolved your problem.

In any case, I will double-check the training loop, as your deviation looks higher than the one you should be observing.

Hi!

Thank you for your efforts in investigating the issue with the evaluation parameters. However, even after the correction, the evaluation results are still significantly lower than those reported for the Pretrained models (using the same evaluation code). Notably, the evaluation metrics for the Pretrained models are very close to those reported in the paper. This leads me to believe that the issue might be related to the training parameters or the code provided in the repository.

I was wondering if you have attempted to train the models using the current version of the code in the repository. I will continue to investigate further to gather more information to assist in troubleshooting.

@BarqueroGerman
Copy link
Owner

Hi shouyezhe!

I am actively looking into this. I'll let you know my findings asap. Thanks for your patience!

@BarqueroGerman
Copy link
Owner

Hi again,

I re-trained both models several times with different seeds and I noticed two things:

  1. Two parameters were missing in the Babel training command (--min_seq_len 45 --max_seq_len 250). I will update this in the readme now.
  2. The randomness inherent in the BPE-training strategy and likely the random initialization result in evaluations yielding slightly different outcomes compared to those reported in the paper. In all the trainings I replicated, a similar performance (slightly higher/lower depending on the metric) to that reported in the paper was achieved around the 1.3M/500k for Babel/HumanML +/- 100k steps. For example:

image

Also, I feel that current metric-learning-based metrics used to evaluate generative motion synthesis are not as robust as we'd like. I am curious to see what researchers in this field come up with as an alternative. We tried our best proposing the PJ/AUJ metrics to make sure transitions are smooth enough, relying only on the motion original space.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants