Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AN4 fail to reproduce the reported result #193

Closed
Minju-Jung opened this issue Dec 7, 2017 · 11 comments
Closed

AN4 fail to reproduce the reported result #193

Minju-Jung opened this issue Dec 7, 2017 · 11 comments
Assignees

Comments

@Minju-Jung
Copy link

Minju-Jung commented Dec 7, 2017

Based on issue #85, I tried to reproduce the reported result of AN4 by following command.
python train.py --cuda --visdom --learning_anneal 1.01 --train_manifest data/an4_train_manifest.csv --val_manifest data/an4_val_manifest.csv

And I got the following result.
result

The best result is

Validation Summary Epoch: [60] Average WER 14.720 Average CER 5.894

But, it is still worse than the reported result.

Dataset | WER | CER
AN4 test | 10.52 | 4.78

Could you kindly let me know how can I improve the result?

@ryanleary
Copy link
Collaborator

It was trained with 5 layers of 800 node GRUs, probably with the augment flag set, and ran for 97 epochs.

@Minju-Jung
Copy link
Author

Thank you for reply!
Could you also let me know about learning rate and noise probability?
Currently, I use the default setting for learning rate (3e-4), noise probability (0.4), and epochs (70).

@alugupta
Copy link

alugupta commented Dec 8, 2017

Hi,

This repository is fantastic!! Worked really well out of the box :)

Currently, I'm working on reproducing the pre-trained models results as well.

For the an4 dataset we were able to get down to a WER of 12.35 using an LR = 0.0005 and anneal rate of 1.0001. All other hyperparameters (noise, epochs = 70, 5 layer GRU, 800 hidden units, etc.) were kept the same.

For the ted dataset we got a WER of somewhere in the 50's using the default hyperparameters (trained for 70 epochs) which is quite far off from the pre-trained model. (can provide more details soon)

I'm just getting setup with the librispeech dataset as well so hopefully can try running some training experiments for that as well.

Would it be possible for us to put the hyperparameters for the pre-trained or something that gets close to them somewhere for everyone to see? Would be a huge help to everyone!

Thank you!

@SeanNaren
Copy link
Owner

Sorry for being out the loop here, I'll sync up with @ryanleary and try to solve this issue but may take some time. Worst case I'll retrain models so that we have exact hyper-parameters used!

@alugupta
Copy link

Hi,

Thanks for helping out with this!

Not sure if this helps but here are some results I was able to get:

  1. Running the pre-trained model provided on the Librispeech dataset we get the following results:
    Librispeech Other: WER of 21.7 and CER of 8.1 on the test set; WER of 20.5 and CER of 7.7 on validation set
    Librispeech Clean: WER of 10.97 and CER of 3.3 on the test set; WER of 11.06 and CER of 3.45 on the test set

The numbers for the LibriSpeech other dataset are different than the ones reported on the released models but I'm unclear on why (perhaps we downloaded only a subset of the other dataset?) We basically just ran python librispeech.py which by default I believe does the entire dataset. This might be realted to this issue as well.

  1. When we train a model from scratch with a LR of 3e-4, an anneal rate of 1.1, a batch size of 5 (to fit on the NVIDIA GTX 1080 Ti's), and a max_norm of 200, we've been able to get down to a WER of 23.2 and CER of 8.9 on the validation set within 8 epochs. The model is still training so it might get closer to the 21.7. All other hyperparameters (5 layer birdirectional GRU) were the default provided.

Also would it be helpful in terms of reproducibility to fix a default random seed in PyTorch? Just thought that if we are re-training some models it might have fixing that as well (might affect the smaller datasets more than the larger ones).

Let me know if there's anything I can do to help!

Udit

@SeanNaren
Copy link
Owner

@alugupta thanks for your help!

The way WER/CER is calculated has changed to match up more with academic standards (but are correct for the release branch and the commit it points to I think). There does seem to be a slight discrepancy between the WER/CER at training time and at testing time, but I'm investigating further.

I'll definitely need to update the librispeech script to create separate test scripts for the different test sets libri offer, so any contribution there would be awesome :)

I agree with the default random seed, that would definitely help! Will create a ticket to track this. Thanks for your help!

@Minju-Jung
Copy link
Author

@SeanNaren As in my issue #200, does current LibriSpeech dataset contain both clean and other?

Now, I'm training the network using LibriSpeech in same condition as @alugupta.
I will report my simulation results soon.

@ryanleary
Copy link
Collaborator

I just set up a new machine and pulled down this repository and the an4 dataset from scratch. GPU was 1x Titan V. Trained with following command:

python train.py --train_manifest an4/an4_train_manifest.csv \
                --val_manifest an4/an4_val_manifest.csv  \
                --num_workers 4 \
                --cuda \
                --learning_anneal 1.01 \
                --augment \
                --epochs 100

Result:

Dataset WER CER Loss Epochs
AN4 test 9.732 3.919 0.054 48

which beats the previously released model. Be sure to include the augment flag, particularly when running for a large number of epochs. If not using --augment on a dataset this small, there's little reason to run a large number of epochs.

I'll leave this open for another day or so, but will probably close since ^^ reproduces the result.

@ryanleary
Copy link
Collaborator

Also @alugupta I think what you're calling "Librispeech Other" is actually the combined other+clean test set. The pretrained model scores around 31.3% WER on other. 21.7% is the weighted average WER between the other and clean test sets.

@alugupta
Copy link

@ryanleary Oh right! The numbers I reported earlier for Librispeech other are actually LibriSpeech clean + other. That makes sense then. So model from earlier should more or less be similar to the pretrained model. Thanks for also rerunning the an4 dataset!

@Minju-Jung I guess that also answer your question in that by default the dataset is the combined clean and other. If you specified one or the other when pre-processing them then it could be different.

@SeanNaren For the separate test-scripts for other and clean are you just imagining that in the pre-processing steps you partition saving the other and clean subsets separately as default and then have 3 sets of manifests: clean, other, combined? I could perhaps contribute this if that helps (it might be a while as I'll be away for the coming week).

@SeanNaren
Copy link
Owner

@alugupta thanks for your input!

I've just merged #205 which contained splitting the testing scripts into clean/other. This should help get the correct test scores!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants