AN4 fail to reproduce the reported result #193

Minju-Jung · 2017-12-07T02:58:04Z

Based on issue #85, I tried to reproduce the reported result of AN4 by following command.
python train.py --cuda --visdom --learning_anneal 1.01 --train_manifest data/an4_train_manifest.csv --val_manifest data/an4_val_manifest.csv

And I got the following result.

The best result is

Validation Summary Epoch: [60] Average WER 14.720 Average CER 5.894

But, it is still worse than the reported result.

Dataset | WER | CER
AN4 test | 10.52 | 4.78

Could you kindly let me know how can I improve the result?

The text was updated successfully, but these errors were encountered:

ryanleary · 2017-12-07T05:58:55Z

It was trained with 5 layers of 800 node GRUs, probably with the augment flag set, and ran for 97 epochs.

Minju-Jung · 2017-12-07T06:01:37Z

Thank you for reply!
Could you also let me know about learning rate and noise probability?
Currently, I use the default setting for learning rate (3e-4), noise probability (0.4), and epochs (70).

alugupta · 2017-12-08T20:36:31Z

Hi,

This repository is fantastic!! Worked really well out of the box :)

Currently, I'm working on reproducing the pre-trained models results as well.

For the an4 dataset we were able to get down to a WER of 12.35 using an LR = 0.0005 and anneal rate of 1.0001. All other hyperparameters (noise, epochs = 70, 5 layer GRU, 800 hidden units, etc.) were kept the same.

For the ted dataset we got a WER of somewhere in the 50's using the default hyperparameters (trained for 70 epochs) which is quite far off from the pre-trained model. (can provide more details soon)

I'm just getting setup with the librispeech dataset as well so hopefully can try running some training experiments for that as well.

Would it be possible for us to put the hyperparameters for the pre-trained or something that gets close to them somewhere for everyone to see? Would be a huge help to everyone!

Thank you!

SeanNaren · 2017-12-15T16:11:12Z

Sorry for being out the loop here, I'll sync up with @ryanleary and try to solve this issue but may take some time. Worst case I'll retrain models so that we have exact hyper-parameters used!

alugupta · 2017-12-20T18:21:12Z

Hi,

Thanks for helping out with this!

Not sure if this helps but here are some results I was able to get:

Running the pre-trained model provided on the Librispeech dataset we get the following results:
Librispeech Other: WER of 21.7 and CER of 8.1 on the test set; WER of 20.5 and CER of 7.7 on validation set
Librispeech Clean: WER of 10.97 and CER of 3.3 on the test set; WER of 11.06 and CER of 3.45 on the test set

The numbers for the LibriSpeech other dataset are different than the ones reported on the released models but I'm unclear on why (perhaps we downloaded only a subset of the other dataset?) We basically just ran python librispeech.py which by default I believe does the entire dataset. This might be realted to this issue as well.

When we train a model from scratch with a LR of 3e-4, an anneal rate of 1.1, a batch size of 5 (to fit on the NVIDIA GTX 1080 Ti's), and a max_norm of 200, we've been able to get down to a WER of 23.2 and CER of 8.9 on the validation set within 8 epochs. The model is still training so it might get closer to the 21.7. All other hyperparameters (5 layer birdirectional GRU) were the default provided.

Also would it be helpful in terms of reproducibility to fix a default random seed in PyTorch? Just thought that if we are re-training some models it might have fixing that as well (might affect the smaller datasets more than the larger ones).

Let me know if there's anything I can do to help!

Udit

SeanNaren · 2017-12-21T09:12:26Z

@alugupta thanks for your help!

The way WER/CER is calculated has changed to match up more with academic standards (but are correct for the release branch and the commit it points to I think). There does seem to be a slight discrepancy between the WER/CER at training time and at testing time, but I'm investigating further.

I'll definitely need to update the librispeech script to create separate test scripts for the different test sets libri offer, so any contribution there would be awesome :)

I agree with the default random seed, that would definitely help! Will create a ticket to track this. Thanks for your help!

Minju-Jung · 2017-12-21T09:25:33Z

@SeanNaren As in my issue #200, does current LibriSpeech dataset contain both clean and other?

Now, I'm training the network using LibriSpeech in same condition as @alugupta.
I will report my simulation results soon.

ryanleary · 2017-12-22T17:58:58Z

I just set up a new machine and pulled down this repository and the an4 dataset from scratch. GPU was 1x Titan V. Trained with following command:

python train.py --train_manifest an4/an4_train_manifest.csv \
                --val_manifest an4/an4_val_manifest.csv  \
                --num_workers 4 \
                --cuda \
                --learning_anneal 1.01 \
                --augment \
                --epochs 100

Result:

Dataset	WER	CER	Loss	Epochs
AN4 test	9.732	3.919	0.054	48

which beats the previously released model. Be sure to include the augment flag, particularly when running for a large number of epochs. If not using --augment on a dataset this small, there's little reason to run a large number of epochs.

I'll leave this open for another day or so, but will probably close since ^^ reproduces the result.

ryanleary · 2017-12-22T18:11:19Z

Also @alugupta I think what you're calling "Librispeech Other" is actually the combined other+clean test set. The pretrained model scores around 31.3% WER on other. 21.7% is the weighted average WER between the other and clean test sets.

alugupta · 2017-12-22T18:24:14Z

@ryanleary Oh right! The numbers I reported earlier for Librispeech other are actually LibriSpeech clean + other. That makes sense then. So model from earlier should more or less be similar to the pretrained model. Thanks for also rerunning the an4 dataset!

@Minju-Jung I guess that also answer your question in that by default the dataset is the combined clean and other. If you specified one or the other when pre-processing them then it could be different.

@SeanNaren For the separate test-scripts for other and clean are you just imagining that in the pre-processing steps you partition saving the other and clean subsets separately as default and then have 3 sets of manifests: clean, other, combined? I could perhaps contribute this if that helps (it might be a while as I'll be away for the coming week).

SeanNaren · 2017-12-24T12:42:19Z

@alugupta thanks for your input!

I've just merged #205 which contained splitting the testing scripts into clean/other. This should help get the correct test scores!

SeanNaren self-assigned this Dec 15, 2017

SeanNaren mentioned this issue Dec 19, 2017

Pre-trained models take 2 - tracker #199

Closed

3 tasks

SeanNaren mentioned this issue Dec 23, 2017

Added manual seeds to training script #204

Merged

ryanleary closed this as completed Dec 28, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AN4 fail to reproduce the reported result #193

AN4 fail to reproduce the reported result #193

Minju-Jung commented Dec 7, 2017 •

edited

Loading

ryanleary commented Dec 7, 2017

Minju-Jung commented Dec 7, 2017

alugupta commented Dec 8, 2017

SeanNaren commented Dec 15, 2017

alugupta commented Dec 20, 2017

SeanNaren commented Dec 21, 2017

Minju-Jung commented Dec 21, 2017

ryanleary commented Dec 22, 2017

ryanleary commented Dec 22, 2017

alugupta commented Dec 22, 2017

SeanNaren commented Dec 24, 2017

AN4 fail to reproduce the reported result #193

AN4 fail to reproduce the reported result #193

Comments

Minju-Jung commented Dec 7, 2017 • edited Loading

ryanleary commented Dec 7, 2017

Minju-Jung commented Dec 7, 2017

alugupta commented Dec 8, 2017

SeanNaren commented Dec 15, 2017

alugupta commented Dec 20, 2017

SeanNaren commented Dec 21, 2017

Minju-Jung commented Dec 21, 2017

ryanleary commented Dec 22, 2017

ryanleary commented Dec 22, 2017

alugupta commented Dec 22, 2017

SeanNaren commented Dec 24, 2017

Minju-Jung commented Dec 7, 2017 •

edited

Loading