Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to train using fl_asr_train with fork option using the AM file (am_500ms_future_context_dev_other.bin) #456

Open
vchagari opened this issue Feb 5, 2021 · 3 comments
Assignees
Labels
bug Something isn't working

Comments

@vchagari
Copy link

vchagari commented Feb 5, 2021

Bug Description

Getting Coredump while running fl_asr_train app with the fork option with the Librispeech AM file, please find the details below.

Error:
E0204 16:25:50.015505 34464 Serializer.h:80 Error while loading "/data/set3/am_500ms_future_context_dev_other.bin": Trying to load an unregistered polymorphic type (w2l::SpecAugment).
Make sure your type is registered with CEREAL_REGISTER_TYPE and that the archive you are using was included (and registered with CEREAL_REGISTER_ARCHIVE) prior to calling CEREAL_REGISTER_TYPE.
If your type is already registered and you still see this error, you may need to use CEREAL_REGISTER_DYNAMIC_INIT.

#######
Details:
./fl_asr_train fork /data/set3/am_500ms_future_context_dev_other.bin --flagsfile=/data/set3/for_training_fork_am_500ms_future_context.cfg --minloglevel=0 --rundir=/data/set3/02_04_2021 --rndv_filepath=""
I0204 16:25:15.639739 32766 Train.cpp:54] Parsing command line flags
I0204 16:25:15.639756 32766 Train.cpp:57] Reading flags from file /data/set3/for_training_fork_am_500ms_future_context.cfg
W0204 16:25:15.639839 32766 Helpers.cpp:91] Did not find scalefactor, using the flag's value.
I0204 16:25:15.639843 32766 Helpers.cpp:97] Using initial scale factor 1
Initialized NCCL 2.8.3 successfully!
I0204 16:25:15.898775 32766 Train.cpp:197] Gflags after parsing
--flagfile=; --fromenv=; --tryfromenv=; --undefok=; --tab_completion_columns=80; --tab_completion_word=; --help=false; --helpfull=false; --helpmatch=; --helpon=; --helppackage=false; --helpshort=false; --helpxml=false; --version=false; --ipl_maxisz=1.7976931348623157e+308; --ipl_maxtsz=9223372036854775807; --ipl_minisz=0; --ipl_mintsz=0; --ipl_relabel_epoch=10000000; --ipl_relabel_ratio=1; --ipl_seed_model_wer=-1; --ipl_use_existing_pl=false; --unsup_datadir=; --unsup_train=; --adambeta1=0.90000000000000002; --adambeta2=0.999; --am=; --am_decoder_tr_dropout=0; --am_decoder_tr_layerdrop=0; --am_decoder_tr_layers=1; --arch=/data/set3/am_500ms_future_context.arch; --attention=content; --attentionthreshold=0; --attnWindow=no; --attnconvchannel=0; --attnconvkernel=0; --attndim=0; --batching_max_duration=0; --batching_strategy=none; --batchsize=8; --beamsize=2500; --beamsizetoken=250000; --beamthreshold=25; --channels=1; --criterion=ctc; --critoptim=sgd; --datadir=/data/set3; --decoderattnround=1; --decoderdropout=0; --decoderrnnlayer=1; --decodertype=wrd; --devwin=0; --emission_dir=; --emission_queue_size=3000; --enable_distributed=true; --encoderdim=0; --eosscore=0; --everstoredb=false; --features_type=mfsc; --fftcachesize=1; --filterbanks=80; --fl_amp_max_scale_factor=32000; --fl_amp_scale_factor=4096; --fl_amp_scale_factor_update_interval=2000; --fl_amp_use_mixed_precision=false; --fl_benchmark_mode=true; --fl_log_level=; --fl_log_mem_ops_interval=0; --fl_optim_mode=; --fl_vlog_level=0; --flagsfile=/data/set3/for_training_fork_am_500ms_future_context.cfg; --framesizems=25; --framestridems=10; --gamma=1; --gumbeltemperature=1; --highfreqfilterbank=-1; --inputfeeding=false; --isbeamdump=false; --iter=100000000; --itersave=true; --labelsmooth=0; --leftWindowSize=50; --lexicon=/data/set3/decoder-unigram-10000-nbest10-02-04-2021.lexicon; --linlr=-1; --linlrcrit=-1; --linseg=0; --lm=; --lm_memory=5000; --lm_vocab=; --lmtype=kenlm; --lmweight=0; --lmweight_high=4; --lmweight_low=0; --lmweight_step=0.20000000000000001; --localnrmlleftctx=300; --localnrmlrightctx=0; --logadd=false; --lowfreqfilterbank=0; --lr=0.01; --lr_decay=10000; --lr_decay_step=9223372036854775807; --lrcosine=false; --lrcrit=0; --max_devices_per_node=8; --maxdecoderoutputlen=200; --maxgradnorm=0.5; --maxload=-1; --maxrate=10; --maxsil=50; --maxword=-1; --melfloor=1; --mfcccoeffs=13; --minrate=3; --minsil=0; --momentum=0.80000000000000004; --netoptim=sgd; --nthread=6; --nthread_decoder=1; --nthread_decoder_am_forward=1; --numattnhead=8; --onorm=target; --optimepsilon=1e-08; --optimrho=0.90000000000000002; --pctteacherforcing=100; --pcttraineval=1; --pretrainWindow=0; --replabel=0; --reportiters=1000; --rightWindowSize=50; --rndv_filepath=; --rundir=/data/set3/02_04_2021; --samplerate=16000; --sampletarget=0; --samplingstrategy=rand; --saug_fmaskf=27; --saug_fmaskn=2; --saug_start_update=-1; --saug_tmaskn=2; --saug_tmaskp=1; --saug_tmaskt=100; --sclite=; --seed=0; --sfx_config=; --sfx_start_update=2147483647; --show=false; --showletters=false; --silscore=0; --smearing=none; --smoothingtemperature=1; --softwoffset=10; --softwrate=5; --softwstd=5; --sqnorm=true; --stepsize=1000000; --surround=; --test=; --tokens=/data/set3/librispeech-train-all-unigram-10000.tokens; --train=lists/train.lst; --trainWithWindow=false; --transdiag=0; --unkscore=-inf; --use_memcache=false; --uselexicon=true; --usewordpiece=true; --valid=lists/dev.lst; --validbatchsize=-1; --warmup=1; --weightdecay=0; --wordscore=0; --wordseparator=_; --world_rank=0; --world_size=32; --alsologtoemail=; --alsologtostderr=false; --colorlogtostderr=false; --drop_log_memory=true; --log_backtrace_at=; --log_dir=; --log_link=; --log_prefix=true; --logbuflevel=0; --logbufsecs=30; --logemaillevel=999; --logfile_mode=436; --logmailer=/bin/mail; --logtostderr=true; --max_log_size=1800; --minloglevel=0; --stderrthreshold=2; --stop_logging_if_full_disk=false; --symbolize_stacktrace=true; --v=0; --vmodule=;
I0204 16:25:15.899092 32766 Train.cpp:198] Experiment path: /data/set3/02_04_2021
I0204 16:25:15.899096 32766 Train.cpp:199] Experiment runidx: 1
I0204 16:25:15.901998 32766 Train.cpp:272] Number of classes (network): 9998
I0204 16:25:16.854326 32766 Train.cpp:279] Number of words: 204170
E0204 16:25:18.193625 34464 Serializer.h:80 Error while loading "/data/set3/am_500ms_future_context_dev_other.bin": Trying to load an unregistered polymorphic type (w2l::SpecAugment).
Make sure your type is registered with CEREAL_REGISTER_TYPE and that the archive you are using was included (and registered with CEREAL_REGISTER_ARCHIVE) prior to calling CEREAL_REGISTER_TYPE.
If your type is already registered and you still see this error, you may need to use CEREAL_REGISTER_DYNAMIC_INIT.

E0204 16:25:19.343036 34464 Serializer.h:80 Error while loading "/data/set3/am_500ms_future_context_dev_other.bin": Trying to load an unregistered polymorphic type (w2l::SpecAugment).
Make sure your type is registered with CEREAL_REGISTER_TYPE and that the archive you are using was included (and registered with CEREAL_REGISTER_ARCHIVE) prior to calling CEREAL_REGISTER_TYPE.
If your type is already registered and you still see this error, you may need to use CEREAL_REGISTER_DYNAMIC_INIT.

E0204 16:25:21.500684 34464 Serializer.h:80 Error while loading "/data/set3/am_500ms_future_context_dev_other.bin": Trying to load an unregistered polymorphic type (w2l::SpecAugment).
Make sure your type is registered with CEREAL_REGISTER_TYPE and that the archive you are using was included (and registered with CEREAL_REGISTER_ARCHIVE) prior to calling CEREAL_REGISTER_TYPE.
If your type is already registered and you still see this error, you may need to use CEREAL_REGISTER_DYNAMIC_INIT.

E0204 16:25:25.676436 34464 Serializer.h:80 Error while loading "/data/set3/am_500ms_future_context_dev_other.bin": Trying to load an unregistered polymorphic type (w2l::SpecAugment).
Make sure your type is registered with CEREAL_REGISTER_TYPE and that the archive you are using was included (and registered with CEREAL_REGISTER_ARCHIVE) prior to calling CEREAL_REGISTER_TYPE.
If your type is already registered and you still see this error, you may need to use CEREAL_REGISTER_DYNAMIC_INIT.

E0204 16:25:33.849670 34464 Serializer.h:80 Error while loading "/data/set3/am_500ms_future_context_dev_other.bin": Trying to load an unregistered polymorphic type (w2l::SpecAugment).
Make sure your type is registered with CEREAL_REGISTER_TYPE and that the archive you are using was included (and registered with CEREAL_REGISTER_ARCHIVE) prior to calling CEREAL_REGISTER_TYPE.
If your type is already registered and you still see this error, you may need to use CEREAL_REGISTER_DYNAMIC_INIT.

E0204 16:25:50.015505 34464 Serializer.h:80 Error while loading "/data/set3/am_500ms_future_context_dev_other.bin": Trying to load an unregistered polymorphic type (w2l::SpecAugment).
Make sure your type is registered with CEREAL_REGISTER_TYPE and that the archive you are using was included (and registered with CEREAL_REGISTER_ARCHIVE) prior to calling CEREAL_REGISTER_TYPE.
If your type is already registered and you still see this error, you may need to use CEREAL_REGISTER_DYNAMIC_INIT.

terminate called after throwing an instance of 'cereal::Exception'
what(): Trying to load an unregistered polymorphic type (w2l::SpecAugment).
Make sure your type is registered with CEREAL_REGISTER_TYPE and that the archive you are using was included (and registered with CEREAL_REGISTER_ARCHIVE) prior to calling CEREAL_REGISTER_TYPE.
If your type is already registered and you still see this error, you may need to use CEREAL_REGISTER_DYNAMIC_INIT.
*** Aborted at 1612484750 (unix time) try "date -d @1612484750" if you are using GNU date ***
PC: @ 0x7f6805cecfb7 gsignal
*** SIGABRT (@0x3ed00007ffe) received by PID 32766 (TID 0x7f684fb12000) from PID 32766; stack trace: ***
@ 0x7f6848fda980 (unknown)
@ 0x7f6805cecfb7 gsignal
@ 0x7f6805cee921 abort
@ 0x7f6806910957 (unknown)
@ 0x7f6806916ae6 (unknown)
@ 0x7f6806916b21 std::terminate()
@ 0x7f6806916da9 __cxa_rethrow
@ 0x55949666d179 main
@ 0x7f6805ccfbf7 __libc_start_main
@ 0x5594966fa79a _start
Aborted (core dumped)

Platform and Hardware

[Please list your operating system, [GPU] hardware, compiler, and other details if relevant]
Have I written custom code (as opposed to running examples on an unmodified clone of the repository): No
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu version - 18.04 LTS
Python version: Python 3.6.9
Bazel version (if compiling from source): N/A
GCC/Compiler version (if compiling from source): N/A
CUDA/cuDNN version: 10.1/7.6.4.38
GPU model and memory: NVIDIA-SMI 460.27.04 Driver Version: 460.27.04

@vchagari vchagari added the bug Something isn't working label Feb 5, 2021
@tlikhomanenko
Copy link
Contributor

This model was trained with old codebase that is why it cannot be right now reused by the new codebase.

Solutions:

  • use particular branch/commit with which model was trained (see readme in particular recipe on dependencies used for training)
  • retrain model on your own
  • probably @vineelpratap / @avidov could convert the model to the new format, or you can do on your own.

cc @vineelpratap @avidov

@vchagari
Copy link
Author

vchagari commented Feb 8, 2021

Thank you @tlikhomanenko.
@vineelpratap, @avidov : Could you please help me converting the model to the new format?.

@tlikhomanenko
Copy link
Contributor

tlikhomanenko commented Apr 7, 2021

Converting models will be here #524

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants