[WIP] Lattice-faster-decoder-combine #3061

LvHang · 2019-02-28T01:39:43Z

Combine ProcessEmitting() and ProcessNonemitting() into ProcessForFrame(). According to the modification, update GetRawLattice().
Put it here so that @chenzhehuai can review it easily.
Thanks.

small fix

chenzhehuai

I talk with Hang and understand the main idea of his design.

Let's take Lattice-faster-decoder.cc as the baseline. Before explicit reviews, I have several general comments:

next_cutoff in the code is theoretically worse than the baseline:

Say we are decoding frame T (the current decoding step). The baseline obtains next_cutoff from the best token of all emitting and non-emitting tokens in the last decoding step, while current code obtains it from the best token of all emitting tokens in the last decoding step (non-emitting tokens are processed in the current decoding step). Hence, the "best token" is not really the best in the current code. Moreover, thinking about line 811-817, it assumes the next emitting arc following previous emitting arc, which is unusual in speech and won't bring about a reasonable cutoff. Although we keep updating next_cutoff once we obtain better tokens, the next_cutoff will start from a bad one and become a much better one, which not only wastes many tokens in the beginning, but also makes beam tuning harder since "better one" changes greatly.

the current code will have higher delay:

For the last frame, say T, baseline needs to decode emitting arcs of frame T and non-emitting arcs of frame T, while the current code needs to decode all above arcs and extra non-emitting arcs of frame T-1

since the current code changes the pruning behaviors, if we want to replace the baseline, we need to test both 1-best & lattice performance versus decoding speed:

for 1-best, we need to test real time factor (RTF) versus WER
for lattice, we need to test lattice density versus lattice oracle error rate and lattice rescored error rate

To sum up, I am a little bit worry about the decoding efficiency because of comment 1 and 2 above. I have done some experiments on your codes, share the result and show some improving ideas here LvHang#2
please improve your implementation according to above PR and if you get similar speed after incorporating these ideas, please start to test wer & etc. as what I discussed in point 3 above to see whether my point 1 and 2 are really problems.

Improvements: 1. with the same config, the previous code will result in 10% more tokens in each frame compared with baseline. The reason is because GetCutoff function is over emitting tokens, while for baseline it is over emitting&non emiting tokens. (please check it use --verbose 6 and the number of all tokens is shown in line 895, but not in line 693, which only contains emitting) Hence we should use 10% less max_active config. 2. the baseline use kaldi version HashList, while the previous code use std::unordered_map. please still use HashList or its variants. The reason is because HashList is allocating memory in Pool/Block fashion, hence it is 10% faster than std::unordered_map. To show this problem, I temporally do a hack in line 247, plese check it After these two improvements, the current code is with similar speed as the baseline. HOWEVER, please check the quality of 1-best & lattice, I have discuss about my worries here kaldi-asr#3061 (review)

danpovey · 2019-02-28T18:52:59Z

src/bin/latgen-faster-mapped-combine.cc

+// bin/latgen-faster-mapped.cc
+
+// Copyright 2009-2012  Microsoft Corporation, Karel Vesely
+//                2013  Johns Hopkins University (author: Daniel Povey)


This is OK for testing but eventually this should just be a change to lattice-faster-decoder.cc. It's just a small optimization.

src/decoder/lattice-faster-decoder-combine.h

danpovey · 2019-02-28T19:07:03Z

Zhehuai: You are right that the cutoff may not be optimal, but this is an issue for the emitting arcs as well, and I don't think it will affect nonemitting arcs any worse than it already does for emitting ones. I don't think the need to reprocess the last frame is a big problem, it's only one frame. The speed advantage of this method is that we don't need to iterate over all the states twice, to process emitting and nonemitting arcs separately. However, I have a lot of concerns about (a) the design of this, with recover_, and recover_map_, and (b) the quality of the comments. It doesn't look even close to something that I would merge, at this point, in terms of clarity and documentation; in fact, it might be faster for me to just take it, figure out what is going on, and rewrite it myself. I am hoping that Zhehuai can work with Hang on this. I might in principle be OK with the core design, but first I would need to be able to understand it, and the comments do not help with that. Maybe Zhehuai can understand it and write the comments himself?

…

On Wed, Feb 27, 2019 at 10:46 PM Zhehuai Chen ***@***.***> wrote: ***@***.**** commented on this pull request. I talk with Hang and understand the main idea of his design. Let's take Lattice-faster-decoder.cc as the baseline. Before explicit reviews, I have several general comments: 1. next_cutoff in the code is theoretically worse than the baseline: Say we are decoding frame T (the current decoding step). The baseline obtains next_cutoff from the best token of all emitting and non-emitting tokens in the last decoding step, while current code obtains it from the best token of all emitting tokens in the last decoding step (non-emitting tokens are processed in the current decoding step). Hence, the "best token" is not really the best in the current code. Moreover, thinking about line 811-817, it assumes the next emitting arc following previous emitting arc, which is unusual in speech and won't bring about a reasonable cutoff. Although we keep updating next_cutoff once we obtain better tokens, the next_cutoff will start from a bad one and become a much better one, which not only wastes many tokens in the beginning, but also makes beam tuning harder since "better one" changes greatly. 1. the current code will have higher delay: For the last frame, say T, baseline needs to decode emitting arcs of frame T and non-emitting arcs of frame T, while the current code needs to decode all above arcs and extra non-emitting arcs of frame T-1 1. since the current code changes the pruning behaviors, if we want to replace the baseline, we need to test both 1-best & lattice performance versus decoding speed: for 1-best, we need to test real time factor (RTF) versus WER for lattice, we need to test lattice density versus lattice oracle error rate and lattice rescored error rate To sum up, I am a little bit worry about the decoding efficiency because of comment 1 and 2 above. I will do some experiments on the current code and share the result with Hang. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#3061 (review)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADJVuw27xi0WSGOLIhV0yOoRhiQlt4ePks5vR1EWgaJpZM4bV0gF> .

LvHang · 2019-02-28T20:06:29Z

Dan:

In my design, the recover_ and recover_map_ will be used when we call GetRawLattice() in the middle of an utterance decoding.
After processing the frame t, we can achive the complete token list of frame t and the emittion part of frame t+1. So we will call ProcessNonemitting() once in GetRawLattice() to generate the complete token list for frame t+1. As GetRawLattice() will use "vector active_toks_" to build the raw lattice, so I update "active_toks_" directly in ProcessNonemitting().
But in PrcessForFrame(), the start point should be an emittion part token list. I have to get rid of the non-emittion tokens from "active_toks_" and recover their tot_costs(alpha) so that the behavior is consistent with normal case.

LvHang · 2019-03-01T04:12:53Z

Dan, Zhehuai:

I update the functions and comments.

Now ProcessNonemitting() will take an argument "std::unordered_map<Token*, BaseFloat> recover_map". In normal case, it will be NULL. But when we call GetRawLattice() during decoding, we will pass a local variable "recover_map" of GetRawLattice() and then recover the "active_toks_" in the end of GetRawLattice(). So it will not introduce the linkages.

chenzhehuai

I review the coding style and code comments inline. Please test speeds and precision once you finish them.

src/decoder/lattice-faster-decoder-combine.h

src/decoder/lattice-faster-decoder-combine.cc

danpovey · 2019-03-01T18:46:32Z

Regarding max-active: see how much difference it makes to the WER/RTF curve if you just leave a single option max-active and redefine it as the total number of active states. If it's not a huge incompatibility I'd rather keep it simple; there will be other slight differences in the pruning behavior as well, that may somewhat cancel out this effect.

…

On Fri, Mar 1, 2019 at 8:16 AM Zhehuai Chen ***@***.***> wrote: ***@***.**** requested changes on this pull request. I review the coding style and code comments inline. Please share some testing speeds once you finish them. ------------------------------ In src/decoder/lattice-faster-decoder-combine.h <#3061 (comment)>: > + Fst::Fst<fst::StdArc>, a.k.a. StdFst, or GrammarFst. If you invoke it with + FST == StdFst and it notices that the actual FST type is + fst::VectorFst<fst::StdArc> or fst::ConstFst<fst::StdArc>, the decoder object + will internally cast itself to one that is templated on those more specific + types; this is an optimization for speed. + */ +template <typename FST, typename Token = decodercombine::StdToken> +class LatticeFasterDecoderCombineTpl { + public: + using Arc = typename FST::Arc; + using Label = typename Arc::Label; + using StateId = typename Arc::StateId; + using Weight = typename Arc::Weight; + using ForwardLinkT = decodercombine::ForwardLink<Token>; + + using StateIdToTokenMap = typename std::unordered_map<StateId, Token*>; using StateIdToTokenMap = typename std::unordered_map<StateId, Token*, std::hash, std::equal_to, fkaldi::PoolAllocator<std::pair<const StateId, Token*>>>; ------------------------------ In src/decoder/lattice-faster-decoder-combine.h <#3061 (comment)>: > + // It's called by PruneActiveTokens if any forward links have been pruned + void PruneTokensForFrame(int32 frame_plus_one); + + + // Go backwards through still-alive tokens, pruning them if the + // forward+backward cost is more than lat_beam away from the best path. It's + // possible to prove that this is "correct" in the sense that we won't lose + // anything outside of lat_beam, regardless of what happens in the future. + // delta controls when it considers a cost to have changed enough to continue + // going backward and propagating the change. larger delta -> will recurse + // less far. + void PruneActiveTokens(BaseFloat delta); + + /// Processes non-emitting (epsilon) arcs and emitting arcs for one frame + /// together. It takes the emittion tokens in "cur_toks_" from last frame. + /// Generates non-emitting tokens for current frame and emitting tokens for Generates non-emitting tokens for the previous frame and emitting tokens for the current frame. It would be better to keep the definition consistent as previous decoders. The emitting tokens for the current frame means the token take acoustic scores of the current frame if you agree with this change, maybe you would like to also change all "cur_toks_" to "prev_toks_", and change all "next_toks_" to "cur_toks_" ------------------------------ In src/decoder/lattice-faster-decoder-combine.h <#3061 (comment)>: > + void PruneActiveTokens(BaseFloat delta); + + /// Processes non-emitting (epsilon) arcs and emitting arcs for one frame + /// together. It takes the emittion tokens in "cur_toks_" from last frame. + /// Generates non-emitting tokens for current frame and emitting tokens for + /// next frame. + void ProcessForFrame(DecodableInterface *decodable); + + /// Processes nonemitting (epsilon) arcs for one frame. + /// Calls this function once when all frames were processed. + /// Or calls it in GetRawLattice() to generate the complete token list for + /// the last frame. [Deal With the tokens in map "next_toks_" which would + /// only contains emittion tokens from previous frame.] + /// If "recover_map" isn't NULL, we build the recover_map which will be used + /// to recover "active_toks_[last_frame]" token list for the last frame. + void ProcessNonemitting(std::unordered_map<Token*, BaseFloat> *recover_map); please merge this function into ProcessForFrame as we discussed ------------------------------ In src/decoder/lattice-faster-decoder-combine.h <#3061 (comment)>: > + /// Returns true if result is nonempty (using the return status is deprecated, + /// it will become void). If "use_final_probs" is true AND we reached the + /// final-state of the graph then it will include those as final-probs, else + /// it will treat all final-probs as one. Note: this just calls GetRawLattice() + /// and figures out the shortest path. + bool GetBestPath(Lattice *ofst, + bool use_final_probs = true); + + /// Outputs an FST corresponding to the raw, state-level + /// tracebacks. Returns true if result is nonempty. + /// If "use_final_probs" is true AND we reached the final-state + /// of the graph then it will include those as final-probs, else + /// it will treat all final-probs as one. + /// The raw lattice will be topologically sorted. + /// The function can be called during decoding, it will take "next_toks_" map + /// and generate the complete token list for the last frame. Then recover it it will take "next_toks_" map and generate the complete token list -> it will process non-emission tokens from "next_toks_" map to get both emission and non-emission tokens to get raw lattices ------------------------------ In src/decoder/lattice-faster-decoder-combine.cc <#3061 (comment)>: > + +// instantiate this class once for each thing you have to decode. +template <typename FST, typename Token> +LatticeFasterDecoderCombineTpl<FST, Token>::LatticeFasterDecoderCombineTpl( + const FST &fst, + const LatticeFasterDecoderCombineConfig &config): + fst_(&fst), delete_fst_(false), config_(config), num_toks_(0) { + config.Check(); +} + + +template <typename FST, typename Token> +LatticeFasterDecoderCombineTpl<FST, Token>::LatticeFasterDecoderCombineTpl( + const LatticeFasterDecoderCombineConfig &config, FST *fst): + fst_(fst), delete_fst_(true), config_(config), num_toks_(0) { + config.Check(); add the following to be consistent with previous version: next_toks_.reserve(1000); cur_toks_.reserve(1000); ------------------------------ In src/decoder/lattice-faster-decoder-combine.cc <#3061 (comment)>: > +// an unusual search error. +template <typename FST, typename Token> +bool LatticeFasterDecoderCombineTpl<FST, Token>::Decode(DecodableInterface *decodable) { + InitDecoding(); + + // We use 1-based indexing for frames in this decoder (if you view it in + // terms of features), but note that the decodable object uses zero-based + // numbering, which we have to correct for when we call it. + + while (!decodable->IsLastFrame(NumFramesDecoded() - 1)) { + if (NumFramesDecoded() % config_.prune_interval == 0) + PruneActiveTokens(config_.lattice_beam * config_.prune_scale); + ProcessForFrame(decodable); + } + // Procss non-emitting arcs for the last frame. + ProcessNonemitting(NULL); remove this separate call as we discussed ------------------------------ In src/decoder/lattice-faster-decoder-combine.cc <#3061 (comment)>: > +template <typename FST, typename Token> +bool LatticeFasterDecoderCombineTpl<FST, Token>::GetRawLattice( + Lattice *ofst, + bool use_final_probs) { + typedef LatticeArc Arc; + typedef Arc::StateId StateId; + typedef Arc::Weight Weight; + typedef Arc::Label Label; + // Note: you can't use the old interface (Decode()) if you want to + // get the lattice with use_final_probs = false. You'd have to do + // InitDecoding() and then AdvanceDecoding(). + if (decoding_finalized_ && !use_final_probs) + KALDI_ERR << "You cannot call FinalizeDecoding() and then call " + << "GetRawLattice() with use_final_probs == false"; + + std::unordered_map<Token*, BaseFloat> *recover_map = NULL; do not need to use pointer and new one to it. Just use std::unordered_map<Token*, BaseFloat> recover_map ------------------------------ In src/decoder/lattice-faster-decoder-combine.h <#3061 (comment)>: > + void RecoverLastTokenList(std::unordered_map<Token*, BaseFloat> *recover_map); + + + /// The "cur_toks_" and "next_toks_" actually allow us to maintain current + /// and next frames. They are indexed by StateId. It is indexed by frame-index + /// plus one, where the frame-index is zero-based, as used in decodable object. + /// That is, the emitting probs of frame t are accounted for in tokens at + /// toks_[t+1]. The zeroth frame is for nonemitting transition at the start of + /// the graph. + StateIdToTokenMap cur_toks_; + StateIdToTokenMap next_toks_; + + /// Gets the weight cutoff. + /// Notice: In traiditional version, the histogram prunning method is applied + /// on a complete token list on one frame. But, in this version, it is used + /// on a token list which only contains the emittion part. So the max_active because of this, how about change "max_active" to "max_emit". and to make it forward compatibility to our previous scripts, we can have max_emit=0.9*max_active ------------------------------ In src/decoder/lattice-faster-decoder-combine.h <#3061 (comment)>: > + /// and next frames. They are indexed by StateId. It is indexed by frame-index + /// plus one, where the frame-index is zero-based, as used in decodable object. + /// That is, the emitting probs of frame t are accounted for in tokens at + /// toks_[t+1]. The zeroth frame is for nonemitting transition at the start of + /// the graph. + StateIdToTokenMap cur_toks_; + StateIdToTokenMap next_toks_; + + /// Gets the weight cutoff. + /// Notice: In traiditional version, the histogram prunning method is applied + /// on a complete token list on one frame. But, in this version, it is used + /// on a token list which only contains the emittion part. So the max_active + /// and min_active values might be narrowed. + BaseFloat GetCutoff(const StateIdToTokenMap& toks, + BaseFloat *adaptive_beam, + StateId *best_elem_id, Token **best_elem); best_elem_id -> best_state_id best_elem -> best_token ------------------------------ In src/decoder/lattice-faster-decoder-combine.cc <#3061 (comment)>: > + // pruning "online" before having seen all tokens + + // "next_cutoff" is used to limit a new token in next frame should be handle + // or not. It will be updated along with the further processing. + BaseFloat next_cutoff = std::numeric_limits<BaseFloat>::infinity(); + // "cost_offset" contains the acoustic log-likelihoods on current frame in + // order to keep everything in a nice dynamic range. Reduce roundoff errors. + BaseFloat cost_offset = 0.0; + + // First process the best token to get a hopefully + // reasonably tight bound on the next cutoff. The only + // products of the next block are "next_cutoff" and "cost_offset". + // Notice: As the difference between the combine version and the traditional + // version, this "best_tok" is choosen from emittion tokens. Normally, the + // best token of one frame comes from an epsilon non-emittion. So the best + // token is a looser boundary. Use it to estimate a bound on the next cutoff. So the best token for now is a looser boundary. We use it to estimate a bound on the next cutoff and we will update the cutoff once we have better tokens ------------------------------ In src/decoder/lattice-faster-decoder-combine.cc <#3061 (comment)>: > + *adaptive_beam = config_.beam; + return beam_cutoff; + } + } +} + +template <typename FST, typename Token> +void LatticeFasterDecoderCombineTpl<FST, Token>::ProcessForFrame( + DecodableInterface *decodable) { + KALDI_ASSERT(active_toks_.size() > 0); + int32 frame = active_toks_.size() - 1; // frame is the frame-index + // (zero-based) used to get likelihoods + // from the decodable object. + active_toks_.resize(active_toks_.size() + 1); + + cur_toks_.clear(); it may be better to understand: cur_toks_.swap(next_toks_); next_toks_.clear(); ------------------------------ In src/decoder/lattice-faster-decoder-combine.cc <#3061 (comment)>: > + cur_cost = tok->tot_cost, + tot_cost = cur_cost + ac_cost + graph_cost; + if (tot_cost > next_cutoff) continue; + else if (tot_cost + adaptive_beam < next_cutoff) + next_cutoff = tot_cost + adaptive_beam; // a tighter boundary for emitting + + // no change flag is needed + Token *next_tok = FindOrAddToken(arc.nextstate, frame + 1, tot_cost, + tok, &next_toks_, NULL); + // Add ForwardLink from tok to next_tok. Put it on the head of tok->link + // list + tok->links = new ForwardLinkT(next_tok, arc.ilabel, arc.olabel, + graph_cost, ac_cost, tok->links); + } + } // for all arcs + tok->in_current_queue = false; // out of queue let's do this in line 838 in case there is a state with self-loop ------------------------------ In src/decoder/lattice-faster-decoder-combine.cc <#3061 (comment)>: > + if (tot_cost > next_cutoff) continue; + else if (tot_cost + adaptive_beam < next_cutoff) + next_cutoff = tot_cost + adaptive_beam; // a tighter boundary for emitting + + // no change flag is needed + Token *next_tok = FindOrAddToken(arc.nextstate, frame + 1, tot_cost, + tok, &next_toks_, NULL); + // Add ForwardLink from tok to next_tok. Put it on the head of tok->link + // list + tok->links = new ForwardLinkT(next_tok, arc.ilabel, arc.olabel, + graph_cost, ac_cost, tok->links); + } + } // for all arcs + tok->in_current_queue = false; // out of queue + } // end of while loop + KALDI_VLOG(6) << "toks after: " << cur_toks_.size(); KALDI_VLOG(6) << "Number of tokens active on frame " << NumFramesDecoded()-1 << " is " << cur_toks_.size(); ------------------------------ In src/decoder/lattice-faster-decoder-combine.cc <#3061 (comment)>: > + BaseFloat w = static_cast<BaseFloat>(iter->second->tot_cost); + tmp_array_.push_back(w); + if (w < best_weight) { + best_weight = w; + if (best_elem) { + *best_elem_id = iter->first; + *best_elem = iter->second; + } + } + } + + BaseFloat beam_cutoff = best_weight + config_.beam, + min_active_cutoff = std::numeric_limits<BaseFloat>::infinity(), + max_active_cutoff = std::numeric_limits<BaseFloat>::infinity(); + + KALDI_VLOG(6) << "Number of tokens active on frame " << NumFramesDecoded() KALDI_VLOG(6) << "Number of tokens emitting on frame " << NumFramesDecoded()-1 — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#3061 (review)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADJVu9Hs1Z60mrg6zhCKH0K58Taismb5ks5vSSgTgaJpZM4bV0gF> .

LvHang · 2019-03-02T22:43:49Z

According to the suggestions of Dan and Zhehuai, update the comments and code.

danpovey · 2019-03-02T22:47:36Z

thanks. Zhehuai, please check carefully. I also will wait to hear on the efficiency. Ideally it would be nice to be able to separate the effects of stopping using HashList, versus combining nonemitting/emitting propagation, but perhaps that is too much to hope for.

…

On Sat, Mar 2, 2019 at 5:43 PM LvHang ***@***.***> wrote: According to the suggestions of Dan and Zhehuai, update the comments and code. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#3061 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADJVu1VmmwPSPI7SaqItNdScEhlylEMbks5vSv6pgaJpZM4bV0gF> .

LvHang · 2019-03-03T00:29:19Z

I uploaded a simple testing script. It calls "nnet3-compute" to generate posterior, and then use "latgen-faster-mapped" or "latgen-faster-mapped-combine" to decode.
Anyway, I'm testing it.

src/decoder/lattice-faster-decoder-combine.cc

src/decoder/lattice-faster-decoder-combine-bucketqueue.cc

danpovey

Various requested changes.

src/decoder/lattice-faster-decoder-combine-bucketqueue.h

src/decoder/lattice-faster-decoder-combine-bucketqueue.cc

src/decoder/lattice-faster-decoder-combine-bucketqueue.h

danpovey · 2019-03-26T20:00:20Z

Just treat it as zero, since we do in fact normalize costs to be not far from zero on each frame. It's not necessary; let the Push() code handle it.

…

On Tue, Mar 26, 2019 at 3:51 PM LvHang ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In src/decoder/lattice-faster-decoder-combine-bucketqueue.h <#3061 (comment)>: > + in_queue(false) { } +}; + +} // namespace decoder + + +template<typename Token> +class BucketQueue { + public: + /** Constructor. 'cost_scale' is a scale that we multiply the token costs by + * before intergerizing; a larger value means more buckets. + * 'best_cost_estimate' is an estimate of the best (lowest) cost that + * we are likely to encounter (e.g. the best cost that we have seen so far). + * It is used to initialize 'bucket_storage_begin_'. + */ + BucketQueue(BaseFloat best_cost_estimate, BaseFloat cost_scale = 1.0); Dan, how are you thinking about remove "best_cost_estimate" from the constructor of BucketQueue? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#3061 (review)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADJVu2-JmSaFokjgMbEdyatfOmULbWWXks5vanpdgaJpZM4bV0gF> .

danpovey · 2019-03-27T02:22:54Z

In src/decoder/lattice-faster-decoder-combine-bucketqueue.cc <#3061 (comment)>: > + if (buckets_[vec_index].empty()) { // This bucket is empty. Update vec_index + int32 next_vec_index = vec_index + 1; + for (; next_vec_index < buckets_.size(); next_vec_index++) { + if (!buckets_[next_vec_index].empty()) break; + } + vec_index = next_vec_index; + first_occupied_vec_index_ = vec_index; + } + + if (tok->in_queue) { // This is a effective token + tok->in_queue = false; + return tok; + } + } + return NULL; +} Sorry, dan, I didn't get the difference between "first_nonempty_bucket_index_" and "first_occupied_vec_index_"

I was referring the terminology in a comment in the code, which you got from an email of mine. They are the same, but my point is you don't have a temporary in the code any more named 'bucket_index'.

Dan, with the new design, I think the Pop() function still looks like current code. We cannot return the token directly as you wrote if (!buckets_[first_nonempty_bucket_index_].empty()) { Token *ans = buckets_[first_nonempty_bucket_index_].back(); buckets_[first_nonempty_bucket_index_].pop_back(); return ans; } Because we cannot guarantee the 'ans' is an effective token.

OK, then check ans ans->in_queue == true before returning it. But don't start with a while loop, because you are checking a condition that is not necessary to check at that point (because as I said, you would always ensure that first_nonempty_bucket_index_ was less than buckets_size(), precisely to avoid that check).

while (first_nonempty_bucket_index_ < buckets_.size()) { Token *tok = buckets_[first_nonempty_bucket_index_].back(); if (tok->in_queue) break; buckets_[first_nonempty_bucket_index_].pop_back(); //remove the invalid token if (buckets_[first_nonempty_bucket_index_].empty()) { first_nonempty_bucket_index_++; for (; first_nonempty_bucket_index_ < buckets_.size(); first_nonempty_bucket_index_++) { if(!bucket_[first_nonempty_bucket_index_].empty()) break; } } } At the same time, we still need "first_nonempty_bucket_index == buckets_.size()" to determinize the BucketQueue is empty at the beginning of the function. if (first_nonempty_bucket_index == buckets_.size()) { // no token left return NULL: ] Read more carefully what I said. I am proposing to always ensure that

first_nonempty_bucket_index< buckets_.size()), and change what that variable means.

…

— You are receiving this because you commented. Reply to this email directly, view it on GitHub <#3061 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADJVu_yRWN3GDFtM1wvABW44v4FxCJabks5vasbJgaJpZM4bV0gF> .

LvHang · 2019-03-27T04:56:17Z

Hi Dan,

I understand that you want to use the new "bucket_index" formula instead of the old "bucket_index" and "vec_index".

Let me explain what makes me confused. Firstly, we have the definition of "first_nonempty_bucket_index_".
/// first_nonempty_bucket_index_ is an integer in the range [0, buckets_.size() -1]
/// which is not larger than the index of the ifrst nonempty element of buckets_.
int32 first_nonempty_bucket_index_;

So the key problem is when and how to update "first_nonempty_bucket_index_".
Assume the Push() function adds the token into the BucketQueue one by one and sets "first_nonempty_bucket_index_" to the first non-empty bucket.
From your previous comments, I guess, after once successful Pop operation (ans->in_queue), you just leave the "first_nonempty_bucket_index_" there. And then, until the bucket is empty (buckets_[first_nonempty_bucket_index_].empty()), you will update it.

(1) But what should we do if the Pop operation is not successed? We need a while loop to repeat the try until the BucketQueue is empty.
(2) How can we know the BucketQueue is empty? I think the most convenient way is "first_nonempty_bucket_index_ == buckets_.size()". But it is conflict with your definition. So I suggest use
the following definition.

/// first_nonempty_bucket_index_ is an integer whose valid range is [0, buckets_.size() -1]
/// which is not larger than the index of the ifrst nonempty element of buckets_.
/// If "first_nonempty_bucket_index_" corresponds to a value past the end of buckets_
/// (i.e. first_nonempty_bucket_index_ == buckets_.size(), we interpret it as "there are no 
/// buckets with entries."
int32 first_nonempty_bucket_index_;

And the code will be (please see the comments to know what I'm concerned.)

  93 template<typename Token>                                                                              
  94 Token* BucketQueue<Token>::Pop() {                                                                    
  95   while (true) {                                                                                      
  96     if (!bucket_[first_nonempty_bucket_index_].empty()) {                                             
  97       Token *ans = buckets_[first_nonempty_bucket_index_].back();                                     
  98       buckets_[first_nonempty_bucket_index_].pop_back();                                              
  99       if (ans->in_queue) {                                                                            
 100         ans->in_queue = false;                                                                        
 101         return ans;                                                                                   
 102       }                                                                                               
 103     }                                                                                                 
 104     // There are two cases can trigger the following code to update the                               
 105     // "first_nonempty_bucket_index_". One is last token is invalid (i.e.                             
 106     // ans->in_queue == false). The other is bucket is empty.                                         
 107     if (buckets_[first_nonempty_bucket_index_].empty()) {                                             
 108       first_nonempty_bucket_index_++;                                                                 
 109       for (; first_nonempty_bucket_index_ < buckets_.size();                                          
 110            first_nonempty_bucket_index_++) {                                                          
 111         if (!buckets_[first_nonempty_bucket_index_].empty()) break;                                   
 112       }                                                                                               
 113     }                                                                                                 
 114     // There are two cases will trigger the following code. One is                                    
 115     // first_nonempty_bucket_index_ == buckets_.size() [ But it is conflict with                      
 116     // the definition]. The other is we find an appropriate new                                       
 117     // "first_nonempty_bucket_index_".                                          
 118     if (first_nonempty_bucket_index_ == buckets_.size()) return NULL;                                 
 119   }                                                                                                   
 120 }

danpovey · 2019-03-27T05:00:10Z

The whole point is to make the normal code path very fast. Just set first_nonempty_bucket_index_ to buckets_.size() - 1 if it would naturally be more than buckets_.size(). That way you never have to check (in the normally taken code path) whether that index is valid or not.

…

On Wed, Mar 27, 2019 at 12:56 AM LvHang ***@***.***> wrote: Hi Dan, I understand that you want to use the new "bucket_index" formula instead of the old "bucket_index" and "vec_index". Let me explain what makes me confused. Firstly, we have the definition of "first_nonempty_bucket_index_". /// first_nonempty_bucket_index_ is an integer in the range [0, buckets_.size() -1] /// which is not larger than the index of the ifrst nonempty element of buckets_. int32 first_nonempty_bucket_index_; So the key problem is when and how to update "first_nonempty_bucket_index_". Assume the Push() function adds the token into the BucketQueue one by one and sets "first_nonempty_bucket_index_" to the first non-empty bucket. From your previous comments, I guess, after once successful Pop operation (ans->in_queue), you just leave the "first_nonempty_bucket_index_" there. And then, until the bucket is empty (buckets_[first_nonempty_bucket_index_].empty()), you will update it. (1) But what should we do if the Pop operation is not successed? We need a while loop to repeat the try until the BucketQueue is empty. (2) How can we know the BucketQueue is empty? I think the most convenient way is "first_nonempty_bucket_index_ == buckets_.size()". But it is conflict with your definition. So I suggest use the following definition. /// first_nonempty_bucket_index_ is an integer whose valid range is [0, buckets_.size() -1] /// which is not larger than the index of the ifrst nonempty element of buckets_. /// If "first_nonempty_bucket_index_" corresponds to a value past the end of buckets_ /// (i.e. first_nonempty_bucket_index_ == buckets_.size(), we interpret it as "there are no /// buckets with entries." int32 first_nonempty_bucket_index_; And the code will be (please see the comments to know what I'm concerned.) 93 template<typename Token> 94 Token* BucketQueue<Token>::Pop() { 95 while (true) { 96 if (!bucket_[first_nonempty_bucket_index_].empty()) { 97 Token *ans = buckets_[first_nonempty_bucket_index_].back(); 98 buckets_[first_nonempty_bucket_index_].pop_back(); 99 if (ans->in_queue) { 100 ans->in_queue = false; 101 return ans; 102 } 103 } 104 // There are two cases can trigger the following code to update the 105 // "first_nonempty_bucket_index_". One is last token is invalid (i.e. 106 // ans->in_queue == false). The other is bucket is empty. 107 if (buckets_[first_nonempty_bucket_index_].empty()) { 108 first_nonempty_bucket_index_++; 109 for (; first_nonempty_bucket_index_ < buckets_.size(); 110 first_nonempty_bucket_index_++) { 111 if (!buckets_[first_nonempty_bucket_index_].empty()) break; 112 } 113 } 114 // There are two cases will trigger the following code. One is 115 // first_nonempty_bucket_index_ == buckets_.size() [ But it is conflict with 116 // the definition]. The other is we find an appropriate new 117 // "first_nonempty_bucket_index_". 118 if (first_nonempty_bucket_index_ == buckets_.size()) return NULL; 119 } 120 } — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#3061 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADJVu99sY5yu4M7gHDcdTMcp8AhjQX7bks5vavn7gaJpZM4bV0gF> .

src/decoder/lattice-faster-decoder-combine-bucketqueue.cc

danpovey

Some small comments.

This is getting closer to being ready to commit. One more big thing that needs to be done is make a clean version of this where this code replaces the code in lattice-faster-decoder.{h,cc}. There is no reason to keep both versions around.

src/decoder/lattice-faster-decoder-combine-bucketqueue.cc

stale · 2020-06-19T06:36:23Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale · 2020-07-19T06:23:52Z

This issue has been automatically closed by a bot strictly because of inactivity. This does not mean that we think that this issue is not important! If you believe it has been closed hastily, add a comment to the issue and mention @kkm000, and I'll gladly reopen it.

stale · 2020-09-17T08:30:35Z

This issue has been automatically marked as stale by a bot solely because it has not had recent activity. Please add any comment (simply 'ping' is enough) to prevent the issue from being closed for 60 more days if you believe it should be kept open.

Combine ProcessEmitting() and ProcessNonemitting()

5503ba1

small fix

chenzhehuai reviewed Feb 28, 2019

View reviewed changes

LvHang added 2 commits February 28, 2019 01:54

add test binary

4b30697

add test2

2538a32

chenzhehuai mentioned this pull request Feb 28, 2019

The previous code is 30% slower than the baseline. LvHang/kaldi#2

Draft

danpovey reviewed Feb 28, 2019

View reviewed changes

src/decoder/lattice-faster-decoder-combine.h Outdated Show resolved Hide resolved

danpovey reviewed Feb 28, 2019

View reviewed changes

src/decoder/lattice-faster-decoder-combine.h Outdated Show resolved Hide resolved

danpovey reviewed Feb 28, 2019

View reviewed changes

src/decoder/lattice-faster-decoder-combine.h Outdated Show resolved Hide resolved

danpovey reviewed Feb 28, 2019

View reviewed changes

src/decoder/lattice-faster-decoder-combine.h Outdated Show resolved Hide resolved

Update design and comments

d011342

chenzhehuai suggested changes Mar 1, 2019

View reviewed changes

LvHang added 2 commits March 2, 2019 16:55

update comments and the functions about PNE()

a758ba4

change recvoer—â_map to token_orig_cost and document

85da998

add a simple test script

18e8758

small fix

b223bb7

LvHang force-pushed the async-decoder branch from 3e55c22 to b223bb7 Compare March 3, 2019 02:03

fix

dca20d4

danpovey reviewed Mar 4, 2019

View reviewed changes

src/decoder/lattice-faster-decoder-combine.cc Outdated Show resolved Hide resolved

danpovey reviewed Mar 4, 2019

View reviewed changes

src/decoder/lattice-faster-decoder-combine.cc Outdated Show resolved Hide resolved

LvHang added 2 commits March 5, 2019 15:21

change queue for speeding up

06d38cb

add hashlist version for test

603b705

danpovey reviewed Mar 25, 2019

View reviewed changes

src/decoder/lattice-faster-decoder-combine-bucketqueue.cc Outdated Show resolved Hide resolved

Hang Lyu and others added 3 commits March 25, 2019 20:51

some fix and remove SetBegin

d7a3a9d

minor fix

a70f34b

small fix

b636119

danpovey reviewed Mar 26, 2019

View reviewed changes

LvHang commented Mar 26, 2019

View reviewed changes

src/decoder/lattice-faster-decoder-combine-bucketqueue.h Outdated Show resolved Hide resolved

Hang Lyu added 4 commits March 27, 2019 23:10

first_nonempty_bucket_index_ and first_nonempty_bucket_

b6abf43

remove RecoverLastFrame()

25907d8

do ProcessNonemitting if final-probs are requested

26b378a

fix

c66f1bb

LvHang force-pushed the async-decoder branch from 111ebc0 to c66f1bb Compare March 29, 2019 20:40

small fix

79b0071

danpovey reviewed Apr 2, 2019

View reviewed changes

src/decoder/lattice-faster-decoder-combine-bucketqueue.cc Outdated Show resolved Hide resolved

danpovey reviewed Apr 2, 2019

View reviewed changes

Hang Lyu added 4 commits April 3, 2019 20:25

fix according to the comments

c359fe2

resize the BucketQueue when a weird long one was caused

896c5c8

small fix

6e2d27a

1.2 tolerance

7dd2ca2

stale bot added the stale Stale bot on the loose label Jun 19, 2020

stale bot closed this Jul 19, 2020

kkm000 reopened this Jul 19, 2020

stale bot removed the stale Stale bot on the loose label Jul 19, 2020

stale bot added the stale Stale bot on the loose label Sep 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Lattice-faster-decoder-combine #3061

[WIP] Lattice-faster-decoder-combine #3061

LvHang commented Feb 28, 2019

chenzhehuai left a comment •

edited

danpovey Feb 28, 2019

danpovey commented Feb 28, 2019 via email

LvHang commented Feb 28, 2019

LvHang commented Mar 1, 2019

chenzhehuai left a comment •

edited

danpovey commented Mar 1, 2019 via email

LvHang commented Mar 2, 2019

danpovey commented Mar 2, 2019 via email

LvHang commented Mar 3, 2019

danpovey left a comment

danpovey commented Mar 26, 2019 via email

danpovey commented Mar 27, 2019 via email

LvHang commented Mar 27, 2019

danpovey commented Mar 27, 2019 via email

danpovey left a comment

stale bot commented Jun 19, 2020

stale bot commented Jul 19, 2020

stale bot commented Sep 17, 2020

[WIP] Lattice-faster-decoder-combine #3061

Are you sure you want to change the base?

[WIP] Lattice-faster-decoder-combine #3061

Conversation

LvHang commented Feb 28, 2019

chenzhehuai left a comment • edited

Choose a reason for hiding this comment

danpovey Feb 28, 2019

Choose a reason for hiding this comment

danpovey commented Feb 28, 2019 via email

LvHang commented Feb 28, 2019

LvHang commented Mar 1, 2019

chenzhehuai left a comment • edited

Choose a reason for hiding this comment

danpovey commented Mar 1, 2019 via email

LvHang commented Mar 2, 2019

danpovey commented Mar 2, 2019 via email

LvHang commented Mar 3, 2019

danpovey left a comment

Choose a reason for hiding this comment

danpovey commented Mar 26, 2019 via email

danpovey commented Mar 27, 2019 via email

LvHang commented Mar 27, 2019

danpovey commented Mar 27, 2019 via email

danpovey left a comment

Choose a reason for hiding this comment

stale bot commented Jun 19, 2020

stale bot commented Jul 19, 2020

stale bot commented Sep 17, 2020

chenzhehuai left a comment •

edited

chenzhehuai left a comment •

edited