Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Lattice-faster-decoder-combine #3061

Open
wants to merge 29 commits into
base: master
Choose a base branch
from

Conversation

LvHang
Copy link
Contributor

@LvHang LvHang commented Feb 28, 2019

Combine ProcessEmitting() and ProcessNonemitting() into ProcessForFrame(). According to the modification, update GetRawLattice().
Put it here so that @chenzhehuai can review it easily.
Thanks.

Copy link
Contributor

@chenzhehuai chenzhehuai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I talk with Hang and understand the main idea of his design.

Let's take Lattice-faster-decoder.cc as the baseline. Before explicit reviews, I have several general comments:

  1. next_cutoff in the code is theoretically worse than the baseline:

Say we are decoding frame T (the current decoding step). The baseline obtains next_cutoff from the best token of all emitting and non-emitting tokens in the last decoding step, while current code obtains it from the best token of all emitting tokens in the last decoding step (non-emitting tokens are processed in the current decoding step). Hence, the "best token" is not really the best in the current code. Moreover, thinking about line 811-817, it assumes the next emitting arc following previous emitting arc, which is unusual in speech and won't bring about a reasonable cutoff. Although we keep updating next_cutoff once we obtain better tokens, the next_cutoff will start from a bad one and become a much better one, which not only wastes many tokens in the beginning, but also makes beam tuning harder since "better one" changes greatly.

  1. the current code will have higher delay:

For the last frame, say T, baseline needs to decode emitting arcs of frame T and non-emitting arcs of frame T, while the current code needs to decode all above arcs and extra non-emitting arcs of frame T-1

  1. since the current code changes the pruning behaviors, if we want to replace the baseline, we need to test both 1-best & lattice performance versus decoding speed:

for 1-best, we need to test real time factor (RTF) versus WER
for lattice, we need to test lattice density versus lattice oracle error rate and lattice rescored error rate

To sum up, I am a little bit worry about the decoding efficiency because of comment 1 and 2 above. I have done some experiments on your codes, share the result and show some improving ideas here LvHang#2
please improve your implementation according to above PR and if you get similar speed after incorporating these ideas, please start to test wer & etc. as what I discussed in point 3 above to see whether my point 1 and 2 are really problems.

chenzhehuai pushed a commit to chenzhehuai/kaldi that referenced this pull request Feb 28, 2019
Improvements:
1. with the same config, the previous code will result in 10% more tokens in each frame compared with baseline. The reason is because GetCutoff function is over emitting tokens, while for baseline it is over emitting&non emiting tokens. (please check it use --verbose 6 and the number of all tokens is shown in line 895, but not in line 693, which only contains emitting) Hence we should use 10% less max_active config.
2. the baseline use kaldi version HashList, while the previous code use std::unordered_map. please still use HashList or its variants. The reason is because HashList is allocating memory in Pool/Block fashion, hence it is 10% faster than std::unordered_map. To show this problem, I temporally do a hack in line 247, plese check it

After these two improvements, the current code is with similar speed as the baseline. HOWEVER, please check the quality of 1-best & lattice, I have discuss about my worries here kaldi-asr#3061 (review)
// bin/latgen-faster-mapped.cc

// Copyright 2009-2012 Microsoft Corporation, Karel Vesely
// 2013 Johns Hopkins University (author: Daniel Povey)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is OK for testing but eventually this should just be a change to lattice-faster-decoder.cc. It's just a small optimization.

@danpovey
Copy link
Contributor

danpovey commented Feb 28, 2019 via email

@LvHang
Copy link
Contributor Author

LvHang commented Feb 28, 2019

Dan:

In my design, the recover_ and recover_map_ will be used when we call GetRawLattice() in the middle of an utterance decoding.
After processing the frame t, we can achive the complete token list of frame t and the emittion part of frame t+1. So we will call ProcessNonemitting() once in GetRawLattice() to generate the complete token list for frame t+1. As GetRawLattice() will use "vector active_toks_" to build the raw lattice, so I update "active_toks_" directly in ProcessNonemitting().
But in PrcessForFrame(), the start point should be an emittion part token list. I have to get rid of the non-emittion tokens from "active_toks_" and recover their tot_costs(alpha) so that the behavior is consistent with normal case.

@LvHang
Copy link
Contributor Author

LvHang commented Mar 1, 2019

Dan, Zhehuai:

I update the functions and comments.

Now ProcessNonemitting() will take an argument "std::unordered_map<Token*, BaseFloat> recover_map". In normal case, it will be NULL. But when we call GetRawLattice() during decoding, we will pass a local variable "recover_map" of GetRawLattice() and then recover the "active_toks_" in the end of GetRawLattice(). So it will not introduce the linkages.

Copy link
Contributor

@chenzhehuai chenzhehuai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I review the coding style and code comments inline. Please test speeds and precision once you finish them.

src/decoder/lattice-faster-decoder-combine.h Show resolved Hide resolved
src/decoder/lattice-faster-decoder-combine.h Outdated Show resolved Hide resolved
src/decoder/lattice-faster-decoder-combine.h Outdated Show resolved Hide resolved
src/decoder/lattice-faster-decoder-combine.h Outdated Show resolved Hide resolved
src/decoder/lattice-faster-decoder-combine.cc Show resolved Hide resolved
src/decoder/lattice-faster-decoder-combine.cc Outdated Show resolved Hide resolved
src/decoder/lattice-faster-decoder-combine.cc Outdated Show resolved Hide resolved
src/decoder/lattice-faster-decoder-combine.cc Outdated Show resolved Hide resolved
src/decoder/lattice-faster-decoder-combine.cc Outdated Show resolved Hide resolved
src/decoder/lattice-faster-decoder-combine.cc Outdated Show resolved Hide resolved
@danpovey
Copy link
Contributor

danpovey commented Mar 1, 2019 via email

@LvHang
Copy link
Contributor Author

LvHang commented Mar 2, 2019

According to the suggestions of Dan and Zhehuai, update the comments and code.

@danpovey
Copy link
Contributor

danpovey commented Mar 2, 2019 via email

@LvHang
Copy link
Contributor Author

LvHang commented Mar 3, 2019

I uploaded a simple testing script. It calls "nnet3-compute" to generate posterior, and then use "latgen-faster-mapped" or "latgen-faster-mapped-combine" to decode.
Anyway, I'm testing it.

Copy link
Contributor

@danpovey danpovey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Various requested changes.

src/decoder/lattice-faster-decoder-combine-bucketqueue.h Outdated Show resolved Hide resolved
src/decoder/lattice-faster-decoder-combine-bucketqueue.h Outdated Show resolved Hide resolved
src/decoder/lattice-faster-decoder-combine-bucketqueue.h Outdated Show resolved Hide resolved
src/decoder/lattice-faster-decoder-combine-bucketqueue.cc Outdated Show resolved Hide resolved
src/decoder/lattice-faster-decoder-combine-bucketqueue.cc Outdated Show resolved Hide resolved
src/decoder/lattice-faster-decoder-combine-bucketqueue.cc Outdated Show resolved Hide resolved
src/decoder/lattice-faster-decoder-combine-bucketqueue.cc Outdated Show resolved Hide resolved
src/decoder/lattice-faster-decoder-combine-bucketqueue.cc Outdated Show resolved Hide resolved
src/decoder/lattice-faster-decoder-combine-bucketqueue.cc Outdated Show resolved Hide resolved
@danpovey
Copy link
Contributor

danpovey commented Mar 26, 2019 via email

@danpovey
Copy link
Contributor

danpovey commented Mar 27, 2019 via email

@LvHang
Copy link
Contributor Author

LvHang commented Mar 27, 2019

Hi Dan,

I understand that you want to use the new "bucket_index" formula instead of the old "bucket_index" and "vec_index".

Let me explain what makes me confused. Firstly, we have the definition of "first_nonempty_bucket_index_".
/// first_nonempty_bucket_index_ is an integer in the range [0, buckets_.size() -1]
/// which is not larger than the index of the ifrst nonempty element of buckets_.
int32 first_nonempty_bucket_index_;

So the key problem is when and how to update "first_nonempty_bucket_index_".
Assume the Push() function adds the token into the BucketQueue one by one and sets "first_nonempty_bucket_index_" to the first non-empty bucket.
From your previous comments, I guess, after once successful Pop operation (ans->in_queue), you just leave the "first_nonempty_bucket_index_" there. And then, until the bucket is empty (buckets_[first_nonempty_bucket_index_].empty()), you will update it.

(1) But what should we do if the Pop operation is not successed? We need a while loop to repeat the try until the BucketQueue is empty.
(2) How can we know the BucketQueue is empty? I think the most convenient way is "first_nonempty_bucket_index_ == buckets_.size()". But it is conflict with your definition. So I suggest use
the following definition.

/// first_nonempty_bucket_index_ is an integer whose valid range is [0, buckets_.size() -1]
/// which is not larger than the index of the ifrst nonempty element of buckets_.
/// If "first_nonempty_bucket_index_" corresponds to a value past the end of buckets_
/// (i.e. first_nonempty_bucket_index_ == buckets_.size(), we interpret it as "there are no 
/// buckets with entries."
int32 first_nonempty_bucket_index_;

And the code will be (please see the comments to know what I'm concerned.)

  93 template<typename Token>                                                                              
  94 Token* BucketQueue<Token>::Pop() {                                                                    
  95   while (true) {                                                                                      
  96     if (!bucket_[first_nonempty_bucket_index_].empty()) {                                             
  97       Token *ans = buckets_[first_nonempty_bucket_index_].back();                                     
  98       buckets_[first_nonempty_bucket_index_].pop_back();                                              
  99       if (ans->in_queue) {                                                                            
 100         ans->in_queue = false;                                                                        
 101         return ans;                                                                                   
 102       }                                                                                               
 103     }                                                                                                 
 104     // There are two cases can trigger the following code to update the                               
 105     // "first_nonempty_bucket_index_". One is last token is invalid (i.e.                             
 106     // ans->in_queue == false). The other is bucket is empty.                                         
 107     if (buckets_[first_nonempty_bucket_index_].empty()) {                                             
 108       first_nonempty_bucket_index_++;                                                                 
 109       for (; first_nonempty_bucket_index_ < buckets_.size();                                          
 110            first_nonempty_bucket_index_++) {                                                          
 111         if (!buckets_[first_nonempty_bucket_index_].empty()) break;                                   
 112       }                                                                                               
 113     }                                                                                                 
 114     // There are two cases will trigger the following code. One is                                    
 115     // first_nonempty_bucket_index_ == buckets_.size() [ But it is conflict with                      
 116     // the definition]. The other is we find an appropriate new                                       
 117     // "first_nonempty_bucket_index_".                                          
 118     if (first_nonempty_bucket_index_ == buckets_.size()) return NULL;                                 
 119   }                                                                                                   
 120 }                                                         

@danpovey
Copy link
Contributor

danpovey commented Mar 27, 2019 via email

Copy link
Contributor

@danpovey danpovey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some small comments.

This is getting closer to being ready to commit. One more big thing that needs to be done is make a clean version of this where this code replaces the code in lattice-faster-decoder.{h,cc}. There is no reason to keep both versions around.

src/decoder/lattice-faster-decoder-combine-bucketqueue.cc Outdated Show resolved Hide resolved
src/decoder/lattice-faster-decoder-combine-bucketqueue.cc Outdated Show resolved Hide resolved
src/decoder/lattice-faster-decoder-combine-bucketqueue.cc Outdated Show resolved Hide resolved
src/decoder/lattice-faster-decoder-combine-bucketqueue.cc Outdated Show resolved Hide resolved
src/decoder/lattice-faster-decoder-combine-bucketqueue.cc Outdated Show resolved Hide resolved
src/decoder/lattice-faster-decoder-combine-bucketqueue.cc Outdated Show resolved Hide resolved
src/decoder/lattice-faster-decoder-combine-bucketqueue.cc Outdated Show resolved Hide resolved
src/decoder/lattice-faster-decoder-combine-bucketqueue.cc Outdated Show resolved Hide resolved
@stale
Copy link

stale bot commented Jun 19, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale Stale bot on the loose label Jun 19, 2020
@stale
Copy link

stale bot commented Jul 19, 2020

This issue has been automatically closed by a bot strictly because of inactivity. This does not mean that we think that this issue is not important! If you believe it has been closed hastily, add a comment to the issue and mention @kkm000, and I'll gladly reopen it.

@stale stale bot closed this Jul 19, 2020
@kkm000 kkm000 reopened this Jul 19, 2020
@stale stale bot removed the stale Stale bot on the loose label Jul 19, 2020
@stale
Copy link

stale bot commented Sep 17, 2020

This issue has been automatically marked as stale by a bot solely because it has not had recent activity. Please add any comment (simply 'ping' is enough) to prevent the issue from being closed for 60 more days if you believe it should be kept open.

@stale stale bot added the stale Stale bot on the loose label Sep 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale Stale bot on the loose
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants