Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-lookahead cache-aware streaming models #6711

Merged
merged 50 commits into from
Jun 8, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
50 commits
Select commit Hold shift + click to select a range
3c8d009
added methods.
VahidooX Dec 1, 2022
dad6ac5
added methods.
VahidooX Dec 2, 2022
7db70ae
added initial code.
VahidooX Dec 6, 2022
dbd4086
added initial code.
VahidooX Dec 6, 2022
859ae60
added initial code.
VahidooX Dec 6, 2022
b10c6d9
added config files.
VahidooX Dec 6, 2022
01dbc72
fixed bugs.
VahidooX Dec 6, 2022
d72d61c
updated confs.
VahidooX Dec 6, 2022
7170814
updated confs.
VahidooX Dec 6, 2022
d861130
updated confs.
VahidooX Dec 6, 2022
b0811bc
updated confs.
VahidooX Dec 7, 2022
71a9b70
pulled from main.
VahidooX Dec 7, 2022
3205d8d
improved f.conv1d
VahidooX Dec 7, 2022
33f4c6a
pulled from main.
VahidooX Jan 24, 2023
12f4330
pulled from main.
VahidooX Jan 25, 2023
68067e6
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jan 25, 2023
19b2396
Merge branch 'main' of https://github.com/NVIDIA/NeMo into adaptive_s…
VahidooX Jan 25, 2023
d75302a
pulled from main.
VahidooX Jan 25, 2023
916ebe4
Merge remote-tracking branch 'origin/adaptive_streaming2' into adapti…
VahidooX Jan 25, 2023
9ed9213
pulled from main.
VahidooX Mar 8, 2023
8d4410d
added postpostnorm.
VahidooX Mar 18, 2023
1a29d31
fixed the target continiouse bug.
VahidooX Mar 18, 2023
f5ebe67
added dw_striding causal.
VahidooX Mar 20, 2023
c843487
added print for debugging.
VahidooX Mar 23, 2023
36b3d65
added print for debugging.
VahidooX Mar 23, 2023
bd949e7
fixed causal convolutions.
VahidooX Mar 28, 2023
bdfba5f
added _midnorm.
VahidooX Mar 30, 2023
e7fdbe1
fixed transcribe.
VahidooX Apr 1, 2023
a973821
pulled from main.
VahidooX May 23, 2023
22413b5
cleaned code.
VahidooX May 24, 2023
39e5cf2
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] May 24, 2023
3c766a2
moved back configs.
VahidooX May 24, 2023
f1001c4
Merge remote-tracking branch 'origin/adaptive_streaming_main' into ad…
VahidooX May 24, 2023
1ea787b
moved back configs.
VahidooX May 24, 2023
1f46567
updated fast emit for FC models.
VahidooX May 24, 2023
3f17198
updated fast emit for FC models.
VahidooX May 24, 2023
88cad23
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] May 24, 2023
132b8a4
Merge branch 'main' of https://github.com/NVIDIA/NeMo into adaptive_s…
VahidooX Jun 5, 2023
cbf3dd8
Merge remote-tracking branch 'origin/adaptive_streaming_main' into ad…
VahidooX Jun 5, 2023
29bda6a
fixed bug.
VahidooX Jun 5, 2023
5ac4d67
fixed bug and addressed comments.
VahidooX Jun 7, 2023
f2fd690
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jun 7, 2023
3a9ce49
fixed configs.
VahidooX Jun 7, 2023
50a4b59
Merge branch 'main' of https://github.com/NVIDIA/NeMo into adaptive_s…
VahidooX Jun 7, 2023
f95bd74
Merge remote-tracking branch 'origin/adaptive_streaming_main' into ad…
VahidooX Jun 7, 2023
c9fc699
fixed configs.
VahidooX Jun 7, 2023
828fbb9
Merge branch 'main' of https://github.com/NVIDIA/NeMo into adaptive_s…
VahidooX Jun 7, 2023
fbb19dc
Merge branch 'main' of https://github.com/NVIDIA/NeMo into adaptive_s…
VahidooX Jun 7, 2023
c997c84
dropped the test.
VahidooX Jun 8, 2023
31e919c
Merge branch 'main' of https://github.com/NVIDIA/NeMo into adaptive_s…
VahidooX Jun 8, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -103,10 +103,16 @@ model:
# [left, right] specifies the number of steps to be seen from left and right of each step in self-attention
# for att_context_style=regular, the right context is recommended to be a small number around 0 to 3 as multiple-layers may increase the effective right context too large
# for att_context_style=chunked_limited, the left context need to be dividable by the right context plus one
# for chunked_limited you may calculate the look-ahead or right context by the following formula:
# look-ahead(secs) = att_context_size[1]*subsampling_factor*window_stride, example: 27*4*0.01=1.08s

# For multi-lookahead models, you may specify a list of context sizes. During the training, different context sizes would be used randomly with the distribution specified by att_context_probs.
# The first item in the list would be the default during test/validation/inference.
# An example of settings for multi-lookahead:
# att_context_size: [[140,27],[140,13],[140,2],[140,0]]
# att_context_probs: [0.25, 0.25, 0.25, 0.25, 0.25]
att_context_size: [140, 27] # -1 means unlimited context
att_context_style: chunked_limited # regular or chunked_limited
att_context_probs: null

xscaling: true # scales up the input embeddings by sqrt(d_model)
untie_biases: true # unties the biases of the TransformerXL layers
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -113,10 +113,16 @@ model:
# [left, right] specifies the number of steps to be seen from left and right of each step in self-attention
# for att_context_style=regular, the right context is recommended to be a small number around 0 to 3 as multiple-layers may increase the effective right context too large
# for att_context_style=chunked_limited, the left context need to be dividable by the right context plus one
# for chunked_limited you may calculate the look-ahead or right context by the following formula:
# look-ahead(secs) = att_context_size[1]*subsampling_factor*window_stride, example: 27*4*0.01=1.08s
# look-ahead(secs) = att_context_size[1]*subsampling_factor*window_stride, example: 13*8*0.01=1.04s

# For multi-lookahead models, you may specify a list of context sizes. During the training, different context sizes would be used randomly with the distribution specified by att_context_probs.
# The first item in the list would be the default during test/validation/inference.
# An example of settings for multi-lookahead:
# att_context_size: [[140,27],[140,13],[140,2],[140,0]]
# att_context_probs: [0.25, 0.25, 0.25, 0.25, 0.25]
att_context_size: [140, 27] # -1 means unlimited context
att_context_style: chunked_limited # regular or chunked_limited
att_context_probs: null

xscaling: true # scales up the input embeddings by sqrt(d_model)
untie_biases: true # unties the biases of the TransformerXL layers
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -97,10 +97,17 @@ model:
# [left, right] specifies the number of steps to be seen from left and right of each step in self-attention
# for att_context_style=regular, the right context is recommended to be a small number around 0 to 3 as multiple-layers may increase the effective right context too large
# for att_context_style=chunked_limited, the left context need to be dividable by the right context plus one
# for chunked_limited you may calculate the look-ahead or right context by the following formula:
# look-ahead(secs) = att_context_size[1]*subsampling_factor*window_stride, example: 13*8*0.01=1.04s

# For multi-lookahead models, you may specify a list of context sizes. During the training, different context sizes would be used randomly with the distribution specified by att_context_probs.
# The first item in the list would be the default during test/validation/inference.
# An example of settings for multi-lookahead:
# att_context_size: [[70,13],[70,6],[70,1],[70,0]]
# att_context_probs: [0.25, 0.25, 0.25, 0.25, 0.25]
att_context_size: [70, 13] # -1 means unlimited context
att_context_style: chunked_limited # regular or chunked_limited
att_context_probs: null


xscaling: true # scales up the input embeddings by sqrt(d_model)
pos_emb_max_len: 5000
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -100,11 +100,19 @@ model:
n_heads: 8 # may need to be lower for smaller d_models

# [left, right] specifies the number of steps to be seen from left and right of each step in self-attention
# for att_context_style=regular, the right context is recommended to be a small number around 0 to 2 as multiple-layers may increase the effective right context too large
# for att_context_style=regular, the right context is recommended to be a small number around 0 to 3 as multiple-layers may increase the effective right context too large
# for att_context_style=chunked_limited, the left context need to be dividable by the right context plus one
# look-ahead(secs) = att_context_size[1]*subsampling_factor*window_stride, example: 13*8*0.01=1.04s

# For multi-lookahead models, you may specify a list of context sizes. During the training, different context sizes would be used randomly with the distribution specified by att_context_probs.
# The first item in the list would be the default during test/validation/inference.
# An example of settings for multi-lookahead:
# att_context_size: [[70,13],[70,6],[70,1],[70,0]]
# att_context_probs: [0.25, 0.25, 0.25, 0.25, 0.25]
att_context_size: [70, 13] # -1 means unlimited context
att_context_style: chunked_limited # regular or chunked_limited
att_context_probs: null


xscaling: true # scales up the input embeddings by sqrt(d_model)
pos_emb_max_len: 5000
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -102,10 +102,17 @@ model:
# [left, right] specifies the number of steps to be seen from left and right of each step in self-attention
# for att_context_style=regular, the right context is recommended to be a small number around 0 to 3 as multiple-layers may increase the effective right context too large
# for att_context_style=chunked_limited, the left context need to be dividable by the right context plus one
# for chunked_limited you may calculate the look-ahead or right context by the following formula:
# look-ahead(secs) = att_context_size[1]*subsampling_factor*window_stride, example: 13*8*0.01=1.04s

# For multi-lookahead models, you may specify a list of context sizes. During the training, different context sizes would be used randomly with the distribution specified by att_context_probs.
# The first item in the list would be the default during test/validation/inference.
# An example of settings for multi-lookahead:
# att_context_size: [[70,13],[70,6],[70,1],[70,0]]
# att_context_probs: [0.25, 0.25, 0.25, 0.25, 0.25]
att_context_size: [70, 13] # -1 means unlimited context
att_context_style: chunked_limited # regular or chunked_limited
att_context_probs: null


xscaling: true # scales up the input embeddings by sqrt(d_model)
pos_emb_max_len: 5000
Expand Down Expand Up @@ -191,9 +198,9 @@ model:
loss_name: "default"
warprnnt_numba_kwargs:
# FastEmit regularization: https://arxiv.org/abs/2010.11148
# You may enable FastEmit to reduce the latency of the model for streaming
# It also helps to improve the accuracy of the model in streaming mode
fastemit_lambda: 1e-3 # Recommended values to be in range [1e-4, 1e-2], 0.001 is a good start.
# You may enable FastEmit to increase the accuracy and reduce the latency of the model for streaming
# You may set it to lower values like 1e-3 for models with larger right context
fastemit_lambda: 5e-3 # Recommended values to be in range [1e-4, 1e-2], 0.001 is a good start.
clamp: -1.0 # if > 0, applies gradient clamping in range [-clamp, clamp] for the joint tensor only.

optim:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -106,11 +106,19 @@ model:
n_heads: 8 # may need to be lower for smaller d_models

# [left, right] specifies the number of steps to be seen from left and right of each step in self-attention
# for att_context_style=regular, the right context is recommended to be a small number around 0 to 2 as multiple-layers may increase the effective right context too large
# for att_context_style=regular, the right context is recommended to be a small number around 0 to 3 as multiple-layers may increase the effective right context too large
# for att_context_style=chunked_limited, the left context need to be dividable by the right context plus one
# look-ahead(secs) = att_context_size[1]*subsampling_factor*window_stride, example: 13*8*0.01=1.04s

# For multi-lookahead models, you may specify a list of context sizes. During the training, different context sizes would be used randomly with the distribution specified by att_context_probs.
# The first item in the list would be the default during test/validation/inference.
# An example of settings for multi-lookahead:
# att_context_size: [[70,13],[70,6],[70,1],[70,0]]
# att_context_probs: [0.25, 0.25, 0.25, 0.25, 0.25]
att_context_size: [70, 13] # -1 means unlimited context
att_context_style: chunked_limited # regular or chunked_limited
att_context_probs: null


xscaling: true # scales up the input embeddings by sqrt(d_model)
pos_emb_max_len: 5000
Expand Down Expand Up @@ -196,9 +204,9 @@ model:
loss_name: "default"
warprnnt_numba_kwargs:
# FastEmit regularization: https://arxiv.org/abs/2010.11148
# You may enable FastEmit to reduce the latency of the model for streaming
# It also helps to improve the accuracy of the model in streaming mode
fastemit_lambda: 1e-3 # Recommended values to be in range [1e-4, 1e-2], 0.001 is a good start.
# You may enable FastEmit to increase the accuracy and reduce the latency of the model for streaming
# You may set it to lower values like 1e-3 for models with larger right context
fastemit_lambda: 5e-3 # Recommended values to be in range [1e-4, 1e-2], 0.001 is a good start.
clamp: -1.0 # if > 0, applies gradient clamping in range [-clamp, clamp] for the joint tensor only.

optim:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,8 @@
# FastConformer-CTC's architecture config: NeMo/examples/asr/conf/fastconformer/fast-conformer_ctc_bpe.yaml
# FastConformer-Transducer's architecture config, along with the optimal batch size and precision: NeMo/examples/asr/conf/fastconformer/fast-conformer_transducer_bpe.yaml

# Note: if training loss does not converge, you may increase warm-up to 20K.

name: "FastConformer-Hybrid-Transducer-CTC-BPE-Streaming"

model:
Expand Down Expand Up @@ -106,8 +108,15 @@ model:
# for att_context_style=regular, the right context is recommended to be a small number around 0 to 3 as multiple-layers may increase the effective right context too large
# for att_context_style=chunked_limited, the left context need to be dividable by the right context plus one
# look-ahead(secs) = att_context_size[1]*subsampling_factor*window_stride, example: 13*8*0.01=1.04s

# For multi-lookahead models, you may specify a list of context sizes. During the training, different context sizes would be used randomly with the distribution specified by att_context_probs.
# The first item in the list would be the default during test/validation/inference.
# An example of settings for multi-lookahead:
# att_context_size: [[70,13],[70,6],[70,1],[70,0]]
# att_context_probs: [0.25, 0.25, 0.25, 0.25, 0.25]
att_context_size: [70, 13] # -1 means unlimited context
att_context_style: chunked_limited # regular or chunked_limited
att_context_probs: null

xscaling: true # scales up the input embeddings by sqrt(d_model)
pos_emb_max_len: 5000
Expand Down Expand Up @@ -206,9 +215,9 @@ model:
loss_name: "default"
warprnnt_numba_kwargs:
# FastEmit regularization: https://arxiv.org/abs/2010.11148
# You may enable FastEmit to reduce the latency of the model for streaming
# It also helps to improve the accuracy of the model in streaming mode
fastemit_lambda: 1e-3 # Recommended values to be in range [1e-4, 1e-2], 0.001 is a good start.
# You may enable FastEmit to increase the accuracy and reduce the latency of the model for streaming
# You may set it to lower values like 1e-3 for models with larger right context
fastemit_lambda: 5e-3 # Recommended values to be in range [1e-4, 1e-2], 0.001 is a good start.
clamp: -1.0 # if > 0, applies gradient clamping in range [-clamp, clamp] for the joint tensor only.

optim:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,8 @@
# FastConformer-CTC's architecture config: NeMo/examples/asr/conf/fastconformer/fast-conformer_ctc_bpe.yaml
# FastConformer-Transducer's architecture config, along with the optimal batch size and precision: NeMo/examples/asr/conf/fastconformer/fast-conformer_transducer_bpe.yaml

# Note: if training loss does not converge, you may increase warm-up to 20K.

name: "FastConformer-Hybrid-Transducer-CTC-Char-Streaming"

model:
Expand Down Expand Up @@ -111,8 +113,15 @@ model:
# for att_context_style=regular, the right context is recommended to be a small number around 0 to 3 as multiple-layers may increase the effective right context too large
# for att_context_style=chunked_limited, the left context need to be dividable by the right context plus one
# look-ahead(secs) = att_context_size[1]*subsampling_factor*window_stride, example: 13*8*0.01=1.04s

# For multi-lookahead models, you may specify a list of context sizes. During the training, different context sizes would be used randomly with the distribution specified by att_context_probs.
# The first item in the list would be the default during test/validation/inference.
# An example of settings for multi-lookahead:
# att_context_size: [[70,13],[70,6],[70,1],[70,0]]
# att_context_probs: [0.25, 0.25, 0.25, 0.25, 0.25]
att_context_size: [70, 13] # -1 means unlimited context
att_context_style: chunked_limited # regular or chunked_limited
att_context_probs: null

xscaling: true # scales up the input embeddings by sqrt(d_model)
pos_emb_max_len: 5000
Expand Down Expand Up @@ -211,9 +220,9 @@ model:
loss_name: "default"
warprnnt_numba_kwargs:
# FastEmit regularization: https://arxiv.org/abs/2010.11148
# You may enable FastEmit to reduce the latency of the model for streaming
# It also helps to improve the accuracy of the model in streaming mode
fastemit_lambda: 1e-3 # Recommended values to be in range [1e-4, 1e-2], 0.001 is a good start.
# You may enable FastEmit to increase the accuracy and reduce the latency of the model for streaming
# You may set it to lower values like 1e-3 for models with larger right context
fastemit_lambda: 5e-3 # Recommended values to be in range [1e-4, 1e-2], 0.001 is a good start.
clamp: -1.0 # if > 0, applies gradient clamping in range [-clamp, clamp] for the joint tensor only.

optim:
Expand Down