TensorSpeech · dathudeptrai · Nov 19, 2020 · Nov 19, 2020 · Nov 19, 2020 · Nov 19, 2020
diff --git a/README.md b/README.md
@@ -19,16 +19,17 @@
 :zany_face: TensorFlowTTS provides real-time state-of-the-art speech synthesis architectures such as Tacotron-2, Melgan, Multiband-Melgan, FastSpeech, FastSpeech2 based-on TensorFlow 2. With Tensorflow 2, we can speed-up training/inference progress, optimizer further by using [fake-quantize aware](https://www.tensorflow.org/model_optimization/guide/quantization/training_comprehensive_guide) and [pruning](https://www.tensorflow.org/model_optimization/guide/pruning/pruning_with_keras), make TTS models can be run faster than real-time and be able to deploy on mobile devices or embedded systems.
 
 ## What's new
-- 2020/08/23 **(NEW!)** Add Parallel WaveGAN tensorflow implementation. See [here](https://github.com/TensorSpeech/TensorFlowTTS/tree/master/examples/parallel_wavegan)
-- 2020/08/23 **(NEW!)** Add MBMelGAN G + ParallelWaveGAN G example. See [here](https://github.com/TensorSpeech/TensorFlowTTS/tree/master/examples/multiband_pwgan)
-- 2020/08/20 **(NEW!)** Add C++ inference code. Thank [@ZDisket](https://github.com/ZDisket). See [here](https://github.com/TensorSpeech/TensorFlowTTS/tree/master/examples/cppwin)
-- 2020/08/18 **(NEW!)** Update [new base processor](https://github.com/TensorSpeech/TensorFlowTTS/blob/master/tensorflow_tts/processor/base_processor.py). Add [AutoProcessor](https://github.com/TensorSpeech/TensorFlowTTS/blob/master/tensorflow_tts/inference/auto_processor.py) and [pretrained processor](https://github.com/TensorSpeech/TensorFlowTTS/blob/master/tensorflow_tts/processor/pretrained/) json file.
-- 2020/08/14 **(NEW!)** Support Chinese TTS. Pls see the [colab](https://colab.research.google.com/drive/1YpSHRBRPBI7cnTkQn1UcVTWEQVbsUm1S?usp=sharing). Thank [@azraelkuan](https://github.com/azraelkuan).
-- 2020/08/05 **(NEW!)** Support Korean TTS. Pls see the [colab](https://colab.research.google.com/drive/1ybWwOS5tipgPFttNulp77P6DAB5MtiuN?usp=sharing). Thank [@crux153](https://github.com/crux153).
-- 2020/07/17 Support MultiGPU for all Trainer.
-- 2020/07/05 Support Convert Tacotron-2, FastSpeech to Tflite. Pls see the [colab](https://colab.research.google.com/drive/1HudLLpT9CQdh2k04c06bHUwLubhGTWxA?usp=sharing). Thank @jaeyoo from the TFlite team for his support.
+- 2020/11/19 **(NEW!)** Add Multi-GPU gradient accumulator. See [here](https://github.com/TensorSpeech/TensorFlowTTS/pull/377)
+- 2020/08/23  Add Parallel WaveGAN tensorflow implementation. See [here](https://github.com/TensorSpeech/TensorFlowTTS/tree/master/examples/parallel_wavegan)
+- 2020/08/23 Add MBMelGAN G + ParallelWaveGAN G example. See [here](https://github.com/TensorSpeech/TensorFlowTTS/tree/master/examples/multiband_pwgan)
+- 2020/08/20  Add C++ inference code. Thank [@ZDisket](https://github.com/ZDisket). See [here](https://github.com/TensorSpeech/TensorFlowTTS/tree/master/examples/cppwin)
+- 2020/08/18 Update [new base processor](https://github.com/TensorSpeech/TensorFlowTTS/blob/master/tensorflow_tts/processor/base_processor.py). Add [AutoProcessor](https://github.com/TensorSpeech/TensorFlowTTS/blob/master/tensorflow_tts/inference/auto_processor.py) and [pretrained processor](https://github.com/TensorSpeech/TensorFlowTTS/blob/master/tensorflow_tts/processor/pretrained/) json file
+- 2020/08/14 Support Chinese TTS. Pls see the [colab](https://colab.research.google.com/drive/1YpSHRBRPBI7cnTkQn1UcVTWEQVbsUm1S?usp=sharing). Thank [@azraelkuan](https://github.com/azraelkuan)
+- 2020/08/05 Support Korean TTS. Pls see the [colab](https://colab.research.google.com/drive/1ybWwOS5tipgPFttNulp77P6DAB5MtiuN?usp=sharing). Thank [@crux153](https://github.com/crux153)
+- 2020/07/17 Support MultiGPU for all Trainer
+- 2020/07/05 Support Convert Tacotron-2, FastSpeech to Tflite. Pls see the [colab](https://colab.research.google.com/drive/1HudLLpT9CQdh2k04c06bHUwLubhGTWxA?usp=sharing). Thank @jaeyoo from the TFlite team for his support
 - 2020/06/20 [FastSpeech2](https://arxiv.org/abs/2006.04558) implementation with Tensorflow is supported.
-- 2020/06/07 [Multi-band MelGAN (MB MelGAN)](https://github.com/tensorspeech/TensorFlowTTS/blob/master/examples/multiband_melgan/) implementation with Tensorflow is supported.
+- 2020/06/07 [Multi-band MelGAN (MB MelGAN)](https://github.com/tensorspeech/TensorFlowTTS/blob/master/examples/multiband_melgan/) implementation with Tensorflow is supported
 
 
 ## Features
@@ -38,6 +39,7 @@
 - Suitable for deployment.
 - Easy to implement a new model, based-on abstract class.
 - Mixed precision to speed-up training if possible.
+- Support Single/Multi GPU gradient Accumulate.
 - Support both Single/Multi GPU in base trainer class.
 - TFlite conversion for all supported models.
 - Android example.

diff --git a/examples/fastspeech/conf/fastspeech.v1.yaml b/examples/fastspeech/conf/fastspeech.v1.yaml
@@ -46,7 +46,7 @@ fastspeech_params:
 ###########################################################
 #                  DATA LOADER SETTING                    #
 ###########################################################
-batch_size: 16              # Batch size.
+batch_size: 16              # Batch size for each GPU with asuming that gradient_accumulation_steps is 1
 remove_short_samples: true  # Whether to remove samples the length of which are less than batch_max_steps.
 allow_cache: true           # Whether to allow cache in dataset. If true, it requires cpu memory.
 mel_length_threshold: 32    # remove all targets has mel_length <= 32 
@@ -60,7 +60,8 @@ optimizer_params:
     decay_steps: 150000          # < train_max_steps is recommend.
     warmup_proportion: 0.02
     weight_decay: 0.001
-
+
+gradient_accumulation_steps: 1
 var_train_expr: null  # trainable variable expr (eg. 'embeddings|encoder|decoder' )
                       # must separate by |. if var_train_expr is null then we 
                       # training all variable 

diff --git a/examples/fastspeech/conf/fastspeech.v3.yaml b/examples/fastspeech/conf/fastspeech.v3.yaml
@@ -46,7 +46,7 @@ fastspeech_params:
 ###########################################################
 #                  DATA LOADER SETTING                    #
 ###########################################################
-batch_size: 16              # Batch size.
+batch_size: 16              # Batch size for each GPU with assuming that gradient_accumulation_steps == 1.
 remove_short_samples: true  # Whether to remove samples the length of which are less than batch_max_steps.
 allow_cache: true           # Whether to allow cache in dataset. If true, it requires cpu memory.
 mel_length_threshold: 32    # remove all targets has mel_length <= 32 
@@ -60,7 +60,8 @@ optimizer_params:
     decay_steps: 150000          # < train_max_steps is recommend.
     warmup_proportion: 0.02
     weight_decay: 0.001
-
+
+gradient_accumulation_steps: 1
 var_train_expr: null  # trainable variable expr (eg. 'embeddings|encoder|decoder' )
                       # must separate by |. if var_train_expr is null then we 
                       # training all variable     

diff --git a/examples/fastspeech/train_fastspeech.py b/examples/fastspeech/train_fastspeech.py
@@ -36,8 +36,7 @@
 from tensorflow_tts.models import TFFastSpeech
 from tensorflow_tts.optimizers import AdamWeightDecay, WarmUp
 from tensorflow_tts.trainers import Seq2SeqBasedTrainer
-from tensorflow_tts.utils import (calculate_2d_loss, calculate_3d_loss,
-                                  return_strategy)
+from tensorflow_tts.utils import calculate_2d_loss, calculate_3d_loss, return_strategy
 
 
 class FastSpeechTrainer(Seq2SeqBasedTrainer):
@@ -218,7 +217,7 @@ def main():
         default="",
         type=str,
         nargs="?",
-        help='pretrained checkpoint file to load weights from. Auto-skips non-matching layers',
+        help="pretrained checkpoint file to load weights from. Auto-skips non-matching layers",
     )
     args = parser.parse_args()
 
@@ -302,7 +301,9 @@ def main():
     ).create(
         is_shuffle=config["is_shuffle"],
         allow_cache=config["allow_cache"],
-        batch_size=config["batch_size"] * STRATEGY.num_replicas_in_sync,
+        batch_size=config["batch_size"]
+        * STRATEGY.num_replicas_in_sync
+        * config["gradient_accumulation_steps"],
     )
 
     valid_dataset = CharactorDurationMelDataset(
@@ -335,11 +336,12 @@ def main():
         )
         fastspeech._build()
         fastspeech.summary()
-        
+
         if len(args.pretrained) > 1:
             fastspeech.load_weights(args.pretrained, by_name=True, skip_mismatch=True)
-            logging.info(f"Successfully loaded pretrained weight from {args.pretrained}.")
-
+            logging.info(
+                f"Successfully loaded pretrained weight from {args.pretrained}."
+            )
 
         # AdamW for fastspeech
         learning_rate_fn = tf.keras.optimizers.schedules.PolynomialDecay(

diff --git a/examples/fastspeech2/conf/fastspeech2.baker.v2.yaml b/examples/fastspeech2/conf/fastspeech2.baker.v2.yaml
@@ -48,7 +48,7 @@ fastspeech2_params:
 ###########################################################
 #                  DATA LOADER SETTING                    #
 ###########################################################
-batch_size: 16              # Batch size.
+batch_size: 16              # Batch size for each GPU with assuming that gradient_accumulation_steps == 1.
 remove_short_samples: true  # Whether to remove samples the length of which are less than batch_max_steps.
 allow_cache: true           # Whether to allow cache in dataset. If true, it requires cpu memory.
 mel_length_threshold: 32    # remove all targets has mel_length <= 32 
@@ -62,7 +62,8 @@ optimizer_params:
     decay_steps: 150000          # < train_max_steps is recommend.
     warmup_proportion: 0.02
     weight_decay: 0.001
-
+
+gradient_accumulation_steps: 1
 var_train_expr: null  # trainable variable expr (eg. 'embeddings|encoder|decoder' )
                       # must separate by |. if var_train_expr is null then we 
                       # training all variable

diff --git a/examples/fastspeech2/conf/fastspeech2.kss.v1.yaml b/examples/fastspeech2/conf/fastspeech2.kss.v1.yaml
@@ -47,7 +47,7 @@ fastspeech2_params:
 ###########################################################
 #                  DATA LOADER SETTING                    #
 ###########################################################
-batch_size: 16              # Batch size.
+batch_size: 16              # Batch size for each GPU with assuming that gradient_accumulation_steps == 1.
 remove_short_samples: true  # Whether to remove samples the length of which are less than batch_max_steps.
 allow_cache: true           # Whether to allow cache in dataset. If true, it requires cpu memory.
 mel_length_threshold: 32    # remove all targets has mel_length <= 32 
@@ -61,7 +61,8 @@ optimizer_params:
     decay_steps: 150000          # < train_max_steps is recommend.
     warmup_proportion: 0.02
     weight_decay: 0.001
-
+
+gradient_accumulation_steps: 1
 var_train_expr: null  # trainable variable expr (eg. 'embeddings|encoder|decoder' )
                       # must separate by |. if var_train_expr is null then we 
                       # training all variable 

diff --git a/examples/fastspeech2/conf/fastspeech2.kss.v2.yaml b/examples/fastspeech2/conf/fastspeech2.kss.v2.yaml
@@ -48,7 +48,7 @@ fastspeech2_params:
 ###########################################################
 #                  DATA LOADER SETTING                    #
 ###########################################################
-batch_size: 16              # Batch size.
+batch_size: 16              # Batch size for each GPU with assuming that gradient_accumulation_steps == 1.
 remove_short_samples: true  # Whether to remove samples the length of which are less than batch_max_steps.
 allow_cache: true           # Whether to allow cache in dataset. If true, it requires cpu memory.
 mel_length_threshold: 32    # remove all targets has mel_length <= 32 
@@ -62,7 +62,8 @@ optimizer_params:
     decay_steps: 150000          # < train_max_steps is recommend.
     warmup_proportion: 0.02
     weight_decay: 0.001
-
+
+gradient_accumulation_steps: 1
 var_train_expr: null  # trainable variable expr (eg. 'embeddings|encoder|decoder' )
                       # must separate by |. if var_train_expr is null then we 
                       # training all variable

diff --git a/examples/fastspeech2/conf/fastspeech2.v1.yaml b/examples/fastspeech2/conf/fastspeech2.v1.yaml
@@ -46,7 +46,7 @@ fastspeech2_params:
 ###########################################################
 #                  DATA LOADER SETTING                    #
 ###########################################################
-batch_size: 16              # Batch size.
+batch_size: 16              # Batch size for each GPU with assuming that gradient_accumulation_steps == 1.
 remove_short_samples: true  # Whether to remove samples the length of which are less than batch_max_steps.
 allow_cache: true           # Whether to allow cache in dataset. If true, it requires cpu memory.
 mel_length_threshold: 32    # remove all targets has mel_length <= 32 
@@ -60,7 +60,8 @@ optimizer_params:
     decay_steps: 150000          # < train_max_steps is recommend.
     warmup_proportion: 0.02
     weight_decay: 0.001
-
+
+gradient_accumulation_steps: 1
 var_train_expr: null  # trainable variable expr (eg. 'embeddings|encoder|decoder' )
                       # must separate by |. if var_train_expr is null then we 
                       # training all variable

diff --git a/examples/fastspeech2/conf/fastspeech2.v2.yaml b/examples/fastspeech2/conf/fastspeech2.v2.yaml
@@ -47,7 +47,7 @@ fastspeech2_params:
 ###########################################################
 #                  DATA LOADER SETTING                    #
 ###########################################################
-batch_size: 16              # Batch size.
+batch_size: 16              # Batch size for each GPU with assuming that gradient_accumulation_steps == 1
 remove_short_samples: true  # Whether to remove samples the length of which are less than batch_max_steps.
 allow_cache: true           # Whether to allow cache in dataset. If true, it requires cpu memory.
 mel_length_threshold: 32    # remove all targets has mel_length <= 32 
@@ -61,7 +61,8 @@ optimizer_params:
     decay_steps: 150000          # < train_max_steps is recommend.
     warmup_proportion: 0.02
     weight_decay: 0.001
-
+
+gradient_accumulation_steps: 1  
 var_train_expr: null  # trainable variable expr (eg. 'embeddings|encoder|decoder' )
                       # must separate by |. if var_train_expr is null then we 
                       # training all variable   

diff --git a/examples/fastspeech2/train_fastspeech2.py b/examples/fastspeech2/train_fastspeech2.py
@@ -33,15 +33,13 @@
 from tqdm import tqdm
 
 import tensorflow_tts
-from examples.fastspeech2.fastspeech2_dataset import \
-    CharactorDurationF0EnergyMelDataset
+from examples.fastspeech2.fastspeech2_dataset import CharactorDurationF0EnergyMelDataset
 from examples.fastspeech.train_fastspeech import FastSpeechTrainer
 from tensorflow_tts.configs import FastSpeech2Config
 from tensorflow_tts.models import TFFastSpeech2
 from tensorflow_tts.optimizers import AdamWeightDecay, WarmUp
 from tensorflow_tts.trainers import Seq2SeqBasedTrainer
-from tensorflow_tts.utils import (calculate_2d_loss, calculate_3d_loss,
-                                  return_strategy)
+from tensorflow_tts.utils import calculate_2d_loss, calculate_3d_loss, return_strategy
 
 
 class FastSpeech2Trainer(Seq2SeqBasedTrainer):
@@ -244,9 +242,8 @@ def main():
         default="",
         type=str,
         nargs="?",
-        help='pretrained weights .h5 file to load weights from. Auto-skips non-matching layers',
+        help="pretrained weights .h5 file to load weights from. Auto-skips non-matching layers",
     )
-
 
     args = parser.parse_args()
 
@@ -330,7 +327,9 @@ def main():
     ).create(
         is_shuffle=config["is_shuffle"],
         allow_cache=config["allow_cache"],
-        batch_size=config["batch_size"] * STRATEGY.num_replicas_in_sync,
+        batch_size=config["batch_size"]
+        * STRATEGY.num_replicas_in_sync
+        * config["gradient_accumulation_steps"],
     )
 
     valid_dataset = CharactorDurationF0EnergyMelDataset(
@@ -367,7 +366,9 @@ def main():
         fastspeech.summary()
         if len(args.pretrained) > 1:
             fastspeech.load_weights(args.pretrained, by_name=True, skip_mismatch=True)
-            logging.info(f"Successfully loaded pretrained weight from {args.pretrained}.")
+            logging.info(
+                f"Successfully loaded pretrained weight from {args.pretrained}."
+            )
 
         # AdamW for fastspeech
         learning_rate_fn = tf.keras.optimizers.schedules.PolynomialDecay(

diff --git a/examples/fastspeech2_libritts/conf/fastspeech2libritts.yaml b/examples/fastspeech2_libritts/conf/fastspeech2libritts.yaml
@@ -46,7 +46,7 @@ fastspeech2_params:
 ###########################################################
 #                  DATA LOADER SETTING                    #
 ###########################################################
-batch_size: 32               # Batch size.
+batch_size: 32              # Batch size for each GPU with assuming that gradient_accumulation_steps == 1.
 remove_short_samples: true  # Whether to remove samples the length of which are less than batch_max_steps.
 allow_cache: true           # Whether to allow cache in dataset. If true, it requires cpu memory.
 mel_length_threshold: 48    # remove all targets has mel_length <= 32
@@ -60,7 +60,8 @@ optimizer_params:
     decay_steps: 120000          # < train_max_steps is recommend.
     warmup_proportion: 0.02
     weight_decay: 0.001
-
+
+gradient_accumulation_steps: 1  
 var_train_expr: null  # trainable variable expr (eg. 'embeddings|encoder|decoder' )
                       # must separate by |. if var_train_expr is null then we 
                       # training all variable