TensorSpeech · dathudeptrai · Dec 3, 2020 · Dec 2, 2020 · Dec 3, 2020
diff --git a/README.md b/README.md
@@ -19,6 +19,7 @@
 :zany_face: TensorFlowTTS provides real-time state-of-the-art speech synthesis architectures such as Tacotron-2, Melgan, Multiband-Melgan, FastSpeech, FastSpeech2 based-on TensorFlow 2. With Tensorflow 2, we can speed-up training/inference progress, optimizer further by using [fake-quantize aware](https://www.tensorflow.org/model_optimization/guide/quantization/training_comprehensive_guide) and [pruning](https://www.tensorflow.org/model_optimization/guide/pruning/pruning_with_keras), make TTS models can be run faster than real-time and be able to deploy on mobile devices or embedded systems.
 
 ## What's new
+- 2020/12/02 **(NEW!)** Support German TTS with [Thorsten dataset](https://github.com/thorstenMueller/deep-learning-german-tts). See the [Colab](https://colab.research.google.com/drive/1W0nSFpsz32M0OcIkY9uMOiGrLTPKVhTy?usp=sharing). Thanks [thorstenMueller](https://github.com/thorstenMueller) and [monatis](https://github.com/monatis).
 - 2020/11/24 **(NEW!)** Add HiFi-GAN vocoder. See [here](https://github.com/TensorSpeech/TensorFlowTTS/tree/master/examples/hifigan)
 - 2020/11/19 **(NEW!)** Add Multi-GPU gradient accumulator. See [here](https://github.com/TensorSpeech/TensorFlowTTS/pull/377)
 - 2020/08/23  Add Parallel WaveGAN tensorflow implementation. See [here](https://github.com/TensorSpeech/TensorFlowTTS/tree/master/examples/parallel_wavegan)
@@ -128,11 +129,11 @@ The preprocessing has two steps:
 
 To reproduce the steps above:
 ```
-tensorflow-tts-preprocess --rootdir ./[ljspeech/kss/baker/libritts] --outdir ./dump_[ljspeech/kss/baker/libritts] --config preprocess/[ljspeech/kss/baker]_preprocess.yaml --dataset [ljspeech/kss/baker/libritts]
-tensorflow-tts-normalize --rootdir ./dump_[ljspeech/kss/baker/libritts] --outdir ./dump_[ljspeech/kss/baker/libritts] --config preprocess/[ljspeech/kss/baker/libritts]_preprocess.yaml --dataset [ljspeech/kss/baker/libritts]
+tensorflow-tts-preprocess --rootdir ./[ljspeech/kss/baker/libritts/thorsten] --outdir ./dump_[ljspeech/kss/baker/libritts/thorsten] --config preprocess/[ljspeech/kss/baker/thorsten]_preprocess.yaml --dataset [ljspeech/kss/baker/libritts/thorsten]
+tensorflow-tts-normalize --rootdir ./dump_[ljspeech/kss/baker/libritts/thorsten] --outdir ./dump_[ljspeech/kss/baker/libritts/thorsten] --config preprocess/[ljspeech/kss/baker/libritts/thorsten]_preprocess.yaml --dataset [ljspeech/kss/baker/libritts/thorsten]
 ```
 
-Right now we only support [`ljspeech`](https://keithito.com/LJ-Speech-Dataset/), [`kss`](https://www.kaggle.com/bryanpark/korean-single-speaker-speech-dataset), [`baker`](https://weixinxcxdb.oss-cn-beijing.aliyuncs.com/gwYinPinKu/BZNSYP.rar) and [`libritts`](http://www.openslr.org/60/) for dataset argument. In the future, we intend to support more datasets.
+Right now we only support [`ljspeech`](https://keithito.com/LJ-Speech-Dataset/), [`kss`](https://www.kaggle.com/bryanpark/korean-single-speaker-speech-dataset), [`baker`](https://weixinxcxdb.oss-cn-beijing.aliyuncs.com/gwYinPinKu/BZNSYP.rar), [`libritts`](http://www.openslr.org/60/) and [`thorsten`](https://github.com/thorstenMueller/deep-learning-german-tts) for dataset argument. In the future, we intend to support more datasets.
 
 **Note**: To run `libritts` preprocessing, please first read the instruction in [examples/fastspeech2_libritts](https://github.com/TensorSpeech/TensorFlowTTS/tree/master/examples/fastspeech2_libritts). We need to reformat it first before run preprocessing.
 
@@ -143,7 +144,7 @@ After preprocessing, the structure of the project folder should be:
 |   |- wav/
 |       |- file1.wav
 |       |- ...
-|- dump_[ljspeech/kss/baker/libritts]/
+|- dump_[ljspeech/kss/baker/libritts/thorsten]/
 |   |- train/
 |       |- ids/
 |           |- LJ001-0001-ids.npy

diff --git a/preprocess/thorsten_preprocess.yaml b/preprocess/thorsten_preprocess.yaml
@@ -0,0 +1,19 @@
+###########################################################
+#                FEATURE EXTRACTION SETTING               #
+###########################################################
+sampling_rate: 22050     # Sampling rate.
+fft_size: 1024           # FFT size.
+hop_size: 256            # Hop size. (fixed value, don't change)
+win_length: null         # Window length.
+                         # If set to null, it will be the same as fft_size.
+window: "hann"           # Window function.
+num_mels: 80             # Number of mel basis.
+fmin: 80                 # Minimum freq in mel basis calculation.
+fmax: 7600               # Maximum frequency in mel basis calculation.
+global_gain_scale: 1.0   # Will be multiplied to all of waveform.
+trim_silence: true       # Whether to trim the start and end of silence.
+trim_threshold_in_db: 60 # Need to tune carefully if the recording is not good.
+trim_frame_size: 2048    # Frame size in trimming.
+trim_hop_size: 512       # Hop size in trimming.
+format: "npy"            # Feature file format. Only "npy" is supported.
+
diff --git a/setup.py b/setup.py
@@ -42,7 +42,8 @@
         "textgrid",
         "click",
         "g2p_en",
-        "dataclasses"
+        "dataclasses",
+        "german_transliterate @ git+https://github.com/repodiac/german_transliterate.git#egg=german_transliterate"
     ],
     "setup": ["numpy", "pytest-runner",],
     "test": [

diff --git a/tensorflow_tts/bin/preprocess.py b/tensorflow_tts/bin/preprocess.py
@@ -34,11 +34,13 @@
 from tensorflow_tts.processor import BakerProcessor
 from tensorflow_tts.processor import KSSProcessor
 from tensorflow_tts.processor import LibriTTSProcessor
+from tensorflow_tts.processor import ThorstenProcessor
 
 from tensorflow_tts.processor.ljspeech import LJSPEECH_SYMBOLS
 from tensorflow_tts.processor.baker import BAKER_SYMBOLS
 from tensorflow_tts.processor.kss import KSS_SYMBOLS
 from tensorflow_tts.processor.libritts import LIBRITTS_SYMBOLS
+from tensorflow_tts.processor.thorsten import THORSTEN_SYMBOLS
 
 from tensorflow_tts.utils import remove_outlier
 
@@ -69,7 +71,7 @@ def parse_and_config():
         "--dataset",
         type=str,
         default="ljspeech",
-        choices=["ljspeech", "kss", "libritts", "baker"],
+        choices=["ljspeech", "kss", "libritts", "baker", "thorsten"],
         help="Dataset to preprocess.",
     )
     parser.add_argument(
@@ -349,20 +351,23 @@ def preprocess():
         "kss": KSSProcessor,
         "libritts": LibriTTSProcessor,
         "baker": BakerProcessor,
+        "thorsten": ThorstenProcessor,
     }
 
     dataset_symbol = {
         "ljspeech": LJSPEECH_SYMBOLS,
         "kss": KSS_SYMBOLS,
         "libritts": LIBRITTS_SYMBOLS,
         "baker": BAKER_SYMBOLS,
+        "thorsten": THORSTEN_SYMBOLS,
     }
 
     dataset_cleaner = {
         "ljspeech": "english_cleaners",
         "kss": "korean_cleaners",
         "libritts": None,
         "baker": None,
+        "thorsten": "german_cleaners",
     }
 
     logging.info(f"Selected '{config['dataset']}' processor.")

diff --git a/tensorflow_tts/inference/auto_processor.py b/tensorflow_tts/inference/auto_processor.py
@@ -23,6 +23,7 @@
     KSSProcessor,
     BakerProcessor,
     LibriTTSProcessor,
+    ThorstenProcessor,
 )
 
 CONFIG_MAPPING = OrderedDict(
@@ -31,6 +32,7 @@
         ("KSSProcessor", KSSProcessor),
         ("BakerProcessor", BakerProcessor),
         ("LibriTTSProcessor", LibriTTSProcessor),
+        ("ThorstenProcessor", ThorstenProcessor)
     ]
 )
 

diff --git a/tensorflow_tts/processor/__init__.py b/tensorflow_tts/processor/__init__.py
@@ -4,3 +4,5 @@
 from tensorflow_tts.processor.baker import BakerProcessor
 from tensorflow_tts.processor.kss import KSSProcessor
 from tensorflow_tts.processor.libritts import LibriTTSProcessor
+
+from tensorflow_tts.processor.thorsten import ThorstenProcessor
diff --git a/tensorflow_tts/processor/pretrained/thorsten_mapper.json b/tensorflow_tts/processor/pretrained/thorsten_mapper.json
@@ -0,0 +1 @@
+{"symbol_to_id": {"pad": 0, "-": 1, "!": 2, "'": 3, "(": 4, ")": 5, ",": 6, ".": 7, "?": 8, " ": 9, "A": 10, "B": 11, "C": 12, "D": 13, "E": 14, "F": 15, "G": 16, "H": 17, "I": 18, "J": 19, "K": 20, "L": 21, "M": 22, "N": 23, "O": 24, "P": 25, "Q": 26, "R": 27, "S": 28, "T": 29, "U": 30, "V": 31, "W": 32, "X": 33, "Y": 34, "Z": 35, "a": 36, "b": 37, "c": 38, "d": 39, "e": 40, "f": 41, "g": 42, "h": 43, "i": 44, "j": 45, "k": 46, "l": 47, "m": 48, "n": 49, "o": 50, "p": 51, "q": 52, "r": 53, "s": 54, "t": 55, "u": 56, "v": 57, "w": 58, "x": 59, "y": 60, "z": 61, "eos": 62}, "id_to_symbol": {"0": "pad", "1": "-", "2": "!", "3": "'", "4": "(", "5": ")", "6": ",", "7": ".", "8": "?", "9": " ", "10": "A", "11": "B", "12": "C", "13": "D", "14": "E", "15": "F", "16": "G", "17": "H", "18": "I", "19": "J", "20": "K", "21": "L", "22": "M", "23": "N", "24": "O", "25": "P", "26": "Q", "27": "R", "28": "S", "29": "T", "30": "U", "31": "V", "32": "W", "33": "X", "34": "Y", "35": "Z", "36": "a", "37": "b", "38": "c", "39": "d", "40": "e", "41": "f", "42": "g", "43": "h", "44": "i", "45": "j", "46": "k", "47": "l", "48": "m", "49": "n", "50": "o", "51": "p", "52": "q", "53": "r", "54": "s", "55": "t", "56": "u", "57": "v", "58": "w", "59": "x", "60": "y", "61": "z", "62": "eos"}, "speakers_map": {"thorsten": 0}, "processor_name": "ThorstenProcessor"}
diff --git a/tensorflow_tts/processor/thorsten.py b/tensorflow_tts/processor/thorsten.py
@@ -0,0 +1,126 @@
+# -*- coding: utf-8 -*-
+# Copyright 2020 TensorFlowTTS Team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Perform preprocessing and raw feature extraction for LJSpeech dataset."""
+
+import os
+import re
+
+import numpy as np
+import soundfile as sf
+from dataclasses import dataclass
+from tensorflow_tts.processor import BaseProcessor
+from tensorflow_tts.utils import cleaners
+
+_pad = "pad"
+_eos = "eos"
+_punctuation = "!'(),.? "
+_special = "-"
+_letters = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"
+
+# Export all symbols:
+THORSTEN_SYMBOLS = (
+    [_pad] + list(_special) + list(_punctuation) + list(_letters) + [_eos]
+)
+
+# Regular expression matching text enclosed in curly braces:
+_curly_re = re.compile(r"(.*?)\{(.+?)\}(.*)")
+
+
+@dataclass
+class ThorstenProcessor(BaseProcessor):
+    """Thorsten processor."""
+
+    cleaner_names: str = "german_cleaners"
+    positions = {
+        "wave_file": 0,
+        "text_norm": 1,
+    }
+    train_f_name: str = "metadata.csv"
+
+    def create_items(self):
+        if self.data_dir:
+            with open(
+                os.path.join(self.data_dir, self.train_f_name), encoding="utf-8"
+            ) as f:
+                self.items = [self.split_line(self.data_dir, line, "|") for line in f]
+
+    def split_line(self, data_dir, line, split):
+        parts = line.strip().split(split)
+        wave_file = parts[self.positions["wave_file"]]
+        text_norm = parts[self.positions["text_norm"]]
+        wav_path = os.path.join(data_dir, "wavs", f"{wave_file}.wav")
+        speaker_name = "thorsten"
+        return text_norm, wav_path, speaker_name
+
+    def setup_eos_token(self):
+        return _eos
+
+    def get_one_sample(self, item):
+        text, wav_path, speaker_name = item
+
+        # normalize audio signal to be [-1, 1], soundfile already norm.
+        audio, rate = sf.read(wav_path)
+        audio = audio.astype(np.float32)
+
+        # convert text to ids
+        text_ids = np.asarray(self.text_to_sequence(text), np.int32)
+
+        sample = {
+            "raw_text": text,
+            "text_ids": text_ids,
+            "audio": audio,
+            "utt_id": os.path.split(wav_path)[-1].split(".")[0],
+            "speaker_name": speaker_name,
+            "rate": rate,
+        }
+
+        return sample
+
+    def text_to_sequence(self, text):
+        sequence = []
+        # Check for curly braces and treat their contents as ARPAbet:
+        while len(text):
+            m = _curly_re.match(text)
+            if not m:
+                sequence += self._symbols_to_sequence(
+                    self._clean_text(text, [self.cleaner_names])
+                )
+                break
+            sequence += self._symbols_to_sequence(
+                self._clean_text(m.group(1), [self.cleaner_names])
+            )
+            sequence += self._arpabet_to_sequence(m.group(2))
+            text = m.group(3)
+
+        # add eos tokens
+        sequence += [self.eos_id]
+        return sequence
+
+    def _clean_text(self, text, cleaner_names):
+        for name in cleaner_names:
+            cleaner = getattr(cleaners, name)
+            if not cleaner:
+                raise Exception("Unknown cleaner: %s" % name)
+            text = cleaner(text)
+        return text
+
+    def _symbols_to_sequence(self, symbols):
+        return [self.symbol_to_id[s] for s in symbols if self._should_keep_symbol(s)]
+
+    def _arpabet_to_sequence(self, text):
+        return self._symbols_to_sequence(["@" + s for s in text.split()])
+
+    def _should_keep_symbol(self, s):
+        return s in self.symbol_to_id and s != "_" and s != "~"
diff --git a/tensorflow_tts/utils/cleaners.py b/tensorflow_tts/utils/cleaners.py
@@ -24,6 +24,7 @@
 from tensorflow_tts.utils.korean import tokenize as ko_tokenize
 from tensorflow_tts.utils.number_norm import normalize_numbers
 from unidecode import unidecode
+from german_transliterate.core import GermanTransliterate
 
 # Regular expression matching whitespace:
 _whitespace_re = re.compile(r"\s+")
@@ -107,3 +108,8 @@ def korean_cleaners(text):
         text
     )  # '존경하는' --> ['ᄌ', 'ᅩ', 'ᆫ', 'ᄀ', 'ᅧ', 'ᆼ', 'ᄒ', 'ᅡ', 'ᄂ', 'ᅳ', 'ᆫ']
     return text
+
+def german_cleaners(text):
+    """Pipeline for German text, including number and abbreviation expansion."""
+    text = GermanTransliterate(replace={';': ',', ':': ' '}, sep_abbreviation=' -- ').transliterate(text)
+    return text