Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for German TTS with Thorsten dataset #405

Merged
merged 2 commits into from
Dec 3, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
9 changes: 5 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@
:zany_face: TensorFlowTTS provides real-time state-of-the-art speech synthesis architectures such as Tacotron-2, Melgan, Multiband-Melgan, FastSpeech, FastSpeech2 based-on TensorFlow 2. With Tensorflow 2, we can speed-up training/inference progress, optimizer further by using [fake-quantize aware](https://www.tensorflow.org/model_optimization/guide/quantization/training_comprehensive_guide) and [pruning](https://www.tensorflow.org/model_optimization/guide/pruning/pruning_with_keras), make TTS models can be run faster than real-time and be able to deploy on mobile devices or embedded systems.

## What's new
- 2020/12/02 **(NEW!)** Support German TTS with [Thorsten dataset](https://github.com/thorstenMueller/deep-learning-german-tts). See the [Colab](https://colab.research.google.com/drive/1W0nSFpsz32M0OcIkY9uMOiGrLTPKVhTy?usp=sharing). Thanks [thorstenMueller](https://github.com/thorstenMueller) and [monatis](https://github.com/monatis).
- 2020/11/24 **(NEW!)** Add HiFi-GAN vocoder. See [here](https://github.com/TensorSpeech/TensorFlowTTS/tree/master/examples/hifigan)
- 2020/11/19 **(NEW!)** Add Multi-GPU gradient accumulator. See [here](https://github.com/TensorSpeech/TensorFlowTTS/pull/377)
- 2020/08/23 Add Parallel WaveGAN tensorflow implementation. See [here](https://github.com/TensorSpeech/TensorFlowTTS/tree/master/examples/parallel_wavegan)
Expand Down Expand Up @@ -128,11 +129,11 @@ The preprocessing has two steps:

To reproduce the steps above:
```
tensorflow-tts-preprocess --rootdir ./[ljspeech/kss/baker/libritts] --outdir ./dump_[ljspeech/kss/baker/libritts] --config preprocess/[ljspeech/kss/baker]_preprocess.yaml --dataset [ljspeech/kss/baker/libritts]
tensorflow-tts-normalize --rootdir ./dump_[ljspeech/kss/baker/libritts] --outdir ./dump_[ljspeech/kss/baker/libritts] --config preprocess/[ljspeech/kss/baker/libritts]_preprocess.yaml --dataset [ljspeech/kss/baker/libritts]
tensorflow-tts-preprocess --rootdir ./[ljspeech/kss/baker/libritts/thorsten] --outdir ./dump_[ljspeech/kss/baker/libritts/thorsten] --config preprocess/[ljspeech/kss/baker/thorsten]_preprocess.yaml --dataset [ljspeech/kss/baker/libritts/thorsten]
tensorflow-tts-normalize --rootdir ./dump_[ljspeech/kss/baker/libritts/thorsten] --outdir ./dump_[ljspeech/kss/baker/libritts/thorsten] --config preprocess/[ljspeech/kss/baker/libritts/thorsten]_preprocess.yaml --dataset [ljspeech/kss/baker/libritts/thorsten]
```

Right now we only support [`ljspeech`](https://keithito.com/LJ-Speech-Dataset/), [`kss`](https://www.kaggle.com/bryanpark/korean-single-speaker-speech-dataset), [`baker`](https://weixinxcxdb.oss-cn-beijing.aliyuncs.com/gwYinPinKu/BZNSYP.rar) and [`libritts`](http://www.openslr.org/60/) for dataset argument. In the future, we intend to support more datasets.
Right now we only support [`ljspeech`](https://keithito.com/LJ-Speech-Dataset/), [`kss`](https://www.kaggle.com/bryanpark/korean-single-speaker-speech-dataset), [`baker`](https://weixinxcxdb.oss-cn-beijing.aliyuncs.com/gwYinPinKu/BZNSYP.rar), [`libritts`](http://www.openslr.org/60/) and [`thorsten`](https://github.com/thorstenMueller/deep-learning-german-tts) for dataset argument. In the future, we intend to support more datasets.

**Note**: To run `libritts` preprocessing, please first read the instruction in [examples/fastspeech2_libritts](https://github.com/TensorSpeech/TensorFlowTTS/tree/master/examples/fastspeech2_libritts). We need to reformat it first before run preprocessing.

Expand All @@ -143,7 +144,7 @@ After preprocessing, the structure of the project folder should be:
| |- wav/
| |- file1.wav
| |- ...
|- dump_[ljspeech/kss/baker/libritts]/
|- dump_[ljspeech/kss/baker/libritts/thorsten]/
| |- train/
| |- ids/
| |- LJ001-0001-ids.npy
Expand Down
19 changes: 19 additions & 0 deletions preprocess/thorsten_preprocess.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
###########################################################
# FEATURE EXTRACTION SETTING #
###########################################################
sampling_rate: 22050 # Sampling rate.
fft_size: 1024 # FFT size.
hop_size: 256 # Hop size. (fixed value, don't change)
win_length: null # Window length.
# If set to null, it will be the same as fft_size.
window: "hann" # Window function.
num_mels: 80 # Number of mel basis.
fmin: 80 # Minimum freq in mel basis calculation.
fmax: 7600 # Maximum frequency in mel basis calculation.
global_gain_scale: 1.0 # Will be multiplied to all of waveform.
trim_silence: true # Whether to trim the start and end of silence.
trim_threshold_in_db: 60 # Need to tune carefully if the recording is not good.
trim_frame_size: 2048 # Frame size in trimming.
trim_hop_size: 512 # Hop size in trimming.
format: "npy" # Feature file format. Only "npy" is supported.

3 changes: 2 additions & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,8 @@
"textgrid",
"click",
"g2p_en",
"dataclasses"
"dataclasses",
"german_transliterate @ git+https://github.com/repodiac/german_transliterate.git#egg=german_transliterate"
],
"setup": ["numpy", "pytest-runner",],
"test": [
Expand Down
7 changes: 6 additions & 1 deletion tensorflow_tts/bin/preprocess.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,11 +34,13 @@
from tensorflow_tts.processor import BakerProcessor
from tensorflow_tts.processor import KSSProcessor
from tensorflow_tts.processor import LibriTTSProcessor
from tensorflow_tts.processor import ThorstenProcessor

from tensorflow_tts.processor.ljspeech import LJSPEECH_SYMBOLS
from tensorflow_tts.processor.baker import BAKER_SYMBOLS
from tensorflow_tts.processor.kss import KSS_SYMBOLS
from tensorflow_tts.processor.libritts import LIBRITTS_SYMBOLS
from tensorflow_tts.processor.thorsten import THORSTEN_SYMBOLS

from tensorflow_tts.utils import remove_outlier

Expand Down Expand Up @@ -69,7 +71,7 @@ def parse_and_config():
"--dataset",
type=str,
default="ljspeech",
choices=["ljspeech", "kss", "libritts", "baker"],
choices=["ljspeech", "kss", "libritts", "baker", "thorsten"],
help="Dataset to preprocess.",
)
parser.add_argument(
Expand Down Expand Up @@ -349,20 +351,23 @@ def preprocess():
"kss": KSSProcessor,
"libritts": LibriTTSProcessor,
"baker": BakerProcessor,
"thorsten": ThorstenProcessor,
}

dataset_symbol = {
"ljspeech": LJSPEECH_SYMBOLS,
"kss": KSS_SYMBOLS,
"libritts": LIBRITTS_SYMBOLS,
"baker": BAKER_SYMBOLS,
"thorsten": THORSTEN_SYMBOLS,
}

dataset_cleaner = {
"ljspeech": "english_cleaners",
"kss": "korean_cleaners",
"libritts": None,
"baker": None,
"thorsten": "german_cleaners",
}

logging.info(f"Selected '{config['dataset']}' processor.")
Expand Down
2 changes: 2 additions & 0 deletions tensorflow_tts/inference/auto_processor.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@
KSSProcessor,
BakerProcessor,
LibriTTSProcessor,
ThorstenProcessor,
)

CONFIG_MAPPING = OrderedDict(
Expand All @@ -31,6 +32,7 @@
("KSSProcessor", KSSProcessor),
("BakerProcessor", BakerProcessor),
("LibriTTSProcessor", LibriTTSProcessor),
("ThorstenProcessor", ThorstenProcessor)
]
)

Expand Down
2 changes: 2 additions & 0 deletions tensorflow_tts/processor/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,3 +4,5 @@
from tensorflow_tts.processor.baker import BakerProcessor
from tensorflow_tts.processor.kss import KSSProcessor
from tensorflow_tts.processor.libritts import LibriTTSProcessor

from tensorflow_tts.processor.thorsten import ThorstenProcessor
1 change: 1 addition & 0 deletions tensorflow_tts/processor/pretrained/thorsten_mapper.json
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"symbol_to_id": {"pad": 0, "-": 1, "!": 2, "'": 3, "(": 4, ")": 5, ",": 6, ".": 7, "?": 8, " ": 9, "A": 10, "B": 11, "C": 12, "D": 13, "E": 14, "F": 15, "G": 16, "H": 17, "I": 18, "J": 19, "K": 20, "L": 21, "M": 22, "N": 23, "O": 24, "P": 25, "Q": 26, "R": 27, "S": 28, "T": 29, "U": 30, "V": 31, "W": 32, "X": 33, "Y": 34, "Z": 35, "a": 36, "b": 37, "c": 38, "d": 39, "e": 40, "f": 41, "g": 42, "h": 43, "i": 44, "j": 45, "k": 46, "l": 47, "m": 48, "n": 49, "o": 50, "p": 51, "q": 52, "r": 53, "s": 54, "t": 55, "u": 56, "v": 57, "w": 58, "x": 59, "y": 60, "z": 61, "eos": 62}, "id_to_symbol": {"0": "pad", "1": "-", "2": "!", "3": "'", "4": "(", "5": ")", "6": ",", "7": ".", "8": "?", "9": " ", "10": "A", "11": "B", "12": "C", "13": "D", "14": "E", "15": "F", "16": "G", "17": "H", "18": "I", "19": "J", "20": "K", "21": "L", "22": "M", "23": "N", "24": "O", "25": "P", "26": "Q", "27": "R", "28": "S", "29": "T", "30": "U", "31": "V", "32": "W", "33": "X", "34": "Y", "35": "Z", "36": "a", "37": "b", "38": "c", "39": "d", "40": "e", "41": "f", "42": "g", "43": "h", "44": "i", "45": "j", "46": "k", "47": "l", "48": "m", "49": "n", "50": "o", "51": "p", "52": "q", "53": "r", "54": "s", "55": "t", "56": "u", "57": "v", "58": "w", "59": "x", "60": "y", "61": "z", "62": "eos"}, "speakers_map": {"thorsten": 0}, "processor_name": "ThorstenProcessor"}
126 changes: 126 additions & 0 deletions tensorflow_tts/processor/thorsten.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,126 @@
# -*- coding: utf-8 -*-
# Copyright 2020 TensorFlowTTS Team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Perform preprocessing and raw feature extraction for LJSpeech dataset."""

import os
import re

import numpy as np
import soundfile as sf
from dataclasses import dataclass
from tensorflow_tts.processor import BaseProcessor
from tensorflow_tts.utils import cleaners

_pad = "pad"
_eos = "eos"
_punctuation = "!'(),.? "
_special = "-"
_letters = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"

# Export all symbols:
THORSTEN_SYMBOLS = (
[_pad] + list(_special) + list(_punctuation) + list(_letters) + [_eos]
)

# Regular expression matching text enclosed in curly braces:
_curly_re = re.compile(r"(.*?)\{(.+?)\}(.*)")


@dataclass
class ThorstenProcessor(BaseProcessor):
"""Thorsten processor."""

cleaner_names: str = "german_cleaners"
positions = {
"wave_file": 0,
"text_norm": 1,
}
train_f_name: str = "metadata.csv"

def create_items(self):
if self.data_dir:
with open(
os.path.join(self.data_dir, self.train_f_name), encoding="utf-8"
) as f:
self.items = [self.split_line(self.data_dir, line, "|") for line in f]

def split_line(self, data_dir, line, split):
parts = line.strip().split(split)
wave_file = parts[self.positions["wave_file"]]
text_norm = parts[self.positions["text_norm"]]
wav_path = os.path.join(data_dir, "wavs", f"{wave_file}.wav")
speaker_name = "thorsten"
return text_norm, wav_path, speaker_name

def setup_eos_token(self):
return _eos

def get_one_sample(self, item):
text, wav_path, speaker_name = item

# normalize audio signal to be [-1, 1], soundfile already norm.
audio, rate = sf.read(wav_path)
audio = audio.astype(np.float32)

# convert text to ids
text_ids = np.asarray(self.text_to_sequence(text), np.int32)

sample = {
"raw_text": text,
"text_ids": text_ids,
"audio": audio,
"utt_id": os.path.split(wav_path)[-1].split(".")[0],
"speaker_name": speaker_name,
"rate": rate,
}

return sample

def text_to_sequence(self, text):
sequence = []
# Check for curly braces and treat their contents as ARPAbet:
while len(text):
m = _curly_re.match(text)
if not m:
sequence += self._symbols_to_sequence(
self._clean_text(text, [self.cleaner_names])
)
break
sequence += self._symbols_to_sequence(
self._clean_text(m.group(1), [self.cleaner_names])
)
sequence += self._arpabet_to_sequence(m.group(2))
text = m.group(3)

# add eos tokens
sequence += [self.eos_id]
return sequence

def _clean_text(self, text, cleaner_names):
for name in cleaner_names:
cleaner = getattr(cleaners, name)
if not cleaner:
raise Exception("Unknown cleaner: %s" % name)
text = cleaner(text)
return text

def _symbols_to_sequence(self, symbols):
return [self.symbol_to_id[s] for s in symbols if self._should_keep_symbol(s)]

def _arpabet_to_sequence(self, text):
return self._symbols_to_sequence(["@" + s for s in text.split()])

def _should_keep_symbol(self, s):
return s in self.symbol_to_id and s != "_" and s != "~"
6 changes: 6 additions & 0 deletions tensorflow_tts/utils/cleaners.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@
from tensorflow_tts.utils.korean import tokenize as ko_tokenize
from tensorflow_tts.utils.number_norm import normalize_numbers
from unidecode import unidecode
from german_transliterate.core import GermanTransliterate

# Regular expression matching whitespace:
_whitespace_re = re.compile(r"\s+")
Expand Down Expand Up @@ -107,3 +108,8 @@ def korean_cleaners(text):
text
) # '존경하는' --> ['ᄌ', 'ᅩ', 'ᆫ', 'ᄀ', 'ᅧ', 'ᆼ', 'ᄒ', 'ᅡ', 'ᄂ', 'ᅳ', 'ᆫ']
return text

def german_cleaners(text):
"""Pipeline for German text, including number and abbreviation expansion."""
text = GermanTransliterate(replace={';': ',', ':': ' '}, sep_abbreviation=' -- ').transliterate(text)
return text