Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trim silences during synthesizer preprocess #501

Closed
ghost opened this issue Aug 21, 2020 · 5 comments
Closed

Trim silences during synthesizer preprocess #501

ghost opened this issue Aug 21, 2020 · 5 comments

Comments

@ghost
Copy link

ghost commented Aug 21, 2020

Synthesizer preprocess generally does not trim silences from wav files in the dataset. (An exception is is the dataset has alignments, such as LibriSpeech. Those alignment files contain data that is used to trim leading and trailing silence from an utterance.)

We should apply voice activation detection (webrtcvad) to help trim excess silence from other datasets like VCTK. I notice my synth models trained on VCTK synthesize a lot of leading and trailing silence and think this is the reason. All that is needed is to add this line: wav = encoder.preprocess_wav(wav) after librosa loads the wav.

if no_alignments:
# Gather the utterance audios and texts
# LibriTTS uses .wav but we will include extensions for compatibility with other datasets
extensions = ["*.wav", "*.flac", "*.mp3"]
for extension in extensions:
wav_fpaths = book_dir.glob(extension)
for wav_fpath in wav_fpaths:
# Load the audio waveform
wav, _ = librosa.load(str(wav_fpath), hparams.sample_rate)
if hparams.rescale:
wav = wav / np.abs(wav).max() * hparams.rescaling_max
# Get the corresponding text
# Check for .txt (for compatibility with other datasets)
text_fpath = wav_fpath.with_suffix(".txt")
if not text_fpath.exists():
# Check for .normalized.txt (LibriTTS)
text_fpath = wav_fpath.with_suffix(".normalized.txt")
assert text_fpath.exists()
with text_fpath.open("r") as text_file:
text = "".join([line for line in text_file])
text = text.replace("\"", "")
text = text.strip()
# Process the utterance
metadata.append(process_utterance(wav, text, out_dir, str(wav_fpath.with_suffix("").name),
skip_existing, hparams))

@javaintheuk
Copy link

Hi Bluefish, (big fan of your work here!)

Thank for the tip! please can you correct me if I'm wrong? you mean HERE?

`# Load the audio waveform
wav, _ = librosa.load(str(wav_fpath), hparams.sample_rate)
if hparams.rescale:
wav = wav / np.abs(wav).max() * hparams.rescaling_max

HERE?
---> wav = encoder.preprocess_wav(wav)

Get the corresponding text

`

Thanks in advance!

@ghost
Copy link
Author

ghost commented Aug 24, 2020

@javaintheuk Right before checking hparams.rescale. I have noticed that for a few utterances in VCTK the preprocess result will be None or an empty wav which causes an error, so if you experience this you could follow up the preprocess_wav with if wav is not None: and if wav.size > 0 to check for a valid wav before continuing.

# Load the audio waveform
wav, _ = librosa.load(str(wav_fpath), hparams.sample_rate)
wav = encoder.preprocess_wav(wav)

Thanks for expressing your interest in this idea, I will consider submitting a pull request once I've figured out a good implementation. Anyone in the community is also welcome to contribute ideas for better preprocessing or to submit a PR.

@ghost
Copy link
Author

ghost commented Sep 3, 2020

@javaintheuk In #472, I settled on this implementation: https://github.com/blue-fish/Real-Time-Voice-Cloning/compare/1d0d650...blue-fish:d692584

Trimming silences from training data is very important when working with fatchord's tacotron1 model so I am going to bundle it with the pytorch synthesizer.

@ghost
Copy link
Author

ghost commented Sep 4, 2020

I preprocessed LibriSpeech using silence_min_duration_split of 0.2 seconds instead of the default 0.4. This breaks up all utterances with long pauses (0.4 seconds is still quite long), which might make the VAD hack unnecessary. However, this setting is only effective when alignments are available for the dataset.

Edit: I did not notice a difference in the model when splitting at 0.2 and 0.4 seconds if VAD is applied. Now trying splitting on silences of 0.05 seconds without VAD.

@javaintheuk
Copy link

@javaintheuk In #472, I settled on this implementation: blue-fish/Real-Time-Voice-Cloning@1d0d650...blue-fish:d692584

Trimming silences from training data is very important when working with fatchord's tacotron1 model so I am going to bundle it with the pytorch synthesizer.

Thanks for the update! :)

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant