Trim silences during synthesizer preprocess #501

ghost · 2020-08-21T21:32:03Z

Synthesizer preprocess generally does not trim silences from wav files in the dataset. (An exception is is the dataset has alignments, such as LibriSpeech. Those alignment files contain data that is used to trim leading and trailing silence from an utterance.)

We should apply voice activation detection (webrtcvad) to help trim excess silence from other datasets like VCTK. I notice my synth models trained on VCTK synthesize a lot of leading and trailing silence and think this is the reason. All that is needed is to add this line: wav = encoder.preprocess_wav(wav) after librosa loads the wav.

Real-Time-Voice-Cloning/synthesizer/preprocess.py

Lines 57 to 84 in a32962b

    
           if no_alignments: 
        
               # Gather the utterance audios and texts 
        
               # LibriTTS uses .wav but we will include extensions for compatibility with other datasets 
        
               extensions = ["*.wav", "*.flac", "*.mp3"] 
        
               for extension in extensions: 
        
                   wav_fpaths = book_dir.glob(extension) 
        
                   for wav_fpath in wav_fpaths: 
        
                       # Load the audio waveform 
        
                       wav, _ = librosa.load(str(wav_fpath), hparams.sample_rate) 
        
                       if hparams.rescale: 
        
                           wav = wav / np.abs(wav).max() * hparams.rescaling_max 
        
                       # Get the corresponding text 
        
                       # Check for .txt (for compatibility with other datasets) 
        
                       text_fpath = wav_fpath.with_suffix(".txt") 
        
                       if not text_fpath.exists(): 
        
                           # Check for .normalized.txt (LibriTTS) 
        
                           text_fpath = wav_fpath.with_suffix(".normalized.txt") 
        
                           assert text_fpath.exists() 
        
                       with text_fpath.open("r") as text_file: 
        
                           text = "".join([line for line in text_file]) 
        
                           text = text.replace("\"", "") 
        
                           text = text.strip() 
        
                       # Process the utterance 
        
                       metadata.append(process_utterance(wav, text, out_dir, str(wav_fpath.with_suffix("").name), 
        
                                                         skip_existing, hparams))

The text was updated successfully, but these errors were encountered:

javaintheuk · 2020-08-23T17:37:39Z

Hi Bluefish, (big fan of your work here!)

Thank for the tip! please can you correct me if I'm wrong? you mean HERE?

`# Load the audio waveform
wav, _ = librosa.load(str(wav_fpath), hparams.sample_rate)
if hparams.rescale:
wav = wav / np.abs(wav).max() * hparams.rescaling_max

HERE?
---> wav = encoder.preprocess_wav(wav)

Get the corresponding text

`

Thanks in advance!

ghost · 2020-08-24T02:34:02Z

@javaintheuk Right before checking hparams.rescale. I have noticed that for a few utterances in VCTK the preprocess result will be None or an empty wav which causes an error, so if you experience this you could follow up the preprocess_wav with if wav is not None: and if wav.size > 0 to check for a valid wav before continuing.

# Load the audio waveform
wav, _ = librosa.load(str(wav_fpath), hparams.sample_rate)
wav = encoder.preprocess_wav(wav)

Thanks for expressing your interest in this idea, I will consider submitting a pull request once I've figured out a good implementation. Anyone in the community is also welcome to contribute ideas for better preprocessing or to submit a PR.

ghost · 2020-09-03T08:32:52Z

@javaintheuk In #472, I settled on this implementation: https://github.com/blue-fish/Real-Time-Voice-Cloning/compare/1d0d650...blue-fish:d692584

Trimming silences from training data is very important when working with fatchord's tacotron1 model so I am going to bundle it with the pytorch synthesizer.

ghost · 2020-09-04T05:20:16Z

I preprocessed LibriSpeech using silence_min_duration_split of 0.2 seconds instead of the default 0.4. This breaks up all utterances with long pauses (0.4 seconds is still quite long), which might make the VAD hack unnecessary. However, this setting is only effective when alignments are available for the dataset.

Edit: I did not notice a difference in the model when splitting at 0.2 and 0.4 seconds if VAD is applied. Now trying splitting on silences of 0.05 seconds without VAD.

javaintheuk · 2020-09-04T20:12:11Z

@javaintheuk In #472, I settled on this implementation: blue-fish/Real-Time-Voice-Cloning@1d0d650...blue-fish:d692584

Trimming silences from training data is very important when working with fatchord's tacotron1 model so I am going to bundle it with the pytorch synthesizer.

Thanks for the update! :)

This was referenced Aug 21, 2020

Training a new model based on LibriTTS #449

Closed

Pytorch synthesizer #447

Closed

This was referenced Sep 2, 2020

Pytorch synthesizer #472

Merged

Fixing the synthesizer's gaps in spectrograms #53

Closed

ghost mentioned this issue Sep 30, 2020

New pretrained synthesizer model (tensorflow) #538

Closed

ghost closed this as completed Oct 12, 2020

ghost mentioned this issue Mar 1, 2021

How to train new model from Mozilla Common Voice? #677

Closed

ghost mentioned this issue Apr 20, 2021

synthesizer Training #737

Closed

ghost mentioned this issue Jun 17, 2021

Strange synthesizer result from time to time #775

Closed

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trim silences during synthesizer preprocess #501

Trim silences during synthesizer preprocess #501

ghost commented Aug 21, 2020 •

edited by ghost

javaintheuk commented Aug 23, 2020

ghost commented Aug 24, 2020

ghost commented Sep 3, 2020

ghost commented Sep 4, 2020 •

edited by ghost

javaintheuk commented Sep 4, 2020

Trim silences during synthesizer preprocess #501

Trim silences during synthesizer preprocess #501

Comments

ghost commented Aug 21, 2020 • edited by ghost

javaintheuk commented Aug 23, 2020

Get the corresponding text

ghost commented Aug 24, 2020

ghost commented Sep 3, 2020

ghost commented Sep 4, 2020 • edited by ghost

javaintheuk commented Sep 4, 2020

ghost commented Aug 21, 2020 •

edited by ghost

ghost commented Sep 4, 2020 •

edited by ghost