- 
                Notifications
    You must be signed in to change notification settings 
- Fork 3.4k
Description
I believe I am currently having an issue when training from both scratch and the pre-trained tacotron2 model.
I have collected 14 to 17 hours of pre-processed wav files of Obama speaking. Each file was initially normalized with ffmpeg-normalize and then resampled to the recommended 22050Hz.
I have ensured that:
- the Sampling rate of each wav file is 22050Hz
- there is only a Single speaker: Obama
- the Speech contains a variety of speech phonemes
- each Audio file is split into segments of 10 seconds
- each of the Audio segments does NOT have silence at the beginning and end of the file
- each of the Audio segments does not contain long silences
Here is a link to a drive containing the wav files for inspection:
https://drive.google.com/drive/folders/17RoPoNhcU6ovW0BBkONt3WEXf6ZvuUwF?usp=download
Here is a link to both of the formatted .txt files (train and val):
Train .txt file: https://drive.google.com/file/d/1dxTkagpAT43jP06QAeODWS92GmuqdPqz/view?usp=sharing
Validation .txt file: https://drive.google.com/file/d/1dtaHPWTFdXLM1QdOVb2V9H2a_VMKVWRg/view?usp=sharing
I formatted the .txt files in the same way as the LJSpeech dataset. I used wav2vec2.0 for transcriptions. I made sure that any spaces at the start and end of the transcriptions are removed, and that a period was added to the end of each transcript. Each should be on a new line.
The train.py script will run. The directory paths and naming conventions are correct.
This is what a graph of the training inference looks like at epochs 0, 50, and 100:
Epoch 0:
Epoch 50:
Epoch 100:
Epoch 250:
Is this how the charts should be looking? Any help would be appreciated!



