Implementation of MelNet: A Generative Model for Audio in the Frequency Domain (Work in progress)
- Tested with Python 3.6.8, PyTorch 1.2.0.
pip install -r requirements.txt
How to train
- Download train data: You may use either Blizzard(22,050Hz) or VoxCeleb2(16,000Hz) data. Both
wavextension can be used.
wavextension, you need to fix
datasets/wavloader.py#L38. This hardcoded file extension will be fixed soon.
python trainer.py -c config/voxceleb2.yaml -n [name of run] -t [tier number] -b [batch size]
- You may need to adjust the batch size for each tier. For Tesla V100(32GB), b=4 for t=1, b=8 for t=2 was tested.
- We found that only SGD optimizer with
lr=0.0001, momentum=0works properly. Other optimizers like RMSProp or Adam have lead to severe unstability of loss.
- Implement upsampling procedure
- GMM sampling + loss function
- Unconditional audio generation
- TTS synthesis (PR #3 is in review)
- Tensorboard logging
- Multi-GPU training