Skip to content

Voice Alignment and Conversion with Neural Networks and the WORLD codec.

License

Notifications You must be signed in to change notification settings

JavierAntoran/tiger-costume-voice-conversion

Repository files navigation

tiger-costume Voice Conversion project

A lightweight tool for performing voice conversion written in Python with Pytroch and Numpy. It is mainly composed of two parts. A voice alignment algorithm and a regression model.

Voice Conversion
Alignment
Model Training and MLPG
Running the voice conversion
Other resources

Voice Conversion

Voice conversion is the task of modifying the speech signal of one speaker (source speaker) so that it sounds as if it had been pronounced by a different speaker (target speaker).

We use a parametric method for voice generation. Raw waveforms are converted to fundamental frequency (f0), spectral envelope coefficients (SP) and band aperiodicity coefficients (AP) for processing using the WORLD codec. They are then converted to log_f0 and Mel Generalized Cepstral Coefficients using SPTK. In order to regenerate synthetic waveforms, these conversions are inverted. Note that this is a lossy process. A 25ms window is used for utterance alignment. A 5ms window is used for waveform generation.

In order to build a training set, we captured ten phrases with about seven words each. They were repeated ten times by two speakers using the same microphone. Then we separated those phrases into words using a silence detector. We align these utterances, frame by frame, using a modified version of the Dynamic Time Warping (DTW) algorithm. We use the aligned frames to train a regression model which takes the source speaker's audio data as input and outputs the converted frame parameters such that the reconstructed waveform sounds matches that of the target speaker. In this case we do regression frame by frame with a feed forward network. However, we use contextual information by including delta features (parameter time derivatives) and using Maximum Likelihood Parameter Generation MLPG.

For a more in depth overview see the project slides.pdf

Alignment

Feature extraction and training corpus generation chain:

A plot of the DTW cost matrices of a word, along with the optimal path for alignment.

Two utterances of same word from different speakers, before and after alignment.

We do this in the Dataset_Analysis.ipynb Notebook.

Note that you will have to run the notebook yourself as plotly plots are not displayed automatically.

Model Training and MLPG

This is shown in the Run_Models.ipynb Notebook.

Note that you will have to run the notebook yourself as Plotly plots are not displayed automatically.

We use a 3 layered Fully connected neural net that parametrises a diagonal covariance Normal distribution (mean and std for each output) over the regression targets. The inputs/targets are Generalized Cepstral Coefficients and log f0 with their first and second order time gradients. We use the outputted covariance matrix as the uncertainty parameter for MLPG. Both inputs and targets are normalized to have 0 mean and unit variance.

In the following image we show the conversion of a word's fundamental frequency. Although the predicted values are temporarily aligned with the source values, they take the shape of the target waveform. Additionally, we can see how MLPG uses slope information in order to produce a smooth output.

Running the voice conversion

This script was developed to run on a low resource device like a raspberry pi 3. All dependencies must be satisfied.

python hobbes.py [-dn  DATASET_NAME -ts_num SAMPLE_NUM] [--harvest]

# Use an audio sample with stonemask algorithm
# Output conversion will be otorrino_10_dio.wav
python hobbes.py -dn otorrino -ts_num 10 

# Capture audio from microphone at 16 KHz for 5 seconds and use harvest algorithm
# Output conversion will be acq_harvest.wav
python hobbes.py --harvest

Other resources

This project uses the WORLD vocoder http://www.kki.yamanashi.ac.jp/~mmorise/world/english/ Implemented through pyWORLD: https://github.com/JeremyCCHsu/Python-Wrapper-for-World-Vocoder

It also uses the Speech Signal Processing Toolkit, SPTK http://sp-tk.sourceforge.net implemented through pySPTK https://github.com/r9y9/pysptk

An assortment of voice generation / conversion publications can be found in the papers folder.

Other VC repos: