Skip to content

TariqAHassan/HiFiHybrid

Repository files navigation

HiFi Hybrid

A vocoder broadly based on HiFi-GAN, with the addition of some more recent proposals from BigVGAN.

Specifically:

  • The discriminator design used is identical to HiFi, as this repository does not follow BigVGAN in replacing the multi-scale discriminator (MSD) with a mutli-resolution discriminator (MSD).
    • This could be done in the future, but a preliminary investigation found little difference between the two
  • The generator design is very similar to BigVGAN in that it employs anti-aliased multi-periodicity composition (AMP) modules, but the low pass filters are made trainable.

Thus, this code is essentially a hybrid between HiFi and BigVGAN.

Training

python train.py /path/to/data/goes/here

Installation

pip install git+git://github.com/TariqAHassan/HifiHybrid

Requires Python 3.9+

Help

Information on training the model can be found by running the following command:

$ python train.py --help

NAME
    train.py - Train Model.

SYNOPSIS
    train.py DATA_PATH <flags>

DESCRIPTION
    Train Model.

POSITIONAL ARGUMENTS
    DATA_PATH
        Type: str
        system path where audio samples exist.

FLAGS
    --file_ext=FILE_EXT
        Type: str
        Default: 'wav'
        file extension to filter for in ``data_path``.
    --val_prop=VAL_PROP
        Type: float
        Default: 0.1
        proportion of files in ``data_path`` to use for validation
    --max_epochs=MAX_EPOCHS
        Type: int
        Default: 3200
        the maximum number of epochs to train the model for

...

Results

Initial results from this model are quite promising.

The BigVGAN paper leverages a lot of evaluation metrics (M-STFT, PESQ, MCD, etc.) which, regrettably, I have not yet had time to implement. However, a simple plot of the L1 reconstruction error over time on the Expanded Groove drum dataset is easy to obtain and still quite instructive.

This plot shows $|| mel(x) - mel(G(mel(x))) ||_{1}$ over time during validation, where $mel$ computes the mel spectrogram of a waveform, $x$ is the original audio waveform and $G$ is the vocoding generator.

The figure below shows the mel spectrograms at the end of training.

Each row contains pairs of spectrograms. The spectrograms on top are the originals and the spectrograms immediately below them are the reconstructions produced by the model.

References

About

Hifi-like Vocoder implemented in PyTorch

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages