HiFi Hybrid

A vocoder broadly based on HiFi-GAN, with the addition of some more recent proposals from BigVGAN.

Specifically:

The discriminator design used is identical to HiFi, as this repository does not follow BigVGAN in replacing the multi-scale discriminator (MSD) with a mutli-resolution discriminator (MSD).
- This could be done in the future, but a preliminary investigation found little difference between the two
The generator design is very similar to BigVGAN in that it employs anti-aliased multi-periodicity composition (AMP) modules, but the low pass filters are made trainable.

Thus, this code is essentially a hybrid between HiFi and BigVGAN.

Training

python train.py /path/to/data/goes/here

Installation

pip install git+git://github.com/TariqAHassan/HifiHybrid

Requires Python 3.9+

Help

Information on training the model can be found by running the following command:

$ python train.py --help

NAME
    train.py - Train Model.

SYNOPSIS
    train.py DATA_PATH <flags>

DESCRIPTION
    Train Model.

POSITIONAL ARGUMENTS
    DATA_PATH
        Type: str
        system path where audio samples exist.

FLAGS
    --file_ext=FILE_EXT
        Type: str
        Default: 'wav'
        file extension to filter for in ``data_path``.
    --val_prop=VAL_PROP
        Type: float
        Default: 0.1
        proportion of files in ``data_path`` to use for validation
    --max_epochs=MAX_EPOCHS
        Type: int
        Default: 3200
        the maximum number of epochs to train the model for

...

Results

Initial results from this model are quite promising.

The BigVGAN paper leverages a lot of evaluation metrics (M-STFT, PESQ, MCD, etc.) which, regrettably, I have not yet had time to implement. However, a simple plot of the L1 reconstruction error over time on the Expanded Groove drum dataset is easy to obtain and still quite instructive.

This plot shows $|| mel(x) - mel(G(mel(x))) ||_{1}$ over time during validation, where $mel$ computes the mel spectrogram of a waveform, $x$ is the original audio waveform and $G$ is the vocoding generator.

The figure below shows the mel spectrograms at the end of training.

Each row contains pairs of spectrograms. The spectrograms on top are the originals and the spectrograms immediately below them are the reconstructions produced by the model.

References

Some code used here was adapted from https://github.com/jik876/hifi-gan

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
assets		assets
hifihybrid		hifihybrid
.isort.cfg		.isort.cfg
LICENSE.md		LICENSE.md
README.md		README.md
dev_requirements.txt		dev_requirements.txt
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HiFi Hybrid

Training

Installation

Help

Results

References

About

Releases

Packages

Languages

License

TariqAHassan/HiFiHybrid

Folders and files

Latest commit

History

Repository files navigation

HiFi Hybrid

Training

Installation

Help

Results

References

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages