GitHub - AchyutBurlakoti/Neural-Audio-Compression: Neural audio codecs that use end-to-end approaches have gained popularity due to their ability to learn efficient audio representations through data-driven methods, without relying on handcrafted signal processing components.

Neural Audio Codec

Neural audio codecs that use end-to-end approaches have gained popularity due to their ability to learn efficient audio representations through data-driven methods, without relying on handcrafted signal processing components. These codecs utilize autoencoder networks with quantization of hidden features, and have been applied in early works for speech coding , as well as in more recent studies, where a deep convolutional network was used for speech compression. While most of these works target speech coding at low bitrates, several studies have demonstrated the efficient compression of audio using neural networks.

Download model from : https://drive.google.com/file/d/1xc7-heD1JIf2BOA02Ta5YUpJguDSmGZY/view?usp=sharing For more information please read the report.pdf from /reports/

Architecture

Quantization

Quantization is a fundamental process in data compression, and its main job is to discretize a continuous latent space by preparing a codebook. In audio and image compression, quantization is commonly used to represent high-dimensional data with lower-dimensional embeddings. The quantizer prepares the codebook for these embeddings, allowing us to store the index of their nearest neighbor in the codebook. This process is called vector quantization, and it involves grouping similar embeddings together into clusters. The codebook consists of the centroid of each cluster, which is represented by a discrete symbol.

There exist the limitation in the uses of Vector Quantization and in order to eliminate those issues we decided to use Residual Vector Quantization. The RVQ comes with the additional feature of adaptive bitrate. It means that instead of using all quantizer codebook you can use the any number of codebook but with the trade of quality.

quantization.mp4

Traning Model with your data from scratch

Put all your .wav files in the /data/input/
run following commands in the root dir :

pip install .
python train.py

All other functionality related to the model's uses can be found the root dir and /src/ folder

Sources that are used for the completion of the projects are :

Future Uses

The developed model is a self-associative network which learns the representation of the data through the compression of those high dimensional data in the discrete latent space (i.e. through vector quantization) so model knows the audio data representation very well and can be further used in other underlying downstream tasks such as text-to-speech, audio generation and other form of audio modeling.

Further Improvement that can be done

Unable to try the model with the discriminator of Hifi-GAN due to lack of memory capacity so anyone can try it out as the code for Hifi-GAN traning is also provided in the /src/ folder
The calculated MUSHRA scores still doesn't represent the model efficiency well due to lack of experimentation setup for MUSHRA score calculation.

Custom File Format (.nac)

As with the other audio codecs like mp3, flacc which requires their own file format, our neural audio codec also have it's own file format called .nac (neural audio codec).

Byte order : network big endian
Header format (9 bytes) :
- 3 bytes: magic string
- 1 byte : version number
- 4 bytes: metadata length
- 1 byte : bit rate

Result

The following results are the reconstruction of the audio when they are compressed at 24 kbps bitrate i.e. only 24000 bits are need to represent 1s audio clip which is in total 2.9 KB for 16000Hz audio waveform.

speech audio

original audio (1.63 MB in .wav format)

original_speech.mp4

reconstructed 24000 kbps audio (153 KB in .nac format)

reconstructed_speechwav.mp4

piano audio

original audio

original_pianon.mp4

reconstructed audio

reconstructed_piano.mp4

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
api		api
configurations		configurations
data/input		data/input
models		models
notebooks		notebooks
reports		reports
src		src
8kn.nac		8kn.nac
LICENSE		LICENSE
README.md		README.md
compress.py		compress.py
main.py		main.py
pianon.nac		pianon.nac
pianon.wav		pianon.wav
requirements.txt		requirements.txt
setup.py		setup.py
test.ipynb		test.ipynb
test_environment.py		test_environment.py
tox.ini		tox.ini
train.py		train.py

License

AchyutBurlakoti/Neural-Audio-Compression

Folders and files

Latest commit

History

Repository files navigation

Neural Audio Codec

Architecture

Quantization

Traning Model with your data from scratch

Sources that are used for the completion of the projects are :

Future Uses

Further Improvement that can be done

Custom File Format (.nac)

Result

speech audio

piano audio

About

Resources

License

Stars

Watchers

Forks

Languages