Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

from mel_spectrogram to wav again #10

Open
kimchi88 opened this issue Jun 11, 2019 · 26 comments
Open

from mel_spectrogram to wav again #10

kimchi88 opened this issue Jun 11, 2019 · 26 comments

Comments

@kimchi88
Copy link

Hi,
Do you have any suggestion about how to re-build the audio file after augmentation?

@KnowBetterHelps
Copy link

The same question I want to ask,too. In my case, use librosa.feature.melspectrogram and then to compute librosa.feature.mfcc is not equal with kaldi's process.

BTW, did you find the way to re-build audio?

@kimchi88
Copy link
Author

kimchi88 commented Jul 3, 2019

Hi,
nope.. still nothing.. but I've read some other post and it doesn't seems trivial.. there is a post in Kaldi github repository where developers are discussing about their findings after applying specaugment to existing kaldi recipes. Hope it helps!

@KnowBetterHelps
Copy link

thank you for your kind reply

I will looking for it

@dkakaie
Copy link

dkakaie commented Jul 3, 2019

I spent a few hours yesterday for this. This is what I finally settled upon at least for now. Sorry for the delay in sharing this.
New version of librosa seems to include the functionality we need here, see #844. However this is unreleased yet so you have to install from source. Version 0.7.0rc1 is what I used.
You could do

recov = librosa.feature.inverse.mel_to_audio (M=warped_masked_spectrogram, 
    hop_length=128, sr=sampling_rate)

and use this function to save it

def save_wav (wav, path):
        wav *= 32767 / max (0.01, np.max(np.abs(wav)))
        scipy.io.wavfile.write (path, 16000, wav.astype(np.int16))

@kimchi88
Copy link
Author

kimchi88 commented Jul 5, 2019

Hi Roxima,
Thanks for sharing! I'll give it a try :)

@dkakaie
Copy link

dkakaie commented Jul 5, 2019

@kimchi88 Great. Looking forward to your results.

@kimchi88
Copy link
Author

kimchi88 commented Jul 5, 2019

confirmed! It works perfectly.. next step will be use the augmented audio to improve ASR. thanks for help!

@darisettysuneel
Copy link

Hi @roxima / @kimchi88,

Can you please confirm the time taken to convert from mel-spectrogram to wav and what is hardware configuration? bcs for me it is taking 2 to 3 min on cpu with 6 cores and 8 gb ram.

@dkakaie
Copy link

dkakaie commented Jul 23, 2019

@darisettysuneel As much as I can remember it finishes very quickly. What takes time was augmentation and not saving resulting audio. I'll try to report back to you with a simple benchmark.

@darisettysuneel
Copy link

Hi @roxima

Any statistics can I get?

@Lomax314
Copy link

@roxima Hi, I waste more time when convert mel_spectrogram to wav than augment the wav. Do you have any better solution? Thanks

@dkakaie
Copy link

dkakaie commented Aug 20, 2019

@darisettysuneel @Lomax314 So sorry for being late, was as busy as a bee.
I'm on Windows 10, x64, i3-6100U, 8Gb DDR4 RAM, 128GB SSD storage
This is the result for the default sample audio in the repository:

Loaded audio in  0:00:00.509608
Tensorflow finished in  0:00:02.145270
librosa reconstructed audio in  0:00:25.873811
Audio saved in  0:00:00.005016
PyTorch finished in  0:00:00.050832
librosa reconstructed audio in  0:00:29.923980
Audio saved in  0:00:00.004015

As can be seen, reconstructing audio takes much more time compared with augmentations. However I noticed that running this script uses more than 8Gb of my OS drive free space, maybe there is a IO bottleneck?! Running this I get only 141Mb free space.
No, have not found a better solution. Maybe librosa isn't still fully optimized for this stage.

@dkakaie
Copy link

dkakaie commented Aug 20, 2019

Previous one used librosa 0.7.0RC1 and this is for the latest 0.7.0 release:

Loaded audio in  0:00:00.512629
Tensorflow finished in  0:00:02.180432
librosa reconstructed audio in  0:00:20.358577
Audio saved in  0:00:00.006011
PyTorch finished in  0:00:00.045847
librosa reconstructed audio in  0:00:43.839765
Audio saved in  0:00:00.004988

One more

Loaded audio in  0:00:00.505621
Tensorflow finished in  0:00:02.230296
librosa reconstructed audio in  0:00:32.860149
Audio saved in  0:00:00.006980
PyTorch finished in  0:00:00.052857
librosa reconstructed audio in  0:00:46.224405
Audio saved in  0:00:00.005985

@darisettysuneel
Copy link

@roxima Thanks for sharing the statistics! May I know the length of the audio files for provided results.

@dkakaie
Copy link

dkakaie commented Aug 20, 2019

@darisettysuneel Your're welcome. Exactly 2s970ms

@darisettysuneel
Copy link

darisettysuneel commented Aug 20, 2019

@roxima For me it is taking ~1.5 minutes for 8-10sec audio. I need to take a look at input data to reconstruction function. Once again thanks.

@Lomax314
Copy link

@roxima Very thanks for ur reply! the function of the librosa takes much time for me so that i wish i can find other solution. Once again thanks.

@AASHISHAG
Copy link

AASHISHAG commented Nov 27, 2019

@darisettysuneel @Lomax314 : Did you find any other better method to achieve it?

@Lomax314
Copy link

@AASHISHAG I'm sorry about that the answer is NO.However,this method seemd to be implemented in function of the kaldi'repository

@AASHISHAG
Copy link

AASHISHAG commented Nov 28, 2019

@Lomax314 : Thank you for the reply. I will have a look.

If you still have the setup running, could you please help me with the tensorflow and tensorflow_addons and gcc version. I am trying to run the test script as given in the readme but getting some errors on from specAugment import spec_augment_tensorflow

import glob
import scipy
import librosa
import numpy as np
from specAugment import spec_augment_tensorflow

mozilla_augmented = '/mozilla_augmented/clips/*.wav'

for audio_path in glob.iglob(mozilla_augmented):
    print(audio_path)
    audio, sampling_rate = librosa.load(audio_path)
    mel_spectrogram = librosa.feature.melspectrogram(y=audio,
                                                     sr=sampling_rate,
                                                     n_mels=256,
                                                     hop_length=128,
                                                     fmax=8000)
    warped_masked_spectrogram = spec_augment_tensorflow.spec_augment(mel_spectrogram=mel_spectrogram)
    wav = librosa.feature.inverse.mel_to_audio (M=warped_masked_spectrogram, hop_length=128, sr=sampling_rate)
    wav *= 32767 / max (0.01, np.max(np.abs(wav)))
    scipy.io.wavfile.write (audio_path, 16000, wav.astype(np.int16))

@junaedifahmi
Copy link

@roxima For me it is taking ~1.5 minutes for 8-10sec audio. I need to take a look at input data to reconstruction function. Once again thanks.

It takes me 10 minutes for 10 sec audio for me, the machine have 88 cores with 500GB memory, I use the last code to convert to audio, do you have any better solution? maybe with torch audio? thanks.

@AASHISHAG
Copy link

AASHISHAG commented Dec 12, 2019

@juunnn : Could you please confirm your tensorflow and gcc version? I am facing some dependency issue. I think it has to do with tensorflow and gcc.
The best would be, if you can give the output of the following command: pip3 list

This will list all the versions.

@junaedifahmi
Copy link

I still have problem with tf dependenci, that's why I use pytorch for them. It works, and don't have a long time to execute, but for some audio it says "output have no finite value everywhere" while compiling back to audio. I dont know what to do,

@AASHISHAG
Copy link

@juunnn : Could you please share your code, that you wrote with PyTorch dependencies. I don't have exposure to either PyTorch or Tensorflow. It would be really helpful.

I am using the below code and facing dependencies issues.

import glob
import scipy
import librosa
import numpy as np
from specAugment import spec_augment_tensorflow

mozilla_augmented = '/mozilla_augmented/clips/*.wav'

for audio_path in glob.iglob(mozilla_augmented):
    print(audio_path)
    audio, sampling_rate = librosa.load(audio_path)
    mel_spectrogram = librosa.feature.melspectrogram(y=audio,
                                                     sr=sampling_rate,
                                                     n_mels=256,
                                                     hop_length=128,
                                                     fmax=8000)
    warped_masked_spectrogram = spec_augment_tensorflow.spec_augment(mel_spectrogram=mel_spectrogram)
    wav = librosa.feature.inverse.mel_to_audio (M=warped_masked_spectrogram, hop_length=128, sr=sampling_rate)
    wav *= 32767 / max (0.01, np.max(np.abs(wav)))
    scipy.io.wavfile.write (audio_path, 16000, wav.astype(np.int16))

@ma7555
Copy link

ma7555 commented Aug 15, 2020

it indeed takes a lot of time to convert from mel_spectogram to audio, if someone gets across a faster way instead of librosa built in please share.

For a 1 minute audio with 128 mels

CPU times: user 8min 32s, sys: 5min 11s, total: 13min 43s
Wall time: 7min 14s

@neel04
Copy link

neel04 commented Apr 20, 2021

Any new updates for possibly faster implementations?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants