Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to map to sound time series? #34

Closed
williamFalcon opened this issue Oct 26, 2018 · 7 comments
Closed

How to map to sound time series? #34

williamFalcon opened this issue Oct 26, 2018 · 7 comments

Comments

@williamFalcon
Copy link

Hi, I don't have a lot of experience with audio processing, what's unclear to me is how to map these features to the corresponding sound sequence?

I'm feeding these to an RNN along with the corresponding sound chunk. Do you know how I would set that up?

Say I have a sound file with 32,000 values (16mhz for 2 seconds). I'm feeding the RNN a sequence of 1024 items at a time. BUT I'm grouping them by frames where each frame has 16 sound steps.

So

wav = load(...)
wav.shape  
# [1 x 32000]

sub_seq = wav[0:1024]
sub_seq = suq_seq.reshape(1, 64, 16)

f0_contour, spectral_envelope, aperiodicity = pw.wav2world(wav, sample_rate=16000, frame_period=16)

f0_contour.shape
# [157]
# ??? unclear how to match to the (1, 64, 16) piece of sound

Thanks for an awesome package!!

@williamFalcon
Copy link
Author

Would also appreciate pointers about where I can learn about this? (ie, audio processing, binning, the padding, etc...)

@JeremyCCHsu
Copy link
Owner

Hi,

The frames in WORLD vocoder actually have variable lengths, so they cannot be easily mapped into the format you specified. See the official repo for more description: mmorise/World#51
You might also consult mmorise's advice on this issue.

It's still possible to align the features with the sample sequence by exploiting the time bounding boxes from pw.dio output. However, if you really need to fit fixed length frames, then you might need to consider short-time Fourier Transform (STFT) which is available in many signal processing toolkits (e.g., scipy, Tensorflow, etc).

As I'm not sure what kind of application you're doing and why you need to align the features to the sample sequence, I'm afraid that I cannot provide further help. In some application scenario, treating [f0, ap, sp] as a time sequence and applying RNN to it is a common approach.

Lastly, as for the material, I studied these on a speech signal processing course several years ago.

@williamFalcon
Copy link
Author

williamFalcon commented Oct 27, 2018

Ok... makes sense. So, if I'm using this to synthesize speech from these features (x = vocoder features, y=audio, model=rnn), I understand that now I can use [f0, ap, sp] as features, but i need to know the amount of time that corresponds to since the sequences are extremely long and I need to do TBPTT.

I noticed if I use frame period = 5 with audio sample rate of 16khz, then i get 13 frames for every 952 audio samples.
Q1. Does that mean I can just reliably go 13 frames and 952 steps at a time to do what I need to do?

The formula I came up with which (i think) estimates the number of frames is

    def number_of_frames(self, audio, sample_rate=16000, frame_period=5):
        frames = np.ceil(((len(audio) / sample_rate) / frame_period) * 1000)
        return frames

The paper (char2wav) I'm going off of says:

First, we pretrained the reader and the neural vocoder separately. We used normalized WORLD
vocoder features (Morise et al., 2016; Wu et al., 2016) as targets for the reader and as inputs for the
neural vocoder. Finally, we fine-tuned the whole model end-to-end. Our code is available online.1

Q2: Does pyworld have a normalize function? Or is it just (x-mean(x)) / var(x) ? For each feature.

@JeremyCCHsu
Copy link
Owner

I'm so sorry that I got your question wrong and answered you the wrong thing. The frames are actually aligned with the waveform.* In your case, you messed up with the frame_period argument-- the unit is in millisecond. Therefore, instead of setting it to 16, you should've set it to 1. (The sampling rate is 16 kHz => 1 ms = 16 samples).**

As for Q2: how Char2wav normalize the features, you might want to contact the authors. There isn't a normalizing function in WORLD as far as I know. Z-normalization makes sense (I assume you meant std in the denominator).

Hope this helps.

footnotes:
*You might get output features (f0, ap, sp) of length 2001; in this case, the last frame should be ignored.

**Strictly speaking, the frames have overlapping, so a frame of (f0, sp, ap) actually contributes to more than 16 samples (here I'm assuming that frame_period=1). Nonetheless, the starting point of every frame is aligned with the signal.

@williamFalcon
Copy link
Author

thanks! Yeah meant std.

with frame_period = 1, will the f0, sp, ap still capture enough information? My (shallow) understanding was that frame_period needed to be > 1 for it to capture anything meaningful?

That’s all! Thank you

@JeremyCCHsu
Copy link
Owner

Take this example. Q=frame_period, K=frame_length (frame_length cannot be specified and is variable-length in WORLD), so actually, forframe_period, the smaller, the better (typically 5 or 1 ms).

@williamFalcon
Copy link
Author

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants