New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to map to sound time series? #34
Comments
Would also appreciate pointers about where I can learn about this? (ie, audio processing, binning, the padding, etc...) |
Hi, The frames in WORLD vocoder actually have variable lengths, so they cannot be easily mapped into the format you specified. See the official repo for more description: mmorise/World#51 It's still possible to align the features with the sample sequence by exploiting the time bounding boxes from As I'm not sure what kind of application you're doing and why you need to align the features to the sample sequence, I'm afraid that I cannot provide further help. In some application scenario, treating [f0, ap, sp] as a time sequence and applying RNN to it is a common approach. Lastly, as for the material, I studied these on a speech signal processing course several years ago. |
Ok... makes sense. So, if I'm using this to synthesize speech from these features (x = vocoder features, y=audio, model=rnn), I understand that now I can use I noticed if I use frame period = 5 with audio sample rate of 16khz, then i get 13 frames for every 952 audio samples. The formula I came up with which (i think) estimates the number of frames is def number_of_frames(self, audio, sample_rate=16000, frame_period=5):
frames = np.ceil(((len(audio) / sample_rate) / frame_period) * 1000)
return frames The paper (char2wav) I'm going off of says:
Q2: Does pyworld have a normalize function? Or is it just |
I'm so sorry that I got your question wrong and answered you the wrong thing. The frames are actually aligned with the waveform.* In your case, you messed up with the As for Q2: how Char2wav normalize the features, you might want to contact the authors. There isn't a normalizing function in WORLD as far as I know. Z-normalization makes sense (I assume you meant std in the denominator). Hope this helps. footnotes: **Strictly speaking, the frames have overlapping, so a frame of (f0, sp, ap) actually contributes to more than 16 samples (here I'm assuming that |
thanks! Yeah meant std. with frame_period = 1, will the f0, sp, ap still capture enough information? My (shallow) understanding was that frame_period needed to be > 1 for it to capture anything meaningful? That’s all! Thank you |
Take this example. |
Thanks! |
Hi, I don't have a lot of experience with audio processing, what's unclear to me is how to map these features to the corresponding sound sequence?
I'm feeding these to an RNN along with the corresponding sound chunk. Do you know how I would set that up?
Say I have a sound file with 32,000 values (16mhz for 2 seconds). I'm feeding the RNN a sequence of 1024 items at a time. BUT I'm grouping them by frames where each frame has 16 sound steps.
So
Thanks for an awesome package!!
The text was updated successfully, but these errors were encountered: