-
Notifications
You must be signed in to change notification settings - Fork 177
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Style transfer #9
Comments
Try collecting multiple z values (prior evidence), padding them to max length by replicating them and finally computing the mean. Intuitively, this procedure averages out sentence dependent characteristics (text, pitch contour) and keeps what's common between all sentences (anger, in your example). |
You can also sample a distribution by collecting one z and treating each dimension as a mean.
|
@rafaelvalle Can we expect command line options to be added to |
Thanks @rafaelvalle for your advice. Averaging helps a lot, and the latent sampling improves the naturalness a bit. I added some more samples for those who are interested here, using the acted samples of RAVDESS. @dakshvar22 I'm happy to make a PR for the style transfer code, though I'm not entirely certain what would be the ideal command-line interface for it. So I'll wait for @rafaelvalle 's comments/thoughts on this. |
@karkirowle do you have samples in which you average over both batch and time and then sample from your 80-d Gaussian n-frames times ? |
@rafaelvalle thanks for your help. I tried and uploaded some time averaging samples now here. I find that these are less distinctive than the batch averaged ones. But I might have done something wrong, so here is my code again for reference:
One bit I forgot to mention is that I used tiling instead of replication because replication introduced some artefacts at the end. I think you are doing the same thing when you replicate, but it's good to clarify. |
Thanks for sharing your code! I tried to replicate it, but for some reason I am only getting noise and it seems like reason is the residual tensor is just containing NaN. Could you maybe also share the command line and arguments that you are using here? |
I found the cause of my problem - looks like I am using a newer version of PyTorch in which masked_fill_ does not work with uint8 tensors anymore. The fix is to convert the mask to bool() in Flowtron.forward():
|
(sorry for closing/reopening, misclicked)
|
For the people interested in style transfer: give us a few days to put a notebook up replicating some of our experiments. |
Please take a look at https://github.com/NVIDIA/flowtron/blob/master/inference_style_transfer.ipynb |
The bits related to the style transfer experiments are unclear to me. The prosodic control is now represented by the latent space instead of the GST compared to Mellotron/Tacotron 2, but I'm missing how you can project an utterance to the latent space, i.e Section 4.4, especially Section 4.4.4 in the paper.
My guess is the following, and please confirm if that's right:
I tried using the style speaker's id and I found that the style is very nicely represented but the spoken text is gibberish.
I put the style example (angry.wav) and the synthesised example here. The utterance to synth: "How are you today?"
Here is what I changed in inference.py (sorry, my padding solution is criminal):
The text was updated successfully, but these errors were encountered: