Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Style transfer #9

Open
karkirowle opened this issue May 15, 2020 · 11 comments
Open

Style transfer #9

karkirowle opened this issue May 15, 2020 · 11 comments

Comments

@karkirowle
Copy link

The bits related to the style transfer experiments are unclear to me. The prosodic control is now represented by the latent space instead of the GST compared to Mellotron/Tacotron 2, but I'm missing how you can project an utterance to the latent space, i.e Section 4.4, especially Section 4.4.4 in the paper.

My guess is the following, and please confirm if that's right:

  • getting z from style utterance ("prior evidence") you do flowtron.forward( mel, speaker_vecs, text, in_lens, out_lens). You do this with the style utterance's transcription. What's the correct way to assign speaker vecs then, if it is an unseen speaker?
  • running flowtron.inference() using that z (residual) from the code.

I tried using the style speaker's id and I found that the style is very nicely represented but the spoken text is gibberish.

I put the style example (angry.wav) and the synthesised example here. The utterance to synth: "How are you today?"

Here is what I changed in inference.py (sorry, my padding solution is criminal):

with torch.no_grad():
        if utterance is None:
            residual = torch.cuda.FloatTensor(1, 80, n_frames).normal_() * sigma
        else:

            utt_text = "Dogs are sitting by the door!"
            utt_text = trainset.get_text(utt_text).cuda()
            utt_text = utt_text[None]

            # loading mel spectra, in_lens, out_lens?
            audio, _ = load_wav_to_torch(utterance)
            mel = trainset.get_mel(audio).to(device="cuda")

            # You need to padd this because of the permute
            mel = mel[None]
            out_lens = torch.LongTensor(1).to(device="cuda")

            out_lens[0] = mel.size(2)
            in_lens = torch.LongTensor([utt_text.shape[1]]).to(device="cuda")
            residual, _, _, _, _, _, _ = model.forward(mel, speaker_vecs, utt_text, in_lens, out_lens)

        residual = residual.permute(1,2,0)
        # TODO: This is a horrible solution to pad once if needed
        if n_frames > residual.shape[2]:
            pad_len = n_frames - residual.shape[2]
            residual = torch.cat((residual,residual[:,:,:pad_len]),axis=2)
        else:
            residual = residual[:,:,:n_frames]

        mels, attentions = model.infer(residual, speaker_vecs, text)


@rafaelvalle
Copy link
Contributor

rafaelvalle commented May 15, 2020

Try collecting multiple z values (prior evidence), padding them to max length by replicating them and finally computing the mean.

Intuitively, this procedure averages out sentence dependent characteristics (text, pitch contour) and keeps what's common between all sentences (anger, in your example).

@rafaelvalle
Copy link
Contributor

rafaelvalle commented May 15, 2020

You can also sample a distribution by collecting one z and treating each dimension as a mean.
In this case you can either average over time or pad to desired length.

from torch.distributions import Normal
dist = Normal(z, sigma)
z_style = dist.sample().reshape(1, 80, n_frames)

@dakshvar22
Copy link

@rafaelvalle Can we expect command line options to be added to inference.py from your side to accomplish this kind of style transfer?

@karkirowle
Copy link
Author

Thanks @rafaelvalle for your advice. Averaging helps a lot, and the latent sampling improves the naturalness a bit. I added some more samples for those who are interested here, using the acted samples of RAVDESS.

@dakshvar22 I'm happy to make a PR for the style transfer code, though I'm not entirely certain what would be the ideal command-line interface for it. So I'll wait for @rafaelvalle 's comments/thoughts on this.

@rafaelvalle
Copy link
Contributor

@karkirowle do you have samples in which you average over both batch and time and then sample from your 80-d Gaussian n-frames times ?

@karkirowle
Copy link
Author

@rafaelvalle thanks for your help. I tried and uploaded some time averaging samples now here.

I find that these are less distinctive than the batch averaged ones. But I might have done something wrong, so here is my code again for reference:

    category = "happy"
    with torch.no_grad():
        files = glob("data/" + category + "/*.wav")
        residual_accumulator = torch.zeros((1,80,n_frames)).to("cuda")
        for utterance in files:
            if utterance is None:
                residual = torch.cuda.FloatTensor(1, 80, n_frames).normal_() * sigma
            else:

                utt_text = "Dogs are sitting by the door!"
                utt_text = trainset.get_text(utt_text).cuda()
                utt_text = utt_text[None]

                # loading mel spectra, in_lens, out_lens?
                audio, _ = load_wav_to_torch(utterance)
                mel = trainset.get_mel(audio).to(device="cuda")

                # You need to pad this because of the permute
                mel = mel[None]
                out_lens = torch.LongTensor(1).to(device="cuda")

                # talan
                out_lens[0] = mel.size(2)
                in_lens = torch.LongTensor([utt_text.shape[1]]).to(device="cuda")
                residual, _, _, _, _, _, _ = model.forward(mel, speaker_vecs, utt_text, in_lens, out_lens)
                residual = residual.permute(1, 2, 0)

                residual = residual[:,:,:n_frames]

                if residual.shape[2] < n_frames:
                    num_tile = int(np.ceil(n_frames/residual.shape[2]))

                    # I used tiling instead of replication
                    residual = tile(residual.cpu(),2,num_tile).to("cuda")

                residual_accumulator = residual_accumulator + residual[:,:,:n_frames]

        residual_accumulator = residual_accumulator / len(files)

        average_over_time = True
        if not average_over_time:
            dist = Normal(residual_accumulator, sigma)
            z_style = dist.sample()
        else:
            residual_accumulator = residual_accumulator.mean(dim=2)
            dist = Normal(residual_accumulator,sigma)
            z_style = dist.sample((n_frames,)).permute(1,2,0)

        mels, attentions = model.infer(z_style, speaker_vecs, text)

One bit I forgot to mention is that I used tiling instead of replication because replication introduced some artefacts at the end. I think you are doing the same thing when you replicate, but it's good to clarify.

@Quasimondo
Copy link

Thanks for sharing your code! I tried to replicate it, but for some reason I am only getting noise and it seems like reason is the residual tensor is just containing NaN. Could you maybe also share the command line and arguments that you are using here?

@Quasimondo
Copy link

I found the cause of my problem - looks like I am using a newer version of PyTorch in which masked_fill_ does not work with uint8 tensors anymore. The fix is to convert the mask to bool() in Flowtron.forward():

[...]
log_s_list = []
attns_list = []
mask = ~get_mask_from_lengths(in_lens)[..., None].bool() #added .bool() here
for i, flow in enumerate(self.flows):
[...]

@karkirowle
Copy link
Author

karkirowle commented May 16, 2020

I found the cause of my problem - looks like I am using a newer version of PyTorch in which masked_fill_ does not work with uint8 tensors anymore. The fix is to convert the mask to bool() in Flowtron.forward():

[...]
log_s_list = []
attns_list = []
mask = ~get_mask_from_lengths(in_lens)[..., None].bool() #added .bool() here
for i, flow in enumerate(self.flows):
[...]

(sorry for closing/reopening, misclicked)
Oh yes, indeed, that's one thing I also changed, sorry for not reporting that! If you debug it takes uint8's complement I guess, which makes the wrong masks. I also downsampled the recordings using sox, I think it might matter:

$ sox $file -b 16 ${file%.*}_.wav rate -I 22050 dither -s
And there is a gist with my code, here, but it needs improvement so it uses a data loader for batch processing instead of one example at a time. But it might get people started meanwhile.

@rafaelvalle
Copy link
Contributor

For the people interested in style transfer: give us a few days to put a notebook up replicating some of our experiments.

@rafaelvalle
Copy link
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants