Style transfer #9

karkirowle · 2020-05-15T17:38:02Z

The bits related to the style transfer experiments are unclear to me. The prosodic control is now represented by the latent space instead of the GST compared to Mellotron/Tacotron 2, but I'm missing how you can project an utterance to the latent space, i.e Section 4.4, especially Section 4.4.4 in the paper.

My guess is the following, and please confirm if that's right:

getting z from style utterance ("prior evidence") you do flowtron.forward( mel, speaker_vecs, text, in_lens, out_lens). You do this with the style utterance's transcription. What's the correct way to assign speaker vecs then, if it is an unseen speaker?
running flowtron.inference() using that z (residual) from the code.

I tried using the style speaker's id and I found that the style is very nicely represented but the spoken text is gibberish.

I put the style example (angry.wav) and the synthesised example here. The utterance to synth: "How are you today?"

Here is what I changed in inference.py (sorry, my padding solution is criminal):

with torch.no_grad():
        if utterance is None:
            residual = torch.cuda.FloatTensor(1, 80, n_frames).normal_() * sigma
        else:

            utt_text = "Dogs are sitting by the door!"
            utt_text = trainset.get_text(utt_text).cuda()
            utt_text = utt_text[None]

            # loading mel spectra, in_lens, out_lens?
            audio, _ = load_wav_to_torch(utterance)
            mel = trainset.get_mel(audio).to(device="cuda")

            # You need to padd this because of the permute
            mel = mel[None]
            out_lens = torch.LongTensor(1).to(device="cuda")

            out_lens[0] = mel.size(2)
            in_lens = torch.LongTensor([utt_text.shape[1]]).to(device="cuda")
            residual, _, _, _, _, _, _ = model.forward(mel, speaker_vecs, utt_text, in_lens, out_lens)

        residual = residual.permute(1,2,0)
        # TODO: This is a horrible solution to pad once if needed
        if n_frames > residual.shape[2]:
            pad_len = n_frames - residual.shape[2]
            residual = torch.cat((residual,residual[:,:,:pad_len]),axis=2)
        else:
            residual = residual[:,:,:n_frames]

        mels, attentions = model.infer(residual, speaker_vecs, text)

The text was updated successfully, but these errors were encountered:

rafaelvalle · 2020-05-15T19:13:03Z

Try collecting multiple z values (prior evidence), padding them to max length by replicating them and finally computing the mean.

Intuitively, this procedure averages out sentence dependent characteristics (text, pitch contour) and keeps what's common between all sentences (anger, in your example).

rafaelvalle · 2020-05-15T19:17:14Z

You can also sample a distribution by collecting one z and treating each dimension as a mean.
In this case you can either average over time or pad to desired length.

from torch.distributions import Normal
dist = Normal(z, sigma)
z_style = dist.sample().reshape(1, 80, n_frames)

dakshvar22 · 2020-05-15T21:28:16Z

@rafaelvalle Can we expect command line options to be added to inference.py from your side to accomplish this kind of style transfer?

karkirowle · 2020-05-15T23:06:02Z

Thanks @rafaelvalle for your advice. Averaging helps a lot, and the latent sampling improves the naturalness a bit. I added some more samples for those who are interested here, using the acted samples of RAVDESS.

@dakshvar22 I'm happy to make a PR for the style transfer code, though I'm not entirely certain what would be the ideal command-line interface for it. So I'll wait for @rafaelvalle 's comments/thoughts on this.

rafaelvalle · 2020-05-15T23:37:32Z

@karkirowle do you have samples in which you average over both batch and time and then sample from your 80-d Gaussian n-frames times ?

karkirowle · 2020-05-16T09:28:29Z

@rafaelvalle thanks for your help. I tried and uploaded some time averaging samples now here.

I find that these are less distinctive than the batch averaged ones. But I might have done something wrong, so here is my code again for reference:

    category = "happy"
    with torch.no_grad():
        files = glob("data/" + category + "/*.wav")
        residual_accumulator = torch.zeros((1,80,n_frames)).to("cuda")
        for utterance in files:
            if utterance is None:
                residual = torch.cuda.FloatTensor(1, 80, n_frames).normal_() * sigma
            else:

                utt_text = "Dogs are sitting by the door!"
                utt_text = trainset.get_text(utt_text).cuda()
                utt_text = utt_text[None]

                # loading mel spectra, in_lens, out_lens?
                audio, _ = load_wav_to_torch(utterance)
                mel = trainset.get_mel(audio).to(device="cuda")

                # You need to pad this because of the permute
                mel = mel[None]
                out_lens = torch.LongTensor(1).to(device="cuda")

                # talan
                out_lens[0] = mel.size(2)
                in_lens = torch.LongTensor([utt_text.shape[1]]).to(device="cuda")
                residual, _, _, _, _, _, _ = model.forward(mel, speaker_vecs, utt_text, in_lens, out_lens)
                residual = residual.permute(1, 2, 0)

                residual = residual[:,:,:n_frames]

                if residual.shape[2] < n_frames:
                    num_tile = int(np.ceil(n_frames/residual.shape[2]))

                    # I used tiling instead of replication
                    residual = tile(residual.cpu(),2,num_tile).to("cuda")

                residual_accumulator = residual_accumulator + residual[:,:,:n_frames]

        residual_accumulator = residual_accumulator / len(files)

        average_over_time = True
        if not average_over_time:
            dist = Normal(residual_accumulator, sigma)
            z_style = dist.sample()
        else:
            residual_accumulator = residual_accumulator.mean(dim=2)
            dist = Normal(residual_accumulator,sigma)
            z_style = dist.sample((n_frames,)).permute(1,2,0)

        mels, attentions = model.infer(z_style, speaker_vecs, text)

One bit I forgot to mention is that I used tiling instead of replication because replication introduced some artefacts at the end. I think you are doing the same thing when you replicate, but it's good to clarify.

Quasimondo · 2020-05-16T10:33:54Z

Thanks for sharing your code! I tried to replicate it, but for some reason I am only getting noise and it seems like reason is the residual tensor is just containing NaN. Could you maybe also share the command line and arguments that you are using here?

Quasimondo · 2020-05-16T11:01:42Z

I found the cause of my problem - looks like I am using a newer version of PyTorch in which masked_fill_ does not work with uint8 tensors anymore. The fix is to convert the mask to bool() in Flowtron.forward():

[...]
log_s_list = []
attns_list = []
mask = ~get_mask_from_lengths(in_lens)[..., None].bool() #added .bool() here
for i, flow in enumerate(self.flows):
[...]

karkirowle · 2020-05-16T11:07:45Z

I found the cause of my problem - looks like I am using a newer version of PyTorch in which masked_fill_ does not work with uint8 tensors anymore. The fix is to convert the mask to bool() in Flowtron.forward():
[...]
log_s_list = []
attns_list = []
mask = ~get_mask_from_lengths(in_lens)[..., None].bool() #added .bool() here
for i, flow in enumerate(self.flows):
[...]

(sorry for closing/reopening, misclicked)
Oh yes, indeed, that's one thing I also changed, sorry for not reporting that! If you debug it takes uint8's complement I guess, which makes the wrong masks. I also downsampled the recordings using sox, I think it might matter:

$ sox $file -b 16 ${file%.*}_.wav rate -I 22050 dither -s
And there is a gist with my code, here, but it needs improvement so it uses a data loader for batch processing instead of one example at a time. But it might get people started meanwhile.

rafaelvalle · 2020-08-07T19:47:14Z

For the people interested in style transfer: give us a few days to put a notebook up replicating some of our experiments.

rafaelvalle · 2020-08-07T23:10:39Z

Please take a look at https://github.com/NVIDIA/flowtron/blob/master/inference_style_transfer.ipynb

karkirowle closed this as completed May 16, 2020

karkirowle reopened this May 16, 2020

rafaelvalle mentioned this issue May 26, 2020

How can I implement style transfer with inference.py ? #22

Closed

This was referenced May 30, 2020

Style transfer COLAB #26

Open

byte mask fix #27

Open

polvanrijn mentioned this issue Jul 14, 2020

Replication #46

Closed

zefyrr mentioned this issue Nov 26, 2020

Training with 2080 Ti #83

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Style transfer #9

Style transfer #9

karkirowle commented May 15, 2020

rafaelvalle commented May 15, 2020 •

edited

Loading

rafaelvalle commented May 15, 2020 •

edited

Loading

dakshvar22 commented May 15, 2020

karkirowle commented May 15, 2020

rafaelvalle commented May 15, 2020

karkirowle commented May 16, 2020

Quasimondo commented May 16, 2020

Quasimondo commented May 16, 2020

karkirowle commented May 16, 2020 •

edited

Loading

rafaelvalle commented Aug 7, 2020

rafaelvalle commented Aug 7, 2020

Style transfer #9

Style transfer #9

Comments

karkirowle commented May 15, 2020

rafaelvalle commented May 15, 2020 • edited Loading

rafaelvalle commented May 15, 2020 • edited Loading

dakshvar22 commented May 15, 2020

karkirowle commented May 15, 2020

rafaelvalle commented May 15, 2020

karkirowle commented May 16, 2020

Quasimondo commented May 16, 2020

Quasimondo commented May 16, 2020

karkirowle commented May 16, 2020 • edited Loading

rafaelvalle commented Aug 7, 2020

rafaelvalle commented Aug 7, 2020

rafaelvalle commented May 15, 2020 •

edited

Loading

rafaelvalle commented May 15, 2020 •

edited

Loading

karkirowle commented May 16, 2020 •

edited

Loading