-
Notifications
You must be signed in to change notification settings - Fork 911
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Doubt on use of discretized MoL in sampling and loss calculation #155
Comments
Hello @StevenZYj, thanks for reaching out! Also sorry for being late into make this answer, I didn't find much free time to make a proper legendary comment.. :) I know the feeling, MoL is a pain to understand haha.. no problem though, let's go through this step by step (with the assistance of Gaussian distribution, things tend to become easier). Before we get into this, I want to apologize in advance for any typos or any lack of attention mistakes, most of what will follow comes from my personal thinking and observations. While I tried covering most important stuff, I am always open to any improvements to my modest comment! Please sit back and Enjoy :) Let's start with easy stuff and move to more complicated levels as we proceed:
The main idea of sampling is that we want to pick a random variable X that is most probable to be picked using the Distribution In practice, we select a random uniform y in (1e-5, 1 - 1e-5) to avoid saturation regions of the sigmoid, then determine x from it. If things got a bit too complicated at this point, use Logistic distribution link above for plot assistance. Naturally, it is evident to notice that if our randomly picked y=0.5, x will be exactly the mean of the Logistic distribution, which is logic. Lucky for us, by picking a y between 1e-5 and 1 - 1e-5 we can actually get a x that is most probable for those mean and scale parameters according to PDF. Perfect, we got how sampling is done at synthesis, now how do I train my model to make accurate predictions for these means and scales? and in a mixture case, how does the model learn to correctly select the Distribution to use? This brings us to the second part of this long comment :)
In the figure above, I have made two plots (PDF on the top, CDF on the bottom) for 3 different normal distributions with their respective means and variances as legend. I also made 4 dashed lines, 2 blacks and 2 blues. The 2 blue dashed lines are on top of each other so they appear as one. They refer to x_cdf+ and x_cdf- (explanation on the way). Blue lines follow the real plot scale, that's reality. For the sake of explanation, I break the plot scale rules by a factor of 1000x to create the black dashed lines which are the not real x-cdf+ and x_cdf-. The intersection of x_cdf+ with the distribution CDF gives cdf+. Same applies on cdf- side. cdf+ and cdf- are what I call the "envelope" borders. Alright let's stick with the CDF for a while. In the figure, I suppose that y = 0 which is represented by the blue dashed line. Now, if we consider x_cdf+ and x_cdf- to be the two dashed black lines, how can we maximize the difference between cdf+ and cdf- (i.e: maximize cdf+ - cdf-) while x_cdf+ and x_cdf- are kept constant (they only depend on the true value of the target y and the model has no control over them)? Actually, by having a closer look to CDF of Gaussian distribution, we notice that the maximal value for cdf+ - cdf- is obtained with the red Gaussian for two reasons: - The mean of the Gaussian is exactly on top of the real y value. In contrast to the Blue Gaussian which is way off, thus cdf+ and cdf- for the blue distribution are about the same. (saturation region of the CDF). - The slope of the CDF gets bigger when Now, if we come back to the real plot scale, maximizing the difference between cdf+ and cdf- is really maximizing the slope at the point of x-coordinate y. Now, it is well known that maximal slope for CDF is hit exactly in its inflection point (inflection point: So, just like that, this function (difference between cdf+ and cdf-, does this function not have a name? am I not aware of it?) not only hits its maximum when If we want to make a reference to PDF, we indeed notice that the bigger the difference between cdf+ and cdf-, the more the distribution has a probability of picking the real target y (Biggest value is hit for the *red gaussian). Thus this reformulation of the loss function, is still in principle, consistent with the MLE we discussed earlier. (Maximize the probability that a real sample y is drawn from an output distribution EDIT: If you have some mathematical background, you can understand this as follows (thanks to @m-toman for pointing that out): That is the base of MoL loss as well. Once this has been assimilated without problems, MoL loss is really a variation with small details. In the next section, I will solely be focusing on those details. :)
Let's start with the distribution training as it's a continuation to what has been discussed in the previous section:
There you have it, I assume that's all there is to know on MoL (and Gaussian distribution sampling for WaveNet). I hope this modest comment helped you get the main intuition to actually be able to go through the code and feel like you know what's going on. Everyone is invited to add anything I may have missed or discuss anything I explained wrong. ;) In any case, here are some useful references I found when trying to understand MoL/Gaussian myself in case anyone wants to have some more reading: |
@Rayhane-mamah That is a legendary comment for sure! I cannot say how much I appreciate it :D I've literally spent most of my day reading through, searching and understanding these ideas. It is indeed a pain haha, but at the same time I feel like I've learned a lot. |
@StevenZYj thanks a lot! :) I know that sometimes I talk about stuff as it's evident.. (I saw most of the involved mathematics in college so I tend to skip some details supposing they're well known..) Please don't hate me for that x) If you find anything not explained well enough, please feel free to ask any questions, I'll do my best :) |
Wow, I'll also check out this amazing comment later on. Thanks for that. EDIT1:
|
Hey @m-toman
@m-toman you seem to have great ways of simplifying things, I look forward to your future notes! |
@Rayhane-mamah Thanks, I've read it through now and the rest sounded pretty straightforward. I think your explanation of the CDF thingy is better for someone with less background knowledge, as mine makes a bit more more assumptions. Back to the topic, a couple hints that hopefully help others:
|
@m-toman yeah it sure feels like trial and error that come from intuitions (or a strong mathematical research? e.g: parallel wavenet doesn't seem to be intuition based that much.. )
|
Hi @Rayhane-mamah, Thank you for fixing bugs in wavenet vocoder. To train both models sequentially use: python train.py --model='Tacotron-2', The results of the loss function wavenet are as follows: [2018-08-16 11:45:10.851] Step 1 [14.957 sec/step, loss=1.17336, avg_loss=1.17336] Are the negative values in Loss correct? |
Hey @atreyas313, yes that is normal assuming you are using "raw" with 2 output channels (which uses a single gaussian distribution). As explained in my first comment, we minimize the negative log probability of y. With good predictions this probability gets bigger and the loss gets slower leading to even smaller loss (bigger absolute value under 0). So yeah that's normal :) If you prefer to use MoL instead, change the output_channels parameter to be M * 3 where M is your chosen number of Logistic distributions (usually 10) |
Hey @Rayhane-mamah
Sorry for my silly question, Does this scaling takes place at |
@begeekmyfriend Hi, what do you mean by "rescaling"? Could you point out where it is embodied in the preprocess.py? |
On this line |
if the audio is scaled to [-2, 2] rather than [-1, 1], so whether what should I do is just clip the sampled prediction with [-2, 2] ? Need any modify in discretized MoL loss file |
Hi @Rayhane-mamah https://lips.cs.princeton.edu/the-gumbel-max-trick-for-discrete-distributions/ |
This thread has been amazing. Could we borrow some ideas behind MOL for making a mixture of gaussians to model the output distribution?... I ask because I haven't had success with MOL for my problem. |
what the mean of |
I think the |
Hi @Rayhane-mamah , thanks for your legendary answer! While I have more or less grasped your ideas, I have another question that has bothered me for days: why use an approximation of pdf in the first place during training? My guess for the MoL case is that it leads to more straightforward formulation as CDF for logistic distribution is easier to calculate than PDF. But what about the gaussian case? Why not directly use PDF to calculate the MLE loss? |
Hello! Thank you for your amazing post but i still have some questions to think about. As I know, you sourced I want to know why did you decide to take them off? |
Hi @Rayhane-mamah , really nice work, and thanks for the explanations so far. I have a further question about the training of the distributions and would be glad if you can help me. Thank you for your time. |
Hi all, first thanks @Rayhane-mamah for fixing bugs in wavenet vocoder and making it fully work now :) I've spent several days looking into its implementation and there's a part that really makes me struggled—the implementation of discretized MoL in sampling and loss calculation. I know it is sourced from the official implementation of PixelCNN++, but that does not help much though. Seems like I'm lacking in certain mathematical knowledge. Wonder if anyone can help me out? Thanks.
The text was updated successfully, but these errors were encountered: