Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding normalizing flows #236

Open
guyko81 opened this issue Feb 19, 2021 · 13 comments
Open

Adding normalizing flows #236

guyko81 opened this issue Feb 19, 2021 · 13 comments

Comments

@guyko81
Copy link

guyko81 commented Feb 19, 2021

It might be a stupid idea, didn't think it through from code perspective, but couldn't Ngboost predict the Normalizing Flows' parameters as well, so the final outcome distribution could be transformed into any arbitrary distribution? It would give very high flexibility to the package.

@alejandroschuler
Copy link
Collaborator

alejandroschuler commented Feb 19, 2021

@guyko81 this is a very good idea. I've been kicking around variations of this concept for about a year. And, in fact Jerome Friedman himself emailed us (me, @tonyduan, and @avati ) a few months ago with some suggestions along these lines, as well as some related things he'd been working on (I've attached the papers he sent me here). I particularly like the approach suggested in the "omnireg" paper since it is fully nonparametric (only restriction is that the transformation is monotonic). I haven't 100% thought through how to implement this in ngboost but if you're interested in working on it there is definitely a publication waiting to be had.

contrastall (1).pdf
omnireg3 (1).pdf

@guyko81
Copy link
Author

guyko81 commented Feb 19, 2021

@alejandroschuler I would love to work on it. I'm going to read the papers and spend all the time needed to make it work!

@guyko81
Copy link
Author

guyko81 commented Feb 20, 2021

@alejandroschuler I was thinking about focusing only on univariate distributions in regression for now. I'm not quite sure yet, but my idea is to implement a transformation function that is common for all observations (as we already assume that all observations follow the same distribution, just with different parameters - so this step is not a restriction compared to the current solution), and finding the common transformation could be done in a parallel optimization.
If I read correctly, the onnireg paper proposes something similar, but with a non-gradient based optimisation and a monotonic transformation function. This would be different as it would use the variable change equation (calculate the density function and correct the "level" with the derivative).
At each optimization step the transformation could be corrected after the parameters were found (same trick as in the paper, just with more general functions and gradient based method).
so the following process would be used:

  1. calculate 1 step of optimization with current method
  2. optimize 1 step of the transformation (changing the distribution function towards to a more complicated shape, rather than the target variable towards to the normal distribution - but it's actually equivalent, we are measuring the log-likelihood value anyway)
  3. goto step 1, but applying the transformation during the optimization

Is it oversimplification of the problem?

@alejandroschuler
Copy link
Collaborator

Your proposal makes sense to me- the only technical challenge is to include the current transformation in the calculation of the gradients for the distributional parameters, but I think it should still allow for modularity of transforms/distributions because of the chain rule.

What kind of transforms are you thinking of trying? What method do you propose to use the transform parameters in step 2?

Something to note here is that if you use a fixed transform (no parameters, e.g. Y' = log(Y)) then this procedure is equivalent to just changing the distribution (e.g. from Normal to LogNormal). So you still have Y'|X ~ D(θ(X)) where D is some (post-transform) distribution with parameters θ. When you have parameters in the transform, let's say Y' = f(Y, β), then β become parameters that are shared globally across values of X. So you still have a fixed distributional form, but it is now Y'|X ~ D(θ(X), β). So really what this adds, mathematically speaking, is the ability to have some global parameters. But a global parameter is nothing but a function that is constant in X. That is, we could write Y'|X ~ D(θ(X), β(X)) where β(X) = β. So we can include θ'(X) = [θ(X), β(X)] and let β vary with X to get an even more flexible model, and this gives Y'|X ~ D(θ'(X)) which is mathematically the same kind of creature as Y|X ~ D(θ(X)).

In short- adding any global transformation is a special case of adding a new distribution that has a particular (larger) parametrization. And adding a new distribution can already be handed with the algorithm we have without the need for intermediate steps. We are never escaping from the need to specify a parametric form for the outcome distribution, we are just expanding the kind of parametric forms we are willing to consider. Your proposal does allow for a global parameter that is constrained to be a constant, but that's a special case of having another parameter that varies with X.

None of that is to say it's a bad idea- obviously this is exactly the same concept as "normalizing flows", which people seem to like a lot. But if you're going to try implementing this, given the above arguments, maybe it would be equivalent (slightly more general, really) to leave the fitting algorithm alone and instead develop a system for constructing modular, transformable distributions so that a user might specify Dist = lambda β, Normal: β[0]*Normal ** 2 + β[1]*log(Normal) or something to that effect? Worth thinking about.

The method in the omnireg paper is fundamentally different because the transformation is fully nonparametric (only limited to monotonicity).

@alejandroschuler
Copy link
Collaborator

(@soumyasahu may have thoughts to add here as well)

@guyko81
Copy link
Author

guyko81 commented Feb 23, 2021

@alejandroschuler I might be still misunderstanding the paper, but from section 3 I still read that the transformation happens globally in the omnireg paper. In my understanding the algorithm operates with a non-parametric transformation of the target variable so it satisfies the original assumption.

"One way to mitigate this problem is by transforming the outcome variable y using a monotonic function. The goal here is to find a monotonic transformation function g(y) such that (2) at least approximately holds for the transformed variable. That is
g(y) ~ f( x) + s(x) · epsilon"

And the g(y) transformation is applied only on the target variable.
"If all of the training data y-values are uncensored, or the censoring intervals are restricted to non—overlapping bins, then the corresponding ˆ F(y) is trivially obtained by ranking the training data y-values."

"Given the resulting solution location and scale functions one evaluates a new ˆg(y) using (16). This transformation function replaces the previous one resulting in new estimates for ˆ f(x) and ˆs(x) from (5) (6). These in turn produce a new transformation through (16) and so on."

Getting a parametric transformation during training seems a bit expensive to me. But maybe I'm wrong, and making a transformation globally has the same computational cost. My other concern with the parametric change on tree leaves is that I'm not sure how it would affect the natural gradient boosting part (the FI section). In a simple gradient method I have the feeling that I could write down the equations correctly.
If the transformation could be done globally then it could be separated from the parameter search part, so the 2 equations won't interact with each other, therefore the current solution would not require any change. Only a new layer of y-transformation would be applied at each training step.

@alejandroschuler
Copy link
Collaborator

@guyko81 right, the transformation in the omnireg paper is indeed global, but my point isn't that it's not global, it's that it can't be written down in a closed parametric form and is in that sense much more flexible in the kinds of relationships it can accommodate. As I argued, the global parametric transformation approach that you argue for (with a given distribution Y|X ~ D(θ(X)) and a transformation Y' = g(Y, β)) is equivalent to a subset of the model Y'|X ~ D'(θ'(X)) for some parametric distribution D' with a finite-dimensional set of parameters θ'. So ultimately we're still in some parametric family, just a bigger, more flexible one. And that's fine, I just wanted to point it out. The omnireg approach puts us into different class of problems altogether, i.e. Y|X ~ G{D(θ(X))}, where G maps D to some other distribution D'. The difference here is that now I can't always write D' in some known parametric form and as a result my model is not constrained to adhere to that form.

However, I think that I may have misunderstood what you were proposing, based on your comment here:

If the transformation could be done globally then it could be separated from the parameter search part, so the 2 equations won't interact with each other, therefore the current solution would not require any change. Only a new layer of y-transformation would be applied at each training step.

One thing to note is that in the omnireg paper, I think Friedman is proposing to fit the entire boosting model, then find the optimal global transformation, and then fit the entire model again, and repeat 3-5 times. This is not done between trees, but between total refits of the model. But that said, let me outline what I think you are suggesting:

  1. Assume g(Y, β)|X ~ D(θ(X)) where β is a finite-dimensional vector of parameters for the global transformation g such that g(⋅, β) always has an inverse and θ(X) is a finite-dimensional vector of functions of X.
  2. Decide a score function S( g(Y,β), D(θ(X)) ) differentiable in β and θ. Write as S(β, θ) for short with data X and Y presumed fixed.
  3. Initialize β=β₀ and θ(X)= θ₀(X) and iterate the following two steps:
    a) Update θ(X)= θ(X) + ε ∂S(β,θ(X))/∂θ(X) approximately via a gradient boosting step (i.e. ∂S(β,θ(X))/∂θ(X) is estimated nonparametrically for each θj(X) using a regression tree)
    b) Update β = β + ε ∂S(β,θ(X))/∂β (standard gradient descent)

To predict a point estimate or quantile, calculate the corresponding point estimate or quantile from D(θ(X)), then pass it through g⁻¹(⋅, β).

Does that accurately summarize your proposal?

What I would worry about here is that you are updating g in each iteration, but never actually learning based on that final g. So the predicted output of the first tree is not optimized for any transformation, the predicted output of the second tree is only optimized for a small transformation, and so on. I'm not sure it would work at prediction time- by the end of the fitting, you have a bunch of trees that are all optimized to predict slightly different things, most of which are by now pretty far from the ultimate target g(Y, β). But who knows, maybe it would.

@guyko81
Copy link
Author

guyko81 commented Feb 24, 2021

@alejandroschuler I think the summary is accurate. This is my understanding of the process, with the reason why it should work.
NGBoost with normalizing flows.pptx

And you asked earlier what transformation would I choose. I was wrong actually, my proposed method would be monotonic as well, but with no closed form - I suggest using the Conor Durkan, Artur Bekasov, Iain Murray, George Papamakarios. Neural Spline Flows. NeurIPS 2019.

It's already implemented in Pyro.ai: https://pyro.ai/examples/normalizing_flows_i.html

I don't know if it covers all the possible transformations, but for a first iteration I would start with that.

@alejandroschuler
Copy link
Collaborator

@guyko81 excellent, thank you for the slides, paper, and documentation. Using splines is a good idea and seems like a natural way to get a lot of flexibility (though technically still parametric).

I think you should go ahead and give it a shot! Let's see what happens.

@guyko81
Copy link
Author

guyko81 commented Feb 24, 2021

Thanks, I'll get started then.

@soumyasahu
Copy link
Collaborator

soumyasahu commented Mar 4, 2021

Sorry, @alejandroschuler, I was very busy with my research and studies, so, I had to catch up late. I really enjoyed the above discussion. @guyko81, it is indeed a nice idea to use coupling transformation with monotonic rational splines (I have learned it the first time here), and also it is non-parametric in the sense that the number of parameters increases with the number of data points. From the learning point of view, I have a few questions regarding neural spline flow,
(1) How many free parameters are involved in the transformation with (K+1) knots, in the paper, I see (3K - 1) many for i = d,.., D and also how to get this 'd'.
(2) Is the parameter space of each of these parameters the Real line? If not, we may need some transformation to apply them in NGBoost.

Although @alejandroschuler has nicely described the algorithm, I have a few questions/suggestions,
(1) What distribution are we using for the transformed variable? I think it is logistic distribution as described in omnireg paper.
(2) One of the loss functions in NGBoost is K-L divergence, where negative log-likelihood is an equivalent loss function. Now parameters are estimated or updated based on the natural gradient based on that loss function, which is the first part of step 3. We can see this as if there is a transformation G, then parameters, \theta are updated minimizing K-L divergence between the true distribution of G(y) and the logistic distribution(\theta). Then in the next step transformation parameters (\beta) are estimated based on the usual gradient to minimize the same K-L distance, keeping parameter values fixed. Although this looks fine and also omnireg used a similar idea, my suggestion is before going into the covariate structure let's check if this is working in the simple case, i.e. generate y from logistic(5, 1) and suppose we observe ln(y) and using the idea in step 3, let's see if we can find the true parameters and the distribution.

Another advantage is that the algorithm is also very flexible as it will find different transformations for different data points which is good if the data come from some mixture of differently transformed variables.

This is not an easy problem to deal with. I want to thank @guyko81 for coming up with this idea. Please let me know if you need any help from my side. I shall be happy to help.

@kevindarby
Copy link

Hi, anything ever come of this? @guyko81

@kevindarby
Copy link

I am going to try creating a set of FlowDist classes based on either

https://github.com/bayesiains/nflows/tree/master
or
https://github.com/ChrisWaites/jax-flows/tree/master

There are others, but I need to be able to tease out a single step() / loss() function to call from fit()

unless someone knows of some non Jax / non torch flows library?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants