-
Notifications
You must be signed in to change notification settings - Fork 215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding normalizing flows #236
Comments
@guyko81 this is a very good idea. I've been kicking around variations of this concept for about a year. And, in fact Jerome Friedman himself emailed us (me, @tonyduan, and @avati ) a few months ago with some suggestions along these lines, as well as some related things he'd been working on (I've attached the papers he sent me here). I particularly like the approach suggested in the "omnireg" paper since it is fully nonparametric (only restriction is that the transformation is monotonic). I haven't 100% thought through how to implement this in ngboost but if you're interested in working on it there is definitely a publication waiting to be had. |
@alejandroschuler I would love to work on it. I'm going to read the papers and spend all the time needed to make it work! |
@alejandroschuler I was thinking about focusing only on univariate distributions in regression for now. I'm not quite sure yet, but my idea is to implement a transformation function that is common for all observations (as we already assume that all observations follow the same distribution, just with different parameters - so this step is not a restriction compared to the current solution), and finding the common transformation could be done in a parallel optimization.
Is it oversimplification of the problem? |
Your proposal makes sense to me- the only technical challenge is to include the current transformation in the calculation of the gradients for the distributional parameters, but I think it should still allow for modularity of transforms/distributions because of the chain rule. What kind of transforms are you thinking of trying? What method do you propose to use the transform parameters in step 2? Something to note here is that if you use a fixed transform (no parameters, e.g. In short- adding any global transformation is a special case of adding a new distribution that has a particular (larger) parametrization. And adding a new distribution can already be handed with the algorithm we have without the need for intermediate steps. We are never escaping from the need to specify a parametric form for the outcome distribution, we are just expanding the kind of parametric forms we are willing to consider. Your proposal does allow for a global parameter that is constrained to be a constant, but that's a special case of having another parameter that varies with None of that is to say it's a bad idea- obviously this is exactly the same concept as "normalizing flows", which people seem to like a lot. But if you're going to try implementing this, given the above arguments, maybe it would be equivalent (slightly more general, really) to leave the fitting algorithm alone and instead develop a system for constructing modular, transformable distributions so that a user might specify The method in the omnireg paper is fundamentally different because the transformation is fully nonparametric (only limited to monotonicity). |
(@soumyasahu may have thoughts to add here as well) |
@alejandroschuler I might be still misunderstanding the paper, but from section 3 I still read that the transformation happens globally in the omnireg paper. In my understanding the algorithm operates with a non-parametric transformation of the target variable so it satisfies the original assumption. "One way to mitigate this problem is by transforming the outcome variable y using a monotonic function. The goal here is to find a monotonic transformation function g(y) such that (2) at least approximately holds for the transformed variable. That is And the g(y) transformation is applied only on the target variable. "Given the resulting solution location and scale functions one evaluates a new ˆg(y) using (16). This transformation function replaces the previous one resulting in new estimates for ˆ f(x) and ˆs(x) from (5) (6). These in turn produce a new transformation through (16) and so on." Getting a parametric transformation during training seems a bit expensive to me. But maybe I'm wrong, and making a transformation globally has the same computational cost. My other concern with the parametric change on tree leaves is that I'm not sure how it would affect the natural gradient boosting part (the FI section). In a simple gradient method I have the feeling that I could write down the equations correctly. |
@guyko81 right, the transformation in the omnireg paper is indeed global, but my point isn't that it's not global, it's that it can't be written down in a closed parametric form and is in that sense much more flexible in the kinds of relationships it can accommodate. As I argued, the global parametric transformation approach that you argue for (with a given distribution However, I think that I may have misunderstood what you were proposing, based on your comment here:
One thing to note is that in the omnireg paper, I think Friedman is proposing to fit the entire boosting model, then find the optimal global transformation, and then fit the entire model again, and repeat 3-5 times. This is not done between trees, but between total refits of the model. But that said, let me outline what I think you are suggesting:
To predict a point estimate or quantile, calculate the corresponding point estimate or quantile from D(θ(X)), then pass it through g⁻¹(⋅, β). Does that accurately summarize your proposal? What I would worry about here is that you are updating g in each iteration, but never actually learning based on that final g. So the predicted output of the first tree is not optimized for any transformation, the predicted output of the second tree is only optimized for a small transformation, and so on. I'm not sure it would work at prediction time- by the end of the fitting, you have a bunch of trees that are all optimized to predict slightly different things, most of which are by now pretty far from the ultimate target g(Y, β). But who knows, maybe it would. |
@alejandroschuler I think the summary is accurate. This is my understanding of the process, with the reason why it should work. And you asked earlier what transformation would I choose. I was wrong actually, my proposed method would be monotonic as well, but with no closed form - I suggest using the Conor Durkan, Artur Bekasov, Iain Murray, George Papamakarios. Neural Spline Flows. NeurIPS 2019. It's already implemented in Pyro.ai: https://pyro.ai/examples/normalizing_flows_i.html I don't know if it covers all the possible transformations, but for a first iteration I would start with that. |
@guyko81 excellent, thank you for the slides, paper, and documentation. Using splines is a good idea and seems like a natural way to get a lot of flexibility (though technically still parametric). I think you should go ahead and give it a shot! Let's see what happens. |
Thanks, I'll get started then. |
Sorry, @alejandroschuler, I was very busy with my research and studies, so, I had to catch up late. I really enjoyed the above discussion. @guyko81, it is indeed a nice idea to use coupling transformation with monotonic rational splines (I have learned it the first time here), and also it is non-parametric in the sense that the number of parameters increases with the number of data points. From the learning point of view, I have a few questions regarding neural spline flow, Although @alejandroschuler has nicely described the algorithm, I have a few questions/suggestions, Another advantage is that the algorithm is also very flexible as it will find different transformations for different data points which is good if the data come from some mixture of differently transformed variables. This is not an easy problem to deal with. I want to thank @guyko81 for coming up with this idea. Please let me know if you need any help from my side. I shall be happy to help. |
Hi, anything ever come of this? @guyko81 |
I am going to try creating a set of FlowDist classes based on either https://github.com/bayesiains/nflows/tree/master There are others, but I need to be able to tease out a single step() / loss() function to call from fit() unless someone knows of some non Jax / non torch flows library? |
It might be a stupid idea, didn't think it through from code perspective, but couldn't Ngboost predict the Normalizing Flows' parameters as well, so the final outcome distribution could be transformed into any arbitrary distribution? It would give very high flexibility to the package.
The text was updated successfully, but these errors were encountered: