-
Notifications
You must be signed in to change notification settings - Fork 387
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some thoughts from the perspective of a Haskeller while reading https://github.com/Kindelia/HVM/blob/master/HOW.md #74
Comments
Edit: since the comment became big, I'll separate it in sections to make it more organized. Improving HOW.md for FP veteransYour comment is very interesting. It seems like you approached the document trying to understand HVM by tracing connections to how GHC operates, but doing so ended up being more confusing, perhaps for the same reasons that approaching FP coming from an OOP mindset is. It seems like you got stuck in some local maximums of insight ("ah, so dup nodes are like reference counted thunks"), which are kinda true but irrelevant, and ended up missing the bigger picture ("ah, the core is fully linear, un-scoped, and reference-free, making it inherently lightweight; lazy dups are just the least aggressive way to introduce copying without affecting these aspects"). Regardless, both approaches should work ideally, so, thanks for reporting your experience reading HOW.md. I can already think of some changes that would make the document less confusing to people coming from a similar mindset. I think we could trace even more parallels with GHC, and perhaps devote a section to it. I don't have knowledge of its internals, though. Comparing DUPs and SUPs to Haskell thunks
You could kind of see a dup node as "a unitary reference-counted thunk, where the reference count is always 2, and such that it gets immediately undone as soon as it is read once"... perhaps? That's a contrived way to describe it, but I guess it is accurate.
Yes, there is more structure, because copying bound variables requires superposition nodes, so superpositions are part of the "copying machinery". But superpositions aren't always related to dup nodes. There are SUPs without DUPs, and there are DUPs without SUPs (like cloning numbers or constructors). At this point, comparisons to thunks will end up being more complex than the entire description of sups and dups, so these should just be learned as new concepts. A better example of the LAM-DUP rule
You mean this example: dup f g = ((λx λy (Pair (+ x x) y)) 2)
(Pair (f 10) (g 20)) Right? Initially, it was going to be something like: dup f g = λy (Pair (+ 2 2) y)
(Pair (f 10) (g 20)) Because STG isn't able to share the dup f g = ((λx λy (Pair (+ x x) y)) 2)
(Pair (f 10) (g 20)) But you're right that GHC would also inline this program, so this isn't a great example either. The problem here is that this is just a static expression, technically GHC could inline everything and just run it at compile-time. So, yes, perhaps using free variables would help. The problem is, if I wrote something like this: λn
dup f g = λy (Pair (+ n n) y)
(Pair (f 10) (g 20)) People could just complain that I'm reducing inside λ-binders, and it would look like HVM is actually doing more work somehow? So perhaps I should instead present an actual program with external inputs, like this? (Main n a b) =
let f = λy (Pair (+ n n) y)
(Pair (f a) (f b)) That presentation would make it clear that main = do
[n, a, b] <- readArgs :: IO [Word32]
let f = \y -> (n + n, y)
print $ (f a, f b) What do you think? Exploring beta-optimality techniques
Not quite. I never thought of it that way, at least. Perhaps if you think of using λ-encodings instead of constructors as an eta-expansion, then you could say so? But not even λ-encodings are the point. It is all about sharing common lambdas. No matter if they're from λ-encodings, eta-expansions or just normal lambdas.
No, because the point wouldn't be to eta-expand lambdas, but to share these lambdas. If you just eta-expand without sharing, then yes, it would be wasteful.
Yes, that's exactly it. HVM can be seen as a runtime-GHC. Of course, running GHC at runtime would be prohibitively expensive. But HVM does that, while still being efficient.
No, I really think that's not the case, at all. I don't think HVM loses any sharing due to eta-expansion either. That happens in GHC precisely because it isn't able to share inside binders. I think if we ever compile Haskell to HVM, we could flag GHC to eta-expand as much as it wants without fear. I could be wrong, though. Regardless, it is a different problem I was talking about here. What this paragraph is talking about is how we can build a function that can be called 2^N times in a way that uses only Comparing HVM, GHC and Supercombinators
That's correct, but somewhat misleading. Yes, it can be seen as supercombinators, to an extent, but also not. Let me try to trace a parallel. The ideal input for HVM is a big list of small equational clauses. So, for example, a compiler targeting HVM should certainly replace something like: foo a x = case a of
True -> Just (42, 10 * x)
False -> Nothing By: foo True x = Just (42, 10 * x)
foo False x = Nothing Once functions are in that shape, then we can convert them into very fast C procedures. For example: Term foo(Term a, Term x) {
if (a == Ctr(TRUE, [])) {
return Ctr(JUST, [Ctr(PAIR, [42, 10 * x])]);
} else {
return Ctr(NOTHING, []);
}
} And that's all HVM does. This is fast, for the same reason supercombinators are fast: it is just a C function that does some branching and allocs some memory. That is great. The difference, though, is how supercombinators and HVM deal with lambdas. Think of this function: foo True = \x -> (42, x * x)
foo False = \x -> (42, x + x)
main = foo True 7 This function can not be compiled to a fast C procedure, because C doesn't have closures. So, every runtime must deal with this situation in some way. Supercombinators deal with it by transforming the whole program in a way that functions only receive/return concrete values (numbers, constructors). That way, lambdas don't exist at runtime, and everything is compiled to fast C-like procedures. Like this: // Notice "x" was uncurried, allowing `foo` to
// return a concrete value, not a lambda
Term foo(Term a, Term x) {
if a == Ctr(TRUE, []) {
return Ctr(PAIR, [42, x * 10]);
} else {
return Ctr(PAIR, [42, x + 20]);
}
}
u64 main = foo(1, 7); That looks blazingly fast. And it is. Until there are too many lambdas. In that case, the involved transformations will make your functions considerably heavier, and you'll lose the benefits. GHC deals with it by wrapping runtime lambdas as objects that track their arguments, and, once fully saturated, call the machine code that returns the concrete value. Like this: Term foo_0(Term[] args) {
return Ctr(PAIR, [42, args[0] + 10]);
}
Term foo_1(u64[] args) {
return Ctr(PAIR, [42, args[0] + 20]);
}
// Foo wasn't uncurried, but it returns a black-box closure
Term foo(Term a) {
if a == Ctr(TRUE, []) {
return Closure(foo_0);
} else {
return Closure(foo_1);
}
}
u64 main = call(foo(1), 7); Of course, GHC will do tons of optimizations to avoid returning closures, and this ignores thunks/memoization, but, point is, there are no expressions inside lambdas, just machine code waiting to be called. That is why you can't inspect inside the body of a lambda, and why GHC can't share expressions inside them. HVM deals with it by simply treating lambdas as concrete values. That's all. When a lambda is returned, nothing special happens. It literally just returns a lambda, just like you would return a list constructor. Like this: // Foo wasn't curried, it just returns a lightweight lambda
Term foo(Term a) {
if a == Ctr(TRUE, []) {
Var x = fresh();
return Lam(x, Ctr(PAIR, [42, Add(x, 10)]));
} else {
Var y = fresh();
return Lam(y, Ctr(PAIR, [42, Add(y, 20)]));
}
}
u64 main = App(App(foo, 1), 7); Since lambdas are not special, you can compile every program to a fast C-like procedure, which is the whole point of supercombinators, except you don't need to do any additional transformation to avoid lambdas. The only consequence is that Lam/App/Add nodes require some extra memory, but almost all of it can be avoided, and HVM is designed with that in mind. For example, by organizing your program as small top-level equations, then almost all LAM/APP aren't necessary. And HVM compiles numeric nodes like ADD in a way that allows the C compiler to inline them whenever possible. So, in the program above, Now, there are cases where avoiding LAM/APP/ADD nodes isn't possible in HVM, but these are exactly the cases where supercombinators would need expensive program transformations, and GHC would use closures, which also use extra memory. The nice thing, is, though, these are also the cases where HVM will outperform GHC asymptotically, precisely because the lambdas are still present at runtime. This information isn't lost, making execution-time-inlining possible, which isn't possible in GHC, since lambda bodies were already converted to machine code. Uff. Hope this makes sense... |
Edit: Urgh, accidental early submission while I nearly dropped my laptop. Will post the rest later on |
Yes, hence I would have appreciated an early debunk of my initial thinking with a forward reference.
Well, this isn't a thought that you get while you haven't even read about how lambdas are duplicated... Hence the need for a forward reference.
Hmm, here it is still statically possible to float out import Control.Monad.Trans.Reader
replicateM' :: Monad m => m a -> Int -> m [a]
replicateM' _ 0 = pure []
replicateM' m n = (:) <$> m <*> replicateM' m (n-1)
{-# NOINLINE replicateM' #-}
main = do
let xs = runReader (replicateM' ((+ 1) <$> ask) 100000) (41 :: Int)
print $ sum xs This will add
I mean eta-expansion in the sense that each datatype gives rise to a "recursor", which is what I think you mean by the type's "λ-encoding". And then you say
Which is exactly eta-expansion. And as I said, eta-expansion is always sound, but may lose sharing, at least in traditional GHC-compiled code.
Note that I'm talking about the sharing the work of the lambda body, not "sharing it's lambdas". Just so that we are talking about the same kind of sharing, here's an example let f x = if sum [0..x] < 13 then id else (+1)
f13 = f 13
in f13 15 + f13 54 Eta-expansion of
Ah, so does that mean that eta-expanding
Interesting. I would expect those two occs of
Ah, so you rather want to have foo x = case x of
True -> \y1 -> Just (42, 10 * y1)
False -> \y2 -> Nothing Correct? But then the latter is not fully eta-expanded. So there is some drawback to that, perhaps the 2^N duplication problem you talked about earlier?
But this is eta-expanded again. So you expect eta-contracted input because it is semantically richer, but ultimately generate eta-expanded code? What if we instead started with foo x = case odd $ sum [0..x] of
True -> \y1 -> Just (42, 10 * y1)
False -> \y2 -> Nothing would you eta-expand over the My point is that eta-expansion makes some programs faster, but others slower. It's the job of the compiler/HVM(?) to figure out when it's beneficial to do so. Alternatively, not do it at all.
Well yes, do closure conversion or lambda lifting or something in-between (Appel's work on closure conversion). I think the original supercombinators paper used lambda lifting to pass free variables, but I don't think it would need to; closure conversion would work just as well. The point I was trying to make is that supercombinators take whole lambda expressions and find an efficient compilation to machine code, rather than just translating the expressions to well-known combinators like SKI first (or interpret the code represented as a data structure, as I think
Aha! This is the interesting part. You're returning foo a = case odd $ sum [0..a] of
True -> \x -> Just (42, 10 * x)
False -> \y -> case y > 2 of
True -> Just (42, 20 * y)
False -> Nothing What code would you generate here? |
That example is way too contrived, it must be simple and pedagogical. I don't think people will go that far. The point of this example to show a situation where GHC can't share a redex. It is there to teach someone trying to learn, not to convince the most absolutist skeptics on whether GHC can always "fix" its lack of beta-optimality by applying smart compile-time manipulations. For these, there is the book, which has proofs that this isn't the case.
No, a recursor is a type of λ-encoding called Church-Encoding. Usually I mean the Scott-Encoding, since pattern-matching on it has the same complexity as pattern-matching on built-in datatypes.
I don't think that is what eta-expansion means, though. Eta-expansion is taking a function like
I know. The problem is it seems like we're mixing two different subjects. One thing is optimizing function composition from
There is no built-in
Yes. The program you wrote loses no sharing on HVM after eta-expansion.
No, we often want to have small right-hand sides because the larger it is, the more memory must be allocated. For example, this: foo x = case odd $ sum [0..x] of
True -> \y1 -> Just (42, 10 * y1)
False -> \y2 -> Nothing Must alloc space for a representation of the entire "case-of" expression whenever
There is no "case-of" in HVM, so this expression doesn't even exist. You have 3 alternatives:
// Could also be called "If"
(BoolCaseOf True t f) = t
(BoolCaseOf False t f) = f
(Foo a) =
let true_case = λx (Just (Pair 42 (* 10 x)))
let false_case = λy ( ... )
(BoolCaseOf (Odd (Sum (Range 0 a))) true_case false_case) That is the easy, but stupid way, because the right-hand side will be huge, and you'll get a constant-time slowdown, since you must alloc all of it whenever you call
(Foo a) = (Foo_0 (Odd (Sum (Range 0 a))))
(Foo_0 True) = λx (Just (Pair 42 (* 10 x)))
(Foo_1 False) = (...) This is the proper way. It will generate fast programs.
This way, you don't even use constructors, just lambdas. This is the worst way in general since it uses the most memory, but if the compiler is really, really clever, it may abuse λ-encodings to gain asymptotic speedups. Normal compilers should just go with (2) and all will be fine. |
Yes, exactly. I wanted to give an example where the loss of sharing would really matter to GHC. I agree it's not pedagogical at all. I hope that you might be able to come up with better examples from the one I gave.
Interesting, I'd really like to see and understand these proofs. Where they written with the kind of optimisations in mind that GHC can do? E.g., that certain programs would never be relevant because you can just float the redex out of a lambda into a shared let. At the end of the day, I'm a pragmatist and I take the solution that makes my programs go the fastest. If that entails a bit of restructuring my program, a smart compiler like GHC, or an exciting new virtual machine for STG, I'll take it. But so far, I know of no concrete program where GHC isn't able to optimise it so that the promised sharing of HVM would make a positive difference. It's always "This is such a program" and then I say "No, it isn't, because GHC would float out the shared work". So what is a program where STG-code that was produced by GHC lacks sharing of work and that HVM could exploit?
No, eta expansion takes an expression
Same thing as your "sharing of lambdas".
Yes, that makes sense.
Yes, I mostly care about the notion of sharing I talked about, trying to figure out "how HVM sharing works". I'm getting lost in our discussion threads, so allow me to start a new with the open questions I'm interested in. Function equations are "better" than
|
Just letting you know that I'll be busy on the short term so I will not be able to answer your last comment on the next days, but I'll be back to it when possible. Thanks for bringing great questions and insights. |
Very interesting discussion to propose an alternative lambda calculus: A lambda calculus for the HVM #97 |
Some notes I took while reading your introduction to your exciting project from my perspective as someone familiar with performant GHC-Haskell code. Of interest to me is how exactly everything is mapped to hardware, at least to the point of heap object layout and memory reads/writes. Not everybody might need that much detail, but perf seems like the main selling point of overhauling the machine model of LC languages, so that's what I expect to take away from reading. Later... Looking forward to more details on that in the dedicated section. :)
Clarify the difference between
dup
and reference counted thunksOnce I read (at the end of this section)
I immediately thought "Ah, so
dup
nodes are like reference counted thunks". Is that true? Why isn't it true?I presume it has something to do with reduction of open expressions (under lambdas). Later... Indeed, superposition means I need to remember which "reference count" referenced which entry in the superposition. So there is more structure. I wish that was mentioned earlier, so that I knew where to look for the difference.
Better example for Lambda deduplication
I find the example at the end of https://github.com/Kindelia/HVM/blob/master/HOW.md#lambda-duplication a bit unconvincing, because no optimising compiler would leave behind an obvious redex like
(λx. ...) 2
. GHC would start withto begin with. I know that there are probably better examples with open expressions (otherwise, what's the point?), perhaps one with an expression under a non-redex lambda that can't be floated out because it mentions the lambda-bound variable or some such (no idea, really). Giving one of those would be much more informative!
In other words, I think this section should mention the "killer example".
Bonus: Abusing Beta-Optimality
What follows is more like a few loose connections I can make as someone familiar with the way GHC-compiled programs work.
The takeaway of this section for me seems to be that we want to eta-expand all definitions as much as possible. In GHC that is generally true, too, but problematic if doing so destroys sharing (for example, when expanding over a shared let binding, effectively floating it inside a lambda). Are there similar cases in which sharing is destroyed, due to eagerly
dup
ped lambdas?In fact, I've argued that GHC's rewrite RULEs from
[a]
to fusion helpers should be made more explicit by introducing a data typerather than having the helpers scattered about like this. Indeed, if you only used
FB
everywhere we use lists today, we wouldn't need any rewrite rules at all.Indeed,
mapFB f . mapFB g
is optimised tomapFB (f . g)
, too by GHC. I presume that you mean that Interaction Nets support these optimisations dynamically, for example ifmapFB g
is passed as some function parameterx
and we havemapFB f . x
in the body. That indeed seems like it has an edge over what GHC does!Yes, this seems like the loss of sharing due to excessive eta-expansion I talk about above. Note also that transforming the arity 2 function into the arity 1 function is unsound, as it makes
Add
stricter:Add undefined `seq` ()
diverges after the transformation. So the programmer has to write the arity 1 definition from the beginning.Implementation
Judging from what reduce does, it seems to me that the implementation is more like a bytecode interpreter because of the lack of a compile-time mapping from Lambdas to actual machine code. Do you have ideas for how such a mapping might work? Similar to how "graph reduction" transitioned to today's "supercombinator" encodings (which is just the familiar lambda=<static link, code ptr> encoding again).
The text was updated successfully, but these errors were encountered: