Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some thoughts from the perspective of a Haskeller while reading https://github.com/Kindelia/HVM/blob/master/HOW.md #74

Closed
sgraf812 opened this issue Feb 9, 2022 · 7 comments

Comments

@sgraf812
Copy link

sgraf812 commented Feb 9, 2022

Some notes I took while reading your introduction to your exciting project from my perspective as someone familiar with performant GHC-Haskell code. Of interest to me is how exactly everything is mapped to hardware, at least to the point of heap object layout and memory reads/writes. Not everybody might need that much detail, but perf seems like the main selling point of overhauling the machine model of LC languages, so that's what I expect to take away from reading. Later... Looking forward to more details on that in the dedicated section. :)

Clarify the difference between dup and reference counted thunks

Once I read (at the end of this section)

Collection is orchestrated by variables that go out of scope.

I immediately thought "Ah, so dup nodes are like reference counted thunks". Is that true? Why isn't it true?
I presume it has something to do with reduction of open expressions (under lambdas). Later... Indeed, superposition means I need to remember which "reference count" referenced which entry in the superposition. So there is more structure. I wish that was mentioned earlier, so that I knew where to look for the difference.

Better example for Lambda deduplication

I find the example at the end of https://github.com/Kindelia/HVM/blob/master/HOW.md#lambda-duplication a bit unconvincing, because no optimising compiler would leave behind an obvious redex like (λx. ...) 2. GHC would start with

dup f g = (λy (Pair 4 y)
(Pair (f 10) (g 20))

to begin with. I know that there are probably better examples with open expressions (otherwise, what's the point?), perhaps one with an expression under a non-redex lambda that can't be floated out because it mentions the lambda-bound variable or some such (no idea, really). Giving one of those would be much more informative!

In other words, I think this section should mention the "killer example".

Bonus: Abusing Beta-Optimality

What follows is more like a few loose connections I can make as someone familiar with the way GHC-compiled programs work.

The takeaway of this section for me seems to be that we want to eta-expand all definitions as much as possible. In GHC that is generally true, too, but problematic if doing so destroys sharing (for example, when expanding over a shared let binding, effectively floating it inside a lambda). Are there similar cases in which sharing is destroyed, due to eagerly dupped lambdas?

For example, a clever way to implement a Data.List would be to let all algorithms operate on λ-encoded Church Lists under the hoods, converting as needed. This has the same "deforestation" effect of Haskell's rewrite pragmas, without any hard-coded compile-time rewriting, and in a more flexible way.

In fact, I've argued that GHC's rewrite RULEs from [a] to fusion helpers should be made more explicit by introducing a data type

newtype FB a = FB { unFB :: forall b. (a -> b -> b) -> b -> b }

mapFB :: (a -> b) -> FB a -> FB b
...

rather than having the helpers scattered about like this. Indeed, if you only used FB everywhere we use lists today, we wouldn't need any rewrite rules at all.

For example, using map in a loop is "deforested" in HVM. GHC can't do that, because the number of applications is not known statically.

Indeed, mapFB f . mapFB g is optimised to mapFB (f . g), too by GHC. I presume that you mean that Interaction Nets support these optimisations dynamically, for example if mapFB g is passed as some function parameter x and we have mapFB f . x in the body. That indeed seems like it has an edge over what GHC does!

Note that too much cloning will often make your normal forms large, so avoid these by keeping your programs linear.

Yes, this seems like the loss of sharing due to excessive eta-expansion I talk about above. Note also that transforming the arity 2 function into the arity 1 function is unsound, as it makes Add stricter: Add undefined `seq` () diverges after the transformation. So the programmer has to write the arity 1 definition from the beginning.

Implementation

Judging from what reduce does, it seems to me that the implementation is more like a bytecode interpreter because of the lack of a compile-time mapping from Lambdas to actual machine code. Do you have ideas for how such a mapping might work? Similar to how "graph reduction" transitioned to today's "supercombinator" encodings (which is just the familiar lambda=<static link, code ptr> encoding again).

@VictorTaelin
Copy link
Member

VictorTaelin commented Feb 9, 2022

Edit: since the comment became big, I'll separate it in sections to make it more organized.

Improving HOW.md for FP veterans

Your comment is very interesting. It seems like you approached the document trying to understand HVM by tracing connections to how GHC operates, but doing so ended up being more confusing, perhaps for the same reasons that approaching FP coming from an OOP mindset is. It seems like you got stuck in some local maximums of insight ("ah, so dup nodes are like reference counted thunks"), which are kinda true but irrelevant, and ended up missing the bigger picture ("ah, the core is fully linear, un-scoped, and reference-free, making it inherently lightweight; lazy dups are just the least aggressive way to introduce copying without affecting these aspects").

Regardless, both approaches should work ideally, so, thanks for reporting your experience reading HOW.md. I can already think of some changes that would make the document less confusing to people coming from a similar mindset. I think we could trace even more parallels with GHC, and perhaps devote a section to it. I don't have knowledge of its internals, though.

Comparing DUPs and SUPs to Haskell thunks

I immediately thought "Ah, so dup nodes are like reference counted thunks". Is that true? Why isn't it true?

You could kind of see a dup node as "a unitary reference-counted thunk, where the reference count is always 2, and such that it gets immediately undone as soon as it is read once"... perhaps? That's a contrived way to describe it, but I guess it is accurate.

So there is more structure.

Yes, there is more structure, because copying bound variables requires superposition nodes, so superpositions are part of the "copying machinery". But superpositions aren't always related to dup nodes. There are SUPs without DUPs, and there are DUPs without SUPs (like cloning numbers or constructors). At this point, comparisons to thunks will end up being more complex than the entire description of sups and dups, so these should just be learned as new concepts.

A better example of the LAM-DUP rule

I find the example at the end of https://github.com/Kindelia/HVM/blob/master/HOW.md#lambda-duplication a bit unconvincing, because no optimizing compiler would leave behind an obvious redex like (λx. ...) 2.

You mean this example:

dup f g = ((λx λy (Pair (+ x x) y)) 2)
(Pair (f 10) (g 20))

Right? Initially, it was going to be something like:

dup f g = λy (Pair (+ 2 2) y)
(Pair (f 10) (g 20))

Because STG isn't able to share the 2 + 2. The problem is, GHC obviously will inline 2 + 2 at compile time. So, I needed something that, after some reductions, ended up producing the term above. Which is how I come up with:

dup f g = ((λx λy (Pair (+ x x) y)) 2)
(Pair (f 10) (g 20))

But you're right that GHC would also inline this program, so this isn't a great example either. The problem here is that this is just a static expression, technically GHC could inline everything and just run it at compile-time. So, yes, perhaps using free variables would help. The problem is, if I wrote something like this:

λn
  dup f g = λy (Pair (+ n n) y)
  (Pair (f 10) (g 20))

People could just complain that I'm reducing inside λ-binders, and it would look like HVM is actually doing more work somehow? So perhaps I should instead present an actual program with external inputs, like this?

(Main n a b) =
  let f = λy (Pair (+ n n) y)
  (Pair (f a) (f b))

That presentation would make it clear that n is a dynamic input, and it wouldn't look like I'm reducing inside lambdas. The equivalent Haskell program can't share the n + n redex:

main = do
  [n, a, b] <- readArgs :: IO [Word32]
  let f = \y -> (n + n, y)
  print $ (f a, f b)

What do you think?

Exploring beta-optimality techniques

The takeaway of this section for me seems to be that we want to eta-expand all definitions as much as possible.

Not quite. I never thought of it that way, at least. Perhaps if you think of using λ-encodings instead of constructors as an eta-expansion, then you could say so? But not even λ-encodings are the point. It is all about sharing common lambdas. No matter if they're from λ-encodings, eta-expansions or just normal lambdas.

Are there similar cases in which sharing is destroyed, due to eagerly dupped lambdas?

No, because the point wouldn't be to eta-expand lambdas, but to share these lambdas. If you just eta-expand without sharing, then yes, it would be wasteful.

I presume that you mean that Interaction Nets support these optimisations dynamically, for example if mapFB g is passed as some function parameter x and we have mapFB f . x in the body.

Yes, that's exactly it. HVM can be seen as a runtime-GHC. Of course, running GHC at runtime would be prohibitively expensive. But HVM does that, while still being efficient.

Yes, this seems like the loss of sharing due to excessive eta-expansion I talk about above.

No, I really think that's not the case, at all. I don't think HVM loses any sharing due to eta-expansion either. That happens in GHC precisely because it isn't able to share inside binders. I think if we ever compile Haskell to HVM, we could flag GHC to eta-expand as much as it wants without fear. I could be wrong, though.

Regardless, it is a different problem I was talking about here. What this paragraph is talking about is how we can build a function that can be called 2^N times in a way that uses only N steps. That is a delicate process. The function must be "crystal clear", because, if it contains anything that can't be computed in N steps, then the whole function will take 2^N steps, ruining the plan. Duplications often cause that to happen. This can be sometimes avoided by replacing a duplication by a lambda. Like, instead of (if x then f(k) else g(k)) (bad), write (if x then (\x -> f x) else (\x -> g x)) $ k (good). That is all I meant. And this could be eta-reduced to (if x then f else g) $ k, and it would still be good. It is more about avoiding non-linear variables, less about eta-expansion.

Comparing HVM, GHC and Supercombinators

Judging from what reduce does, it seems to me that the implementation is more like a bytecode interpreter because of the lack of a compile-time mapping from Lambdas to actual machine code.

That's correct, but somewhat misleading. Yes, it can be seen as supercombinators, to an extent, but also not. Let me try to trace a parallel. The ideal input for HVM is a big list of small equational clauses. So, for example, a compiler targeting HVM should certainly replace something like:

foo a x = case a of
  True -> Just (42, 10 * x)
  False -> Nothing

By:

foo True x = Just (42, 10 * x)
foo False x = Nothing

Once functions are in that shape, then we can convert them into very fast C procedures. For example:

Term foo(Term a, Term x) {
  if (a == Ctr(TRUE, [])) {
    return Ctr(JUST, [Ctr(PAIR, [42, 10 * x])]);
  } else {
    return Ctr(NOTHING, []);
  }
}

And that's all HVM does. This is fast, for the same reason supercombinators are fast: it is just a C function that does some branching and allocs some memory. That is great. The difference, though, is how supercombinators and HVM deal with lambdas. Think of this function:

foo True = \x -> (42, x * x)
foo False = \x -> (42, x + x)
main = foo True 7

This function can not be compiled to a fast C procedure, because C doesn't have closures. So, every runtime must deal with this situation in some way.

Supercombinators deal with it by transforming the whole program in a way that functions only receive/return concrete values (numbers, constructors). That way, lambdas don't exist at runtime, and everything is compiled to fast C-like procedures. Like this:

// Notice "x" was uncurried, allowing `foo` to
// return a concrete value, not a lambda
Term foo(Term a, Term x) {
  if a == Ctr(TRUE, []) {
    return Ctr(PAIR, [42, x * 10]);
  } else {
    return Ctr(PAIR, [42, x + 20]);
  }
}

u64 main = foo(1, 7);

That looks blazingly fast. And it is. Until there are too many lambdas. In that case, the involved transformations will make your functions considerably heavier, and you'll lose the benefits.

GHC deals with it by wrapping runtime lambdas as objects that track their arguments, and, once fully saturated, call the machine code that returns the concrete value. Like this:

Term foo_0(Term[] args) {
  return Ctr(PAIR, [42, args[0] + 10]);
}

Term foo_1(u64[] args) {
  return Ctr(PAIR, [42, args[0] + 20]);
}

// Foo wasn't uncurried, but it returns a black-box closure
Term foo(Term a) {
  if a == Ctr(TRUE, []) {
    return Closure(foo_0);
  } else {
    return Closure(foo_1);
  }
}

u64 main = call(foo(1), 7);

Of course, GHC will do tons of optimizations to avoid returning closures, and this ignores thunks/memoization, but, point is, there are no expressions inside lambdas, just machine code waiting to be called. That is why you can't inspect inside the body of a lambda, and why GHC can't share expressions inside them.

HVM deals with it by simply treating lambdas as concrete values. That's all. When a lambda is returned, nothing special happens. It literally just returns a lambda, just like you would return a list constructor. Like this:

// Foo wasn't curried, it just returns a lightweight lambda
Term foo(Term a) {
  if a == Ctr(TRUE, []) {
    Var x = fresh();
    return Lam(x, Ctr(PAIR, [42, Add(x, 10)]));
  } else {
    Var y = fresh();
    return Lam(y, Ctr(PAIR, [42, Add(y, 20)]));
  }
}

u64 main = App(App(foo, 1), 7);

Since lambdas are not special, you can compile every program to a fast C-like procedure, which is the whole point of supercombinators, except you don't need to do any additional transformation to avoid lambdas. The only consequence is that Lam/App/Add nodes require some extra memory, but almost all of it can be avoided, and HVM is designed with that in mind.

For example, by organizing your program as small top-level equations, then almost all LAM/APP aren't necessary. And HVM compiles numeric nodes like ADD in a way that allows the C compiler to inline them whenever possible. So, in the program above, Add(x, 10) will be compiled as just x + 10. That is why numeric functions in HVM can perform, in theory, as well as GHC, there aren't any extra allocations, it runs directly as machine code. Of course, GHC has considerably more man-hours thrown at low-level optimizations, but in theory both can be equivalent in numeric computations.

Now, there are cases where avoiding LAM/APP/ADD nodes isn't possible in HVM, but these are exactly the cases where supercombinators would need expensive program transformations, and GHC would use closures, which also use extra memory. The nice thing, is, though, these are also the cases where HVM will outperform GHC asymptotically, precisely because the lambdas are still present at runtime. This information isn't lost, making execution-time-inlining possible, which isn't possible in GHC, since lambda bodies were already converted to machine code.

Uff. Hope this makes sense...

@sgraf812
Copy link
Author

sgraf812 commented Feb 9, 2022

Edit: Urgh, accidental early submission while I nearly dropped my laptop. Will post the rest later on

@sgraf812
Copy link
Author

It seems like you got stuck in some local maximums of insight ("ah, so dup nodes are like reference counted thunks"), which are kinda true but irrelevant

Yes, hence I would have appreciated an early debunk of my initial thinking with a forward reference.

the bigger picture ("ah, the core is fully linear, un-scoped, and reference-free, making it inherently lightweight; lazy dups are just the least aggressive way to introduce copying without affecting these aspects").

Well, this isn't a thought that you get while you haven't even read about how lambdas are duplicated... Hence the need for a forward reference.

main = do
  [n, a, b] <- readArgs :: IO [Word32]
  let f = \y -> (n + n, y)
  print $ (f a, f b)

What do you think?

Hmm, here it is still statically possible to float out n+n form under the lambda into a local let binding inside main, thus sharing the work among all invokations of f. I think we really need to go more higher-order. Perhaps

import Control.Monad.Trans.Reader

replicateM' :: Monad m => m a -> Int -> m [a]
replicateM' _ 0 = pure []
replicateM' m n = (:) <$> m <*> replicateM' m (n-1)
{-# NOINLINE replicateM' #-}

main = do
  let xs = runReader (replicateM' ((+ 1) <$> ask) 100000) (41 :: Int)
  print $ sum xs

This will add 41+1 over and over again. Not a particularly compelling use case, I must confess... Maybe you can come up with better examples, knowing this example. The point here is that (+1) <$> ask will always supply the same value 41 to the addition, but main doesn't see that. similarly, replicateM' can't cache the result of the m action because it doesn't know which arg it is called with.

Not quite. I never thought of it that way, at least. Perhaps if you think of using λ-encodings instead of constructors as an eta-expansion, then you could say so? But not even λ-encodings are the point. It is all about sharing common lambdas. No matter if they're from λ-encodings, eta-expansions or just normal lambdas.

I mean eta-expansion in the sense that each datatype gives rise to a "recursor", which is what I think you mean by the type's "λ-encoding". And then you say

Then we lift the shared lambdas up:

Which is exactly eta-expansion. And as I said, eta-expansion is always sound, but may lose sharing, at least in traditional GHC-compiled code.

No, because the point wouldn't be to eta-expand lambdas, but to share these lambdas.

Note that I'm talking about the sharing the work of the lambda body, not "sharing it's lambdas". Just so that we are talking about the same kind of sharing, here's an example

let f x = if sum [0..x] < 13 then id else (+1)
    f13 = f 13
in f13 15 + f13 54

Eta-expansion of f to arity 2 (so f x y = if sum [0..x] < 13 then y else (y+1)) will compute sum [0..13] < 13 twice instead of once. So eta-expansion lost the shared work (among all calls to f) of evaluating sum [0..13] < 13 to WHNF.

If you just eta-expand without sharing, then yes, it would be wasteful.

I think if we ever compile Haskell to HVM, we could flag GHC to eta-expand as much as it wants without fear. I could be wrong, though.

Ah, so does that mean that eta-expanding f above would not lose work on HVM? That is an interesting and attractive property! I'm still vary about the costs, though.

Like, instead of (if x then f(k) else g(k)) (bad), write (if x then (\x -> f x) else (\x -> g x)) $ k (good).

Interesting. I would expect those two occs of k to not need a dup anyway, as they occur on different code paths. I'd say that k is used linear in the "bad" program.

for example, a compiler targeting HVM should certainly replace something like:

foo a x = case a of
  True -> Just (42, 10 * x)
  False -> Nothing

By:

foo True x = Just (42, 10 * x)
foo False x = Nothing

Ah, so you rather want to have

foo x = case x of
  True -> \y1 -> Just (42, 10 * y1)
  False -> \y2 -> Nothing

Correct? But then the latter is not fully eta-expanded. So there is some drawback to that, perhaps the 2^N duplication problem you talked about earlier?

Once functions are in that shape, then we can convert them into very fast C procedures. For example:

Term foo(Term a, Term x) {
  if (a == Ctr(TRUE, [])) {
    return Ctr(JUST, [Ctr(PAIR, [42, 10 * x])]);
  } else {
    return Ctr(NOTHING, []);
  }
}

But this is eta-expanded again. So you expect eta-contracted input because it is semantically richer, but ultimately generate eta-expanded code? What if we instead started with

foo x = case odd $ sum [0..x] of
  True -> \y1 -> Just (42, 10 * y1)
  False -> \y2 -> Nothing

would you eta-expand over the case? Then that means you lose sharing of work at arity 1 call sites.

My point is that eta-expansion makes some programs faster, but others slower. It's the job of the compiler/HVM(?) to figure out when it's beneficial to do so. Alternatively, not do it at all.

This function can not be compiled to a fast C procedure, because C doesn't have closures. So, every runtime must deal with this situation in some way.

Well yes, do closure conversion or lambda lifting or something in-between (Appel's work on closure conversion). I think the original supercombinators paper used lambda lifting to pass free variables, but I don't think it would need to; closure conversion would work just as well. The point I was trying to make is that supercombinators take whole lambda expressions and find an efficient compilation to machine code, rather than just translating the expressions to well-known combinators like SKI first (or interpret the code represented as a data structure, as I think reduce does at the moment). My emphasis is on "find an efficient code generation scheme to machine code".

Like this:

// Foo wasn't curried, it just returns a lightweight lambda
Term foo(Term a) {
  if a == Ctr(TRUE, []) {
    Var x = fresh();
    return Lam(x, Ctr(PAIR, [42, Add(x, 10)]));
  } else {
    Var y = fresh();
    return Lam(y, Ctr(PAIR, [42, Add(y, 20)]));
  }
}

u64 main = App(App(foo, 1), 7);

Aha! This is the interesting part. You're returning Lam(y', C[y->y']), where C is the code for the lambda body, correct? What if C was a case expression, e.g.

foo a = case odd $ sum [0..a] of
  True -> \x -> Just (42, 10 * x)
  False -> \y -> case y > 2 of
    True -> Just (42, 20 * y)
    False -> Nothing

What code would you generate here?

@VictorTaelin
Copy link
Member

VictorTaelin commented Feb 10, 2022

This will add 41+1 over and over again.

That example is way too contrived, it must be simple and pedagogical. I don't think people will go that far. The point of this example to show a situation where GHC can't share a redex. It is there to teach someone trying to learn, not to convince the most absolutist skeptics on whether GHC can always "fix" its lack of beta-optimality by applying smart compile-time manipulations. For these, there is the book, which has proofs that this isn't the case.

I mean eta-expansion in the sense that each datatype gives rise to a "recursor", which is what I think you mean by the type's "λ-encoding".

No, a recursor is a type of λ-encoding called Church-Encoding. Usually I mean the Scott-Encoding, since pattern-matching on it has the same complexity as pattern-matching on built-in datatypes.

Which is exactly eta-expansion. And as I said, eta-expansion is always sound, but may lose sharing, at least in traditional GHC-compiled code.

I don't think that is what eta-expansion means, though. Eta-expansion is taking a function like f :: Int -> Int and expanding it to \x -> f x. Sharing a lambda is taking a branch like if k then (\x -> f x) else (\x -> g x) and turning it into \x -> if k then f x else g x. These are different concepts.

Note that I'm talking about the sharing the work of the lambda body, not "sharing it's lambdas". Just so that we are talking about the same kind of sharing, here's an example

I know. The problem is it seems like we're mixing two different subjects. One thing is optimizing function composition from O(N^2) to O(N). Other thing is how HVM sharing works. The article you read talks about optimizing composition, and teaches techniques to do so, including sharing lambdas. It has nothing to do with how HVM shares inside binders.

I'd say that k is used linear in the "bad" program.

There is no built-in if on HVM, so it can't tell it is branching, and will just insert a dup node. When compiling an if expression, compilers should either add λs on its branches, or move them to different equations, if they want to avoid dup nodes (which isn't always necessary, since dups are cheap).

Ah, so does that mean that eta-expanding f above would not lose work on HVM?

Yes. The program you wrote loses no sharing on HVM after eta-expansion. sum [0..x] will always only occur once.

Ah, so you rather want to have

No, we often want to have small right-hand sides because the larger it is, the more memory must be allocated. For example, this:

foo x = case odd $ sum [0..x] of
  True -> \y1 -> Just (42, 10 * y1)
  False -> \y2 -> Nothing

Must alloc space for a representation of the entire "case-of" expression whenever foo x is called. So a sane compiler would never do that, it would just break it down as foo_aux x True = ... and foo_aux x False = .... That has nothing to do with sharing or optimality, this is a constant-space-time optimization.

What code would you generate here?

There is no "case-of" in HVM, so this expression doesn't even exist. You have 3 alternatives:

  1. Generate an auxiliary "BoolCaseOf" function:
// Could also be called "If"
(BoolCaseOf True t f) = t
(BoolCaseOf False t f) = f

(Foo a) = 
  let true_case = λx (Just (Pair 42 (* 10 x)))
  let false_case = λy ( ... )
  (BoolCaseOf (Odd (Sum (Range 0 a))) true_case false_case)

That is the easy, but stupid way, because the right-hand side will be huge, and you'll get a constant-time slowdown, since you must alloc all of it whenever you call Foo.

  1. Break it down into smaller equations:
(Foo a) = (Foo_0 (Odd (Sum (Range 0 a))))
  (Foo_0 True) = λx (Just (Pair 42 (* 10 x)))
  (Foo_1 False) = (...)

This is the proper way. It will generate fast programs.

  1. Convert it to λ-encodings:
(Foo a) = 
  let true_case = λx λNothing λJust (Just λPair (Pair 42 (* 10 x)))
  let false_case = λy ( ... )
  ((Odd (Sum (Range 0 a))) true_case false_case)

This way, you don't even use constructors, just lambdas. This is the worst way in general since it uses the most memory, but if the compiler is really, really clever, it may abuse λ-encodings to gain asymptotic speedups.

Normal compilers should just go with (2) and all will be fine.

@sgraf812
Copy link
Author

That example is way too contrived, it must be simple and pedagogical. I don't think people will go that far. The point of this example to show a situation where GHC can't share a redex.

Yes, exactly. I wanted to give an example where the loss of sharing would really matter to GHC. I agree it's not pedagogical at all. I hope that you might be able to come up with better examples from the one I gave.

It is there to teach someone trying to learn, not to convince the most absolutist skeptics on whether GHC can always "fix" its lack of beta-optimality by applying smart compile-time manipulations. For these, there is the book, which has proofs that this isn't the case.

Interesting, I'd really like to see and understand these proofs. Where they written with the kind of optimisations in mind that GHC can do? E.g., that certain programs would never be relevant because you can just float the redex out of a lambda into a shared let.

At the end of the day, I'm a pragmatist and I take the solution that makes my programs go the fastest. If that entails a bit of restructuring my program, a smart compiler like GHC, or an exciting new virtual machine for STG, I'll take it. But so far, I know of no concrete program where GHC isn't able to optimise it so that the promised sharing of HVM would make a positive difference. It's always "This is such a program" and then I say "No, it isn't, because GHC would float out the shared work".

So what is a program where STG-code that was produced by GHC lacks sharing of work and that HVM could exploit?

I don't think that is what eta-expansion means, though. Eta-expansion is taking a function like f :: Int -> Int and expanding it to \x -> f x. Sharing a lambda is taking a branch like if k then (\x -> f x) else (\x -> g x) and turning it into \x -> if k then f x else g x. These are different concepts.

No, eta expansion takes an expression e (not necessarily a function variable occurrence f) and expands it to \x -> e x. Now depending on the shape of e, we can immediately push the application of x inside e and perform some beta-reductions (at compile-time). With your example

  1. if k then (\x -> f x) else (\x -> g x). Eta-expand it, get
  2. \z -> (if k then (\x -> f x) else (\x -> g x)) z. push the application inwards, into the if branches
  3. \z -> if k then (\x -> f x) z else (\x -> g x) z. beta reduce,
  4. \z -> if k then f z else g z. Alpha-rename
  5. \x -> if k then f x else g x. Et voilà.

Same thing as your "sharing of lambdas".

There is no built-in if on HVM, so it can't tell it is branching, and will just insert a dup node. When compiling an if expression, compilers should either add λs on its branches, or move them to different equations, if they want to avoid dup nodes (which isn't always necessary, since dups are cheap).

Yes, that makes sense.

I know. The problem is it seems like we're mixing two different subjects. One thing is optimizing function composition from O(N^2) to O(N). Other thing is how HVM sharing works.

Yes, I mostly care about the notion of sharing I talked about, trying to figure out "how HVM sharing works".

I'm getting lost in our discussion threads, so allow me to start a new with the open questions I'm interested in.


Function equations are "better" than λx. case x of ...

For a bit of context, you said

for example, a compiler targeting HVM should certainly replace something like:

foo a x = case a of
  True -> Just (42, 10 * x)
  False -> Nothing

By:

foo True x = Just (42, 10 * x)
foo False x = Nothing

And I assumed Haskell semantics, where the latter desugars to

foo x = case x of
  True -> \y1 -> Just (42, 10 * y1)
  False -> \y2 -> Nothing

So I thought that was the difference, because that is a different function than the first one, which has 2 manifest lambdas (e.g., foo = λa. λx. case a of ...). But really the point you were trying to make is that every case is very expensive to duplicate (e.g., an additional constant-time factor) because of potentially huge case alternatives. And somehow the version

foo True x = Just (42, 10 * x)
foo False x = Nothing

is better in that regard, because you might treat these equations more efficiently. Fair enough, case closed.

Eta expansion and loss of sharing (of work)

Back to the eta expansion example. In Haskell, we have (a very slightly simplified definition compared to before)

foo x = case odd $ sum [0..x] of
  True -> \y1 -> 10 * y1
  False -> \y2 -> 20 * y2

You said that a not-super-smart compiler targetting HVM should generate this code (which is "(2) break it down into smaller eqns"):

(Foo a) = (Foo_0 (Odd (Sum (Range 0 a))))
  (Foo_0 True) = λx (* 10 x)
  (Foo_0 False) = λy (* 20 y)

And that this would generate very efficient code. I'd like to see the C code of that, similar to the C code you gave earlier! Perhaps it'd be two functions?

And then I'd also be interested in the C code for the eta-expanded definition (which lifts x as a "shared lambda" at the top)

(Foo2 a x) = (Foo_0 (Odd (Sum (Range 0 a))))
  (Foo2_0 True) = (* 10 x)
  (Foo2_0 False) = (* 20 x)

And I'd also like to know if this function will share the evaluation of (Odd (Sum (Range 0 a))) in the same way as the unexpanded definition will.

More concretely, will the following program

(Bar f) = (+ (f 1) (f 2))

Main = Bar (Foo2 42)

evaluate (Odd (Sum (Range 0 42))) twice or just once? I'm certain that if we used Foo instead of Foo2, we'd only evaluate the expr once. But what about Foo2?

@VictorTaelin
Copy link
Member

Just letting you know that I'll be busy on the short term so I will not be able to answer your last comment on the next days, but I'll be back to it when possible. Thanks for bringing great questions and insights.

@ken-okabe
Copy link

Very interesting discussion to propose an alternative lambda calculus:
Parsimonious Types and Non-uniform Computation
https://lipn.univ-paris13.fr/~mazza/papers/Lpoly.pdf

A lambda calculus for the HVM #97

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants