Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quadaric time & space on heavily shared lazy lists #60

Closed
oxalica opened this issue Feb 3, 2022 · 27 comments
Closed

Quadaric time & space on heavily shared lazy lists #60

oxalica opened this issue Feb 3, 2022 · 27 comments

Comments

@oxalica
Copy link

oxalica commented Feb 3, 2022

Consider the program calculating sum (map head (tails list)). Due to the laziness, though the normal form of tails list has O(n^2) elements, I'm expecting the whole expression to cost linear time since we don't need to fully evaluate each lazy lists, and the number of elements we actually traversed are just n.
But unfortunately, HVM cost quadaric time & space on this scenario, indicating it's not lazy on sub-lists and all sub-lists elements are actually generated.

HVM:

(Gen 0) = Nil
(Gen n) = (Cons n (Gen (- n 1)))

(Foldr f z Nil) = z
(Foldr f z (Cons x xs)) = (f x (Foldr f z xs))

(Tails Nil) = (Cons Nil Nil)
(Tails (Cons x xs)) = (Cons xs (Tails xs))

(Map f Nil) = Nil
(Map f (Cons x xs)) = (Cons (f x) (Map f xs))

(Head Nil) = 0
(Head (Cons x xs)) = x

(Main n) = (Foldr @a@b(+ a b) 0 (Map @l(Head l) (Tails (Gen n))))

Haskell:

import Prelude hiding (head, foldr, tails, map)
import Data.Word (Word32)
import System.Environment (getArgs)

data List a = Nil | Cons a (List a)

gen 0 = Nil
gen n = Cons n (gen (n - 1))

foldr f z Nil = z
foldr f z (Cons x xs) = f x (foldr f z xs)

tails Nil = Cons Nil Nil
tails (Cons x xs) = Cons xs (tails xs)

map f Nil = Nil
map f (Cons x xs) = Cons (f x) (map f xs)

head Nil = 0
head (Cons x xs) = x

main = do
    n <- read . (!! 0) <$> getArgs :: IO Word32
    print $ foldr (\a b -> a + b) 0 (map (\l -> head l) (tails (gen n)))

Run time:

n HVM (clang, O2, single thread) GHC, O2
1000 131ms 10ms
2000 316ms 10ms
4000 1082ms 12ms
8000 5.039s (1.6G mem) 15ms (5.0M mem)
16000 31.018s (5.9G mem) 20ms (5.6M mem)

I'd think it's a serious issue since it's quite a common pattern to generate lazy data structures and share their parts everywhere (eg. B-Tree, Finger Tree).

@VictorTaelin
Copy link
Member

VictorTaelin commented Feb 3, 2022

That's definitely a serious issue. Today I have a busy day, but as soon as possible I'll investigate why this is happening. Here is the complete reduction of your program for n = 5:

https://gist.githubusercontent.com/VictorTaelin/99d19ae48493852d3f526c529260d940/raw/e233a04e9be839b98e066165888bfb4d94fc89d9/reduction.txt

Edit: updated to use Sum instead of Foldr (same issue happens, easier to visualize)

Edit: okay, I see what is going on. This is not caused by HVM being strict on the tails (which would make it quadratic, obviously). It is indeed fully lazy, as expected. Instead, due to the way tails are cloned, there is an accumulation of copy-then-discard nodes:

(+ 4 (+ 3 (+ 2 (+ $a0 (Sum (Map λx5 (Head x5) (Tails (Cons b0 b6))))))))
dup a0 b0 = b1;
dup a4 b4 = 1;
dup * b6 = b7;
dup * b1 = b2;
dup * b3 = a4;
dup * b2 = b3;
dup * b7 = b8;
dup * b8 = b9;
dup * b9 = (Gen (- b4 1));

Here, * stands for an unused variable. Notice how a0 must be copied several times, then discarded several times, redundantly. Because of that, each taking the head of element N requires us to copy-and-discard it N times, redundantly, which makes the whole thing quadratic. There is an easy fix, though. An additional reduction rule:

dup x * = a
----------- dup-opt-0
x <- a

dup * x = a
----------- dup-opt-1
x <- a

I've added a fix to this issue on the quadratic_fix branch, for the interpreted version only. It makes your program compute linearly, as expected:

v@v ~/vic/dev/Kindelia/HVM/tmp$ time hvm r tmp 16000
Reducing.
Rewrites: 336018 (30.55 MR/s)
Mem.Size: 176035

127992000
hvm r tmp 16000  0.01s user 0.00s system 86% cpu 0.020 total

@oxalica can you test the quadratic_fix branch, and let me know if that solved your issue?

Coincidently, that same rule would fix the "rare cases" where a garbage collector would be needed (I talk about that on a comment on runtime.c). Sadly, this rule escapes the interaction net realms, and is NOT sound in some cases, which is why I was afraid to add it. But this issue definitely makes it seen unavoidable. I have an idea on how to add it in a sound manner, though, I just need some time to make sure it won't break any law.

This one is a great, very important issue. Thanks for making it.

@oxalica
Copy link
Author

oxalica commented Feb 3, 2022

I've added a fix to this issue on the quadratic_fix branch, for the interpreted version only.

Special cased patches seem not enough. Simply repeat the calculation twice, and it's quadratic again.

(Main n) =
    let lists = (Tails (Gen n))
    (+ (Sum (Map @l(Head l) lists)) (Sum (Map @l(Head l) lists)))

@oxalica
Copy link
Author

oxalica commented Feb 3, 2022

I'd like to see precise reasoning about the time complexity of this evaluation model, instead of find-and-patch, since copying everywhere instead of remembering evaluated state (using "thunk") sounds quite problematic on time complexity. Usually, though copying is lazy, whenever used, data are still copied. So copy-time is just put off to other operations but not eliminated.

Currently there is only "how this is implemented" in HOW.md, but no "how we can achieve the same time (or space) complexity than thunk-based implementation". For generic usage, we should at least convince others that we are at least not worse than others.

@VictorTaelin VictorTaelin pinned this issue Feb 3, 2022
@VictorTaelin
Copy link
Member

VictorTaelin commented Feb 3, 2022

It is reasonable to demand proofs about the time complexity of this model, in relation to other models. Note, though, that this isn't "try-and-patch". These reduction rules are well-stablished for decades, and the "patch" above is a well-known optimization, which is safe to apply in some cases. The complexity reasoning comes from the vast literature on the optimal evaluation of functional programming languages. This book, from Andrea Asperti and Stefano Guerrini, has a great overview. HVM is merely a practical implementation of the bookkeeping-free reduction machine depicted on the book (pages 14-39). There is nothing fancy or invented here. It has a 1-to-1 relationship to the theoretical model, and the same complexity bounds, and respective proofs (chapter 10) apply. In a recent paper, Andrea Asperti wrote:

Rephrasing [Asperti and Mairson 2001] in this context, [Asperti et al. 2004] showed that the non elementary cost of optimal reduction is not due to bookkeeping (which one may suspect to add superfluous work), but to the (apparently unavoid- able) duplication work. If you accept the fact that optimal reduction performs the minimal amount of duplication, you will have at least the same operations, and hence the same computational cost in any reduction technique. (...) More recently, still working in a “bookkeeping free” framework, and making a direct syntactical comparison with a standard graph rewriting machine, [Guerrini et al. 2012] showed that sharing graphs can only improve performances. In conclusion, while there are several examples of classes of lambda terms where optimal reduction outperforms standard techniques, there is so far no known counterexample to its computational efficiency.

If my interpretation is correct, that means that the bookkeeping-free optimal evaluator is (as the name says!) at least as fast (i.e., in the same complexity class of) any other evaluator. As such, it is surprising, to me, that GHC behaves with superior asymptotical complexity on these inputs. This should not be the case, as far as I understand. So, why this happens? This could be either due to a bug on HVM, or a misunderstanding on my part about its complexity. If it is a bug, we will fix it. If it is a fundamental limitation of the model, then we must ask: why? When this occurs? Is this known? Is this solvable? This is a negative result, and that is good. Only by investigating countless "no"s, we can find the most correct "yes"s. Of course, if it turns out that there is no reasonable solution, we must update the docs, and make it clear that HVM isn't always asymptotically superior to GHC.

Obviously, I can't give you a definite answer right now, but I'll pin this issue to make it clear it is important, and dedicate time to it. This must be throughly understood before moving on. Thanks for challenging my assumptions!

@VictorTaelin
Copy link
Member

I wonder what is the complexity of the program below in Haskell, using datatypes, and then using Scott-Encodings.

(Main n) =
    let lists = (Tails (Gen n))
    (+ (Sum (Map @l(Head l) lists)) (Sum (Map @l(Head l) lists)))

I'm a little busy today. Can anyone write that program for us?

@oxalica
Copy link
Author

oxalica commented Feb 4, 2022

EDIT: I pasted the wrong code. Now they are updated.
EDIT2: The result below is tested on af7b5b3.

It is reasonable to demand proofs about the time complexity of this model, in relation to other models. Note, though, that this isn't "try-and-patch". These reduction rules are well-stablished for decades, and the "patch" above is a well-known optimization, which is safe to apply in some cases.

Sorry for the inappropriate wording since I have not yet fully understood the detail design.

Rephrasing [Asperti and Mairson 2001] in this context, [Asperti et al. 2004] showed that the non elementary cost of optimal reduction is not due to bookkeeping (which one may suspect to add superfluous work), but to the (apparently unavoid- able) duplication work. If you accept the fact that optimal reduction performs the minimal amount of duplication, you will have at least the same operations, and hence the same computational cost in any reduction technique.

Thanks for the reference. I'll check it later. But for this paragraph, I'd interpret it as "it does do duplicated work, though it's minimal (without outside-LC technique)". But GHC uses "thunk" (which cannot be implemented in pure LC) to memorize states of each bindings, ensuing there's actually NO duplicated work at all.

... then we must ask: why? When this occurs? Is this known? Is this solvable? This is a negative result, and that is good. Only by investigating countless "no"s, we can find the most correct "yes"s. Of course, if it turns out that there is no reasonable solution, we must update the docs, and make it clear that HVM isn't always asymptotically superior to GHC.

That's also my opinion. It's okay to not beat GHC at every corners. But we should know the exact cases where we are better, and the cases we are not doing well. We should also mention this in README, instead of being advertised as a "generic VM (sometimes) beating GHC".

I wonder what is the complexity of the program below in Haskell, using datatypes, and then using Scott-Encodings.

Write the same expression twice cost simply twice more time, so the total time complexity is still O(n).
I'm not sure why you want Scott-Encoding, but I'd assume it's referring to Mogensen–Scott_encoding. But, well, changing encoding doesn't really affect the time complexity.

{-# LANGUAGE RankNTypes #-}
import Prelude hiding (head, foldr, tails, map, sum)
import Data.Word (Word32)
import System.Environment (getArgs)

newtype SList a = SList { runList :: forall b. b -> (a -> SList a -> b) -> b }

mkNil = SList (\z cons -> z)
mkCons x xs = SList (\z cons -> cons x xs)

gen 0 = mkNil
gen n = mkCons n (gen (n - 1))

sum xs = runList xs 0 (\x xs' -> x + sum xs')

tails xs = runList xs mkNil (\x xs' -> mkCons xs' (tails xs'))

map f xs = runList xs mkNil (\x xs' -> mkCons (f x) (map f xs'))

head xs = runList xs 0 (\x xs' -> x)

main = do
    n <- read . (!! 0) <$> getArgs :: IO Word32
    let lists = tails (gen n)
    print $
        sum (map (\l -> head l) lists) +
        sum (map (\l -> head l) lists)

HVM code for comparason:

(Gen 0) = Nil
(Gen n) = (Cons n (Gen (- n 1)))

(Sum Nil) = 0
(Sum (Cons x xs)) = (+ x (Sum xs))

(Tails Nil) = Nil
(Tails (Cons x xs)) = (Cons xs (Tails xs))

(Map f Nil) = Nil
(Map f (Cons x xs)) = (Cons (f x) (Map f xs))

(Head Nil) = 0
(Head (Cons x xs)) = x

(Main n) =
    let lists = (Tails (Gen n))
    (+ (Sum (Map @l(Head l) lists)) (Sum (Map @l(Head l) lists)))

Evaluation result:

> \time cargo r --release -- r tails.hvm 10000
...Omitting cargo outputs...
Reducing.
Rewrites: 100230007 (11.24 MR/s)
Mem.Size: 250135030

99990000
8.17user 0.84system 0:09.12elapsed 98%CPU (0avgtext+0avgdata 1960572maxresident)k
0inputs+0outputs (0major+490684minor)pagefaults 0swaps

> ghc tails.hs -O -o tailsghc && \time ./tailsghc 10000
99990000
0.00user 0.00system 0:00.01elapsed 93%CPU (0avgtext+0avgdata 4784maxresident)k
0inputs+0outputs (0major+476minor)pagefaults 0swaps

Note: The definition of tails in both Haskell and HVM are not the same as the one in Haskell's Data.List, to make it easy to write in HVM. It returns [[2,1],[1],[]] for [3,2,1]. But the time complexity is the same.

@VictorTaelin
Copy link
Member

VictorTaelin commented Feb 4, 2022

That's also my opinion. It's okay to not beat GHC at every corners.

Yes, but we need to be clear about that. I do think though that this is achievable; in the worst cases, we could just add references / thunks for these cases where it underperforms. Before going so far, though, I'd rather gain a deeper understanding on the issue, and look for solutions that involve pure rewrite rules.

Thanks for the reference. I'll check it later. But for this paragraph, I'd interpret it as "it does do duplicated work, though it's minimal (without outside-LC technique)". But GHC uses "thunk" (which cannot be implemented in pure LC) to memorize states of each bindings, ensuing there's actually NO duplicated work at all.

That isn't my interpretation; I believe he just meant that any evaluator must have a duplication system of some sort (including thunks, which are basically read-only memoized duplications), and that optimal evaluators are superior to any possible duplication strategy, because they perform the minimal amount of duplications. So, if I'm right, this would be a counterexample for his argument.

That said, perhaps references do bring something to the table in some cases (like this one), and we could consider some form of thunks on HVM. We must make sure it doesn't play badly with the automatic parallelism, though. I have some ideas in mind, but, again, a pure rewrite-based solution would be nicer.

Write the same expression twice cost simply twice more time, so the total time complexity is still O(n).
I'm not sure why you want Scott-Encoding, but I'd assume it's referring to Mogensen–Scott_encoding. But, well, changing encoding doesn't really affect the time complexity.

Thanks for the code! I asked the Scott-Encoding just for the sake of having the same programs on both the abstract algorithm and GHC (the algorithm of the book doesn't support native datatypes). This gives us an example where the same exact same input performs better on STG. Now we must find out why: maybe it is an issue with my implementation of the algorithm, maybe it is a well known limitation that I somehow missed or just misunderstood, maybe it is something that is actually not known.

I'll attempt to contact some of the involved researchers, and do some investigation myself when I find the time. If anyone has an insight feel free to bring it here!

@Mathnerd314
Copy link

Mathnerd314 commented Mar 17, 2022

I can think of two reasons you're getting quadratic overhead:

  • In Is the Optimal Implementation Inefficient? Elementarily Not they find a quadratic bound on complexity relative to normal order reduction. The quadratic overhead comes from the swap rule (called Superposed duplication in HVM). But I can't tell from scanning the gist if that rule is in play here.

  • It could just be a space leak. Similarly to how sum [1..10^9] in unoptimized Haskell builds up a lot of thunks and ends up being slower than a strict language, the program in this issue builds up a lot of duplications. The solution following Haskell is to introduce a "strict evaluate" or seq primitive that forces away indirections. Maybe you will also need an unshare operation to remove all duplication nodes.

@VictorTaelin
Copy link
Member

I believe it is pretty likely to be the lack of the safe rules described on the book, which causes an accumulation of redundant dup nodes in some cases. Implementing safe rules on HVM would be an amazing feat.

@joonazan
Copy link

joonazan commented Jul 18, 2022

I skimmed over the literature and it seems to me that optimal reduction is not faster than other strategies in general.

About the efficient reduction of lambda terms claims that the bookkeeping-free version is better in general. However, as evidence it cites Deep into Optimality – Complexity and correctness of sharing implementation of bounded logics, which is bookkeeping-free but only because it operates on bounded logics instead of the general lambda calculus. Is the Optimal Implementation Inefficient? Elementarily Not is restricted to certain linear logics for the exact same reason.

So apart from some confusing papers by Asperti, I'd assume that the status quo is still described by Optimality and inefficiency: what isn't a cost model of the lambda calculus?, which says that an exponential amount of bookkeeping is required regardless of proposed optimizations.

Of course, it could be that there is a way to eliminate bookkeeping but I would probably have seen it on this walk through the literature. The "safe nodes" technique described in Asperti's book that HVM is based on doesn't always work. The book even says in the beginning of chapter 9 that this is still an open problem.

My takeaway is that optimal reduction is not a good approach to performance, and rather than try to find the bookkeeping-free version that might not exist it would make sense to design a good evaluation strategy from scratch. Or use optimal reduction on one of the linear logics that it works on.

@VictorTaelin
Copy link
Member

You have good points, but I must disagree with your conclusion:

My takeaway is that optimal reduction is not a good approach to performance, and rather than try to find the bookkeeping-free version that might not exist it would make sense to design a good evaluation strategy from scratch

Which may also not exist. Even if it turns out that this particular example is one where Haskell's STG asymptotically outperforms the HVM, there are countless examples where the opposite is true (and by exponential, not quadratic, margins). So, what do we take from that? Perhaps just like there is no reduction strategy that is best for all cases, perhaps there is no algorithm that is, at all. We may be fighting against a mathematical impossibility here. Or not. Only time will tell.

Now, as for optimal reduction not being a good approach to performance, I find that claim absurd. Rust is widely regarded as performant, and its lambdas are drastically slower than Haskell's. C doesn't even have closures. The fact a particular runtime has worse asymptotics than another for specific lambdas doesn't imply that it isn't performant. HVM is lightweight, massively parallel, non garbage-collected and has many other theoretical requirements for a highly efficient language, as the numbers clearly show.

Ultimately, if there turns out to be no "theoretically elegant and perfect" solution to this problem, perhaps a mix of many strategies will end up being adopted. As I said, adding Haskell-like memoized references to HVM wouldn't be hard, which would solve any case where HVM is quadratically slower than GHC. But that would come at a cost of impacting parallelism, confluence, garbage collection freeness, measurability, and other important characteristics, so I wonder if it would be worth it. For example, is the gain above relevant if it impedes us from making massively parallel functional programming processors (which we can, for the HVM)?

Finally, note that the algorithm above, and any other similar, can be easily refactored into a much faster version that is really fast on HVM, by using clones more efficiently. I do suspect any instance of a "quadratically slower on HVM" example is a term that uses cloning poorly, and just happens to be handled well by GHC, at the costs above. All in all, mathematics is cruel and sometimes there just isn't a one size fits all answer. I'm not the inventor of the algorithm, all I do is try to implement it efficiently, which I believe is a worthy endeavor, even if it turns out not to work well (which, again, I highly doubt).

@VictorTaelin
Copy link
Member

Just to add to the discussion: I'm somewhat curious about bottom-up beta-reduction. It takes an entirely opposite approach. On HVM, everything is low-order, only exists in one place, operations are constant-time and there is a lot of locality. That approach is entirely high-order, shared references are the king, things can exist in many places, and beta-reduction isn't a constant time operation. But it seems to be as fast as it gets in the task of cloning lambdas, while still sharing as much as possible, without losing full λ-calculus compatibility.

@joonazan
Copy link

joonazan commented Jul 18, 2022

Optimal reduction is very good on lambda constructions, which is pretty cool! However, it is still faster to implement integers with machine integers than with lambda constructions. HVM outperforms GHC on certain lambda-heavy programs but that is because Haskell isn't meant to be used that way.

Finally, note that the algorithm above, and any other similar, can be easily refactored into a much faster version that is really fast on HVM, by using clones more efficiently. I do suspect any instance of a "quadratically slower on HVM" example is a term that uses cloning poorly

As stated by Asperti in one of the papers, programs don't really use higher-order functions a lot. Haskell performs very well on programs with little higher-order action and it is easy to write them.

Eager programming languages are even easier to write performantly than Haskell. But in return, Haskell gets convenience and even some asymptotic complexity as shown in Okasaki's algorithms book. To sell the more difficult to optimize model, HVM would have to provide something equally desirable.

There are use cases where Haskell could benefit from faster lambdas, for example state monads and other monadic constructions. On the other hand, these things would like to be zero-cost, not just fast.

Now, as for optimal reduction not being a good approach to performance, I find that claim absurd.

I have already addressed many facets of this but one more is that optimal reduction is not a way to attain utmost performance because it needs a runtime and is interpreted. When aiming for the highest performance, one often has a rough idea about the shape of the desired machine code. An optimal reduction language cannot produce that machine code.

Having no runtime like Rust gives the programmer the freedom to insert a runtime if it is beneficial for some task. There are still some assumptions which are baked into Rust. Most notably, the calling convention and monomorphization via copying or vtables. A perfect language for performance would make even those more malleable.

@joonazan
Copy link

Ultimately, if there turns out to be no "theoretically elegant and perfect" solution to this problem, perhaps a mix of many strategies will end up being adopted.

I see some value in a language like this. It reminds me of Python's list and sorting algorithm, which attempt to perform reasonably no matter how they are used. A language that isn't fast but isn't unreasonably slow even if used by a monkey on a typewriter would be very good because it would allow one to write code without thinking about performance at all, only about correctness.

That language isn't possible to actually make but for instance not having to think about using foldr vs foldl would be great.

@VictorTaelin
Copy link
Member

VictorTaelin commented Jul 19, 2022

As stated by Asperti in one of the papers, programs don't really use higher-order functions a lot. Haskell performs very well on programs with little higher-order action and it is easy to write them.

As I see it, HVM should perform better than GHC in these that use integers and a lot of low-order computations. So I disagree with Asperti's intuition here. The thing is, if you have a solid memory model for inets, then it turns out the algorithm is actually efficient. And why shouldn't it be? In the worst case, it is just a non-garbage-collected version of Haskell's STG.

I have already addressed many facets of this but one more is that optimal reduction is not a way to attain utmost performance because it needs a runtime and is interpreted. (...) When aiming for the highest performance, one often has a rough idea about the shape of the desired machine code. An optimal reduction language cannot produce that machine code.

I don't agree, though. Have you checked HVM's output? I'm not sure in what sense HVM would be "more interpreted" than C and Haskell. C has a runtime too, which includes malloc and similar. Haskell has a huge runtime (way bigger than HVM's) which includes all the STG machinery. You could see these as interpretation layers. Since HVM is very resource-aware, you can have tight control over memory, computation, and can have a good control over the generated assembly code. None of these is true on GHC. HVM does compile right-hand sides to machine code that isn't very different from what you'd get with C or GHC. For example:

(Foo a b) = (* (+ a b) 2)

Is translated to the following C code:

case _FOO_: {
  if (1) {
    inc_cost(mem);
    u64 ret_2;
    if (get_tag(ask_arg(mem, term, 0)) == NUM && get_tag(ask_arg(mem, term, 1)) == NUM) {
      ret_2 = Num(get_num(ask_arg(mem, term, 0)) + get_num(ask_arg(mem, term, 1)));
      inc_cost(mem);
    } else {
      u64 op2_3 = alloc(mem, 2);
      link(mem, op2_3 + 0, ask_arg(mem, term, 0));
      link(mem, op2_3 + 1, ask_arg(mem, term, 1));
      ret_2 = Op2(ADD, op2_3);
    }
    u64 ret_0;
    if (get_tag(ret_2) == NUM && get_tag(Num(2ull)) == NUM) {
      ret_0 = Num(get_num(ret_2) * get_num(Num(2ull)));
      inc_cost(mem);
    } else {
      u64 op2_1 = alloc(mem, 2);
      link(mem, op2_1 + 0, ret_2);
      link(mem, op2_1 + 1, Num(2ull));
      ret_0 = Op2(MUL, op2_1);
    }
    u64 done = ret_0;
    link(mem, host, done);
    clear(mem, get_loc(term, 0), 2);
    init = 1;
    continue;
  }
  break;
};

You may notice some alloc's there, which would create memory for the + and - nodes. That might lead you to conclude this is kind of a thin interpretation layer to maintain these graphs. But notice all occurrences of malloc are dead code here. Notice the second if-then-else is:

if (get_tag(ret_2) == NUM && get_tag(Num(2ull)) == NUM) {

Since the value ret_2 is statically known from the previous lines, the C compiler can remove the second branch, avoiding allocating a node for + and * operators. So, when a and b are NUMs, the code above is reduced to:

case _FOO_: {
  if (1) {
    u64 ret_2 = Num(get_num(ask_arg(mem, term, 0)) + get_num(ask_arg(mem, term, 1)));
    u64 done = Num(get_num(ret_2) * get_num(Num(2ull)));
    link(mem, host, done);
    clear(mem, get_loc(term, 0), 2);
    init = 1;
    continue;
  }
  break;
};

Which does the numeric computations directly. Once fully inlined, it results in:

case _FOO_: {
  // Gets the memory index where `term` is
  u64 tloc = term & 0xFFFFFFFF;

  // Reads `a` and `b` from memory
  u64 val_0 = mem->node[tloc]   & 0xFFFFFFFFFFFFFFF;
  u64 val_1 = mem->node[tloc+1] & 0xFFFFFFFFFFFFFFF;

  // Computes (a + b) * 2
  // Additional numeric operations would accumulate here, resulting 
  // in efficient machine code, with no interpretation overhead!
  u64 done = (val_0 + val_1) * 2;

  // Writes the result to memory
  mem->node[host] = 0xB000000000000000 | done;

  // Clears the memory used by the `(FOO a b)` node
  mem->free[2].data[++mem->free[2].size] = tloc;
  mem->free[2].data[++mem->free[2].size] = tloc + 1;

  // Continues
  init = 1;
  continue;
};

Which is then compiled to efficient machine code. The only overhead in a numeric function call is:

  1. Finding the right branch to jump to, but that is also how C functions work.

  2. Reading the values from memory, but C functions also do that (registers, stack, etc.).

  3. Wrapping and unwrapping NUM tags. This is actually an extra cost, but is it not worth the parallelism/laziness gains?

And that isn't using all the opportunities; once HVM inlines function calls (which it doesn't do), then it wouldn't alloc memory for intermediate nodes across different function calls too. As a result, highly numeric functions like a dot product would compile to a compact, alloc-free machine code on inner loops, just like in C and GHC.

@joonazan
Copy link

I'm seeing a bit more overhead than you are.

  1. A C function call knows which address to jump to. It doesn't need to go through a jump table, which avoids a branch. This is the part that makes me say this is an interpreter. Also, reduce looks like an interpreter to me. That is not how Haskell works.
  2. Freeing the memory used by the node is extra work and can probably not be optimized out by any C compiler. C and Haskell's stack is cheaper to manage.

And why shouldn't it be? In the worst case, it is just a non-garbage-collected version of Haskell's STG.

But optimal reduction is sometimes worse than other reduction strategies. Optimal reduction is a lot slower than Haskell in some cases. And I suspect that slowness isn't as simple to avoid as in Haskell, where it suffices to use data structures rather than huge lambda constructions.

Here's an example of how Haskell code is compiled. I think it is clear from the example that it becomes machine code with jumps, no loop and jump table is involved.
https://gitlab.haskell.org/ghc/ghc/-/wikis/commentary/compiler/generated-code#example-1-function-application-with-sufficient-stack-space

@VictorTaelin
Copy link
Member

Optimal reduction is a lot slower than Haskell in some cases.

That's a point I don't get. Haskell is a lot slower than optimal reduction in way more cases, and way more profoundly.

@Mathnerd314
Copy link

Mathnerd314 commented Jul 19, 2022

So apart from some confusing papers by Asperti, I'd assume that the status quo is still described by Optimality and inefficiency: what isn't a cost model of the lambda calculus?, which says that an exponential amount of bookkeeping is required regardless of proposed optimizations.

They have a term which requires exponential bookkeeping, but it also requires an exponential number of normal-order steps to reduce. So this doesn't show anything regarding the speed of optimal vs normal order. Also they only consider the bracket/croissant implementations, I think Guerrini's muxes will not show the exponential behavior on their example because the muxes combine.

As far as I am aware there are no examples anywhere on the internet of a family of terms that requires more n^2 graph reduction steps using optimal reduction vs n for normal order. ELL (elementary linear logic) is quite a large subset of terms - even if some terms outside it are slower using optimal reduction, this would only prompt an improved version of optimal reduction. Eager evaluation performs poorly even on some very simple examples like if(true, 1, infinite_loop) yet is the established order. Optimal reduction solves this. It also is better than lazy evaluation - for i w = expensive_number + w; i 2 + i 3, lazy evaluation evaluates expensive_number twice, while optimal reduction only evaluates it once.

As far as HVM, it is restricted to EAL (elementary affine logic), at least if you want reasonable behavior. If you go outside that subset you get nonsensical behavior, like how (Pair x (λx(8) 7)) reduces to (Pair 7 8) but worse. This is because the implementation doesn't have bookkeeping at all. The bounds in the Asperti papers are extremely relevant to HVM because EAL and ELL are only a little apart.

Now it is true that, despite using fewer reduction steps, optimal reduction can use more memory. Although, memory is cheap nowadays, and programmers don't pay much attention to memory usage anyways. The more pressing issue is that memory bandwidth costs can overwhelm time efficiency. So the graph representation needs to be optimized and there is a set of memory optimizations that trades off time for memory when computing an expression is cheap or fetching an expression is expensive. I think about it this way: when multiplying matrices fast, one generally uses Strassen's algorithm for large matrix blocks and the naive algorithm for smaller matrices. Similarly, the best reduction algorithm is a mixture of optimal reduction for expensive expressions and eager evaluation for cheap ones.

optimal reduction is not a way to attain utmost performance because it needs a runtime and is interpreted.

This is discussed in the STG paper. Even STG needs a runtime and an interpreter. They go from while(TRUE) { switch(cont) { ... }} like HVM uses to while (TRUE) { cont = (*cont)*() }. After that they replace the loop indirection with a direct jump, a common JIT optimization. HVM could presumably follow the same path. It just is not very optimized yet.

@joonazan
Copy link

First of all, I was not aware that HVM is meant to consume EAL. Partly because EAL wasn't mentioned anywhere and partly because I misread an abstract and thought it said that optimal reduction always outperforms conventional reduction in certain calculi. Now the performance makes a lot more sense.

As far as I am aware there are no examples anywhere on the internet of a family of terms that requires more n^2 graph reduction steps using optimal reduction vs n for normal order.

Optimal Implementation Inefficient? Elementarily Not proves that a quadratic slowdown is the worst that can happen on ELL and LLL, so that observation matches up with the theory perfectly.

That also explains the example shown in this issue. The README should make it clear that HVM is at best exponentially faster and at worst quadratically slower.

This is discussed in the STG paper. Even STG needs a runtime and an interpreter. They go from while(TRUE) { switch(cont) { ... }} like HVM uses to while (TRUE) { cont = (cont)() }. After that they replace the loop indirection with a direct jump, a common JIT optimization. HVM could presumably follow the same path. It just is not very optimized yet.

I can see that the while loop in STG can be trivially turned into direct jumps. It seems to me that HVM cannot know at compile time what the next node is.

Optimal reduction is a lot slower than Haskell in some cases.

That's a point I don't get. Haskell is a lot slower than optimal reduction in way more cases, and way more profoundly.

I'd rather program in a language that has exponential worst cases that are easy to avoid than in a language that has random quadratic pitfalls. Both are bad enough to be unacceptable, so it doesn't matter that one is much worse. Basically I would use HVM if the quadratic performance is easy to avoid or if manipulating big lambda constructions has some benefit other than looking cool.

Did I misread something or do you agree that optimal reduction is sometimes quadratically worse according to the literature? Does anyone have a collection of these poorly performing programs or do we only have the two Github issues?

@VictorTaelin
Copy link
Member

VictorTaelin commented Jul 19, 2022

HVM is not meant to consume EAL though. What he probably means is that EAL is the only subset for which there are proofs of optimal evaluation soundness and efficiency, but it not exhaustive, and indeed is a fairly restrictive subset. For example, EAL doesn't allow Scott Encodings, recursion and the Y-combinator, all that work fine on the HVM. The set of terms HVM can compute soundly and efficiently is massively larger than EAL, and I'd like it to be characterized by a more general extension of EAL.

The README should make it clear that HVM is at best exponentially faster and at worst quadratically slower.

Yes, I agree, but I'd like to have it reference an actual proof.

Basically I would use HVM if the quadratic performance is easy to avoid or if manipulating big lambda constructions has some benefit other than looking cool.

In a pragmatic sense, at least when it comes to high-quality code, I think it is more likely to end up with an accidental exponential slowdown in GHC, than a quadratic slowdown on HVM. OP's example, for instance, (Map @l(Head l) (Tails (Gen n))), clones a list several times, only to return itself. It clearly uses clones wastefully, and shouldn't pass a reasonable code review. On the other hands, GHC's inability to share redexes inside binders can easily go wrong. The fairly common expression below:

foo n =
  let a = \x -> (foo (n - 1))
  a r + a s

Is exponential on GHC, vs linear on HVM.

Did I misread something or do you agree that optimal reduction is sometimes quadratically worse according to the literature? Does anyone have a collection of these poorly performing programs or do we only have the two Github issues?

I don't have the theoretical knowledge to agree or disagree, I just implemented the algorithm as defined on the literature. As soon as OP raised the issue, I acknowledged and pinned the thread. From what you've discussed, that seems to be the case.

@VictorTaelin
Copy link
Member

The README should make it clear that HVM is at best exponentially faster and at worst quadratically slower.

It seems like there is nowhere on README.md claiming that HVM is always asymptotically faster than GHC (let me know if you find such claim). I've added a final note on HOW.md linking to this thread.

@VictorTaelin
Copy link
Member

To add to HVM's performance case: Kind2's dependent type-checker is in late stages of development, and preliminary benchmarks point that it is orders of magnitude faster than Agda's - which, if I'm not mistaken, runs on Haskell via NbE. Once it is more polished, we'll have a complex real-world application to stress test HVM's capabilities.

@Mathnerd314
Copy link

The fairly common expression below [foo] is exponential on GHC, vs linear on HVM.

Actually with -O GHC optimizes this to foo n = let a = foo (n-1) in a + a and it is linear. Try again. :-)

@VictorTaelin
Copy link
Member

VictorTaelin commented Jul 22, 2022

The fairly common expression below [foo] is exponential on GHC, vs linear on HVM.

Actually with -O GHC optimizes this to foo n = let a = foo (n-1) in a + a and it is linear. Try again. :-)

Not sure it is fair to compare a compiler + a runtime to just a runtime. In theory we could also add a compiler before HVM which also fixed these cases where it is quadratic. My point is STG can't share expressions under λ-binders, which is a pretty substantial drawback for a functional runtime. But again, I'm just explaining the differences that, I believe, are relevant to keep in mind. Both projects are great pieces of technology, I don't think there is a clear one size fits all winner. If someone thinks GHC works better for certain problem, by all means use it!

@VictorTaelin
Copy link
Member

Here is a Hacker News comment with my updated thoughts on the issue:

https://news.ycombinator.com/edit?id=35342297

@Abastro
Copy link

Abastro commented May 7, 2023

I do not see how it would be easy to avoid these pitfalls for performance, what is the problem here?

@kings177
Copy link
Member

closing this as HVM is now HVM2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants