Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[StdLib] Strictify 'fix' #5939

Merged
merged 5 commits into from
May 6, 2024
Merged

Conversation

effectfully
Copy link
Contributor

@effectfully effectfully commented Apr 30, 2024

This makes fix stricter so that the body of the function that it takes always gets forced regardless of whether the necessary argument is provided or not.

Resolves #5961.

Copy link
Contributor Author

@effectfully effectfully left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. It's oddly specific, but it's for testing a specific pass anyway, so it's perfectly fine. It would be great to try to abstract some of that stuff out, so that we can reuse the code in other tests, but I don't really have any suggestions on that.

30 lines of imports is ridiculous, we should clean our APIs up.

@@ -0,0 +1,329 @@
(let
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we print out UPLC, instead of TPLC? The TPLC is 100x harder to read, and for no benefit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not just harder to read, it's also noisy. It has this huge fixBy definition that is generated "just in case", which is normally removed via dead code elimination, but not in this case because we turn optimizations off.

(Note that some work in this PR is done by me and some by Yura, so it's not immediately obvious who reviewers are talking to)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we print out UPLC, instead of TPLC? The TPLC is 100x harder to read, and for no benefit.

The reason I wrote PIR in this case is because I was following this advice by @michaelpj
image

This test isolates just one pass PIR -> PIR, so that we can reason how PIR compilation treats Let(rec)s without taking into the equation possible effects added by compilation to UPLC.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, this is PIR, not TPLC.. but still, this is too much to look at. At the very least you should print readable PIR, but it will still be too big. I think the UPLC for this program should be quite small, probably 10 lines or so.

Comment on lines -5 to +13
(\s x ->
(\and ->
force
(case
x
[ (delay (constr 0 []))
, (\x xs ->
delay
(force
(case
x
[(delay (and xs)), (delay (constr 1 []))]))) ]))
(s s))
(\s ds ->
force
(case
ds
[ (delay (constr 0 []))
, (\x xs ->
delay
(force
(case x [(delay (s s xs)), (delay (constr 1 []))]))) ]))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting! So this is what gets more efficient and we literally just inline a single application: and = s s gets inlined. Why does the optimizer not do that? Perhaps 'cause it's not semantics preserving, for all the optimizer knows evaluating s s (I just realized I should've named it differently...) may be impure.

I now wonder if we can make it even more efficient by somehow caching s s once and for all.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does the optimizer not do that? Perhaps 'cause it's not semantics preserving, for all the optimizer knows evaluating s s (I just realized I should've named it differently...) may be impure.

That's exactly why.

What I'm wondering is how the cost improvement can be so significant. I'd expect a very tiny improvement, and if and is used more than once, I'd expect a regression 😕

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd expect a very tiny improvement

It may simply enable us to do more optimizations, I'll take a closer look. But in any case not saving something into a name pointlessly and then needing to do a lookup certainly should indeed help.

if and is used more than once, I'd expect a regression 😕

No, and is computed at every recursive step: the \s -> s s thing, i.e. the driver of recursion, is outside of it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may simply enable us to do more optimizations

But this is the final UPLC, after all optimizations. I struggle to see how the cost difference can be so big - I must be missing something.

No, and is computed at every recursive step: the \s -> s s thing, i.e. the driver of recursion, is outside of it.

What I'm saying is that if and occurs twice, then inlining it causes s s to be evaluated twice at each recursive step, which should be worse. I'm surprised that there isn't a single test case that became worse.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I now wonder if we can make it even more efficient by somehow caching s s once and for all.

I've looked at it as hard as I could and I failed to find any opportunity there, it probably doesn't exist.

@kwxm
Copy link
Contributor

kwxm commented May 3, 2024

/benchmark nofib

Copy link
Contributor

github-actions bot commented May 3, 2024

Click here to check the status of your benchmark.

@Unisay Unisay force-pushed the effectfully/stdlib/strictify-fix branch from f4e4826 to 0121dfa Compare May 3, 2024 13:04
@kwxm
Copy link
Contributor

kwxm commented May 3, 2024

I ran some of the benchmarks in this PR. This seems to speed things up a bit, apart from some suspicious results for the list benchmarks. It definitely doesn't make things worse.

Comment on lines 16 to 17
(\f -> (\s -> s s) (\s x -> f (s s) x))) No newline at end of file
(\f -> (\s -> s s) (\s -> f (\x -> s s x))))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For anybody wondering, this is the whole change.

(force
(force ifThenElse
(lessThanEqualsInteger 10 x)
(delay ((\x -> s s x) (addInteger 1 i) xs))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optimization idea: we should eta-contract applied functions. That would change effect ordering though, I think.

({cpu: 13825400
| mem: 38840})
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Barely any difference for findIndex and a big difference for elem:

- ({cpu: 10254990
- | mem: 32120})
+ ({cpu: 9426990
+ | mem: 28520})

and all:

- ({cpu: 10219630
- | mem: 32120})
+ ({cpu: 9391630
+ | mem: 28520})

The more I look at it, the more confused I become.

Maybe it's indeed just an extra lambda and an application? We have different extra ones for findIndex as well, that could explain all the difference.

({cpu: 9426990
| mem: 28520})
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No extra lam-apply => big difference.

@effectfully
Copy link
Contributor Author

@zliu41 I think the by far most likely explanation is simply removal of a single lam-apply pair. Because when we get a different one back, the numbers go roughly to where they were before. The simplest way to investigate it would be to eta-contract \x -> s s x to s s in findIndex and observe that it does indeed affect costs that much.

Copy link
Member

@zliu41 zliu41 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zliu41 I think the by far most likely explanation is simply removal of a single lam-apply pair.

I think you are likely correct. For any or elem, the recursion is simply called on xs (i.e., go xs). The UPLC inliner can inline xs, yielding s s xs, which is efficient.

For findIndex, on the other hand, we have go (Builtins.addInteger i 1) xs, and the inliner is unable to inline Builtins.addInteger i 1 (because it's impure), leading to (\x -> s s x) (addInteger 1 i).

There is a big cost difference for findIndexEmptyList, and I think that's because s s is evaluated once before, but not evaluated at all after. In general s s is evaluated one more time before vs. after.

EvaluationFailure ->
fail $ "Evaluation failed, available traces: " <> show traces
EvaluationSuccess term -> do
term @?= constant () (someValue (1 :: Integer))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Unisay just FYI, you can write fromValue (1 :: Integer) instead of constant () (someValue (1 :: Integer). It's a bit confusing, because we have a HasConstant class that contains

    -- | Wrap a Haskell value as a @term@.
    fromConstant :: Some (ValueOf (UniOf term)) -> term

which is a lot like TermLike containing

    constant :: ann -> Some (ValueOf uni) -> term ann

but doesn't take an annotation. fromValue is defined in terms of the former and we use it during evaluation unlike anything from TermLike. And in general, evaluation stuff is separate (because it needs to be ultra-optimized), so it doesn't have great discoverability.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, will use fromValue next time!

@effectfully
Copy link
Contributor Author

@Unisay has addressed comments, I've written a Note explaining what we compile (mutually) recursive functions to, let's merge this bad boi. There's still work to do here, I'll create some issues or something.

@effectfully effectfully merged commit ef40c38 into master May 6, 2024
5 checks passed
@effectfully effectfully deleted the effectfully/stdlib/strictify-fix branch May 6, 2024 20:40
@zliu41
Copy link
Member

zliu41 commented May 6, 2024

I still think stirctLetRec.golden should be UPLC. I don't think anyone is going to read it otherwise.

@effectfully
Copy link
Contributor Author

effectfully commented May 6, 2024

@zliu41 I think nobody is going to read it regardless. Without optimizations we produce this unreadable CPSed code and it's not much better in UPLC compared to TPLC/PIR, types even help reading it a bit. But if you have a strong opinion about it, I'll make it UPLC. Note that this PR reduces the size of the file (old version vs new version) by 2x.

@zliu41
Copy link
Member

zliu41 commented May 6, 2024

I think nobody is going to read it regardless.

We did read andExpensive.uplc.golden etc., and drew conclusions from them. I'm suggesting doing the same to strictLetRec - I don't see a reason not to. It will need to be read if we change something and the strictLetRec test fails.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

A recursive binding is compiled unexpectedly lazily
4 participants