Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spaces.No.Core: speed up the two slowest lines #1779

Merged
merged 1 commit into from
Oct 10, 2023

Conversation

jdchristensen
Copy link
Collaborator

Some simple changes make the two slowest lines instantaneous. I have no idea why one spot requires univalence to be passed in. Never seen that before. I tried various phrasings of that line, but all were slow unless the ua argument was given.

@@ -1065,7 +1067,7 @@ Section RaiseSort.
strip_truncations.
destruct sh as [[l sh]|[r sh]].
+ apply lt_l with l.
apply IHL0, (@No_decode_le _).
apply IHL0, (@No_decode_le ua). (* Need to pass in [Univalence] to make this fast. *)
rewrite p; exact sh.
+ apply lt_r with r.
apply IHR, No_decode_le.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this spot with the same lemma (and many other spots), Coq finds Univalence quickly.

Copy link
Collaborator

@Alizter Alizter Oct 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is absolutely bizarre. Could it be that Coq is searching the typeclasses database for any instances that produce univalence? I wonder if we have some false promises lying around like being able to produce the term of any type which could cause typeclasses to spin out.

If you do Set Typeclasses Debug you can see what it is looking for.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There might also be a universe issue? If Set Typeclasses Debug doesn't give enough information (i.e., if the tc log is relatively short but spends a bunch of time producing no output), you can also do Set Typeclasses Debug Verbosity 2 and get all the things it runs unification/conversion on before deciding to discard things that don't match.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also Set Typeclasses Depth 3 or another short number can let you understand what Coq is spending its time searching for more concisely.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i.e., if the tc log is relatively short but spends a bunch of time producing no output

Also btw, if you want an automated way to search for this sort of thing, you can pipe the output of coqc through https://github.com/JasonGross/coq-scripts/blob/master/timing/insert-timings.sh (I think we have this as a submodule here in etc/coq-scripts/timing/insert-timings.sh?) and then sort by timing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, suspicious. I don't suppose Set Printing All. Set Printing Universes. gives anything interesting?

Summoning @ppedrot , do you know of any hotspots that would cause a .16 second lag between

Debug: considering goal 1 of status initial

and

Debug: 1: looking for Univalence without backtracking

in typeclass resolution?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, how big is the context here @jdchristensen ? Does Time evar (ua2 : Univalence). also take .16s?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Time evar (ua2 : Univalence).

That takes 0s.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The context has 23 things in it, and (at least as displayed by default), none of them is particularly large.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't suppose Set Printing All. Set Printing Universes. gives anything interesting?

I didn't see anything that caught my eye in either the context or the goal with those set. Both are still relatively compact. And there's really no visible difference with the next goal a few lines down. Also note that Univalence is in Set, in case that matters.

Comment on lines +925 to +926
(* Coq can find this proof by typeclass search, but helping it out makes this faster. *)
{ intro x; nrapply istrunc_prod; nrapply istrunc_arrow; exact _. }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might be able to globally speed up truncation tc search with judicious use of #[export] Hint Mode ... : typeclass_instances. and #[export] Hint Opaque ... : typeclass_instances. We can use the first to say, e.g., "if you're looking to prove IsTrunc _ ?e and ?e is headed by an evar, quit early, and we can use the second to say "don't spend a bunch of time trying to unfold definitions to see if IsTrunc n <some definition with a complicated body> matches IsTrunc n <some other definition with a different complicated body>.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JasonGross Those kinds of typeclasses speedups are really valuable for us but obviously its very difficult to know all the tricks. Would you be able to open an issue explaining some tricks? The kinds of classes that we use quite often (and often with issue) are truncatedness and the wildcat ones. Our mathclasses code also has a lot of these issues, but that's less of a priority rn.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I have a large bag of tricks here. Roughly the universal trick is "set up typeclass search such that for every typeclass problem, there is at most one 'right' answer; the 'right' answer should be obvious to the coder just looking at the goal; try to get Coq to consider only the 'right' answer" (Hint Mode and Hint Opaque are useful for this last part). Relaxations of this rule are permissible, but introduce slowness exponential in the nesting depth, so make exceptions sparingly, ideally with bounded depth (Hint Cut ... : typeclass_instances can help with this), and make make sure to order the instance priorities in a way that cuts the search short as early as possible.

Basically, if you're designing something to be solved by typeclasses, you should be able to run the full resolution in your head, including paths that fail, for all the goals you care about; if you can't, you're probably expecting magic that you will pay for in performance.

I guess one trick for helping with performance is doing Hint Constants Opaque : typeclass_instances at the outset, and then only doing Hint Transparent foo : typeclass_instances when you're sure. (And probably you want to pair #[export] Hint Transparent with #[export] Hint Unfold; if you're allowing tc to see through the constant, you very likely have no business depending on it being folded.) (And even better to use an idiom like #[local] Hint Transparent foo : typeclass_instances. #[export] Instance : TheClass foo := _. so that you don't pollute conversion problems with unfolding foo when really you only need to unfold it for one particular instance.)
Set Typeclasses Debug Verbosity 2. Set Debug "tactic-unification,unification". Set Printing All. is absolutely essential to debugging failures then, though, since often it won't be clear why tc isn't considering some instance when it seems to you like it should.

Feel free to make an issue or a doc out of this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it would be great if some things were just faster by default, thanks to some clever settings. I don't understand these options well enough to play with this myself.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would do something like

#[export] Hint Constants Opaque: typeclass_instances. (* all unfoldable constants have to be declared as such explicitly*)
#[export] Hint Mode IsTrunc ! ! : typeclass_instances. (* if the level and type both have some known structure, we can proceed *)
#[export] Hint Mode IsTrunc - + : typeclass_instances. (* if the type is fully known we can use tc to infer the level *)
#[export] Hint Mode IsHProp ! : typeclass_instances.
#[export] Hint Mode IsHSet ! : typeclass_instances.
#[export] Hint Mode Contr ! : typeclass_instances.
#[export] Hint Mode IsEquiv - - + : typeclass_instances. (* only infer equivalence if the function is known; - - ! would also be fine if we want to infer IsEquiv id without needing the types to be known *)
#[export] Hint Mode Equiv + + : typeclass_instances. (* only infer Equiv if both types are fully known *)

and then see what breaks and why, and adjust from there.


If you describe what typeclasses are used in wildcats / mathclasses and how the information flow works, I might be able to guess a good starting point.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IsHProp, IsHSet and Contr are notations, so I had to exclude them. And we already set IsTrunc to be Opaque in typeclass search, so those lines are probably not needed in general.

I added the rest right before the slow spot here (again, not very slow, 0.15s) and it changed to 0.007s, as fast as my version. In other words, with the above, I can just write refine (No_ind_hprop _ _). and it's as fast as my version. Nice! I checked, and it's just the Hint Constants Opaque that affects the speed here.

However, I'm not sure how practical it will be to add Hint Constants Opaque to the library in general. We'd then need to add a lot of hints about unfolding throughout. If someone wants to add Hint Constants Opaque to Overture.v and then go through and add Hint Mode as needed to make the library build, that would be an interesting experiment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then go through and add Hint Mode as needed to make the library build

It's not Hint Mode but Hint Transparent that needs to be added. Roughly the rule of thumb is that you need to mark transparent any definition whose truncatedness you want tc to infer, but for which no trunc instance is declared. (For example, if you want to use instances of IsTrunc _ { a : A & ... } interchangably with instances of IsTrunc _ (hfiber ...)). There are a couple of exceptions, though I expect them to be quite rare for IsTrunc. One example is that if you prove forall (A : Type) (ls : list A), IsHProp (nil = ls) and then want to automatically infer forall (A B : Type) (f : A -> B) (ls : list B), IsHProp (map f nil = ls), you'll need to declare map transparent. I wouldn't be all that surprised if none of the truncation instances broke.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @JasonGross! I'll copy this to the issue as well.

@@ -1065,7 +1067,7 @@ Section RaiseSort.
strip_truncations.
destruct sh as [[l sh]|[r sh]].
+ apply lt_l with l.
apply IHL0, (@No_decode_le _).
apply IHL0, (@No_decode_le ua). (* Need to pass in [Univalence] to make this fast. *)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the unification of apply does something faster when the univalence argument is not an evar?
Is refine faster than apply here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SkySkimmer No, refine takes the same amount of time. But here are some strange observations. The last two are really mysterious to me!

        (* nrefine is "notypeclasses refine" *)
        Time apply (No_decode_le). (* slow *)
        Time apply (@No_decode_le). (* fails, unable to find Univalence *)
        Time apply (@No_decode_le ua). (* fast *)
        Time refine (@No_decode_le ua _ _ _ _). (* fast *)
        Time refine (@No_decode_le _ _ _ _ _). (* slow *)
        Time nrefine (@No_decode_le ua _ _ _ _). (* slow! *)
        Time nrefine (@No_decode_le _ _ _ _ _). (* slow!  (and leaves Univalence as a goal, of course) *)

cc @JasonGross

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, and with the last one, it's still slow to solve the isolated Univalence goal:

        Unshelve.
        Time 2: exact _.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's also slow if you use typeclasses eauto instead of exact _?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SkySkimmer Yes, it's the same. BTW, everything labelled "slow" takes essentially the same amount of time, around 0.28s on my machine, and everything labelled "fast" is 0s or very close to 0s. It's weird that the last one breaks the problem into two steps, and both steps take 0.28s.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the situation where all I'm trying to do is solve the goal Univalence using exact _, if I clear just this one thing from the context:

sh : le' {{ fun l : L => No_raise (xL l) | (fun r : R => No_raise (xR r)) // rxcut }} (No_raise (xL0 l))

then it becomes fast. Strangely, an identical thing is in the context for the next goal, which is fast without manually supplying ua.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem seems to be in the generation of the "local" hint database corresponding to the local context. It reduces in an inefficient way just to find out that it's not a class.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SkySkimmer Great! Glad that this lead to something useful. If it works, I'll add a comment here that says that the ua can be removed when we move to Coq 8.19.

@jdchristensen
Copy link
Collaborator Author

Thanks for all the ideas here. The slowdown with univalence seems to be a quirk that hasn't bothered us elsewhere, so I'm fine with just hardcoding the solution that works here. The slowdown with typeclass search involving truncation is a broader problem that could be tackled another time. I'll open an issue. For now, I'll just merge this.

@jdchristensen jdchristensen merged commit 31d7fee into HoTT:master Oct 10, 2023
23 checks passed
@jdchristensen jdchristensen deleted the faster_surreals branch October 10, 2023 14:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants