-
Notifications
You must be signed in to change notification settings - Fork 184
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spaces.No.Core: speed up the two slowest lines #1779
Conversation
@@ -1065,7 +1067,7 @@ Section RaiseSort. | |||
strip_truncations. | |||
destruct sh as [[l sh]|[r sh]]. | |||
+ apply lt_l with l. | |||
apply IHL0, (@No_decode_le _). | |||
apply IHL0, (@No_decode_le ua). (* Need to pass in [Univalence] to make this fast. *) | |||
rewrite p; exact sh. | |||
+ apply lt_r with r. | |||
apply IHR, No_decode_le. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this spot with the same lemma (and many other spots), Coq finds Univalence quickly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is absolutely bizarre. Could it be that Coq is searching the typeclasses database for any instances that produce univalence? I wonder if we have some false promises lying around like being able to produce the term of any type which could cause typeclasses to spin out.
If you do Set Typeclasses Debug
you can see what it is looking for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There might also be a universe issue? If Set Typeclasses Debug
doesn't give enough information (i.e., if the tc log is relatively short but spends a bunch of time producing no output), you can also do Set Typeclasses Debug Verbosity 2
and get all the things it runs unification/conversion on before deciding to discard things that don't match.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also Set Typeclasses Depth 3
or another short number can let you understand what Coq is spending its time searching for more concisely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i.e., if the tc log is relatively short but spends a bunch of time producing no output
Also btw, if you want an automated way to search for this sort of thing, you can pipe the output of coqc through https://github.com/JasonGross/coq-scripts/blob/master/timing/insert-timings.sh (I think we have this as a submodule here in etc/coq-scripts/timing/insert-timings.sh
?) and then sort by timing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, suspicious. I don't suppose Set Printing All. Set Printing Universes.
gives anything interesting?
Summoning @ppedrot , do you know of any hotspots that would cause a .16 second lag between
Debug: considering goal 1 of status initial
and
Debug: 1: looking for Univalence without backtracking
in typeclass resolution?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, how big is the context here @jdchristensen ? Does Time evar (ua2 : Univalence).
also take .16s?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Time evar (ua2 : Univalence).
That takes 0s.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The context has 23 things in it, and (at least as displayed by default), none of them is particularly large.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't suppose
Set Printing All. Set Printing Universes.
gives anything interesting?
I didn't see anything that caught my eye in either the context or the goal with those set. Both are still relatively compact. And there's really no visible difference with the next goal a few lines down. Also note that Univalence
is in Set
, in case that matters.
(* Coq can find this proof by typeclass search, but helping it out makes this faster. *) | ||
{ intro x; nrapply istrunc_prod; nrapply istrunc_arrow; exact _. } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we might be able to globally speed up truncation tc search with judicious use of #[export] Hint Mode ... : typeclass_instances.
and #[export] Hint Opaque ... : typeclass_instances.
We can use the first to say, e.g., "if you're looking to prove IsTrunc _ ?e
and ?e
is headed by an evar, quit early, and we can use the second to say "don't spend a bunch of time trying to unfold definitions to see if IsTrunc n <some definition with a complicated body>
matches IsTrunc n <some other definition with a different complicated body>
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JasonGross Those kinds of typeclasses speedups are really valuable for us but obviously its very difficult to know all the tricks. Would you be able to open an issue explaining some tricks? The kinds of classes that we use quite often (and often with issue) are truncatedness and the wildcat ones. Our mathclasses code also has a lot of these issues, but that's less of a priority rn.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think I have a large bag of tricks here. Roughly the universal trick is "set up typeclass search such that for every typeclass problem, there is at most one 'right' answer; the 'right' answer should be obvious to the coder just looking at the goal; try to get Coq to consider only the 'right' answer" (Hint Mode
and Hint Opaque
are useful for this last part). Relaxations of this rule are permissible, but introduce slowness exponential in the nesting depth, so make exceptions sparingly, ideally with bounded depth (Hint Cut ... : typeclass_instances
can help with this), and make make sure to order the instance priorities in a way that cuts the search short as early as possible.
Basically, if you're designing something to be solved by typeclasses, you should be able to run the full resolution in your head, including paths that fail, for all the goals you care about; if you can't, you're probably expecting magic that you will pay for in performance.
I guess one trick for helping with performance is doing Hint Constants Opaque : typeclass_instances
at the outset, and then only doing Hint Transparent foo : typeclass_instances
when you're sure. (And probably you want to pair #[export] Hint Transparent with #[export] Hint Unfold
; if you're allowing tc to see through the constant, you very likely have no business depending on it being folded.) (And even better to use an idiom like #[local] Hint Transparent foo : typeclass_instances. #[export] Instance : TheClass foo := _.
so that you don't pollute conversion problems with unfolding foo when really you only need to unfold it for one particular instance.)
Set Typeclasses Debug Verbosity 2. Set Debug "tactic-unification,unification". Set Printing All.
is absolutely essential to debugging failures then, though, since often it won't be clear why tc isn't considering some instance when it seems to you like it should.
Feel free to make an issue or a doc out of this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it would be great if some things were just faster by default, thanks to some clever settings. I don't understand these options well enough to play with this myself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would do something like
#[export] Hint Constants Opaque: typeclass_instances. (* all unfoldable constants have to be declared as such explicitly*)
#[export] Hint Mode IsTrunc ! ! : typeclass_instances. (* if the level and type both have some known structure, we can proceed *)
#[export] Hint Mode IsTrunc - + : typeclass_instances. (* if the type is fully known we can use tc to infer the level *)
#[export] Hint Mode IsHProp ! : typeclass_instances.
#[export] Hint Mode IsHSet ! : typeclass_instances.
#[export] Hint Mode Contr ! : typeclass_instances.
#[export] Hint Mode IsEquiv - - + : typeclass_instances. (* only infer equivalence if the function is known; - - ! would also be fine if we want to infer IsEquiv id without needing the types to be known *)
#[export] Hint Mode Equiv + + : typeclass_instances. (* only infer Equiv if both types are fully known *)
and then see what breaks and why, and adjust from there.
If you describe what typeclasses are used in wildcats / mathclasses and how the information flow works, I might be able to guess a good starting point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IsHProp
, IsHSet
and Contr
are notations, so I had to exclude them. And we already set IsTrunc
to be Opaque
in typeclass search, so those lines are probably not needed in general.
I added the rest right before the slow spot here (again, not very slow, 0.15s) and it changed to 0.007s, as fast as my version. In other words, with the above, I can just write refine (No_ind_hprop _ _).
and it's as fast as my version. Nice! I checked, and it's just the Hint Constants Opaque
that affects the speed here.
However, I'm not sure how practical it will be to add Hint Constants Opaque
to the library in general. We'd then need to add a lot of hints about unfolding throughout. If someone wants to add Hint Constants Opaque
to Overture.v and then go through and add Hint Mode
as needed to make the library build, that would be an interesting experiment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then go through and add
Hint Mode
as needed to make the library build
It's not Hint Mode
but Hint Transparent
that needs to be added. Roughly the rule of thumb is that you need to mark transparent any definition whose truncatedness you want tc to infer, but for which no trunc instance is declared. (For example, if you want to use instances of IsTrunc _ { a : A & ... }
interchangably with instances of IsTrunc _ (hfiber ...)
). There are a couple of exceptions, though I expect them to be quite rare for IsTrunc
. One example is that if you prove forall (A : Type) (ls : list A), IsHProp (nil = ls)
and then want to automatically infer forall (A B : Type) (f : A -> B) (ls : list B), IsHProp (map f nil = ls)
, you'll need to declare map
transparent. I wouldn't be all that surprised if none of the truncation instances broke.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @JasonGross! I'll copy this to the issue as well.
@@ -1065,7 +1067,7 @@ Section RaiseSort. | |||
strip_truncations. | |||
destruct sh as [[l sh]|[r sh]]. | |||
+ apply lt_l with l. | |||
apply IHL0, (@No_decode_le _). | |||
apply IHL0, (@No_decode_le ua). (* Need to pass in [Univalence] to make this fast. *) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe the unification of apply
does something faster when the univalence argument is not an evar?
Is refine faster than apply here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SkySkimmer No, refine
takes the same amount of time. But here are some strange observations. The last two are really mysterious to me!
(* nrefine is "notypeclasses refine" *)
Time apply (No_decode_le). (* slow *)
Time apply (@No_decode_le). (* fails, unable to find Univalence *)
Time apply (@No_decode_le ua). (* fast *)
Time refine (@No_decode_le ua _ _ _ _). (* fast *)
Time refine (@No_decode_le _ _ _ _ _). (* slow *)
Time nrefine (@No_decode_le ua _ _ _ _). (* slow! *)
Time nrefine (@No_decode_le _ _ _ _ _). (* slow! (and leaves Univalence as a goal, of course) *)
cc @JasonGross
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, and with the last one, it's still slow to solve the isolated Univalence goal:
Unshelve.
Time 2: exact _.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's also slow if you use typeclasses eauto
instead of exact _
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SkySkimmer Yes, it's the same. BTW, everything labelled "slow" takes essentially the same amount of time, around 0.28s on my machine, and everything labelled "fast" is 0s or very close to 0s. It's weird that the last one breaks the problem into two steps, and both steps take 0.28s.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the situation where all I'm trying to do is solve the goal Univalence
using exact _
, if I clear
just this one thing from the context:
sh : le' {{ fun l : L => No_raise (xL l) | (fun r : R => No_raise (xR r)) // rxcut }} (No_raise (xL0 l))
then it becomes fast. Strangely, an identical thing is in the context for the next goal, which is fast without manually supplying ua
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem seems to be in the generation of the "local" hint database corresponding to the local context. It reduces in an inefficient way just to find out that it's not a class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SkySkimmer Great! Glad that this lead to something useful. If it works, I'll add a comment here that says that the ua
can be removed when we move to Coq 8.19.
Thanks for all the ideas here. The slowdown with univalence seems to be a quirk that hasn't bothered us elsewhere, so I'm fine with just hardcoding the solution that works here. The slowdown with typeclass search involving truncation is a broader problem that could be tackled another time. I'll open an issue. For now, I'll just merge this. |
Some simple changes make the two slowest lines instantaneous. I have no idea why one spot requires univalence to be passed in. Never seen that before. I tried various phrasings of that line, but all were slow unless the
ua
argument was given.