-
-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Effects on master are influenced by generation of coverage data #49978
Comments
this is annoying but intentional. the problem is that asking for code coverage means that code is not effect free (because with code coverage on, deleting a function who's output is not used has the visible effect of changing the coverage number. |
:/ Does that mean more or less everyone relying on some effects to be there needs to run their code twice, once with coverage and once without? Or should I not test effects to this extent (would be a shame - I need those guarantees for AVR)? It's not a long-running testsuite, but here's the smallest reproducer I have for now: using FieldFlags
using Test
@bitflags struct JETStruct
a
_
b
end
foldableAccess(j::JETStruct) = j.a
effects = Base.infer_effects(foldableAccess, (JETStruct,))
display(effects)
@test Core.Compiler.is_foldable(effects) with outputs being
|
The PR introducing this change is #48598, if I'm not mistaken, which only taints the effect because otherwise the optimizer would remove the dead IR. Is it possible to implement code coverage differently, so that effect testing is not influenced by coverage? Or is there some other way I can test that these functions are foldable? |
If anything, I'd argue that tainting |
at the heart of this is a question of what coverage should measure. if coverage is meant to measure code as written, the compiler must not be allowed to change the code run in certain ways. we could probably add a coverage option that allows these options, but it is somewhat unclear what would then be measured. |
Here's a smaller MWE, which doesn't need FieldFlags.jl: [sukera@tower ~]$ julia --code-coverage -q
julia> h(x) = iseven(x) ? "string" : 1
h (generic function with 1 method)
julia> g() = h(2)
g (generic function with 1 method)
julia> Base.infer_effects(g,())
(+c,!e,+n,+t,+s,+m,+i) [sukera@tower ~]$ julia -q
julia> h(x) = iseven(x) ? "string" : 1
h (generic function with 1 method)
julia> g() = h(2)
g (generic function with 1 method)
julia> Base.infer_effects(g,())
(+c,+e,+n,+t,+s,+m,+i) This doesn't make sense and is incosistent, as [sukera@tower ~]$ julia --code-coverage -q
julia> h(x) = iseven(x) ? "string" : 1
h (generic function with 1 method)
julia> g() = h(2)
g (generic function with 1 method)
julia> @code_warntype g()
MethodInstance for g()
from g() @ Main REPL[2]:1
Arguments
#self#::Core.Const(g)
Body::String
1 ─ %1 = Main.h(2)::Core.Const("string")
└── return %1 If the tainting were real/proper, this ought to infer |
If one of the "effects" of the function is to emit coverage data, then evaluating it at compile time would change that behavior, no?
This is with coverage: julia> @code_llvm g()
; @ REPL[2]:1 within `g`
define nonnull {}* @julia_g_925() #0 {
top:
%0 = alloca [8 x i8], align 8
%lcnt = load volatile i64, i64* inttoptr (i64 105553177974792 to i64*), align 8
%1 = add i64 %lcnt, 1
store volatile i64 %1, i64* inttoptr (i64 105553177974792 to i64*), align 8
%2 = call { {}*, i8 } @j_h_927([8 x i8]* noalias nocapture noundef nonnull %0, i64 signext 2) #0
%3 = extractvalue { {}*, i8 } %2, 0
ret {}* %3
} and without julia> @code_llvm g()
; @ REPL[2]:1 within `g`
define nonnull {}* @julia_g_477() #0 {
top:
%0 = alloca [8 x i8], align 8
%1 = call { {}*, i8 } @j_h_479([8 x i8]* noalias nocapture noundef nonnull %0, i64 signext 2) #0
%2 = extractvalue { {}*, i8 } %1, 0
ret {}* %2
} |
"emitting coverage data" is a bad effect in terms of side-effect freeness though, in particular when it puns on "is constant foldable". The reason being that, for the code that we're inspecting with coverage data, the coverage data is not a sideeffect - it's only a behavior changing sideeffect for the testsuite. |
In particular, the example I gave in the OP is a longer version of this MWE: [sukera@tower ~]$ julia --code-coverage -q
julia> struct Foo end
julia> Base.getproperty(f::Foo, s::Symbol) = s === :foo ? "foo!" : getfield(f, s)
julia> f(x::Foo) = x.foo
f (generic function with 1 method)
julia> Base.infer_effects(f,(Foo,))
(+c,!e,+n,+t,+s,+m,+i)
julia> code_warntype(f,(Foo,))
MethodInstance for f(::Foo)
from f(x::Foo) @ Main REPL[3]:1
Arguments
#self#::Core.Const(f)
x::Core.Const(Foo())
Body::String
1 ─ %1 = Base.getproperty(x, :foo)::Core.Const("foo!")
└── return %1 While the code in the OP doesn't return a |
In general, it's not a good idea to have the results of your code rely on certain optimisations that may be disabled when testing. Think of forced boundschecks, and debug mode(s) in some languages. Being able to add additional safety features during testing is a good thing |
With apologies for repeating myself, but if the code is supposed to track every time it is called, how could it be valid to evaluate its result at compile time? Then that would change the output in the coverage files? |
It is actually an interesting question though, once we expose infer_effects to users in an API, do we try to make it so that the code path taken is not impacted, and so you can still test code coverage without it bypassing the code being tested? |
I guess we could make it an invisible effect like a GC allocation? Though that doesn't solve the fact that it shouldn't concrete eval/fold, because it means you might get wrong coverage results |
I think a lot of the problems here could be resolved by adding a separate effect for coverage rather than tainting the existing effects. The only reason the change of effects is causing problems is because people want to test that their code has good effects (which for example will be needed for compiling to AVR). If we separated the coverage effect, it would be possible for people to test that their code will work as expected when deployed (without coverage). |
Here's the rub - the fact that the compiler is able to constant fold some call means it had enough information to find some path through the call, otherwise it could not have folded the call into a constant. So the compiler must have perfect knowledge about coverage already, since it knows which path it took to arrive at the constant - were that not the case, it would not have been able to fold the call into a constant (and even without being able to fold a call into a constant, it must be able to give some path/branch coverage after constant propagation, due to eliminating some error paths like in the example of
Code coverage is a very interesting field, with lots of different variations. The following is loosely inspired by some literature I stumbled across years ago - I'll have to see if I can find it again. The ideas are definitely not my own, though I believe the julia specific application is. Generally, each of the following list has either full, partial, or no coverage. Full coverage means "every variation of this is covered by some test code", partial coverage means "some variation of this is covered by some test code" and no coverage means "no variation of this is covered by some test code".
The list is ordered from "least trustworthy" to "most trustworthy", in terms of implications about correctness of the tested piece of code. If you have full coverage (I think sometimes also called "total" in the literature, but don't quote me on that) on one level of the list, it means that you must also have full coverage on the levels above. Partial coverage only implies partial coverage up the list (though full coverage is allowed - for example, it's extremely common to only have partial Input/Output space coverage, yet you reach full line coverage trivially). No coverage on any level means that you can't have coverage on any other level here either. Julia currently only reports (fuzzy) line coverage, which is extremely easy to reach with a bit of effort (for example, the package I mentioned above has 100% coverage, yet I only consider it moderately well tested because julia doesn't report functions that aren't called at all as uncovered). So from my POV, "making coverage better" means being able to emit more granular information about what isn't covered, thereby reducing reported coverage percentages (because we're moving down the list of coverage guarantees). To get some of these levels, our source code needs to at least track columns as well as lines, otherwise we only get a subset of expressions, branches or paths that could be covered (think ternary expressions, for example). So what does this have to do with coverage being an observable sideeffect or not? Well, coverage is a diagnostic global property of a program, not a property inherent to a function. Let me take the example from above again: h(x) = iseven(x) ? "string" : 1
g() = h(2) A priori, asking whether any part of this code is "covered" doesn't really make sense - there is no call to either One solution is to of course disable constant folding when emitting coverage, but that then implies the issue about whether or not constant folding is part of observable API at some point, and how to test for that if it's actually required for some other reason (like compiling away an abstraction to build a nice API, which julia is extremely good at - it would be a shame not to be able to test for that!). This is likely sufficient to fix the immediate issue reported here, though doesn't help with getting better/more granular coverage reports overall. Another solution is to keep track of coverage data (in however granular form we want to [1]), but only "merge" it with the parent call when actually requested/a top level call is done that inquires about the coverage of its subexpressions. An additional benefit of this approach is that we can ask the opposite question - where do I need to write more tests to increase coverage, by making top level calls that would cover more subexpressions that are not yet covered in some capacity by another call? I think this can actually already work with abstract interpretation by keeping track of covered line data through calls & expressions, bubbling coverage data upwards. A function is fully covered if all its lines/subexpressions are covered - if some are covered it's partially covered and if none or it's not covered (you get the idea). Admittedly I don't have an implementation of that idea. The difficulty of course is that there's quite a lot of additional metadata one would like to get access to when asking about coverage - it's not impossible to think that julia could directly produce a "coverage percentage" (how bad of a metric that is) itself, using that method. A similar idea could be applied for tracking a lowerbound for how much memory a given call needs, though this is obviously offtopic and out of scope for this discussion. Ah well, this should be its own seperate discussion, shouldn't it? I have some more points on coverage (in particular about a nasty [1]: e.g. for constant foldable expressions julia must already have perfect knowledge of line, expression, branch, path and input/output space coverage, because it was able to execute the code at compile time! This doesn't necessarily imply that the result of that is full coverage, but does imply partial coverage for all of these. |
Of note, this is not technically a requirement for compiling to AVR in general, but it does make things a whole lot easier. I really can't have dynamic dispatch for now, so being able to know that from a run of JET.jl instead of through an error in GPUCompiler (which is missing the crucial information of where that dispatch occured) is a godsend. Being able to test for that seperately is even better, because FieldFlags.jl isn't targeting AVR in particular at all! It's just a package that happens to be good in that environment as well :) |
Here's another cool example/argument for why having coverage influence optimization is a bad thing: https://shapr.github.io/posts/2023-07-30-goldilocks-property-tests.html I'd love to port this approach to PropCheck.jl, but without being able to reflect on coverage data in a safe way (without influencing the code I'm reflecting on), I can't really do this. |
Found while debugging the failures of FieldFlags.jl on nightly - the tests run fine on 1.9, haven't yet tried to bisect. Since it's not (yet) through its registration period, you'll have to
]add
the github repo directly for reproducing.Either way, the failures are reproducible with
but not with
which make the tests run through. I'll see if I can minimize the failure a bit more.
EDIT: For posteriority, a smaller MWE without dependencies:
The text was updated successfully, but these errors were encountered: