Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inference failure when inlining #1602

Closed
charleskawczynski opened this issue Jan 31, 2024 · 7 comments · Fixed by #1636
Closed

Inference failure when inlining #1602

charleskawczynski opened this issue Jan 31, 2024 · 7 comments · Fixed by #1636
Assignees
Labels
bug Something isn't working Inference performance

Comments

@charleskawczynski
Copy link
Member

charleskawczynski commented Jan 31, 2024

I need to dig a bit more into why this happens, however, what I've seen is that if PhasePartition (in Thermodynamics.jl) in this reproducer is inlined, then the broadcasted anonymous function in ClimaCore.jl/Fields/broadcast.jl:

# types aren't isbits
Base.Broadcast.broadcasted(
    fs::AbstractFieldStyle,
    ::Type{T},
    args...,
) where {T} = Base.Broadcast.broadcasted(fs, (x...) -> T(x...), args...)

is not inferred. Here's the reproducer:

#=
julia --project=perf
include("../perf/climacore_type_bc_inf.jl")
=#
using Revise
import ClimaCore;
import Thermodynamics as TD
import Thermodynamics.Parameters as TDP
import CloudMicrophysics as CM
import ClimaCore.Fields as Fields
import ClimaCore.Operators as Operators
import ClimaCore.Spaces as Spaces
import ClimaCore.Geometry as Geometry
import ClimaComms
@isdefined(TU) || include(joinpath(pkgdir(ClimaCore), "test", "TestUtilities", "TestUtilities.jl"));
import .TestUtilities as TU;
FT = Float64;
zelem=25
cspace = TU.CenterExtrudedFiniteDifferenceSpace(FT;zelem, helem=10);
fspace = Spaces.FaceExtrudedFiniteDifferenceSpace(cspace);
space = cspace;
@show ClimaComms.device(cspace)

# From ClimaAtmos
using ClimaParams
import ClimaParams as CP
aliases = string.(fieldnames(TD.Parameters.ThermodynamicsParameters));
toml_dict = CP.create_toml_dict(FT);
const thermo_params = TD.Parameters.ThermodynamicsParameters(toml_dict)
const cm_params = CM.Parameters.Parameters0M(FT)

x = (;
    ᶜts       = Fields.Field(TD.PhaseEquil{FT}, space),
    ᶜS_ρq_tot = Fields.Field(FT, space),
    ρ         = Fields.Field(FT, space),
)

total_specific_humidity(tp, ts) = TD.PhasePartition(tp, ts).tot
function compute_precipitation_cache!(x)
    (; ᶜts, ρ, ᶜS_ρq_tot) = x
    @. ᶜS_ρq_tot =
        ρ * CM.Microphysics0M.remove_precipitation(
            cm_params,
            TD.PhasePartition(thermo_params, ᶜts),
        )
    # @. ᶜS_ρq_tot = ρ * TD.PhasePartition(thermo_params, ᶜts).tot # allocates!
    # @. ᶜS_ρq_tot = ρ * total_specific_humidity(thermo_params, ᶜts) # allocation-free
    return nothing
end

compute_precipitation_cache!(x)

import JET
JET.@test_opt compute_precipitation_cache!(x)

function do_work!(x)
    for i in 1:200
        compute_precipitation_cache!(x)
    end
end

import Profile, ProfileCanvas
@info "Compiling first..."
do_work!(x) # compile first
@info "Collecting profile..."
Profile.clear()
prof = Profile.@profile do_work!(x)
results = Profile.fetch()
Profile.clear()
@info "Generating html..."
ProfileCanvas.html_file("flame.html", results)

@info "Benchmarking..."
import BenchmarkTools
trial = BenchmarkTools.@benchmark compute_precipitation_cache!($x)
show(stdout, MIME("text/plain"), trial)

This function is ~100x slower as a result. cc @glwagner and @dennisYatunin (just FYI). It's possible that we're hitting a heuristic (PhasePartition calls PhasePartition_equil with a handful of arguments), or maybe it's the use of phase_type, which is somewhat of an edge case in Thermodynamics.

Whatever the issue is, this is a useful reproducer because one other thing I noticed is that @. ᶜS_ρq_tot = ρ * TD.PhasePartition(thermo_params, ᶜts).tot allocates, making it slower. It's important that we track and document these issues, and spread best-practice patterns to avoid shortcomings with our broadcast machinery, or improve its robustness.

In addition, JET seems to miss this inference failure, however, it is caught in the flame graph (and the timing difference is observed through BenchmarkTools).

@charleskawczynski
Copy link
Member Author

Maybe even more interesting, removing

Base.Broadcast.broadcasted(
    fs::AbstractFieldStyle,
    ::Type{T},
    args...,
) where {T} = Base.Broadcast.broadcasted(fs, (x...) -> T(x...), args...)

entirely not only fixes the issue, but the function is 20% faster and seems to have less noise.

I'll see if this can be removed. It was added in a very large PR (#767), so it's possible that it was a kluge to get things working.

@charleskawczynski
Copy link
Member Author

Let's add an allocation unit test to reproduce this.

@glwagner
Copy link
Member

glwagner commented Feb 2, 2024

Can you clarify? Are you saying that there is a problem when an @inline annotation is added to a function? To reproduce this, we need to know which function to add/remove @inline from?

@glwagner
Copy link
Member

glwagner commented Feb 2, 2024

Maybe even more interesting, removing

Base.Broadcast.broadcasted(
    fs::AbstractFieldStyle,
    ::Type{T},
    args...,
) where {T} = Base.Broadcast.broadcasted(fs, (x...) -> T(x...), args...)

entirely not only fixes the issue, but the function is 20% faster and seems to have less noise.

I'll see if this can be removed. It was added in a very large PR (#767), so it's possible that it was a kluge to get things working.

It looks like this is supposed to handle constructions like FT.(x) where FT is something like Float32. Is that right?

Something like this should not be called during time-stepping. It might be even better to avoid this kind of pattern altogether, if possible. Not sure what the use case is.

@simonbyrne
Copy link
Member

Maybe even more interesting, removing

Base.Broadcast.broadcasted(
    fs::AbstractFieldStyle,
    ::Type{T},
    args...,
) where {T} = Base.Broadcast.broadcasted(fs, (x...) -> T(x...), args...)

It should no longer be required, now that JuliaGPU/CUDA.jl#2000 is in CUDA.jl

@charleskawczynski
Copy link
Member Author

charleskawczynski commented Feb 2, 2024

Can you clarify? Are you saying that there is a problem when an @inline annotation is added to a function? To reproduce this, we need to know which function to add/remove @inline from?

Yes, when we inline Thermodynamic's PhaseEquil, the reproducer I gave (a full reproducer with git clone etc. is in aviatesk/JET.jl#608) results inference failure.

It looks like this is supposed to handle constructions like FT.(x) where FT is something like Float32. Is that right?

Yes but, interestingly, this broadcasted method also seems to catch when FT is a type and a method. Here is a minimal reproducer of what I mean:

function Base.Broadcast.broadcasted(
    fs::Base.BroadcastStyle,
    ::Type{T},
    args...,
) where {T}
    println("Boom! $T, $args")
    type_cast(x...) = T(x...)
    Base.Broadcast.broadcasted(fs, type_cast, args...)
end

# Base's definition (FYI):
# function Base.Broadcast.broadcasted(
#     fs::Base.BroadcastStyle,
#     f::T,
#     args...,
# ) where {T}
#     Base.Broadcast.Broadcasted(fs, f, args...)
# end

struct Foo{T}
    x::T
end
struct Bar{T}
    x::T
end
struct Baz{T}
    x::T
end
function Bar(x::Int)
    print("wow ")
    Bar{Float64}(x);
end;
function get_Baz(x::Int)
    print("nice ")
    Baz{Float64}(x);
end;
function main_foo!(x)
    x .= x .+ getproperty.(Foo.(x), :x)
    return nothing
end;
function main_bar!(x)
    x .= x .+ getproperty.(Bar.(x), :x) # this Bar is a method, not the type!
    return nothing
end;
function main_baz!(x)
    x .= x .+ getproperty.(get_Baz.(x), :x) # broadcasted does not catch it if there is no method
    return nothing
end;

a = [1,2,3];
main_foo!(a);
main_bar!(a);
main_baz!(a);

Something like this should not be called during time-stepping. It might be even better to avoid this kind of pattern altogether, if possible. Not sure what the use case is.

It should no longer be required, now that JuliaGPU/CUDA.jl#2000 is in CUDA.jl

Agreed, I think we can remove it, we just need to fix the resulting issues. Which, at the moment is (from this build):

ERROR: LoadError: MethodError: adapt_structure(::CUDA.KernelAdaptor, ::Base.Broadcast.Broadcasted{ClimaCore.Fields.FieldStyle{ClimaCore.DataLayouts.VIJFHStyle{4, CUDA.CuArray{Float32, N, CUDA.Mem.DeviceBuffer} where N}}, ClimaCore.Operators.CenterPlaceholderSpace, Type{ClimaCore.Geometry.Contravariant3Vector}, Tuple{ClimaCore.Fields.Field{ClimaCore.DataLayouts.VIJFH{ClimaCore.Geometry.Covariant12Vector{Float32}, 4, SubArray{Float32, 5, CUDA.CuArray{Float32, 5, CUDA.Mem.DeviceBuffer}, Tuple{Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}, UnitRange{Int64}, Base.Slice{Base.OneTo{Int64}}}, false}}, ClimaCore.Operators.PlaceholderSpace}, ClimaCore.Fields.Field{ClimaCore.DataLayouts.VIJFH{ClimaCore.Geometry.LocalGeometry{(1, 2, 3), ClimaCore.Geometry.LatLongZPoint{Float32}, Float32, StaticArraysCore.SMatrix{3, 3, Float32, 9}}, 4, CUDA.CuArray{Float32, 5, CUDA.Mem.DeviceBuffer}}, ClimaCore.Operators.PlaceholderSpace}}}) is ambiguous.
  |  
  | Candidates:
  | adapt_structure(to::CUDA.KernelAdaptor, bc::Base.Broadcast.Broadcasted{Style, <:Any, Type{T}}) where {Style, T}
  | @ CUDA /central/scratch/esm/slurm-buildkite/climaatmos-ci/depot/default/packages/CUDA/rXson/src/compiler/execution.jl:174
  | adapt_structure(to, bc::Base.Broadcast.Broadcasted{Style}) where Style<:ClimaCore.Fields.AbstractFieldStyle
  | @ ClimaCore.Fields /central/scratch/esm/slurm-buildkite/climaatmos-ci/depot/default/packages/ClimaCore/SnmZh/src/Fields/broadcast.jl:38
  |  
  | Possible fix, define
  | adapt_structure(::CUDA.KernelAdaptor, ::Base.Broadcast.Broadcasted{…} where Axes) where {}
  |  
  | Stacktrace:
  | [1] adapt(to::CUDA.KernelAdaptor, x::Base.Broadcast.Broadcasted{…})
  | @ Adapt /central/scratch/esm/slurm-buildkite/climaatmos-ci/depot/default/packages/Adapt/cdaEv/src/Adapt.jl:40
  | [2] (::Base.Fix1{…})(y::Base.Broadcast.Broadcasted{…})
  | @ Base ./operators.jl:1118
  | [3] map(f::Base.Fix1{…}, t::Tuple{…})
  | @ Base ./tuple.jl:292
  | [4] adapt_structure(to::CUDA.KernelAdaptor, xs::Tuple{Base.Broadcast.Broadcasted{…}, Base.Broadcast.Broadcasted{…}})
  | @ Adapt /central/scratch/esm/slurm-buildkite/climaatmos-ci/depot/default/packages/Adapt/cdaEv/src/base.jl:3
  | [5] adapt(to::CUDA.KernelAdaptor, x::Tuple{Base.Broadcast.Broadcasted{…}, Base.Broadcast.Broadcasted{…}})
  | @ Adapt /central/scratch/esm/slurm-buildkite/climaatmos-ci/depot/default/packages/Adapt/cdaEv/src/Adapt.jl:40
  | [6] adapt_structure(to::CUDA.KernelAdaptor, sbc::ClimaCore.Operators.StencilBroadcasted{…})
  | @ ClimaCore.Operators /central/scratch/esm/slurm-buildkite/climaatmos-ci/depot/default/packages/ClimaCore/SnmZh/src/Operators/finitedifference.jl:238
  | [7] adapt(to::CUDA.KernelAdaptor, x::ClimaCore.Operators.StencilBroadcasted{…})
  | @ Adapt /central/scratch/esm/slurm-buildkite/climaatmos-ci/depot/default/packages/Adapt/cdaEv/src/Adapt.jl:40
  | [8] cudaconvert(arg::ClimaCore.Operators.StencilBroadcasted{…})
  | @ CUDA /central/scratch/esm/slurm-buildkite/climaatmos-ci/depot/default/packages/CUDA/rXson/src/compiler/execution.jl:188
  | [9] map (repeats 2 times)
  | @ ./tuple.jl:294 [inlined]
  | [10] macro expansion
  | @ /central/scratch/esm/slurm-buildkite/climaatmos-ci/depot/default/packages/CUDA/rXson/src/compiler/execution.jl:102 [inlined]
  | [11] copyto!(out::ClimaCore.Fields.Field{…}, bc::ClimaCore.Operators.StencilBroadcasted{…})
  | @ ClimaCore.Operators /central/scratch/esm/slurm-buildkite/climaatmos-ci/depot/default/packages/ClimaCore/SnmZh/src/Operators/finitedifference.jl:3366
  | [12] materialize!(dest::ClimaCore.Fields.Field{…}, opbc::ClimaCore.Operators.StencilBroadcasted{…})
  | @ ClimaCore.Operators /central/scratch/esm/slurm-buildkite/climaatmos-ci/depot/default/packages/ClimaCore/SnmZh/src/Operators/common.jl:61
  | [13] set_ᶠuₕ³!(ᶠuₕ³::ClimaCore.Fields.Field{…}, Y::ClimaCore.Fields.FieldVector{…})
  | @ ClimaAtmos /central/scratch/esm/slurm-buildkite/climaatmos-ci/16362/climaatmos-ci/src/cache/precomputed_quantities.jl:138
  | [14] macro expansion
  | @ /central/scratch/esm/slurm-buildkite/climaatmos-ci/16362/climaatmos-ci/src/cache/precomputed_quantities.jl:424 [inlined]
  | [15] set_precomputed_quantities!(Y::ClimaCore.Fields.FieldVector{…}, p::@NamedTuple{}, t::Float32)

@charleskawczynski
Copy link
Member Author

Interestingly, I've found another issue, from analyzing and simplifying this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Inference performance
Projects
None yet
3 participants