Bad performance of "by" function for random queries #1988

ufechner7 · 2019-10-19T19:27:24Z

The following code has a bad performance due to recompilation for each call of the "by" function:

# Purpose of this test:
# Reproduce the bad performance of aggregate, when the number and the order of the named_tuple argument changes.

using DataFrames

# TODO: find an implementation that doesn't recompile when the number of columns or their order change
function aggregate(table, agg_clause, (named_tuple), prn=false)
    # println(named_tuple)
    result, t, bytes, gctime, memallocs = @timed by(table, agg_clause, named_tuple, sort=true)
    if prn
        println("time: $(round(t, digits=3))")
    end
    return result
end

# Testdata
ALL = Symbol[:CAT_UTRECHT_CNT, :CAT_MOUSE, :WW_OP, :QWE_GRP, :HWR_KT, :CAT_WEIGHT, :HIGH_GRP, :CAT_WEIGHT_C, :CAT_CMI_C, :CAT_MOUSE_C, :CAT_FREIG_WGT_CNT, :CAT_FREIG_WGT_CNT_C, :AMS_CMI_TOT, :CAT_UTRECHT_DIS_CNT, :AMS_CMI_HRV, 
:GVZ_WXK, :AMS_CMI_DIFF, :CAT_UTRECHT_CNT_C]

agg_clause = Symbol[:HIGH_GRP, :QWE_GRP, :WW_OP]
NAMED_TUPLE = (CAT_WEIGHT_C_T = :CAT_WEIGHT_C => maximum, CAT_UTRECHT_CNT_T = :CAT_UTRECHT_CNT => maximum, CAT_CMI_C_T = :CAT_CMI_C => maximum, 
                 CAT_MOUSE_C_T = :CAT_MOUSE_C => maximum, CAT_FREIG_WGT_CNT_T = :CAT_FREIG_WGT_CNT => maximum, 
                 CAT_FREIG_WGT_CNT_C_T = :CAT_FREIG_WGT_CNT_C => maximum, CAT_MOUSE_T = :CAT_MOUSE => maximum, CAT_WEIGHT_C = :CAT_WEIGHT_C => sum, 
                 CAT_UTRECHT_CNT_C = :CAT_UTRECHT_CNT_C => sum, WW_OP = :WW_OP => maximum, CAT_FREIG_WGT_CNT = :CAT_FREIG_WGT_CNT => sum, 
                 HWR_KT = :HWR_KT => maximum, CAT_WEIGHT = :CAT_WEIGHT => sum, CAT_MOUSE = :CAT_MOUSE => sum, AMS_CMI_TOT = :AMS_CMI_TOT => sum, 
                 CAT_FREIG_WGT_CNT_C = :CAT_FREIG_WGT_CNT_C => sum, CAT_UTRECHT_DIS_CNT = :CAT_UTRECHT_DIS_CNT => sum, CAT_CMI_C = :CAT_CMI_C => sum, 
                 CAT_MOUSE_C = :CAT_MOUSE_C => sum, AMS_CMI_HRV = :AMS_CMI_HRV => sum, GVZ_WXK = :GVZ_WXK => sum, 
                 CAT_UTRECHT_CNT = :CAT_UTRECHT_CNT => sum, AMS_CMI_DIFF = :AMS_CMI_DIFF => sum)

# add a column with random Float64 values and the name col to the table
function add_column(table, col::Symbol)
    COL = rand(Float64, nrow(table))
    table[!, col] = COL 
end

# add a column with random String values and the name col to the table
function add_string_column(table, col::Symbol)
    COL = String[]
    for i in 1:nrow(table)
        if i % 3 == 0 
            push!(COL, "HUND")
        elseif i % 3 == 1
            push!(COL, "HASE")
        elseif i % 3 == 2
            push!(COL, "IGEL")
        end
    end
    table[!, col] = COL 
end

# build a named tuple with the given length, using NAMED_TUPLE as input
function make_named_tuple(len)
    nt = NAMED_TUPLE
    @assert len <= length(nt)
    t_names = propertynames(nt)
    pairs = Vector{Pair}(undef, len)
    indices = []
    new_names = []
    for i in 1:len
        index = rand(setdiff(1:length(nt), indices))
        push!(indices, index)
        pairs[i] = values(nt)[index]
        push!(new_names, t_names[index])
    end
    return NamedTuple{Tuple(new_names)}(Tuple(pairs))
end

# create data frame with 30 rows
jtab = DataFrame([1:30])

# add columns, using random values
for col in ALL
    global jtab
    if col in agg_clause
        add_string_column(jtab, col)
    else
        add_column(jtab, col)
    end
end

# create random aggregates of this dataframe and determine the time
for i = 1:length(NAMED_TUPLE)
    nt = make_named_tuple(i)
    res = aggregate(jtab, agg_clause, nt, true)
end
nothing

Example output, second run:

julia> include("local_tests/test_aggregate2.jl")
time: 0.046
time: 0.054
time: 0.062
time: 0.057
time: 0.057
time: 0.061
time: 0.06
time: 0.063
time: 0.061
time: 0.066
time: 0.065
time: 0.074
time: 0.072
time: 0.075
time: 0.072
time: 0.069
time: 0.069
time: 0.071
time: 0.074
time: 0.075
time: 0.076
time: 0.078
time: 0.076

Original post: https://discourse.julialang.org/t/bad-performance-of-group-by-of-dataframes-updated/30061/

The text was updated successfully, but these errors were encountered:

ufechner7 · 2019-10-19T19:34:15Z

The performance improves by a factor of 40 for queries that work on Float64 colums only, and by a factor of four for queries with mixed column types, when the following code is used:

# This function works fine if all colums are Float64, but not so good if it is a mix
# of String and Float64
function aggregate1(table, agg_clause, @nospecialize(named_tuple), prn=false)
    # println(named_tuple)
    x = []
    for (a,(b, c)) in pairs(named_tuple)
        push!(x, [a, b, c])
    end
    result, t, bytes, gctime, memallocs = @timed by(table, agg_clause, sort=true) do sdf
        r = DataFrame()
        for (a,b,c) in x
            r[!, a] = [c(sdf[!, b])]
        end
        r
    end
    if prn
        println("time: $(round(t, digits=3))")
    end
    return result
end

But first of all it is complicated, and secondly it doesn't work very well if the column types are mixed.

globalchitnis · 2019-10-19T20:18:53Z

When one has random queries, like I have come across (typical for generic applications) fast function would make a lot of difference!

nalimilan · 2019-10-21T14:01:26Z

Couldn't you just pass a vector of pairs instead of a named tuple, and rename columns afterwards? I.e. do return pairs in make_named_tuple.

It's too bad that names are part of the type of named tuples in this case. I precisely ensured that named tuples are not used internally so that recompilation isn't needed when the names change. But the public API for specifying column names indeed relies on named tuples. Maybe we could mitigate the problem by collecting them into a vector as soon as possible so that only a very small function needs to be recompiled. I'm not sure @nospecialize currently works with annotated arguments.

nalimilan · 2019-10-22T08:42:32Z

@vtjnash told me that @nospecialize works with typed arguments now, and indeed by adding a small unspecialized method for NamedTuple times are much improved. See #1992.

nalimilan · 2019-10-23T08:22:42Z

@ufechner7 Can you try with DataFrames master and see whether your problem is fixed?

nalimilan mentioned this issue Oct 22, 2019

Avoid recompiling combine for each new NamedTuple or Tuple type #1992

Merged

nalimilan closed this as completed in #1992 Oct 23, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bad performance of "by" function for random queries #1988

Bad performance of "by" function for random queries #1988

ufechner7 commented Oct 19, 2019 •

edited by nalimilan

Loading

ufechner7 commented Oct 19, 2019 •

edited

Loading

globalchitnis commented Oct 19, 2019

nalimilan commented Oct 21, 2019 •

edited

Loading

nalimilan commented Oct 22, 2019

nalimilan commented Oct 23, 2019

Bad performance of "by" function for random queries #1988

Bad performance of "by" function for random queries #1988

Comments

ufechner7 commented Oct 19, 2019 • edited by nalimilan Loading

ufechner7 commented Oct 19, 2019 • edited Loading

globalchitnis commented Oct 19, 2019

nalimilan commented Oct 21, 2019 • edited Loading

nalimilan commented Oct 22, 2019

nalimilan commented Oct 23, 2019

ufechner7 commented Oct 19, 2019 •

edited by nalimilan

Loading

ufechner7 commented Oct 19, 2019 •

edited

Loading

nalimilan commented Oct 21, 2019 •

edited

Loading