Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bad performance of "by" function for random queries #1988

Closed
ufechner7 opened this issue Oct 19, 2019 · 5 comments · Fixed by #1992
Closed

Bad performance of "by" function for random queries #1988

ufechner7 opened this issue Oct 19, 2019 · 5 comments · Fixed by #1992

Comments

@ufechner7
Copy link

ufechner7 commented Oct 19, 2019

The following code has a bad performance due to recompilation for each call of the "by" function:

# Purpose of this test:
# Reproduce the bad performance of aggregate, when the number and the order of the named_tuple argument changes.

using DataFrames

# TODO: find an implementation that doesn't recompile when the number of columns or their order change
function aggregate(table, agg_clause, (named_tuple), prn=false)
    # println(named_tuple)
    result, t, bytes, gctime, memallocs = @timed by(table, agg_clause, named_tuple, sort=true)
    if prn
        println("time: $(round(t, digits=3))")
    end
    return result
end

# Testdata
ALL = Symbol[:CAT_UTRECHT_CNT, :CAT_MOUSE, :WW_OP, :QWE_GRP, :HWR_KT, :CAT_WEIGHT, :HIGH_GRP, :CAT_WEIGHT_C, :CAT_CMI_C, :CAT_MOUSE_C, :CAT_FREIG_WGT_CNT, :CAT_FREIG_WGT_CNT_C, :AMS_CMI_TOT, :CAT_UTRECHT_DIS_CNT, :AMS_CMI_HRV, 
:GVZ_WXK, :AMS_CMI_DIFF, :CAT_UTRECHT_CNT_C]

agg_clause = Symbol[:HIGH_GRP, :QWE_GRP, :WW_OP]
NAMED_TUPLE = (CAT_WEIGHT_C_T = :CAT_WEIGHT_C => maximum, CAT_UTRECHT_CNT_T = :CAT_UTRECHT_CNT => maximum, CAT_CMI_C_T = :CAT_CMI_C => maximum, 
                 CAT_MOUSE_C_T = :CAT_MOUSE_C => maximum, CAT_FREIG_WGT_CNT_T = :CAT_FREIG_WGT_CNT => maximum, 
                 CAT_FREIG_WGT_CNT_C_T = :CAT_FREIG_WGT_CNT_C => maximum, CAT_MOUSE_T = :CAT_MOUSE => maximum, CAT_WEIGHT_C = :CAT_WEIGHT_C => sum, 
                 CAT_UTRECHT_CNT_C = :CAT_UTRECHT_CNT_C => sum, WW_OP = :WW_OP => maximum, CAT_FREIG_WGT_CNT = :CAT_FREIG_WGT_CNT => sum, 
                 HWR_KT = :HWR_KT => maximum, CAT_WEIGHT = :CAT_WEIGHT => sum, CAT_MOUSE = :CAT_MOUSE => sum, AMS_CMI_TOT = :AMS_CMI_TOT => sum, 
                 CAT_FREIG_WGT_CNT_C = :CAT_FREIG_WGT_CNT_C => sum, CAT_UTRECHT_DIS_CNT = :CAT_UTRECHT_DIS_CNT => sum, CAT_CMI_C = :CAT_CMI_C => sum, 
                 CAT_MOUSE_C = :CAT_MOUSE_C => sum, AMS_CMI_HRV = :AMS_CMI_HRV => sum, GVZ_WXK = :GVZ_WXK => sum, 
                 CAT_UTRECHT_CNT = :CAT_UTRECHT_CNT => sum, AMS_CMI_DIFF = :AMS_CMI_DIFF => sum)

# add a column with random Float64 values and the name col to the table
function add_column(table, col::Symbol)
    COL = rand(Float64, nrow(table))
    table[!, col] = COL 
end

# add a column with random String values and the name col to the table
function add_string_column(table, col::Symbol)
    COL = String[]
    for i in 1:nrow(table)
        if i % 3 == 0 
            push!(COL, "HUND")
        elseif i % 3 == 1
            push!(COL, "HASE")
        elseif i % 3 == 2
            push!(COL, "IGEL")
        end
    end
    table[!, col] = COL 
end

# build a named tuple with the given length, using NAMED_TUPLE as input
function make_named_tuple(len)
    nt = NAMED_TUPLE
    @assert len <= length(nt)
    t_names = propertynames(nt)
    pairs = Vector{Pair}(undef, len)
    indices = []
    new_names = []
    for i in 1:len
        index = rand(setdiff(1:length(nt), indices))
        push!(indices, index)
        pairs[i] = values(nt)[index]
        push!(new_names, t_names[index])
    end
    return NamedTuple{Tuple(new_names)}(Tuple(pairs))
end

# create data frame with 30 rows
jtab = DataFrame([1:30])

# add columns, using random values
for col in ALL
    global jtab
    if col in agg_clause
        add_string_column(jtab, col)
    else
        add_column(jtab, col)
    end
end

# create random aggregates of this dataframe and determine the time
for i = 1:length(NAMED_TUPLE)
    nt = make_named_tuple(i)
    res = aggregate(jtab, agg_clause, nt, true)
end
nothing

Example output, second run:

julia> include("local_tests/test_aggregate2.jl")
time: 0.046
time: 0.054
time: 0.062
time: 0.057
time: 0.057
time: 0.061
time: 0.06
time: 0.063
time: 0.061
time: 0.066
time: 0.065
time: 0.074
time: 0.072
time: 0.075
time: 0.072
time: 0.069
time: 0.069
time: 0.071
time: 0.074
time: 0.075
time: 0.076
time: 0.078
time: 0.076

Original post: https://discourse.julialang.org/t/bad-performance-of-group-by-of-dataframes-updated/30061/

@ufechner7
Copy link
Author

ufechner7 commented Oct 19, 2019

The performance improves by a factor of 40 for queries that work on Float64 colums only, and by a factor of four for queries with mixed column types, when the following code is used:

# This function works fine if all colums are Float64, but not so good if it is a mix
# of String and Float64
function aggregate1(table, agg_clause, @nospecialize(named_tuple), prn=false)
    # println(named_tuple)
    x = []
    for (a,(b, c)) in pairs(named_tuple)
        push!(x, [a, b, c])
    end
    result, t, bytes, gctime, memallocs = @timed by(table, agg_clause, sort=true) do sdf
        r = DataFrame()
        for (a,b,c) in x
            r[!, a] = [c(sdf[!, b])]
        end
        r
    end
    if prn
        println("time: $(round(t, digits=3))")
    end
    return result
end

But first of all it is complicated, and secondly it doesn't work very well if the column types are mixed.

@globalchitnis
Copy link

When one has random queries, like I have come across (typical for generic applications) fast function would make a lot of difference!

@nalimilan
Copy link
Member

nalimilan commented Oct 21, 2019

Couldn't you just pass a vector of pairs instead of a named tuple, and rename columns afterwards? I.e. do return pairs in make_named_tuple.

It's too bad that names are part of the type of named tuples in this case. I precisely ensured that named tuples are not used internally so that recompilation isn't needed when the names change. But the public API for specifying column names indeed relies on named tuples. Maybe we could mitigate the problem by collecting them into a vector as soon as possible so that only a very small function needs to be recompiled. I'm not sure @nospecialize currently works with annotated arguments.

@nalimilan
Copy link
Member

@vtjnash told me that @nospecialize works with typed arguments now, and indeed by adding a small unspecialized method for NamedTuple times are much improved. See #1992.

@nalimilan
Copy link
Member

@ufechner7 Can you try with DataFrames master and see whether your problem is fixed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants