Mutable DataFrameRow for view method that takes a function arg? #1352

JockLawrie · 2018-02-09T10:06:54Z

Hi there,

More of a question than an issue.

Often I need to filter a DataFrame based on several conditions.
To make this a little easier I've hacked together a view method that takes a function from a DataFrame row to Bool:

view(tbl::DataFrame, f::Function)

Then to use it (for example):

dt1 = Date(2017,1,1)
dt2 = Date(2017,12,31)
v = view(tbl, (r) -> !ismissing(r[:x]) && r[:date] >= dt1 && r[:date] <= dt2)

To get this happening I made a type similar to DataFrameRow, but mutable:

mutable struct dfrow{T <: AbstractDataFrame}  # Mutable version of DataFrameRow
    df::T
    rowidx::Int
end

function Base.getindex(r::dfrow, colname::Symbol)
    r.df[r.rowidx, colname]
end

function view(tbl::AbstractDataFrame, f::Function)  # view(tbl, rows where f(row) == true)
    n    = size(tbl, 1)
    inds = fill(false, n)
    r    = dfrow(tbl, 0)
    for i = 1:n
        r.rowidx += 1
        inds[i] = f(r)
    end
    view(tbl, inds)
end

(I should make dfrow follow the iterator protocol)

I have found this to be faster than using the immutable DataFrameRow.

My questions are:

Is there a reason that DataFrameRows are immutable?
Can they be made mutable so that the hack above is unnecessary?

Thanks,
Jock

The text was updated successfully, but these errors were encountered:

nalimilan · 2018-02-09T12:31:51Z

DataFrameRow is immutable, but you can perfectly mutate its fields using r[:col] = .... Have you tried that? Making it mutable would only allow changing its parent data frame or its row index, which doesn't sound useful.

JockLawrie · 2018-02-09T22:45:23Z

Indeed you are right.
On further testing, I see no consistent speed difference between the mutable and immutable versions.
Testing below.
Also note the slow down when using the eachrow function in view6, compared to a simple loop in view4.

#=
  Contents: Comparison of implementations of view(tbl::DataFrame, f::Function), where f: row -> Bool.

  Result: Methods 4 and 5 seem equally fastest. Choose method 4 because it uses the existing DataFrameRow type, whereas method 5 requires a new type.
=#


using DataFrames

mutable struct dfrow{T <: AbstractDataFrame}  # Mutable version of DataFrameRow
    df::T
    rowidx::Int
end

function Base.getindex(r::dfrow, colname::Symbol)
    r.df[r.rowidx, colname]
end

function view1(tbl::AbstractDataFrame, f::Function)
    n    = size(tbl, 1)
    inds = fill(false, n)
    r    = dfrow(tbl, 0)
    for i = 1:n
        r.rowidx += 1
        inds[i] = f(r)
    end
    view(tbl, inds)
end


function view2(tbl::AbstractDataFrame, f::Function)
    inds = [f(r) for r in eachrow(tbl)]
    view(tbl, inds)
end


function view3(tbl::AbstractDataFrame, f::Function)
    n    = size(tbl, 1)
    inds = fill(false, n)
    for i = 1:n
        inds[i] = f(DataFrameRow(tbl, i))
    end
    view(tbl, inds)
end

function view4(tbl::AbstractDataFrame, f::Function)
    n    = size(tbl, 1)
    inds = Int[]
    for i = 1:n
        !f(DataFrameRow(tbl, i)) && continue
        push!(inds, i)
    end
    view(tbl, inds)
end

function view5(tbl::AbstractDataFrame, f::Function)
    n    = size(tbl, 1)
    inds = Int[]
    r    = dfrow(tbl, 0)
    for i = 1:n
        r.rowidx += 1
        !f(r) && continue
        push!(inds, i)
    end
    view(tbl, inds)
end

function view6(tbl::AbstractDataFrame, f::Function)
    inds = Int[]
    for r in eachrow(tbl)
        !f(r) && continue
        push!(inds, r.row)
    end
    view(tbl, inds)
end

tbl = DataFrame(a = [i for i = 1:10_000], b = rand(10_000));

# Compile
v1 = view1(tbl, (r) -> r[:a] <= 5_000);
v2 = view2(tbl, (r) -> r[:a] <= 5_000);
v3 = view3(tbl, (r) -> r[:a] <= 5_000);
v4 = view4(tbl, (r) -> r[:a] <= 5_000);
v5 = view5(tbl, (r) -> r[:a] <= 5_000);
v6 = view6(tbl, (r) -> r[:a] <= 5_000);

# Wrap timings in function
function test1(tbl)
    @time view1(tbl, (r) -> r[:a] <= 5_000);
    @time view2(tbl, (r) -> r[:a] <= 5_000);
    @time view3(tbl, (r) -> r[:a] <= 5_000);
    @time view4(tbl, (r) -> r[:a] <= 5_000);
    @time view5(tbl, (r) -> r[:a] <= 5_000);
    @time view6(tbl, (r) -> r[:a] <= 5_000);
    ""
end

for i = 1:10
    println("Trial $i #################################################")
    test1(tbl)
end

nalimilan · 2018-02-10T12:09:48Z

The performance difference between these approaches is negligible compared to the impact of type-instability of the columns: view(tbl, tbl[:a] .<= 5000) is about 20× faster. If you care about performance, you should make your custom view function pass a NamedTuple of columns to a helper function which will contain the loop. We should adopt this approach for the whole code base, but it's tricky (#1335).

Closing since there's nothing to do in DataFrames for this particular issue.

JockLawrie · 2018-02-11T22:46:04Z

OK great.
Thanks, much appreciated.

nalimilan closed this as completed Feb 10, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mutable DataFrameRow for view method that takes a function arg? #1352

Mutable DataFrameRow for view method that takes a function arg? #1352

JockLawrie commented Feb 9, 2018

nalimilan commented Feb 9, 2018

JockLawrie commented Feb 9, 2018

nalimilan commented Feb 10, 2018

JockLawrie commented Feb 11, 2018

Mutable DataFrameRow for view method that takes a function arg? #1352

Mutable DataFrameRow for view method that takes a function arg? #1352

Comments

JockLawrie commented Feb 9, 2018

nalimilan commented Feb 9, 2018

JockLawrie commented Feb 9, 2018

nalimilan commented Feb 10, 2018

JockLawrie commented Feb 11, 2018