Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mutable DataFrameRow for view method that takes a function arg? #1352

Closed
JockLawrie opened this issue Feb 9, 2018 · 4 comments
Closed

Mutable DataFrameRow for view method that takes a function arg? #1352

JockLawrie opened this issue Feb 9, 2018 · 4 comments

Comments

@JockLawrie
Copy link

Hi there,

More of a question than an issue.

Often I need to filter a DataFrame based on several conditions.
To make this a little easier I've hacked together a view method that takes a function from a DataFrame row to Bool:

view(tbl::DataFrame, f::Function)

Then to use it (for example):

dt1 = Date(2017,1,1)
dt2 = Date(2017,12,31)
v = view(tbl, (r) -> !ismissing(r[:x]) && r[:date] >= dt1 && r[:date] <= dt2)

To get this happening I made a type similar to DataFrameRow, but mutable:

mutable struct dfrow{T <: AbstractDataFrame}  # Mutable version of DataFrameRow
    df::T
    rowidx::Int
end

function Base.getindex(r::dfrow, colname::Symbol)
    r.df[r.rowidx, colname]
end

function view(tbl::AbstractDataFrame, f::Function)  # view(tbl, rows where f(row) == true)
    n    = size(tbl, 1)
    inds = fill(false, n)
    r    = dfrow(tbl, 0)
    for i = 1:n
        r.rowidx += 1
        inds[i] = f(r)
    end
    view(tbl, inds)
end

(I should make dfrow follow the iterator protocol)

I have found this to be faster than using the immutable DataFrameRow.

My questions are:

  1. Is there a reason that DataFrameRows are immutable?
  2. Can they be made mutable so that the hack above is unnecessary?

Thanks,
Jock

@nalimilan
Copy link
Member

DataFrameRow is immutable, but you can perfectly mutate its fields using r[:col] = .... Have you tried that? Making it mutable would only allow changing its parent data frame or its row index, which doesn't sound useful.

@JockLawrie
Copy link
Author

Indeed you are right.
On further testing, I see no consistent speed difference between the mutable and immutable versions.
Testing below.
Also note the slow down when using the eachrow function in view6, compared to a simple loop in view4.

#=
  Contents: Comparison of implementations of view(tbl::DataFrame, f::Function), where f: row -> Bool.

  Result: Methods 4 and 5 seem equally fastest. Choose method 4 because it uses the existing DataFrameRow type, whereas method 5 requires a new type.
=#


using DataFrames

mutable struct dfrow{T <: AbstractDataFrame}  # Mutable version of DataFrameRow
    df::T
    rowidx::Int
end

function Base.getindex(r::dfrow, colname::Symbol)
    r.df[r.rowidx, colname]
end

function view1(tbl::AbstractDataFrame, f::Function)
    n    = size(tbl, 1)
    inds = fill(false, n)
    r    = dfrow(tbl, 0)
    for i = 1:n
        r.rowidx += 1
        inds[i] = f(r)
    end
    view(tbl, inds)
end


function view2(tbl::AbstractDataFrame, f::Function)
    inds = [f(r) for r in eachrow(tbl)]
    view(tbl, inds)
end


function view3(tbl::AbstractDataFrame, f::Function)
    n    = size(tbl, 1)
    inds = fill(false, n)
    for i = 1:n
        inds[i] = f(DataFrameRow(tbl, i))
    end
    view(tbl, inds)
end

function view4(tbl::AbstractDataFrame, f::Function)
    n    = size(tbl, 1)
    inds = Int[]
    for i = 1:n
        !f(DataFrameRow(tbl, i)) && continue
        push!(inds, i)
    end
    view(tbl, inds)
end

function view5(tbl::AbstractDataFrame, f::Function)
    n    = size(tbl, 1)
    inds = Int[]
    r    = dfrow(tbl, 0)
    for i = 1:n
        r.rowidx += 1
        !f(r) && continue
        push!(inds, i)
    end
    view(tbl, inds)
end

function view6(tbl::AbstractDataFrame, f::Function)
    inds = Int[]
    for r in eachrow(tbl)
        !f(r) && continue
        push!(inds, r.row)
    end
    view(tbl, inds)
end

tbl = DataFrame(a = [i for i = 1:10_000], b = rand(10_000));

# Compile
v1 = view1(tbl, (r) -> r[:a] <= 5_000);
v2 = view2(tbl, (r) -> r[:a] <= 5_000);
v3 = view3(tbl, (r) -> r[:a] <= 5_000);
v4 = view4(tbl, (r) -> r[:a] <= 5_000);
v5 = view5(tbl, (r) -> r[:a] <= 5_000);
v6 = view6(tbl, (r) -> r[:a] <= 5_000);

# Wrap timings in function
function test1(tbl)
    @time view1(tbl, (r) -> r[:a] <= 5_000);
    @time view2(tbl, (r) -> r[:a] <= 5_000);
    @time view3(tbl, (r) -> r[:a] <= 5_000);
    @time view4(tbl, (r) -> r[:a] <= 5_000);
    @time view5(tbl, (r) -> r[:a] <= 5_000);
    @time view6(tbl, (r) -> r[:a] <= 5_000);
    ""
end

for i = 1:10
    println("Trial $i #################################################")
    test1(tbl)
end

@nalimilan
Copy link
Member

The performance difference between these approaches is negligible compared to the impact of type-instability of the columns: view(tbl, tbl[:a] .<= 5000) is about 20× faster. If you care about performance, you should make your custom view function pass a NamedTuple of columns to a helper function which will contain the loop. We should adopt this approach for the whole code base, but it's tricky (#1335).

Closing since there's nothing to do in DataFrames for this particular issue.

@JockLawrie
Copy link
Author

OK great.
Thanks, much appreciated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants