In [1]:
using DataFrames

In [2]:
n = 10^2

ro_x_co_x_an = DataFrame(
    "In"=>rand(1:9, n),
    "Fl"=>rand(1.0:9, n),
    "Ch"=>rand('a':'z', n),
    "St4"=>[join(rand('a':'z', 4)) for _ in 1:n],
    "St8"=>[join(rand('a':'z', 8)) for _ in 1:n],
)

;

What is the best way to access one DataFrame column?

In [3]:
# Everything in `()` is evaluated before the trial, making this a bad benchamrk.
@btime $(ro_x_co_x_an[!, "St4"])

# Interpolate the global mutable variable so that the compiler knows more.

@btime ro_x_co_x_an[!, "St4"]

@btime $ro_x_co_x_an[!, "St4"]

# Surprisingly, `view` allocates and is slower.

@btime view($ro_x_co_x_an, !, "St4")

;

  2.084 ns (0 allocations: 0 bytes)
  103.435 ns (0 allocations: 0 bytes)
  75.222 ns (0 allocations: 0 bytes)
  187.467 ns (1 allocation: 48 bytes)


What is the best way to access two DataFrame columns and do something simple (`string`) with them?

In [5]:
# `zip` DataFrame.
function zi(da)
    
    for (a, b) in zip(da[!, "St4"], da[!, "St8"])
        
        string(a, b)
        
    end
    
end

# `eachrow` DataFrame.
function ea(da)
    
    for (a, b) in eachrow(da[!, ["St4", "St8"]])
        
        string(a, b)
        
    end
    
end

# `zip` method allocates more but is faster.

@btime zi($ro_x_co_x_an)

@btime ea($ro_x_co_x_an)

;

  17.541 μs (604 allocations: 20.44 KiB)
  26.250 μs (320 allocations: 11.11 KiB)


In [6]:
# Make `Matrix` and `zip`.
function mzi(da)
    
    ma = Matrix(da)
    
    id_ = indexin(("St4", "St8"), names(da))
    
    for (a, b) in zip(ma[:, id_[1]], ma[:, id_[2]])
        
        string(a, b)
        
    end
    
end

# Make `Matrix` and `eachrow`.
function mea(da)
    
    ma = Matrix(da)
    
    id_ = indexin(("St4", "St8"), names(da))
    
    for (a, b) in eachrow(ma[:, id_])
        
        string(a, b)
        
    end
    
end

# Surprisingly, `Matrix`ing is slower...

@btime mzi($ro_x_co_x_an)

@btime mea($ro_x_co_x_an)

;

  28.958 μs (924 allocations: 35.13 KiB)
  35.166 μs (1020 allocations: 35.05 KiB)


Does breaking the functions into smaller ones improve the performance?

In [7]:
# `Matrix`ing.
function maid(da)
    
    ma = Matrix(da)
    
    id_ = indexin(("St4", "St8"), names(da))
    
    ma, id_
    
end

@btime maid($ro_x_co_x_an)

;

  3.521 μs (117 allocations: 6.70 KiB)


In [8]:
# `zip` `Matrix`.
function mazi(ma, id_)
    
    for (a, b) in zip(ma[:, id_[1]], ma[:, id_[2]])
        
        string(a, b)
        
    end
    
end

# `eachrow` `Matrix`.
function maea(ma, id_)
    
    for (a, b) in eachrow(ma[:, id_])
        
        string(a, b)
        
    end
    
end

ma = Matrix(ro_x_co_x_an)

id_ = indexin(("St4", "St8"), names(ro_x_co_x_an))

@btime mazi($ma, $id_)

@btime maea($ma, $id_)

;

  2.028 μs (102 allocations: 4.88 KiB)
  2.009 μs (101 allocations: 4.89 KiB)


Sum of the parts were less than expected.
Does substituting these parts improve the performance?

In [10]:
# Use `maid`.

function maid_mzi(da)
    
    ma, id_ = maid(da)
    
    for (a, b) in zip(ma[:, id_[1]], ma[:, id_[2]])
        
        string(a, b)
        
    end
    
end

function maid_mea(da)
    
    ma, id_ = maid(da)
    
    for (a, b) in eachrow(ma[:, id_])
        
        string(a, b)
        
    end
    
end

@btime maid_mzi($ro_x_co_x_an)

@btime maid_mea($ro_x_co_x_an)

;

  28.875 μs (924 allocations: 35.13 KiB)
  34.791 μs (1020 allocations: 35.05 KiB)


In [12]:
# Use `Matrix` `zip` and `eachrow`.

function _mazi(da)
    
    ma = Matrix(da)
    
    id_ = indexin(("St4", "St8"), names(da))
    
    mazi(ma, id_)
    
end

function _maea(da)
    
    ma = Matrix(da)
    
    id_ = indexin(("St4", "St8"), names(da))
    
    maea(ma, id_)
    
end

# Copying a part of code into a function and substituting it with the code improved the performance!

@btime _mazi($ro_x_co_x_an)

@btime _maea($ro_x_co_x_an)

;

  5.639 μs (218 allocations: 11.54 KiB)
  5.681 μs (217 allocations: 11.55 KiB)


In [13]:
# Use `maid` with `Matrix` `zip` and `eachrow`.

function maid_maea(ro_x_co_x_an)
    
    ma, id_ = maid(ro_x_co_x_an)
    
    maea(ma, id_)
    
end

function maid_mazi(ro_x_co_x_an)
    
    ma, id_ = maid(ro_x_co_x_an)
    
    mazi(ma, id_)
    
end

@btime maid_maea($ro_x_co_x_an)

@btime maid_mazi($ro_x_co_x_an)

;

  5.667 μs (217 allocations: 11.55 KiB)
  5.646 μs (218 allocations: 11.54 KiB)


The compiler optimizes code at function boundaries.
Use multiple smaller functions! 