@quinnj:
Start Julia with multiple threads.
Run:
using DataFrames
using CSV
df = DataFrame(x = repeat(["a", "b"], 10^6), y = rand(2*10^6));
f1 = tempname();
CSV.write(f1, df);
df2 = CSV.read(f1, DataFrame, header = false, skipto = 1001, limit = 10000);
sort(df2, 1)
The problem is that in both columns SentinelArrays.ChainedVector the length of arrays does not match inds. Here is an example for Column2:
julia> df2.Column2.inds
4-element Vector{Int64}:
2853
5706
8559
10000
julia> cumsum(length.(df2.Column2.arrays))
4-element Vector{Int64}:
2853
5706
8559
9970
and you see that there are 30 elements missing.
If you reimplemented the getindex like:
function Base.getindex(A::ChainedVector, i::Integer)
chunk, ix = index(A, i)
x = A.arrays[chunk][ix]
return x
end
to enable bounds checking you get bounds error when trying to work with df2.