New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reduce gc load in readdlm #10465

Merged
merged 1 commit into from Mar 11, 2015

Conversation

Projects
None yet
10 participants
@tanmaykm
Member

tanmaykm commented Mar 10, 2015

Using direct ccall wherever possible instead of creating SubString for every column during parsing.
This gets readdlm performance closer to that with gc disabled.
ref: #10428

julia> @time a = readdlm("try7nonuls", '\t', dims=(10^7,46));
elapsed time: 206.317901076 seconds (17021 MB allocated, 45.34% gc time in 12 pauses with 9 full sweep)

julia> a=nothing; @time gc()
elapsed time: 75.903634913 seconds (96 bytes allocated, 100.00% gc time in 2 pauses with 1 full sweep)

julia> # with gc disabled
       gc_disable()
true

julia> @time a = readdlm("try7nonuls", '\t', dims=(10^7,46));
elapsed time: 116.272807009 seconds (17021 MB allocated)

julia> gc_enable(); a=nothing; @time gc();
elapsed time: 76.882703601 seconds (80 bytes allocated, 100.00% gc time in 2 pauses with 1 full sweep)
reduce gc load in readdlm
Using direct `ccall` wherever possible instead of creating `SubString` for every column during parsing.

ref: #10428
@jiahao

This comment has been minimized.

Show comment
Hide comment
@jiahao

jiahao Mar 10, 2015

Member

99% reduction in execution time woot

Member

jiahao commented Mar 10, 2015

99% reduction in execution time woot

@ViralBShah

This comment has been minimized.

Show comment
Hide comment
@ViralBShah

ViralBShah Mar 10, 2015

Member

How far are we from pandas now?

Is this kind of a thing something our GC should be able to do better, or our compiler should be able to do better with automatic insertion of free statements?
Cc: @carnaval

Member

ViralBShah commented Mar 10, 2015

How far are we from pandas now?

Is this kind of a thing something our GC should be able to do better, or our compiler should be able to do better with automatic insertion of free statements?
Cc: @carnaval

@jiahao

This comment has been minimized.

Show comment
Hide comment
@jiahao

jiahao Mar 10, 2015

Member

This change would bring us to 2.2x slower than pandas and 4.3x slower than R's data.table.

I'm guessing that there's more garbage reduction to be had.

Member

jiahao commented Mar 10, 2015

This change would bring us to 2.2x slower than pandas and 4.3x slower than R's data.table.

I'm guessing that there's more garbage reduction to be had.

@jiahao

This comment has been minimized.

Show comment
Hide comment
@jiahao

jiahao Mar 10, 2015

Member

I have to say though that the timings can't be directly compared with the numbers I posted in the earlier issue. The current numbers are with the dimensions prespecified, which in my testing cut the execution time by 40%. Possibly the margin is smaller here.

Maybe we should have a "big data" mode that determines the data dimensions by shelling out to wc and head | cut, since that takes less than 2 seconds for files of this size ;)

Nonetheless, big speedup. Thanks @tanmaykm!

Member

jiahao commented Mar 10, 2015

I have to say though that the timings can't be directly compared with the numbers I posted in the earlier issue. The current numbers are with the dimensions prespecified, which in my testing cut the execution time by 40%. Possibly the margin is smaller here.

Maybe we should have a "big data" mode that determines the data dimensions by shelling out to wc and head | cut, since that takes less than 2 seconds for files of this size ;)

Nonetheless, big speedup. Thanks @tanmaykm!

@nalimilan

This comment has been minimized.

Show comment
Hide comment
@nalimilan

nalimilan Mar 10, 2015

Contributor

Maybe we should have a "big data" mode that determines the data dimensions by shelling out to wc and head | cut, since that takes less than 2 seconds for files of this size ;)

I understand with was mostly a joke, but there could be an argument to trigger counting the number of rows by going over the whole file once (in pure Julia code of course), and assume that the number of columns is that of the first line. If that's more efficient in practice, it could even be the default.

Contributor

nalimilan commented Mar 10, 2015

Maybe we should have a "big data" mode that determines the data dimensions by shelling out to wc and head | cut, since that takes less than 2 seconds for files of this size ;)

I understand with was mostly a joke, but there could be an argument to trigger counting the number of rows by going over the whole file once (in pure Julia code of course), and assume that the number of columns is that of the first line. If that's more efficient in practice, it could even be the default.

@jiahao

This comment has been minimized.

Show comment
Hide comment
@jiahao

jiahao Mar 10, 2015

Member
shell>  cat > try.tsv #Has Evil Unicode and NUL character at the end of the first line
3   .14159  is  π  でしょう    
2   .71828  is  e   maybe   junk
^D
julia> readdlm("try.tsv")
2x6 Array{Any,2}:
 3.0  0.14159  "is"  "π"  "でしょう"   ""    
 2.0  0.71828  "is"  "e"  "maybe"  "junk"

👍

Member

jiahao commented Mar 10, 2015

shell>  cat > try.tsv #Has Evil Unicode and NUL character at the end of the first line
3   .14159  is  π  でしょう    
2   .71828  is  e   maybe   junk
^D
julia> readdlm("try.tsv")
2x6 Array{Any,2}:
 3.0  0.14159  "is"  "π"  "でしょう"   ""    
 2.0  0.71828  "is"  "e"  "maybe"  "junk"

👍

@jiahao

This comment has been minimized.

Show comment
Hide comment
@jiahao

jiahao Mar 10, 2015

Member
shell> cat > try.tsv
1Doe    a   deer    ,   a       female deer
2Ray    a   drop    of  golden  sun
3Mi a   name    I   call    myself
^D
julia> readdlm("try.tsv")
3x7 Array{Any,2}:
 "1Doe"  "a"  "deer"  ","   "a"       "female"  "deer"
 "2Ray"  "a"  "drop"  "of"  "golden"  "sun"     ""    
 "3Mi"   "a"  "name"  "I"   "call"    "myself"  ""    
shell> cat try.tsv #Has NUL character in the last field
5.uper  cali    fragi   listic  expe    a   lidocious
^D
julia> readdlm("try.tsv")
1x7 Array{Any,2}:
 "5.uper"  "cali"  "fragi"  "listic"  "expe"  "a"  "lidoc\0ious"

👍

Member

jiahao commented Mar 10, 2015

shell> cat > try.tsv
1Doe    a   deer    ,   a       female deer
2Ray    a   drop    of  golden  sun
3Mi a   name    I   call    myself
^D
julia> readdlm("try.tsv")
3x7 Array{Any,2}:
 "1Doe"  "a"  "deer"  ","   "a"       "female"  "deer"
 "2Ray"  "a"  "drop"  "of"  "golden"  "sun"     ""    
 "3Mi"   "a"  "name"  "I"   "call"    "myself"  ""    
shell> cat try.tsv #Has NUL character in the last field
5.uper  cali    fragi   listic  expe    a   lidocious
^D
julia> readdlm("try.tsv")
1x7 Array{Any,2}:
 "5.uper"  "cali"  "fragi"  "listic"  "expe"  "a"  "lidoc\0ious"

👍

@jiahao

This comment has been minimized.

Show comment
Hide comment
@jiahao

jiahao Mar 10, 2015

Member

So a NUL character by itself in its own field is not parsed into "\0", but rather "". A NUL character is correctly read into a string if it occurs partway through the content. (See the π input.) I think that's the only edge case I've found so far.

shell> cat > try.tsv #Has NUL character and invisible space U+200B
3.141592α​β
^D
julia> readdlm("try.tsv", '\0')
1x2 Array{Any,2}:
 3.141  "592α\u200bβ"

👍

Member

jiahao commented Mar 10, 2015

So a NUL character by itself in its own field is not parsed into "\0", but rather "". A NUL character is correctly read into a string if it occurs partway through the content. (See the π input.) I think that's the only edge case I've found so far.

shell> cat > try.tsv #Has NUL character and invisible space U+200B
3.141592α​β
^D
julia> readdlm("try.tsv", '\0')
1x2 Array{Any,2}:
 3.141  "592α\u200bβ"

👍

@JeffBezanson

This comment has been minimized.

Show comment
Hide comment
@JeffBezanson

JeffBezanson Mar 10, 2015

Member

Thanks @tanmaykm , this is much needed!

It would be good to add some simple wrappers for substrtod and memcmp, to make the code less repetitive and more readable.

We should absolutely try very hard to guess the result size. Counting lines is very cheap. When we guess right, there will be a big improvement. If we guess wrong, it won't really be worse than it is now.

Hopefully some GC knobs can be tuned to better handle the case of a rapidly growing live heap.

Member

JeffBezanson commented Mar 10, 2015

Thanks @tanmaykm , this is much needed!

It would be good to add some simple wrappers for substrtod and memcmp, to make the code less repetitive and more readable.

We should absolutely try very hard to guess the result size. Counting lines is very cheap. When we guess right, there will be a big improvement. If we guess wrong, it won't really be worse than it is now.

Hopefully some GC knobs can be tuned to better handle the case of a rapidly growing live heap.

@hayd

This comment has been minimized.

Show comment
Hide comment
@hayd

hayd Mar 10, 2015

Member

wes' blog on memory efficient/fast read_csv in pandas suggests some csv files to benchmark (a lot of works gone into pandas + R + data.table read_csv since then).

Member

hayd commented Mar 10, 2015

wes' blog on memory efficient/fast read_csv in pandas suggests some csv files to benchmark (a lot of works gone into pandas + R + data.table read_csv since then).

@tanmaykm

This comment has been minimized.

Show comment
Hide comment
@tanmaykm

tanmaykm Mar 10, 2015

Member

Yes. Thanks for the tips. I think I'll try something on these lines:

  • guess the number of columns and some of their attributes using the first few rows worth of data
  • guess the number of rows by making one pass over the file if the file size is within acceptable limits, else through extrapolation
  • if the guess turns out wrong, reallocate, copy over the already processed data, continue
  • process file in manageable chunks
Member

tanmaykm commented Mar 10, 2015

Yes. Thanks for the tips. I think I'll try something on these lines:

  • guess the number of columns and some of their attributes using the first few rows worth of data
  • guess the number of rows by making one pass over the file if the file size is within acceptable limits, else through extrapolation
  • if the guess turns out wrong, reallocate, copy over the already processed data, continue
  • process file in manageable chunks
@jiahao

This comment has been minimized.

Show comment
Hide comment
@jiahao

jiahao Mar 11, 2015

Member

Let's merge this for now so that we can start eliminating the second pass through the data.

Member

jiahao commented Mar 11, 2015

Let's merge this for now so that we can start eliminating the second pass through the data.

jiahao added a commit that referenced this pull request Mar 11, 2015

Merge pull request #10465 from tanmaykm/readcsvopt
reduce gc load in readdlm

@jiahao jiahao merged commit adb9095 into JuliaLang:master Mar 11, 2015

2 checks passed

continuous-integration/appveyor AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@quinnj

This comment has been minimized.

Show comment
Hide comment
@quinnj

quinnj Mar 24, 2015

Member

Can this be backported? It'd be great to have these speedups in 0.3.

Member

quinnj commented Mar 24, 2015

Can this be backported? It'd be great to have these speedups in 0.3.

@tkelman

This comment has been minimized.

Show comment
Hide comment
@tkelman

tkelman Mar 25, 2015

Contributor

There's some 0.4-specific syntax here. I can't tell whether anything else here would otherwise be 0.4-specific (though the surrounding code may be much different due to other intervening PR's?), but would definitely want to redo the performance comparison for any potential backport considering the GC is completely different.

Contributor

tkelman commented Mar 25, 2015

There's some 0.4-specific syntax here. I can't tell whether anything else here would otherwise be 0.4-specific (though the surrounding code may be much different due to other intervening PR's?), but would definitely want to redo the performance comparison for any potential backport considering the GC is completely different.

@Ken-B

This comment has been minimized.

Show comment
Hide comment
@Ken-B

Ken-B Apr 4, 2015

Contributor

@tanmaykm You can read more about some details on R DataTable's fread with a lot of references here:
http://www.inside-r.org/packages/cran/data.table/docs/fread

Eg. "The first 5 rows, middle 5 rows and last 5 rows are then read to determine column types" and "There are no buffers used in fread's C code at all."

Contributor

Ken-B commented Apr 4, 2015

@tanmaykm You can read more about some details on R DataTable's fread with a lot of references here:
http://www.inside-r.org/packages/cran/data.table/docs/fread

Eg. "The first 5 rows, middle 5 rows and last 5 rows are then read to determine column types" and "There are no buffers used in fread's C code at all."

@pao

This comment has been minimized.

Show comment
Hide comment
@pao

pao Apr 4, 2015

Member

@Ken-B your comment looks clipped.

Member

pao commented Apr 4, 2015

@Ken-B your comment looks clipped.

@Ken-B

This comment has been minimized.

Show comment
Hide comment
@Ken-B

Ken-B Apr 5, 2015

Contributor

@pao I was actually finished after the quote, but thanks anyway for the poke. I was just trying to add some inspiration to get closer to R DataTable's excellent fread function.

Contributor

Ken-B commented Apr 5, 2015

@pao I was actually finished after the quote, but thanks anyway for the poke. I was just trying to add some inspiration to get closer to R DataTable's excellent fread function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment