Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve performance by 2x #64

Merged
merged 3 commits into from
Mar 11, 2019
Merged

Conversation

bjarthur
Copy link
Contributor

@bjarthur bjarthur commented Jan 23, 2019

a speed test on my test data set with this PR is 2.979 sec. on master it is 5.934 sec. with PyYAML it is 4.071 sec for the wrapper around the libyaml C code, or 24.03 sec for the pure python code.

# python test script
from yaml import load, dump
from yaml import CLoader as Loader, CDumper as Dumper
import time
fid=open('foo.yml','r')
t0=time.time(); load(fid, Loader=Loader); print(time.time()-t0)

worth noting that C code which implements a custom parser that is specific to this test set's document structure can read this file in 0.25 sec. so there is a lot of overhead for generality.

@codecov
Copy link

codecov bot commented Jan 23, 2019

Codecov Report

Merging #64 into master will increase coverage by 0.39%.
The diff coverage is 87.12%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #64      +/-   ##
==========================================
+ Coverage      75%   75.39%   +0.39%     
==========================================
  Files          11       11              
  Lines        1264     1268       +4     
==========================================
+ Hits          948      956       +8     
+ Misses        316      312       -4
Impacted Files Coverage Δ
src/resolver.jl 100% <ø> (ø) ⬆️
src/tokens.jl 95% <ø> (ø) ⬆️
src/events.jl 100% <ø> (ø) ⬆️
src/scanner.jl 74.73% <0%> (+0.3%) ⬆️
src/buffered_input.jl 95.65% <100%> (+4.74%) ⬆️
src/nodes.jl 100% <100%> (ø) ⬆️
src/constructor.jl 73.42% <42.85%> (ø) ⬆️
src/parser.jl 72.18% <86.53%> (-0.19%) ⬇️
src/composer.jl 92.1% <97.14%> (+1.82%) ⬆️
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e323673...3d86736. Read the comment docs.

@kescobo
Copy link
Collaborator

kescobo commented Jan 23, 2019

Neat! Thanks for taking the time to do this. I noticed a bunch of the AbstractString type signatures were changed to String. Is that necessary? This can be kind of a pain if someone passes something like a SubString (not sure that would ever happen here) and I thought that it would be equally performant if only Strings are passed.

@bjarthur
Copy link
Contributor Author

the AbstractStrings in the type definitions slowed things down by about 10%. a huge chunk of time now is simply spent on memory allocation. but you're right, no need to restrict the user from passing in e.g. SubStrings to load. i've reverted src/YAML.jl but kept all the others.

@GunnarFarneback
Copy link
Contributor

I can confirm the speedup on my small document from 25 ms to 11 ms. However, this is a bit breaking for me since I'm doing things like

# Set up YAML loading to use OrderedDict instead of regular Dict.
function construct_ordered_mapping(constructor::YAML.Constructor,
                                   node::YAML.MappingNode; deep=false)
    YAML.flatten_mapping(node)
    mapping = OrderedDict{String, Any}()
    for (key_node, value_node) in node.value
        key = YAML.construct_object(constructor, key_node, deep=deep)
        value = YAML.construct_object(constructor, value_node, deep=deep)
        mapping[key] = value
    end
    return mapping
end

which is passed as a constructor to load_file. The breaking part is that this PR changes the named argument deep of construct_object to a positional argument. Not a big deal to fix but is it a performance improvement to avoid named arguments anymore? E.g.

julia> using BenchmarkTools

julia> f(x, y = "") = split(x, y)
f (generic function with 2 methods)

julia> g(x; y = "") = split(x, y)
g (generic function with 1 method)

julia> @btime f("abcabcabc", "b");
  232.720 ns (6 allocations: 256 bytes)

julia> @btime g("abcabcabc", y = "b");
  229.867 ns (6 allocations: 256 bytes)

(This should not be seen as an indication that named arguments are faster. The differences are within measurement noise.)

@bjarthur
Copy link
Contributor Author

the main reason for changing to positional arguments was that code_warntype does not work with keyword arguments, and so it was difficult to look for type instabilities. compare code_warntype(YAML.construct_object, (YAML.Constructor, YAML.Node, Bool)) on master and with this PR.

i could revert this change if you like, but i'd suggest keeping it as not all type instabilities have been eradicated.

in my hands the switch is also roughly 10% faster, but that could be within the noise as you say.

@GunnarFarneback
Copy link
Contributor

It's no big deal for me to adapt my code, so if the change does add some value I'm fine with it.

@GunnarFarneback
Copy link
Contributor

It would be nice to get this moving, and a new release of the package.

@bjarthur
Copy link
Contributor Author

bjarthur commented Mar 8, 2019

is there anything i need to do on my end? i've been using this PR in production for the last month and it has been fine.

@kescobo
Copy link
Collaborator

kescobo commented Mar 8, 2019

This looks good to me, but I'm not an official maintainer here. Actually, I don't see a maintainer - @benjward @bicycle1885 do you know who's on it?

@TransGirlCodes TransGirlCodes merged commit 3a808eb into JuliaData:master Mar 11, 2019
@ScottPJones
Copy link

Instead of converting those structs to use a concrete type, String, why not parameterize them instead, so that this package could be used with other (faster 😁 ) string types?
Also, many fields are Union{Nothing, String}. For any of those fields, is an empty string valid?
If not, instead of storing a nothing, store a constant "", which can be checked for directly with ===.

@bjarthur bjarthur deleted the bja/perf branch March 11, 2019 15:24
@GunnarFarneback
Copy link
Contributor

Thanks for merging this. Any chance to get a new release as well?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants