Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large xml file seems to not be "streaming", eatings GBs of Ram #29

Open
jakeonrails opened this issue Feb 8, 2012 · 22 comments
Open

Large xml file seems to not be "streaming", eatings GBs of Ram #29

jakeonrails opened this issue Feb 8, 2012 · 22 comments

Comments

@jakeonrails
Copy link

I have a 1.6gb xml file, and when I parse it with Sax Machine it does not seem to be streaming or eating the file in chunks - rather it appears to be loading the whole file into memory (or maybe there is a memory leak somewhere?) because my ruby process climbs upwards of 2.5gb of ram. I don't know where it stops growing because I ran out of memory.

On a smaller file (50mb) it also appears to be loading the whole file. My task iterates over the records in the xml file and saves each record to a database. It takes about 30 seconds of "idling" and then all of a sudden the database queries start executing.

I thought SAX was supposed to allow you to work with large files like this without loading the whole thing in memory.

Is there something I am overlooking?

Many thanks,

@jakeonrails

@ezkl
Copy link
Collaborator

ezkl commented May 29, 2012

@jakeonrails If this issue is still on your radar, can you see if the same problem occurs with current master? The behavior you describe does sound like a bug, but before digging deeper, I'd like to verify the issue wasn't resolved by upgrading Nokogiri to v1.5

@jakeonrails
Copy link
Author

@ezkl I can't take time to test this right now but I will try to test in the next couple days. I hope it does work now, since the parsing with sax-machine was a lot cleaner than what I resorted to back a few months ago, which is to use a monkey patched nokogiri Reader to parse out the chunk of XML for the node I want and pass that to sax-machine.

@ezkl
Copy link
Collaborator

ezkl commented May 30, 2012

@jakeonrails What method are you using to load the XML file?

@jakeonrails
Copy link
Author

@ezkl I ended up using this technique here: http://stackoverflow.com/a/9223767/586983

You can see my original code which spurred me to write this github issue at the top of that SO question.

@ezkl
Copy link
Collaborator

ezkl commented May 30, 2012

@jakeonrails Thanks for the link and background info. I had a bit of a brain fart yesterday. Don't bother testing against HEAD at the moment. A streaming interface was implemented by @gregwebs, but his work wasn't ever merged into master (see: #18 and #24). I've been using a fork that includes Greg's work in production without issue for nearly a year, but never with XML files quite as large as yours. Once I've finished merging Greg's work, I'd love to get your feedback on performance w/ large files.

@gregwebs
Copy link

I have been using my fork on files of about that size in production.

@speedmax
Copy link

speedmax commented Jun 1, 2012

we need these large file support, heroku 512mb workers really struggle with large xml parsing.

++1 on the merge of #18 #24

@gregwebs
Copy link

gregwebs commented Jun 1, 2012

so now we have 2 issues open, probably should close one

@chtrinh
Copy link

chtrinh commented Jun 6, 2013

Has this been merged?

@mrjcleaver
Copy link

+1
Though - is the merge both straightforward and non-controversial?

@mrjcleaver
Copy link

I suppose the another question might be, would pauldix prefer that gregwebs be made the new maintainer of the ruby gem? It's a bit confusing having multiple versions.

@gregwebs
Copy link

gregwebs commented Aug 3, 2013

I will not be a maintainer, but @ezkl might

@mrjcleaver
Copy link

@ezkl? Would you be willing & able? @pauldix - what's your preference? Thx, M.

@jmehnle
Copy link

jmehnle commented Sep 23, 2013

What's the status of this? What can I do to help with getting this merged from @gregwebs's branch? @ezkl?

@pauldix
Copy link
Owner

pauldix commented Sep 25, 2013

If someone submits a PR I'll merge it in.

@eljojo
Copy link

eljojo commented Nov 28, 2013

Hi, I opened this PR #47, I think it should solve the problem.
Feedback is well appreciated.

@gregwebs
Copy link

That is solving a different use case then what I had. My branch allows for giant streaming collections

@jrochkind
Copy link

From my point of view, the main point to using the SAX interface is streaming, rather than reading into memory at once. Does current sax-machine release support any kind of streaming? No, I think? Curious what uses people have for SAX without streaming, but that's another topic I guess.

@doomspork
Copy link

Has this issue been abandoned?

@krasnoukhov
Copy link
Collaborator

It looks like all effort made by @ezkl and @gregwebs is left way behind the current master, so it's not possible to review/merge these changes.

I don't feel like streaming features will be added to sax-machine in nearest future, unless someone would be willing to reimplement/port that. So it basically stays usable for small xml files, especially for RSS/Atom feeds by using feedjira.

For streaming, I'd suggest to consider using Nokogiri SAX or Ox SAX.

@torbiak
Copy link

torbiak commented Jan 9, 2015

I'm using sax-machine-1.2.0 and nokkogiri-1.6.3, parsing a 1GB xml file by passing an IO object to a SAXMachine parser, and it appears to be streaming, seeing as the virtual memory usage of the process doesn't go above 100MB.

open('/huge_soap.xml') { |f| MySAXMachine.parse(f) }

The mixin passes xml_text to the handler, and the Nokogiri handler passes it directly to the backend parser, so I think that as long as Nokogiri's SAX parser streams the given IO object---and it appears to---then it's all good.

I didn't test with Ox or Oga, but it looks like the Ox handler expects xml_text to be a string and can't currently stream from an IO. I don't see an obvious reason why making a StringIO from xml_text shouldn't be conditional so an IO can be passed to SAXMachine#parse when using Ox as the backend parser.

krasnoukhov added a commit that referenced this issue Jan 11, 2015
@krasnoukhov
Copy link
Collaborator

@torbiak Thanks for looking into this. In regard of IO parsing you're totally correct, I've changed Ox handler to support both String and IO, please see attached commit.

I'm wondering why you're getting such good memory footprint results. Can I see a full example you're running? I'm thinking about putting some benchmarks together so your example could help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests