This repository was archived by the owner on Mar 1, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 22
This repository was archived by the owner on Mar 1, 2024. It is now read-only.
scanf equivalent #21
Copy link
Copy link
Closed
Description
It would be nice to have a [sf]scanf equivalent for quickly parsing large tabulated text files with simple but ad-hoc formatting. I wasn't sure where to post an issue about this, let me know if there's a better place.
A current example on my mind is ftp://ftp.ga.gov.au/geodesy-outgoing/gravity/ausgeoid/AUSGeoid09_V1.01/AUSGeoid09_GDA94_V1.01_DOV.zip
AUSGeoid09_GDA94_V1.01 www.ga.gov.au
GEO 14.595 S 8 0 0.000 E108 0 0.000 -29.88 -26.83
GEO 14.833 S 8 0 0.000 E108 1 0.000 -36.93 -20.99
GEO 14.968 S 8 0 0.000 E108 2 0.000 -36.59 -15.21
... actual data isn't important, but there's 7 million rows of it.
I made a little prototype, including a string macro, with the aim of making the following work:
values = read(io, scanf"GEO %f %c %d %d %f %c%d %d %f %f %f")Gist here: https://gist.github.com/c42f/9999dc6f9b63a9bd4ea4237a95876475
The prototype just ccalls sscanf directly, but I hit a few nasty things and thought I'd record the problems here in case anyone has feedback.
- It would be nice if this worked with
IOobjects for streaming. However in julia streams are either unbuffered or internally buffered, with no way to get at the buffer object (if it even exists) or put back a character. This makes efficient text parsing from a stream without overreading impossible. Would need something like https://github.com/BioJulia/BufferedStreams.jl instead of a plainIO(CC @bicycle1885 ) For the moment, I worked around the problem for this particular input file usingreadline()but that's far from ideal. - Related to the above, IO objects cannot be used with
fscanfsince they have no internalFILEdata structure at the C level (unsurprising). sscanf("%s", ...)seems to be completely memory unsafe, with no way to work around it without disallowing it, implementing this conversion specifier on the julia side or relying on the posix "%ms" extension.- The format width specifiers ("%lf" for
Float64? ugh.) are C-isms which are a little nasty to expose at the julia level, but make sense for compatibility.
ahwillia, kim366, ederag and schneiderfelipe
Metadata
Metadata
Assignees
Labels
No labels