Skip to content
This repository was archived by the owner on Mar 1, 2024. It is now read-only.
This repository was archived by the owner on Mar 1, 2024. It is now read-only.

scanf equivalent #21

@c42f

Description

@c42f

It would be nice to have a [sf]scanf equivalent for quickly parsing large tabulated text files with simple but ad-hoc formatting. I wasn't sure where to post an issue about this, let me know if there's a better place.

A current example on my mind is ftp://ftp.ga.gov.au/geodesy-outgoing/gravity/ausgeoid/AUSGeoid09_V1.01/AUSGeoid09_GDA94_V1.01_DOV.zip

AUSGeoid09_GDA94_V1.01                          www.ga.gov.au
GEO   14.595 S 8  0  0.000 E108  0  0.000    -29.88    -26.83
GEO   14.833 S 8  0  0.000 E108  1  0.000    -36.93    -20.99
GEO   14.968 S 8  0  0.000 E108  2  0.000    -36.59    -15.21
... actual data isn't important, but there's 7 million rows of it.

I made a little prototype, including a string macro, with the aim of making the following work:

values = read(io, scanf"GEO %f %c %d %d %f %c%d %d %f %f %f")

Gist here: https://gist.github.com/c42f/9999dc6f9b63a9bd4ea4237a95876475

The prototype just ccalls sscanf directly, but I hit a few nasty things and thought I'd record the problems here in case anyone has feedback.

  • It would be nice if this worked with IO objects for streaming. However in julia streams are either unbuffered or internally buffered, with no way to get at the buffer object (if it even exists) or put back a character. This makes efficient text parsing from a stream without overreading impossible. Would need something like https://github.com/BioJulia/BufferedStreams.jl instead of a plain IO (CC @bicycle1885 ) For the moment, I worked around the problem for this particular input file using readline() but that's far from ideal.
  • Related to the above, IO objects cannot be used with fscanf since they have no internal FILE data structure at the C level (unsurprising).
  • sscanf("%s", ...) seems to be completely memory unsafe, with no way to work around it without disallowing it, implementing this conversion specifier on the julia side or relying on the posix "%ms" extension.
  • The format width specifiers ("%lf" for Float64? ugh.) are C-isms which are a little nasty to expose at the julia level, but make sense for compatibility.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions