Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CSVY support for fread() #1701

Closed
arunsrinivasan opened this issue May 12, 2016 · 7 comments · Fixed by #2656
Closed

Add CSVY support for fread() #1701

arunsrinivasan opened this issue May 12, 2016 · 7 comments · Fixed by #2656
Assignees
Milestone

Comments

@arunsrinivasan
Copy link
Member

@arunsrinivasan arunsrinivasan commented May 12, 2016

The rio already allows reading of csvy formats by relying on the helper package (same author) csvy. It'd be great to leverage its features to read/write.

@MichaelChirico
Copy link
Member

@MichaelChirico MichaelChirico commented Mar 3, 2018

Just documenting how the .csvy reader of rio works:

  1. rio encounters a .csvy file and dispatches to .import.rio_csvy
  2. .import.rio_csvy calls read_csvy from the csvy package.
  3. read_csvy uses readLines to digest the file, then uses grep to identify the YAML header.
  4. yaml.load from the yaml package is called to convert the YAML content string to a list of component parts. This is already implemented in C so is presumably efficient.
  5. read_csv then applies paste(., collapse = '\n') to the non-YAML portion of the file and can read it with fread.
  6. Content of YAML header is applied to the output.

Major inefficiencies are:

  • Using readLines to digest the whole file (slow)
  • pasteing the file back into a format for fread to tackle only after stripping out the YAML part
  • Some parts of the YAML header are intended to assist fread, but the YAML data is not fed into fread.

My proposed solution (if we decide to tackle this):

  1. Stream lines of the file until the end of the YAML header is reached (or until an out-of-format line is reached -> stop). I'm not sure the most efficient way to pass files line-by-line in R, or if we'll have to implement that ourselves in C. Keep track of the # of lines fed. I see this from StackOverflow: https://stackoverflow.com/q/9871307/3576984
  2. Add yaml package to Suggests and rely on that to parse the YAML info
  3. Extract any info relevant to fread itself (especially/most importantly colClasses; I'll have to read the csvy standard to see how open-ended the rest is)
  4. fread the remainder of the file, using skip to jump past the YAML section, and including relevant info from the YAML.

Remaining API Q for me are: (1) do we try and detect YAML automatically (harder & beyond me to implement currently), or simply add a yaml (or similarly named) argument to fread and rely on user input? (2) What of fwrite?

For (1) I lean towards the latter primarily out of laziness (more sophisticated -- doesn't seem to pass the cost/benefit test given limited user requests for this feature). We can revisit in a future issue if this becomes more popular, I suppose. No opinions on (2), I only include it since we seem to be aiming to keep fread and fwrite as each others' inverse functions.

@HughParsonage
Copy link
Member

@HughParsonage HughParsonage commented Mar 3, 2018

I'm not sure the most efficient way to pass files line-by-line in R

fread(file, sep = NULL) (in dev) ?

Also readLines is likely to be much faster in R 3.5.0, possibly as fast as fread or readr::read_lines.

@MichaelChirico
Copy link
Member

@MichaelChirico MichaelChirico commented Mar 3, 2018

@HughParsonage that still leaves the issue of reading the file twice. The idea of streaming lines is to examine the file line-by-line and only read in, say, 10-20 lines of YAML metadata using readLines (or fread, or whatever) before parsing that and deploying fread on (presumably) the bulk of the file which follows the header

(anyway good to know they're finally getting around to improving readLines, it's silly how slow it is considering how minimal its responsibilities are)

@MichaelChirico MichaelChirico self-assigned this Mar 3, 2018
@MichaelChirico
Copy link
Member

@MichaelChirico MichaelChirico commented Mar 3, 2018

@jangorecki
Copy link
Member

@jangorecki jangorecki commented Mar 5, 2018

I would avoid extra dependency, even in suggests, and use some helper function to extract fields from yaml header. Similar way as we would process DESCRIPTION file. Fwrite should be able to write csvy header the way that fread can read it.

@MichaelChirico
Copy link
Member

@MichaelChirico MichaelChirico commented Mar 5, 2018

as yaml can be arbitrarily nested I didn't see a need to reinvent the wheel.

especially as it seems at the moment to be a rather limited use case -- happy to revisit if this format takes off.

@jangorecki jangorecki added this to the 1.12.0 milestone Jun 26, 2018
@jangorecki jangorecki removed this from the 1.12.0 milestone Jan 5, 2019
@jangorecki jangorecki added this to the 1.12.2 milestone Jan 5, 2019
@mattdowle mattdowle removed this from the 1.12.2 milestone Jan 14, 2019
@mattdowle mattdowle added this to the 1.12.4 milestone May 2, 2019
@MichaelChirico
Copy link
Member

@MichaelChirico MichaelChirico commented May 3, 2019

fwrite support not done yet... will file separately

@MichaelChirico MichaelChirico changed the title Add CSVY support for fread() and fwrite() Add CSVY support for fread() May 3, 2019
@MichaelChirico MichaelChirico mentioned this issue May 4, 2019
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants