Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should use read.csv instead of read.table for "csv" format? #50

Closed
jamiefolson opened this issue Jun 4, 2013 · 8 comments
Closed

Should use read.csv instead of read.table for "csv" format? #50

jamiefolson opened this issue Jun 4, 2013 · 8 comments

Comments

@jamiefolson
Copy link
Contributor

Since the rmr2 format is referred to as "csv", shouldn't it actually call read.csv so that it has the expected default parameters? Of particular importance is comment.char = "", which I spent a surprising amount of time debugging before I finally noticed that rmr actually calls read.table. I think it specifies somewhere in the documentation that read.table is being called, but at least I still found it surprising that it's not calling read.csv.

@piccolbo
Copy link
Collaborator

piccolbo commented Jun 4, 2013

If I can remember I went for read.table because it is more flexible, but I
can see your point as far as creating the wrong assumption. If you read
the Wikipedia entry for CSV I think the naming is sensible and I am not
sure I always want to follow all the quirks of some arbitrary definitions
in R. The formats read by read.table read.CSV and read. csv2 are all CSVs.
Isn't that just bad naming in the standard functions? And is this worth a
backward incompatible change?
On Jun 4, 2013 5:46 PM, "Jamie F Olson" notifications@github.com wrote:

Since the rmr2 format is referred to as "csv", shouldn't it actually call
read.csv so that it has the expected default parameters? Of particular
importance is comment.char = "", which I spent a surprising amount of
time debugging before I finally noticed that rmr actually calls read.table.
I think it specifies somewhere in the documentation that read.table is
being called, but at least I still found it surprising that it's not
calling read.csv.


Reply to this email directly or view it on GitHubhttps://github.com//issues/50
.

@jamiefolson
Copy link
Contributor Author

Yeah, csv isn't really a standard and there are wide variations on how people parse "csv" files.

One option would be to simply default to comment.char = "" but that could be even more confusing since then you're not consistent with any of the read.* functions. Maybe a new input format consistent with the hive/pig defaults (e.g. sep="\001",comment.char = "",quote="")?

@piccolbo
Copy link
Collaborator

How is the rmr csv format not consistent with read.table? A new input
format to import from hive pig sonds like a great idea independent from the
original subject here.

On Tue, Jun 11, 2013 at 8:14 AM, Jamie F Olson notifications@github.comwrote:

Yeah, csv isn't really a standard and there are wide variations on how
people parse "csv" files.

One option would be to simply default to comment.char = "" but that could
be even more confusing since then you're not consistent with any of the
read.* functions. Maybe a new input format consistent with the hive/pig
defaults (e.g. sep="\001",comment.char = "",quote="")?


Reply to this email directly or view it on GitHubhttps://github.com//issues/50#issuecomment-19268215
.

@jamiefolson
Copy link
Contributor Author

I meant that you're currently completely consistent with read.table but that the read.table default comment.char="#" leads to surprises. If you only changed that default then you would perhaps be less surprising to the people wanting to parse "csv" files, but you would be more confusing to experts since you would no longer be consistent with read.table

I'm currently using sep="\001",comment.char = "",colClasses="character",fill=TRUE,flush=TRUE,quote="",... for importing hive/pig data:

  make.input.format("csv","text",sep=sep,comment.char = comment.char,
                    colClasses=colClasses,
                    fill=fill,flush=flush,quote=quote,...)

@piccolbo
Copy link
Collaborator

And what do you need to do, if anything, in Hive and Pig?

On Mon, Jun 17, 2013 at 7:25 AM, Jamie F Olson notifications@github.comwrote:

I meant that you're currently completely consistent with read.table but
that the read.table default comment.char="#" leads to surprises. If you
only changed that default then you would perhaps be less surprising to the
people wanting to parse "csv" files, but you would be more confusing to
experts since you would no longer be consistent with read.table

I'm currently using sep="\001",comment.char =
"",colClasses="character",fill=TRUE,flush=TRUE,quote="",... for importing
hive/pig data:

make.input.format("csv","text",sep=sep,comment.char = comment.char, colClasses=colClasses, fill=fill,flush=flush,quote=quote,...)


Reply to this email directly or view it on GitHubhttps://github.com//issues/50#issuecomment-19548260
.

@jamiefolson
Copy link
Contributor Author

Those parameters should be consistent with the default default format for
both Hive and Pig (ROW FORMAT DELIMITED FIELDS TERMINATED BY '001' LINES TERMINATED BY '\n').

Jamie Olson

On Mon, Jun 17, 2013 at 11:34 AM, Antonio Piccolboni <
notifications@github.com> wrote:

And what do you need to do, if anything, in Hive and Pig?

On Mon, Jun 17, 2013 at 7:25 AM, Jamie F Olson notifications@github.comwrote:

I meant that you're currently completely consistent with read.table but
that the read.table default comment.char="#" leads to surprises. If you
only changed that default then you would perhaps be less surprising to
the
people wanting to parse "csv" files, but you would be more confusing to
experts since you would no longer be consistent with read.table

I'm currently using sep="\001",comment.char =
"",colClasses="character",fill=TRUE,flush=TRUE,quote="",... for
importing
hive/pig data:

make.input.format("csv","text",sep=sep,comment.char = comment.char,
colClasses=colClasses, fill=fill,flush=flush,quote=quote,...)


Reply to this email directly or view it on GitHub<
https://github.com/RevolutionAnalytics/rmr2/issues/50#issuecomment-19548260>

.


Reply to this email directly or view it on GitHubhttps://github.com//issues/50#issuecomment-19553113
.

@piccolbo
Copy link
Collaborator

I am implementing this for 2.3.0 and I was wondering why you added the ... to the make input call. Of course that's not correct R but I was wondering if you meant that I should accept additional arguments. Or more in general, should I make the pig/hive format fixed or are some variations useful?

@jamiefolson
Copy link
Contributor Author

I just accepted additional arguments assuming that I'd find additional things I'd want to configure. I think a couple options that might depend on circumstances are stringsAsFactors and strip.white.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants