Should use `read.csv` instead of `read.table` for "csv" format? #50

jamiefolson · 2013-06-04T15:44:04Z

Since the rmr2 format is referred to as "csv", shouldn't it actually call read.csv so that it has the expected default parameters? Of particular importance is comment.char = "", which I spent a surprising amount of time debugging before I finally noticed that rmr actually calls read.table. I think it specifies somewhere in the documentation that read.table is being called, but at least I still found it surprising that it's not calling read.csv.

The text was updated successfully, but these errors were encountered:

piccolbo · 2013-06-04T17:20:28Z

If I can remember I went for read.table because it is more flexible, but I
can see your point as far as creating the wrong assumption. If you read
the Wikipedia entry for CSV I think the naming is sensible and I am not
sure I always want to follow all the quirks of some arbitrary definitions
in R. The formats read by read.table read.CSV and read. csv2 are all CSVs.
Isn't that just bad naming in the standard functions? And is this worth a
backward incompatible change?
On Jun 4, 2013 5:46 PM, "Jamie F Olson" notifications@github.com wrote:

Since the rmr2 format is referred to as "csv", shouldn't it actually call
read.csv so that it has the expected default parameters? Of particular
importance is comment.char = "", which I spent a surprising amount of
time debugging before I finally noticed that rmr actually calls read.table.
I think it specifies somewhere in the documentation that read.table is
being called, but at least I still found it surprising that it's not
calling read.csv.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/50
.

jamiefolson · 2013-06-11T15:14:51Z

Yeah, csv isn't really a standard and there are wide variations on how people parse "csv" files.

One option would be to simply default to comment.char = "" but that could be even more confusing since then you're not consistent with any of the read.* functions. Maybe a new input format consistent with the hive/pig defaults (e.g. sep="\001",comment.char = "",quote="")?

piccolbo · 2013-06-15T00:07:50Z

How is the rmr csv format not consistent with read.table? A new input
format to import from hive pig sonds like a great idea independent from the
original subject here.

On Tue, Jun 11, 2013 at 8:14 AM, Jamie F Olson notifications@github.comwrote:

Yeah, csv isn't really a standard and there are wide variations on how
people parse "csv" files.

One option would be to simply default to comment.char = "" but that could
be even more confusing since then you're not consistent with any of the
read.* functions. Maybe a new input format consistent with the hive/pig
defaults (e.g. sep="\001",comment.char = "",quote="")?

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/50#issuecomment-19268215
.

jamiefolson · 2013-06-17T14:25:41Z

I meant that you're currently completely consistent with read.table but that the read.table default comment.char="#" leads to surprises. If you only changed that default then you would perhaps be less surprising to the people wanting to parse "csv" files, but you would be more confusing to experts since you would no longer be consistent with read.table

I'm currently using sep="\001",comment.char = "",colClasses="character",fill=TRUE,flush=TRUE,quote="",... for importing hive/pig data:

  make.input.format("csv","text",sep=sep,comment.char = comment.char,
                    colClasses=colClasses,
                    fill=fill,flush=flush,quote=quote,...)

piccolbo · 2013-06-17T15:34:34Z

And what do you need to do, if anything, in Hive and Pig?

On Mon, Jun 17, 2013 at 7:25 AM, Jamie F Olson notifications@github.comwrote:

I meant that you're currently completely consistent with read.table but
that the read.table default comment.char="#" leads to surprises. If you
only changed that default then you would perhaps be less surprising to the
people wanting to parse "csv" files, but you would be more confusing to
experts since you would no longer be consistent with read.table

I'm currently using sep="\001",comment.char =
"",colClasses="character",fill=TRUE,flush=TRUE,quote="",... for importing
hive/pig data:

make.input.format("csv","text",sep=sep,comment.char = comment.char, colClasses=colClasses, fill=fill,flush=flush,quote=quote,...)

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/50#issuecomment-19548260
.

jamiefolson · 2013-06-17T18:48:08Z

Those parameters should be consistent with the default default format for
both Hive and Pig (ROW FORMAT DELIMITED FIELDS TERMINATED BY '001' LINES TERMINATED BY '\n').

Jamie Olson

On Mon, Jun 17, 2013 at 11:34 AM, Antonio Piccolboni <
notifications@github.com> wrote:

And what do you need to do, if anything, in Hive and Pig?

On Mon, Jun 17, 2013 at 7:25 AM, Jamie F Olson notifications@github.comwrote:

I meant that you're currently completely consistent with read.table but
that the read.table default comment.char="#" leads to surprises. If you
only changed that default then you would perhaps be less surprising to
the
people wanting to parse "csv" files, but you would be more confusing to
experts since you would no longer be consistent with read.table

I'm currently using sep="\001",comment.char =
"",colClasses="character",fill=TRUE,flush=TRUE,quote="",... for
importing
hive/pig data:

make.input.format("csv","text",sep=sep,comment.char = comment.char,
colClasses=colClasses, fill=fill,flush=flush,quote=quote,...)

—
Reply to this email directly or view it on GitHub<
https://github.com/RevolutionAnalytics/rmr2/issues/50#issuecomment-19548260>

.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/50#issuecomment-19553113
.

piccolbo · 2013-06-21T22:32:14Z

I am implementing this for 2.3.0 and I was wondering why you added the ... to the make input call. Of course that's not correct R but I was wondering if you meant that I should accept additional arguments. Or more in general, should I make the pig/hive format fixed or are some variations useful?

jamiefolson · 2013-06-28T15:15:50Z

I just accepted additional arguments assuming that I'd find additional things I'd want to configure. I think a couple options that might depend on circumstances are stringsAsFactors and strip.white.

piccolbo mentioned this issue Jun 17, 2013

csv format to import export from hive #53

Closed

jamiefolson closed this as completed Sep 16, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Should use `read.csv` instead of `read.table` for "csv" format? #50

Should use `read.csv` instead of `read.table` for "csv" format? #50

jamiefolson commented Jun 4, 2013

piccolbo commented Jun 4, 2013

jamiefolson commented Jun 11, 2013

piccolbo commented Jun 15, 2013

jamiefolson commented Jun 17, 2013

piccolbo commented Jun 17, 2013

jamiefolson commented Jun 17, 2013

piccolbo commented Jun 21, 2013

jamiefolson commented Jun 28, 2013

Should use read.csv instead of read.table for "csv" format? #50

Should use read.csv instead of read.table for "csv" format? #50

Comments

jamiefolson commented Jun 4, 2013

piccolbo commented Jun 4, 2013

jamiefolson commented Jun 11, 2013

piccolbo commented Jun 15, 2013

jamiefolson commented Jun 17, 2013

piccolbo commented Jun 17, 2013

jamiefolson commented Jun 17, 2013

piccolbo commented Jun 21, 2013

jamiefolson commented Jun 28, 2013

Should use `read.csv` instead of `read.table` for "csv" format? #50

Should use `read.csv` instead of `read.table` for "csv" format? #50