Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fwrite/fread single column input with NA -vs- empty lines #2106

Closed
skanskan opened this issue Apr 7, 2017 · 9 comments
Closed

fwrite/fread single column input with NA -vs- empty lines #2106

skanskan opened this issue Apr 7, 2017 · 9 comments

Comments

@skanskan
Copy link

@skanskan skanskan commented Apr 7, 2017

I create a toy example.

temp <- data.table(a=c(1,NA,2,3,999,NA))

I save it:
fwrite(temp, "temp.csv", quote=FALSE, sep=",", append=F)

and read it again:
my <- fread("temp.csv", stringsAsFactors=F)

As you can see only the first line is read.

I don't know if it's a problem with fread or with fwrite's output file.

@MichaelChirico
Copy link
Member

@MichaelChirico MichaelChirico commented Apr 7, 2017

library(data.table)
temp = data.table(a=c(1,NA,2,3,999,NA))
tmp = tempfile()
fwrite(temp, tmp, quote=FALSE)
system(paste('cat', tmp))
# a
# 1
# 
# 2
# 3
# 999
# 

There are no separators in any line, and the last line is blank.

Absent separators, fread has no way of knowing whether there are just blank lines or whether they are supposed to have missing data.

fread warns you about this:

Warning message:

In fread("temp.csv", stringsAsFactors = F) :
Stopped reading at empty line 3 but text exists afterwards (discarded): 2

One potential fix: add a dummy column:

temp[ , b := NA]

Now tmp will have separators so fread can tell which lines have data.

@franknarf1
Copy link
Contributor

@franknarf1 franknarf1 commented Apr 7, 2017

@MichaelChirico Side note: to cat it to the console, just use fwrite, which puts it there by default:

fwrite(temp)
# a
# 1
# 
# 2
# 3
# 999

So it can be reproduced like...

fread(paste(capture.output(fwrite(temp)), collapse="\n"))

#    a
# 1: 1
# Warning message:
# In fread(paste(capture.output(fwrite(temp)), collapse = "\n")) :
#   Found the last consistent line but text exists afterwards (discarded): <<2>>

Yeah, I'm inclined towards saying the file should be written better if it wants blank lines read as NA (rather than reconfigure fread to treat this one-column case specially). I mean:

fread(paste(capture.output(fwrite(temp, na="NA")), collapse="\n"))

@MichaelChirico
Copy link
Member

@MichaelChirico MichaelChirico commented Apr 7, 2017

@franknarf1 nice, the default to write to stdout is an update I missed, wasn't like that initially. Matches write.table behavior 👍

And I like your fix better, but not sure if it's the user's responsibility to handle a case like that, or if na = if (ncol(x) > 1L) '' else 'NA' as the default is a better fix

@mattdowle
Copy link
Member

@mattdowle mattdowle commented Apr 8, 2017

The warning message is there. And there are arguments to control it.
Using v1.10.4 on CRAN :

> fread("temp.csv")
   a
1: 1
Warning message:
In fread("temp.csv") :
  Stopped reading at empty line 3 but text exists afterwards (discarded): 2
> fread("temp.csv", fill=TRUE)
     a
1:   1
2:  NA
3:   2
4:   3
5: 999
6:  NA
> fread("temp.csv", blank.lines.skip=TRUE)
     a
1:   1
2:   2
3:   3
4: 999
> 

Perhaps "Consider fill=TRUE and blank.lines.skip=TRUE" should be added to the warning message? (TODO1)

@mattdowle mattdowle changed the title Serious fread problem. fread single column input with NA or empty lines Apr 8, 2017
@mattdowle
Copy link
Member

@mattdowle mattdowle commented Apr 8, 2017

Also the empty lines can be controlled in fwrite with the na= argument.

> temp
     a
1:   1
2:  NA
3:   2
4:   3
5: 999
6:  NA
> fwrite(temp, "temp.csv")
> system("more temp.csv")
a
1

2
3
999

> fwrite(temp, "temp.csv", na="NA")
> system("more temp.csv")
a
1
NA
2
3
999
NA
> fread("temp.csv")
     a
1:   1
2:  NA
3:   2
4:   3
5: 999
6:  NA
> 

@mattdowle
Copy link
Member

@mattdowle mattdowle commented Apr 8, 2017

I just read again and understood @MichaelChirico's comment now :

or if na = if (ncol(x) > 1L) '' else 'NA' as the default is a better fix

Yes - nice idea! Happy to make that change. (TODO2)

@mattdowle mattdowle added this to the v1.10.6 milestone Apr 8, 2017
@mattdowle mattdowle changed the title fread single column input with NA or empty lines fwrite/fread single column input with NA -vs- empty lines Apr 8, 2017
@MichaelChirico
Copy link
Member

@MichaelChirico MichaelChirico commented Oct 31, 2017

Closed by #2451

@skanskan
Copy link
Author

@skanskan skanskan commented Oct 31, 2017

Great.
I've checked that it's also working well when a whole row is full of NA.

temp = data.table(a=c(1,NA,2,3,999,NA), b=c(1,NA,2,3,999,NA))

@mattdowle
Copy link
Member

@mattdowle mattdowle commented Mar 2, 2018

Now that fread handles the blank lines in single-column files, this change in dev can be reverted back to how it was on CRAN which is simpler and cleaner.
CRAN version has fwrite(..., na="", ...)
dev changed to fwrite(..., na = if (length(x) > 1L) "" else "NA", ...)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
4 participants