Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fwrite gains scipen #3716

Merged
merged 13 commits into from
Sep 17, 2019
Merged

fwrite gains scipen #3716

merged 13 commits into from
Sep 17, 2019

Conversation

MichaelChirico
Copy link
Member

@MichaelChirico MichaelChirico commented Jul 21, 2019

Closes #2020
via #3189

base R handles scipen in format.c here:

*w = neg + (*d > 0) + *d + 4 + *e; /* width for E format */
if (wF <= *w + R_print.scipen)

The basic logic is in place (and actually quite simple as we already had a branch in place that did everything but use the scientific penalty), but some details of implementation should be solved in this PR before writing the NEWS/manual entry. I haven't quite convinced myself that writeFloat64 is working exactly the same for scientific-vs-decimal format but I haven't found any counterexamples yet.

As it is now, I guess it's not backwards-compatible (although tests passed) -- essentially, we moved from a default of scientificPenalty=0 to scientificPenalty=getOption('scipen', 0L). I think the default value is 0, which is why tests passed on my machine. Shall we phase this in through deprecation cycle?

Next, it seems fwrite handles overflow better (haven't dug into why, and not sure it's relevant, but it's worth noting now that we'll support writing very wide decimals):

fwrite(data.table(10^100), col.names = FALSE)
# 10000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
write.table(data.frame(10^100), col.names = FALSE, row.names = FALSE)
# 10000000000000000159028911097599180468360808563945281389781327557747838772170381060813469985856815104

Once we agree on the details of how to proceed, we can add some more tests & tidy up before merging.

Assigning @st-pasha for review since we added to fwriteMainArgs, which I guess would affect pydatatable as well.

Feedback welcome.

…way from scientific notation

minor adjustment

more minor stuff

remove unrelated test
@codecov
Copy link

codecov bot commented Jul 21, 2019

Codecov Report

Merging #3716 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #3716      +/-   ##
==========================================
+ Coverage   99.42%   99.42%   +<.01%     
==========================================
  Files          72       72              
  Lines       13498    13502       +4     
==========================================
+ Hits        13421    13425       +4     
  Misses         77       77
Impacted Files Coverage Δ
src/fwrite.c 97.83% <100%> (ø) ⬆️
R/fwrite.R 100% <100%> (ø) ⬆️
R/foverlaps.R 100% <100%> (ø) ⬆️
src/fwriteR.c 100% <100%> (ø) ⬆️
R/test.data.table.R 100% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1da7624...b3c5428. Read the comment docs.

@st-pasha
Copy link
Contributor

@MichaelChirico There is writerMaxLen array defined in "fwrite.h", which gives max size (in bytes) of various fields. For Float64 the max size is declared as 29.
However, when the "scipen" option comes into play, I suspect the output can become arbitrarily large, as your example with 10^100 demonstrates. The logic to compute maxLineLen will then have to be updated, if we don't want to see crashes caused by out-of-buffer writes.

@MichaelChirico
Copy link
Member Author

Thanks @st-pasha... very good point to raise. I don't think it can become arbitrarily large -- should be at most 29+scipen IINM. The immediate implication is that means defining the writerMaxLen array within fwriteR or fwrite.c, though, not the header.

But where writerMaxLen is used is here:

for (int j=0; j<args.ncol; j++) {
  int width = writerMaxLen[args.whichFun[j]];
  if (width==0) {
    switch(args.whichFun[j]) {
    case WF_String:
      width = getMaxStringLen(args.columns[j], args.nrow);
      break;
    case WF_CategString:
      width = getMaxCategLen(args.columns[j]);
      break;
    case WF_List:
      width = getMaxListItemLen(args.columns[j], args.nrow);
      break;
    default:
      STOP("Internal error: type %d has no max length method implemented", args.whichFun[j]);  // # nocov
    }
  }
  if (width<naLen) width = naLen;
  maxLineLen += width*2;  // *2 in case the longest string is all quotes and they all need to be escaped
}

I guess maxLineLen += scipen would essentially do the trick, no?

@mattdowle
Copy link
Member

mattdowle commented Jul 23, 2019

Great. Could fwrite's new argument be called scipen, for consistency with R's option name, rather than scientificPenalty. I'm not sure knowing that scipen stands for scientificPenalty is going to help a new user: scientificPenalty is not exactly self describing either. So they'll be looking up scientificPenalty where they should be looking up the consistent name: scipen.

@mattdowle mattdowle changed the title Closes #2020 -- fwrite gains scientificPenalty argument for biasing a… fwrite gains scipen Jul 23, 2019
@mattdowle mattdowle added this to the 1.12.4 milestone Jul 23, 2019
@MichaelChirico
Copy link
Member Author

I was 50/50 on the naming, wanted to come up with something even more straightforward but couldn't.

I honestly never knew what scipen stood for until working on this & it finally clicked. was hoping spelling it out could have that same effect on users.

but not married to it. OTOH linking to ?options & the explanation in Details should also do the trick.

@mattdowle
Copy link
Member

mattdowle commented Jul 23, 2019

It's nice that you've added it as an argument at all. So then new users know there's something there. In base read.csv and read.table it's not an argument, so you have to know a global option exists. That's what I thought we'd do actually: just make fwrite respect the global option without adding the argument. But adding the argument is better.
I'd lean toward scipen then for consistency if you don't mind please. It'll be clear in ?fwrite what it is. The downside of scientificPrecision is it's long to type and it's camel case which gets us into a discussion we haven't had yet about whether it should be scientific_precision. And then what about other long argument names in data.table everywhere else? At least if we pick scipen it's for a good reason of consistency and we can move on.

@MichaelChirico
Copy link
Member Author

Sure, done.

I think adding it as an argument is in keeping with our aversion to options that can't be set explicitly (I'm thinking #3578), though this case is a bit different since it's a default option from base.

Latest attempts to address @st-pasha's concern, I'm not sure how to test it though, can have a look?

inst/tests/tests.Rraw Outdated Show resolved Hide resolved
test(2112.04, fwrite(DT, scipen=1), output=c("a,b,c","0.0001,1e+06,-20"))
test(2112.05, fwrite(DT, scipen=2), output=c("a,b,c","0.0001,1000000,-20"))
test(2112.06, fwrite(DT, scipen=-3), output=c("a,b,c","1e-04,1e+06,-20"))
# pending ... test(2112.07, fwrite(DT, scipen=-4), output=c("a,b,c","1e-04,1e+06,-2e+01"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any comment why this is disabled would be nice
also scipen=0 test would be god

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just found 2112.07 didn't work yesterday and will continue today. scipen=0 is test 2112.01.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I see what I was going for with the -1^(i+j) tests... seems negatives not working:

for (i in -10:10) {cat(i, ':', sep = ''); fwrite(DT, scipen = i, col.names = FALSE)}
-10:1e-04,1e+06,-20
-9:1e-04,1e+06,-20
-8:1e-04,1e+06,-20
-7:1e-04,1e+06,-20
-6:1e-04,1e+06,-20
-5:1e-04,1e+06,-20
-4:1e-04,1e+06,-20
-3:1e-04,1e+06,-20
-2:1e-04,1e+06,-20
-1:1e-04,1e+06,-20
0:1e-04,1e+06,-20
1:0.0001,1e+06,-20
2:0.0001,1000000,-20
3:0.0001,1000000,-20
4:0.0001,1000000,-20
5:0.0001,1000000,-20
6:0.0001,1000000,-20
7:0.0001,1000000,-20
8:0.0001,1000000,-20
9:0.0001,1000000,-20
10:0.0001,1000000,-20

Copy link
Member Author

@MichaelChirico MichaelChirico Sep 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, but DT$c is integer. Implementation only applied scipen to numeric... write.csv also ignores scipen for integer (which is what I was getting at by comparing to format, which write.csv will match):

options(scipen = -4)
write.csv(DT)
"","a","b","c"
"1",1e-04,1e+06,-20

@mattdowle mattdowle added the WIP label Sep 17, 2019
@mattdowle mattdowle removed the WIP label Sep 17, 2019
@mattdowle mattdowle merged commit 6c5fb5b into master Sep 17, 2019
@mattdowle mattdowle deleted the fwrite_scipen branch September 17, 2019 21:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

fwrite ignores options(scipen = 999)
4 participants