Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Various enhancements to print.data.table #1523

Open
MichaelChirico opened this issue Feb 6, 2016 · 55 comments
Open

Various enhancements to print.data.table #1523

MichaelChirico opened this issue Feb 6, 2016 · 55 comments

Comments

@MichaelChirico
Copy link
Member

@MichaelChirico MichaelChirico commented Feb 6, 2016

Current task list:

  • 1. Add .Rd file for print.data.table
  • 2. Ability to turn off row numbers [1) from #645/R-F#1957 - Yike Lu; handled in this commit, Nov. 12, 2013]
  • 3. Ability to turn off smart table wrapping [2) from #645/R-F#1957 - Yike Lu]
  • 4. Ability to force-print all entries [3) from #645/R-F#1957 - Yike Lu; handled in this commit, Sep. 14, 2012]
  • 5. Ability to demarcate by-groupings [4) from #645/R-F#1957 - Yike Lu]
  • 6. Demarcation of table border [part of 5) from #645/R-F#1957 - Yike Lu]
  • 7. Demarcation of key columns [part of 5) from #645/R-F#1957 - Yike Lu]
  • 8. Fungible option for whether row numbers are printed [#1097 - @smcinerney]
  • 9. Options for whether/which registers of column names to print [#1482 - Oleg Bondar on SO]
  • 10. Option for dplyr-like printing [see below - @MichaelChirico]
  • 11. Facilities for compact glance at data a la dplyr tbl_df [#1497 - @nverno; #2608 - @vlulla]
  • 12. Option for specifying a truncation character [#1374 - @jangorecki]
  • 13. Handling of empty-named data.table [#545/R-F#5253 - @arunsrinivasan]
  • 14. Improve printing of list/non-atomic columns [see below - @franknarf1 via SO; also #605; handled in #2562]
  • 15. POSIXct columns with timezones should include that information in printed output [#2842 - @MichaelChirico]
  • 16. Limit number of columns printed for very wide tables (i.e. where print.data.table would exceed max.print)

Some Notes

3 (tabled pending clarification)

As I understand it, this issue is a request to prevent the console output from wrapping around (i.e., to force all columns to appear parallel, regardless of how wide the table is).

If that's the case, this is (AFAICT) impossible, since that's something done by RStudio/R itself. I for one certainly don't know of any way to alter this behavior.

If someone does know of a way to affect this, or if they think I'm mis-interpreting, please pipe up and we can have this taken care of.

7

As I see it there are two options here. One is to treat all key columns the same; the other is to treat secondary, tertiary, etc. keys separately.

Example output:

set.seed(01394)
DT <- data.table(key1 = rep(c("A","B"), each = 4),
                 key2 = rep(c("a","b"), 4),
                 V1 = nrorm(8), key = c("key1","key2"))

# Only demarcate key columns
DT
#    | key1 | | key2 |         V1
#1: |    A | |    a |  0.5994579
#2: |    A | |    a | -1.0898775
#3: |    A | |    b | -0.2285326
#4: |    A | |    b | -1.7858472
#5: |    B | |    a | -0.6269875
#6: |    B | |    a | -0.6633084
#7: |    B | |    b |  1.0367084
#8: |    B | |    b |  0.7364276

# Separately "emboss" keys based on key order
DT
#    | key1 | || key2 ||         V1
#1: |    A | ||    a ||  0.5994579
#2: |    A | ||    a || -1.0898775
#3: |    A | ||    b || -0.2285326
#4: |    A | ||    b || -1.7858472
#5: |    B | ||    a || -0.6269875
#6: |    B | ||    a || -0.6633084
#7: |    B | ||    b ||  1.0367084
#8: |    B | ||    b ||  0.7364276

And of course, add an option for deciding whether to demarcate with | or some other user's-choice character (*, +, etc.)

9 [DONE]

Some feedback from a closed PR that was a first stab at solving this:

From Arun regarding preferred options:

col.names = c("auto", "top", "none")

"auto": current behaviour

"top": only on top, data.frame-like

"none": no column names -- exclude rows in which column names would have been printed.

10 [DONE]

It would be nice to have an option to print a row under the row of column names which gives each column's stored type, as is currently (I understand) the default for the output of dplyr operations.

Example from dplyr:

library(dplyr)
DF <- data.frame(n = numeric(1), c1 = complex(1), i = integer(1),
                 f = factor(1), D = as.Date("2016-02-06"), c2 = character(1),
                 stringsAsFactors = FALSE)
tbl_df(DF)
# Source: local data frame [1 x 6]
#
#       n     c1     i      f          D    c2
#   (dbl) (cmpl) (int) (fctr)     (date) (chr) # <- this row
#1     0   0+0i     0      1 2016-02-06      

Current best alternative is to do sapply(DF, class), but it's nice to have a preview of the data wit this extra information.

11

This seems closely related to 3. Current plan is to implement this as an alternative to 3 since it seems more tangible/doable.

Via @nverno:

Would it be useful for head.data.table to have an option to print only the head of columns that fit the screen width, and summarise the rest? I was imagining something like the printed output from the head of a tbl_df in dplyr. I think it is nice for tables with many columns.

and the guiding example from Arun:

require(data.table)
dt = setDT(lapply(1:100, function(x) 1:3))
dt
dplyr::tbl_dt(dt)

12

Currently covered by @jangorecki's PR #1448; Jan, assuming #1529 is merged first, could you edit the print.data.table man page for your PR?

@MichaelChirico MichaelChirico changed the title FR: option for dplyr-like data.table printing Various enhancements to print.data.table Feb 8, 2016
@arunsrinivasan

This comment has been minimized.

Copy link
Member

@arunsrinivasan arunsrinivasan commented Feb 8, 2016

Just brilliant!

@arunsrinivasan

This comment has been minimized.

Copy link
Member

@arunsrinivasan arunsrinivasan commented Feb 8, 2016

No idea about 3 and 5 (as to what they mean).
I think a PR for 6 would be nice (seems straightforward from what Jan wrote there). Perhaps ?print.data.table is the time consuming part? Do you think you'd be up for this, @MichaelChirico ?
No idea as to what 7 means either..
8 is another great idea. PR would be great!

@arunsrinivasan

This comment has been minimized.

Copy link
Member

@arunsrinivasan arunsrinivasan commented Feb 8, 2016

It'd be really nice if Github would allow assigning tasks to project who aren't necessarily members :-(.

@arunsrinivasan

This comment has been minimized.

Copy link
Member

@arunsrinivasan arunsrinivasan commented Feb 8, 2016

There's also #1497

@MichaelChirico

This comment has been minimized.

Copy link
Member Author

@MichaelChirico MichaelChirico commented Feb 9, 2016

@arunsrinivasan should I try and PR this one issue at a time? Or in a fell swoop? I've got 8 basically taken care of, just need to add tests.

@arunsrinivasan

This comment has been minimized.

Copy link
Member

@arunsrinivasan arunsrinivasan commented Feb 9, 2016

Michael, separate PRs.

@nverno

This comment has been minimized.

Copy link

@nverno nverno commented Feb 10, 2016

Very nice! Sorry to get back to you late on this, but Arun provided a nice example. It is just a nice convenience when interactively looking at tables with lots columns so your console isn't engulfed by a huge data dump when you take a look at the head. Ill close that other one.

arunsrinivasan added a commit that referenced this issue Mar 4, 2016
#1523 progress: adds option for dplyr-inspired column class summary with printing
arunsrinivasan added a commit that referenced this issue Mar 6, 2016
Closes #1097 (progress towards #1523), creates option for printing row names
@arunsrinivasan

This comment has been minimized.

Copy link
Member

@arunsrinivasan arunsrinivasan commented Mar 9, 2016

It'd be also nice to print:

primary key:
secondary indices: , etc..
<data.table>

by default. It's definitely informative to know what the keys and secondary indices are..

@arunsrinivasan

This comment has been minimized.

Copy link
Member

@arunsrinivasan arunsrinivasan commented Mar 9, 2016

Also, I think this is better output for:

print(DT, class=TRUE)
   <char> <int> <num>
     site  date     x
1:      A     1    10
2:      A     2    20
3:      A     3    30
4:      B     1    10
5:      B     2    20
6:      B     3    30

It's easier to copy/paste the data.table without the classes in the way. If we can do that, we can turn on printing classes by default.

Thoughts?

@MichaelChirico

This comment has been minimized.

Copy link
Member Author

@MichaelChirico MichaelChirico commented Mar 9, 2016

@arunsrinivasan about printing keys:

  • Isn't that the point of tables()? (though TBH I almost never use this function) BTW tables, to the extent that it's useful, could go for an update to add a secondary_indices column...
  • You don't consider this subsumed by point # 7 here? See this chat (interrupted in the middle) b/w Frank and I about possibilities for filling # 7. Or perhaps you'd like to replace point # 7 with your idea. What do you think?

About class:

This can be done, but will require a step of wrangling -- basically toprint <- rbind(rownames(toprint), toprint); rownames(toprint) <- abbs. Which is fine, I'm just curious why you're thinking of easier copy-pasting as a clear advantage? Not sure the cost of including class info in copy-pasted output. Happy to hear feedback.

@arunsrinivasan

This comment has been minimized.

Copy link
Member

@arunsrinivasan arunsrinivasan commented Mar 9, 2016

About class: -- copy pasting from SO, for example to provide input to fread(). I also find it easier without the separation between column name and value (just used to seeing it).

On printing keys:

  • Yes, but it gives it for all tables, which is useful in itself. But if I'd like to see just the keys retained after a join operation, I don't necessarily want to have a look at all the tables' key.
  • I don't think point 7 (drawing lines) would work well.. since it can not (AFAICT) tell the order of key columns.. But stating:

primary key: <a, b>

clearly tells the first key column is "a", then "b"..

Does this clarify things a bit?

@arunsrinivasan

This comment has been minimized.

Copy link
Member

@arunsrinivasan arunsrinivasan commented Mar 9, 2016

I agree tables() could use an update.

@MichaelChirico

This comment has been minimized.

Copy link
Member Author

@MichaelChirico MichaelChirico commented Mar 9, 2016

@arunsrinivasan OK, I think I can get on board with that. Can ditch point # 7 then. I agree distinguishing key order at a glance was going to be tough. So how about:

  • If a table has a key, say c("key1", "key2"), print the following above the output of print.data.table:

    keys: key1, key2
    
  • If there is no key, print:

    keys: <unkeyed>
    
  • Secondary index printing is optional, but if activated will come below keys a la:

    Secondary indices: key2.1, key2.2, ...
                       key3.1, key3.2, ...
    

Lastly, I propose sending this output through message to help distinguish it from the data.table itself visually.

@arunsrinivasan

This comment has been minimized.

Copy link
Member

@arunsrinivasan arunsrinivasan commented Mar 9, 2016

My suggestion would be this:

  1. If either of these attributes are not present, don't print them. I think people will quickly learn that no keys are set (if it isn't displayed).
  2. Since there can be more than 1 secondary index, I suggest the format be:

Keys: <col1, col2> (only one)
Secondary Indices: , , <col1, col2>, ...
If there are more than 'x' (=5 to begin with?) indices, use a "...". They can always access it using key2().

I don't mind "<>" being replaced with "" if that'd be more aesthetically pleasing.. e.g., "col1,col2", "col1" etc..

Last proposal: seems nice, but I wonder if it might create issues wth knitr when people suppress 'messages' in chunk.. and print the output?

@arunsrinivasan

This comment has been minimized.

Copy link
Member

@arunsrinivasan arunsrinivasan commented Mar 9, 2016

It'd be great to have this and class=TRUE default for v1.9.8 already.. we'll see.

@arunsrinivasan

This comment has been minimized.

Copy link
Member

@arunsrinivasan arunsrinivasan commented Mar 9, 2016

One other thought:

Many people use "numeric" type when an integer type would suffice, and when "integer64" would fit the bill better. How about marking those columns somehow while printing?

instead of , perhaps >num< ?? that'll allow people to be aware of such optimisations as well..

@arunsrinivasan

This comment has been minimized.

Copy link
Member

@arunsrinivasan arunsrinivasan commented Mar 9, 2016

OR "!num!"? There's a function isReallyReal that checks this. But this'll perhaps be too time consuming to run on all rows every time..

@MichaelChirico

This comment has been minimized.

Copy link
Member Author

@MichaelChirico MichaelChirico commented Mar 9, 2016

@arunsrinivasan Hmm I think it's definitely not something to be used as a part of print.data.table default.

Some initial musings:

  • Could add an option to do so, and a companion function (check_num_cols or the like) which runs this on an input table and spits out the candidate columns.
  • Could do this the first time only -- have some sort of global variable associated with each data.table in memory which we use to trigger the evaluation
  • Could have this as part of the standard (or verbose) output of fread (since I imagine that's where most data.tables are created in general. I guess setDT is the other big source.

Are you thinking of pushing 1.9.8 soon?

Oh, one more thing, what do you think about porting print.data.table to its own .R file?

@arunsrinivasan

This comment has been minimized.

Copy link
Member

@arunsrinivasan arunsrinivasan commented Mar 9, 2016

Hm, yes, let's forget the marking of columns for now.

On pushing 1.9.8: trying as much as possible to wrap the other issues marked as quick as possible. I'd like to work on non-equi joins for this release.

On print.data.table to separate file, sure, sounds good.

@MichaelChirico

This comment has been minimized.

Copy link
Member Author

@MichaelChirico MichaelChirico commented Mar 10, 2016

@arunsrinivasan just a heads up that setting class = TRUE as the default is causing 100s of errors in the tests

@arunsrinivasan

This comment has been minimized.

Copy link
Member

@arunsrinivasan arunsrinivasan commented Mar 10, 2016

Okay thanks, will take a look.

@franknarf1

This comment has been minimized.

Copy link
Contributor

@franknarf1 franknarf1 commented Aug 19, 2017

Another idea: an option dput = TRUE, that will write reproducible code (since dput(DT) doesn't work). Something like

dtput = function(DT){
  d0 = capture.output(dput(setattr(data.table:::shallow(DT), ".internal.selfref", NULL)))
  cat("data.table::alloc.col(", d0, ")\n", sep="\n")
}

# example
library(data.table)
DT = as.data.table(as.list(1:10))
dtput(DT)
# which writes...
data.table::alloc.col(
structure(list(V1 = 1L, V2 = 2L, V3 = 3L, V4 = 4L, V5 = 5L, V6 = 6L, 
    V7 = 7L, V8 = 8L, V9 = 9L, V10 = 10L), .Names = c("V1", "V2", 
"V3", "V4", "V5", "V6", "V7", "V8", "V9", "V10"), row.names = c(NA, 
-1L), class = c("data.table", "data.frame"))
)

... except less hacky and embedded in print.data.table. I guess if dput = TRUE, all the others can be ignored. Getting fancy, maybe allow dput = "file.txt" like dput() does. I figure it makes enough sense to put it in print, and it's not worth it to add a new function.

@franknarf1

This comment has been minimized.

Copy link
Contributor

@franknarf1 franknarf1 commented Dec 6, 2017

Another idea similar to those in #645 : turn off smart truncation of list column display: example from SO.

I see this truncation pretty frequently, and in some cases it'd be nice to see printing as if list column v was sapply(v, toString) instead.

@MichaelChirico

This comment has been minimized.

Copy link
Member Author

@MichaelChirico MichaelChirico commented Jan 10, 2018

@franknarf1 i think a very easy fix would be here:

paste(c(format(head(x,6), justify=justify, ...), if(length(x)>6)""),collapse=",")

change "" to "...". What do you think? I like toString, but should also come with a default width parameter, I'm not sure how to do that robustly.


actually, re-reading toString.default:

function (x, width = NULL, ...) 
{
    string <- paste(x, collapse = ", ")
    if (missing(width) || is.null(width) || width == 0) 
        return(string)
    if (width < 0) 
        stop("'width' must be positive")
    if (nchar(string, type = "w") > width) {
        width <- max(6, width)
        string <- paste0(strtrim(string, width - 4), "....")
    }
    string
}

It seems the default way of handling width is similar to what's currently implemented. I think limiting output based on on-screen width rather than truncating to the first few elements is better, no?

This approach also allows better user interaction since toString is S3-registered -- we (or end users) could write/customize toString.* methods for any use cases that arise. Perhaps add a colWidth parameter to print.data.table which would be dropped into width of toString.default...

@franknarf1

This comment has been minimized.

Copy link
Contributor

@franknarf1 franknarf1 commented Jan 10, 2018

@MichaelChirico One point in favor of the trailing "," over a ",..." is that it saves horizontal space. Nonetheless, that seems like a good change, since most users won't guess what the "," means.

Rather than that change, I was more interested in was printing a higher number of entries in place of 6 in head(x, 6), like your colWidth idea.

Re methods, I'd find an argument like formatters = list(character = function(x) toString(x), lm = function(x) x$qr$tol) easy to use (to be used for list columns provided every element matches the named class or is NULL). Not sure if that's what you meant.

@jsams

This comment has been minimized.

Copy link
Contributor

@jsams jsams commented May 23, 2018

Thought I would drop a mention of #2893 here as the two seem closely related.

@franknarf1

This comment has been minimized.

Copy link
Contributor

@franknarf1 franknarf1 commented Aug 17, 2018

(Similar to my last comment...) Having a data.table like...

library(data.table)
(DT <- data.table(id = 1:2, v = numeric_version("0.0.0")))
#   id                 v
# 1:  1 <numeric_version>
# 2:  2 <numeric_version>

I cannot really read the contents of my list column, even though there is a print method for it.

It would be nice to have a way to tell data.table how I want a list column of a certain class printed, like ...

library(magrittr)

formatters = list(numeric_version = as.character)

printDT = data.table:::shallow(DT)
left_cols = which(sapply(DT, is.list))
for (i in seq_along(formatters)){
    if (length(left_cols) == 0L) break 
    alt_cols = left_cols[ sapply(DT[, ..left_cols], inherits, names(formatters)[i]) ]    
    if (length(alt_cols)){
      printDT[, (alt_cols) := lapply(.SD, formatters[[i]]), .SDcols = alt_cols][]
      left_cols = setdiff(left_cols, alt_cols)
    }
}
print(printDT)

   id     v
1:  1 0.0.0
2:  2 0.0.0

Could have that list passed by the user in options(datatable.print.formatters = formatters). To reduce the computational burden, I guess this would be done after filtering with nrows= and topn=.

@HughParsonage

This comment has been minimized.

Copy link
Member

@HughParsonage HughParsonage commented Feb 4, 2019

(If I want to suggest an addition to this list, do I add it here or add it as a discrete issue?)

@MichaelChirico

This comment has been minimized.

Copy link
Member Author

@MichaelChirico MichaelChirico commented Feb 4, 2019

@jangorecki jangorecki added this to the 1.12.4 milestone Feb 4, 2019
@jangorecki

This comment has been minimized.

Copy link
Member

@jangorecki jangorecki commented Feb 4, 2019

The less points is defined in scope the more easy is to merge a PR for it. Definitely it make sense to separate points which may result in breaking change (if any) from those for which default behaviour will not change.

@MichaelChirico

This comment has been minimized.

Copy link
Member Author

@MichaelChirico MichaelChirico commented Feb 4, 2019

@fparages fparages mentioned this issue Apr 10, 2019
@franknarf1

This comment has been minimized.

Copy link
Contributor

@franknarf1 franknarf1 commented Apr 11, 2019

As an extension to @fparages' #3500 (addressing the timezone display item in the OP of this issue/thread), it might be nice to also support the tz being printed in the class header, <POSc:-07:00> or <POSc:PDT>, and not in the column (to save horizontal space), eg when class=tz=TRUE.

@MichaelChirico

This comment has been minimized.

Copy link
Member Author

@MichaelChirico MichaelChirico commented Apr 11, 2019

^ related: #2842

@randomgambit

This comment has been minimized.

Copy link

@randomgambit randomgambit commented Jun 30, 2019

That would be awesome!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
10 participants
You can’t perform that action at this time.