Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Various enhancements to print.data.table #1523

Open
MichaelChirico opened this issue Feb 6, 2016 · 56 comments
Open

Various enhancements to print.data.table #1523

MichaelChirico opened this issue Feb 6, 2016 · 56 comments

Comments

@MichaelChirico
Copy link
Member

@MichaelChirico MichaelChirico commented Feb 6, 2016

Current task list:

  • 1. Add .Rd file for print.data.table
  • 2. Ability to turn off row numbers [1) from #645/R-F#1957 - Yike Lu; handled in this commit, Nov. 12, 2013]
  • 3. Ability to turn off smart table wrapping [2) from #645/R-F#1957 - Yike Lu]
  • 4. Ability to force-print all entries [3) from #645/R-F#1957 - Yike Lu; handled in this commit, Sep. 14, 2012]
  • 5. Ability to demarcate by-groupings [4) from #645/R-F#1957 - Yike Lu]
  • 6. Demarcation of table border [part of 5) from #645/R-F#1957 - Yike Lu]
  • 7. Demarcation of key columns [part of 5) from #645/R-F#1957 - Yike Lu]
  • 8. Fungible option for whether row numbers are printed [#1097 - @smcinerney]
  • 9. Options for whether/which registers of column names to print [#1482 - Oleg Bondar on SO]
  • 10. Option for dplyr-like printing [see below - @MichaelChirico]
  • 11. Facilities for compact glance at data a la dplyr tbl_df [#1497 - @nverno; #2608 - @vlulla]
  • 12. Option for specifying a truncation character [#1374 - @jangorecki]
  • 13. Handling of empty-named data.table [#545/R-F#5253 - @arunsrinivasan]
  • 14. Improve printing of list/non-atomic columns [see below - @franknarf1 via SO; also #605; handled in #2562]
  • 15. POSIXct columns with timezones should include that information in printed output [#2842 - @MichaelChirico]
  • 16. Limit number of columns printed for very wide tables (i.e. where print.data.table would exceed max.print)

Some Notes

3 (tabled pending clarification)

As I understand it, this issue is a request to prevent the console output from wrapping around (i.e., to force all columns to appear parallel, regardless of how wide the table is).

If that's the case, this is (AFAICT) impossible, since that's something done by RStudio/R itself. I for one certainly don't know of any way to alter this behavior.

If someone does know of a way to affect this, or if they think I'm mis-interpreting, please pipe up and we can have this taken care of.

7

As I see it there are two options here. One is to treat all key columns the same; the other is to treat secondary, tertiary, etc. keys separately.

Example output:

set.seed(01394)
DT <- data.table(key1 = rep(c("A","B"), each = 4),
                 key2 = rep(c("a","b"), 4),
                 V1 = nrorm(8), key = c("key1","key2"))

# Only demarcate key columns
DT
#    | key1 | | key2 |         V1
#1: |    A | |    a |  0.5994579
#2: |    A | |    a | -1.0898775
#3: |    A | |    b | -0.2285326
#4: |    A | |    b | -1.7858472
#5: |    B | |    a | -0.6269875
#6: |    B | |    a | -0.6633084
#7: |    B | |    b |  1.0367084
#8: |    B | |    b |  0.7364276

# Separately "emboss" keys based on key order
DT
#    | key1 | || key2 ||         V1
#1: |    A | ||    a ||  0.5994579
#2: |    A | ||    a || -1.0898775
#3: |    A | ||    b || -0.2285326
#4: |    A | ||    b || -1.7858472
#5: |    B | ||    a || -0.6269875
#6: |    B | ||    a || -0.6633084
#7: |    B | ||    b ||  1.0367084
#8: |    B | ||    b ||  0.7364276

And of course, add an option for deciding whether to demarcate with | or some other user's-choice character (*, +, etc.)

9 [DONE]

Some feedback from a closed PR that was a first stab at solving this:

From Arun regarding preferred options:

col.names = c("auto", "top", "none")

"auto": current behaviour

"top": only on top, data.frame-like

"none": no column names -- exclude rows in which column names would have been printed.

10 [DONE]

It would be nice to have an option to print a row under the row of column names which gives each column's stored type, as is currently (I understand) the default for the output of dplyr operations.

Example from dplyr:

library(dplyr)
DF <- data.frame(n = numeric(1), c1 = complex(1), i = integer(1),
                 f = factor(1), D = as.Date("2016-02-06"), c2 = character(1),
                 stringsAsFactors = FALSE)
tbl_df(DF)
# Source: local data frame [1 x 6]
#
#       n     c1     i      f          D    c2
#   (dbl) (cmpl) (int) (fctr)     (date) (chr) # <- this row
#1     0   0+0i     0      1 2016-02-06      

Current best alternative is to do sapply(DF, class), but it's nice to have a preview of the data wit this extra information.

11

This seems closely related to 3. Current plan is to implement this as an alternative to 3 since it seems more tangible/doable.

Via @nverno:

Would it be useful for head.data.table to have an option to print only the head of columns that fit the screen width, and summarise the rest? I was imagining something like the printed output from the head of a tbl_df in dplyr. I think it is nice for tables with many columns.

and the guiding example from Arun:

require(data.table)
dt = setDT(lapply(1:100, function(x) 1:3))
dt
dplyr::tbl_dt(dt)

12

Currently covered by @jangorecki's PR #1448; Jan, assuming #1529 is merged first, could you edit the print.data.table man page for your PR?

@MichaelChirico MichaelChirico changed the title FR: option for dplyr-like data.table printing Various enhancements to print.data.table Feb 8, 2016
@arunsrinivasan
Copy link
Member

@arunsrinivasan arunsrinivasan commented Feb 8, 2016

Just brilliant!

@arunsrinivasan
Copy link
Member

@arunsrinivasan arunsrinivasan commented Feb 8, 2016

No idea about 3 and 5 (as to what they mean).
I think a PR for 6 would be nice (seems straightforward from what Jan wrote there). Perhaps ?print.data.table is the time consuming part? Do you think you'd be up for this, @MichaelChirico ?
No idea as to what 7 means either..
8 is another great idea. PR would be great!

@arunsrinivasan
Copy link
Member

@arunsrinivasan arunsrinivasan commented Feb 8, 2016

It'd be really nice if Github would allow assigning tasks to project who aren't necessarily members :-(.

@arunsrinivasan
Copy link
Member

@arunsrinivasan arunsrinivasan commented Feb 8, 2016

There's also #1497

@MichaelChirico
Copy link
Member Author

@MichaelChirico MichaelChirico commented Feb 9, 2016

@arunsrinivasan should I try and PR this one issue at a time? Or in a fell swoop? I've got 8 basically taken care of, just need to add tests.

@arunsrinivasan
Copy link
Member

@arunsrinivasan arunsrinivasan commented Feb 9, 2016

Michael, separate PRs.

@nverno
Copy link

@nverno nverno commented Feb 10, 2016

Very nice! Sorry to get back to you late on this, but Arun provided a nice example. It is just a nice convenience when interactively looking at tables with lots columns so your console isn't engulfed by a huge data dump when you take a look at the head. Ill close that other one.

arunsrinivasan added a commit that referenced this issue Mar 4, 2016
#1523 progress: adds option for dplyr-inspired column class summary with printing
arunsrinivasan added a commit that referenced this issue Mar 6, 2016
Closes #1097 (progress towards #1523), creates option for printing row names
@arunsrinivasan
Copy link
Member

@arunsrinivasan arunsrinivasan commented Mar 9, 2016

It'd be also nice to print:

primary key:
secondary indices: , etc..
<data.table>

by default. It's definitely informative to know what the keys and secondary indices are..

@arunsrinivasan
Copy link
Member

@arunsrinivasan arunsrinivasan commented Mar 9, 2016

Also, I think this is better output for:

print(DT, class=TRUE)
   <char> <int> <num>
     site  date     x
1:      A     1    10
2:      A     2    20
3:      A     3    30
4:      B     1    10
5:      B     2    20
6:      B     3    30

It's easier to copy/paste the data.table without the classes in the way. If we can do that, we can turn on printing classes by default.

Thoughts?

@MichaelChirico
Copy link
Member Author

@MichaelChirico MichaelChirico commented Mar 9, 2016

@arunsrinivasan about printing keys:

  • Isn't that the point of tables()? (though TBH I almost never use this function) BTW tables, to the extent that it's useful, could go for an update to add a secondary_indices column...
  • You don't consider this subsumed by point # 7 here? See this chat (interrupted in the middle) b/w Frank and I about possibilities for filling # 7. Or perhaps you'd like to replace point # 7 with your idea. What do you think?

About class:

This can be done, but will require a step of wrangling -- basically toprint <- rbind(rownames(toprint), toprint); rownames(toprint) <- abbs. Which is fine, I'm just curious why you're thinking of easier copy-pasting as a clear advantage? Not sure the cost of including class info in copy-pasted output. Happy to hear feedback.

@arunsrinivasan
Copy link
Member

@arunsrinivasan arunsrinivasan commented Mar 9, 2016

About class: -- copy pasting from SO, for example to provide input to fread(). I also find it easier without the separation between column name and value (just used to seeing it).

On printing keys:

  • Yes, but it gives it for all tables, which is useful in itself. But if I'd like to see just the keys retained after a join operation, I don't necessarily want to have a look at all the tables' key.
  • I don't think point 7 (drawing lines) would work well.. since it can not (AFAICT) tell the order of key columns.. But stating:

primary key: <a, b>

clearly tells the first key column is "a", then "b"..

Does this clarify things a bit?

@arunsrinivasan
Copy link
Member

@arunsrinivasan arunsrinivasan commented Mar 9, 2016

I agree tables() could use an update.

@MichaelChirico
Copy link
Member Author

@MichaelChirico MichaelChirico commented Mar 9, 2016

@arunsrinivasan OK, I think I can get on board with that. Can ditch point # 7 then. I agree distinguishing key order at a glance was going to be tough. So how about:

  • If a table has a key, say c("key1", "key2"), print the following above the output of print.data.table:

    keys: key1, key2
    
  • If there is no key, print:

    keys: <unkeyed>
    
  • Secondary index printing is optional, but if activated will come below keys a la:

    Secondary indices: key2.1, key2.2, ...
                       key3.1, key3.2, ...
    

Lastly, I propose sending this output through message to help distinguish it from the data.table itself visually.

@arunsrinivasan
Copy link
Member

@arunsrinivasan arunsrinivasan commented Mar 9, 2016

My suggestion would be this:

  1. If either of these attributes are not present, don't print them. I think people will quickly learn that no keys are set (if it isn't displayed).
  2. Since there can be more than 1 secondary index, I suggest the format be:

Keys: <col1, col2> (only one)
Secondary Indices: , , <col1, col2>, ...
If there are more than 'x' (=5 to begin with?) indices, use a "...". They can always access it using key2().

I don't mind "<>" being replaced with "" if that'd be more aesthetically pleasing.. e.g., "col1,col2", "col1" etc..

Last proposal: seems nice, but I wonder if it might create issues wth knitr when people suppress 'messages' in chunk.. and print the output?

@arunsrinivasan
Copy link
Member

@arunsrinivasan arunsrinivasan commented Mar 9, 2016

It'd be great to have this and class=TRUE default for v1.9.8 already.. we'll see.

@arunsrinivasan
Copy link
Member

@arunsrinivasan arunsrinivasan commented Mar 9, 2016

One other thought:

Many people use "numeric" type when an integer type would suffice, and when "integer64" would fit the bill better. How about marking those columns somehow while printing?

instead of , perhaps >num< ?? that'll allow people to be aware of such optimisations as well..

@arunsrinivasan
Copy link
Member

@arunsrinivasan arunsrinivasan commented Mar 9, 2016

OR "!num!"? There's a function isReallyReal that checks this. But this'll perhaps be too time consuming to run on all rows every time..

@MichaelChirico
Copy link
Member Author

@MichaelChirico MichaelChirico commented Mar 9, 2016

@arunsrinivasan Hmm I think it's definitely not something to be used as a part of print.data.table default.

Some initial musings:

  • Could add an option to do so, and a companion function (check_num_cols or the like) which runs this on an input table and spits out the candidate columns.
  • Could do this the first time only -- have some sort of global variable associated with each data.table in memory which we use to trigger the evaluation
  • Could have this as part of the standard (or verbose) output of fread (since I imagine that's where most data.tables are created in general. I guess setDT is the other big source.

Are you thinking of pushing 1.9.8 soon?

Oh, one more thing, what do you think about porting print.data.table to its own .R file?

@arunsrinivasan
Copy link
Member

@arunsrinivasan arunsrinivasan commented Mar 9, 2016

Hm, yes, let's forget the marking of columns for now.

On pushing 1.9.8: trying as much as possible to wrap the other issues marked as quick as possible. I'd like to work on non-equi joins for this release.

On print.data.table to separate file, sure, sounds good.

@MichaelChirico
Copy link
Member Author

@MichaelChirico MichaelChirico commented Mar 10, 2016

@arunsrinivasan just a heads up that setting class = TRUE as the default is causing 100s of errors in the tests

@arunsrinivasan
Copy link
Member

@arunsrinivasan arunsrinivasan commented Mar 10, 2016

Okay thanks, will take a look.

@jsams
Copy link
Contributor

@jsams jsams commented May 23, 2018

Thought I would drop a mention of #2893 here as the two seem closely related.

@franknarf1
Copy link
Contributor

@franknarf1 franknarf1 commented Aug 17, 2018

(Similar to my last comment...) Having a data.table like...

library(data.table)
(DT <- data.table(id = 1:2, v = numeric_version("0.0.0")))
#   id                 v
# 1:  1 <numeric_version>
# 2:  2 <numeric_version>

I cannot really read the contents of my list column, even though there is a print method for it.

It would be nice to have a way to tell data.table how I want a list column of a certain class printed, like ...

library(magrittr)

formatters = list(numeric_version = as.character)

printDT = data.table:::shallow(DT)
left_cols = which(sapply(DT, is.list))
for (i in seq_along(formatters)){
    if (length(left_cols) == 0L) break 
    alt_cols = left_cols[ sapply(DT[, ..left_cols], inherits, names(formatters)[i]) ]    
    if (length(alt_cols)){
      printDT[, (alt_cols) := lapply(.SD, formatters[[i]]), .SDcols = alt_cols][]
      left_cols = setdiff(left_cols, alt_cols)
    }
}
print(printDT)

   id     v
1:  1 0.0.0
2:  2 0.0.0

Could have that list passed by the user in options(datatable.print.formatters = formatters). To reduce the computational burden, I guess this would be done after filtering with nrows= and topn=.

@HughParsonage
Copy link
Member

@HughParsonage HughParsonage commented Feb 4, 2019

(If I want to suggest an addition to this list, do I add it here or add it as a discrete issue?)

@MichaelChirico
Copy link
Member Author

@MichaelChirico MichaelChirico commented Feb 4, 2019

@jangorecki jangorecki added this to the 1.12.4 milestone Feb 4, 2019
@jangorecki
Copy link
Member

@jangorecki jangorecki commented Feb 4, 2019

The less points is defined in scope the more easy is to merge a PR for it. Definitely it make sense to separate points which may result in breaking change (if any) from those for which default behaviour will not change.

@MichaelChirico
Copy link
Member Author

@MichaelChirico MichaelChirico commented Feb 4, 2019

@fparages fparages mentioned this issue Apr 10, 2019
@franknarf1
Copy link
Contributor

@franknarf1 franknarf1 commented Apr 11, 2019

As an extension to @fparages' #3500 (addressing the timezone display item in the OP of this issue/thread), it might be nice to also support the tz being printed in the class header, <POSc:-07:00> or <POSc:PDT>, and not in the column (to save horizontal space), eg when class=tz=TRUE.

@MichaelChirico
Copy link
Member Author

@MichaelChirico MichaelChirico commented Apr 11, 2019

^ related: #2842

@randomgambit
Copy link

@randomgambit randomgambit commented Jun 30, 2019

That would be awesome!

@tdhock
Copy link
Contributor

@tdhock tdhock commented Feb 27, 2020

hi all I don't know if you care but I noticed a bug in print.data.table(col.names="none") when there are lots of columns. minimal code is:

library(data.table)
x <- 1:30
DT <- data.table(t(x))
print(DT, col.names="none")

output on my system is:

th798@cmp2986 MINGW64 ~/R
$ R --vanilla < datatable-print-bug.R

R version 3.6.1 (2019-07-05) -- "Action of the Toes"
Copyright (C) 2019 The R Foundation for Statistical Computing
Platform: x86_64-w64-mingw32/x64 (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library(data.table)
> x <- 1:30
> DT <- data.table(t(x))
> print(DT, col.names="none")
1:  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21
   V22 V23 V24 V25 V26 V27 V28 V29 V30
1:  22  23  24  25  26  27  28  29  30
> 
�]0;MINGW64:/c/Users/th798/R�
th798@cmp2986 MINGW64 ~/R
$ 

You can see in the output above that the column names V22 through V30 are printed, but I expected they should not be. What I expected:

> print(DT, col.names="none")
1:  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21
1:  22  23  24  25  26  27  28  29  30
> 
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
You can’t perform that action at this time.