Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved show for DataFrames #995

Merged
merged 6 commits into from Sep 19, 2017
Merged

Improved show for DataFrames #995

merged 6 commits into from Sep 19, 2017

Conversation

bkamins
Copy link
Member

@bkamins bkamins commented Jun 12, 2016

A proposal to solve #760.

Summary of changes:

  • show and showall get a new argument onechunk which limits the number of printed chunks to 1 if splitchunks is true (with appropriate message in the summary)
  • showcols gets two parameters allcols (if all columns of an AbstractDataFrame should be printed or only those fitting on the screen) and values (if sample of values of an AbstractDataFrame should be printed)
  • showcompact is defined for AbstractDataFrame and GroupedDataFrame

@nalimilan
Copy link
Member

Thanks. These positional arguments are really getting out of hand. We should probably get rid of splitchunks (whose name is really confusing) and tell people to use showcols instead. Then onechunk could be renamed to allcols. What do you think?

I don't think showcompact is intended for this kind of use: as the docs say, it's mainly for scalar values to provide a short representation without type information to be used inside arrays.

#'
#' @returns o::Void A `nothing` value.
#'
#' @examples
#'
#' df = DataFrame(A = 1:3, B = ["x", "y", "z"])
#' showcols(df, true)
function showcols(io::IO, df::AbstractDataFrame) # -> Void
#' showcols(STDOUT, df)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The example was correct, STDOUT is implicit.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thould it should stay as the example with implicit STDOUT is given below for definition of function showcols(df::AbstractDataFrame, allcols::Bool=false, values::Bool=true).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Anyway until we turn these into real docstrings this is pretty abstract.

@bkamins
Copy link
Member Author

bkamins commented Jun 13, 2016

@nalimilan Thx for the comments. I will correct the PR.

For the record let me add that this kind of output also should be fixed:

df = DataFrame(x=["a", "\t", "\\", "\n", "\$", "z"], y=1:6)
6×2 DataFrames.DataFrame
│ Row │ x   │ y │
├─────┼─────┼───┤
│ 1   │ "a" │ 1 │
│ 2   │ "\t"  │ 2 │
│ 3   │ "\\" │ 3 │
│ 4   │ "\n"  │ 4 │
│ 5   │ "\$" │ 5 │
│ 6   │ "z" │ 6 │

@bkamins
Copy link
Member Author

bkamins commented Jun 14, 2016

@nalimilan I hope I have covered all your comments correctly.

I have left splitchunks internally to differentiate the behavior of show and showall (similarly to arrays).

Additionally I have changed the formula calculating the width of the string so that DataFrames render correctly with escaped strings.

@quinnj
Copy link
Member

quinnj commented Sep 7, 2017

I think this would be a great fix; care to rebase?

@bkamins
Copy link
Member Author

bkamins commented Sep 7, 2017

Sure - I thought it was rejected.

@quinnj
Copy link
Member

quinnj commented Sep 7, 2017

Sorry, the package has struggled w/ maintenance over the last year or so as development moved over to DataTables.jl, but all that work has now been backported here and all development will now resume here.

@nalimilan
Copy link
Member

Sorry, I think I didn't finish the review because I wanted to get a complete understanding of the design space, and I didn't find the time for that at that point.

@bkamins
Copy link
Member Author

bkamins commented Sep 10, 2017

I have started to update the PR and have one issue to clarify.
There is a change how DataFrames performs show between latest release and master.

Consider running the following line in REPL:

df = DataFrame(A = 1:4,
               B = ["x\"", "y\n", "z\$", "ABC"],
               C = Float32[1.0, 2.0, 3.0, 4.0],
               D = Symbol[:ABC,Symbol("x\""),Symbol("y\n"),Symbol("z\$")])

on master it shows:

4×4 DataFrames.DataFrame
│ Row │ A │ B   │ C   │ D   │
├─────┼───┼─────┼─────┼─────┤
│ 1   │ 1 │ x"  │ 1.0 │ ABC │
│ 2   │ 2 │ y
   │ 2.0 │ x"  │
│ 3   │ 3 │ z$  │ 3.0 │ y
  │
│ 4   │ 4 │ ABC │ 4.0 │ z$  │

and latest release it shows:

4×4 DataFrames.DataFrame
│ Row │ A │ B     │ C   │ D   │
├─────┼───┼───────┼─────┼─────┤
│ 1   │ 1 │ "x\""  │ 1.0 │ ABC │
│ 2   │ 2 │ "y\n"   │ 2.0 │ x"  │
│ 3   │ 3 │ "z\$"  │ 3.0 │ y
  │
│ 4   │ 4 │ "ABC" │ 4.0 │ z$  │

and if we cast df to an array we get yet another output (it could be reproduced by show of DataFrame with column names and pipes added - here I want to concentrate on how the field values are printed):

julia> convert(Matrix, df)
4×4 Array{Any,2}:
 1  "x\""  1.0  :ABC
 2  "y\n"  2.0  Symbol("x\"")
 3  "z\$"  3.0  Symbol("y\n")
 4  "ABC"  4.0  Symbol("z$")

Which is the preferred target printing style? Or maybe yet some other option?
I personally would feel comfortable with the third one (show values like show for arrays) as it would be consistent.

@quinnj
Copy link
Member

quinnj commented Sep 11, 2017

I definitely think we should be consistent w/ array printing (last example). Show strings as quoted strings, as well as symbol that way. I think that's the only real solution if we want to be able to show unicode + control characters/whitespace and maintain the correct column widths.

@nalimilan
Copy link
Member

Agreed, the Array output looks like a good reference. Though printing quotes around strings is a bit verbose, and we could get rid of it if we printed the column eltype in a header (like tibbles in R).

@bkamins
Copy link
Member Author

bkamins commented Sep 11, 2017

Thank you for the comments. Regarding refactoring of show I have some additional thoughts:

  • if we omit " in strings how do we visually distinguish "NA" string from true NA (in R this is a problem with "<NA>")
  • how do we handle columns of custom types (normal and Nullable - whatever they will be eventually called) - in particular when their representation is very long (should some truncation be applied); in particular current show has problem for calculation of width of such structures, e.g.:
julia>     df = DataFrame(A=[[1:25;],"sdf"])
2×1 DataFrames.DataFrame
│ Row │ A                                                                                           │
├─────┼─────────────────────────────────────────────────────────────────────────────────────────────┤
│ 1   │ [1, 2, 3, 4, 5, 6, 7, 8, 9, 10  …  16, 17, 18, 19, 20, 21, 22, 23, 24, 25] │
│ 2   │ "sdf"                                                                                       │

In short - the problem is complex so I will try to give it some thought and will open a separate issue and write down my recommendation. I will strip this PR from field width calculation changes and leave only improved display features.

@nalimilan
Copy link
Member

if we omit " in strings how do we visually distinguish "NA" string from true NA (in R this is a problem with "")

Yes, that would be a problem unless we add a header with the eltype of each column. But we can keep the quotes for now and discuss that possibility later.

how do we handle columns of custom types (normal and Nullable - whatever they will be eventually called) - in particular when their representation is very long (should some truncation be applied); in particular current show has problem for calculation of width of such structures, e.g.:

Some truncation should probably be applied (I think you can do that by setting a property on IOContext now). But that can also be improved later, no need to fix everything in a single PR. It's not that common to have fields like that in DataFrames anyway.

@nalimilan
Copy link
Member

@bkamins I think you need to rebase on the latest master. If there are still unrelated commits, use git rebase -i master and remove them.

@bkamins
Copy link
Member Author

bkamins commented Sep 11, 2017

Agreed - that is why in this PR I have left only changes to show global behavior and left other changes for the future.

One question regarding git: I have made a merge not a rebase and now I can see the bad consequences that all the intermediate commits got included. Is there any simple and safe way to fix it?

@coveralls
Copy link

Coverage Status

Changes Unknown when pulling a58e9d6 on bkamins:newshow into ** on JuliaData:master**.

@nalimilan
Copy link
Member

Hmm... I guess the easiest solution would be to start a new branch from master, cherry-pick your commits into it, and then force push to this branch using git push --force bkamins :newshow.

@coveralls
Copy link

coveralls commented Sep 12, 2017

Coverage Status

Coverage increased (+1.03%) to 88.15% when pulling c59bbf9 on bkamins:newshow into 885078a on JuliaData:master.

@bkamins
Copy link
Member Author

bkamins commented Sep 13, 2017

Just as a comment: I believe that the build failed on Julia latest is unrelated to this PR.

@@ -315,21 +317,30 @@ function showrows(io::IO,
rowindices2::AbstractVector{Int},
maxwidths::Vector{Int},
splitchunks::Bool = false,
rowlabel::Symbol = :Row,
allcols::Bool = true,
rowlabel::Symbol = Symbol("Row"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't change this line. Same below (twice).

#' @param allcols::Bool If `false` (default), only a subset of columns
#' fitting on the screen is printed.
#' @param values::Bool If `true` (default), first and last value of
#' each column is printed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"are". Maybe also add "the" (not a native speaker here).

#' showcols(df, true)
function showcols(io::IO, df::AbstractDataFrame) # -> Void
#' showcols(STDOUT, df)
function showcols(io::IO, df::AbstractDataFrame, allcols::Bool = false, values::Bool = true) # -> Void
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep rows below 92 chars (same elsewhere).

nrows, ncols = size(df)
if values && nrows > 0
if nrows == 1
metadata[:Values] = [Symbol(sprint(showcompact, df[1, i])) for i in 1:ncols]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this use ourshowcompact? Why do you need Symbol?

#' count.
#'
#' @param df::AbstractDataFrame An AbstractDataFrame.
#' @param allcols::Bool If `false` (default), only a subset of columns
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just call this all, since "col" is already clear from the function's name?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still applies, right?

test/show.jl Outdated

io = IOBuffer()
show(io, df)
show(io, df, true)
showall(io, df)
showall(io, df, true)
showall(io, df, false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you test the actual output? You can use triple-quoted strings for that.

@bkamins
Copy link
Member Author

bkamins commented Sep 14, 2017

@nalimilan Inline comments got removed so I reply here:

  • changed Symbol("Row") to :Row
  • @param values comment string corrected
  • all lines are below 92 chars
  • I have removed Symbol in :Values formatting, but if we change the way DataFrame columns containing strings are printed it might have to be revised
  • allcols changed to all (but one has to remember it clashes with all function)
  • added testing of actual output in tests

@coveralls
Copy link

coveralls commented Sep 15, 2017

Coverage Status

Coverage increased (+1.0%) to 88.08% when pulling 14c321c on bkamins:newshow into 885078a on JuliaData:master.

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Sorry for bothering you with tests, but that's the only way to ensure somebody doesn't break your improvements in the future.

@@ -297,6 +297,8 @@ end
#' required to render each column.
#' @param splitchunks::Bool Should the printing of the AbstractDataFrame
#' be done in chunks? Defaults to `false`.
#' @param allcols::Bool Should only one chunk be printed if printing in
#' chunks? Defaults to `false`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Defaults to false.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to true

if isempty(rowindices1)
if displaysummary
println(io, summary(df))
end
return
end

rowmaxwidth = maxwidths[ncols + 1]
chunkbounds = getchunkbounds(maxwidths, splitchunks, displaysize(io)[2])
nchunks = length(chunkbounds) - 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be clearer to do nchunks = allcols ? length(chunkbounds) - 1 : min(nchunks, 1).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed (with a bit different code as nchunks is undefined before this line)

showall(io, metadata, true, Symbol("Col #"), false)
nrows, ncols = size(df)
if values && nrows > 0
# type of Values column is now String; it might need to be changed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this comment is needed: tests will (or should) catch this and people will figure out what needs to be changed anyway.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

#' count.
#'
#' @param df::AbstractDataFrame An AbstractDataFrame.
#' @param allcols::Bool If `false` (default), only a subset of columns
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still applies, right?

# type of Values column is now String; it might need to be changed
# if the way strings are printed in data frames changes
if nrows == 1
metadata[:Values] = [sprint(showcompact, df[1, i]) for i in 1:ncols]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, what about using ourshowcompact?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed - but it creates problems in corner cases (described in TODO for getmaxwidths)

4×3 DataFrames.DataFrame
│ Row │ A │ B │ C │
├─────┼───┼───────────────┼─────┤
│ 1 │ 1 │ x\" │ 1.0 │
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose the fact that vertical lines are not aligned is a bug elsewhere? Then better leave a TODO somewhere to make it clear.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are aligned when the string is printed, but " needs to be escaped in string literal which breaks alignment in the code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, of course!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually there is a problem - in the next line │ 2 │ 2 │ ∀ε⫺0: x+ε⫺x │ 2.0 │ which is not aligned properly and it is a TODO do be added for getmaxwidths function. Sorry for confusion

test/show.jl Outdated
show(io, df, true)
showall(io, df)
showall(io, df, true)
show(io, df_big)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also test the output of these functions (even if that's verbose, it's fine)? Else the case when there are too many rows won't be covered. You should probably pass a custom IOContext to control the size of the display. Also better define df_big here rather than above, where it isn't used.

test/show.jl Outdated
df = DataFrame(A = 1:3, B = ["x", "y", "z"])
# In the future newline characte \n should be added to this test case
df = DataFrame(A = 1:4, B = ["x\"", "∀ε⫺0: x+ε⫺x", "z\$", "ABC"],
C = Float32[1.0, 2.0, 3.0, 4.0])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a null value somewhere so that this is covered (unless it's done elsewhere already)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is already covered I believe in line:

df = DataFrame(Fish = ["Suzy", "Amir"], Mass = [1.5, null])

at the end of the file

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but showcols isn't tested there. Would be worth adding a test.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added showcols test

@bkamins
Copy link
Member Author

bkamins commented Sep 15, 2017

@nalimilan I hope I have managed to clean up everything.

test/show.jl Outdated
@@ -41,27 +39,181 @@ module TestShow
refstr = """
4×3 DataFrames.DataFrame

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a detail, but we probably don't need an empty line? That would be more consistent with the other format.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

│ 24 │ 0.762276 │ 0.755415 │
│ 25 │ 0.339081 │ 0.649056 │"""

io = IOContext(IOBuffer(), :displaysize=>(10,40))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe set the number of rows to a lower value in order to have smaller test and check what happens when not all rows can be shown in a single page? Can also be done in a later PR if you prefer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is what I check here. I assume 10 rows and 40 columns. And you can see the difference between show and showall.
show limits the output to fit page height and showall does not do that.
They also differ in how they handle wide data (not fitting the screen vertically) and we set allcols to true: show does paging and showall prints full table ignoring :displaysize (which could useful, when e.g. we want to dump DataFrame show result to a file).

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, looks good to me! Maybe others have comments?

@coveralls
Copy link

coveralls commented Sep 15, 2017

Coverage Status

Coverage increased (+0.9%) to 88.064% when pulling 73c3c8e on bkamins:newshow into 885078a on JuliaData:master.

@coveralls
Copy link

coveralls commented Sep 15, 2017

Coverage Status

Coverage increased (+0.9%) to 88.064% when pulling b376700 on bkamins:newshow into 885078a on JuliaData:master.

Copy link
Contributor

@cjprybol cjprybol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bkamins this looks great! Any idea what this error is from? https://ci.appveyor.com/project/nalimilan/dataframes-jl/build/1.0.392/job/0dmu2dx08fjf51ss#L133

edit: looks like an Int32/64 comparison issue

@bkamins
Copy link
Member Author

bkamins commented Sep 19, 2017

@cjprybol fixed the issue with tests on 32 bit machine.

@coveralls
Copy link

coveralls commented Sep 19, 2017

Coverage Status

Coverage increased (+0.9%) to 88.064% when pulling 3d18d7d on bkamins:newshow into 885078a on JuliaData:master.

@nalimilan
Copy link
Member

Thanks! Merging since Travis doesn't seem to be willing to run on Mac...

@nalimilan nalimilan merged commit e06ac96 into JuliaData:master Sep 19, 2017
@bkamins bkamins deleted the newshow branch September 19, 2017 20:57
@coveralls
Copy link

Coverage Status

Coverage increased (+0.9%) to 88.064% when pulling 3d18d7d on bkamins:newshow into 885078a on JuliaData:master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants