Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make CoefTable implement the Tables.jl interface #629

Merged
merged 2 commits into from
Feb 1, 2021
Merged

Conversation

nalimilan
Copy link
Member

This allows retrieving the contents of a CoefTable object in a convenient form, notably a DataFrame. To avoid introducing a dependency on Tables.jl, CoefTable has to iterate NamedTuples, so that it implements the row-table interface implicitly. This is inefficient since CoefTable uses a column-based storage, but given the typical size of such tables it should not matter.

Comment on lines 35 to 48
──────────────────────────────────────────
Estimate Stderror df p
──────────────────────────────────────────
[1] 0.112582 0.0566454 0.381813 0.8198
[2] 0.368314 0.120781 0.815104 0.6699
[3] 0.344454 0.179574 0.242208 0.4531
──────────────────────────────────────────"""
@test length(ct) === 3
@test eltype(ct) ==
NamedTuple{(:Name, :Estimate, :Stderror, :df, :p),
Tuple{String,Float64,Float64,Float64,Float64}}
@test collect(ct) == [
(Name = "[1]", Estimate = 0.11258244478647295, Stderror = 0.05664544616214151,
df = 0.38181274408522614, p = 0.8197779704008801)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm hesitant as to whether it's better to always create a "Name" column even when no names have been specified. That makes the result more predictable, but that also creates a column which is nearly useless...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is better to keep a fixed schema for this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well there isn't really a fixed schema since each package determines the column names and types. Actually, "Names" would be the only column which is guaranteed to be present. :-)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah - in this case I would leave this out (i.e. rely on the schema provided by the package) and allow packages to specify the schema in full. Then actually we should check if packages provide a proper schema.

This allows retrieving the contents of a `CoefTable` object in a convenient form,
notably a `DataFrame`. To avoid introducing a dependency on Tables.jl, `CoefTable`
has to iterate `NamedTuple`s, so that it implements the row-table interface
implicitly. This is inefficient since `CoefTable` uses a column-based storage,
but given the typical size of such tables it should not matter.
@bkamins
Copy link
Contributor

bkamins commented Jan 5, 2021

Can we add DataFrames.jl as [extras] dependency and just check if a DataFrame is correctly produced (it is not strictly necessary, but I would find it nice to have in the tests 😄).

@nalimilan
Copy link
Member Author

Can we add DataFrames.jl as [extras] dependency and just check if a DataFrame is correctly produced (it is not strictly necessary, but I would find it nice to have in the tests smile).

Is that really useful? As long as we check that the interface is implemented everything should be OK.

Comment on lines 44 to 45
NamedTuple{(:Name, :Estimate, :Stderror, :df, :p),
Tuple{String,Float64,Float64,Float64,Float64}}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
NamedTuple{(:Name, :Estimate, :Stderror, :df, :p),
Tuple{String,Float64,Float64,Float64,Float64}}
NamedTuple{(:Name, :Estimate, :Stderror, :df, :p),
Tuple{String,Float64,Float64,Float64,Float64}}

@bkamins
Copy link
Contributor

bkamins commented Jan 5, 2021

Is that really useful?

This was a soft suggestion - mainly to show users in the tests how this can be used. Alternatively maybe we could put some docstring about it. The point is to have somewhere a clear visual signal about what we allow (and DataFrame is very easy to recognize visually). This is a soft suggestion.

@jerlich
Copy link

jerlich commented Jan 5, 2021

Thanks for being super responsive about this!

@kleinschmidt
Copy link
Member

I do think it could be a good example in the docs ("civilians" might not understand that Tables.jl integration means you can just call DataFrame on it...)

Base.length(ct::CoefTable) = length(ct.cols[1])
function Base.eltype(ct::CoefTable)
nmtype = isempty(ct.rownms) ? String : eltype(ct.rownms)
NamedTuple{(Symbol("Name"), Symbol.(ct.colnms)...),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other option is we could return a custom "view" struct here instead of a NamedTuple. That could make it a bit more efficient and is a common pattern for other "table" types. For example, it's common to have something like:

struct CoefRow
    table::CoefTable
    rownumber::Int
end

Then on top of that you just need to implement propertynames and getproperty that correspond to the "row values" to be returned. The property names could be computed a single time in the initial Base.iterate(ct::CoefTable) method and passed as an additional piece of iterator state. It's obviously a bit more code/work, so maybe not worth it if these tables are always going to be small and simple anyway.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, thanks. Though these tables are always super small so as you say I'm not sure it's worth the additional complexity. If performance really mattered we should implement a column-wise approach anyway.

@nalimilan
Copy link
Member Author

OK, I've pushed changes to avoid adding a "Name" column when there are no names and to mention Tables.jl and DataFrames in the docstring.

@nalimilan nalimilan merged commit ed3b86e into master Feb 1, 2021
@nalimilan nalimilan deleted the nl/coeftable branch February 1, 2021 08:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants