Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a hint to tell that the row iterator state is the row number in tables with row access #342

Open
ronisbr opened this issue Sep 6, 2023 · 17 comments

Comments

@ronisbr
Copy link
Member

ronisbr commented Sep 6, 2023

Hi!

We have a problem in PrettyTables.jl related to huge matrices with row access. The current method, AFAIK, to access an element in those tables is to iterate the rows until we reach the desired one and then retrieve the column:

    # Get the i-th row by iterating the row table.
    it, state = iterate(rtable.table)

    for _ in 2:i
        it, state = iterate(rtable.table, state)
        it === nothing && error("The row `i` does not exist.")
    end

This method is OK for any usage that requires to read the entire data. However, when you are printing huge tables, you usually want a part of the table. In the case of PrettyTables.jl, we have a mode in which we print the beginning and the end of the table. Hence, for each printed column, we need to iterate through all the rows to the very bottom, even though we only need a few of them, leading to a very slow process.

The problem appeared here: ronisbr/PrettyTables.jl#217 (comment)

If we add to the Tables.jl API some hint that the state in the iteration is the row id, I can access, let's say, the element in the row 1,000,000 by just:

it, ~ = iterate(stable.table, 1_000_000)

instead of looping through the entire iterator.

In the case of GeoTables.jl, which uses the row ID as the state, the printing process was reduced from 8s to 0.001s when we only show 10 rows and 10 columns with "middle" cropping mode.

@bkamins
Copy link
Member

bkamins commented Sep 6, 2023

I think there would be a need for "trait" that verifies if a given table has a known number of rows and these rows are indexable (as not all tables support this). E.g. nothing would be returned if a table does not support row indexing and otherwise row could be returned.

@ablaom
Copy link
Contributor

ablaom commented Sep 6, 2023

Yes!

@bkamins
Copy link
Member

bkamins commented Sep 7, 2023

OK. I have looked at it a bit more. So:

  1. the DataAPI.nrow function is in the API.
  2. the Tables.subset function is also in the API.

The only question is how well this API is supported in the ecosystem, but I think that the best way forward is to just start using it and then the packages that miss it can add it.

Probably more work needs to be done with nrow/ncol support which is worse than Tables.subset support currently.

(and in PrettyTables.jl maybe what can be done is e.g. try running nrow and if it throws MethodError do fallback to the slow variant)

@bkamins
Copy link
Member

bkamins commented Sep 7, 2023

Ah - and there is an internal Tables.rowcount function, so it could be used instead of nrow. We just need to wait for @quinnj to comment on the use of Tables.rowcount. Maybe we just need to define in Tables.jl the default fallback:

`DataAPI.nrow(x) = Tables.rowcount(x)`

?

@ronisbr
Copy link
Member Author

ronisbr commented Sep 7, 2023

Hi @bkamins !

But even if the interface allow us to obtain the number of rows, there is the problem about how to obtain the n-th row. There is nothing in the interface describing what the state should be in the iterator Tables.AbstractRow. Sometime ago, I was using the row number as the state. However, SparseVariables.jl uses 0 as the initial state, which was leading to problems:

ronisbr/PrettyTables.jl#174

In fact, I think that state can even be some kind of object, leading to the same problem.

AFAIK, the only way today to safely obtain the n-th row, even if the object can return the number of rows, is to iterate through the rows.

@bkamins
Copy link
Member

bkamins commented Sep 7, 2023

No, the API is Tables.subset to which you pass an index and this index should be 1-based.

In other words: Tables.jl tables should support Tables.subset with 1-based indexing:

julia> df = DataFrame(a=1:5)
5×1 DataFrame
 Row │ a
     │ Int64
─────┼───────
   1 │     1
   2 │     2
   3 │     3
   4 │     4
   5 │     5

julia> rt = Tables.rowtable(df)
5-element Vector{NamedTuple{(:a,), Tuple{Int64}}}:
 (a = 1,)
 (a = 2,)
 (a = 3,)
 (a = 4,)
 (a = 5,)

julia> ct = Tables.columntable(df)
(a = [1, 2, 3, 4, 5],)

julia> Tables.subset(df, 4)
DataFrameRow
 Row │ a
     │ Int64
─────┼───────
   4 │     4

julia> Tables.subset(rt, 4)
(a = 4,)

julia> Tables.subset(ct, 4)
(a = 4,)

@ronisbr
Copy link
Member Author

ronisbr commented Sep 7, 2023

Perfect! Thank you very much @bkamins ! I will update PrettyTables.jl to improve this use case.

@ronisbr
Copy link
Member Author

ronisbr commented Sep 12, 2023

Unfortunately, it seems that Tables.subet is only optional. For example, GeoTables does not implement it:

julia> using GeoTables

julia> a = georef((a=rand(1000,1000),));

julia> Tables.subset(a, 1)
ERROR: ArgumentError: no default `Tables.subset` implementation for type: GeoTable{Meshes.CartesianGrid{2, Float64}, NamedTuple{(:a,), Tuple{Vector{Float64}}}}
Stacktrace:
 [1] subset(x::GeoTable{Meshes.CartesianGrid{2, Float64}, NamedTuple{(:a,), Tuple{Vector{Float64}}}}, inds::Int64; viewhint::Nothing, view::Nothing)
   @ Tables ~/.julia/packages/Tables/AcRIE/src/Tables.jl:632
 [2] subset(x::GeoTable{Meshes.CartesianGrid{2, Float64}, NamedTuple{(:a,), Tuple{Vector{Float64}}}}, inds::Int64)
   @ Tables ~/.julia/packages/Tables/AcRIE/src/Tables.jl:608
 [3] top-level scope
   @ REPL[27]:1

@juliohm Is it possible to implement Tables.subset for GeoTables? I am planning to check if there is a valid implementation of Tables.subset to obtain the i-th row. If not, I will fallback to the default iteration. Hence, if you implement this function, that problem related with vcrop_mode = :middle will vanish.

@juliohm
Copy link

juliohm commented Sep 12, 2023 via email

@bkamins
Copy link
Member

bkamins commented Sep 12, 2023

And also I guess preferably nrow from DataAPI.jl (if you do not support it already).

@juliohm
Copy link

juliohm commented Sep 12, 2023

We do support nrow and ncol from DataAPI.jl, thanks for checking @bkamins.

As we discussed in another issue, it would be really nice if we could port these names to Tables.nrows and Tables.ncols so that anything related to the tables interface lives in a single package.

@bkamins
Copy link
Member

bkamins commented Sep 12, 2023

so that anything related to the tables interface lives in a single package.

Yes. Let us wait for other maintainers to comment there.

@juliohm
Copy link

juliohm commented Sep 13, 2023

@ronisbr a new patch release is available with the Tables.subset implementation.

Please let us know if something else is missing.

Yes. Let us wait for other maintainers to comment there.

It would be nice if Tables.jl had other maintainers with decision power besides @quinnj, to advance things more quickly.

@tpapp
Copy link
Contributor

tpapp commented Sep 13, 2023

In general I like the fact that we have a single person who maintains a coherent design. I don't think think that an interface package like Tables.jl should prioritize quick changes; a lot of packages in the ecosystem rely on the API so it should move carefully.

@quinnj
Copy link
Member

quinnj commented Sep 13, 2023

Yeah, Tables.subset is the right answer here, as pointed out by @bkamins.

On the Tables.rowcount question, that's really only an internal function meant to operate on column-oriented table sources. So it's safe to call on the result of Tables.columns, but not otherwise, so that makes me hesitate to set it as the default for DataAPI.nrows.

@juliohm, any JuliaData admin/maintainer is free to make changes to Tables.jl and don't require approval from me. We've built a, IMO, healthy process to act by committee though, hence the invitation from other maintainers to weigh in.

IMO, there's always going to be a bit of awkwardness between DataAPI.jl and Tables.jl interfaces. Some interfaces more obviously belong to one package or the other, but it also doesn't make sense for them to all live in one or the other. The functions in DataAPI have, so far, been those that aren't exactly tables-specific, so a more generic home seems appropriate for packages where it wouldn't make sense to take Tables.jl as a dependency but they want to overload a data-related interface. nrows/ncols are admittedly a bit awkward here because they seem tables related, but in reality, they impose a more strict interface that what Tables.jl requires (i.e. Tables.jl doesn't require the # of rows to be known), so it seemed prudent to have them live in DataAPI. So the awkwardness is understandable, but hopefully the clarification here helps explain the current arrangement?

@juliohm
Copy link

juliohm commented Sep 13, 2023 via email

@juliohm
Copy link

juliohm commented Sep 13, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants