Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tables.jl and DataAPI.jl interoperation #67

Closed
bkamins opened this issue Mar 14, 2022 · 6 comments
Closed

Tables.jl and DataAPI.jl interoperation #67

bkamins opened this issue Mar 14, 2022 · 6 comments

Comments

@bkamins
Copy link

bkamins commented Mar 14, 2022

@ablaom I am not sure if this is the best place to start this discussion, but it is a follow up to https://discourse.julialang.org/t/random-access-to-rows-of-a-table/77386 and JuliaData/Tables.jl#278.

The key point is to avoid creating functions having essentially the same functionalities across DataAPI.jl, Tables.jl, and MLUtils.jl (possibly other ML packages I am not aware of).

Assume for a moment that Tables.jl table is a source of data for some ML model and you want operations to be efficient.

My understanding that your high-level workflow is the following:

  1. the user starts with a Tables.jl table.
  2. then the user does observation subsetting, feature selection, feature transformation operations on this table (either eagerly or lazily).
  3. finally the user transforms the result of step 2 to an object to some other type (again - either lazily or eagerly) to another value that can be accepted as an input by the ML algorithm.

The question is:

What functionalities you need to have in DataAPI.jl and Tables.jl so that it is efficient and you do not need to provide duplicate definitions of concepts in MLUtils.jl (or some other packages)?
Another consideration (raised in the linked discussions) is that I would expect that what we develop is consistent with the interfaces that Base Julia already defines (e.g. iterator interface, abstract vector interface, indexing interface, view interface)

@bkamins
Copy link
Author

bkamins commented Mar 14, 2022

CC @nalimilan @quinnj

@AriMKatz
Copy link

@darsnack @ToucheSir

@AriMKatz
Copy link

Also cc @manikyabard

@CarloLucibello
Copy link
Member

I think the only methods we need in Tables.jl are

  • numrows(table) returning the number of rows (already available as length(rows(table)))
  • getrow(tables, i::Int) returning a materialized row (similar to df[i, :] for DataFrame)
  • optionally also getrow(tables, i::AbstractVector{<:Integer}) returning a materialized subtable (again similar to df[i, :] for DataFrame)

Now, since there is no AbstractTable type, is not clear how to achieve interoperability. One option is to change the
generic fallbacks in https://github.com/JuliaML/MLUtils.jl/blob/main/src/observation.jl as follows:

function numobs(data)
  if istable(data)
    return numrows(data)
  else
    return length(data)
  end
end

function getobs(data, i)
  if istable(data)
    return getrow(data, i)
  else
    return data[i]
  end
end

Having those branches in such low-level functions is not great but I don't know how else we can support generic Tables.jl's tables here.

@bkamins
Copy link
Author

bkamins commented Apr 8, 2022

x-ref to discussion in Tables.jl JuliaData/Tables.jl#278

@CarloLucibello
Copy link
Member

Closing this and leaving only #61 open

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants