Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Breaking behaviour in Tables 1.8.0 #306

Closed
ablaom opened this issue Oct 4, 2022 · 6 comments
Closed

Breaking behaviour in Tables 1.8.0 #306

ablaom opened this issue Oct 4, 2022 · 6 comments

Comments

@ablaom
Copy link
Contributor

ablaom commented Oct 4, 2022

n_grams = [
    Dict("Hi" => 1, "name" => 1),
    Dict("name" => 1, "Sam" => 1),
]

using Tables

Tables 1.7.0:

julia> Tables.istable(n_grams)
false

Tables 1.8.0:

julia> Tables.istable(n_grams)
true

In particular, this breaks MLJText.jl (and any future packages that use the scientific type definitions for text analysis types that appear there).

@quinnj
Copy link
Member

quinnj commented Oct 4, 2022

Can you describe more specifically how this breaks your use-cases? In general, it's usually not considered (too) breaking to take previously unallowed or exceptional behavior and then allow it/make it work, which is what we were going for here. In particular, the "dicts as tables" was reported as not being very flexible since it only allowed Symbols as keys instead of Strings as well. DataFrames is the model we're borrowing from here by supporting column names as symbols or strings as much as possible.

@ablaom
Copy link
Contributor Author

ablaom commented Oct 5, 2022

The use-case tripping MLJText is type checks. MLJ uses ScientificTypes.jl to check user-supplied types, and MLJText adds scitypes for text analysis. In text analysis, data can be a vector of documents, where "document" is a dictionary of word counts. So each element of the vector is a dict with different keys. You could regard this conceptually as a table, but with very sparse rows, which rules out generic table types for text analysis.

Presently, if istable(X)==true, scitype(X) = Table{K} for some K depending on the column scitypes; scitype(d) = Multiset{K} if d is an abstract dictionary (and not a table) with Integer values (K is the key type); if v is an abstract vector and not a table, then scitype(v) = AbstractArray{K,N} where K is the element scitype. This leads to new behaviour in the above example after the Tables update: see here.

Does this address your query?

@nalimilan
Copy link
Member

You could hardcode in ScientificTypes that AbstractVector{<:AbstractDict{<:AbstractString,<:Integer}} is never considered as a table. That would be slightly surprising for people that define this type of tables and use them as such in other packages, but it seems strictly superior to disallowing this in all packages that use Tables.jl.

(In general it's not super clear what Tables.istable can be used for in practice, as it's only an approximation anyway for containers with an unknown or Any eltype.)

@quinnj
Copy link
Member

quinnj commented Oct 7, 2022

Yeah, I kind of regret ever adding istable TBH. It's just not able to do what people really want, since you can always have tables that don't know their tables until you call Tables.rows/Tables.columns. @ablaom, I agree with Milan/Bogumil that the best path forward is defining your own _istable overload that fallsback to Tables.istable and you can overload for Dict String tables to be false. The alternative is we try to rollback and break a bunch of code that is already using this. I apologize for the hassle and we'll try to be more thoughtful about these kinds of changes in the future; like I said, in general, these changes are considered safe since we're enabling something that wasn't working before, but in this case, it was more nuanced than we realized.

@quinnj quinnj closed this as completed Oct 7, 2022
@rofinn
Copy link
Member

rofinn commented Oct 7, 2022

FWIW, I actually like istable, but I just wish it was always false by default. Apart from explicit tables like DataFrames, I feel like needing to do something like Tables.@enable Tables.RowTable would make sense to me.

@ablaom
Copy link
Contributor Author

ablaom commented Oct 10, 2022

Thanks @quinnj and @nalimilan for the prompt attention. I am following your advise (see above cross ref).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants