-
-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
added TableDataset #26
added TableDataset #26
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a good start to me. I left a couple comments. My main concern is that it seems like we have to iterate a bunch of rows to get the i-th row?
src/datasets/containers.jl
Outdated
TableDataset{T}(table::T) where T = Tables.istable(table) ? new{T}(table) : error("Object doesn't implement Tables.jl interface") | ||
end | ||
|
||
TableDataset(path::String) = TableDataset(DataFrame(CSV.File(path))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reflecting Ari's comment from Zulip, maybe we can just use CSV.File
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay I can change it so that the default behaviour on getting a path is to use CSV.File
object. It's just that I thought having a familiar API like DataFrames might be easier to work with, specially for people coming from Python fastai.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's true, but I think in the this case it is more important to be performant reading from disk. Most people won't be directly interacting with the underlying table.
src/datasets/containers.jl
Outdated
for (index, row) in enumerate(Tables.rows(dataset.table)) | ||
if index==idx | ||
return row | ||
end | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Iterating all the rows seems like a really unfortunate consequence of the Tables.jl interface. CSV.jl support "jumping ahead" to a certain row in the file. Does Tables.jl not have a similar interface?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think Tables.jl requires to support "jumping ahead" to a certain row, unless it is possible to do so for all objects implementing iteration interface.
If the table is row access, the only required method is Tables.rows
which just needs to return an iterable of objects satisfying the AbstractRow
interface. AbstractRow
only requires methods for getting column values and names.
If the table is column access, the only required method is Tables.columns
which needs to return an object implementing the AbstractColumns
interface and having methods for getting a column object and column names.
src/datasets/containers.jl
Outdated
elseif Tables.columnaccess(dataset.table) | ||
rowvals = [] | ||
for i in 1:(length ∘ Tables.columnnames)(dataset.table) | ||
append!(rowvals, Tables.getcolumn(dataset.table, i)[idx]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be push!
or append!
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it should be push!
only, considering single item is being added.
src/datasets/containers.jl
Outdated
|
||
function LearnBase.nobs(dataset::FastAI.Datasets.TableDataset{T}) where {T} | ||
if Tables.rowaccess(dataset.table) | ||
return length(Tables.rows(dataset.table)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw a couple of implementations (e.g. JDBC) where length
is not defined on the result of rows
. The question is whether or not we care about that. There's also the question of whether this might end up materializing more data than we want as a side effect of calling rows
or getcolumn
. Unfortunately AFAICT there's no API in Tables.jl for querying the size of a source (if it has one).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then is the only option to iterate through the rows for the size, incase the table doesn't support column access? If it is, then should I make the required changes?
Based on the discussion, I am wondering if we should have a |
For data containers, it makes more sense to target these based on file type? For tables already loaded for disk, I'm now thinking it makes more sense to directly define If we want a unified interface for the table file types, we could have |
I like this idea, this way we can specialize on the format while having a uniform API. Just throwing out another possibility: Having one The advantage here is that we can then provide a default (perhaps slow) implementation for all tables. Also endusers will only deal with one type, i.e. calling Unrelated: I am in favor of consistently using |
Why not just do |
Also, let's use |
Ah yes, that is functionally equivalent and more elegant 👍 |
Co-authored-by: Kyle Daruwalla <daruwalla.k.public@icloud.com>
…and changed nobs order
290c8c4
to
b9c65eb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs tests
Co-authored-by: Kyle Daruwalla <daruwalla.k.public@icloud.com>
…and changed nobs order
…kyabard/FastAI.jl into manikyabard/table_container
Co-authored-by: Kyle Daruwalla <daruwalla.k.public@icloud.com>
Co-authored-by: lorenzoh <lorenz.ohly@gmail.com>
I have added a few tests for TableDataset in |
src/datasets/containers.jl
Outdated
@@ -78,8 +79,8 @@ function LearnBase.nobs(dataset::TableDataset{T}) where {T} | |||
end | |||
end | |||
|
|||
LearnBase.getobs(dataset::TableDataset{<:DataFrame}, idx) = dataset.table[idx, :] | |||
LearnBase.getobs(dataset::TableDataset{<:DataFrame}, idx) = [data for data in dataset.table[idx, :]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't getobs
return an object that is indexable by the columns? So in this case just dataset.table[idx, :]
would be fine because you can then index it by column name like row.col1
/row[:col1]
. And you would then pass the column names to TabularTransforms
instead of integer indices which could quickly lead to hard to spot errors, if some columns are reordered or change. Also this would avoid needing to copy the data in the row as this currently does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah for DataFrame
this would make more sense. But is it fine to keep this return behaviour for getobs(dataset::TableDataset{<:CSV.File}, idx)
and the general getobs(dataset::TableDataset{T}, idx)
methods considering that the types could be immutable, and they may not not have setindex!
defined (as in the case of CSV.Row
) which might make transformations difficult if we use a copy of the input directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think any transforms shouldn't be calling setindex!
anyhow without a defensive copy first, so we should stick with dataset.table[idx, :]
everywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreeing with Brian that the transforms shouldn't mutate the arguments by default; we have encode!
/run!
for the inplace versions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for the existing suggestions
src/datasets/containers.jl
Outdated
TableDataset(path::AbstractPath) = TableDataset(DataFrame(CSV.File(path))) | ||
|
||
function LearnBase.getobs(dataset::FastAI.Datasets.TableDataset{T}, idx) where {T} | ||
if Tables.rowaccess(dataset.table) | ||
for (index, row) in enumerate(Tables.rows(dataset.table)) | ||
if index==idx | ||
return row | ||
return [data for data in row] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar comment here about keeping the row object instead of converting it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the main concern I have is that, if we use an object returned by getobs
for transformation, and let's say normalization is defined as
struct NormalizeRow <: DataAugmentation.Transform
normstats
normidxs
end
function DataAugmentation.apply(tfm::NormalizeRow, item::TabularItem; randstate=nothing)
x = copy(item.data)
for idx in tfm.normidxs
colmean, colstd = tfm.normstats[idx]
x[idx] = (x[idx] - colmean)/colstd
end
TabularItem(x)
end
There's a possibility that functions like copy
or setindex
won't be defined for the object returned (example being CSV.Row
as I mentioned before). Maybe we can extract the data from the returned object and put it in a vector in the transformation step itself, but then the indexes won't really work with it. Is there a way to retain the original type from getobs
and still make these kinds of transformation work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's solve this within the transforms portion of the pipeline instead of forcing a vector here.
src/datasets/containers.jl
Outdated
@@ -50,13 +50,14 @@ struct TableDataset{T} | |||
TableDataset{T}(table::T) where T = Tables.istable(table) ? new{T}(table) : error("Object doesn't implement Tables.jl interface") | |||
end | |||
|
|||
TableDataset(table::T) where {T} = TableDataset{T}(table) | |||
TableDataset(path::AbstractPath) = TableDataset(DataFrame(CSV.File(path))) | |||
|
|||
function LearnBase.getobs(dataset::FastAI.Datasets.TableDataset{T}, idx) where {T} | |||
if Tables.rowaccess(dataset.table) | |||
for (index, row) in enumerate(Tables.rows(dataset.table)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not faster, but instead of a loop + index check you could write something like first(Iterators.drop(Tables.rows(...), idx - 1))
(skip up until the index and return the next element, which will be the element at the index).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I can do that.
…kyabard/FastAI.jl into manikyabard/table_container
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some comments. @ToucheSir and @lorenzoh covered the main concerns already.
Also, it looks like something has gone very wrong with a git rebase here. If it isn't immediately obvious how to solve it, I would make a local copy of src/datasets/*
and the tests, then start with a fresh branch/PR and re-apply the changes.
src/datasets/containers.jl
Outdated
TableDataset(path::AbstractPath) = TableDataset(DataFrame(CSV.File(path))) | ||
|
||
function LearnBase.getobs(dataset::FastAI.Datasets.TableDataset{T}, idx) where {T} | ||
if Tables.rowaccess(dataset.table) | ||
for (index, row) in enumerate(Tables.rows(dataset.table)) | ||
if index==idx | ||
return row | ||
return [data for data in row] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's solve this within the transforms portion of the pipeline instead of forcing a vector here.
src/datasets/containers.jl
Outdated
@@ -78,8 +79,8 @@ function LearnBase.nobs(dataset::TableDataset{T}) where {T} | |||
end | |||
end | |||
|
|||
LearnBase.getobs(dataset::TableDataset{<:DataFrame}, idx) = dataset.table[idx, :] | |||
LearnBase.getobs(dataset::TableDataset{<:DataFrame}, idx) = [data for data in dataset.table[idx, :]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for the existing suggestions
3e97d01
to
2462546
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems you figured out the git issue. The PR looks good to me. I made some minor suggestions on the test cases.
Makie = "ee78f7c6-11fb-53f2-987a-cfe4a2b5a57a" | ||
ShowCases = "605ecd9f-84a6-4c9e-81e2-4798472b76a3" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These seem orthogonal, maybe @lorenzoh can double-check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
JLD2, Makie and ShowCases are already in the master branch, not sure why they're shown here. Might just be that the list was sorted on a ]pkg command.
Co-authored-by: Kyle Daruwalla <daruwalla.k.public@icloud.com>
Co-authored-by: Kyle Daruwalla <daruwalla.k.public@icloud.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pending a check from @lorenzoh on the packages, I think this is GTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since FluxML/FluxML-Community-Call-Minutes#34 states that the points, include documentation, we shouldn't forget about that. Though I suppose this as a basis for other work can be merged as is. We should definitely add more comprehensive documentation later.
Makie = "ee78f7c6-11fb-53f2-987a-cfe4a2b5a57a" | ||
ShowCases = "605ecd9f-84a6-4c9e-81e2-4798472b76a3" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
JLD2, Makie and ShowCases are already in the master branch, not sure why they're shown here. Might just be that the list was sorted on a ]pkg command.
@manikyabard The changes look good to me. Can you remove the |
Okay, I deleted |
Tests are failing. Did you run them locally? I think the problem is |
So the path error should be fixed now, but it looks like there is still an error while downloading data in the testcase |
Could you remove the commented-out lines in the test case? |
Sure I'll do that. |
Looks good, merging! |
Added TableDataset, getobs and nobs which works with Tables.jl interface implementations. Closes #23