Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added TableDataset #26

Merged
merged 24 commits into from Jun 15, 2021
Merged

Conversation

manikyabard
Copy link
Contributor

@manikyabard manikyabard commented Apr 5, 2021

Added TableDataset, getobs and nobs which works with Tables.jl interface implementations. Closes #23

Copy link
Member

@darsnack darsnack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a good start to me. I left a couple comments. My main concern is that it seems like we have to iterate a bunch of rows to get the i-th row?

TableDataset{T}(table::T) where T = Tables.istable(table) ? new{T}(table) : error("Object doesn't implement Tables.jl interface")
end

TableDataset(path::String) = TableDataset(DataFrame(CSV.File(path)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reflecting Ari's comment from Zulip, maybe we can just use CSV.File here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay I can change it so that the default behaviour on getting a path is to use CSV.File object. It's just that I thought having a familiar API like DataFrames might be easier to work with, specially for people coming from Python fastai.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true, but I think in the this case it is more important to be performant reading from disk. Most people won't be directly interacting with the underlying table.

Comment on lines 61 to 62
for (index, row) in enumerate(Tables.rows(dataset.table))
if index==idx
return row
end
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Iterating all the rows seems like a really unfortunate consequence of the Tables.jl interface. CSV.jl support "jumping ahead" to a certain row in the file. Does Tables.jl not have a similar interface?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think Tables.jl requires to support "jumping ahead" to a certain row, unless it is possible to do so for all objects implementing iteration interface.

If the table is row access, the only required method is Tables.rows which just needs to return an iterable of objects satisfying the AbstractRow interface. AbstractRow only requires methods for getting column values and names.

If the table is column access, the only required method is Tables.columns which needs to return an object implementing the AbstractColumns interface and having methods for getting a column object and column names.

src/datasets/containers.jl Outdated Show resolved Hide resolved
elseif Tables.columnaccess(dataset.table)
rowvals = []
for i in 1:(length ∘ Tables.columnnames)(dataset.table)
append!(rowvals, Tables.getcolumn(dataset.table, i)[idx])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be push! or append!?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it should be push! only, considering single item is being added.


function LearnBase.nobs(dataset::FastAI.Datasets.TableDataset{T}) where {T}
if Tables.rowaccess(dataset.table)
return length(Tables.rows(dataset.table))
Copy link
Member

@ToucheSir ToucheSir Apr 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw a couple of implementations (e.g. JDBC) where length is not defined on the result of rows. The question is whether or not we care about that. There's also the question of whether this might end up materializing more data than we want as a side effect of calling rows or getcolumn. Unfortunately AFAICT there's no API in Tables.jl for querying the size of a source (if it has one).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then is the only option to iterate through the rows for the size, incase the table doesn't support column access? If it is, then should I make the required changes?

@darsnack
Copy link
Member

darsnack commented Apr 7, 2021

Based on the discussion, I am wondering if we should have a TableDataset and a CSVDataset. Seems like just doing the former is leaving a lot of performance on the table. CSV.jl will allow us to avoid iterating rows for CSV files.

@darsnack
Copy link
Member

darsnack commented Apr 7, 2021

For data containers, it makes more sense to target these based on file type? For tables already loaded for disk, I'm now thinking it makes more sense to directly define getobs on the Tables.jl implementation instead of wrapping all the implementations into one.

If we want a unified interface for the table file types, we could have TableDataset(path::String) = ... which checks the file extension then invokes CSVDataset etc.

@lorenzoh
Copy link
Member

lorenzoh commented Apr 9, 2021

If we want a unified interface for the table file types, we could have TableDataset(path::String) = ... which checks the file extension then invokes CSVDataset etc.

I like this idea, this way we can specialize on the format while having a uniform API.

Just throwing out another possibility: Having one TableDataset{T} type that wraps a table of type T and then having getobs(td::TableDataset) = getobstable(td.table) and nobs(td::TableDataset) = nobstable(td.table) where we own getobstable and nobstable and can specialize them to the table type.

The advantage here is that we can then provide a default (perhaps slow) implementation for all tables. Also endusers will only deal with one type, i.e. calling TableDataset always returns a TableDataset.


Unrelated: I am in favor of consistently using AbstractPath types from FilePathsBase.jl, though. Maybe we can reexport the p"" macro, then you can call TableDataset(p"data.csv"). Having a canonical format for paths avoids a lot of fiddly errors.

@darsnack
Copy link
Member

darsnack commented Apr 9, 2021

Why not just do getobs(td::TableDataset{<:DataFrame}, i)? @manikyabard can you implement this? In addition to the code that you already have, you would write getobs(td::TableDataset{<:DataFrame}, i) and getobs(td::TableData{<:CSV.File}, i) (which are faster implementations that assume that td.table is a DataFrame or CSV.File).

@darsnack
Copy link
Member

darsnack commented Apr 9, 2021

Also, let's use AbstractPath too

@lorenzoh
Copy link
Member

lorenzoh commented Apr 9, 2021

Ah yes, that is functionally equivalent and more elegant 👍

Copy link
Member

@darsnack darsnack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs tests

src/datasets/containers.jl Outdated Show resolved Hide resolved
src/datasets/containers.jl Show resolved Hide resolved
manikyabard and others added 2 commits May 31, 2021 21:54
Co-authored-by: lorenzoh <lorenz.ohly@gmail.com>
@manikyabard
Copy link
Contributor Author

I have added a few tests for TableDataset in test/datasets/containers.jl.

@manikyabard manikyabard requested a review from darsnack May 31, 2021 17:41
@@ -78,8 +79,8 @@ function LearnBase.nobs(dataset::TableDataset{T}) where {T}
end
end

LearnBase.getobs(dataset::TableDataset{<:DataFrame}, idx) = dataset.table[idx, :]
LearnBase.getobs(dataset::TableDataset{<:DataFrame}, idx) = [data for data in dataset.table[idx, :]]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't getobs return an object that is indexable by the columns? So in this case just dataset.table[idx, :] would be fine because you can then index it by column name like row.col1/row[:col1]. And you would then pass the column names to TabularTransforms instead of integer indices which could quickly lead to hard to spot errors, if some columns are reordered or change. Also this would avoid needing to copy the data in the row as this currently does.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah for DataFrame this would make more sense. But is it fine to keep this return behaviour for getobs(dataset::TableDataset{<:CSV.File}, idx) and the general getobs(dataset::TableDataset{T}, idx) methods considering that the types could be immutable, and they may not not have setindex! defined (as in the case of CSV.Row) which might make transformations difficult if we use a copy of the input directly?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think any transforms shouldn't be calling setindex! anyhow without a defensive copy first, so we should stick with dataset.table[idx, :] everywhere.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreeing with Brian that the transforms shouldn't mutate the arguments by default; we have encode!/run! for the inplace versions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for the existing suggestions

TableDataset(path::AbstractPath) = TableDataset(DataFrame(CSV.File(path)))

function LearnBase.getobs(dataset::FastAI.Datasets.TableDataset{T}, idx) where {T}
if Tables.rowaccess(dataset.table)
for (index, row) in enumerate(Tables.rows(dataset.table))
if index==idx
return row
return [data for data in row]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar comment here about keeping the row object instead of converting it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the main concern I have is that, if we use an object returned by getobs for transformation, and let's say normalization is defined as

struct NormalizeRow <: DataAugmentation.Transform
	normstats
	normidxs
end

function DataAugmentation.apply(tfm::NormalizeRow, item::TabularItem; randstate=nothing)
	x = copy(item.data)
	for idx in tfm.normidxs
		colmean, colstd = tfm.normstats[idx]
		x[idx] = (x[idx] - colmean)/colstd
	end
	TabularItem(x)
end

There's a possibility that functions like copy or setindex won't be defined for the object returned (example being CSV.Row as I mentioned before). Maybe we can extract the data from the returned object and put it in a vector in the transformation step itself, but then the indexes won't really work with it. Is there a way to retain the original type from getobs and still make these kinds of transformation work?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's solve this within the transforms portion of the pipeline instead of forcing a vector here.

@@ -50,13 +50,14 @@ struct TableDataset{T}
TableDataset{T}(table::T) where T = Tables.istable(table) ? new{T}(table) : error("Object doesn't implement Tables.jl interface")
end

TableDataset(table::T) where {T} = TableDataset{T}(table)
TableDataset(path::AbstractPath) = TableDataset(DataFrame(CSV.File(path)))

function LearnBase.getobs(dataset::FastAI.Datasets.TableDataset{T}, idx) where {T}
if Tables.rowaccess(dataset.table)
for (index, row) in enumerate(Tables.rows(dataset.table))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not faster, but instead of a loop + index check you could write something like first(Iterators.drop(Tables.rows(...), idx - 1)) (skip up until the index and return the next element, which will be the element at the index).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I can do that.

Copy link
Member

@darsnack darsnack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments. @ToucheSir and @lorenzoh covered the main concerns already.

Also, it looks like something has gone very wrong with a git rebase here. If it isn't immediately obvious how to solve it, I would make a local copy of src/datasets/* and the tests, then start with a fresh branch/PR and re-apply the changes.

test/datasets/containers.jl Outdated Show resolved Hide resolved
test/datasets/containers.jl Outdated Show resolved Hide resolved
src/datasets/containers.jl Show resolved Hide resolved
TableDataset(path::AbstractPath) = TableDataset(DataFrame(CSV.File(path)))

function LearnBase.getobs(dataset::FastAI.Datasets.TableDataset{T}, idx) where {T}
if Tables.rowaccess(dataset.table)
for (index, row) in enumerate(Tables.rows(dataset.table))
if index==idx
return row
return [data for data in row]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's solve this within the transforms portion of the pipeline instead of forcing a vector here.

src/datasets/containers.jl Outdated Show resolved Hide resolved
src/datasets/containers.jl Outdated Show resolved Hide resolved
src/datasets/containers.jl Outdated Show resolved Hide resolved
@@ -78,8 +79,8 @@ function LearnBase.nobs(dataset::TableDataset{T}) where {T}
end
end

LearnBase.getobs(dataset::TableDataset{<:DataFrame}, idx) = dataset.table[idx, :]
LearnBase.getobs(dataset::TableDataset{<:DataFrame}, idx) = [data for data in dataset.table[idx, :]]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for the existing suggestions

src/datasets/containers.jl Outdated Show resolved Hide resolved
Copy link
Member

@darsnack darsnack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems you figured out the git issue. The PR looks good to me. I made some minor suggestions on the test cases.

Comment on lines +10 to +11
Makie = "ee78f7c6-11fb-53f2-987a-cfe4a2b5a57a"
ShowCases = "605ecd9f-84a6-4c9e-81e2-4798472b76a3"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These seem orthogonal, maybe @lorenzoh can double-check?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JLD2, Makie and ShowCases are already in the master branch, not sure why they're shown here. Might just be that the list was sorted on a ]pkg command.

test/datasets/containers.jl Show resolved Hide resolved
test/datasets/containers.jl Outdated Show resolved Hide resolved
test/datasets/containers.jl Outdated Show resolved Hide resolved
test/datasets/containers.jl Outdated Show resolved Hide resolved
test/datasets/containers.jl Outdated Show resolved Hide resolved
manikyabard and others added 2 commits June 9, 2021 14:18
Co-authored-by: Kyle Daruwalla <daruwalla.k.public@icloud.com>
Co-authored-by: Kyle Daruwalla <daruwalla.k.public@icloud.com>
Copy link
Member

@darsnack darsnack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pending a check from @lorenzoh on the packages, I think this is GTM.

Copy link
Member

@lorenzoh lorenzoh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since FluxML/FluxML-Community-Call-Minutes#34 states that the points, include documentation, we shouldn't forget about that. Though I suppose this as a basis for other work can be merged as is. We should definitely add more comprehensive documentation later.

Comment on lines +10 to +11
Makie = "ee78f7c6-11fb-53f2-987a-cfe4a2b5a57a"
ShowCases = "605ecd9f-84a6-4c9e-81e2-4798472b76a3"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JLD2, Makie and ShowCases are already in the master branch, not sure why they're shown here. Might just be that the list was sorted on a ]pkg command.

@lorenzoh
Copy link
Member

@manikyabard The changes look good to me. Can you remove the Manifest.toml from your forked repo? It seems to be causing the CI errors. If they run fine, we can merge this.

@manikyabard
Copy link
Contributor Author

@manikyabard The changes look good to me. Can you remove the Manifest.toml from your forked repo? It seems to be causing the CI errors. If they run fine, we can merge this.

Okay, I deleted Manifest.toml from the branch.

@lorenzoh
Copy link
Member

Tests are failing. Did you run them locally? I think the problem is include("imports.jl") in test/datasets/cotainers.jl which should be include("../imports.jl") since it is in a subdirectory.

@manikyabard
Copy link
Contributor Author

Tests are failing. Did you run them locally? I think the problem is include("imports.jl") in test/datasets/cotainers.jl which should be include("../imports.jl") since it is in a subdirectory.

So the path error should be fixed now, but it looks like there is still an error while downloading data in the testcase TableDataset from CSV, although the tests pass on my local system.

@lorenzoh
Copy link
Member

Could you remove the commented-out lines in the test case?

@manikyabard
Copy link
Contributor Author

Sure I'll do that.

@lorenzoh lorenzoh merged commit 4cd87ad into FluxML:master Jun 15, 2021
@lorenzoh
Copy link
Member

Looks good, merging!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add TableDataset
4 participants