added TableDataset #26

manikyabard · 2021-04-05T15:43:50Z

Added TableDataset, getobs and nobs which works with Tables.jl interface implementations. Closes #23

darsnack

This looks like a good start to me. I left a couple comments. My main concern is that it seems like we have to iterate a bunch of rows to get the i-th row?

darsnack · 2021-04-06T15:12:47Z

src/datasets/containers.jl

+    TableDataset{T}(table::T) where T = Tables.istable(table) ? new{T}(table) : error("Object doesn't implement Tables.jl interface")
+end
+
+TableDataset(path::String) = TableDataset(DataFrame(CSV.File(path)))


Reflecting Ari's comment from Zulip, maybe we can just use CSV.File here.

Okay I can change it so that the default behaviour on getting a path is to use CSV.File object. It's just that I thought having a familiar API like DataFrames might be easier to work with, specially for people coming from Python fastai.

That's true, but I think in the this case it is more important to be performant reading from disk. Most people won't be directly interacting with the underlying table.

darsnack · 2021-04-06T15:14:32Z

src/datasets/containers.jl

+        for (index, row) in enumerate(Tables.rows(dataset.table))
+            if index==idx
+                return row
+            end
+        end


Iterating all the rows seems like a really unfortunate consequence of the Tables.jl interface. CSV.jl support "jumping ahead" to a certain row in the file. Does Tables.jl not have a similar interface?

I don't think Tables.jl requires to support "jumping ahead" to a certain row, unless it is possible to do so for all objects implementing iteration interface.

If the table is row access, the only required method is Tables.rows which just needs to return an iterable of objects satisfying the AbstractRow interface. AbstractRow only requires methods for getting column values and names.

If the table is column access, the only required method is Tables.columns which needs to return an object implementing the AbstractColumns interface and having methods for getting a column object and column names.

src/datasets/containers.jl

darsnack · 2021-04-06T15:15:25Z

src/datasets/containers.jl

+    elseif Tables.columnaccess(dataset.table)
+        rowvals = []
+        for i in 1:(length ∘ Tables.columnnames)(dataset.table)
+            append!(rowvals, Tables.getcolumn(dataset.table, i)[idx])


Should this be push! or append!?

Yeah it should be push! only, considering single item is being added.

ToucheSir · 2021-04-06T16:00:48Z

src/datasets/containers.jl

+
+function LearnBase.nobs(dataset::FastAI.Datasets.TableDataset{T}) where {T}
+    if Tables.rowaccess(dataset.table)
+        return length(Tables.rows(dataset.table))


I saw a couple of implementations (e.g. JDBC) where length is not defined on the result of rows. The question is whether or not we care about that. There's also the question of whether this might end up materializing more data than we want as a side effect of calling rows or getcolumn. Unfortunately AFAICT there's no API in Tables.jl for querying the size of a source (if it has one).

Then is the only option to iterate through the rows for the size, incase the table doesn't support column access? If it is, then should I make the required changes?

darsnack · 2021-04-07T12:39:01Z

Based on the discussion, I am wondering if we should have a TableDataset and a CSVDataset. Seems like just doing the former is leaving a lot of performance on the table. CSV.jl will allow us to avoid iterating rows for CSV files.

darsnack · 2021-04-07T12:43:38Z

For data containers, it makes more sense to target these based on file type? For tables already loaded for disk, I'm now thinking it makes more sense to directly define getobs on the Tables.jl implementation instead of wrapping all the implementations into one.

If we want a unified interface for the table file types, we could have TableDataset(path::String) = ... which checks the file extension then invokes CSVDataset etc.

lorenzoh · 2021-04-09T10:13:46Z

If we want a unified interface for the table file types, we could have TableDataset(path::String) = ... which checks the file extension then invokes CSVDataset etc.

I like this idea, this way we can specialize on the format while having a uniform API.

Just throwing out another possibility: Having one TableDataset{T} type that wraps a table of type T and then having getobs(td::TableDataset) = getobstable(td.table) and nobs(td::TableDataset) = nobstable(td.table) where we own getobstable and nobstable and can specialize them to the table type.

The advantage here is that we can then provide a default (perhaps slow) implementation for all tables. Also endusers will only deal with one type, i.e. calling TableDataset always returns a TableDataset.

Unrelated: I am in favor of consistently using AbstractPath types from FilePathsBase.jl, though. Maybe we can reexport the p"" macro, then you can call TableDataset(p"data.csv"). Having a canonical format for paths avoids a lot of fiddly errors.

darsnack · 2021-04-09T13:08:12Z

Why not just do getobs(td::TableDataset{<:DataFrame}, i)? @manikyabard can you implement this? In addition to the code that you already have, you would write getobs(td::TableDataset{<:DataFrame}, i) and getobs(td::TableData{<:CSV.File}, i) (which are faster implementations that assume that td.table is a DataFrame or CSV.File).

darsnack · 2021-04-09T13:09:06Z

Also, let's use AbstractPath too

lorenzoh · 2021-04-09T13:09:43Z

Ah yes, that is functionally equivalent and more elegant 👍

Co-authored-by: Kyle Daruwalla <daruwalla.k.public@icloud.com>

…and changed nobs order

darsnack

Needs tests

src/datasets/containers.jl

Co-authored-by: Kyle Daruwalla <daruwalla.k.public@icloud.com>

…and changed nobs order

…kyabard/FastAI.jl into manikyabard/table_container

Co-authored-by: Kyle Daruwalla <daruwalla.k.public@icloud.com>

src/datasets/containers.jl

Co-authored-by: lorenzoh <lorenz.ohly@gmail.com>

manikyabard · 2021-05-31T16:30:57Z

I have added a few tests for TableDataset in test/datasets/containers.jl.

lorenzoh · 2021-06-01T08:58:36Z

src/datasets/containers.jl

@@ -78,8 +79,8 @@ function LearnBase.nobs(dataset::TableDataset{T}) where {T}
    end
 end

-LearnBase.getobs(dataset::TableDataset{<:DataFrame}, idx) = dataset.table[idx, :]
+LearnBase.getobs(dataset::TableDataset{<:DataFrame}, idx) = [data for data in dataset.table[idx, :]]


Shouldn't getobs return an object that is indexable by the columns? So in this case just dataset.table[idx, :] would be fine because you can then index it by column name like row.col1/row[:col1]. And you would then pass the column names to TabularTransforms instead of integer indices which could quickly lead to hard to spot errors, if some columns are reordered or change. Also this would avoid needing to copy the data in the row as this currently does.

Yeah for DataFrame this would make more sense. But is it fine to keep this return behaviour for getobs(dataset::TableDataset{<:CSV.File}, idx) and the general getobs(dataset::TableDataset{T}, idx) methods considering that the types could be immutable, and they may not not have setindex! defined (as in the case of CSV.Row) which might make transformations difficult if we use a copy of the input directly?

I think any transforms shouldn't be calling setindex! anyhow without a defensive copy first, so we should stick with dataset.table[idx, :] everywhere.

Agreeing with Brian that the transforms shouldn't mutate the arguments by default; we have encode!/run! for the inplace versions.

+1 for the existing suggestions

ToucheSir · 2021-06-01T15:06:33Z

src/datasets/containers.jl

 TableDataset(path::AbstractPath) = TableDataset(DataFrame(CSV.File(path)))

 function LearnBase.getobs(dataset::FastAI.Datasets.TableDataset{T}, idx) where {T}
    if Tables.rowaccess(dataset.table)
        for (index, row) in enumerate(Tables.rows(dataset.table))
            if index==idx
-                return row
+                return [data for data in row]


Similar comment here about keeping the row object instead of converting it.

So the main concern I have is that, if we use an object returned by getobs for transformation, and let's say normalization is defined as

struct NormalizeRow <: DataAugmentation.Transform normstats normidxs end function DataAugmentation.apply(tfm::NormalizeRow, item::TabularItem; randstate=nothing) x = copy(item.data) for idx in tfm.normidxs colmean, colstd = tfm.normstats[idx] x[idx] = (x[idx] - colmean)/colstd end TabularItem(x) end

There's a possibility that functions like copy or setindex won't be defined for the object returned (example being CSV.Row as I mentioned before). Maybe we can extract the data from the returned object and put it in a vector in the transformation step itself, but then the indexes won't really work with it. Is there a way to retain the original type from getobs and still make these kinds of transformation work?

Let's solve this within the transforms portion of the pipeline instead of forcing a vector here.

ToucheSir · 2021-06-01T15:13:48Z

src/datasets/containers.jl

@@ -50,13 +50,14 @@ struct TableDataset{T}
    TableDataset{T}(table::T) where T = Tables.istable(table) ? new{T}(table) : error("Object doesn't implement Tables.jl interface")
 end

+TableDataset(table::T) where {T} = TableDataset{T}(table)
 TableDataset(path::AbstractPath) = TableDataset(DataFrame(CSV.File(path)))

 function LearnBase.getobs(dataset::FastAI.Datasets.TableDataset{T}, idx) where {T}
    if Tables.rowaccess(dataset.table)
        for (index, row) in enumerate(Tables.rows(dataset.table))


It's not faster, but instead of a loop + index check you could write something like first(Iterators.drop(Tables.rows(...), idx - 1)) (skip up until the index and return the next element, which will be the element at the index).

Yeah, I can do that.

…kyabard/FastAI.jl into manikyabard/table_container

darsnack

I left some comments. @ToucheSir and @lorenzoh covered the main concerns already.

Also, it looks like something has gone very wrong with a git rebase here. If it isn't immediately obvious how to solve it, I would make a local copy of src/datasets/* and the tests, then start with a fresh branch/PR and re-apply the changes.

test/datasets/containers.jl

src/datasets/containers.jl

darsnack · 2021-06-03T18:25:20Z

src/datasets/containers.jl

 TableDataset(path::AbstractPath) = TableDataset(DataFrame(CSV.File(path)))

 function LearnBase.getobs(dataset::FastAI.Datasets.TableDataset{T}, idx) where {T}
    if Tables.rowaccess(dataset.table)
        for (index, row) in enumerate(Tables.rows(dataset.table))
            if index==idx
-                return row
+                return [data for data in row]


Let's solve this within the transforms portion of the pipeline instead of forcing a vector here.

src/datasets/containers.jl

darsnack · 2021-06-03T18:27:18Z

src/datasets/containers.jl

@@ -78,8 +79,8 @@ function LearnBase.nobs(dataset::TableDataset{T}) where {T}
    end
 end

-LearnBase.getobs(dataset::TableDataset{<:DataFrame}, idx) = dataset.table[idx, :]
+LearnBase.getobs(dataset::TableDataset{<:DataFrame}, idx) = [data for data in dataset.table[idx, :]]


+1 for the existing suggestions

src/datasets/containers.jl

darsnack

Seems you figured out the git issue. The PR looks good to me. I made some minor suggestions on the test cases.

darsnack · 2021-06-07T11:41:30Z

test/Project.toml

+Makie = "ee78f7c6-11fb-53f2-987a-cfe4a2b5a57a"
+ShowCases = "605ecd9f-84a6-4c9e-81e2-4798472b76a3"


These seem orthogonal, maybe @lorenzoh can double-check?

JLD2, Makie and ShowCases are already in the master branch, not sure why they're shown here. Might just be that the list was sorted on a ]pkg command.

test/datasets/containers.jl

Co-authored-by: Kyle Daruwalla <daruwalla.k.public@icloud.com>

darsnack

Pending a check from @lorenzoh on the packages, I think this is GTM.

lorenzoh

Since FluxML/FluxML-Community-Call-Minutes#34 states that the points, include documentation, we shouldn't forget about that. Though I suppose this as a basis for other work can be merged as is. We should definitely add more comprehensive documentation later.

lorenzoh · 2021-06-15T07:20:09Z

test/Project.toml

+Makie = "ee78f7c6-11fb-53f2-987a-cfe4a2b5a57a"
+ShowCases = "605ecd9f-84a6-4c9e-81e2-4798472b76a3"


JLD2, Makie and ShowCases are already in the master branch, not sure why they're shown here. Might just be that the list was sorted on a ]pkg command.

lorenzoh · 2021-06-15T07:24:35Z

@manikyabard The changes look good to me. Can you remove the Manifest.toml from your forked repo? It seems to be causing the CI errors. If they run fine, we can merge this.

manikyabard · 2021-06-15T07:43:51Z

@manikyabard The changes look good to me. Can you remove the Manifest.toml from your forked repo? It seems to be causing the CI errors. If they run fine, we can merge this.

Okay, I deleted Manifest.toml from the branch.

lorenzoh · 2021-06-15T08:05:50Z

Tests are failing. Did you run them locally? I think the problem is include("imports.jl") in test/datasets/cotainers.jl which should be include("../imports.jl") since it is in a subdirectory.

manikyabard · 2021-06-15T09:53:47Z

Tests are failing. Did you run them locally? I think the problem is include("imports.jl") in test/datasets/cotainers.jl which should be include("../imports.jl") since it is in a subdirectory.

So the path error should be fixed now, but it looks like there is still an error while downloading data in the testcase TableDataset from CSV, although the tests pass on my local system.

lorenzoh · 2021-06-15T10:36:01Z

Could you remove the commented-out lines in the test case?

manikyabard · 2021-06-15T10:37:42Z

Sure I'll do that.

lorenzoh · 2021-06-15T11:56:35Z

Looks good, merging!

darsnack reviewed Apr 6, 2021

View reviewed changes

ToucheSir reviewed Apr 6, 2021

View reviewed changes

manikyabard and others added 4 commits April 21, 2021 19:25

added TableDataset

19a77ac

Update src/datasets/containers.jl

81369e8

Co-authored-by: Kyle Daruwalla <daruwalla.k.public@icloud.com>

changed append to push in getobs for TableDataset

8eaa656

updated path type, added direct methods for DataFrames and CSV.File, …

b9c65eb

…and changed nobs order

manikyabard force-pushed the manikyabard/table_container branch from 290c8c4 to b9c65eb Compare April 21, 2021 19:27

darsnack requested changes May 21, 2021

View reviewed changes

src/datasets/containers.jl Outdated Show resolved Hide resolved

src/datasets/containers.jl Show resolved Hide resolved

manikyabard and others added 8 commits May 31, 2021 01:36

added TableDataset

8585fc7

Update src/datasets/containers.jl

dc4b007

Co-authored-by: Kyle Daruwalla <daruwalla.k.public@icloud.com>

changed append to push in getobs for TableDataset

d483b6d

updated path type, added direct methods for DataFrames and CSV.File, …

52a456d

…and changed nobs order

Merge branch 'manikyabard/table_container' of https://github.com/mani…

1d52456

…kyabard/FastAI.jl into manikyabard/table_container

Update src/datasets/containers.jl

134f1bc

Co-authored-by: Kyle Daruwalla <daruwalla.k.public@icloud.com>

fixed nobs typo

a2f6261

removed redundant line

d30c7f8

lorenzoh reviewed May 31, 2021

View reviewed changes

src/datasets/containers.jl Outdated Show resolved Hide resolved

manikyabard and others added 2 commits May 31, 2021 21:54

added tests and made getobs consistent

81b36dd

fixed typo

9b36af1

Co-authored-by: lorenzoh <lorenz.ohly@gmail.com>

manikyabard requested a review from darsnack May 31, 2021 17:41

lorenzoh reviewed Jun 1, 2021

View reviewed changes

changed getobs back for DataFrame

77d9d2b

ToucheSir reviewed Jun 1, 2021

View reviewed changes

Merge branch 'manikyabard/table_container' of https://github.com/mani…

519a3a7

…kyabard/FastAI.jl into manikyabard/table_container

darsnack requested changes Jun 3, 2021

View reviewed changes

manikyabard added 2 commits June 6, 2021 00:37

Updated getobs

6f00d25

Changed getobs to return NamedTuple

2462546

manikyabard force-pushed the manikyabard/table_container branch from 3e97d01 to 2462546 Compare June 7, 2021 04:07

darsnack requested changes Jun 7, 2021

View reviewed changes

manikyabard and others added 2 commits June 9, 2021 14:18

Update test/datasets/containers.jl

d40c1f8

Co-authored-by: Kyle Daruwalla <daruwalla.k.public@icloud.com>

Updated tests for TableDataset container.

77459fa

Co-authored-by: Kyle Daruwalla <daruwalla.k.public@icloud.com>

darsnack approved these changes Jun 11, 2021

View reviewed changes

manikyabard mentioned this pull request Jun 14, 2021

FastAI.jl tabular development GSoC tracking FluxML/FluxML-Community-Call-Minutes#34

Closed

5 tasks

lorenzoh approved these changes Jun 15, 2021

View reviewed changes

remove Manifest.toml

c0f3fcc

fixed path in test

d92e32b

updated csv TableDataset test

6161b0a

removed old csv testcase

aa9f9af

lorenzoh merged commit 4cd87ad into FluxML:master Jun 15, 2021

manikyabard deleted the manikyabard/table_container branch June 17, 2021 15:14

manikyabard mentioned this pull request Jul 9, 2021

add blog about working with tabular data using FastAI.jl FluxML/fluxml.github.io#94

Open

		Makie = "ee78f7c6-11fb-53f2-987a-cfe4a2b5a57a"
		ShowCases = "605ecd9f-84a6-4c9e-81e2-4798472b76a3"

added TableDataset #26

added TableDataset #26

Conversation

manikyabard commented Apr 5, 2021 • edited Loading

darsnack left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ToucheSir Apr 6, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

darsnack commented Apr 7, 2021

darsnack commented Apr 7, 2021

lorenzoh commented Apr 9, 2021 • edited Loading

darsnack commented Apr 9, 2021

darsnack commented Apr 9, 2021

lorenzoh commented Apr 9, 2021

darsnack left a comment

Choose a reason for hiding this comment

manikyabard commented May 31, 2021

Choose a reason for hiding this comment

manikyabard Jun 1, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

darsnack left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

darsnack left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

darsnack left a comment

Choose a reason for hiding this comment

lorenzoh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lorenzoh commented Jun 15, 2021

manikyabard commented Jun 15, 2021

lorenzoh commented Jun 15, 2021

manikyabard commented Jun 15, 2021

lorenzoh commented Jun 15, 2021

manikyabard commented Jun 15, 2021

lorenzoh commented Jun 15, 2021

manikyabard commented Apr 5, 2021 •

edited

Loading

ToucheSir Apr 6, 2021 •

edited

Loading

lorenzoh commented Apr 9, 2021 •

edited

Loading

manikyabard Jun 1, 2021 •

edited

Loading