Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add blog about working with tabular data using FastAI.jl #94

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

manikyabard
Copy link

@manikyabard manikyabard commented Jul 9, 2021

The post explores some of the work done for FastAI.jl Development as a part of GSoC'21 (container pr, transformation pr) under the mentorship of @darsnack, @ToucheSir and @lorenzoh, and shows how to get started with working on tabular data by creating a container, and performing various transformations on it.

To start working, we'll have to take our tabular data and load it in such that it supports the interface defined by [Tables.jl](https://tables.juliadata.org/stable/#Implementing-the-Interface-(i.e.-becoming-a-Tables.jl-source)-1). Most of the popular packages for loading in data from different formats do so already, so you probably won't have to worry about this.

Here, we have a `path` to a csv file, which we'll load in using [CSV.jl](https://github.com/JuliaData/CSV.jl) package, and get a DataFrame using [DataFrames.jl](https://github.com/JuliaData/DataFrames.jl).
If your data is present in a different format, you could use a package which supports loading that format, provided that the final object created supports the required interface.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this required interface? Can you link it? Or is this a general comment?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, seems like you are referring to the Tables.jl Interface, maybe explicitly note that?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this was referring to the Tables.jl interface. Sure I'll do that.


[FastAI.jl](https://github.com/FluxML/FastAI.jl) is a package inspired by [fastai](https://github.com/fastai/fastai), and it's goal is to easily enable creating state-of-the-art models.

This blog post shows how to get started on working with tabular data using FastAI.jl and related packages. The work being presented here was done as a part of [GSoC'21](https://summerofcode.withgoogle.com/projects/#5088642453733376) under the mentorship of Kyle Daruwalla, Brian Chen and Lorenz Ohly.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should not undersell your work here, I am truthfully unfamiliar with the deep technical detail but saying something like "Before my GSoC project, we could only do x and y. Now we can do XY & Z together with this unified interface". This will make it very clear why someone should read this post.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, this project was no small feat. Just look at how long it's taken other frameworks to (not) add support for new modalities!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comments! I'll add this in as well.

Co-authored-by: Logan Kilpatrick  <23kilpatrick23@gmail.com>

julia> path = joinpath(datasetpath("adult_sample") , "adult.csv");

julia> df = CSV.File(path)|> DataFrames.DataFrame; first(df, 5)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
julia> df = CSV.File(path)|> DataFrames.DataFrame; first(df, 5)
julia> df = DataFrames.DataFrame(CSV.File(path))
julia> first(df, 5)


```

What this `TableDataset` object allows us to do is that we can get any observation at a particular index by using `getindex(td, index)` and the total number of observations by using `nobs(td)`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add in a line about why this is cool, and how it generalises the usual getindex based approach for arrays to data frames?


julia> item = DataAugmentation.TabularItem(row, Tables.columnnames(df));

julia> DataAugmentation.apply(normalize, item).data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Show the TablularItem here to clarify what is written in the next sentence? We never see the TabularItem post normalisation.

@logankilpatrick
Copy link
Member

Hey @manikyabard it would be great to get this wrapped up, let me know if I can help in any way!

@manikyabard
Copy link
Author

Sure @logankilpatrick, I'll get this done (although the next 2-3 weeks look a little busy for me, so this might take a bit). Also just wanted to get a confirmation from @darsnack, @ToucheSir, or @lorenzoh if it's fine to put this post here since we were talking about putting this on the FastAI.jl website as well. I think we did discuss this a few ML Community calls ago but can't remember what our opinion was on that.

Another thing is that this blog mainly focused previously on loading the data and performing some transformations on it (mainly because this was all the code that was written at that time), but we have come a long way from that, and can probably include more functionalities such as creating and training tabular models with the data.

@darsnack
Copy link
Member

I think it's fine to post this on FluxML, but I agree the content should be expanded to include the full GSoC.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants