Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add blog about working with tabular data using FastAI.jl #94

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
207 changes: 207 additions & 0 deletions blog/_posts/2021-07-09-FastAI-tabular-data.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,207 @@
---
title: Working with Tabular Data in FastAI.jl
author: Manikya Bardhan
layout: blog
---

[FastAI.jl](https://github.com/FluxML/FastAI.jl) is a package inspired by [fastai](https://github.com/fastai/fastai), and its goal is to easily enable creating state-of-the-art deep learning models.

This blog post shows how to get started on working with tabular data using FastAI.jl and related packages. The work being presented here was done as a part of [GSoC'21](https://summerofcode.withgoogle.com/projects/#5088642453733376) under the mentorship of Kyle Daruwalla, Brian Chen and Lorenz Ohly.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should not undersell your work here, I am truthfully unfamiliar with the deep technical detail but saying something like "Before my GSoC project, we could only do x and y. Now we can do XY & Z together with this unified interface". This will make it very clear why someone should read this post.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, this project was no small feat. Just look at how long it's taken other frameworks to (not) add support for new modalities!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comments! I'll add this in as well.


## Loading the data in a container

To start, we'll have to take our tabular data and load it in such that it supports the interface defined by [Tables.jl](https://tables.juliadata.org/stable/#Implementing-the-Interface-(i.e.-becoming-a-Tables.jl-source)-1). Most of the popular packages for loading in data from different formats do so already, so you probably won't have to worry about this.

Below, we have the `path` to a CSV file, which we'll load in using the [CSV.jl](https://github.com/JuliaData/CSV.jl) package and then insert the data into a DataFrame using [DataFrames.jl](https://github.com/JuliaData/DataFrames.jl).
If your data is present in a different format, you could use a package which supports loading that format, provided that the final object created supports the required interface.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this required interface? Can you link it? Or is this a general comment?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, seems like you are referring to the Tables.jl Interface, maybe explicitly note that?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this was referring to the Tables.jl interface. Sure I'll do that.


```julia
julia> using CSV, DataFrames

julia> using FastAI, FastAI.Datasets

julia> path = joinpath(datasetpath("adult_sample") , "adult.csv");

julia> df = CSV.File(path)|> DataFrames.DataFrame; first(df, 5)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
julia> df = CSV.File(path)|> DataFrames.DataFrame; first(df, 5)
julia> df = DataFrames.DataFrame(CSV.File(path))
julia> first(df, 5)

5×15 DataFrame
Row │ age workclass fnlwgt education education-num marital-status occupation rel ⋯
│ Int64 String Int64 String Float64? String String? Str ⋯
─────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ 49 Private 101320 Assoc-acdm 12.0 Married-civ-spouse missing Wi ⋯
2 │ 44 Private 236746 Masters 14.0 Divorced Exec-managerial No
3 │ 38 Private 96185 HS-grad missing Divorced missing Un
4 │ 38 Self-emp-inc 112847 Prof-school 15.0 Married-civ-spouse Prof-specialty Hu
5 │ 42 Self-emp-not-inc 82297 7th-8th missing Married-civ-spouse Other-service Wi ⋯
8 columns omitted
```

After getting an object satisfying Tables.jl Interface, we can pass this to `FastAI.Datasets.TableDataset` to get a container satisfying the `LearnBase` Interface.

```julia
julia> td = TableDataset(df)
TableDataset{DataFrame}(32561×15 DataFrame
Row │ age workclass fnlwgt education education-num marital-status occupation ⋯
│ Int64 String Int64 String Float64? String String? ⋯
───────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ 49 Private 101320 Assoc-acdm 12.0 Married-civ-spouse missing ⋯
2 │ 44 Private 236746 Masters 14.0 Divorced Exec-managerial
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱
32560 │ 32 Local-gov 217296 HS-grad 9.0 Married-civ-spouse Transport-moving
32561 │ 26 Private 182308 Some-college 10.0 Married-civ-spouse Prof-specialty
8 columns and 32557 rows omitted)

```

What this `TableDataset` object allows us to do is that we can get any observation at a particular index by using `getindex(td, index)` and the total number of observations by using `nobs(td)`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add in a line about why this is cool, and how it generalises the usual getindex based approach for arrays to data frames?


```julia
julia> getobs(td, 3)
DataFrameRow
Row │ age workclass fnlwgt education education-num marital-status occupation relationship race sex ⋯
│ Int64 String Int64 String Float64? String String? String String Str ⋯
─────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────
3 │ 38 Private 96185 HS-grad missing Divorced missing Unmarried Black Fe ⋯
6 columns omitted

julia> nobs(td)
32561
```

## Data Preprocessing

Although, we have loaded the data in a container which can be used later while creating a `DataLoader` and training, often we would like to perform transformations on it.
The tabular transformations are defined as a part of the `[DataAugmentation.jl](https://github.com/lorenzoh/DataAugmentation.jl)` package, and the currently available ones are:
- Normalization
- FillMissing
- Categorify

### Normalization

The `Normalization` transformation involves normalizing a row of data using the mean and standard deviation of the columns. To start, we'll have to create a `Dict` which contains the required information for all rows to be normalized.

```julia
julia> using DataAugmentation, Statistics

julia> continuous_cols = (:age, :fnlwgt, Symbol("education-num"), Symbol("capital-loss"), Symbol("hours-per-week"));

julia> normstats = Dict();

julia> for col in continuous_cols
normstats[col] = (Statistics.mean(skipmissing(df[:, col])), Statistics.std(skipmissing(df[:, col])))
end

julia> normstats
Dict{Any, Any} with 5 entries:
:fnlwgt => (1.89778e5, 105550.0)
:age => (38.5816, 13.6404)
Symbol("education-num") => (10.0798, 2.573)
Symbol("capital-loss") => (87.3038, 402.96)
Symbol("hours-per-week") => (40.4375, 12.3474)
```

After getting `normstats` dictionary, we can create the `NormalizeRow` object which will help us perform this transformation.


```julia
julia> normalize = DataAugmentation.NormalizeRow(normstats, continuous_cols);
```

Now let's quickly get a row of data and see `NormalizeRow` in action.

All the transformations work on `TabularItem` objects and use the `apply` function for transforming the data.

```julia
julia> row = getobs(td, 1)
DataFrameRow
Row │ age workclass fnlwgt education education-num marital-status occupation relationship race ⋯
│ Int64 String Int64 String Float64? String String? String Stri ⋯
─────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ 49 Private 101320 Assoc-acdm 12.0 Married-civ-spouse missing Wife Whi ⋯
7 columns omitted

julia> item = DataAugmentation.TabularItem(row, Tables.columnnames(df));

julia> DataAugmentation.apply(normalize, item).data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Show the TablularItem here to clarify what is written in the next sentence? We never see the TabularItem post normalisation.

(age = 0.7637846676602542, workclass = " Private", fnlwgt = -0.8380709161872286, education = " Assoc-acdm", education-num = 0.7462826288318035, marital-status = " Married-civ-spouse", occupation = missing, relationship = " Wife", race = " White", sex = " Female", capital-gain = 0, capital-loss = 4.5034127099423245, hours-per-week = -0.035428902921319616, native-country = " United-States", salary = ">=50k")
```

We can see that the row has been normalized and we got another `TabularItem` containing a `NamedTuple` with the new data.

### Filling Missing values

Similarly, for `FillMissing` Transformation as well, we'll first create a dictionary containg the required information, construct a `FillMissing` object, and then see the transformation in action by using the `apply` method.

```julia
julia> fmvals = Dict();

julia> for col in continuous_cols
fmvals[col] = Statistics.median(skipmissing(df[:, col]))
end

julia> fmvals[:occupation] = " Exec-managerial";

julia> fm = DataAugmentation.FillMissing(fmvals, [continuous_cols..., :occupation]);

julia> DataAugmentation.apply(fm, item).data
(age = 49, workclass = " Private", fnlwgt = 101320, education = " Assoc-acdm", education-num = 12.0, marital-status = " Married-civ-spouse", occupation = " Exec-managerial", relationship = " Wife", race = " White", sex = " Female", capital-gain = 0, capital-loss = 1902, hours-per-week = 40, native-country = " United-States", salary = ">=50k")
```
As we can see, the "occupation" column which originally has a missing value in this row has been replaced by the value specified in the dictionary.

### Label Encoding Categorical Variables

For handling categorical columns, the `Categorify` transform can be used which label encodes a column to contain integers corresponding to each unique class. These integers could be later used to index into `Embedding` layers which might be present in the tabular model. Also, if there are any `missing` values present in the columns to be transformed, they are directly assigned the integer 1.

Again, we'll create a dictionary containing the classes as the value, for each column name which forms the key.

```julia
julia> categorical_cols = (Symbol("workclass"), Symbol("education"), Symbol("marital-status"), Symbol("occupation"), Symbol("relationship"), Symbol("race"), Symbol("sex"), Symbol("native-country"), :salary);

julia> catdict = Dict();

julia> for col in categorical_cols
catdict[col] = unique(df[:, col])
end

julia> categorify = DataAugmentation.Categorify(catdict, categorical_cols);
┌ Warning: There is a missing value present for category 'occupation' which will be removed from Categorify dict
└ @ DataAugmentation ~/.julia/dev/DataAugmentation/src/rowtransforms.jl:108

julia> categorify.dict
Dict{Any, Any} with 9 entries:
:education => [" Assoc-acdm", " Masters", " HS-grad", " Prof-school", " 7th-8th", " Some-college"…
:race => [" White", " Black", " Asian-Pac-Islander", " Amer-Indian-Eskimo", " Other"]
:sex => [" Female", " Male"]
:workclass => [" Private", " Self-emp-inc", " Self-emp-not-inc", " State-gov", " Federal-gov", " …
:occupation => Union{Missing, String}[" Exec-managerial", " Prof-specialty", " Other-service", " H…
:relationship => [" Wife", " Not-in-family", " Unmarried", " Husband", " Own-child", " Other-relativ…
Symbol("native-country") => [" United-States", " ?", " Puerto-Rico", " Mexico", " Canada", " Taiwan", " Vietnam…
Symbol("marital-status") => [" Married-civ-spouse", " Divorced", " Never-married", " Widowed", " Married-spouse…
:salary => [">=50k", "<50k"]
```
```julia
julia> DataAugmentation.apply(categorify, item).data
(age = 49, workclass = 2, fnlwgt = 101320, education = 2, education-num = 12.0, marital-status = 2, occupation = 1, relationship = 2, race = 2, sex = 2, capital-gain = 0, capital-loss = 1902, hours-per-week = 40, native-country = 2, salary = 2)
```

### Compositions of Transforms

We saw how these transformations can be applied individually, but in most cases we would want a combination of transformations to be applied on the data. There is a really easy way to do this which is possible because the transformations follow the transformation interface defined by `DataAugmentation.jl` package.

If we want to apply `NormalizeRow`, `FillMissing`, and `Categorify` together on our data, we can just use `|>` to create a sequence of transforms.

```julia
julia> tfms = normalize|>fm|>categorify;

julia> typeof(tfms)
Sequence{Tuple{DataAugmentation.NormalizeRow{Dict{Any, Any}, NTuple{5, Symbol}}, DataAugmentation.FillMissing{Dict{Any, Any}, Vector{Symbol}}, DataAugmentation.Categorify{Dict{Any, Any}, NTuple{9, Symbol}}}}
```
Now, we can call the `apply` function on this `Sequence` like we have been doing so for applying the transformations individually.

```julia
julia> DataAugmentation.apply(tfms, item).data
(age = 0.7637846676602542, workclass = 2, fnlwgt = -0.8380709161872286, education = 2, education-num = 0.7462826288318035, marital-status = 2, occupation = 17, relationship = 2, race = 2, sex = 2, capital-gain = 0, capital-loss = 4.5034127099423245, hours-per-week = -0.035428902921319616, native-country = 2, salary = 2)
```

## Conclusion

We saw how to take a tabular dataset present on disk, form a container for it and perform various transformations. Later, we'll see how we can use this `TableContainer` to create a dataloader using the `DataLoaderers.jl` package, construct models for the tabular data and finally train the model in a future post.