In [1]:
using Pkg
Pkg.activate("/Users/manikyabardhan/.julia/dev/FastAI")

# Tabular Classification

Tabular Classification involves having a categorical column as the target. Here, we'll use the adult sample dataset from fastai and try to predict whether the salary is above 50K or not, making this a binary classification task. 

In [2]:
using Flux
using FastAI
using FastAI.Datasets
using Tables
using Statistics
using FluxTraining
using DataAugmentation

┌ Info: Precompiling FastAI [5d0beca9-ade8-49ae-ad0b-a3cf890e669f]
└ @ Base loading.jl:1317


We can quickly download and get the path of any dataset from fastai by using `datasetpath`. Once we have the path, we'll load the data in a `TableContainer`. By default, if we pass in just the path to `TableContainer`, the data is loaded in a `DataFrame`, but we can use any package for accessing our data, and pass an object satisfying the Tables.jl interface to it.

In [3]:
data = TableDataset(joinpath(datasetpath("adult_sample"), "adult.csv"))

TableDataset{DataFrames.DataFrame}([1m32561×15 DataFrame[0m
[1m   Row [0m│[1m age   [0m[1m workclass         [0m[1m fnlwgt [0m[1m education     [0m[1m education-num [0m[1m marit[0m ⋯
[1m       [0m│[90m Int64 [0m[90m String            [0m[90m Int64  [0m[90m String        [0m[90m Float64?      [0m[90m Strin[0m ⋯
───────┼────────────────────────────────────────────────────────────────────────
     1 │    49   Private           101320   Assoc-acdm             12.0   Marr ⋯
     2 │    44   Private           236746   Masters                14.0   Divo
     3 │    38   Private            96185   HS-grad      [90m     missing   [0m  Divo
     4 │    38   Self-emp-inc      112847   Prof-school            15.0   Marr
     5 │    42   Self-emp-not-inc   82297   7th-8th      [90m     missing   [0m  Marr ⋯
     6 │    20   Private            63210   HS-grad                 9.0   Neve
     7 │    49   Private            44434   Some-college           10.0   Divo
  

In case our data was present in a different format for eg. parquet, it could be loaded in a TableContainer as shown below.

In [None]:
using Parquet
TableDataset(read_parquet(parquet_path));

`mapobs` is used here to split our target column from the rest of the row in a lazy manner.

In [5]:
splitdata = mapobs(row -> (row, row[:salary]), data);

To create a learning method for tabular classification task, we need an input block, an output block, and the encodings to be performed on the data.

The input block here is a `TableRow` which contains information about the nature of the columns (ie. categorical or continuous) along with an indexable collection mapping categorical column names to a collection with distinct classes in that column. We can get this mapping by using the `gettransformationdict` method with `DataAugmentation.Categorify`.

The outblock block used is `Label` for single column classification and the unique classes have to passed to it.

This is followed by the encodings which needs to be applied on our input and output blocks. For the input block, we have used the `gettransforms` function here to get a standard bunch of transformations to apply, but this can be easily customized by passing in any tabular transformation from DataAugmentation.jl or a composition of those, to `TabularPreprocessings`. In addition to this, we have just one-hot encoded the outblock.

In [6]:
cat, cont = FastAI.getcoltypes(data)
target = :salary
cat = filter(!isequal(target), cat)
catdict = FastAI.gettransformdict(data, DataAugmentation.Categorify, cat);

In [7]:
method = BlockMethod(
    (
        TableRow(cat, cont, catdict), 
        Label(unique(data.table[:, target]))
    ),
    ((FastAI.TabularPreprocessing(data)), FastAI.OneHot())
)

└ @ DataAugmentation /Users/manikyabardhan/.julia/dev/DataAugmentation/src/rowtransforms.jl:108


BlockMethod(TableRow{8, 6, Dict{Any, Any}} -> Label{String})

In case our initial problem wasn't a classification task, and we had a continuous target column, we would need to perform tabular regression. To create a learning method suitable for regression, we use a `Continuous` block for representing our target column. This can be done even with multiple continuous target columns by just passing the number of columns in `Continuous`. For example, the method here could be used for 3 targets.

In [8]:
method2 = BlockMethod(
    (
        TableRow(cat, cont, catdict), 
        Continuous(3)
    ),
    ((FastAI.TabularPreprocessing(data),)),
    outputblock = Continuous(3)
)

└ @ DataAugmentation /Users/manikyabardhan/.julia/dev/DataAugmentation/src/rowtransforms.jl:108


BlockMethod(TableRow{8, 6, Dict{Any, Any}} -> Continuous)

To get an overview of the learning method created, and as a sanity test, we can use the `describemethod` function. This shows us what encodings will be applied to which blocks, and how the predicted ŷ values are decoded.

In [9]:
describemethod(method)

#### `LearningMethod` summary

  * Task: `TableRow{8, 6, Dict{Any, Any}} -> Label{String}`
  * Model blocks: `FastAI.EncodedTableRow{8, 6, Dict{Any, Any}} -> FastAI.OneHotTensor{0, String}`

Encoding a sample (`encode(method, context, sample)`)

|               Encoding |              Name |                                 `method.blocks[1]` |                   `method.blocks[2]` |
| ----------------------:| -----------------:| --------------------------------------------------:| ------------------------------------:|
|                        | `(input, target)` |                   `TableRow{8, 6, Dict{Any, Any}}` |                      `Label{String}` |
| `TabularPreprocessing` |                   | **`FastAI.EncodedTableRow{8, 6, Dict{Any, Any}}`** |                      `Label{String}` |
|               `OneHot` |          `(x, y)` |     `FastAI.EncodedTableRow{8, 6, Dict{Any, Any}}` | **`FastAI.OneHotTensor{0, String}`** |

Decoding a model output (`decode(method, context, ŷ)`)

|               Decoding |          Name |             `method.outputblock` |
| ----------------------:| -------------:| --------------------------------:|
|                        |          `ŷ` | `FastAI.OneHotTensor{0, String}` |
|               `OneHot` |               |              **`Label{String}`** |
| `TabularPreprocessing` | `target_pred` |                  `Label{String}` |


`getobs` gets us a row of data from the `TableContainer`, which we encode here. This gives us a tuple with the input and target. The input here is again a tuple, containing the categorical values (which have been label encoded or "categorified") and the continuous values (which have been normalized and any missing values have been filled). 

In [10]:
x=encode(method, Training(), getobs(splitdata, 1000))

(([5, 16, 2, 10, 5, 2, 3, 2], [1.6435221651965317, -0.2567538819371021, -2.751580937680526, -0.14591824281680102, -0.21665620002803673, -0.035428902921319616]), Float32[0.0, 1.0])

To quickly get a model suitable for our learning method, we can use the `methodmodel` function. The second argument here is a Dict which can be used to pass in any custom backbones if needed.

In [11]:
model = methodmodel(method, NamedTuple())

Chain(
  Parallel(
    vcat,
    Chain(
      FastAI.Models.var"#43#45"(),
      Parallel(
        vcat,
        Embedding(10, 6),               [90m# 60 parameters[39m
        Embedding(17, 8),               [90m# 136 parameters[39m
        Embedding(8, 5),                [90m# 40 parameters[39m
        Embedding(17, 8),               [90m# 136 parameters[39m
        Embedding(7, 5),                [90m# 35 parameters[39m
        Embedding(6, 4),                [90m# 24 parameters[39m
        Embedding(3, 3),                [90m# 9 parameters[39m
        Embedding(43, 13),              [90m# 559 parameters[39m
      ),
      identity,
    ),
    BatchNorm(6),                       [90m# 12 parameters[39m[90m, plus 12[39m
  ),
  Chain(
    Dense(58, 200, relu; bias=false),   [90m# 11_600 parameters[39m
    BatchNorm(200),                     [90m# 400 parameters[39m[90m, plus 400[39m
    identity,
  ),
  Chain(
    Dense(200, 100, relu; bias=false),  [90m# 20

It is really simple to create a custom backbone using the functions present in `FastAI.Models`.

In [16]:
cardinalities = collect(map(col -> length(catdict[col]), cat))
embedszs = FastAI.Models.get_emb_sz(cardinalities, cat)
catback = FastAI.Models.tabular_embedding_backbone(embedszs, 0.2);

The backbone Dict can take three kinds of backbones-
- :categorical
- :continuous
- :finalclassifier

We can choose to pass in any combination of these in the `methodmodel` function.

In [20]:
backbone = Dict(:categorical => catback)

Dict{Symbol, Chain{Tuple{FastAI.Models.var"#65#67", Parallel{typeof(vcat), Vector{Flux.Embedding{Matrix{Float32}}}}, Dropout{Float64, Colon}}}} with 1 entry:
  :categorical => Chain(#65, Parallel(vcat, Embedding(10, 6), Embedding(17, 8),…

In [21]:
model = methodmodel(method, backbone)

Chain(
  Parallel(
    vcat,
    Chain(
      FastAI.Models.var"#65#67"(),
      Parallel(
        vcat,
        Embedding(10, 6),               [90m# 60 parameters[39m
        Embedding(17, 8),               [90m# 136 parameters[39m
        Embedding(8, 5),                [90m# 40 parameters[39m
        Embedding(17, 8),               [90m# 136 parameters[39m
        Embedding(7, 5),                [90m# 35 parameters[39m
        Embedding(6, 4),                [90m# 24 parameters[39m
        Embedding(3, 3),                [90m# 9 parameters[39m
        Embedding(43, 13),              [90m# 559 parameters[39m
      ),
      Dropout(0.2),
    ),
    BatchNorm(6),                       [90m# 12 parameters[39m[90m, plus 12[39m
  ),
  Chain(
    Dense(58, 200, relu; bias=false),   [90m# 11_600 parameters[39m
    BatchNorm(200),                     [90m# 400 parameters[39m[90m, plus 400[39m
    identity,
  ),
  Chain(
    Dense(200, 100, relu; bias=false),  [90m

To directly get a `Learner` suitable for our method and data, we can use the `methodlearner` function. 

In [22]:
learner = methodlearner(method, splitdata, backbone, Metrics(accuracy), batchsize=128, dlkwargs=NamedTuple(zip([:buffered], [false])))

Learner()

Once we have our learner, we can just call `FluxTraining.fit!` on it to train it for the desired number of epochs.

In [23]:
FluxTraining.fit!(learner, 1)

[32mEpoch 1 TrainingPhase(): 100%|██████████████████████████| Time: 0:01:02[39m


┌───────────────┬───────┬────────┬──────────┐
│[1m         Phase [0m│[1m Epoch [0m│[1m   Loss [0m│[1m Accuracy [0m│
├───────────────┼───────┼────────┼──────────┤
│ TrainingPhase │   1.0 │ 0.4337 │   0.8039 │
└───────────────┴───────┴────────┴──────────┘


[32mEpoch 1 ValidationPhase(): 100%|████████████████████████| Time: 0:00:02[39m


┌─────────────────┬───────┬─────────┬──────────┐
│[1m           Phase [0m│[1m Epoch [0m│[1m    Loss [0m│[1m Accuracy [0m│
├─────────────────┼───────┼─────────┼──────────┤
│ ValidationPhase │   1.0 │ 0.34747 │  0.83589 │
└─────────────────┴───────┴─────────┴──────────┘
