In [2]:
using Pkg
Pkg.activate("/Users/manikyabardhan/.julia/dev/FastAI")

# Tabular Classification

Tabular Classification involves having a categorical column as the target. Here, we'll use the adult sample dataset from fastai and try to predict whether the salary is above 50K or not, making this a binary classification task. 

In [3]:
using Flux
using FastAI
using FastAI.Datasets
using Tables
using Statistics
using FluxTraining
using DataAugmentation

┌ Info: Precompiling FastAI [5d0beca9-ade8-49ae-ad0b-a3cf890e669f]
└ @ Base loading.jl:1317


We can quickly download and get the path of any dataset from fastai by using `datasetpath`. Once we have the path, we'll load the data in a `TableContainer`. By default, if we pass in just the path to `TableContainer`, the data is loaded in a `DataFrame`, but we can use any package for accessing our data, and pass an object satisfying the Tables.jl interface to it.

In [4]:
data = Datasets.TableDataset(joinpath(datasetpath("adult_sample"), "adult.csv"))
cont = [:age, :fnlwgt, Symbol("education-num"), Symbol("capital-loss"), Symbol("hours-per-week")]
cat = [Symbol("workclass"), Symbol("education"), Symbol("marital-status"), Symbol("occupation"), Symbol("relationship"), Symbol("race"), Symbol("sex"), Symbol("native-country")];

`mapobs` is used here to split our target column from the rest of the row in a lazy manner.

Now, to perform the required preprocessing, we'll use the tabular transforms from DataAugmentation.jl. For this we'll create dictionaries containing the required information using the `gettransformationdict` helper function. More information about this can be found in the DataAugmentation.jl docs. 

In [5]:
splitdata = mapobs(row -> (row, row[:salary]), data);

In [6]:
normstats = FastAI.gettransformationdict(data, DataAugmentation.NormalizeRow, cont)
fmvals = FastAI.gettransformationdict(data, DataAugmentation.FillMissing, cont)
catdict = FastAI.gettransformationdict(data, DataAugmentation.Categorify, cat)

normalize = DataAugmentation.NormalizeRow(normstats, cont)
categorify = DataAugmentation.Categorify(catdict, cat)
fm = DataAugmentation.FillMissing(fmvals, cont)
columns = Tables.columnnames(data.table);

└ @ DataAugmentation /Users/manikyabardhan/.julia/dev/DataAugmentation/src/rowtransforms.jl:108


Now, we can create a learning method for the tabular classification task. 

The input block here is a `TableRow` which contains information about the nature of the columns (ie. categorical or continuous) along with an indexable collection mapping categorical column names to a collection with distinct classes in that column.

The outblock block used is `Label` for single column classification and the unique classes have been passed to it.

This is followed by the encodings which needs to be applied on our input and output blocks. For the input block, we use the transformations created in the last cell, and just one-hot encode the output block.

In [7]:
method = BlockMethod(
    (
        TableRow(cat, cont, catdict), 
        Label(unique(data.table[:, :salary]))
    ),
    ((FastAI.TabularTransform(fm|>normalize|>categorify)), FastAI.OneHot())
)

BlockMethod(TableRow{8, 5} -> Label{String})

In case our initial problem wasn't a classification task, and we had a continuous target column, we would need to perform tabular regression. To create a learning method suitable for regression, we need to use `Continuous` block for representing our target column. This can be done even with multiple continuous target columns by just passing the no. of columns in `Continuous`. For example, the method here could be used for 3 targets.

In [8]:
method2 = BlockMethod(
    (
        TableRow(cat, cont, catdict), 
        Continuous(3)
    ),
    ((FastAI.TabularTransform(fm|>normalize|>categorify),)),
            outputblock = Continuous(3)
)

BlockMethod(TableRow{8, 5} -> Continuous)

In [7]:
describemethod(method)

#### `LearningMethod` summary

  * Task: `TableRow{8, 5} -> Label{String}`
  * Model blocks: `FastAI.EncodedTableRow{8, 5} -> FastAI.OneHotTensor{0, String}`

Encoding a sample (`encode(method, context, sample)`)

|           Encoding |              Name |                 `method.blocks[1]` |                   `method.blocks[2]` |
| ------------------:| -----------------:| ----------------------------------:| ------------------------------------:|
|                    | `(input, target)` |                   `TableRow{8, 5}` |                      `Label{String}` |
| `TabularTransform` |                   | **`FastAI.EncodedTableRow{8, 5}`** |                      `Label{String}` |
|           `OneHot` |          `(x, y)` |     `FastAI.EncodedTableRow{8, 5}` | **`FastAI.OneHotTensor{0, String}`** |

Decoding a model output (`decode(method, context, ŷ)`)

|           Decoding |          Name |             `method.outputblock` |
| ------------------:| -------------:| --------------------------------:|
|                    |          `ŷ` | `FastAI.OneHotTensor{0, String}` |
|           `OneHot` |               |              **`Label{String}`** |
| `TabularTransform` | `target_pred` |                  `Label{String}` |


`getobs` gets us a row of data from the `TableContainer`, which now has been encoded, giving us a tuple with the input and target. The input here is just a tuple of the categorical values (which have been label encoded or "categorified") and continuous values (which have been normalized and any missing values have been filled). 

In [8]:
encode(method, Training(), getobs(splitdata, 1000))

(([5, 16, 2, 10, 5, 2, 3, 2], [1.6435221651965317, -0.2567538819371021, -2.751580937680526, -0.21665620002803673, -0.035428902921319616]), Float32[0.0, 1.0])

To quickly get a model suitable for our learning method, we can use the `methodmodel` function. The second argument here is a Dict which can be used to pass in any custom backbones if needed.

In [9]:
model = methodmodel(method, Dict())

Chain(
  Parallel(
    vcat,
    Chain(
      FastAI.Models.var"#42#44"(),
      Parallel(
        vcat,
        Embedding(10, 5),               [90m# 50 parameters[39m
        Embedding(17, 8),               [90m# 136 parameters[39m
        Embedding(8, 5),                [90m# 40 parameters[39m
        Embedding(16, 7),               [90m# 112 parameters[39m
        Embedding(7, 4),                [90m# 28 parameters[39m
        Embedding(6, 4),                [90m# 24 parameters[39m
        Embedding(3, 2),                [90m# 6 parameters[39m
        Embedding(43, 13),              [90m# 559 parameters[39m
      ),
      identity,
    ),
    BatchNorm(5),                       [90m# 10 parameters[39m[90m, plus 10[39m
  ),
  Chain(
    Dense(53, 200, relu; bias=false),   [90m# 10_600 parameters[39m
    BatchNorm(200),                     [90m# 400 parameters[39m[90m, plus 400[39m
    identity,
  ),
  Chain(
    Dense(200, 100, relu; bias=false),  [90m# 20

We'll quickly see how simple it is to pass in a custom backbone created using functions present in `FastAI.Models`.

In [16]:
cardict = Dict(col => length(classes) for (col, classes) in collect(catdict))
embedszs = FastAI.Models.get_emb_sz(cardict, cat)
catback = FastAI.Models.tabular_embedding_backbone(embedszs, 0.2);

The backbone Dict can take three kinds of backbones-
- :categoricalbackbone
- :continuousbackbone
- :finalclassifier

We can choose to pass in any combination of these in the `methodmodel` function.

In [17]:
backbone = Dict(:categoricalbackbone => catback)

Dict{Symbol, Chain{Tuple{FastAI.Models.var"#42#44", Parallel{typeof(vcat), Vector{Flux.Embedding{Matrix{Float32}}}}, Dropout{Float64, Colon}}}} with 1 entry:
  :categoricalbackbone => Chain(#42, Parallel(vcat, Embedding(10, 5), Embedding…

In [18]:
model = methodmodel(method, backbone)

Chain(
  Parallel(
    vcat,
    Chain(
      FastAI.Models.var"#42#44"(),
      Parallel(
        vcat,
        Embedding(10, 5),               [90m# 50 parameters[39m
        Embedding(17, 8),               [90m# 136 parameters[39m
        Embedding(8, 5),                [90m# 40 parameters[39m
        Embedding(16, 7),               [90m# 112 parameters[39m
        Embedding(7, 4),                [90m# 28 parameters[39m
        Embedding(6, 4),                [90m# 24 parameters[39m
        Embedding(3, 2),                [90m# 6 parameters[39m
        Embedding(43, 13),              [90m# 559 parameters[39m
      ),
      Dropout(0.2),
    ),
    BatchNorm(5),                       [90m# 10 parameters[39m[90m, plus 10[39m
  ),
  Chain(
    Dense(53, 200, relu; bias=false),   [90m# 10_600 parameters[39m
    BatchNorm(200),                     [90m# 400 parameters[39m[90m, plus 400[39m
    identity,
  ),
  Chain(
    Dense(200, 100, relu; bias=false),  [90m

To directly get a `Learner` suitable for our method and data, we can use the `methodlearner` function. 

In [19]:
learner = methodlearner(method, splitdata, backbone, Metrics(accuracy), batchsize=128, dlkwargs=NamedTuple(zip([:buffered], [false])))

Learner()

Now that we have our learner, to train it we can just call `FluxTraining.fit!` on it for the required number of epochs.

In [20]:
FluxTraining.fit!(learner, 1)

[32mEpoch 1 TrainingPhase(): 100%|██████████████████████████| Time: 0:00:08[39m


┌───────────────┬───────┬─────────┬──────────┐
│[1m         Phase [0m│[1m Epoch [0m│[1m    Loss [0m│[1m Accuracy [0m│
├───────────────┼───────┼─────────┼──────────┤
│ TrainingPhase │   1.0 │ 0.45015 │   0.7913 │
└───────────────┴───────┴─────────┴──────────┘


[32mEpoch 1 ValidationPhase(): 100%|████████████████████████| Time: 0:00:00[39m


┌─────────────────┬───────┬─────────┬──────────┐
│[1m           Phase [0m│[1m Epoch [0m│[1m    Loss [0m│[1m Accuracy [0m│
├─────────────────┼───────┼─────────┼──────────┤
│ ValidationPhase │   1.0 │ 0.35919 │  0.83366 │
└─────────────────┴───────┴─────────┴──────────┘
