# Tabular Classification

Tabular Classification involves having a categorical column as the target. Here, we'll use the adult sample dataset from fastai and try to predict whether the salary is above 50K or not, making this a binary classification task. 

In [12]:
using Flux
using FastAI
using Tables
using Statistics
using FluxTraining
import DataAugmentation

We can quickly download and get the path of any dataset from fastai by using [`datasetpath`](#). Once we have the path, we'll load the data in a [`TableDataset`](#). By default, if we pass in just the path to [`TableDataset`](#), the data is loaded in a `DataFrame`, but we can use any package for accessing our data, and pass an object satisfying the [Tables.jl](https://github.com/JuliaData/Tables.jl) interface to it.

In [5]:
data = TableDataset(joinpath(datasetpath("adult_sample"), "adult.csv"))

TableDataset{DataFrames.DataFrame}([1m32561×15 DataFrame[0m
[1m   Row [0m│[1m age   [0m[1m workclass         [0m[1m fnlwgt [0m[1m education     [0m[1m education-num [0m[1m marit[0m ⋯
[1m       [0m│[90m Int64 [0m[90m String            [0m[90m Int64  [0m[90m String        [0m[90m Float64?      [0m[90m Strin[0m ⋯
───────┼────────────────────────────────────────────────────────────────────────
     1 │    49   Private           101320   Assoc-acdm             12.0   Marr ⋯
     2 │    44   Private           236746   Masters                14.0   Divo
     3 │    38   Private            96185   HS-grad      [90m     missing   [0m  Divo
     4 │    38   Self-emp-inc      112847   Prof-school            15.0   Marr
     5 │    42   Self-emp-not-inc   82297   7th-8th      [90m     missing   [0m  Marr ⋯
     6 │    20   Private            63210   HS-grad                 9.0   Neve
     7 │    49   Private            44434   Some-college           10.0   Divo
  

In case our data was present in a different format for eg. parquet, it could be loaded into a data container as follows:

```julia
using Parquet
TableDataset(read_parquet(parquet_path));
```

[`mapobs`](#) is used here to split our target column from the rest of the row in a lazy manner, so that each observation consists of a row of inputs and a target variable.

In [6]:
splitdata = mapobs(row -> (row, row[:salary]), data);

To create a learning task for tabular classification task, we need an input block, an output block, and the encodings to be performed on the data.

The input block here is a [`TableRow`](#) which contains information about the nature of the columns (ie. categorical or continuous) along with an indexable collection mapping categorical column names to a collection with distinct classes in that column. We can get this mapping by using the `gettransformationdict` task with [`DataAugmentation.Categorify`](#).

The outblock block used is [`Label`](#) for single column classification and the unique classes have to passed to it.

This is followed by the encodings which needs to be applied on our input and output blocks. For the input block, we have used the `gettransforms` function here to get a standard bunch of transformations to apply, but this can be easily customized by passing in any tabular transformation from DataAugmentation.jl or a composition of those, to [`TabularPreprocessing`](#). In addition to this, we have just one-hot encoded the outblock.

In [7]:
cat, cont = FastAI.getcoltypes(data)
target = :salary
cat = filter(!isequal(target), cat)
catdict = FastAI.gettransformdict(data, DataAugmentation.Categorify, cat);

In [14]:
inputblock = TableRow(cat, cont, catdict)
targetblock = Label(unique(data.table[:, target]))

task = BlockTask(
    (inputblock, targetblock),
    (
        setup(TabularPreprocessing, inputblock, data),
        FastAI.OneHot()
    )
)

└ @ DataAugmentation /home/lorenz/.julia/dev/DataAugmentation/src/rowtransforms.jl:108


BlockTask(TableRow{8, 6, Dict{Any, Any}} -> Label{String})

In case our initial problem wasn't a classification task, and we had a continuous target column, we would need to perform tabular regression. To create a learning task suitable for regression, we use a [`Continuous`](#) block for representing our target column. This can be done even with multiple continuous target columns by just passing the number of columns in `Continuous`. For example, the task here could be used for 3 targets.

```julia
task2 = BlockTask(
    (
        TableRow(cat, cont, catdict), 
        Continuous(3)
    ),
    ((FastAI.TabularPreprocessing(data),)),
    outputblock = Continuous(3)
)
```

To get an overview of the learning task created, and as a sanity test, we can use [`describetask`](#). This shows us what encodings will be applied to which blocks, and how the predicted ŷ values are decoded.

In [15]:
describetask(task)

#### `LearningTask` summary

  * Task: `TableRow{8, 6, Dict{Any, Any}} -> Label{String}`
  * Model blocks: `FastAI.EncodedTableRow{8, 6, Dict{Any, Any}} -> FastAI.OneHotTensor{0, String}`

Encoding a sample (`encodesample(task, context, sample)`)

|               Encoding |              Name |                                 `task.blocks[1]` |                   `task.blocks[2]` |
| ----------------------:| -----------------:| --------------------------------------------------:| ------------------------------------:|
|                        | `(input, target)` |                   `TableRow{8, 6, Dict{Any, Any}}` |                      `Label{String}` |
| `TabularPreprocessing` |                   | **`FastAI.EncodedTableRow{8, 6, Dict{Any, Any}}`** |                      `Label{String}` |
|               `OneHot` |          `(x, y)` |     `FastAI.EncodedTableRow{8, 6, Dict{Any, Any}}` | **`FastAI.OneHotTensor{0, String}`** |

Decoding a model output (`decode(task, context, ŷ)`)

|               Decoding |          Name |             `task.outputblock` |
| ----------------------:| -------------:| --------------------------------:|
|                        |          `ŷ` | `FastAI.OneHotTensor{0, String}` |
|               `OneHot` |               |              **`Label{String}`** |
| `TabularPreprocessing` | `target_pred` |                  `Label{String}` |


`getobs` gets us a row of data from the `TableDataset`, which we encode here. This gives us a tuple with the input and target. The input here is again a tuple, containing the categorical values (which have been label encoded or "categorified") and the continuous values (which have been normalized and any missing values have been filled). 

In [16]:
getobs(splitdata, 1000)

([1mDataFrameRow[0m
[1m  Row [0m│[1m age   [0m[1m workclass  [0m[1m fnlwgt [0m[1m education [0m[1m education-num [0m[1m marital-status   [0m ⋯
[1m      [0m│[90m Int64 [0m[90m String     [0m[90m Int64  [0m[90m String    [0m[90m Float64?      [0m[90m String           [0m ⋯
──────┼─────────────────────────────────────────────────────────────────────────
 1000 │    61   State-gov  162678   5th-6th             3.0   Married-civ-spou ⋯
[36m                                                              10 columns omitted[0m, "<50k")

In [18]:
x = encodesample(task, Training(), getobs(splitdata, 1000))

(([5, 16, 2, 10, 5, 2, 3, 2], [1.6435221651965317, -0.2567538819371021, -2.751580937680526, -0.14591824281680102, -0.21665620002803673, -0.035428902921319616]), Float32[0.0, 1.0])

To get a model suitable for our learning task, we can use [`taskmodel`](#) which constructs a suitable model based on the target block. 

In [17]:
model = taskmodel(task)

Chain(
  Parallel(
    vcat,
    Chain(
      FastAI.Models.var"#41#43"(),
      Parallel(
        vcat,
        Embedding(10, 6),               [90m# 60 parameters[39m
        Embedding(17, 8),               [90m# 136 parameters[39m
        Embedding(8, 5),                [90m# 40 parameters[39m
        Embedding(17, 8),               [90m# 136 parameters[39m
        Embedding(7, 5),                [90m# 35 parameters[39m
        Embedding(6, 4),                [90m# 24 parameters[39m
        Embedding(3, 3),                [90m# 9 parameters[39m
        Embedding(43, 13),              [90m# 559 parameters[39m
      ),
      identity,
    ),
    BatchNorm(6),                       [90m# 12 parameters[39m[90m, plus 12[39m
  ),
  Chain(
    Dense(58, 200, relu; bias=false),   [90m# 11_600 parameters[39m
    BatchNorm(200),                     [90m# 400 parameters[39m[90m, plus 400[39m
    identity,
  ),
  Chain(
    Dense(200, 100, relu; bias=false),  [90m# 20

Of course you can also create a custom backbone using the functions present in `FastAI.Models`.

In [19]:
cardinalities = collect(map(col -> length(catdict[col]), cat))

ovdict = Dict(:workclass => 10, :education => 12, Symbol("native-country") => 16)
overrides = collect(map(col -> col in keys(ovdict) ? ovdict[col] : nothing, cat))

embedszs = FastAI.Models.get_emb_sz(cardinalities, overrides)
catback = FastAI.Models.tabular_embedding_backbone(embedszs, 0.2);

We can then pass a named tuple `(categorical = ..., continuous = ...)` to `taskmodel` to replace the default backbone.

In [20]:
backbone = (categorical = catback, continuous =  BatchNorm(length(cont)))
model = taskmodel(task, backbone)

Chain(
  Parallel(
    vcat,
    Chain(
      FastAI.Models.var"#41#43"(),
      Parallel(
        vcat,
        Embedding(10, 10),              [90m# 100 parameters[39m
        Embedding(17, 12),              [90m# 204 parameters[39m
        Embedding(8, 5),                [90m# 40 parameters[39m
        Embedding(17, 8),               [90m# 136 parameters[39m
        Embedding(7, 5),                [90m# 35 parameters[39m
        Embedding(6, 4),                [90m# 24 parameters[39m
        Embedding(3, 3),                [90m# 9 parameters[39m
        Embedding(43, 16),              [90m# 688 parameters[39m
      ),
      Dropout(0.2),
    ),
    BatchNorm(6),                       [90m# 12 parameters[39m[90m, plus 12[39m
  ),
  Chain(
    Dense(69, 200, relu; bias=false),   [90m# 13_800 parameters[39m
    BatchNorm(200),                     [90m# 400 parameters[39m[90m, plus 400[39m
    identity,
  ),
  Chain(
    Dense(200, 100, relu; bias=false),  [90

To directly get a [`Learner`](#) suitable for our task and data, we can use the [`tasklearner`](#) function. This creates both batched data loaders and a model for us.

In [21]:
learner = tasklearner(task, splitdata;
    backbone=backbone, callbacks=[Metrics(accuracy)],
    batchsize=128, buffered=false)

Learner()

Once we have a `Learner`, we can call [`fitonecycle!`](#) on it to train it for the desired number of epochs:

In [23]:
fitonecycle!(learner, 3, 0.2)

[32mEpoch 1 TrainingPhase(): 100%|██████████████████████████| Time: 0:01:00[39m


┌───────────────┬───────┬─────────┬──────────┐
│[1m         Phase [0m│[1m Epoch [0m│[1m    Loss [0m│[1m Accuracy [0m│
├───────────────┼───────┼─────────┼──────────┤
│ TrainingPhase │   1.0 │ 0.37405 │  0.82753 │
└───────────────┴───────┴─────────┴──────────┘


[32mEpoch 1 ValidationPhase(): 100%|████████████████████████| Time: 0:00:02[39m


┌─────────────────┬───────┬─────────┬──────────┐
│[1m           Phase [0m│[1m Epoch [0m│[1m    Loss [0m│[1m Accuracy [0m│
├─────────────────┼───────┼─────────┼──────────┤
│ ValidationPhase │   1.0 │ 0.39243 │  0.81782 │
└─────────────────┴───────┴─────────┴──────────┘


[32mEpoch 2 TrainingPhase(): 100%|██████████████████████████| Time: 0:00:04[39m


┌───────────────┬───────┬─────────┬──────────┐
│[1m         Phase [0m│[1m Epoch [0m│[1m    Loss [0m│[1m Accuracy [0m│
├───────────────┼───────┼─────────┼──────────┤
│ TrainingPhase │   2.0 │ 0.35332 │  0.83909 │
└───────────────┴───────┴─────────┴──────────┘


[32mEpoch 2 ValidationPhase(): 100%|████████████████████████| Time: 0:00:00[39m


┌─────────────────┬───────┬─────────┬──────────┐
│[1m           Phase [0m│[1m Epoch [0m│[1m    Loss [0m│[1m Accuracy [0m│
├─────────────────┼───────┼─────────┼──────────┤
│ ValidationPhase │   2.0 │ 0.33674 │  0.84259 │
└─────────────────┴───────┴─────────┴──────────┘


[32mEpoch 3 TrainingPhase(): 100%|██████████████████████████| Time: 0:00:04[39m


┌───────────────┬───────┬─────────┬──────────┐
│[1m         Phase [0m│[1m Epoch [0m│[1m    Loss [0m│[1m Accuracy [0m│
├───────────────┼───────┼─────────┼──────────┤
│ TrainingPhase │   3.0 │ 0.32081 │  0.85238 │
└───────────────┴───────┴─────────┴──────────┘


[32mEpoch 3 ValidationPhase(): 100%|████████████████████████| Time: 0:00:00[39m


┌─────────────────┬───────┬─────────┬──────────┐
│[1m           Phase [0m│[1m Epoch [0m│[1m    Loss [0m│[1m Accuracy [0m│
├─────────────────┼───────┼─────────┼──────────┤
│ ValidationPhase │   3.0 │ 0.31522 │  0.85259 │
└─────────────────┴───────┴─────────┴──────────┘
