# Tabular Classification

Tabular Classification involves having a categorical column as the target. Here, we'll use the adult sample dataset from fastai and try to predict whether the salary is above 50K or not, making this a binary classification task. 

In [2]:
using Flux
using FastAI
using FastAI.Datasets
using Tables
using Statistics
using FluxTraining
using DataAugmentation

We can quickly download and get the path of any dataset from fastai by using `datasetpath`. Once we have the path, we'll load the data in a `TableContainer`. By default if we pass in just the path to `TableContainer`, the data is loaded in a `DataFrame`, but we can use any package for accessing our data, and pass an object satisfying Tables.jl interface to it.

In [3]:
path = datasetpath("adult_sample") 
data = Datasets.TableDataset(joinpath(path, "adult.csv"))
df = data.table

Unnamed: 0_level_0,age,workclass,fnlwgt,education,education-num,marital-status
Unnamed: 0_level_1,Int64,String,Int64,String,Float64?,String
1,49,Private,101320,Assoc-acdm,12.0,Married-civ-spouse
2,44,Private,236746,Masters,14.0,Divorced
3,38,Private,96185,HS-grad,missing,Divorced
4,38,Self-emp-inc,112847,Prof-school,15.0,Married-civ-spouse
5,42,Self-emp-not-inc,82297,7th-8th,missing,Married-civ-spouse
6,20,Private,63210,HS-grad,9.0,Never-married
7,49,Private,44434,Some-college,10.0,Divorced
8,37,Private,138940,11th,7.0,Married-civ-spouse
9,46,Private,328216,HS-grad,9.0,Married-civ-spouse
10,36,Self-emp-inc,216711,HS-grad,missing,Married-civ-spouse


We then create a tuple with the continuous and categorical column names which will be used later on.

In [4]:
cont = (:age, :fnlwgt, Symbol("education-num"), Symbol("capital-loss"), Symbol("hours-per-week"));
cat = (Symbol("workclass"), Symbol("education"), Symbol("marital-status"), Symbol("occupation"), Symbol("relationship"), Symbol("race"), Symbol("sex"), Symbol("native-country"));

Now, to perform the required preprocessing, we'll use the tabular transforms from DataAugmentation.jl. For this we'll create dictionaries containing the required information using the `gettransformationdict` helper function. More information about this can be found in the DataAugmentation.jl docs. 

In [5]:
normstats = FastAI.gettransformationdict(data, DataAugmentation.NormalizeRow, cont)
fmvals = FastAI.gettransformationdict(data, DataAugmentation.FillMissing, cont)
catdict = FastAI.gettransformationdict(data, DataAugmentation.Categorify, cat)

Dict{Any, Any} with 8 entries:
  :education               => [" Assoc-acdm", " Masters", " HS-grad", " Prof-sc…
  :race                    => [" White", " Black", " Asian-Pac-Islander", " Ame…
  :sex                     => [" Female", " Male"]
  :workclass               => [" Private", " Self-emp-inc", " Self-emp-not-inc"…
  :occupation              => Union{Missing, String}[missing, " Exec-managerial…
  :relationship            => [" Wife", " Not-in-family", " Unmarried", " Husba…
  Symbol("native-country") => [" United-States", " ?", " Puerto-Rico", " Mexico…
  Symbol("marital-status") => [" Married-civ-spouse", " Divorced", " Never-marr…

In [6]:
normalize = DataAugmentation.NormalizeRow(normstats, cont)
categorify = DataAugmentation.Categorify(catdict, cat)
fm = DataAugmentation.FillMissing(fmvals, cont)
columns = Tables.columnnames(data.table);

└ @ DataAugmentation /Users/manikyabardhan/.julia/dev/DataAugmentation/src/rowtransforms.jl:108


Now that we have our transforms, we'll create a `LearningMethod` for tabular classification, which contains all the information needed for encoding the data.

In [7]:
method = FastAI.TabularClassification(
    FastAI.TabularTransforms(fm|>normalize|>categorify, columns),
    contcols=cont,
    catcols=cat,
    targetcol=:salary,
    columns=columns,
    categorydict = catdict,
    targetclasses = unique(df[:, :salary])
);

`getobs` gets us a row of data from the `TableContainer`, which now has been encoded, giving us a tuple with the input and target. The input here is just a tuple of the categorical and continuous values. 

In [8]:
encode(method, Training(), getobs(data, 1))

((Int32[2, 2, 2, 1, 2, 2, 2, 2], [0.7637846676602542, -0.8380709161872286, 0.7462826288318035, 4.5034127099423245, -0.035428902921319616]), Bool[1, 0])

We can use `methoddataloaders` to quickly get a training and validation dataloader by passing in the `TableContainer`, method and batchsize. `pctgval` decided the percentage of data used for validation.

In [9]:
traindl, valdl = methoddataloaders(data, method, 128; pctgval = 0.2, shuffle = true, buffered=false)

(DataLoaders.GetObsParallel{DataLoaders.BatchViewCollated{DLPipelines.MethodDataset{TabularClassification}}}(batchviewcollated() with 204 batches of size 128, false), DataLoaders.GetObsParallel{DataLoaders.BatchViewCollated{DLPipelines.MethodDataset{TabularClassification}}}(batchviewcollated() with 26 batches of size 256, false))

`methodlossfn` can help us quickly get a loss function compatible with the method.

In [10]:
optim = Flux.ADAM()
lossfn = methodlossfn(method)

logitcrossentropy (generic function with 1 method)

In [11]:
function emb_sz_rule(n_cat)
 	min(600, round(1.6 * n_cat^0.56))
 end

 function _one_emb_sz(catdict, catcol::Symbol, sz_dict=nothing)
 	sz_dict = isnothing(sz_dict) ? Dict() : sz_dict
 	n_cat = length(catdict[catcol])
 	sz = catcol in keys(sz_dict) ? sz_dict[catcol] : emb_sz_rule(n_cat)
 	Int64(n_cat)+1, Int64(sz)
 end

 function get_emb_sz(catdict, cols; sz_dict=nothing)
 	[_one_emb_sz(catdict, catcol, sz_dict) for catcol in cols]
 end

function linbndrop(h_in, h_out; use_bn=true, p=0., act=identity, lin_first=false)
    bn = BatchNorm(lin_first ? h_out : h_in)
    dropout = p == 0 ? identity : Dropout(p)
    dense = Dense(h_in, h_out, act; bias=!use_bn)
    if lin_first
        return Chain(dense, bn, dropout)
    else
        return Chain(bn, dropout, dense)
    end
end

function sigmoidrange(x, low, high)
    @. Flux.sigmoid(x) * (high - low) + low
end

function embeddingbackbone(embedding_sizes, dropoutprob=0.)
    embedslist = [Embedding(ni => nf) for (ni, nf) in embedding_sizes]
    emb_drop = dropoutprob==0. ? identity : Dropout(dropoutprob)
    Chain(
        x -> tuple(eachrow(x)...), 
        Parallel(vcat, embedslist), 
        emb_drop
    )
end

function continuousbackbone(n_cont)
    n_cont > 0 ? BatchNorm(n_cont) : identity
end

function TabularModel(
        catbackbone,
        contbackbone,    
        layers; 
        n_cat,
        n_cont,
        out_sz,
        ps=0,
        use_bn=true,
        bn_final=false,
        act_cls=Flux.relu,
        lin_first=true,
        final_activation=identity
    )

    tabularbackbone = Parallel(vcat, catbackbone, contbackbone)
    
    catoutsize = first(Flux.outputsize(catbackbone, (n_cat, 1)))
    ps = Iterators.cycle(ps)
    classifiers = []

    first_ps, ps = Iterators.peel(ps)
    push!(classifiers, linbndrop(catoutsize+n_cont, first(layers); use_bn=use_bn, p=first_ps, lin_first=lin_first, act=act_cls))
    
    for (isize, osize, p) in zip(layers[1:(end-1)], layers[2:(end)], ps)
        layer = linbndrop(isize, osize; use_bn=use_bn, p=p, act=act_cls, lin_first=lin_first)
        push!(classifiers, layer)
    end
    
    push!(classifiers, linbndrop(last(layers), out_sz; use_bn=bn_final, lin_first=lin_first))
    
    layers = Chain(
        tabularbackbone,
        classifiers...,
        final_activation
    )
end

TabularModel (generic function with 1 method)

To create the model, we'll to create a categorical backbone (which will handle the categorical values), continuous backbone (where the continuous values will go), and finally pass them to `TabularModel` which could contain a bunch of linear layers after the backbones.

In [12]:
embszs = get_emb_sz(catdict, cat)

8-element Vector{Tuple{Int64, Int64}}:
 (10, 5)
 (17, 8)
 (8, 5)
 (16, 7)
 (7, 4)
 (6, 4)
 (3, 2)
 (43, 13)

In [13]:
embedbackbone = embeddingbackbone(embszs)

Chain(
  var"#6#8"(),
  Parallel(
    vcat,
    Embedding(10 => 5),                 [90m# 50 parameters[39m
    Embedding(17 => 8),                 [90m# 136 parameters[39m
    Embedding(8 => 5),                  [90m# 40 parameters[39m
    Embedding(16 => 7),                 [90m# 112 parameters[39m
    Embedding(7 => 4),                  [90m# 28 parameters[39m
    Embedding(6 => 4),                  [90m# 24 parameters[39m
    Embedding(3 => 2),                  [90m# 6 parameters[39m
    Embedding(43 => 13),                [90m# 559 parameters[39m
  ),
  identity,
)[90m                   # Total: 8 arrays, [39m955 parameters, 128 bytes.

In [14]:
contbackbone = continuousbackbone(5)

BatchNorm(5)        [90m# 10 parameters[39m[90m, plus 10 non-trainable[39m

Now, we'll create the model and create a `Learner` which will hold the model along with the data and everything required for training.

In [15]:
model = TabularModel(embedbackbone, contbackbone, [200, 100], n_cat=8, n_cont=5, out_sz=2)
learner = Learner(model, (traindl, valdl), optim, lossfn, Metrics(accuracy))

Learner()

We can use `fitonecycle` method to use the one-cycle strategy for training. 

In [16]:
fitonecycle!(learner, 1)

[32mEpoch 1 TrainingPhase(): 100%|██████████████████████████| Time: 0:01:03[39m


┌───────────────┬───────┬─────────┬──────────┐
│[1m         Phase [0m│[1m Epoch [0m│[1m    Loss [0m│[1m Accuracy [0m│
├───────────────┼───────┼─────────┼──────────┤
│ TrainingPhase │   1.0 │ 0.37584 │  0.82569 │
└───────────────┴───────┴─────────┴──────────┘


[32mEpoch 1 ValidationPhase(): 100%|████████████████████████| Time: 0:00:02[39m


┌─────────────────┬───────┬─────────┬──────────┐
│[1m           Phase [0m│[1m Epoch [0m│[1m    Loss [0m│[1m Accuracy [0m│
├─────────────────┼───────┼─────────┼──────────┤
│ ValidationPhase │   1.0 │ 0.34106 │  0.84285 │
└─────────────────┴───────┴─────────┴──────────┘
