# Tabular data

In [None]:
from fastai.gen_doc.nbdoc import *
from fastai.tabular import * 
from fastai import *
from fastai.docs import *

The `tabular` module contains all the necessary classes to deal with tabular data. In `tabular.transform`, we define the `TabularTransform` class to help with preprocessing. In `tabular.data`, we define the `TabularDataset` that handles that data, as well as the methods to quickly get a `DataBunch`.

## Preprocessing tabular data

Tabular data usually comes in the form of a csv file containing variables of different kinds: texts, numbers, and some missing values. The example we'll work with in this section is a sample of the [adult dataset](https://archive.ics.uci.edu/ml/datasets/adult) which gives a certain amount of data on individuals to train a model to predict wether their salary is greater than \$50k or not.

In [None]:
untar_adult()
ADULT_PATH = DATA_PATH / 'adult_sample'
df = pd.read_csv(ADULT_PATH/'adult.csv')
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,>=50k
0,49,Private,101320,Assoc-acdm,12.0,Married-civ-spouse,,Wife,White,Female,0,1902,40,United-States,1
1,44,Private,236746,Masters,14.0,Divorced,Exec-managerial,Not-in-family,White,Male,10520,0,45,United-States,1
2,38,Private,96185,HS-grad,,Divorced,,Unmarried,Black,Female,0,0,32,United-States,0
3,38,Self-emp-inc,112847,Prof-school,15.0,Married-civ-spouse,Prof-specialty,Husband,Asian-Pac-Islander,Male,0,0,40,United-States,1
4,42,Self-emp-not-inc,82297,7th-8th,,Married-civ-spouse,Other-service,Wife,Black,Female,0,0,50,United-States,0


Here all the information that will form our input is in the 14 first columns, and the dependant variable is the last column. We will split our input between two types of variables: categoricals and continuous.
- Categorical variables will be replaced by a category, a unique id that identifies them, before passing through an embedding layer.
- Continuous variables will be normalized then directly fed to the model.

Another thing we need to handle are the missing values: our model isn't going to like receiving NaNs so we should remove them in a smart way. All of this preprocessing is done by `TabularTransform` objects and `TabularDataset`.

First let's split our variables between categoricals and continuous (we can ignore the dependant variable at this stage).

In [None]:
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country']
cont_names = ['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', 'hours-per-week']

Then we can define a bunch of Transforms that will be applied to those variables. Here we transform all the variables in the `cat_names` list into categories, and we replace missing values for the ones in `cont_names` by the median of the corresponding column.

In [None]:
tfms = [FillMissing, Categorify]

In [None]:
train_df, valid_df = df[:800].copy(),df[800:].copy()
data = tabular_data_from_df(ADULT_PATH, train_df, valid_df, '>=50k', tfms=tfms, cat_names=cat_names, cont_names=cont_names)

In [None]:
x,y = x,y = next(iter(data.train_dl))
x[0][:10], x[1][:10]

(tensor([[ 4, 13,  2,  4,  1,  5,  2, 28,  2],
         [ 2, 15,  2,  0,  1,  3,  2, 28,  1],
         [ 1, 15,  4,  1,  4,  5,  2, 28,  1],
         [ 4, 10,  4,  0,  2,  5,  2, 28,  1],
         [ 4, 12,  1,  3,  2,  5,  2, 28,  2],
         [ 7,  9,  2, 11,  1,  5,  2, 28,  2],
         [ 4, 15,  4,  6,  4,  5,  2, 28,  2],
         [ 4, 12,  4,  0,  2,  5,  2, 28,  1],
         [ 4,  1,  2, 14,  1,  5,  2, 28,  2],
         [ 3, 15,  6,  0,  2,  5,  1, 28,  2]], device='cuda:0'),
 tensor([[ 1.6002, -0.7512, -0.1152, -0.1364, -0.2289,  0.3617],
         [ 0.5984,  1.1684, -0.1152, -0.1364, -0.2289, -0.0503],
         [-1.4823, -0.2425, -0.1152, -0.1364, -0.2289, -0.0503],
         [-0.7117, -0.6892,  1.5310, -0.1364, -0.2289, -0.0503],
         [ 0.1360, -0.8950, -0.1152, -0.1364, -0.2289,  0.3617],
         [-0.5575,  0.5824, -0.1152, -0.1364, -0.2289, -0.2151],
         [-1.2511, -0.8314, -0.1152, -0.1364, -0.2289, -1.6982],
         [-0.6346, -0.7895, -0.6640, -0.1364,  4.0259,  

After being processed in `TabularDataset`, the categorical variables are replaced by ids and the continuous variables are normalized. The codes corresponding to categorical variables are all put together, as are all the continuous variables.

## Defining a model

Once we have our data ready in a `DataBunch`, we just need to create a model to then define a `Learner` and start training. The fastai library has a default `TabularModel` in `models.tabular`.

In [None]:
from fastai.models.tabular import TabularModel

To use that function, we just need to specify the embedding sizes for each of our categorical variables.

In [None]:
cat_szs = [len(train_df[n].cat.categories)+1 for n in cat_names]
emb_szs = [(c, min(50, (c+1)//2)) for c in cat_szs]
emb_szs

[(9, 5), (16, 8), (7, 4), (15, 8), (7, 4), (6, 3), (3, 2), (30, 15), (3, 2)]

In [None]:
model = TabularModel(emb_szs, len(cont_names), out_sz=1, layers=[1000,500], drops=[0.001,0.01], emb_drop=0.04)
learner = Learner(data, model, loss_fn = F.binary_cross_entropy)

In [None]:
learner.fit_one_cycle(1,1e-3)

VBox(children=(HBox(children=(IntProgress(value=0, max=1), HTML(value='0.00% [0/1 00:00<00:00]'))), HTML(value…

Total time: 00:01
epoch  train loss  valid loss
0      0.655111    0.685396    (00:01)

