# Tabular Dataset

This module defines the main class to handle tabular data in the fastai library: `TabularDataset`. As always, there is also a helper function to quickly get your data.

In [None]:
from fastai.gen_doc.nbdoc import *
from fastai.tabular import * 
from fastai.docs import *

## Quickly get the data in a `DataBunch`

The best way to quickly get your data in a DataBunch is to organize it in two (or three) dataframes. One for training, one for validation, and if you have it, one for test. Here we are interested in a subsample of the [adult dataset](https://archive.ics.uci.edu/ml/datasets/adult).

In [None]:
#untar_adult()
ADULT_PATH = DATA_PATH / 'adult_sample'
df = pd.read_csv(ADULT_PATH/'adult.csv')
train_df, valid_df = df[:800].copy(),df[800:].copy()

In [None]:
train_df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,>=50k
0,49,Private,101320,Assoc-acdm,12.0,Married-civ-spouse,,Wife,White,Female,0,1902,40,United-States,1
1,44,Private,236746,Masters,14.0,Divorced,Exec-managerial,Not-in-family,White,Male,10520,0,45,United-States,1
2,38,Private,96185,HS-grad,,Divorced,,Unmarried,Black,Female,0,0,32,United-States,0
3,38,Self-emp-inc,112847,Prof-school,15.0,Married-civ-spouse,Prof-specialty,Husband,Asian-Pac-Islander,Male,0,0,40,United-States,1
4,42,Self-emp-not-inc,82297,7th-8th,,Married-civ-spouse,Other-service,Wife,Black,Female,0,0,50,United-States,0


In [None]:
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country']
cont_names = ['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', 'hours-per-week']

In [None]:
show_doc(tabular_data_from_df, doc_string=False)

#### <a id=tabular_data_from_df></a>`tabular_data_from_df`
> `tabular_data_from_df`(`path`, `train_df`:`DataFrame`, `valid_df`:`DataFrame`, `dep_var`:`str`, `test_df`:`OptDataFrame`=`None`, `tfms`:`Optional`\[`Collection`\[[`TabularTransform`](/tabular.transform.html#TabularTransform)\]\]=`None`, `cat_names`:`OptStrList`=`None`, `cont_names`:`OptStrList`=`None`, `stats`:`OptStats`=`None`, `log_output`:`bool`=`False`, `kwargs`) -> [`DataBunch`](/data.html#DataBunch)
<a href="https://github.com/fastai/fastai_pytorch/blob/master/fastai/tabular/data.py#L58">[source]</a>

Creates a `DataBunch` in `path` from `train_df`, `valid_df` and maybe `test_df`. The dependent variable is the `dep_var` column, while the categorical and continuous variables are in the `cat_names` columns and `cont_names` columns respectively. The `TabularTransform` in `tfms` are applied to the dataframes as preprocessing, then the categories are replaced by their codes + 1 (leaving 0 to spot the nan) and the continuous variables are normalized. You can pass the `stats` to use for that step. If the flag `log_output` is True, the dependant variable is replaced by its log.

Note that the transforms should be passed as `Callable`: the actual initialization with `cat_names` and `cont_names` is done inside.

In [None]:
tfms = [FillMissing, Categorify]
data = tabular_data_from_df(ADULT_PATH, train_df, valid_df, '>=50k', tfms=tfms, cat_names=cat_names, cont_names=cont_names)

 You can then easily create a model for this data with `get_tabular_model`

## The TabularDataset class

In [None]:
show_doc(TabularDataset, doc_string=False)

## <a id=TabularDataset></a>`class` `TabularDataset`
> `TabularDataset`(`df`:`DataFrame`, `dep_var`:`str`, `cat_names`:`OptStrList`=`None`, `cont_names`:`OptStrList`=`None`, `stats`:`OptStats`=`None`, `log_output`:`bool`=`False`) :: [`DatasetBase`](/data.html#DatasetBase)
<a href="https://github.com/fastai/fastai_pytorch/blob/master/fastai/tabular/data.py#L11">[source]</a>

Create a dataset from `df` with the target being the `dep_var` column, while the categorical and continuous variables are in the `cat_names` columns and `cont_names` columns respectively. Categories are replaced by their codes + 1 (leaving 0 to spot the nan) and the continuous variables are normalized. You can pass the `stats` to use for that step. If the flag `log_output` is True, the dependant variable is replaced by its log.

In [None]:
show_doc(TabularDataset.from_dataframe, doc_string=False)

#### <a id=from_dataframe></a>`from_dataframe`
> `from_dataframe`(`df`:`DataFrame`, `dep_var`:`str`, `tfms`:`Optional`\[`Collection`\[[`TabularTransform`](/tabular.transform.html#TabularTransform)\]\]=`None`, `cat_names`:`OptStrList`=`None`, `cont_names`:`OptStrList`=`None`, `stats`:`OptStats`=`None`, `log_output`:`bool`=`False`) -> `TabularDataset`
<a href="https://github.com/fastai/fastai_pytorch/blob/master/fastai/tabular/data.py#L40">[source]</a>

Factory method to create a `TabularDataset` from `df`. The only difference from above is that it gets a list `tfms` of `TabularTfm` that it applied before passing the dataframe to the class initialization.