# Tabular training
How to use the tabular application in fastai

To illustrate the tabular application, we will use the example of the Adult dataset where we have to predict if a person is earning more or less than $50k per year using some general data.

In [38]:
from fastai.tabular.all import *

# Download the data and uncompressed it
path = untar_data(URLs.ADULT_SAMPLE)
path.ls()


(#4) [Path('/Users/ondrej.drapalik/.fastai/data/adult_sample/adult.csv'),Path('/Users/ondrej.drapalik/.fastai/data/adult_sample/.DS_Store'),Path('/Users/ondrej.drapalik/.fastai/data/adult_sample/export.pkl'),Path('/Users/ondrej.drapalik/.fastai/data/adult_sample/models')]

In [39]:
# Checking how data are sturctured
df = pd.read_csv(path/'adult.csv')
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary
0,49,Private,101320,Assoc-acdm,12.0,Married-civ-spouse,,Wife,White,Female,0,1902,40,United-States,>=50k
1,44,Private,236746,Masters,14.0,Divorced,Exec-managerial,Not-in-family,White,Male,10520,0,45,United-States,>=50k
2,38,Private,96185,HS-grad,,Divorced,,Unmarried,Black,Female,0,0,32,United-States,<50k
3,38,Self-emp-inc,112847,Prof-school,15.0,Married-civ-spouse,Prof-specialty,Husband,Asian-Pac-Islander,Male,0,0,40,United-States,>=50k
4,42,Self-emp-not-inc,82297,7th-8th,,Married-civ-spouse,Other-service,Wife,Black,Female,0,0,50,United-States,<50k


>FNLWGT stands for "final weight." It is a term used in statistical sampling to refer to the weight assigned to a sample element (e.g., a person, household, or other unit of observation) in a sample survey. The weight is used to adjust for the fact that the sample element may not be representative of the population as a whole. For example, if a sample element represents a larger number of people than another element in the sample, it would have a higher weight to compensate for this difference. The term FNLWGT is often used in datasets, particularly in demographic and social science research, to refer to a variable that indicates the final weight assigned to each sample element.

## DataLoader

Some of the columns are continuous (like age) and we will treat them as float numbers we can feed our model directly. Others are categorical (like workclass or education) and we will convert them to a unique index that we will feed to embedding layers. We can specify our categorical and continuous column names, as well as the name of the dependent variable in TabularDataLoaders factory methods:



In [40]:
dls = TabularDataLoaders.from_csv(path/'adult.csv', path=path, y_names="salary",
    cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race'],
    cont_names = ['age', 'fnlwgt', 'education-num'],
    procs = [Categorify, FillMissing, Normalize])

### Exploring what's going under the hood of TabularDataLoaders

by rewriting it with TabularPandas

In [41]:
# Create a random 80/20 split
splits = RandomSplitter(valid_pct=0.2)(range_of(df))

In [42]:
to = TabularPandas(df, procs=[Categorify, FillMissing,Normalize],
                   cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race'],
                   cont_names = ['age', 'fnlwgt', 'education-num'],
                   y_names='salary',
                   splits=splits)

In [43]:
to.xs.iloc[:2]


Unnamed: 0,workclass,education,marital-status,occupation,relationship,race,education-num_na,age,fnlwgt,education-num
9619,5,9,3,9,1,5,1,0.616361,0.825333,0.361948
8886,2,10,3,11,1,5,1,1.349586,-0.467698,1.14254


### Back to TabularDataLoaders again...

In [44]:
dls = to.dataloaders(bs=64)

In [45]:
# The show_batch method works like for every other application:
dls.show_batch()


Unnamed: 0,workclass,education,marital-status,occupation,relationship,race,education-num_na,age,fnlwgt,education-num,salary
0,Self-emp-not-inc,Some-college,Separated,Farming-fishing,Not-in-family,White,False,35.0,31094.995073,10.0,<50k
1,Federal-gov,HS-grad,Married-civ-spouse,Adm-clerical,Husband,White,False,45.0,60267.005577,9.0,>=50k
2,Private,Some-college,Married-civ-spouse,Craft-repair,Husband,White,False,49.0,176813.999612,10.0,<50k
3,Private,HS-grad,Married-civ-spouse,Craft-repair,Husband,White,False,35.0,245090.002532,9.0,<50k
4,Private,Some-college,Never-married,Exec-managerial,Own-child,White,False,19.000001,172892.999247,10.0,<50k
5,Private,10th,Never-married,Priv-house-serv,Own-child,White,False,25.0,143280.001137,6.0,<50k
6,Local-gov,HS-grad,Married-civ-spouse,Adm-clerical,Husband,White,False,42.0,263871.001726,9.0,<50k
7,Private,Some-college,Never-married,Adm-clerical,Unmarried,Black,False,35.0,287657.997945,10.0,<50k
8,Private,12th,Married-civ-spouse,Machine-op-inspct,Husband,White,False,23.999999,118656.999652,8.0,<50k
9,Federal-gov,Bachelors,Never-married,Prof-specialty,Own-child,Black,False,28.0,56651.002685,13.0,<50k


# Define a model using tabular learner method

When we define our model, fastai will try to infer the loss function based on our `y_names` earlier.

> **Note**: Sometimes with tabular data, your y’s may be encoded (such as 0 and 1). In such a case you should explicitly pass `y_block = CategoryBlock` in your constructor so fastai won’t presume you are doing regression.

In [46]:
learn = tabular_learner(dls, metrics=accuracy)

And we can train that model with the `fit_one_cycle` method (the f`ine_tune` method won’t be useful here since we don’t have a pretrained model).


In [47]:
learn.fit_one_cycle(3)

epoch,train_loss,valid_loss,accuracy,time
0,0.374963,0.368344,0.833538,00:04
1,0.354952,0.368138,0.828931,00:03
2,0.357534,0.357874,0.836763,00:03


In [76]:
# We can now have a look at the predictions of our model:
learn.show_results()

Unnamed: 0,workclass,education,marital-status,occupation,relationship,race,education-num_na,age,fnlwgt,education-num,salary,salary_pred
0,5.0,16.0,5.0,4.0,2.0,5.0,1.0,0.103103,0.429389,-0.028348,0.0,0.0
1,3.0,10.0,5.0,11.0,2.0,5.0,1.0,-0.336832,0.61695,1.14254,0.0,0.0
2,5.0,10.0,3.0,9.0,6.0,5.0,1.0,-0.776767,-0.892659,1.14254,0.0,0.0
3,5.0,16.0,3.0,14.0,6.0,5.0,1.0,1.642876,0.288534,-0.028348,1.0,1.0
4,5.0,12.0,3.0,8.0,6.0,3.0,1.0,0.323071,-0.637102,-0.418644,0.0,0.0
5,5.0,10.0,5.0,2.0,4.0,5.0,1.0,-1.14338,-0.781261,1.14254,0.0,0.0
6,6.0,12.0,3.0,6.0,1.0,5.0,1.0,0.323071,-0.80982,-0.418644,0.0,0.0
7,5.0,16.0,5.0,9.0,2.0,5.0,1.0,-0.190187,0.332901,-0.028348,0.0,0.0
8,5.0,9.0,5.0,9.0,4.0,5.0,1.0,-1.290025,1.657308,0.361948,0.0,0.0


### Explanation of the tabular model

The predict method returns a tuple containing three values:

- `row`: the row that was passed to the predict method as an argument. This would be the first row (index 0) of df.
- `clas`: the predicted class for the row.
- `probs`: an array of probabilities representing the likelihood of the row belonging to each class. The class with the highest probability is the one that is predicted.

In [89]:
row, clas, probs = learn.predict(df.iloc[3])


In [90]:
row.show()

Unnamed: 0,workclass,education,marital-status,occupation,relationship,race,education-num_na,age,fnlwgt,education-num,salary
0,Self-emp-inc,Prof-school,Married-civ-spouse,Prof-specialty,Husband,Asian-Pac-Islander,False,38.0,112847.00128,15.0,>=50k


In [91]:
row,clas,probs

(   workclass  education  marital-status  occupation  relationship  race  \
 0        6.0       15.0             3.0        11.0           1.0   2.0   
 
    education-num_na       age    fnlwgt  education-num  salary  
 0               1.0 -0.043542 -0.733839       1.923132     1.0  ,
 tensor(1),
 tensor([0.1001, 0.8999]))

With the result:

```
(   workclass  education  marital-status  occupation  relationship  race  \
 0        5.0        8.0             3.0         0.0           6.0   5.0   
 
    education-num_na       age    fnlwgt  education-num  salary  
 0               1.0  0.765509 -0.837614       0.750943     1.0  ,
 tensor(1),
 tensor([0.4905, 0.5095]))
 ```

 >The probablity of the row belonging to class 0 is 0.4905 and the probability of the row belonging to class 1 is 0.5095. Since the probability of the row belonging to class 1 is higher, the model predicts that the row belongs to class 1.

To get prediction on a new dataframe, you can use the test_dl method of the DataLoaders. That dataframe does not need to have the dependent variable in its column.

In [64]:
test_df = df.copy()
test_df.drop(['salary'], axis=1, inplace=True)
dl = learn.dls.test_dl(test_df)

In [68]:
learn.get_preds(dl=dl)


(tensor([[0.3827, 0.6173],
         [0.4272, 0.5728],
         [0.9590, 0.0410],
         ...,
         [0.6074, 0.3926],
         [0.7040, 0.2960],
         [0.7064, 0.2936]]),
 None)