## Tabular training

In [7]:
! pip install -q fastai pathlib

To illustrate the tabular application, we will use the example of the Adult dataset where we have to predict if a person is earning more or less than $50k per year using some general data.

In [8]:
from fastai.tabular.all import *
from pathlib import Path

We can download a sample of this dataset with the usual untar_data command:

In [9]:
dataPath = Path()
dataPath.ls()

(#4) [Path('data.csv'),Path('screener.ipynb'),Path('tabular.ipynb'),Path('Tutorial')]

Then we can have a look at how the data is structured:

In [10]:
df = pd.read_csv(dataPath/'data.csv')
df.head()

Unnamed: 0,Date,Ticker,Open,High,Low,Close,Volume,Dividends,Stock Splits,YearChange (%)
0,2023-01-17 00:00:00-05:00,AAPL,133.426895,135.861287,132.734183,134.525345,63646600,0.0,0.0,-20.594134
1,2023-01-18 00:00:00-05:00,AAPL,135.396194,137.16756,133.624813,133.802948,69672800,0.0,0.0,-21.422109
2,2023-01-19 00:00:00-05:00,AAPL,132.684716,134.832133,132.377945,133.862335,58280400,0.0,0.0,-19.873325
3,2023-01-20 00:00:00-05:00,AAPL,133.872198,136.58369,132.823232,136.435242,80223600,0.0,0.0,-16.579356
4,2023-01-23 00:00:00-05:00,AAPL,136.682621,141.828518,136.464909,139.64151,81760300,0.0,0.0,-13.726248


Some of the columns are continuous (like age) and we will treat them as float numbers we can feed our model directly. Others are categorical (like workclass or education) and we will convert them to a unique index that we will feed to embedding layers. We can specify our categorical and continuous column names, as well as the name of the dependent variable in TabularDataLoaders factory methods:

In [16]:
dls = TabularDataLoaders.from_csv(dataPath/'data.csv', path=dataPath, 
    y_names="YearChange (%)",
    cat_names = ['Date'],
    cont_names = ['Open','High', 'Low', 'Close', 'Volume', 'Dividends', 'Stock Splits'],
    procs = [Categorify, FillMissing, Normalize])

KeyError: "None of [Index(['Open', 'High', 'Low', 'Close', 'Volume'], dtype='object')] are in the [columns]"

The last part is the list of pre-processors we apply to our data:

* Categorify is going to take every categorical variable and make a map from integer to unique categories, then replace the values by the corresponding index.
* FillMissing will fill the missing values in the continuous variables by the median of existing values (you can choose a specific value if you prefer)
* Normalize will normalize the continuous variables (subtract the mean and divide by the std)

To further expose what’s going on below the surface, let’s rewrite this utilizing fastai’s TabularPandas class. We will need to make one adjustment, which is defining how we want to split our data. By default the factory method above used a random 80/20 split, so we will do the same:

In [6]:
splits = RandomSplitter(valid_pct=0.2)(range_of(df))

In [8]:
to = TabularPandas(df, procs=[Categorify, FillMissing,Normalize],
    cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race'],
    cont_names = ['age', 'fnlwgt', 'education-num'],
    y_names='salary',
    splits=splits)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  to[n].fillna(self.na_dict[n], inplace=True)


Once we build our TabularPandas object, our data is completely preprocessed as seen below:

In [11]:
to.xs.iloc[:2]

Unnamed: 0,workclass,education,marital-status,occupation,relationship,race,education-num_na,age,fnlwgt,education-num
10846,6,12,3,5,1,5,1,2.597238,0.392668,-0.421616
31725,7,6,5,6,4,5,1,-1.068254,0.022979,-2.382175


Now we can build our DataLoaders again:

In [13]:
dls = to.dataloaders(bs=64)

The show_batch method works like for every other application:

In [14]:
dls.show_batch()

Unnamed: 0,workclass,education,marital-status,occupation,relationship,race,education-num_na,age,fnlwgt,education-num,salary
0,Private,HS-grad,Never-married,Sales,Own-child,White,False,24.000001,314822.995726,9.0,<50k
1,Private,HS-grad,Never-married,Machine-op-inspct,Own-child,White,False,19.000001,164584.999758,9.0,<50k
2,Self-emp-not-inc,HS-grad,Divorced,Exec-managerial,Not-in-family,Black,False,40.0,98985.000079,9.0,<50k
3,Self-emp-not-inc,7th-8th,Married-civ-spouse,Transport-moving,Husband,White,False,26.0,224360.999188,4.0,<50k
4,Private,Some-college,Widowed,Other-service,Unmarried,Black,False,52.0,135607.002345,10.0,<50k
5,Private,Some-college,Married-civ-spouse,Handlers-cleaners,Husband,Black,False,37.0,360742.996603,10.0,>=50k
6,Private,HS-grad,Married-civ-spouse,Transport-moving,Husband,White,False,60.999999,142921.999567,9.0,<50k
7,Private,Some-college,Married-civ-spouse,Sales,Husband,White,False,46.0,224201.998586,10.0,<50k
8,Private,Some-college,Never-married,Craft-repair,Not-in-family,Asian-Pac-Islander,False,42.0,68728.996491,10.0,<50k
9,Local-gov,Bachelors,Never-married,Prof-specialty,Not-in-family,White,False,58.000001,215244.999305,13.0,<50k


We can define a model using the tabular_learner method. When we define our model, fastai will try to infer the loss function based on our y_names earlier.

Note: Sometimes with tabular data, your y’s may be encoded (such as 0 and 1). In such a case you should explicitly pass y_block = CategoryBlock in your constructor so fastai won’t presume you are doing regression.

In [None]:
learn = tabular_learner(dls, metrics=accuracy)

And we can train that model with the fit_one_cycle method (the fine_tune method won’t be useful here since we don’t have a pretrained model).

In [26]:
learn.fit_one_cycle(3)

epoch,train_loss,valid_loss,accuracy,time
0,0.371694,0.362503,0.834613,00:06
1,0.347213,0.353325,0.833692,00:06
2,0.358546,0.34559,0.841523,00:06


We can then have a look at some predictions:

In [27]:
learn.show_results()

Unnamed: 0,workclass,education,marital-status,occupation,relationship,race,education-num_na,age,fnlwgt,education-num,salary,salary_pred
0,6.0,8.0,3.0,13.0,1.0,5.0,1.0,-0.115226,-0.054219,0.75472,1.0,1.0
1,5.0,12.0,5.0,10.0,3.0,5.0,1.0,-0.921634,1.083524,-0.421616,0.0,0.0
2,5.0,12.0,5.0,13.0,2.0,5.0,1.0,0.031394,0.013966,-0.421616,0.0,0.0
3,5.0,16.0,5.0,9.0,3.0,5.0,1.0,-1.361494,-0.202036,-0.029504,0.0,0.0
4,1.0,10.0,3.0,1.0,1.0,5.0,1.0,1.5709,-0.278079,1.146832,0.0,0.0
5,5.0,16.0,1.0,9.0,2.0,5.0,1.0,-0.555085,0.646976,-0.029504,0.0,0.0
6,5.0,11.0,3.0,11.0,1.0,2.0,1.0,0.104703,1.23953,2.323167,1.0,1.0
7,5.0,7.0,3.0,4.0,1.0,5.0,1.0,1.79083,-1.163493,-1.990063,0.0,0.0
8,5.0,13.0,1.0,0.0,5.0,5.0,2.0,0.617872,0.436985,-0.029504,0.0,0.0


Or use the predict method on a row:

In [33]:
row, clas, probs = learn.predict(df.iloc[0])

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  to[n].fillna(self.na_dict[n], inplace=True)


In [34]:
row.show()

Unnamed: 0,workclass,education,marital-status,occupation,relationship,race,education-num_na,age,fnlwgt,education-num,salary
0,Private,Assoc-acdm,Married-civ-spouse,#na#,Wife,White,False,49.0,101319.999356,12.0,>=50k


In [35]:
clas, probs

(tensor(1), tensor([0.4058, 0.5942]))


To get prediction on a new dataframe, you can use the test_dl method of the DataLoaders. That dataframe does not need to have the dependent variable in its column.

In [36]:
test_df = df.copy()
test_df.drop(['salary'], axis=1, inplace=True)
dl = learn.dls.test_dl(test_df)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  to[n].fillna(self.na_dict[n], inplace=True)


Then Learner.get_preds will give you the predictions:

In [37]:
learn.get_preds(dl=dl)

(tensor([[0.4058, 0.5942],
         [0.4478, 0.5522],
         [0.9658, 0.0342],
         ...,
         [0.6596, 0.3404],
         [0.7203, 0.2797],
         [0.6557, 0.3443]]),
 None)

Note:
Since machine learning models can’t magically understand categories it was never trained on, the data should reflect this. If there are different missing values in your test data you should address this before training