<a href="https://colab.research.google.com/github/JayThibs/Machine-Learning-With-Tabular-Data/blob/master/Fast_AI_Tabular_Tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tabular Data

There's a common myth that you shouldn't use deep learning when you are doing machine learning with tabular data. The main argument is that you don't have enough data to train a good ML model if you are using DL, and you should instead use a technique like XGBoost.

The myth is just that, a myth. Jeremy Howard from fast.ai now uses deep learning to tackle ~90% of problems with tabular data. It is still worth it to try building a neural net and compare it to another algorithm like random forest (easy to build). You can play around with them and then choose which one performs better. Just don't leave out neural nets because someone on reddit told it doesn't work.

One of the reason people haven't been using deep learning with tabular data is that there was no framework that allowed to easily build DL models with tabular data. Until now. The fast.ai framework has a tabular modules which allows us to quickly build models with tabular data.

For this notebook, we'll assume that we've cleaned the data properly in advance and it fits well in a pandas `dataframe`.

In [0]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

In [0]:
from fastai import *            # For quick access to most common functionality
from fastai.tabular import *    # For quick access to tabular functionality

In [3]:
## tabular assumes that the data is in a pandas dataframe.

path = untar_data(URLs.ADULT_SAMPLE) # simple dataset for testing
df = pd.read_csv(path/'adult.csv')
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary
0,49,Private,101320,Assoc-acdm,12.0,Married-civ-spouse,,Wife,White,Female,0,1902,40,United-States,>=50k
1,44,Private,236746,Masters,14.0,Divorced,Exec-managerial,Not-in-family,White,Male,10520,0,45,United-States,>=50k
2,38,Private,96185,HS-grad,,Divorced,,Unmarried,Black,Female,0,0,32,United-States,<50k
3,38,Self-emp-inc,112847,Prof-school,15.0,Married-civ-spouse,Prof-specialty,Husband,Asian-Pac-Islander,Male,0,0,40,United-States,>=50k
4,42,Self-emp-not-inc,82297,7th-8th,,Married-civ-spouse,Other-service,Wife,Black,Female,0,0,50,United-States,<50k


We need to seperate the features into two groups: categorical variables and continuous variables.

For categorical variables, we will be using embeddings in order to use them in our neural net.

For continuous variables, we can send them in the neural net like pixels can. We don't create embeddings for them.

In [0]:
dep_var = 'salary' # dependant variable: all of the adults making equal to or more than $50k
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 
             'relationship', 'race'] # categorical variables
cont_names = ['age', 'fnlwgt', 'education-num'] # continuous variables
procs = [FillMissing, Categorify, Normalize] # processes for pre-processing

For pre-processing, whatever we do to the training set, we need to do the same for the validation and testing sets.

FillMissing replaces a missing value with the median and add a new binary column which says whether that row has a missing value or not.



In [0]:
test = TabularList.from_df(df.iloc[800:1000].copy(), path=path, cat_names=cat_names, cont_names=cont_names)

In [0]:
data = (TabularList.from_df(df, path=path, cat_names=cat_names, cont_names=cont_names,
                            procs=procs)
                            .split_by_idx(list(range(800,1000))) # spliting for training/validation sets
                            .label_from_df(cols=dep_var)
                            .add_test(test, label=0) # adding our test set
                            .databunch())

In [7]:
data.show_batch(rows=10)

workclass,education,marital-status,occupation,relationship,race,education-num_na,age,fnlwgt,education-num,target
Private,HS-grad,Separated,Handlers-cleaners,Unmarried,Asian-Pac-Islander,False,-0.4828,-0.7108,-0.4224,<50k
Private,HS-grad,Married-civ-spouse,Sales,Husband,White,False,0.3235,0.3525,-0.4224,>=50k
State-gov,Some-college,Never-married,Prof-specialty,Not-in-family,White,False,-1.2158,-0.3051,-0.0312,<50k
Local-gov,Masters,Married-civ-spouse,Adm-clerical,Husband,White,False,0.4701,0.1587,1.5334,<50k
?,HS-grad,Married-civ-spouse,?,Husband,White,False,-1.0692,0.1266,-0.4224,<50k
?,Some-college,Never-married,?,Other-relative,White,False,-1.2891,1.7005,-0.0312,<50k
Private,Masters,Divorced,Tech-support,Not-in-family,White,False,0.6166,3.9538,1.5334,<50k
Private,Bachelors,Divorced,Priv-house-serv,Not-in-family,White,False,0.6899,0.095,1.1422,<50k
Private,9th,Separated,Machine-op-inspct,Not-in-family,White,False,0.5434,-0.6865,-1.9869,<50k
?,Some-college,Never-married,?,Own-child,White,False,-1.4357,2.3768,-0.0312,<50k


In [0]:
# layers is where we are defining our architecture
learn = tabular_learner(data, layers=[200,100], metrics=accuracy)

In [9]:
learn.fit(1, 1e-2) # fitting our on one epoch to see our starting accuracy

epoch,train_loss,valid_loss,accuracy,time
0,0.370422,0.39472,0.825,00:05


There we are! We have a deep learning model with tabular data. We trained our model on one epoch to see our initial accuracy. This is helpful to get an idea on how to move forward.

## Inference



In [0]:
row = df.iloc[0]

In [11]:
learn.predict(row)

(Category >=50k, tensor(1), tensor([0.3606, 0.6394]))

We've covered some of the steps for building a model with tabular data in fast.ai.