<a href="https://colab.research.google.com/github/CoryLaidlaw/fastai_tutorial_notes/blob/main/Tabular_Training.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
from fastai.tabular.all import *

# Get Data

Using Adult dataset to predict if a person is making more or less then 50k/year

In [2]:
path = untar_data(URLs.ADULT_SAMPLE)
path.ls()

(#3) [Path('/root/.fastai/data/adult_sample/adult.csv'),Path('/root/.fastai/data/adult_sample/models'),Path('/root/.fastai/data/adult_sample/export.pkl')]

In [3]:
df = pd.read_csv(path/'adult.csv')
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary
0,49,Private,101320,Assoc-acdm,12.0,Married-civ-spouse,,Wife,White,Female,0,1902,40,United-States,>=50k
1,44,Private,236746,Masters,14.0,Divorced,Exec-managerial,Not-in-family,White,Male,10520,0,45,United-States,>=50k
2,38,Private,96185,HS-grad,,Divorced,,Unmarried,Black,Female,0,0,32,United-States,<50k
3,38,Self-emp-inc,112847,Prof-school,15.0,Married-civ-spouse,Prof-specialty,Husband,Asian-Pac-Islander,Male,0,0,40,United-States,>=50k
4,42,Self-emp-not-inc,82297,7th-8th,,Married-civ-spouse,Other-service,Wife,Black,Female,0,0,50,United-States,<50k


# Dataloaders

## Factory Method

Some columns are continous and can be treated as float numbers and fed directly to the model. (cont_names)

Others are categorical and will need to be converted to a unique index that is fed into the embedding layers. (cat_names)

Pre-processors used:
*   Categorify: takes every categorical variable and makes a map from unique int to unique category, then replace categorical values as dictated by the map
*   FillMissing: fills the missing values in continuous variables by the median of the existing values (specific value can be chosen if preferred)
*   Normalize: normalizes continuous variables (subtract the mean and divide by the std)





In [4]:
dls = TabularDataLoaders.from_csv(path/'adult.csv', path=path, y_names="salary",
    cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race'],
    cont_names = ['age', 'fnlwgt', 'education-num'],
    procs = [Categorify, FillMissing, Normalize])

## Using TabularPandas to Preprocess

Rewritting using fastai's TabularPandas class to further expose what is going on below the surface. By default, TabularDataLoaders include a 20% random split.

In [5]:
splits = RandomSplitter(valid_pct=0.2)(range_of(df))

In [6]:
to = TabularPandas(df, procs=[Categorify, FillMissing,Normalize],
                   cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race'],
                   cont_names = ['age', 'fnlwgt', 'education-num'],
                   y_names='salary',
                   splits=splits)

The data has been complately preprocessed

In [7]:
to.xs.iloc[2]

workclass            7.000000
education           10.000000
marital-status       3.000000
occupation          13.000000
relationship         1.000000
race                 5.000000
education-num_na     1.000000
age                  0.913280
fnlwgt              -0.419118
education-num        1.140358
Name: 28275, dtype: float64

Rebuilding DataLoaders

In [9]:
dls = to.dataloaders(bs=64)

In [10]:
dls.show_batch()

Unnamed: 0,workclass,education,marital-status,occupation,relationship,race,education-num_na,age,fnlwgt,education-num,salary
0,Local-gov,HS-grad,Married-civ-spouse,Other-service,Husband,Other,False,40.0,269168.002003,9.0,<50k
1,Private,HS-grad,Never-married,Other-service,Own-child,Black,False,19.0,186096.000156,9.0,<50k
2,Self-emp-inc,Prof-school,Married-civ-spouse,Prof-specialty,Husband,White,False,36.0,242080.001182,15.0,>=50k
3,Private,Masters,Married-civ-spouse,Exec-managerial,Wife,White,False,54.0,37289.005414,14.0,>=50k
4,Private,HS-grad,Never-married,Sales,Own-child,White,False,24.000001,103064.000134,9.0,<50k
5,Private,HS-grad,Divorced,Transport-moving,Not-in-family,White,False,56.0,232138.99946,9.0,<50k
6,Private,Masters,Married-civ-spouse,Prof-specialty,Husband,White,False,52.0,110747.997321,14.0,>=50k
7,State-gov,Masters,Divorced,Prof-specialty,Not-in-family,White,False,56.0,67662.001374,14.0,<50k
8,Private,Some-college,Never-married,Craft-repair,Not-in-family,White,False,34.0,191291.000044,10.0,<50k
9,Private,HS-grad,Widowed,Craft-repair,Unmarried,Asian-Pac-Islander,False,49.0,135642.997158,9.0,<50k


# Learner

Using tabular_learner to define our model. Fastai will try to infer the loss function based on y_names. In cases where y's are encoded, it is appropriate to pass y_block = CategoryBlock so fastai doesn't presume regression



In [11]:
learn = tabular_learner(dls, metrics=accuracy)

Using fir_one_cycle since this isn't using a pretrained model.

In [12]:
learn.fit_one_cycle(1)

epoch,train_loss,valid_loss,accuracy,time
0,0.378229,0.362253,0.83277,00:06


In [13]:
learn.show_results()

Unnamed: 0,workclass,education,marital-status,occupation,relationship,race,education-num_na,age,fnlwgt,education-num,salary,salary_pred
0,5.0,13.0,3.0,11.0,6.0,5.0,1.0,-0.185672,1.860671,1.531159,1.0,1.0
1,5.0,13.0,5.0,11.0,2.0,5.0,1.0,-0.478725,-0.247225,1.531159,0.0,0.0
2,5.0,9.0,3.0,4.0,1.0,5.0,1.0,0.546963,-0.232732,0.358756,0.0,1.0
3,5.0,12.0,5.0,15.0,2.0,5.0,1.0,-0.625252,-1.073833,-0.422847,1.0,0.0
4,5.0,12.0,5.0,0.0,2.0,5.0,1.0,-0.258935,-0.334083,-0.422847,0.0,0.0
5,5.0,12.0,3.0,4.0,1.0,5.0,1.0,-0.039145,-0.280942,-0.422847,0.0,0.0
6,1.0,7.0,7.0,1.0,2.0,5.0,1.0,1.572652,-0.788761,-1.986051,0.0,0.0
7,7.0,2.0,3.0,8.0,1.0,5.0,1.0,-0.332199,0.944332,-1.204449,0.0,0.0
8,5.0,12.0,5.0,14.0,2.0,5.0,1.0,-0.405462,0.566299,-0.422847,0.0,0.0


Showing a specific row prediction

In [14]:
row, clas, probs = learn.predict(df.iloc[0])

In [15]:
row.show()

Unnamed: 0,workclass,education,marital-status,occupation,relationship,race,education-num_na,age,fnlwgt,education-num,salary
0,Private,Assoc-acdm,Married-civ-spouse,#na#,Wife,White,False,49.0,101320.001935,12.0,>=50k


In [16]:
clas, probs

(tensor(1), tensor([0.3653, 0.6347]))

To get prediction on new dataframe, use test_dl method of DataLoaders. This dataframe doesn't need a dependent variable.

In [17]:
test_df = df.copy()
test_df.drop(['salary'], axis=1, inplace=True)
dl = learn.dls.test_dl(test_df)

In [18]:
learn.get_preds(dl=dl)

(tensor([[0.3653, 0.6347],
         [0.5677, 0.4323],
         [0.9243, 0.0757],
         ...,
         [0.5169, 0.4831],
         [0.7034, 0.2966],
         [0.6972, 0.3028]]),
 None)