## Tabular training

In [17]:
# Install libraries on first run
#! pip install -q ipynb fastai pathlib pandas

To illustrate the tabular application, we will use the example of the Adult dataset where we have to predict if a person is earning more or less than $50k per year using some general data.

In [18]:
from fastai.tabular.all import *
from pathlib import Path
import pandas as pd

## Variables

In [19]:
modelName = 'stockScreenerV4.pkl'
trainingDataName = 'stockData.csv'
trainingFolder = Path.cwd().parent / 'TrainingData'
modelFolder = Path.cwd().parent.parent / 'TrainedModels'

# Training parameters
yNames = ['Future Year Change']
catNames = ['Industry']
contNames = [
    'Open',
    'High', 
    'Low', 
    'Close', 
    'Volume', 
    'Dividends', 
    'Stock Splits', 
    'EV/EBIT', 
    'Market Cap', 
    'ROIC'
]

We can download a sample of this dataset with the usual untar_data command:

In [20]:
dataPath = Path()
dataPath.ls()

(#3) [Path('app.ipynb'),Path('stockFetcher.ipynb'),Path('tabular.ipynb')]

Then we can have a look at how the data is structured:

In [21]:
df = pd.read_csv(trainingFolder/trainingDataName)
df.head()

Unnamed: 0,Date,Ticker,Industry,Open,High,Low,Close,Volume,Dividends,Stock Splits,Future Year Change,EV/EBIT,Market Cap,ROIC
0,2019-01-17 00:00:00-05:00,AAPL,Consumer Electronics,36.820321,37.646511,36.595865,37.216702,119284800,0.0,0.0,1.075702,10.460258,559661000000.0,0.075524
1,2019-01-18 00:00:00-05:00,AAPL,Consumer Electronics,37.608297,37.699036,37.245346,37.445927,135004000,0.0,0.0,1.049015,10.519026,563108100000.0,0.075102
2,2019-01-22 00:00:00-05:00,AAPL,Consumer Electronics,37.348034,37.424443,36.443045,36.605419,121576000,0.0,0.0,1.103545,10.303539,550468600000.0,0.076673
3,2019-01-23 00:00:00-05:00,AAPL,Consumer Electronics,36.808383,37.044779,36.223365,36.753464,92522400,0.0,0.0,1.105162,10.341494,552694900000.0,0.076391
4,2019-01-24 00:00:00-05:00,AAPL,Consumer Electronics,36.798839,36.887188,36.232925,36.462154,101766000,0.0,0.0,1.115865,10.266809,548314200000.0,0.076947


Some of the columns are continuous (like age) and we will treat them as float numbers we can feed our model directly. Others are categorical (like workclass or education) and we will convert them to a unique index that we will feed to embedding layers. We can specify our categorical and continuous column names, as well as the name of the dependent variable in TabularDataLoaders factory methods:

In [22]:
dls = TabularDataLoaders.from_csv(trainingFolder/trainingDataName, path=dataPath, 
    y_names=yNames,
    cat_names=catNames,
    cont_names=contNames,
    procs = [Categorify, FillMissing, Normalize])

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  to[n].fillna(self.na_dict[n], inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  to[n].fillna(self.na_dict[n], inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values alway

The last part is the list of pre-processors we apply to our data:

* Categorify is going to take every categorical variable and make a map from integer to unique categories, then replace the values by the corresponding index.
* FillMissing will fill the missing values in the continuous variables by the median of existing values (you can choose a specific value if you prefer)
* Normalize will normalize the continuous variables (subtract the mean and divide by the std)

To further expose what’s going on below the surface, let’s rewrite this utilizing fastai’s TabularPandas class. We will need to make one adjustment, which is defining how we want to split our data. By default the factory method above used a random 80/20 split, so we will do the same:

In [23]:
splits = RandomSplitter(valid_pct=0.2)(range_of(df))

In [24]:
to = TabularPandas(df, procs=[Categorify, FillMissing,Normalize],
    y_names=yNames,
    cat_names = catNames,
    cont_names = contNames,
    splits=splits)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  to[n].fillna(self.na_dict[n], inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  to[n].fillna(self.na_dict[n], inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values alway

Once we build our TabularPandas object, our data is completely preprocessed as seen below:

In [25]:
to.xs.iloc[:2]

Unnamed: 0,Industry,EV/EBIT_na,Market Cap_na,ROIC_na,Open,High,Low,Close,Volume,Dividends,Stock Splits,EV/EBIT,Market Cap,ROIC
98903,18,1,2,2,-0.375255,-0.376508,-0.373401,-0.374423,-0.291893,-0.060631,-0.010101,-1.476886,-0.126733,-0.109192
77562,17,1,2,2,-0.482219,-0.482998,-0.483453,-0.484464,-0.013828,-0.060631,-0.010101,-0.051678,-0.126733,-0.109192


Now we can build our DataLoaders again:

In [26]:
dls = to.dataloaders(bs=64)

The show_batch method works like for every other application:

In [27]:
dls.show_batch()

Unnamed: 0,Industry,EV/EBIT_na,Market Cap_na,ROIC_na,Open,High,Low,Close,Volume,Dividends,Stock Splits,EV/EBIT,Market Cap,ROIC,Future Year Change
0,Drug Manufacturers - General,False,True,True,76.23684,77.637563,75.816625,77.478821,7135600.0,1.008622e-11,-3.435425e-11,26.646767,382021100000.0,0.031807,-0.111958
1,Banks - Diversified,False,True,True,61.488627,61.585079,60.869723,61.319835,1139700.0,1.008622e-11,-3.435425e-11,-23.679138,382021100000.0,0.031807,0.099087
2,Semiconductors,False,True,True,49.978752,50.406667,49.770961,50.197929,18157000.0,1.008622e-11,-3.435425e-11,37.687031,382021100000.0,0.031807,0.659838
3,Software - Infrastructure,False,True,True,185.690002,193.270005,184.520004,193.270004,1008000.0,1.008622e-11,-3.435425e-11,29.361912,382021100000.0,0.031807,0.389041
4,Insurance - Diversified,False,True,True,43.819554,44.32165,43.664359,44.139071,467399.2,1.008622e-11,-3.435425e-11,-11.945892,382021100000.0,0.031807,0.043325
5,Insurance - Diversified,False,True,True,42.256965,42.481417,41.894386,42.136111,401499.5,1.008622e-11,-3.435425e-11,-12.181256,382021100000.0,0.031807,0.160534
6,Other Industrial Metals & Mining,False,True,True,38.939939,38.946491,38.690785,38.920269,784500.8,1.008622e-11,-3.435425e-11,6.55124,382021100000.0,0.031807,0.376146
7,Banks - Regional,False,True,True,17.468668,17.595753,17.20473,17.21451,12102200.0,1.008622e-11,-3.435425e-11,2.500351,382021100000.0,0.031807,0.166478
8,Software - Infrastructure,False,True,True,73.999997,74.162998,71.370004,71.370006,1300300.0,1.008622e-11,-3.435425e-11,28.652769,382021100000.0,0.031807,0.370183
9,Information Technology Services,False,True,True,9.34101,9.409879,9.306572,9.38405,6651300.0,1.008622e-11,-3.435425e-11,12.894661,382021100000.0,0.031807,0.030339


We can define a model using the tabular_learner method. When we define our model, fastai will try to infer the loss function based on our y_names earlier.

Note: Sometimes with tabular data, your y’s may be encoded (such as 0 and 1). In such a case you should explicitly pass y_block = CategoryBlock in your constructor so fastai won’t presume you are doing regression.

In [28]:
learn = tabular_learner(dls, metrics=[rmse, mae])

And we can train that model with the fit_one_cycle method (the fine_tune method won’t be useful here since we don’t have a pretrained model).

In [29]:
learn.fit_one_cycle(15)

epoch,train_loss,valid_loss,_rmse,mae,time
0,0.807332,0.720741,0.848965,0.372917,00:13
1,0.630772,0.627686,0.792266,0.362155,00:13
2,0.71269,0.619117,0.78684,0.35877,00:12
3,0.508386,481.502228,21.943157,0.489854,00:13
4,0.530241,0.612499,0.782623,0.339939,00:12
5,0.577156,486.136017,22.048491,0.498871,00:13
6,0.590178,2.150076,1.466314,0.351829,00:12
7,0.503997,0.623505,0.789623,0.325684,00:12
8,0.462922,4.090142,2.02241,0.347461,00:12
9,0.444063,1.330452,1.153452,0.331914,00:12


We can then have a look at some predictions:

In [30]:
learn.show_results()

Unnamed: 0,Industry,EV/EBIT_na,Market Cap_na,ROIC_na,Open,High,Low,Close,Volume,Dividends,Stock Splits,EV/EBIT,Market Cap,ROIC,Future Year Change,Future Year Change_pred
0,14.0,1.0,2.0,2.0,-0.34908,-0.350185,-0.348109,-0.349475,-0.049325,-0.060631,-0.010101,-0.180709,-0.126733,-0.109192,0.395497,0.076626
1,21.0,1.0,1.0,1.0,0.081602,0.075797,0.071651,0.065316,0.725234,-0.060631,-0.010101,-0.172185,4.864847,0.290187,-0.479721,-0.155556
2,24.0,1.0,2.0,2.0,-0.489256,-0.489742,-0.488637,-0.489432,-0.23787,-0.060631,-0.010101,-0.891687,-0.126733,-0.109192,0.733075,0.061665
3,24.0,1.0,2.0,2.0,-0.420926,-0.422522,-0.422244,-0.424098,-0.225832,-0.060631,-0.010101,-0.705198,-0.126733,-0.109192,0.170792,-0.107931
4,35.0,1.0,2.0,2.0,5.599787,5.525294,5.583323,5.557341,-0.287765,-0.060631,-0.010101,-0.311015,-0.126733,-0.109192,-0.088251,0.528342
5,25.0,1.0,1.0,1.0,-0.424453,-0.424592,-0.424617,-0.422923,-0.258591,-0.060631,-0.010101,0.695148,-1.371529,-0.730218,0.216938,-0.006281
6,1.0,1.0,2.0,2.0,-0.198395,-0.201805,-0.201568,-0.204214,-0.24931,-0.060631,-0.010101,-0.572749,-0.126733,-0.109192,0.200831,0.117869
7,31.0,1.0,1.0,1.0,-0.364962,-0.363227,-0.362625,-0.361092,6.669866,-0.060631,-0.010101,2.473471,2.924956,-1.226261,1.895002,2.013887
8,25.0,1.0,2.0,2.0,-0.419803,-0.419675,-0.418932,-0.420913,-0.247519,-0.060631,-0.010101,-0.043183,-0.126733,-0.109192,-0.065262,-0.137245


## Export the model

In [31]:
learn.export(modelFolder / f'{modelName}')


To get prediction on a new dataframe, you can use the test_dl method of the DataLoaders. That dataframe does not need to have the dependent variable in its column.

In [32]:
from ipynb.fs.full.stockFetcher import getTickerData

Static Data for AAPL:
  Total Debt: 119058997248
  Cash: 65171001344
  Shares Outstanding: 15037899776
Approximated EBIT for AAPL: 58655249203.2
Static Data for MSFT:
  Total Debt: 96838000640
  Cash: 78429003776
  Shares Outstanding: 7434880000
Approximated EBIT for MSFT: 38128499097.6
Static Data for AMZN:
  Total Debt: 158534991872
  Cash: 88050999296
  Shares Outstanding: 10515000320
Approximated EBIT for AMZN: 93019196620.8
Static Data for GOOGL:
  Total Debt: 29289000960
  Cash: 93229998080
  Shares Outstanding: 5842999808
Approximated EBIT for GOOGL: 50978852044.799995
Static Data for GOOG:
  Total Debt: 29289000960
  Cash: 93229998080
  Shares Outstanding: 5534000128
Approximated EBIT for GOOG: 50978852044.799995
Static Data for META:
  Total Debt: 49046999040
  Cash: 70899998720
  Shares Outstanding: 2180000000
Approximated EBIT for META: 23434049126.399998
Static Data for TSLA:
  Total Debt: 12782999552
  Cash: 33648001024
  Shares Outstanding: 3210060032
Approximated EBIT fo

  "        if shares_outstanding is None:\n",


Static Data for TSM:
  Total Debt: 968509030400
  Cash: 2167600054272
  Shares Outstanding: 5186549760
Approximated EBIT for TSM: 397706369433.6
Static Data for NIO:
  Total Debt: 33123946496
  Cash: 36268662784
  Shares Outstanding: 1941929984
Approximated EBIT for NIO: 9528649728.0
Static Data for JD:
  Total Debt: 86189998080
  Cash: 201901998080
  Shares Outstanding: 1449500032
Approximated EBIT for JD: 165783148953.6
Static Data for BIDU:
  Total Debt: 82800001024
  Cash: 150355001344
  Shares Outstanding: 285055008
Approximated EBIT for BIDU: 20226299904.0
Static Data for PDD:
  Total Debt: 9765086208
  Cash: 284927164416
  Shares Outstanding: 1388770048
Approximated EBIT for PDD: 51238949683.2
Static Data for MELI:
  Total Debt: 6339999744
  Cash: 6672999936
  Shares Outstanding: 50697400
Approximated EBIT for MELI: 2773949952.0
Static Data for SE:
  Total Debt: 4403627008
  Cash: 7913288192
  Shares Outstanding: 528812000
Approximated EBIT for SE: 2322908774.4
Static Data for N

In [44]:
predictionTarget = 'AAPL'

test_df = getTickerData(predictionTarget)

# Ensure test_df is a DataFrame
if isinstance(test_df, dict):
	test_df = pd.DataFrame([test_df])

dl = learn.dls.test_dl(test_df)
test_df.head()

Static Data for AAPL:
  Total Debt: 119058997248
  Cash: 65171001344
  Shares Outstanding: 15037899776
Approximated EBIT for AAPL: 58655249203.2


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  to[n].fillna(self.na_dict[n], inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  to[n].fillna(self.na_dict[n], inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values alway

Unnamed: 0,Open,High,Low,Close,Volume,Dividends,Stock Splits,EV/EBIT,Market Cap,ROIC,Industry
0,232.115005,232.289993,229.490005,230.020004,40694968,0.0,0.0,59.890731,3459018000000.0,0.013191,Consumer Electronics


In [42]:
prediction = learn.get_preds(dl=dl)
print(f"Prediction for {predictionTarget}:")
print(f"{prediction[0][0][0].item() * 100:.2f}%")

Prediction for AAPL:
1330.85%


Note:
Since machine learning models can’t magically understand categories it was never trained on, the data should reflect this. If there are different missing values in your test data you should address this before training