# Inference on an upcoming dataset

In this part, we will simulate the real deployment of the package and make inferences on an upcoming dataset. We use the Adult dataset from UCI datasets which has an individual testing set.

## Training models

Similar to the first example, we initialize a `Trainer` and model bases, then train all models.

In [1]:
import torch
from tabensemb.trainer import Trainer
from tabensemb.model import *
import tabensemb
from tabensemb.config import UserConfig
import os
from tempfile import TemporaryDirectory

temp_path = TemporaryDirectory()
tabensemb.setting["default_output_path"] = os.path.join(temp_path.name, "output")
tabensemb.setting["default_config_path"] = os.path.join(temp_path.name, "configs")
tabensemb.setting["default_data_path"] = os.path.join(temp_path.name, "data")

device = "cuda" if torch.cuda.is_available() else "cpu"
print("Using {} device".format(device))

trainer = Trainer(device=device)
adult_columns = [
    "age",
    "workclass",
    "fnlwgt",
    "education",
    "education-num",
    "marital-status",
    "occupation",
    "relationship",
    "race",
    "sex",
    "capital-gain",
    "capital-loss",
    "hours-per-week",
    "native-country",
    "income",
]
cfg = UserConfig.from_uci("Adult", column_names=adult_columns, sep=", ")
trainer.load_config(cfg)
trainer.load_data()
models = [
    PytorchTabular(trainer, model_subset=["Category Embedding"]),
    WideDeep(trainer, model_subset=["TabMlp"]),
    AutoGluon(trainer, model_subset=["Random Forest"]),
]
trainer.add_modelbases(models)
trainer.train(stderr_to_stdout=True)
trainer.get_leaderboard()

Using cuda device
Downloading https://archive.ics.uci.edu/static/public/2/adult.zip to /tmp/tmpgo78fihw/data/Adult.zip


  df = pd.read_csv(StringIO(s), names=names, sep=sep)


age is Integer and will be treated as a continuous feature.
fnlwgt is Integer and will be treated as a continuous feature.
education-num is Integer and will be treated as a continuous feature.
capital-gain is Integer and will be treated as a continuous feature.
capital-loss is Integer and will be treated as a continuous feature.
hours-per-week is Integer and will be treated as a continuous feature.
The project will be saved to /tmp/tmpgo78fihw/output/adult/2023-09-12-11-07-38-0_UserInputConfig
Dataset size: 19536 6512 6513
Data saved to /tmp/tmpgo78fihw/output/adult/2023-09-12-11-07-38-0_UserInputConfig (data.csv and tabular_data.csv).

-------------Run PytorchTabular-------------

Training Category Embedding
Global seed set to 42
2023-09-12 11:07:39,698 - {pytorch_tabular.tabular_model:473} - INFO - Preparing the DataLoaders
2023-09-12 11:07:39,699 - {pytorch_tabular.tabular_datamodule:290} - INFO - Setting up the datamodule for classification task
2023-09-12 11:07:39,748 - {pytorch_t

Unnamed: 0,Program,Model,Training F1_SCORE,Training PRECISION_SCORE,Training RECALL_SCORE,Training JACCARD_SCORE,Training ACCURACY_SCORE,Training BALANCED_ACCURACY_SCORE,Training COHEN_KAPPA_SCORE,Training HAMMING_LOSS,...,Validation ACCURACY_SCORE,Validation BALANCED_ACCURACY_SCORE,Validation COHEN_KAPPA_SCORE,Validation HAMMING_LOSS,Validation MATTHEWS_CORRCOEF,Validation ZERO_ONE_LOSS,Validation ROC_AUC_SCORE,Validation LOG_LOSS,Validation BRIER_SCORE_LOSS,Validation AVERAGE_PRECISION_SCORE
0,WideDeep,TabMlp,0.6942,0.728505,0.662981,0.531628,0.859388,0.792321,0.603167,0.140612,...,0.852426,0.784474,0.584884,0.147574,0.585738,0.147574,0.908951,0.317288,0.101612,0.86842
1,AutoGluon,Random Forest,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,...,0.853808,0.776665,0.580404,0.146192,0.583003,0.146192,0.90701,0.318016,0.100486,0.875084
2,PytorchTabular,Category Embedding,0.709806,0.738341,0.683394,0.550154,0.865479,0.803303,0.622423,0.134521,...,0.85043,0.784467,0.581612,0.14957,0.58215,0.14957,0.909318,0.316194,0.101722,0.86841


## Selecting and storing a model

From the leaderboard, we can check the performance of each model and select one of the models for deployment. Say we want to choose `Random Forest` from `AutoGluon`, we detach the model from the heavy `trainer`. It is stored locally in a separate directory.

In [2]:
trainer_of_one_model = trainer.detach_model(program="AutoGluon", model_name="Random Forest")

Trainer saved. To load the trainer, run trainer = load_trainer(path='/tmp/tmpgo78fihw/output/adult/2023-09-12-11-07-38-0_UserInputConfig-I1/trainer.pkl')


The detached trainer now has only one model base.

In [3]:
# Model bases of the detached trainer
trainer_of_one_model.modelbases

[<tabensemb.model.autogluon.AutoGluon at 0x7feb58588c10>]

In [4]:
# The model in the model base
trainer_of_one_model.get_modelbase("AutoGluon_Random Forest").model["Random Forest"]

('Random Forest',
 <autogluon.tabular.predictor.predictor.TabularPredictor at 0x7feb58400be0>)

## Loading the model

Now the `Trainer` containing a single model is stored in a separate directory. Assume that we want to load the local trainer in a separate script for inference. In the following line, the argument `path` of `load_trainer` is the path to `trainer.pkl`, which is already printed when detaching the model or training the model bases. Here we just use the directory of the detached trainer `trainer_of_one_model`.

**Remark**: You can move the directory to any other place (or other devices if the version of the package and the environment are all consistent) and rename the folder. `tabensemb` automatically configures the path.

In [5]:
from tabensemb.trainer import load_trainer

trainer = load_trainer(path=os.path.join(trainer_of_one_model.project_root, "trainer.pkl"))

In [6]:
trainer.get_modelbase("AutoGluon_Random Forest").model["Random Forest"]

('Random Forest',
 <autogluon.tabular.predictor.predictor.TabularPredictor at 0x7feb58401330>)

## Inference

Assume that we have a new `DataFrame` representing an upcoming dataset. For demonstration, we use the testing set here. The classification target is ordinal encoded by `trainer.datamodule.label_ordinal_encoder`.

In [7]:
df = trainer.df.loc[trainer.test_indices, :]
truth = trainer.df.loc[trainer.test_indices, trainer.label_name].values.flatten()
truth

array([0, 0, 1, ..., 1, 0, 0])

Use the functionality of the model base to do inference. The returned result should be

In [8]:
import pandas as pd

result = trainer.get_modelbase("AutoGluon_Random Forest").predict(df, model_name="Random Forest")
result

array([['<=50K'],
       ['<=50K'],
       ['>50K'],
       ...,
       ['>50K'],
       ['<=50K'],
       ['<=50K']], dtype=object)

You can see the F1 score on the "new" (testing) dataset is the same as that in the above leaderboard. The result should be ordinal-encoded first to calculate metrics. We provide `DataModule.label_categories_transform` to achieve this (and `DataModule.label_categories_inverse_transform` to do the inverse transform).

The `auto_metric_sklearn` automatically calculates different kinds of `sklearn.metrics`, which is extremely useful for classification tasks

In [9]:
from tabensemb.utils import auto_metric_sklearn

encoded_result = trainer.datamodule.label_categories_transform(pd.DataFrame(result, columns=trainer.label_name)).values
# You can also use trainer.datamodule.label_ordinal_encoder.transform(result) to get the same result.
auto_metric_sklearn(truth, encoded_result, "f1_score", "binary"), trainer.leaderboard.loc[trainer.leaderboard["Model"]=="Random Forest", "Testing F1_SCORE"].values

(0.6936416184971098, array([0.69364162]))

## Inference on the individual testing set

When loading from UCI datasets, `UserConfig.from_uci` finds that an individual testing dataset exists, so the downloaded .zip file is not removed. We can load the archive using `zipfile`.

In [10]:
import zipfile

zipf = zipfile.ZipFile(os.path.join(tabensemb.setting["default_data_path"], "Adult.zip"))
zipf.namelist()

['Index', 'adult.data', 'adult.names', 'adult.test', 'old.adult.names']

Now check the content of `adult.test`. It is a .csv-like file, just like the `adult.data` file but has an additional row at the front, and an additional "." at the end of each line.

In [11]:
file = zipf.read("adult.test").decode()
print(file[:500])

|1x3 Cross validator
25, Private, 226802, 11th, 7, Never-married, Machine-op-inspct, Own-child, Black, Male, 0, 0, 40, United-States, <=50K.
38, Private, 89814, HS-grad, 9, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 50, United-States, <=50K.
28, Local-gov, 336951, Assoc-acdm, 12, Married-civ-spouse, Protective-serv, Husband, White, Male, 0, 0, 40, United-States, >50K.
44, Private, 160323, Some-college, 10, Married-civ-spouse, Machine-op-inspct, Husband, Black, Male, 7688, 0


We provide the `str_to_dataframe` function that translates the string object to a `DataFrame` and replaces illegal values with `np.nan`.

In [12]:
from tabensemb.utils import str_to_dataframe

file = file.replace("|1x3 Cross validator\n","").replace(".\n", "\n")
testing_df = str_to_dataframe(file, sep=", ", names=trainer.df.columns, check_nan_on=trainer.cont_feature_names)
testing_df

  df = pd.read_csv(StringIO(s), names=names, sep=sep)


Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,18,?,103497,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16276,39,Private,215419,Bachelors,13,Divorced,Prof-specialty,Not-in-family,White,Female,0,0,36,United-States,<=50K
16277,64,?,321403,HS-grad,9,Widowed,?,Other-relative,Black,Male,0,0,40,United-States,<=50K
16278,38,Private,374983,Bachelors,13,Married-civ-spouse,Prof-specialty,Husband,White,Male,0,0,50,United-States,<=50K
16279,44,Private,83891,Bachelors,13,Divorced,Adm-clerical,Own-child,Asian-Pac-Islander,Male,5455,0,40,United-States,<=50K


The inference is similar.

In [13]:
result = trainer.get_modelbase("AutoGluon_Random Forest").predict(testing_df, model_name="Random Forest")
result

array([['<=50K'],
       ['<=50K'],
       ['<=50K'],
       ...,
       ['>50K'],
       ['<=50K'],
       ['>50K']], dtype=object)

Both the truth and the result require to be ordinal-encoded to calculate metrics.

In [14]:
encoded_truth = trainer.datamodule.label_categories_transform(testing_df[trainer.label_name]).values
encoded_result = trainer.datamodule.label_categories_transform(pd.DataFrame(result, columns=trainer.label_name)).values
auto_metric_sklearn(encoded_truth, encoded_result, "f1_score", "binary"), auto_metric_sklearn(encoded_truth, encoded_result, "roc_auc_score", "binary")

(0.6690190543401553, 0.7731706276694975)