NOTE: ant-xgboost on SQLFlow has moved to backup_antxgboost_work branch
This is a design doc about why and how to support running ant-xgboost via sqlflow as a machine learning estimator.
We propose to build a lightweight python template for xgboost on basis of xgblauncher
,
an incubating xgboost wrapper in ant-xgboost.
Gradient boosting machine (GBM) is a widely used (supervised) machine learning method, which trains a bunch of weak learners, typically decision trees, in a gradual, additive and sequential manner. A lot of winning solutions of data mining and machine learning challenges, such as : Kaggle, KDD cup, are based on GBM or related techniques.
There exists a lot of GBM frameworks (implementations), we propose to use xgboost as backend of sqlflow, which is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable, often regarded as one of the best GBM frameworks.
We propose to use ant-xgboost as backend,
which is consistent with xgboost in kernel level.
Because in ant-xgboost
, there exists an incubating module named xgblauncher,
an extendable, cloud-native xgboost based machine learning pipeline.
Comparing to python API provided by xgboost
, it is easier to build a python code template for xgboost task launching on basis of xgblauncher
.
In terms of sqlflow users, xgboost is an alternative Estimator
like TensorFlow Estimators
.
Working with xgboost is quite similar to working with TensorFlow Estimators; just change TO TRAIN DNNClassifier
into TO TRAIN XGBoostEstimator
.
In addition, xgboost specific parameters can be configured in the same way as TensorFlow parameters.
Below is a demo about training/predicting via xgboost :
// sample clause of train
select
c1, c2, c3, c4, c5 as class
from kaggle_credit_fraud_training_data
TO TRAIN XGBoostEstimator
WITH
booster = "gbtree"
objective = "logistic:binary"
eval_metric = "auc"
train_eval_ratio = 0.8
COLUMN
c1,
DENSE(c2, 10),
BUCKET(c3, [0, 10, 100]),
c4
LABEL class
INTO sqlflow_models.xgboost_model_table;
// sample clause of predict
select
c1, c2, c3, c4
from kaggle_credit_fraud_development_data
TO PREDICT kaggle_credit_fraud_development_data.class
USING sqlflow_models.xgboost_model_table;
As codegen.go
generating TensorFlow code from sqlflow AST,
we will add codegen_xgboost.go
which translate sqlflow AST into a python launcher program of xgboost.
Since xgblauncher provide DataSource
and ModelSource
, abstraction of custom I/O pipeline, by which we can reuse data/model pipeline of runtime
.
The full documentation of xgblauncher will be available soon. Below, we show a demonstration of DataSource/ModelSource API.
class DataSource:
"""
DataSource API
A handler of data reading/writing, which is compatible with both single-machine and distributed runtime.
"""
def __init__(self,
rank: int,
num_worker: int,
column_conf: configs.ColumnFields,
source_conf):
pass
@abstractmethod
def read(self) -> Iterator[XGBoostRecord]:
pass
@abstractmethod
def write(self, result_iter: Iterator[XGBoostResult]):
pass
class ModelSource:
"""
ModelSource API
A handler by which XGBLauncher save/load model(booster) and related information.
"""
def __init__(self, source_conf):
pass
@abstractmethod
def read_buffer(self, model_path: str) -> bytes:
pass
@abstractmethod
def write_buffer(self, buf: bytes, model_path: str):
pass
@abstractmethod
def read_lines(self, model_path: str) -> List[str]:
pass
@abstractmethod
def write_lines(self, lines: List[str], model_path: str):
pass
With the help of xgblauncher, we can launch xgboost from sqlflow AST via a lightweight python code template
and a corresponding filler
.
The code template
roughly includes components as follows:
-
TFDataSource
that is responsible for fetching and pre-processing data via tf.feature_columns API. Data will be fetched in mini-batch style by executing TF compute graph with mini-batch data feed byruntime.db.db_generator
. -
DBDataSource
that is responsible for writing prediction results into specific data base. The writing action can be implemented viaruntime.db.insert_values
. -
LocalModelSource
that is responsible for reading/writing xgboost models on local file system. -
Configure template building and entry point of xgblauncher.
Distributed training is supported in xgboost via rabit, a reliable allreduce and broadcast interface for distributed machine learning.
To run a distributed xgboost job with rabit
, all we need to do is setup a distributed environment.
For now, xgboost has been bind to some popular distributed computing frameworks, such as Apache Spark, Apache Flink, Dask.
However, specific computing frameworks are not always available in production environments.
So, we propose a cloud-native approach: running xgboost directly on k8s cluster
.
As xgblauncher
is scalable and docker-friendly, xgblauncher-based containers can be easily orchestrated by xgboost operator,
a specific Kubernetes controller for (distributed) xgboost jobs.
With the help of xgboost operator
, it is easy to handle XGBoostJob
via Kubernetes API
, a Kubernetes' custom resource defined by xgboost operator
.
XGBoostJob
building and tracking will be integrated to xgblauncher
in near future.
After that, we can generate python codes with an option to decide whether running xgboost job locally or submitting it to remote k8s cluster.