Cardinality estimation by non-autoregressive (masked) model.
All experiments can be run in a docker container.
- Docker
- GPU/cuda environment (for Training)
Dependencies are automatically installed while building a docker image.
# on host
git clone https://github.com/OnizukaLab/nar-cardest.git
cd nar-cardest
docker build -t cardest .
docker run --rm --gpus all -v `pwd`:/workspaces/nar-cardest -it cardest bash
# in container
poetry shell
# in poetry env in container
./scripts/dowload_dmv.sh
./scripts/dowload_imdb.sh
(Additional info) Alternative to join-sampled IMDb csv files
In order to run IMDb-related experiments, need join-sampled csv files.
Use files from ./scripts/download_imdb.sh
or the following extra process.
- Run
job-light
(re-training) of neurocard, get a 10M-join-sampled csv file, and put it as a filedatasets/imdb-job-light.csv
.- To get 10M samples,
bs
andmax_steps
should be2048
and5120
, respectively. - To get a join-sampled file,
_save_samples
option may be required. - Actual re-training is not necessary. Just a join-sampled file is needed.
- To get 10M samples,
- In the same way as above, get a 10M-join-sampled csv file of full IMDb dataset and put it as a file
datasets/imdb.csv
.- In addition, need to adjust target tables and columns in neurocard/experiments.py to match nar-cardest/cardest/datasets.py#Loader.load_imdb
- Finally, transform some columns' name in
datasets/{imdb.csv,imdb-job-light.csv}
.- Replace
(.+):(.+)
with\1.\2
- Replace
__in_(.+)
with__in__:\1
- Replace
__fanout_(.+)
with__fanout__:\1
- Replace
Choose hyperparameter search by optuna or manually specified parameters.
# train w/ hyperparameter search
python cardest/run.py --train -d=imdb -t=mlp --n-trials=10 -e=20
# train w/o hyperparameter search
python cardest/run.py --train -d=imdb -t=mlp -e=20 --d-word=64 --d-ff=256 --n-ff=4 --lr=5e-4
# evaluation
python cardest/run.py --eval -d=imdb -b=job-light -t=mlp -m=models/mlp-ur-imdb/nar-mlp-imdb-imdb-universal.pt
You can find results in results/<benchmark_name>
after trial.
-d/--dataset
: Dataset name-t/--model-type
: Internal model type (mlp
for MLP ortrm
for Transformer)--seed
: Random seed (Default:1234
)--n-blocks
: The number of blocks (for Transformer)--n-heads
: The number of heads (for Transformer)--d-word
: Embedding dimension--d-ff
: Width of feedforward networks--n-ff
: The number of feedforward networks (for MLP)--fact-threshold
: Column factorization threshold (Default:2000
)--fact-bits
: Column factorization bit (Default:10
)
-e/--epochs
: Training epoch--batch-size
: Batch size (Default:1024
)
(w/ hyperparameter search)
--n-trials
: The number of trials for hyperparameter search
(w/ specified parameters)
--lr
: Learning rate--warmups
: Warm-up epoch (for Transformer) (lr
andwarmups
are exclusive)
-m/--model
: Path to model-b/--benchmark
: Benchmark name--eval-sample-size
: Sample size for evaluation
- Datasets
- DMV
dmv
: All data of DMV
- IMDb
imdb
: (almost) All data of IMDbimdb-job-light
: Subset of IMDb for JOB-light benchmark
- DMV
- Benchmarks
- DMV
synthetic2000-1234
: Synthetic 2000 queries (random seed = 1234)
- IMDb
job-light
: Real-world 70 queries
- DMV
- Models
mlp
: Masked MLP-based non-autoregressive modeltrm
: Masked Transformer-based non-autoregressive model
mlp-ur-dmv
: MLP-based model for DMV dataset (Set--fact-threshold=3000
)trm-ur-dmv
: Transformer-based model for DMV dataset (Set--fact-threshold=3000
)mlp-ur-imdb
: MLP-based model for IMDb datasettrm-ur-imdb
: Transformer-based model for IMDb datasetmlp-ur-imdb-jl
: MLP-based model for IMDb dataset (subset for only JOB-light benchmark)trm-ur-imdb-jl
: Transformer-based model for IMDb dataset (subset for only JOB-light benchmark)
@InProceedings{nar_cardest,
author = {Ito, Ryuichi and Xiao, Chuan and Onizuka, Makoto},
title = {{Robust Cardinality Estimator by Non-autoregressive Model}},
booktitle = {Software Foundations for Data Interoperability},
year = {2022},
pages = {55--61},
isbn = {978-3-030-93849-9}
}