Scardina: Scalable Join Cardinality Estimatior

Prerequisites

All experiments can be run in a docker container.

Docker
GPU/cuda environment (for Training)

Getting Started

Setup

Dependencies are automatically installed while building a docker image.

# on host
git clone https://github.com/OnizukaLab/Scardina.git
cd Scardina
docker build -t scardina .
docker run --rm --gpus all -v `pwd`:/workspaces/scardina -it scardina bash

# in container
poetry shell

# in poetry env in container
./scripts/dowload_imdb.sh

Examples

Training

Choose hyperparameter search by optuna or manually specified parameters.

# train w/ hyperparameter search
python scardina/run.py --train -d=imdb -t=mlp --n-trials=10 -e=20

# train w/o hyperparameter search
python scardina/run.py --train -d=imdb -t=mlp -e=20 --d-word=64 --d-ff=256 --n-ff=4 --lr=5e-4

Evaluation

# evaluation
# Note: When default (-s=cin), model path should be like:
#       "models/imdb/mlp-cin/yyyyMMddHHmmss/nar-mlp-imdb-{}-yyyyMMddHHmmss.pt".
#       "{}" is literally "{}", a placeholder string to specify multiple models
python scardina/run.py --eval -d=imdb -b=job-light -t=mlp -m={path/to/model.pt}

You can find results in results/<benchmark_name> after trial.

Options

Common Options

-d/--dataset: Dataset name
-t/--model-type: Internal model type (mlp for MLP or trm for Transformer)
-s/--schema-strategy: Internal subschema type (cin for Closed In-neighborhood Partitioning (Scardina) or ur for Universal Relation)
--seed: Random seed (Default: 1234)
--n-blocks: The number of blocks (for Transformer)
--n-heads: The number of heads (for Transformer)
--d-word: Embedding dimension
--d-ff: Width of feedforward networks
--n-ff: The number of feedforward networks (for MLP)
--fact-threshold: Column factorization threshold (Default: 2000)

Options for Training

-e/--epochs: Training epoch
--batch-size: Batch size (Default: 1024)

(w/ hyperparameter search)

--n-trials: The number of trials for hyperparameter search

(w/ specified parameters)

--lr: Learning rate
--warmups: Warm-up epoch (for Transformer) (lr and warmups are exclusive)

Options for Evaluation

-m/--model: Path to model
-b/--benchmark: Benchmark name
--eval-sample-size: Sample size for evaluation

Choices

Datasets
- IMDb
  - imdb: (almost) All data of IMDb
  - imdb-job-light: Subset of IMDb for JOB-light benchmark
Benchmarks
- IMDb
  - job-light: Real-world 70 queries
  - job-m: Real-world 113 queries
  - job-light_subqueries: Real-world 70 queries for evaluating P-Error (Need DB)
  - job-m_subqueries: Real-world 113 queries for evaluating P-Error (Need DB)
Models
- mlp: MLP-based denoising autoencoder
- trm: Transformer-based denoising autoencoder

Reference

@article{scardina,
    author = {Ito, Ryuichi and Sasaki, Yuya and Xiao, Chuan and Onizuka, Makoto},
    title = {{Scardina: Scalable Join Cardinality Estimation by Multiple Density Estimators}},
    journal = {{arXiv preprint arXiv:2303.18042}},
    year = {2023}
}

Acknowledgement

Some source codes are based on naru/neurocard

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.devcontainer		.devcontainer
benchmarks/imdb		benchmarks/imdb
datasets		datasets
scardina		scardina
scripts		scripts
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
poetry.lock		poetry.lock
poetry.toml		poetry.toml
pyproject.toml		pyproject.toml

License

OnizukaLab/Scardina

Folders and files

Latest commit

History

Repository files navigation

Scardina: Scalable Join Cardinality Estimatior

Prerequisites

Getting Started

Setup

Examples

Training

Evaluation

Options

Common Options

Options for Training

Options for Evaluation

Choices

Reference

Acknowledgement

About

Topics

Resources

License

Stars

Watchers

Forks

Languages