GitHub - LiuPengJugx/TORN-Join: Self-adaptive data layout for distribute joins

TORN Layout for Distributed Joins

This is the source code of the paper TORN, which is mainly oriented to distributed join scenarios, optimizing the data layout for multi-table queries, single-table queries and the co-partitioning of blocks on its basis. Its main execution process is:

a) train MLP predictor to predict historical queries <workload_predictor.py>;

b) build adaptive partition trees based on predicted (multi-table and single-table) queries <partition_algorithm.py>;

c) form partition files (parquet) by routing data based on partition trees <data_routing.ipynb>;

d) conduct experiments using evaluation metrics <experiment.py / experiment.ipynb>;

e) execute experiments on spark <query_routing.ipynb>.

1. Project Structure

2. HDFS and Spark environment

Users need to create their own hadoop and spark environments. The following is an template example of paths that need to be created for the HDFS file system.

PATHS Template

par_data
- tpch
  - QdTree
    - scale1
    - scale10
    - ...
  - PAW
  - TORN
  - AdaptDB
- tpcds
- multi_join
  - tpch
    - ...
  - tpcds

3. Dependencies

The project's dependency packages are written to the requirement.txt and can be installed with the following command:

python3 -m pip install -r requirements.txt

4. Details on determining the optimal depths of multiple trees.

experiment.py:compare_hyper_join_with_multitable

Assigning reasonable depths to two levels of tree has an important impact on the quality of join tree. We define the join depth of top layer as $dp_j$. Once $dp_j$ is determined, the depth of bottom layer can be calculated by the maximum depth of the leaf node.

TORN uses a heuristic algorithm to assign $dp_j$ to each table, our core idea is: (i) Use the number of of queries accessing each table to determine the priority weight $w_p$ of each table $T$, and $w_p(T)=\sum_{q\in Q(T)}{est\ rows}/{total\ rows}$. Where $est\ rows$ is the number of scan rows of query $q$ estimated by the optimizer. (ii) Determine the order of assigning depth to each table according to the priority, and then compute the optimal candidate depth for each of the two adjacent joining tables in turn. (iii) Start multiple iterations and determining the depth of only one table per iteration. In each iteration, if there is a conflict between two optimal candidate depths of a table, multiple candidates schemes are generated based on the conflict depths and the impact of the conflicted depth on the adjacent tables, and the scheme with lowest cost is selected as the final $dp_j$ of the table.

We define $Q_{m:n}$ to indicate that queries spanning these tables $T_m,..,T_n$. $\hat{H}(T_{i-1,i})$ represents the optimal candidate depths for adjacent tables $T_{i-1}$ and $T_i$ when no other tables are considered. $H_o(h,T_{m:n})$ represents that optimal candidate depth set for tables $T_m,..,T_n$ when $dp_j(T_m)=h$. The detailed process is described in following algorithm pseudocode.

If $n$ is too large, the number of candidate combinations of trees with different depths will increase, which will lead to a large time overhead. TORN uses some parallelization designs to solve this problem. (i) The optimal candidate depths between different adjacent tables, such as $\bar{H}{AB}&\bar{H}{CD}$ and $\bar{H}{BC}&\bar{H}{DE}$, can be calculated in parallel; (ii) By leveraging the pre-saved tree structures with various depths and the cost results between various depth combinations, the time overhead of calculating hyper-join cost of new combinations will be effectively reduced.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.idea		.idea
NORA_experiments/queryset/tpch		NORA_experiments/queryset/tpch
SIM_experiments		SIM_experiments
images		images
mlp_train_record		mlp_train_record
README.md		README.md
data_helper.py		data_helper.py
data_routing.ipynb		data_routing.ipynb
experiment.ipynb		experiment.ipynb
experiment.py		experiment.py
experiment_join.ipynb		experiment_join.ipynb
img.png		img.png
join_depths.py		join_depths.py
join_until.py		join_until.py
partition_algorithm.py		partition_algorithm.py
partition_node.py		partition_node.py
partition_tree.py		partition_tree.py
query_routing.ipynb		query_routing.ipynb
query_routing.py		query_routing.py
query_routing_for_join.ipynb		query_routing_for_join.ipynb
workload_predictor.py		workload_predictor.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TORN Layout for Distributed Joins

1. Project Structure

2. HDFS and Spark environment

3. Dependencies

4. Details on determining the optimal depths of multiple trees.

About

Releases

Packages

Languages

LiuPengJugx/TORN-Join

Folders and files

Latest commit

History

Repository files navigation

TORN Layout for Distributed Joins

1. Project Structure

2. HDFS and Spark environment

3. Dependencies

4. Details on determining the optimal depths of multiple trees.

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages