Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

executor: implement disk-based hash join #12067

Merged
merged 14 commits into from Sep 24, 2019

Conversation

SunRunAway
Copy link
Contributor

@SunRunAway SunRunAway commented Sep 6, 2019

What problem does this PR solve?

part of #11607

What is changed and how it works?

When the memory usage of a query exceeds mem-quota-query, it will be spilled out to disk.

I've written a slide to demonstrate how Spilling to disk is triggered in this PR,
https://docs.google.com/presentation/d/1Sa9xNbDTPnLwnQHLKfpwksdYXWodisXPEqp-WR5Up0U/edit?usp=sharing

Here is the benchmark result,

go test -run=^$ -bench="BenchmarkHashJoinExec|BenchmarkBuildHashTableForList" -test.benchmem -count 5 > /tmp/bench.txt && benchstat /tmp/bench.txt
name                                                                                 time/op
HashJoinExec/(rows:100000,_concurency:4,_joinKeyIdx:_[0_1],_disk:false)-12            743ms ± 1%
HashJoinExec/(rows:100000,_concurency:4,_joinKeyIdx:_[0],_disk:false)-12              175ms ± 1%
HashJoinExec/(rows:100000,_concurency:4,_joinKeyIdx:_[0],_disk:true)-12               753ms ± 4%
HashJoinExec/(rows:1000,_concurency:4,_joinKeyIdx:_[0],_disk:true)-12                20.3ms ± 9%
BuildHashTableForList/(rows:10,_concurency:4,_joinKeyIdx:_[0_1],_disk:false)-12      56.0µs ± 1%
BuildHashTableForList/(rows:10,_concurency:4,_joinKeyIdx:_[0_1],_disk:true)-12        346µs ±13%
BuildHashTableForList/(rows:10,_concurency:4,_joinKeyIdx:_[0],_disk:false)-12        4.49µs ± 4%
BuildHashTableForList/(rows:10,_concurency:4,_joinKeyIdx:_[0],_disk:true)-12          290µs ± 7%
BuildHashTableForList/(rows:100000,_concurency:4,_joinKeyIdx:_[0_1],_disk:false)-12   516ms ± 0%
BuildHashTableForList/(rows:100000,_concurency:4,_joinKeyIdx:_[0_1],_disk:true)-12    1.00s ±10%
BuildHashTableForList/(rows:100000,_concurency:4,_joinKeyIdx:_[0],_disk:false)-12    9.42ms ± 2%
BuildHashTableForList/(rows:100000,_concurency:4,_joinKeyIdx:_[0],_disk:true)-12      481ms ± 6%

name                                                                                 alloc/op
HashJoinExec/(rows:100000,_concurency:4,_joinKeyIdx:_[0_1],_disk:false)-12            304MB ± 0%
HashJoinExec/(rows:100000,_concurency:4,_joinKeyIdx:_[0],_disk:false)-12              304MB ± 0%
HashJoinExec/(rows:100000,_concurency:4,_joinKeyIdx:_[0],_disk:true)-12               889MB ± 0%
HashJoinExec/(rows:1000,_concurency:4,_joinKeyIdx:_[0],_disk:true)-12                63.9MB ± 0%
BuildHashTableForList/(rows:10,_concurency:4,_joinKeyIdx:_[0_1],_disk:false)-12      2.37kB ± 0%
BuildHashTableForList/(rows:10,_concurency:4,_joinKeyIdx:_[0_1],_disk:true)-12       28.7kB ± 3%
BuildHashTableForList/(rows:10,_concurency:4,_joinKeyIdx:_[0],_disk:false)-12        2.37kB ± 0%
BuildHashTableForList/(rows:10,_concurency:4,_joinKeyIdx:_[0],_disk:true)-12         29.5kB ± 2%
BuildHashTableForList/(rows:100000,_concurency:4,_joinKeyIdx:_[0_1],_disk:false)-12  7.13MB ± 0%
BuildHashTableForList/(rows:100000,_concurency:4,_joinKeyIdx:_[0_1],_disk:true)-12   8.09MB ± 0%
BuildHashTableForList/(rows:100000,_concurency:4,_joinKeyIdx:_[0],_disk:false)-12    7.14MB ± 0%
BuildHashTableForList/(rows:100000,_concurency:4,_joinKeyIdx:_[0],_disk:true)-12     8.07MB ± 0%

name                                                                                 allocs/op
HashJoinExec/(rows:100000,_concurency:4,_joinKeyIdx:_[0_1],_disk:false)-12             307k ± 0%
HashJoinExec/(rows:100000,_concurency:4,_joinKeyIdx:_[0],_disk:false)-12               307k ± 0%
HashJoinExec/(rows:100000,_concurency:4,_joinKeyIdx:_[0],_disk:true)-12               1.51M ± 0%
HashJoinExec/(rows:1000,_concurency:4,_joinKeyIdx:_[0],_disk:true)-12                 16.5k ± 0%
BuildHashTableForList/(rows:10,_concurency:4,_joinKeyIdx:_[0_1],_disk:false)-12        28.0 ± 0%
BuildHashTableForList/(rows:10,_concurency:4,_joinKeyIdx:_[0_1],_disk:true)-12         68.0 ± 0%
BuildHashTableForList/(rows:10,_concurency:4,_joinKeyIdx:_[0],_disk:false)-12          28.0 ± 0%
BuildHashTableForList/(rows:10,_concurency:4,_joinKeyIdx:_[0],_disk:true)-12           68.0 ± 0%
BuildHashTableForList/(rows:100000,_concurency:4,_joinKeyIdx:_[0_1],_disk:false)-12   4.80k ± 2%
BuildHashTableForList/(rows:100000,_concurency:4,_joinKeyIdx:_[0_1],_disk:true)-12    5.25k ± 1%
BuildHashTableForList/(rows:100000,_concurency:4,_joinKeyIdx:_[0],_disk:false)-12     4.82k ± 0%
BuildHashTableForList/(rows:100000,_concurency:4,_joinKeyIdx:_[0],_disk:true)-12      5.25k ± 1%

TPC-H query 13 result is here, #12067 (comment)

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
mysql> set tidb_mem_quota_query=1;
Query OK, 0 rows affected (0.00 sec)

mysql> explain analyze select /*+ TIDB_HJ(t1, t2) */  * from testtable t1, testtable t2 where t1.id=t2.id;
+------------------------+-------+------+----------------------------------------------------------------------+-----------------------------------+-----------------+
| id                     | count | task | operator info                                                        | execution info                    | memory          |
+------------------------+-------+------+----------------------------------------------------------------------+-----------------------------------+-----------------+
| HashLeftJoin_22        | 41.25 | root | inner join, inner:TableReader_27, equal:[eq(test.t1.id, test.t2.id)] | time:1.748537ms, loops:2, rows:33 | 65.6328125 KB   |
| ├─TableReader_25       | 33.00 | root | data:TableScan_24                                                    | time:850.28µs, loops:2, rows:33   | 1.0537109375 KB |
| │ └─TableScan_24       | 33.00 | cop  | table:t1, range:[-inf,+inf], keep order:false, stats:pseudo          | time:1ms, loops:2, rows:33        | N/A             |
| └─TableReader_27       | 33.00 | root | data:TableScan_26                                                    | time:751.669µs, loops:2, rows:33  | 1.0537109375 KB |
|   └─TableScan_26       | 33.00 | cop  | table:t2, range:[-inf,+inf], keep order:false, stats:pseudo          | time:1ms, loops:2, rows:33        | N/A             |
+------------------------+-------+------+----------------------------------------------------------------------+-----------------------------------+-----------------+
5 rows in set (0.01 sec)

Now we can see log in tidb-server:

[2019/09/18 20:35:31.576 +08:00] [INFO] [hash_table.go:294] ["memory exceeds quota, spill to disk now."] [memory="\n\"explain analyze select /*+ TIDB_HJ(t1, t2) */  * from testtable t1, testtable t2 where t1.id=t2.id\"{\n  \"quota\": 1 Bytes\n  \"consumed\": 1.056640625 KB\n  \"TableReader_25\"{\n    \"quota\": 32 GB\n    \"consumed\": 0 Bytes\n  }\n  \"TableReader_27\"{\n    \"quota\": 32 GB\n    \"consumed\": 1.056640625 KB\n  }\n  \"HashLeftJoin_22\"{\n    \"quota\": 32 GB\n    \"consumed\": 0 Bytes\n    \"hashJoin.innerResult\"{\n      \"consumed\": 0 Bytes\n    }\n  }\n}\n"]

Code changes

  • Has exported function/method change
  • Has exported variable/fields change

Side effects

  • Possible performance regression
  • Breaking backward compatibility

Related changes

  • Need to update the documentation

Release note

  • Write release note for bug-fix or new feature.
    1. A new config variable oom-use-tmp-storage in tidb.toml, which is true in default, now takes control of the behavior whether to spill over to disk in a hashJoin (and other executors we will support spilling in the future). If a query with hashJoin exceeds the mem-quota-query, hashJoin will spill over to the temporary directory. (which could affect performance).
    2. If you want to control the temporary directory hash join using, set the environment variable TMPDIR when starting tidb-server.

@SunRunAway
Copy link
Contributor Author

/run-all-tests

@codecov
Copy link

codecov bot commented Sep 6, 2019

Codecov Report

Merging #12067 into master will not change coverage.
The diff coverage is n/a.

@@             Coverage Diff             @@
##             master     #12067   +/-   ##
===========================================
  Coverage   80.8753%   80.8753%           
===========================================
  Files           454        454           
  Lines         99547      99547           
===========================================
  Hits          80509      80509           
  Misses        13251      13251           
  Partials       5787       5787

@SunRunAway
Copy link
Contributor Author

SunRunAway commented Sep 7, 2019

TPC-H query 13

Run query 13 under TPC-H 10G on my laptop (MacBook Pro 15-inch, 2019, 2.6GHz)

time memory (top) inuse_space(go profile)
default 15.85s 3180MB 1480MB
disk 34.64s 1830MB 534MB

default:

mysql> explain analyze select c_count, count(*) as custdist from ( select c_custkey, count(o_orderkey) as c_count from customer left outer join orders on c_custkey = o_custkey and o_comment not like '%pending%deposits%' group by c_custkey ) c_orders group by c_count order by custdist desc,c_count desc;
+--------------------------------+-------------+------+---------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------+-----------------------+
| id                             | count       | task | operator info                                                                                     | execution info                                                                            | memory                |
+--------------------------------+-------------+------+---------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------+-----------------------+
| Sort_9                         | 150000.00   | root | custdist:desc, c_orders.c_count:desc                                                              | time:15.838720948s, loops:2, rows:65                                                      | 2.7578125 KB          |
| └─Projection_11                | 150000.00   | root | c_count, 6_col_0                                                                                  | time:15.838681959s, loops:2, rows:65                                                      | N/A                   |
|   └─HashAgg_14                 | 150000.00   | root | group by:c_count, funcs:count(1), firstrow(c_count)                                               | time:15.838639251s, loops:2, rows:65                                                      | N/A                   |
|     └─HashAgg_17               | 150000.00   | root | group by:tpch.customer.c_custkey, funcs:count(tpch.orders.o_orderkey)                             | time:15.837279594s, loops:1466, rows:1500000                                              | N/A                   |
|       └─HashLeftJoin_20        | 2252162.08  | root | left outer join, inner:TableReader_25, equal:[eq(tpch.customer.c_custkey, tpch.orders.o_custkey)] | time:14.629918936s, loops:14982, rows:15338185                                            | 1.0599235333502293 GB |
|         ├─TableReader_22       | 150000.00   | root | data:TableScan_21                                                                                 | time:490.729044ms, loops:1466, rows:1500000                                               | 12.883949279785156 MB |
|         │ └─TableScan_21       | 150000.00   | cop  | table:customer, range:[-inf,+inf], keep order:false                                               | proc max:369ms, min:259ms, p80:369ms, p95:369ms, rows:1500000, iters:1482, tasks:4        | N/A                   |
|         └─TableReader_25       | 12000000.00 | root | data:Selection_24                                                                                 | time:8.412349583s, loops:14492, rows:14838130                                             | 236.17509269714355 MB |
|           └─Selection_24       | 12000000.00 | cop  | not(like(tpch.orders.o_comment, "%pending%deposits%", 92))                                        | proc max:2.98s, min:1.359s, p80:2.63s, p95:2.966s, rows:14838130, iters:14788, tasks:31   | N/A                   |
|             └─TableScan_23     | 15000000.00 | cop  | table:orders, range:[-inf,+inf], keep order:false                                                 | proc max:2.878s, min:1.302s, p80:2.536s, p95:2.863s, rows:15000000, iters:14788, tasks:31 | N/A                   |
+--------------------------------+-------------+------+---------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------+-----------------------+
10 rows in set (15.85 sec)

image-20190907002419761

(Lots of memory is used by NewChunkWithCapacity in fetchInnerRows)

disk:

mysql> explain analyze select /*+ MEMORY_QUOTA(10 MB) */ c_count, count(*) as custdist from ( select c_custkey, count(o_orderkey) as c_count from customer left outer join orders on c_custkey = o_custkey and o_comment not like '%pending%deposits%' group by c_custkey ) c_orders group by c_count order by custdist desc,c_count desc;
+--------------------------------+-------------+------+---------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------+-----------------------+
| id                             | count       | task | operator info                                                                                     | execution info                                                                           | memory                |
+--------------------------------+-------------+------+---------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------+-----------------------+
| Sort_9                         | 150000.00   | root | custdist:desc, c_orders.c_count:desc                                                              | time:34.525661988s, loops:2, rows:65                                                     | 2.7578125 KB          |
| └─Projection_11                | 150000.00   | root | c_count, 6_col_0                                                                                  | time:34.524866162s, loops:2, rows:65                                                     | N/A                   |
|   └─HashAgg_14                 | 150000.00   | root | group by:c_count, funcs:count(1), firstrow(c_count)                                               | time:34.524826727s, loops:2, rows:65                                                     | N/A                   |
|     └─HashAgg_17               | 150000.00   | root | group by:tpch.customer.c_custkey, funcs:count(tpch.orders.o_orderkey)                             | time:34.52306466s, loops:1466, rows:1500000                                              | N/A                   |
|       └─HashLeftJoin_20        | 2252162.08  | root | left outer join, inner:TableReader_25, equal:[eq(tpch.customer.c_custkey, tpch.orders.o_custkey)] | time:33.306195067s, loops:14982, rows:15338185                                           | 76.69921875 KB        |
|         ├─TableReader_22       | 150000.00   | root | data:TableScan_21                                                                                 | time:263.787279ms, loops:1466, rows:1500000                                              | 12.883944511413574 MB |
|         │ └─TableScan_21       | 150000.00   | cop  | table:customer, range:[-inf,+inf], keep order:false                                               | proc max:252ms, min:199ms, p80:252ms, p95:252ms, rows:1500000, iters:1482, tasks:4       | N/A                   |
|         └─TableReader_25       | 12000000.00 | root | data:Selection_24                                                                                 | time:5.559257104s, loops:14492, rows:14838130                                            | 638.274112701416 MB   |
|           └─Selection_24       | 12000000.00 | cop  | not(like(tpch.orders.o_comment, "%pending%deposits%", 92))                                        | proc max:2.072s, min:905ms, p80:1.911s, p95:1.988s, rows:14838130, iters:14788, tasks:31 | N/A                   |
|             └─TableScan_23     | 15000000.00 | cop  | table:orders, range:[-inf,+inf], keep order:false                                                 | proc max:1.969s, min:836ms, p80:1.804s, p95:1.87s, rows:15000000, iters:14788, tasks:31  | N/A                   |
+--------------------------------+-------------+------+---------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------+-----------------------+
10 rows in set (34.64 sec)

image-20190907004531303

(We can see the memory usage in NewChunkWithCapacity disappears)

@sre-bot
Copy link
Contributor

sre-bot commented Sep 9, 2019

@zz-jason, @fzhedu, @qw4990, @XuHuaiyu, PTAL.

@SunRunAway
Copy link
Contributor Author

I've moved utility changes in this pr out to #12116 , Please review it first.

tools/check/go.mod Outdated Show resolved Hide resolved
@zz-jason zz-jason removed their request for review September 10, 2019 12:36
@XuHuaiyu XuHuaiyu removed their request for review September 11, 2019 02:32
@SunRunAway
Copy link
Contributor Author

/run-all-tests

@SunRunAway
Copy link
Contributor Author

/run-all-tests

1 similar comment
@SunRunAway
Copy link
Contributor Author

/run-all-tests

@SunRunAway
Copy link
Contributor Author

/run-all-tests

@sre-bot
Copy link
Contributor

sre-bot commented Sep 20, 2019

@qw4990, @zz-jason, @fzhedu, @XuHuaiyu, PTAL.

@sre-bot
Copy link
Contributor

sre-bot commented Sep 22, 2019

@qw4990, @zz-jason, @fzhedu, @XuHuaiyu, PTAL.

@qw4990
Copy link
Contributor

qw4990 commented Sep 23, 2019

/run-unit-test

executor/join.go Outdated Show resolved Hide resolved
executor/hash_table.go Outdated Show resolved Hide resolved
zz-jason
zz-jason previously approved these changes Sep 24, 2019
Copy link
Member

@zz-jason zz-jason left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@zz-jason zz-jason dismissed their stale review September 24, 2019 07:08

only one LGTM 😂

@zz-jason zz-jason added the status/LGT1 Indicates that a PR has LGTM 1. label Sep 24, 2019
@SunRunAway SunRunAway added this to the v4.0.0 milestone Sep 24, 2019
Copy link
Contributor

@qw4990 qw4990 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@qw4990 qw4990 added status/can-merge Indicates a PR has been approved by a committer. status/LGT2 Indicates that a PR has LGTM 2. and removed status/LGT1 Indicates that a PR has LGTM 1. labels Sep 24, 2019
@sre-bot
Copy link
Contributor

sre-bot commented Sep 24, 2019

/run-all-tests

@sre-bot
Copy link
Contributor

sre-bot commented Sep 24, 2019

@SunRunAway merge failed.

@SunRunAway
Copy link
Contributor Author

/run-unit-test

2 similar comments
@SunRunAway
Copy link
Contributor Author

/run-unit-test

@SunRunAway
Copy link
Contributor Author

/run-unit-test

@SunRunAway
Copy link
Contributor Author

/run-integration-ddl-test

@SunRunAway SunRunAway merged commit 1f92255 into pingcap:master Sep 24, 2019
@SunRunAway SunRunAway deleted the disk-join-issue11607 branch September 24, 2019 15:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sig/execution SIG execution status/can-merge Indicates a PR has been approved by a committer. status/LGT2 Indicates that a PR has LGTM 2. type/new-feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants