Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cost-based choice between nested loop join and hash join #7331

Closed
dyemanov opened this issue Oct 6, 2022 · 0 comments
Closed

Cost-based choice between nested loop join and hash join #7331

dyemanov opened this issue Oct 6, 2022 · 0 comments

Comments

@dyemanov
Copy link
Member

dyemanov commented Oct 6, 2022

Currently, only a nested loop join is used when streams have indexed relationships. But often this is sub-optimal and causes the dependent streams to be fetched more times than actually necessary. Hash join could be a better choice in these cases. Of course, cast-based approach should be used when choosing between the possible join algorithms.

Just an example from the TPC-R test suite:

select first 10
  l_orderkey, o_orderdate, o_shippriority,
  sum(l_extendedprice * (1 - l_discount)) as revenue,
from
  customer, orders, lineitem
where
  c_mktsegment = 'BUILDING'
  and c_custkey = o_custkey
  and l_orderkey = o_orderkey
  and o_orderdate < date '1995-03-15'
  and l_shipdate > date '1995-03-15'
group by
  l_orderkey, o_orderdate, o_shippriority
order by
  2 desc, o_orderdate;

PLAN SORT (
  SORT (
    JOIN (
      CUSTOMER NATURAL,
      ORDERS INDEX (ORDERS_CUSTKEY),
      LINEITEM INDEX (LINEITEM_PK))))

Elapsed time = 1.600 sec

vs

PLAN SORT (
  SORT (
    JOIN (
      HASH (
        ORDERS INDEX (ORDERS_ORDERDATE),
        CUSTOMER NATURAL),
      LINEITEM INDEX (LINEITEM_PK))))

Elapsed time = 1.031 sec
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants