Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance issues in standard query patterns. #2391

Open
frankmcsherry opened this issue Mar 23, 2020 · 0 comments
Open

Performance issues in standard query patterns. #2391

frankmcsherry opened this issue Mar 23, 2020 · 0 comments
Labels

Comments

@frankmcsherry
Copy link
Member

@frankmcsherry frankmcsherry commented Mar 23, 2020

@rjnn went through the IMBD dataset, subjecting it to some pretty standard sorts of queries, and there were a few performance and ergonomic hits. I'm going to try and leave all of this in one place for reproduction, but we might want to spin out some issues as we notice specific actionable bits.

The IMDB data are at available at https://datasets.imdbws.com

Starting up materialized, we'll want to create some sources and views. Each of the DELIMITED BY '' fragments need a ctrl-v then a tab to get an actual tab character in there (no support for \t yet).

CREATE SOURCE principals_raw 
FROM FILE '/Users/mcsherry/Desktop/IMDB/title.principals.tsv' 
FORMAT CSV WITH 6 COLUMNS DELIMITED BY ''

CREATE SOURCE titles_raw 
FROM FILE '/Users/mcsherry/Desktop/IMDB/title.basics.tsv' 
FORMAT CSV WITH 9 COLUMNS DELIMITED BY ''

CREATE MATERIALIZED VIEW principals AS 
SELECT 
    column1 AS title_id,
    column2 AS ordering,
    column3 AS name_id,
    column4 AS category,
    column5 AS job,
    column6 AS characters,
    mz_line_no 
FROM principals_raw
WHERE mz_line_no > 1

CREATE MATERIALIZED VIEW titles AS 
SELECT 
    column1 AS title_id, 
    column3 AS title, 
    column6 AS year 
FROM titles_raw

These will load, and depending on your data should have something like

materialize=> select count(*) from titles;
  count  
---------
 6651000
(1 row)

materialize=> select count(*) from principals;
  count   
----------
 38405007
(1 row)

records in them.

The first query of interest is to determine who else has acted with Kevin Bacon, which @rjnn has framed as

CREATE MATERIALIZED VIEW degree1 AS 
SELECT principals.name_id
FROM principals 
WHERE title_id IN (
    SELECT principals.title_id 
    FROM 
        principals,
        titles
    WHERE principals.title_id = titles.title_id 
      AND principals.name_id = 'nm0000102'
)

This presently plans awkwardly, owing to 1. subquery structure and 2. us being suboptimal at using query literals to use indexes (though the appropriate index does not exist).

materialize=> explain plan for SELECT principals.name_id
materialize-> FROM principals 
materialize-> WHERE title_id IN (
materialize(>     SELECT principals.title_id 
materialize(>     FROM 
materialize(>         principals,
materialize(>         titles
materialize(>     WHERE principals.title_id = titles.title_id AND principals.name_id = 'nm0000102'
materialize(> )
materialize-> ;
                         Plan                         
------------------------------------------------------
 0 =                                                 +
 | Get materialize.public.principals (u26)           +
                                                     +
 1 =                                                 +
 | Get materialize.public.principals (u26)           +
 | Distinct group=(#0)                               +
 | ArrangeBy (#0)                                    +
                                                     +
 2 =                                                 +
 | Get materialize.public.principals (u26)           +
 | Filter (#2 = "nm0000102")                         +
 | ArrangeBy (#0)                                    +
                                                     +
 3 =                                                 +
 | Get materialize.public.titles (u20)               +
                                                     +
 4 =                                                 +
 | Join %1 %2 %3 (= %1.#0 %2.#0 %3.#0)               +
 | | implementation = Differential %3 %1.(#0) %2.(#0)+
 | | demand for %1 = (#0)                            +
 | | demand for %2 = ()                              +
 | | demand for %3 = ()                              +
 | Reduce group=(#0) any(true)                       +
 | ArrangeBy (#0)                                    +
                                                     +
 5 =                                                 +
 | Join %0 %4 (= %0.#0 %4.#0)                        +
 | | implementation = Differential %0 %4.(#0)        +
 | | demand for %0 = (#2)                            +
 | | demand for %4 = ()                              +
 | Project (#2)                                      +
 
(1 row)

materialize=> 

Notice that there are three uses of principals, which is the larger of the two input relations. Note, the use of titles is technically spurious here (@rjnn had other plans for it), though even without it the same issues show up (though, less painful). They are:

  1. For a fair bit of time, attempts to read out the count result in
    materialize=> select count(*) from degree1;
    ERROR:  At least one input has no complete timestamps yet.
    
    This isn't obviously wrong; there are some reductions in there, and we don't really stream data out of arranged inputs in timestamp order, but it is a bit unclear and potentially annoying. The "inputs" do have complete timestamps, if we mean the sources, and something else is going on in here (basically, degree1 doesn't have a complete timestamp).
  2. The time taken is longer than we might like (though again, not obviously terrible). I think this boils down to a planning issue that we are tracking (link missing) for using literals in queries to drive indexed look-ups: we would want to index by name_id, avoiding various scans and doing substantially less work.

The next query attempts to go one step out:

CREATE MATERIALIZED VIEW degree2 AS 
SELECT name_id 
FROM principals
WHERE title_id IN (
    SELECT principals.title_id
    FROM principals, titles 
    WHERE principals.title_id = titles.title_id AND principals.name_id IN (
        SELECT name_id FROM degree1
    )
)

Again the titles is spurious, we think, but in this case the query plan is unfortunate:

materialize=> explain plan for SELECT name_id 
materialize-> FROM principals
materialize-> WHERE title_id IN (
materialize(>     SELECT principals.title_id
materialize(>     FROM principals, titles 
materialize(>     WHERE principals.title_id = titles.title_id AND principals.name_id IN (
materialize(>         SELECT name_id FROM materialize.public.degree1
materialize(>     )
materialize(> );
                          Plan                           
---------------------------------------------------------
 0 =                                                    +
 | Get materialize.public.principals (u26)              +
 | Distinct group=(#0)                                  +
 | ArrangeBy ()                                         +
                                                        +
 1 =                                                    +
 | Get materialize.public.principals (u26)              +
                                                        +
 2 =                                                    +
 | Get materialize.public.titles (u20)                  +
 | ArrangeBy (#0)                                       +
                                                        +
 3 =                                                    +
 | Join %0 %1 %2 (= %1.#0 %2.#0)                        +
 | | implementation = Differential %1 %2.(#0) %0.()     +
 | | demand for %0 = (#0)                               +
 | | demand for %1 = (#0, #2)                           +
 | | demand for %2 = ()                                 +
                                                        +
 4 =                                                    +
 | Get materialize.public.principals (u26)              +
                                                        +
 5 =                                                    +
 | Get %3                                               +
 | Filter (#0 = #1)                                     +
                                                        +
 6 =                                                    +
 | Get %3                                               +
 | Distinct group=(#3)                                  +
 | ArrangeBy (#0)                                       +
                                                        +
 7 =                                                    +
 | Get materialize.public.degree1 (u60)                 +
 | ArrangeBy (#0)                                       +
                                                        +
 8 =                                                    +
 | Join %6 %7 (= %6.#0 %7.#0)                           +
 | | implementation = DeltaQuery %6 %7.(#0) | %7 %6.(#0)+
 | | demand for %6 = (#0)                               +
 | | demand for %7 = ()                                 +
 | Reduce group=(#0) any(true)                          +
 | ArrangeBy (#0)                                       +
                                                        +
 9 =                                                    +
 | Join %5 %8 (= %5.#3 %8.#0)                           +
 | | implementation = Differential %5 %8.(#0)           +
 | | demand for %5 = (#0)                               +
 | | demand for %8 = ()                                 +
 | Reduce group=(#0) any(true)                          +
 | ArrangeBy (#0)                                       +
                                                        +
 10 =                                                   +
 | Join %4 %9 (= %4.#0 %9.#0)                           +
 | | implementation = Differential %4 %9.(#0)           +
 | | demand for %4 = (#2)                               +
 | | demand for %9 = ()                                 +
 | Project (#2)                                         +
 
(1 row)

There is a cross-join in the definition of %3, probably because of the double IN (or at least exacerbated by it). @rjnn reports immediately loading up both views, and the second is probably what hosed his system.

A simpler version of the query removes the titles, as

SELECT name_id 
FROM principals
WHERE title_id IN (
    SELECT principals.title_id
    FROM principals 
    WHERE principals.name_id IN (
        SELECT name_id FROM materialize.public.degree1
    )
)

and still there is a (simpler) cross-join:

materialize=> explain plan for SELECT name_id 
materialize-> FROM principals
materialize-> WHERE title_id IN (
materialize(>     SELECT principals.title_id
materialize(>     FROM principals 
materialize(>     WHERE principals.name_id IN (
materialize(>         SELECT name_id FROM materialize.public.degree1
materialize(>     )
materialize(> );
                          Plan                           
---------------------------------------------------------
 0 =                                                    +
 | Get materialize.public.principals (u26)              +
 | Distinct group=(#0)                                  +
 | ArrangeBy ()                                         +
                                                        +
 1 =                                                    +
 | Get materialize.public.principals (u26)              +
                                                        +
 2 =                                                    +
 | Join %0 %1                                           +
 | | implementation = Differential %1 %0.()             +
 | | demand for %0 = (#0)                               +
 | | demand for %1 = (#0, #2)                           +
                                                        +
...

cc @jamii @wangandi

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
1 participant
You can’t perform that action at this time.