Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
Performance issues in standard query patterns. #2391
@rjnn went through the IMBD dataset, subjecting it to some pretty standard sorts of queries, and there were a few performance and ergonomic hits. I'm going to try and leave all of this in one place for reproduction, but we might want to spin out some issues as we notice specific actionable bits.
The IMDB data are at available at https://datasets.imdbws.com
These will load, and depending on your data should have something like
records in them.
The first query of interest is to determine who else has acted with Kevin Bacon, which @rjnn has framed as
CREATE MATERIALIZED VIEW degree1 AS SELECT principals.name_id FROM principals WHERE title_id IN ( SELECT principals.title_id FROM principals, titles WHERE principals.title_id = titles.title_id AND principals.name_id = 'nm0000102' )
This presently plans awkwardly, owing to 1. subquery structure and 2. us being suboptimal at using query literals to use indexes (though the appropriate index does not exist).
Notice that there are three uses of
The next query attempts to go one step out:
CREATE MATERIALIZED VIEW degree2 AS SELECT name_id FROM principals WHERE title_id IN ( SELECT principals.title_id FROM principals, titles WHERE principals.title_id = titles.title_id AND principals.name_id IN ( SELECT name_id FROM degree1 ) )
There is a cross-join in the definition of
A simpler version of the query removes the
SELECT name_id FROM principals WHERE title_id IN ( SELECT principals.title_id FROM principals WHERE principals.name_id IN ( SELECT name_id FROM materialize.public.degree1 ) )
and still there is a (simpler) cross-join: