# Query Explain Plans


<span style='font-size:1.2em'>This lab will pull some queries from previous activities and review the *Explain Plans*, or *Query Plans*.</span>

You are strongly encouraged to use `EXPLAIN` on all queries you write before you try to execute them.
We will look at a couple of bad queries to understand why.



In [1]:
%load_ext sql
%sql postgres://dsa_ro_user:readonly@pgsql.dsa.lan/dsa_ro

'Connected: dsa_ro_user@dsa_ro'

In [2]:
%%sql
EXPLAIN ANALYZE
SELECT COUNT(*) FROM cities;

 * postgres://dsa_ro_user:***@pgsql.dsa.lan/dsa_ro
4 rows affected.


QUERY PLAN
Aggregate (cost=7.40..7.41 rows=1 width=8) (actual time=0.130..0.130 rows=1 loops=1)
-> Seq Scan on cities (cost=0.00..6.52 rows=352 width=0) (actual time=0.018..0.090 rows=352 loops=1)
Planning time: 0.083 ms
Execution time: 0.232 ms


In [3]:
%%sql 
EXPLAIN ANALYZE
SELECT COUNT(*) FROM cities WHERE country = 'India'

 * postgres://dsa_ro_user:***@pgsql.dsa.lan/dsa_ro
6 rows affected.


QUERY PLAN
Aggregate (cost=7.50..7.50 rows=1 width=8) (actual time=0.114..0.114 rows=1 loops=1)
-> Seq Scan on cities (cost=0.00..7.40 rows=38 width=0) (actual time=0.021..0.103 rows=38 loops=1)
Filter: ((country)::text = 'India'::text)
Rows Removed by Filter: 314
Planning time: 0.172 ms
Execution time: 0.160 ms


In the two queries above, we see that either way we get a sequential scan on the table.
This is driven by the size of the table - recall the size is 352 rows.


---

By contrast, let us look at a larger table with 3295 rows.
A regular `COUNT` gets a table scan, ` Seq Scan on us_second_order_divisions`


However, adding the WHERE clause allows an index to come into play.
The index element of the plan in this case: `Bitmap Index Scan on us_second_order_divisions_pkey`  
We will discuss Indexing within databases at the end of this module.


In [4]:
%%sql
EXPLAIN ANALYZE
SELECT COUNT(*) FROM us_second_order_divisions;

 * postgres://dsa_ro_user:***@pgsql.dsa.lan/dsa_ro
4 rows affected.


QUERY PLAN
Aggregate (cost=60.19..60.20 rows=1 width=8) (actual time=1.502..1.502 rows=1 loops=1)
-> Seq Scan on us_second_order_divisions (cost=0.00..51.95 rows=3295 width=0) (actual time=0.029..0.947 rows=3295 loops=1)
Planning time: 0.374 ms
Execution time: 1.543 ms


In [5]:
%%sql
EXPLAIN ANALYZE
SELECT COUNT(*) FROM us_second_order_divisions
WHERE state_number_code = 25;

 * postgres://dsa_ro_user:***@pgsql.dsa.lan/dsa_ro
8 rows affected.


QUERY PLAN
Aggregate (cost=22.79..22.80 rows=1 width=8) (actual time=0.267..0.267 rows=1 loops=1)
-> Bitmap Heap Scan on us_second_order_divisions (cost=4.37..22.76 rows=12 width=0) (actual time=0.257..0.261 rows=14 loops=1)
Recheck Cond: (state_number_code = 25)
Heap Blocks: exact=1
-> Bitmap Index Scan on us_second_order_divisions_pkey (cost=0.00..4.37 rows=12 width=0) (actual time=0.245..0.246 rows=14 loops=1)
Index Cond: (state_number_code = 25)
Planning time: 0.150 ms
Execution time: 0.351 ms


## Explain versus Explain Analyze

You may notice above that we are using `EXPLAIN ANALYZE` versus just `EXPLAIN`. 
This is because I know these queries work and I know that running them will not drag down the database.

It is generally a good idea to `EXPLAIN` first, then once you trust your SQL, `EXPLAIN ANALYZE`.


**Take Note of the output differences of the same SQL without and with the `ANALYZE` option.**

In [6]:
%%sql 
EXPLAIN
SELECT country, MIN(population) 
FROM cities 
GROUP BY country;

 * postgres://dsa_ro_user:***@pgsql.dsa.lan/dsa_ro
3 rows affected.


QUERY PLAN
HashAggregate (cost=8.28..9.25 rows=97 width=12)
Group Key: country
-> Seq Scan on cities (cost=0.00..6.52 rows=352 width=12)


In [7]:
%%sql 
EXPLAIN ANALYZE
SELECT country, MIN(population) 
FROM cities 
GROUP BY country;

 * postgres://dsa_ro_user:***@pgsql.dsa.lan/dsa_ro
5 rows affected.


QUERY PLAN
HashAggregate (cost=8.28..9.25 rows=97 width=12) (actual time=0.343..0.408 rows=97 loops=1)
Group Key: country
-> Seq Scan on cities (cost=0.00..6.52 rows=352 width=12) (actual time=0.033..0.105 rows=352 loops=1)
Planning time: 0.073 ms
Execution time: 0.462 ms


## Aggregates 

We see the `HashAggregate` is used to perform the groupings and apply the aggregate function over the data groups.

In [8]:
%%sql 
EXPLAIN ANALYZE
SELECT country, count(*) 
FROM cities 
GROUP BY country 
HAVING count(*) > 10;

 * postgres://dsa_ro_user:***@pgsql.dsa.lan/dsa_ro
7 rows affected.


QUERY PLAN
HashAggregate (cost=9.16..10.13 rows=97 width=16) (actual time=0.334..0.355 rows=8 loops=1)
Group Key: country
Filter: (count(*) > 10)
Rows Removed by Filter: 89
-> Seq Scan on cities (cost=0.00..6.52 rows=352 width=8) (actual time=0.012..0.078 rows=352 loops=1)
Planning time: 0.071 ms
Execution time: 0.427 ms


## Sorting is expensive!

We previously used the SQL below to build up our understanding of aggregations.

Examine each of the `EXPLAIN` plans and try to correlate those to parts of the SQL.
Tuning a database is as much an **art** as a science.
The first step however, is learning how to read explain plans and understand how query structure and data within the table will affect the cost-based optimizer of a DBMS.

In [9]:
%%sql 
EXPLAIN ANALYZE
SELECT S.state_name, count(*)
FROM us_second_order_divisions as C
JOIN util_us_states as S
  ON (C.state_number_code=S.state_number_code)
GROUP BY S.state_name;

 * postgres://dsa_ro_user:***@pgsql.dsa.lan/dsa_ro
10 rows affected.


QUERY PLAN
HashAggregate (cost=116.08..116.67 rows=59 width=18) (actual time=4.047..4.065 rows=60 loops=1)
Group Key: s.state_name
-> Hash Join (cost=2.35..99.61 rows=3295 width=10) (actual time=0.137..2.482 rows=3295 loops=1)
Hash Cond: (c.state_number_code = s.state_number_code)
-> Seq Scan on us_second_order_divisions c (cost=0.00..51.95 rows=3295 width=2) (actual time=0.015..0.725 rows=3295 loops=1)
-> Hash (cost=1.60..1.60 rows=60 width=12) (actual time=0.072..0.072 rows=60 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 11kB
-> Seq Scan on util_us_states s (cost=0.00..1.60 rows=60 width=12) (actual time=0.011..0.029 rows=60 loops=1)
Planning time: 0.881 ms
Execution time: 4.177 ms


In [10]:
%%sql 
EXPLAIN ANALYZE
SELECT S.state_name, count(*)
FROM us_second_order_divisions as C
JOIN util_us_states as S
  ON (C.state_number_code=S.state_number_code)
GROUP BY S.state_name
ORDER BY S.state_name;

 * postgres://dsa_ro_user:***@pgsql.dsa.lan/dsa_ro
13 rows affected.


QUERY PLAN
Sort (cost=118.41..118.55 rows=59 width=18) (actual time=4.436..4.444 rows=60 loops=1)
Sort Key: s.state_name
Sort Method: quicksort Memory: 29kB
-> HashAggregate (cost=116.08..116.67 rows=59 width=18) (actual time=4.084..4.107 rows=60 loops=1)
Group Key: s.state_name
-> Hash Join (cost=2.35..99.61 rows=3295 width=10) (actual time=0.116..2.476 rows=3295 loops=1)
Hash Cond: (c.state_number_code = s.state_number_code)
-> Seq Scan on us_second_order_divisions c (cost=0.00..51.95 rows=3295 width=2) (actual time=0.014..0.720 rows=3295 loops=1)
-> Hash (cost=1.60..1.60 rows=60 width=12) (actual time=0.075..0.076 rows=60 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 11kB


In [11]:
%%sql 
EXPLAIN ANALYZE
SELECT S.state_name, count(*)
FROM us_second_order_divisions as C
JOIN util_us_states as S
  ON (C.state_number_code=S.state_number_code)
GROUP BY S.state_name
HAVING COUNT(*) BETWEEN 10 AND 30
ORDER BY COUNT(*) DESC;

 * postgres://dsa_ro_user:***@pgsql.dsa.lan/dsa_ro
15 rows affected.


QUERY PLAN
Sort (cost=134.88..135.03 rows=59 width=18) (actual time=3.296..3.297 rows=12 loops=1)
Sort Key: (count(*)) DESC
Sort Method: quicksort Memory: 25kB
-> HashAggregate (cost=132.56..133.15 rows=59 width=18) (actual time=3.161..3.179 rows=12 loops=1)
Group Key: s.state_name
Filter: ((count(*) >= 10) AND (count(*) <= 30))
Rows Removed by Filter: 48
-> Hash Join (cost=2.35..99.61 rows=3295 width=10) (actual time=0.081..1.949 rows=3295 loops=1)
Hash Cond: (c.state_number_code = s.state_number_code)
-> Seq Scan on us_second_order_divisions c (cost=0.00..51.95 rows=3295 width=2) (actual time=0.010..0.549 rows=3295 loops=1)


## <span style="background:yellow">Your Turn!</span>

Examine the **cross-product** query using EXPLAIN first, and then answer the question below.



In [12]:
%%sql

EXPLAIN ANALYZE
SELECT S.state_name, count(*)
FROM us_second_order_divisions as C
, util_us_states as S
GROUP BY S.state_name
HAVING COUNT(*) BETWEEN 10 AND 30
ORDER BY COUNT(*) DESC;


 * postgres://dsa_ro_user:***@pgsql.dsa.lan/dsa_ro
13 rows affected.


QUERY PLAN
Sort (cost=4504.28..4504.42 rows=59 width=18) (actual time=67.984..67.984 rows=0 loops=1)
Sort Key: (count(*)) DESC
Sort Method: quicksort Memory: 25kB
-> HashAggregate (cost=4501.95..4502.54 rows=59 width=18) (actual time=67.979..67.979 rows=0 loops=1)
Group Key: s.state_name
Filter: ((count(*) >= 10) AND (count(*) <= 30))
Rows Removed by Filter: 60
-> Nested Loop (cost=0.00..2524.95 rows=197700 width=10) (actual time=0.068..29.186 rows=197700 loops=1)
-> Seq Scan on us_second_order_divisions c (cost=0.00..51.95 rows=3295 width=0) (actual time=0.015..0.314 rows=3295 loops=1)
-> Materialize (cost=0.00..1.90 rows=60 width=10) (actual time=0.000..0.003 rows=60 loops=3295)


# Save your Notebook, then `File > Close and Halt`

---