# Query Execution Plans

This lab will pull show the importants of planning your queires before execution.  

You are strongly encouraged to use `EXPLAIN` on all queries you write before you try to execute them.
We will look at a couple of bad queries to understand why.



In [1]:
%load_ext sql
%sql postgres://dsa_ro_user:readonly@dbase.dsa.missouri.edu/dsa_ro

'Connected: dsa_ro_user@dsa_ro'

In [2]:
%%sql
EXPLAIN ANALYZE
SELECT COUNT(*) FROM cities;

4 rows affected.


QUERY PLAN
Aggregate (cost=7.40..7.41 rows=1 width=0) (actual time=0.260..0.260 rows=1 loops=1)
-> Seq Scan on cities (cost=0.00..6.52 rows=352 width=0) (actual time=0.067..0.185 rows=352 loops=1)
Planning time: 0.335 ms
Execution time: 0.484 ms


In [3]:
%%sql 
EXPLAIN ANALYZE
SELECT COUNT(*) FROM cities WHERE country = 'India'

6 rows affected.


QUERY PLAN
Aggregate (cost=7.50..7.50 rows=1 width=0) (actual time=0.116..0.116 rows=1 loops=1)
-> Seq Scan on cities (cost=0.00..7.40 rows=38 width=0) (actual time=0.015..0.104 rows=38 loops=1)
Filter: ((country)::text = 'India'::text)
Rows Removed by Filter: 314
Planning time: 0.227 ms
Execution time: 0.157 ms


In the two queries above, we see that either way we get a sequential scan on the table.
This is driven by the size of the table - recall the size is 352 rows.


---

By contrast, lets look at a larger table with 3295 rows.
A regular `COUNT` gets a table scan, ` Seq Scan on us_second_order_divisions`


However, adding the WHERE clause allows an index to come into play.
The index element of the plan in this case: `Bitmap Index Scan on us_second_order_divisions_pkey`


In [4]:
%%sql
EXPLAIN ANALYZE
SELECT COUNT(*) FROM us_second_order_divisions;

4 rows affected.


QUERY PLAN
Aggregate (cost=60.19..60.20 rows=1 width=0) (actual time=1.785..1.785 rows=1 loops=1)
-> Seq Scan on us_second_order_divisions (cost=0.00..51.95 rows=3295 width=0) (actual time=0.045..1.240 rows=3295 loops=1)
Planning time: 0.300 ms
Execution time: 1.828 ms


In [5]:
%%sql
EXPLAIN ANALYZE
SELECT COUNT(*) FROM us_second_order_divisions
WHERE state_number_code = 25;

8 rows affected.


QUERY PLAN
Aggregate (cost=22.79..22.80 rows=1 width=0) (actual time=0.117..0.117 rows=1 loops=1)
-> Bitmap Heap Scan on us_second_order_divisions (cost=4.37..22.76 rows=12 width=0) (actual time=0.108..0.113 rows=14 loops=1)
Recheck Cond: (state_number_code = 25)
Heap Blocks: exact=1
-> Bitmap Index Scan on us_second_order_divisions_pkey (cost=0.00..4.37 rows=12 width=0) (actual time=0.100..0.100 rows=14 loops=1)
Index Cond: (state_number_code = 25)
Planning time: 0.113 ms
Execution time: 0.161 ms


## Explain versus Explain Analyze

You may notice above that we are using `EXPLAIN ANALYZE` versus just `EXPLAIN`. 
This is because I know these queries work and I know that running them will not drag down the database.

It is generally a good idea to `EXPLAIN` first, then once you trust your SQL, `EXPLAIN ANALYZE`.


**Take Note of the output differences of the same SQL without and with the `ANALYZE` option.***

In [6]:
%%sql 
EXPLAIN
SELECT country, MIN(population) 
FROM cities 
GROUP BY country;

3 rows affected.


QUERY PLAN
HashAggregate (cost=8.28..9.25 rows=97 width=12)
Group Key: country
-> Seq Scan on cities (cost=0.00..6.52 rows=352 width=12)


In [7]:
%%sql 
EXPLAIN ANALYZE
SELECT country, MIN(population) 
FROM cities 
GROUP BY country;

5 rows affected.


QUERY PLAN
HashAggregate (cost=8.28..9.25 rows=97 width=12) (actual time=0.364..0.394 rows=97 loops=1)
Group Key: country
-> Seq Scan on cities (cost=0.00..6.52 rows=352 width=12) (actual time=0.013..0.099 rows=352 loops=1)
Planning time: 0.059 ms
Execution time: 0.455 ms


## Aggregates 

We see the `HashAggregate` is used to perform the groupings and apply the aggregate function over the data groups.

In [8]:
%%sql 
EXPLAIN ANALYZE
SELECT country, count(*) 
FROM cities 
GROUP BY country 
HAVING count(*) > 10;

7 rows affected.


QUERY PLAN
HashAggregate (cost=9.16..10.37 rows=97 width=8) (actual time=0.352..0.370 rows=8 loops=1)
Group Key: country
Filter: (count(*) > 10)
Rows Removed by Filter: 89
-> Seq Scan on cities (cost=0.00..6.52 rows=352 width=8) (actual time=0.012..0.086 rows=352 loops=1)
Planning time: 0.057 ms
Execution time: 0.418 ms


## Sorting is expensive!

We previously used the SQL below to build up our understanding of aggregations.

Examine each of the `EXPLAIN` plans and try to correlate those to parts of the SQL.
Tuning a database is as much an **art** as a science.
The first step however, is learning how to read explain plans and understand how query structure and data within the table will affect the cost-based optimizer of a DBMS.

In [9]:
%%sql 
EXPLAIN ANALYZE
SELECT S.state_name, count(*)
FROM us_second_order_divisions as C
JOIN util_us_states as S
  ON (C.state_number_code=S.state_number_code)
GROUP BY S.state_name;

10 rows affected.


QUERY PLAN
HashAggregate (cost=116.08..116.67 rows=59 width=10) (actual time=4.470..4.487 rows=60 loops=1)
Group Key: s.state_name
-> Hash Join (cost=2.35..99.61 rows=3295 width=10) (actual time=0.148..2.821 rows=3295 loops=1)
Hash Cond: (c.state_number_code = s.state_number_code)
-> Seq Scan on us_second_order_divisions c (cost=0.00..51.95 rows=3295 width=2) (actual time=0.011..0.736 rows=3295 loops=1)
-> Hash (cost=1.60..1.60 rows=60 width=12) (actual time=0.073..0.073 rows=60 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 11kB
-> Seq Scan on util_us_states s (cost=0.00..1.60 rows=60 width=12) (actual time=0.007..0.025 rows=60 loops=1)
Planning time: 0.854 ms
Execution time: 4.578 ms


In [10]:
%%sql 
EXPLAIN ANALYZE
SELECT S.state_name, count(*)
FROM us_second_order_divisions as C
JOIN util_us_states as S
  ON (C.state_number_code=S.state_number_code)
GROUP BY S.state_name
ORDER BY S.state_name;

13 rows affected.


QUERY PLAN
Sort (cost=118.41..118.55 rows=59 width=10) (actual time=4.712..4.716 rows=60 loops=1)
Sort Key: s.state_name
Sort Method: quicksort Memory: 29kB
-> HashAggregate (cost=116.08..116.67 rows=59 width=10) (actual time=4.411..4.435 rows=60 loops=1)
Group Key: s.state_name
-> Hash Join (cost=2.35..99.61 rows=3295 width=10) (actual time=0.107..2.742 rows=3295 loops=1)
Hash Cond: (c.state_number_code = s.state_number_code)
-> Seq Scan on us_second_order_divisions c (cost=0.00..51.95 rows=3295 width=2) (actual time=0.009..0.730 rows=3295 loops=1)
-> Hash (cost=1.60..1.60 rows=60 width=12) (actual time=0.076..0.076 rows=60 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 11kB


In [11]:
%%sql 
EXPLAIN ANALYZE
SELECT S.state_name, count(*)
FROM us_second_order_divisions as C
JOIN util_us_states as S
  ON (C.state_number_code=S.state_number_code)
GROUP BY S.state_name
HAVING COUNT(*) BETWEEN 10 AND 30
ORDER BY COUNT(*) DESC;

15 rows affected.


QUERY PLAN
Sort (cost=135.18..135.32 rows=59 width=10) (actual time=4.557..4.560 rows=12 loops=1)
Sort Key: (count(*)) DESC
Sort Method: quicksort Memory: 25kB
-> HashAggregate (cost=132.56..133.44 rows=59 width=10) (actual time=4.476..4.488 rows=12 loops=1)
Group Key: s.state_name
Filter: ((count(*) >= 10) AND (count(*) <= 30))
Rows Removed by Filter: 48
-> Hash Join (cost=2.35..99.61 rows=3295 width=10) (actual time=0.109..2.737 rows=3295 loops=1)
Hash Cond: (c.state_number_code = s.state_number_code)
-> Seq Scan on us_second_order_divisions c (cost=0.00..51.95 rows=3295 width=2) (actual time=0.011..0.723 rows=3295 loops=1)


In [14]:
%%sql 

SELECT S.state_name, count(*)
FROM us_second_order_divisions as C
JOIN util_us_states as S
  ON (C.state_number_code=S.state_number_code)
GROUP BY S.state_name
HAVING COUNT(*) BETWEEN 10 AND 30
ORDER BY COUNT(*) DESC;

12 rows affected.


state_name,count
UTAH,29
ALASKA,28
MARYLAND,24
WYOMING,23
NEW JERSEY,21
NEVADA,17
MAINE,16
PALAU,16
ARIZONA,15
VERMONT,14


## <span style="background:yellow">Your Turn</span>

Examine the **cross-product** query using EXPLAIN first, and then answer the question below.



In [12]:
%%sql

EXPLAIN ANALYZE
SELECT S.state_name, count(*)
FROM us_second_order_divisions as C
, util_us_states as S
GROUP BY S.state_name
HAVING COUNT(*) BETWEEN 10 AND 30
ORDER BY COUNT(*) DESC;


13 rows affected.


QUERY PLAN
Sort (cost=4504.57..4504.72 rows=59 width=10) (actual time=74.831..74.831 rows=0 loops=1)
Sort Key: (count(*)) DESC
Sort Method: quicksort Memory: 25kB
-> HashAggregate (cost=4501.95..4502.84 rows=59 width=10) (actual time=74.826..74.826 rows=0 loops=1)
Group Key: s.state_name
Filter: ((count(*) >= 10) AND (count(*) <= 30))
Rows Removed by Filter: 60
-> Nested Loop (cost=0.00..2524.95 rows=197700 width=10) (actual time=0.020..33.221 rows=197700 loops=1)
-> Seq Scan on us_second_order_divisions c (cost=0.00..51.95 rows=3295 width=0) (actual time=0.011..0.285 rows=3295 loops=1)
-> Materialize (cost=0.00..1.90 rows=60 width=10) (actual time=0.000..0.003 rows=60 loops=3295)


In [15]:
%%sql

SELECT S.state_name, count(*)
FROM us_second_order_divisions as C
, util_us_states as S
GROUP BY S.state_name
HAVING COUNT(*) BETWEEN 10 AND 30
ORDER BY COUNT(*) DESC;


0 rows affected.


state_name,count


# Save your Notebook, then `File > Close and Halt`

---