In [2]:
%load_ext sql

In [3]:
%sql postgresql://appdev@data:5432/appdev

'Connected: appdev@appdev'

## On the table circuits report:

* What type of indices exists for the table and why they are of that type (and not some other type)
* The amount of space each index takes up

There are three indices on the circuits table which are created on the position, the url and the primary key. The url and the primary key are btree's while the position is a gist structure. The position is a combination of the latitude and longitude, so with a gist stucture postgres is able to make more complicated comparisions, than just lager or smaller than. With geo space we also need to ask where we are compared to another location.

The gist index on the position uses 8192 bytes of space
The btree index on the primary key uses 16 kb of space
The btree index on the url uses 16kb of spacemar

## We are talent scouts looking to win over some of the best new drivers there are. But we don't want them too old! Write a query that finds the winner of all the races, but only if they are younger than 38 years. The query should give return the date, driver surname, driver age, track time in milliseconds, race name and circuit name for all races.

In [11]:
%sql SELECT drivers.surname, drivers.dob, milliseconds, races.name AS race_name, circuits.name AS circuits_name, races.date AS race_date FROM results JOIN drivers ON drivers.dob > '1980-01-01' AND results.position = 1 AND drivers.driverid = results.driverid JOIN races USING (raceid) JOIN circuits ON circuits.circuitid = races.circuitid ORDER BY drivers.dob DESC;

199 rows affected.


surname,dob,milliseconds,race_name,circuits_name,race_date
Verstappen,1997-09-30,5401290.0,Malaysian Grand Prix,Sepang International Circuit,2017-10-01
Verstappen,1997-09-30,6100017.0,Spanish Grand Prix,Circuit de Barcelona-Catalunya,2016-05-15
Bottas,1989-08-28,4908523.0,Austrian Grand Prix,Red Bull Ring,2017-07-09
Bottas,1989-08-28,5288743.0,Russian Grand Prix,Sochi Autodrom,2017-04-30
Ricciardo,1989-07-01,7435573.0,Azerbaijan Grand Prix,Baku City Circuit,2017-06-25
Ricciardo,1989-07-01,5076556.0,Belgian Grand Prix,Circuit de Spa-Francorchamps,2014-08-24
Ricciardo,1989-07-01,5952830.0,Canadian Grand Prix,Circuit Gilles Villeneuve,2014-06-08
Ricciardo,1989-07-01,6785058.0,Hungarian Grand Prix,Hungaroring,2014-07-27
Ricciardo,1989-07-01,5832776.0,Malaysian Grand Prix,Sepang International Circuit,2016-10-02
Vettel,1987-07-03,5788651.0,Korean Grand Prix,Korean International Circuit,2012-10-14


## Describe the query using EXPLAIN ANALYZE with at least 5 lines of text. Answer at least the following:
* How many calls are you making?
* How long does it take to perform the query?

In [50]:
%sql EXPLAIN ANALYZE SELECT drivers.surname, drivers.dob, milliseconds, races.name AS race_name, circuits.name AS circuits_name, races.date AS race_date FROM results JOIN drivers ON drivers.dob > '1980-01-01' AND results.position = 1 AND drivers.driverid = results.driverid JOIN races USING (raceid) JOIN circuits ON circuits.circuitid = races.circuitid;

18 rows affected.


QUERY PLAN
Nested Loop (cost=25.68..785.35 rows=69 width=62) (actual time=0.110..2.407 rows=199 loops=1)
-> Nested Loop (cost=25.54..773.21 rows=69 width=50) (actual time=0.106..2.236 rows=199 loops=1)
-> Hash Join (cost=25.26..746.51 rows=69 width=27) (actual time=0.098..1.994 rows=199 loops=1)
Hash Cond: (results.driverid = drivers.driverid)
-> Seq Scan on results (cost=0.00..708.96 rows=974 width=24) (actual time=0.005..1.803 rows=974 loops=1)
"Filter: (""position"" = 1)"
Rows Removed by Filter: 22703
-> Hash (cost=24.51..24.51 rows=60 width=19) (actual time=0.086..0.086 rows=61 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 12kB
-> Seq Scan on drivers (cost=0.00..24.51 rows=60 width=19) (actual time=0.004..0.074 rows=61 loops=1)


By looking at the query, we can see that postgres is doing mutliply things. It starts by setting up some nested loops. It then doing a hash join on the driverid's from the tables 'results' and 'drivers'. After that it is doing a sequential scan on the results to find only those who has postion = 1. It then doing a sequential scan on the drivers table, to check the date of birth. Then it uses an index scan on the raceid from the tables 'results' and 'races'. And again uses another index scan on the tables 'circuits' and 'races' to match the circuit id. In total we are making 7 calls which takes 2.458 ms

## Create a materialized view of your query. Using EXPLAIN ANALYZE try to query the view. Write at least 5 lines of text explaining what's going on and why the query execution time changed.

In [51]:
%sql CREATE VIEW race_winners AS SELECT drivers.surname, drivers.dob, milliseconds, races.name AS race_name, circuits.name AS circuits_name, races.date AS race_date FROM results JOIN drivers ON drivers.dob > '1980-01-01' AND results.position = 1 AND drivers.driverid = results.driverid JOIN races USING (raceid) JOIN circuits ON circuits.circuitid = races.circuitid;

Done.


[]

In [52]:
%sql CREATE MATERIALIZED VIEW race_winners_cache AS SELECT * FROM race_winners;

199 rows affected.


[]

In [53]:
%sql EXPLAIN ANALYZE SELECT * FROM race_winners_cache;

3 rows affected.


QUERY PLAN
Seq Scan on race_winners_cache (cost=0.00..10.50 rows=50 width=1564) (actual time=0.009..0.024 rows=199 loops=1)
Planning time: 0.115 ms
Execution time: 0.041 ms


Compared to the query before we materialized it, we are only making one call which takes only 0.041 ms. We are only making one sequential scan because our materialized view already contains all the information we want, so we only need to go through that data and show (select) it.