# Advanced Aggregates

Please remember to use the `EXPLAIN` before you execute a query to help avoid unnecessary load on the DBMS and indefinite waits by you for results.

Therefore, for each question, we are providing a cell for the `EXPLAIN` as well as the final SQL.


## Our practice schema:

We will use the DVD Rental database.

A PDF of the _Entity-Relationship Diagrams_ (ERD) is available [here](https://web.dsa.missouri.edu/static/PDF/DVD_Rental_ERD2.pdf).   
Printing it out is recommended.


**NOTE**: These queries are more complex that the others.
If you get stuck on one, skip and come back to it later.

**NOTE**: For this notebook, it is desired that you construct solutions using advanced aggregates and derived tables.

In [1]:
%load_ext sql
%sql postgres://dsa_ro_user:readonly@pgsql.dsa.lan/dvdrental

'Connected: dsa_ro_user@dvdrental'

### 1
### What is the average, variance, and standard deviation of the film length?


In [2]:
%%sql
EXPLAIN
SELECT  avg(length)
        ,variance(length)
        ,stddev(length)
FROM    film;

 * postgres://dsa_ro_user:***@pgsql.dsa.lan/dvdrental
2 rows affected.


QUERY PLAN
Aggregate (cost=71.51..71.52 rows=1 width=96)
-> Seq Scan on film (cost=0.00..64.00 rows=1000 width=2)


In [3]:
%%sql
SELECT  avg(length)
        ,variance(length)
        ,stddev(length)
FROM    film;

 * postgres://dsa_ro_user:***@pgsql.dsa.lan/dvdrental
1 rows affected.


avg,variance,stddev
115.272,1634.2883043043043,40.426331818559845


### 2
### What is the average, variance, and standard deviation of the film length; broken down by film category.

In [4]:
%%sql
EXPLAIN
SELECT  ca.name
        ,avg(f.length)
        ,variance(f.length)
        ,stddev(f.length)
FROM    film f JOIN film_category fc
        USING(film_id)
        JOIN category ca
        USING(category_id)
GROUP BY ca.name;

 * postgres://dsa_ro_user:***@pgsql.dsa.lan/dvdrental
11 rows affected.


QUERY PLAN
HashAggregate (cost=109.81..110.09 rows=16 width=164)
Group Key: ca.name
-> Hash Join (cost=77.86..99.81 rows=1000 width=70)
Hash Cond: (fc.category_id = ca.category_id)
-> Hash Join (cost=76.50..95.14 rows=1000 width=4)
Hash Cond: (fc.film_id = f.film_id)
-> Seq Scan on film_category fc (cost=0.00..16.00 rows=1000 width=4)
-> Hash (cost=64.00..64.00 rows=1000 width=6)
-> Seq Scan on film f (cost=0.00..64.00 rows=1000 width=6)
-> Hash (cost=1.16..1.16 rows=16 width=72)


In [5]:
%%sql
SELECT  ca.name
        ,avg(f.length)
        ,variance(f.length)
        ,stddev(f.length)
FROM    film f JOIN film_category fc
        USING(film_id)
        JOIN category ca
        USING(category_id)
GROUP BY ca.name;

 * postgres://dsa_ro_user:***@pgsql.dsa.lan/dvdrental
16 rows affected.


name,avg,variance,stddev
Family,114.78260869565216,1523.2314578005116,39.02859794817784
Games,127.83606557377048,1262.4726775956285,35.531291527266895
Animation,111.01515151515152,1723.0920745920744,41.51014423718706
Classics,111.66666666666666,1475.2261904761904,38.40867337563471
Documentary,108.75,1814.6679104477607,42.598919122998424
New,111.12698412698413,1514.79006656426,38.920304039977125
Sports,128.2027027027027,1796.4651980747872,42.38472835910108
Children,109.8,1500.9084745762711,38.74156004314064
Music,113.6470588235294,1787.3129411764703,42.27662405131789
Travel,113.3157894736842,1540.1842105263156,39.24518072994843


[Helpful Hints Video](https://youtu.be/jy9H2KLI4Iw) 

### 3
### A movie's "cumulative rented duration" is the sum of all rentals from rental table.  What is the average _cumulative rented duration_ per store (inventory.store_id).

In [15]:
%%sql
EXPLAIN
SELECT  s.store_id, avg(cumulative_rentals.cumulative_duration)
FROM    store s JOIN staff st
        USING(store_id)
        JOIN rental r
        USING(staff_id)
        JOIN (SELECT  inventory_id
                    ,sum(r.return_date - r.rental_date) AS cumulative_duration
            FROM    rental as r
            WHERE   r.return_date IS NOT NULL
            GROUP BY r.inventory_id
        ) as cumulative_rentals
        ON r.inventory_id = cumulative_rentals.inventory_id
GROUP BY s.store_id;

 * postgres://dsa_ro_user:***@pgsql.dsa.lan/dvdrental
19 rows affected.


QUERY PLAN
HashAggregate (cost=1233.79..1233.82 rows=2 width=20)
Group Key: s.store_id
-> Hash Join (cost=580.37..1153.57 rows=16044 width=20)
Hash Cond: (r.inventory_id = cumulative_rentals.inventory_id)
-> Hash Join (cost=2.12..533.17 rows=16044 width=8)
Hash Cond: (r.staff_id = st.staff_id)
-> Seq Scan on rental r (cost=0.00..310.44 rows=16044 width=6)
-> Hash (cost=2.09..2.09 rows=2 width=8)
-> Nested Loop (cost=0.00..2.09 rows=2 width=8)
Join Filter: (s.store_id = st.store_id)


In [16]:
%%sql
SELECT  s.store_id, avg(cumulative_rentals.cumulative_duration)
FROM    store s JOIN staff st
        USING(store_id)
        JOIN rental r
        USING(staff_id)
        JOIN (SELECT  inventory_id
                    ,sum(r.return_date - r.rental_date) AS cumulative_duration
            FROM    rental as r
            WHERE   r.return_date IS NOT NULL
            GROUP BY r.inventory_id
        ) as cumulative_rentals
        ON r.inventory_id = cumulative_rentals.inventory_id
GROUP BY s.store_id;

 * postgres://dsa_ro_user:***@pgsql.dsa.lan/dvdrental
2 rows affected.


store_id,avg
1,"19 days, 4:42:53.873135"
2,"19 days, 3:26:03.770615"


[Helpful Hints Video](https://youtu.be/Scyn7exzUcY)  

### 4
### Which three categories of film have the highest average number of actors per film?

In [32]:
%%sql
EXPLAIN
SELECT  c.name, avg(ac.actors) as avg_actors
FROM    category c JOIN film_category fc
        USING(category_id)
        JOIN film f
        USING(film_id)
        JOIN (SELECT  film_id, COUNT(actor_id) as actors
              FROM    film_actor
              GROUP BY film_id
        ) as ac
        USING(film_id)
GROUP BY c.name
ORDER BY avg_actors DESC
LIMIT   3;

 * postgres://dsa_ro_user:***@pgsql.dsa.lan/dvdrental
21 rows affected.


QUERY PLAN
Limit (cost=252.16..252.16 rows=3 width=100)
-> Sort (cost=252.16..252.20 rows=16 width=100)
Sort Key: (avg(ac.actors)) DESC
-> HashAggregate (cost=251.75..251.95 rows=16 width=100)
Group Key: c.name
-> Hash Join (cost=222.19..246.76 rows=997 width=76)
Hash Cond: (fc.film_id = f.film_id)
-> Hash Join (cost=145.69..167.64 rows=997 width=80)
Hash Cond: (fc.category_id = c.category_id)
-> Hash Join (cost=144.33..162.97 rows=997 width=14)


In [33]:
%%sql
SELECT  c.name, avg(ac.actors) as avg_actors
FROM    category c JOIN film_category fc
        USING(category_id)
        JOIN film f
        USING(film_id)
        JOIN (SELECT  film_id, COUNT(actor_id) as actors
              FROM    film_actor
              GROUP BY film_id
        ) as ac
        USING(film_id)
GROUP BY c.name
ORDER BY avg_actors DESC
LIMIT   3;

 * postgres://dsa_ro_user:***@pgsql.dsa.lan/dvdrental
3 rows affected.


name,avg_actors
Sports,6.0410958904109595
Drama,5.737704918032787
Children,5.7333333333333325


### 5
### For each staff member, list their average daily payment amount processed.

In [42]:
%%sql
EXPLAIN
SELECT  staff_id, avg(payments.daily_payments) as avg_daily_payments
FROM    staff st JOIN (
            SELECT  staff_id
                    ,payment_date::TIMESTAMP::DATE as date
                    ,sum(amount) as daily_payments
            FROM    payment
            GROUP BY staff_id, date
        ) as payments
        USING(staff_id)
GROUP BY staff_id;





 * postgres://dsa_ro_user:***@pgsql.dsa.lan/dvdrental
11 rows affected.


QUERY PLAN
GroupAggregate (cost=803.76..804.87 rows=2 width=36)
Group Key: st.staff_id
-> Sort (cost=803.76..804.12 rows=144 width=36)
Sort Key: st.staff_id
-> Hash Join (cost=400.96..798.60 rows=144 width=36)
Hash Cond: (payment.staff_id = st.staff_id)
-> HashAggregate (cost=399.92..615.40 rows=14365 width=38)
"Group Key: payment.staff_id, (payment.payment_date)::date"
-> Seq Scan on payment (cost=0.00..290.45 rows=14596 width=12)
-> Hash (cost=1.02..1.02 rows=2 width=4)


In [48]:
%%sql
SELECT  staff_id, payments.date, avg(payments.daily_payments) as avg_daily_payments
FROM    staff st JOIN (
            SELECT  staff_id
                    ,payment_date::TIMESTAMP::DATE as date
                    ,sum(amount) as daily_payments
            FROM    payment
            GROUP BY staff_id, date
        ) as payments
        USING(staff_id)
GROUP BY staff_id, date
ORDER BY payments.date, staff_id;






 * postgres://dsa_ro_user:***@pgsql.dsa.lan/dvdrental
64 rows affected.


staff_id,date,avg_daily_payments
1,2007-02-14,45.9
2,2007-02-14,70.83
1,2007-02-15,599.39
2,2007-02-15,589.53
1,2007-02-16,611.52
2,2007-02-16,542.66
1,2007-02-17,527.74
2,2007-02-17,660.43
1,2007-02-18,662.36
2,2007-02-18,613.62


### 6
### What is the statistical correlation between film length and rental rate?

In [49]:
%%sql
EXPLAIN
SELECT  corr(length, rental_rate)
FROM    film;

 * postgres://dsa_ro_user:***@pgsql.dsa.lan/dvdrental
2 rows affected.


QUERY PLAN
Aggregate (cost=71.50..71.51 rows=1 width=8)
-> Seq Scan on film (cost=0.00..64.00 rows=1000 width=8)


In [50]:
%%sql
SELECT  corr(length, rental_rate)
FROM    film;

 * postgres://dsa_ro_user:***@pgsql.dsa.lan/dvdrental
1 rows affected.


corr
0.0297892586459086


[Helpful Hints Video](https://youtu.be/3d2vgLn9KVs)  

# Save your Notebook, then `File > Close and Halt`