# Window Functions

Please remember to use the `EXPLAIN` before you execute a query to help avoid unnecessary load on the DBMS and indefinite waits by you for results.

Therefore, for each question, we are providing a cell for the `EXPLAIN` as well as the final SQL.


## Our practice schema:

We will use the DVD rental database.

A PDF of the _Entity-Relationship Diagrams_ (ERD) is available [here](https://web.dsa.missouri.edu/static/PDF/DVD_Rental_ERD2.pdf).   
Printing it out is recommended.


**NOTE**: These queries are more complex that the previous day's.
If you get stuck on one, skip and come back to it later.


**NOTE**: For this notebook, it is desired that you construct solutions using Window Functions.

In [1]:
%load_ext sql
%sql postgres://dsa_ro_user:readonly@pgsql.dsa.lan/dvdrental

'Connected: dsa_ro_user@dvdrental'

# 1

### For the following customers: list each movie they have rented, its `film.rental_duration`, and the comparison of the  `film.rental_duration` versus their average `rental` duration `(return_date  - rental_date)` as a column named `cmp`.

Customer IDs: 
  * 318
  * 110
  * 281
  * 61

In [12]:
%%sql
EXPLAIN
SELECT  r.customer_id
        ,f.title
        ,f.rental_duration
        ,f.rental_duration - avg(r.return_date::TIMESTAMP::DATE  - r.rental_date::TIMESTAMP::DATE) OVER (PARTITION BY customer_id) as cmp 
FROM    film f JOIN inventory i
        USING(film_id)
        JOIN rental r
        USING(inventory_id)
WHERE customer_id IN (61, 110, 281, 318);

 * postgres://dsa_ro_user:***@pgsql.dsa.lan/dvdrental
12 rows affected.


QUERY PLAN
WindowAgg (cost=542.75..545.84 rows=103 width=51)
-> Sort (cost=542.75..543.01 rows=103 width=35)
Sort Key: f.title
-> Nested Loop (cost=392.22..539.31 rows=103 width=35)
-> Hash Join (cost=391.95..503.87 rows=103 width=20)
Hash Cond: (i.inventory_id = r.inventory_id)
-> Seq Scan on inventory i (cost=0.00..70.81 rows=4581 width=6)
-> Hash (cost=390.66..390.66 rows=103 width=22)
-> Seq Scan on rental r (cost=0.00..390.66 rows=103 width=22)
"Filter: (customer_id = ANY ('{61,110,281,318}'::integer[]))"


In [16]:
%%sql
SELECT  r.customer_id
        ,f.title
        ,f.rental_duration
        ,f.rental_duration - avg(r.return_date::TIMESTAMP::DATE  - r.rental_date::TIMESTAMP::DATE) OVER (PARTITION BY customer_id) as cmp 
FROM    film f JOIN inventory i
        USING(film_id)
        JOIN rental r
        USING(inventory_id)
WHERE customer_id IN (61, 110, 281, 318);

 * postgres://dsa_ro_user:***@pgsql.dsa.lan/dvdrental
54 rows affected.


customer_id,title,rental_duration,cmp
61,Necklace Outbreak,3,-3.0
61,Iron Moon,7,1.0
61,Fireball Philadelphia,4,-2.0
61,Impact Aladdin,6,0.0
61,Autumn Crow,3,-3.0
61,Smoochy Control,7,1.0
61,Barefoot Manchurian,6,0.0
61,Ridgemont Submarine,3,-3.0
61,Voyage Legally,6,0.0
61,Reign Gentlemen,3,-3.0


[Helpful Hints Video](https://youtu.be/cm1_d1qWLhg)  
 

--- 

# 2

### For each store (inventory.store_id), list the top three films that have been rented based on accumulated rental durations.

Hint: Use the `rank()` function and a derived table

In [19]:
%%sql
EXPLAIN
SELECT  durations.store_id
        ,durations.film_id
        FROM (SELECT  sum(r.return_date  - r.rental_date) as duration
                        ,rank() OVER (PARTITION BY i.store_id ORDER BY sum(r.return_date  - r.rental_date) DESC)
                        ,store_id
                        ,i.film_id
                FROM    inventory i JOIN rental r
                        USING(inventory_id)
                WHERE   r.return_date IS NOT NULL
                GROUP BY i.store_id, i.film_id
        ) AS durations
WHERE RANK <= 3;

 * postgres://dsa_ro_user:***@pgsql.dsa.lan/dvdrental
13 rows affected.


QUERY PLAN
Subquery Scan on durations (cost=695.81..726.95 rows=319 width=4)
Filter: (durations.rank <= 3)
-> WindowAgg (cost=695.81..714.97 rows=958 width=30)
-> Sort (cost=695.81..698.21 rows=958 width=20)
"Sort Key: i.store_id, (sum((r.return_date - r.rental_date))) DESC"
-> HashAggregate (cost=638.79..648.37 rows=958 width=20)
"Group Key: i.store_id, i.film_id"
-> Hash Join (cost=128.07..480.18 rows=15861 width=20)
Hash Cond: (r.inventory_id = i.inventory_id)
-> Seq Scan on rental r (cost=0.00..310.44 rows=15861 width=20)


In [20]:
%%sql
SELECT  durations.store_id
        ,durations.film_id
        FROM (SELECT  sum(r.return_date  - r.rental_date) as duration
                        ,rank() OVER (PARTITION BY i.store_id ORDER BY sum(r.return_date  - r.rental_date) DESC)
                        ,store_id
                        ,i.film_id
                FROM    inventory i JOIN rental r
                        USING(inventory_id)
                WHERE   r.return_date IS NOT NULL
                GROUP BY i.store_id, i.film_id
        ) AS durations
WHERE RANK <= 3;

 * postgres://dsa_ro_user:***@pgsql.dsa.lan/dvdrental
6 rows affected.


store_id,film_id
1,971
1,109
1,852
2,552
2,891
2,491


[Helpful Hints Video](https://youtu.be/COQem8x3kR4)  
 

--- 

# 3

### For each category, list the three longest movies

In [44]:
%%sql
EXPLAIN
SELECT  lengths.category_id
        ,lengths.film_id
        ,lengths.length
        FROM (SELECT  rank() OVER (PARTITION BY c.category_id ORDER BY f.length DESC)
                        ,c.category_id
                        ,f.film_id
                        ,f.length
                FROM    film f JOIN film_category c
                        USING(film_id)
                GROUP BY c.category_id, f.film_id
        ) AS lengths
WHERE RANK <= 3
ORDER BY category_id, length;

 * postgres://dsa_ro_user:***@pgsql.dsa.lan/dvdrental
14 rows affected.


QUERY PLAN
Sort (cost=206.42..207.25 rows=333 width=8)
"Sort Key: lengths.category_id, lengths.length"
-> Subquery Scan on lengths (cost=159.97..192.47 rows=333 width=8)
Filter: (lengths.rank <= 3)
-> WindowAgg (cost=159.97..179.97 rows=1000 width=16)
-> Sort (cost=159.97..162.47 rows=1000 width=8)
"Sort Key: c.category_id, f.length DESC"
-> HashAggregate (cost=100.14..110.14 rows=1000 width=8)
"Group Key: c.category_id, f.film_id"
-> Hash Join (cost=76.50..95.14 rows=1000 width=8)


In [43]:
%%sql
SELECT  lengths.category_id
        ,lengths.film_id
        ,lengths.length
        FROM (SELECT  rank() OVER (PARTITION BY c.category_id ORDER BY f.length DESC)
                        ,c.category_id
                        ,f.film_id
                        ,f.length
                FROM    film f JOIN film_category c
                        USING(film_id)
                GROUP BY c.category_id, f.film_id
        ) AS lengths
WHERE RANK <= 3
ORDER BY category_id, length;

 * postgres://dsa_ro_user:***@pgsql.dsa.lan/dvdrental
3 rows affected.


category_id,film_id,length
1,126,179
1,991,185
1,212,185


# 4

### For each customer, list their two shortest movie rentals.

In [61]:
%%sql
EXPLAIN
SELECT  shorts.customer_id
        ,shorts.film_id
        ,shorts.rental_duration
        FROM (SELECT  rank() OVER (PARTITION BY c.customer_id ORDER BY -sum(r.return_date  - r.rental_date) DESC)
                        ,sum(r.return_date  - r.rental_date) as rental_duration
                        ,c.customer_id
                        ,i.film_id
                FROM    rental r JOIN customer c
                        USING(customer_id)
                        JOIN inventory i
                        USING(inventory_id)
                GROUP BY i.film_id, c.customer_id
                ORDER BY c.customer_id, rental_duration
        ) AS shorts
WHERE RANK <= 2
ORDER BY customer_id, shorts.rental_duration;

 * postgres://dsa_ro_user:***@pgsql.dsa.lan/dvdrental
18 rows affected.


QUERY PLAN
Subquery Scan on shorts (cost=3589.07..3829.73 rows=5348 width=22)
Filter: (shorts.rank <= 2)
-> Sort (cost=3589.07..3629.18 rows=16044 width=46)
"Sort Key: c.customer_id, (sum((r.return_date - r.rental_date)))"
-> WindowAgg (cost=2107.43..2468.42 rows=16044 width=46)
-> Sort (cost=2107.43..2147.54 rows=16044 width=38)
"Sort Key: c.customer_id, ((- sum((r.return_date - r.rental_date)))) DESC"
-> HashAggregate (cost=786.23..986.78 rows=16044 width=38)
"Group Key: i.film_id, c.customer_id"
-> Hash Join (cost=150.55..545.57 rows=16044 width=22)


In [62]:
%%sql
SELECT  shorts.customer_id
        ,shorts.film_id
        ,shorts.rental_duration
        FROM (SELECT  rank() OVER (PARTITION BY c.customer_id ORDER BY -sum(r.return_date  - r.rental_date) DESC)
                        ,sum(r.return_date  - r.rental_date) as rental_duration
                        ,c.customer_id
                        ,i.film_id
                FROM    rental r JOIN customer c
                        USING(customer_id)
                        JOIN inventory i
                        USING(inventory_id)
                GROUP BY i.film_id, c.customer_id
                ORDER BY c.customer_id, rental_duration
        ) AS shorts
WHERE RANK <= 2
ORDER BY customer_id, shorts.rental_duration;

 * postgres://dsa_ro_user:***@pgsql.dsa.lan/dvdrental
1199 rows affected.


customer_id,film_id,rental_duration
1,924,"1 day, 1:57:00"
1,341,"1 day, 2:44:00"
2,748,19:13:00
2,243,"1 day, 4:21:00"
3,86,19:33:00
3,367,"1 day, 5:02:00"
4,431,23:38:00
4,63,"1 day, 3:04:00"
5,408,20:10:00
5,345,


# 5

### List the quartile statististics of the movie lengths, grouped by release year.

In [75]:
%%sql
EXPLAIN
SELECT  release_year
        ,percentile_cont(0.25) WITHIN GROUP (ORDER BY length) as first_quartile
        ,percentile_cont(0.5) WITHIN GROUP (ORDER BY length) as second_quartile
        ,percentile_cont(0.75) WITHIN GROUP (ORDER BY length) as third_quartile
FROM    film
GROUP BY release_year;

 * postgres://dsa_ro_user:***@pgsql.dsa.lan/dvdrental
5 rows affected.


QUERY PLAN
GroupAggregate (cost=113.83..133.85 rows=1 width=28)
Group Key: release_year
-> Sort (cost=113.83..116.33 rows=1000 width=6)
Sort Key: release_year
-> Seq Scan on film (cost=0.00..64.00 rows=1000 width=6)


In [80]:
%%sql
SELECT  release_year
        ,percentile_cont(0.25) WITHIN GROUP (ORDER BY length) as first_quartile
        ,percentile_cont(0.5) WITHIN GROUP (ORDER BY length) as second_quartile
        ,percentile_cont(0.75) WITHIN GROUP (ORDER BY length) as third_quartile
FROM    film
GROUP BY release_year;

 * postgres://dsa_ro_user:***@pgsql.dsa.lan/dvdrental
1 rows affected.


release_year,first_quartile,second_quartile,third_quartile
2006,80.0,114.0,149.25


# Save your notebook, then `File > Close and Halt`