# ADVANCED: SQL Window Functions

This is one of the most powerful concepts in SQL data analysis. The Window Function allows us to compare one row to another without doing any joins. This can allow us do simple things like create a running-total as well as tricky things like determine if one row was greater than the previous row and classify it based on our findings.

We connect to MySQL server and workbench and make analysis with the parch-and-posey database. This course is the practicals of the course SQL for Data Analysis at Udacity.

In [1]:
# we import some required libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pprint import pprint
import time
print('Done!')

Done!


In [2]:
import mysql
from mysql.connector import Error
from getpass import getpass

db_name = 'parch_and_posey'
try:
    connection = mysql.connector.connect(host='localhost',
                                         database=db_name,
                                         user=input('Enter UserName:'),
                                         password=getpass('Enter Password:'))
    if connection.is_connected():
        db_Info = connection.get_server_info()
        print("Connected to MySQL Server version ", db_Info)
        cursor = connection.cursor()
        cursor.execute("select database();")
        record = cursor.fetchone()
        print("You're connected to database: ", record)

except Error as e:
    print("Error while connecting to MySQL", e)

Enter UserName:danam
Enter Password:········
Connected to MySQL Server version  8.0.24
You're connected to database:  ('parch_and_posey',)


In [3]:
def query_to_df(query):
    st = time.time()
    # Assert Every Query ends with a semi-colon
    try:
        assert query.endswith(';')
    except AssertionError:
        return 'ERROR: Query Must End with ;'

    # so we never have more than 20 rows displayed
    pd.set_option('display.max_rows', 20) 
    df = None

    # Process the query
    cursor.execute(query)
    columns = cursor.description
    result = []
    for value in cursor.fetchall():
        tmp = {}
        for (index,column) in enumerate(value):
            tmp[columns[index][0]] = [column]
        result.append(tmp)

    # Create a DataFrame from all results
    for ind, data in enumerate(result):
        if ind >= 1:
            x = pd.DataFrame(data)
            df = pd.concat([df, x], ignore_index=True)
        else:
            df = pd.DataFrame(data)
    print(f'Query ran for {time.time()-st} secs!')
    return df

In [4]:
# 1. For the accounts table

query = 'SELECT * FROM accounts LIMIT 3;'
query_to_df(query)

Query ran for 0.009805917739868164 secs!


Unnamed: 0,id,name,website,lat,longs,primary_poc,sales_rep_id
0,1001,Walmart,www.walmart.com,40.23849561,-75.10329704,Tamara Tuma,321500
1,1011,Exxon Mobil,www.exxonmobil.com,41.1691563,-73.84937379,Sung Shields,321510
2,1021,Apple,www.apple.com,42.29049481,-76.08400942,Jodee Lupo,321520


In [5]:
# 2. For the orders table

query = 'SELECT * FROM orders LIMIT 3;'
query_to_df(query)

Query ran for 0.015012025833129883 secs!


Unnamed: 0,id,account_id,occurred_at,standard_qty,gloss_qty,poster_qty,total,standard_amt_usd,gloss_amt_usd,poster_amt_usd,total_amt_usd
0,1,1001,2015-10-06 17:31:14,123,22,24,169,613.77,164.78,194.88,973.43
1,2,1001,2015-11-05 03:34:33,190,41,57,288,948.1,307.09,462.84,1718.03
2,3,1001,2015-12-04 04:21:55,85,47,0,132,424.15,352.03,0.0,776.18


In [6]:
# 3. For the sales_reps table

query = 'SELECT * FROM sales_reps LIMIT 3;'
query_to_df(query)

Query ran for 0.009804964065551758 secs!


Unnamed: 0,id,name,region_id
0,321500,Samuel Racine,1
1,321510,Eugena Esser,1
2,321520,Michel Averette,1


In [7]:
# 4. For the web_events table

query = 'SELECT * FROM web_events LIMIT 3;'
query_to_df(query)

Query ran for 0.007811784744262695 secs!


Unnamed: 0,id,account_id,occurred_at,channel
0,1,1001,2015-10-06 17:13:58,direct
1,2,1001,2015-11-05 03:08:26,direct
2,3,1001,2015-12-04 03:57:24,direct


In [8]:
# 5. For the region table

query = 'SELECT * FROM region LIMIT 3;'
query_to_df(query)

Query ran for 0.008397817611694336 secs!


Unnamed: 0,id,name
0,1,Northeast
1,2,Midwest
2,3,Southeast


**Let's find the running total of standard_qty papers purchased over time using Window Function**

In [9]:
query_to_df(
"SELECT standard_qty, SUM(standard_qty) OVER (ORDER BY occurred_at) as running_total FROM orders;"
)

Query ran for 13.897042989730835 secs!


Unnamed: 0,standard_qty,running_total
0,0,0
1,490,490
2,528,1018
3,0,1018
4,492,1510
...,...,...
6907,0,1937478
6908,497,1937975
6909,38,1938013
6910,291,1938304


A **window function** performs an aggregate-like operation on a set of query rows. However, whereas an aggregate operation groups query rows into a single result row, a window function produces a result for each query row:

* The row for which function evaluation occurs is called the current row.
* The query rows related to the current row over which function evaluation occurs comprise the window for the current row.
* The first part of the last query above does a simple `SUM` aggregation. 
* Adding `OVER` designates it as a Window Function
* So everything basically says: Take the sum of standard_qty across all rows leading up to a given row in order by occurred_at.
* window function performs a calculation across a set of table rows that are somehow related to the current row. This is comparable to the type of calculation that can be done with an aggregate function. But unlike regular aggregate functions, use of a window function does not cause rows to become grouped into a single output row — the rows retain their separate identities. Behind the scenes, the window function is able to access more than just the current row of the query result.

## OVER
## PARTITION BY

**OVER** and **PARTITION BY**. These are key to window functions. Not every window function uses `PARTITION BY`; we can also use `ORDER BY` or no statement at all depending on the query we want to run.

* The `OVER` clause has three components: `partitioning`, `ordering`, and `framing`. Partitioning is always supported, but support for ordering and framing depends on which type of window function you are using.
* ORDER BY in the OVER clause is not supported for calculating subtotals, for example. You don’t need the data sorted to calculate a sum. Note that the ORDER BY within the OVER clause has nothing to do with an ORDER BY clause found in the query itself.

#### NOTE:

We can’t use window functions and standard aggregations in the same query. More specifically, you can’t include window functions in a GROUP BY clause.

### Understanding Window Functions

In [10]:
query_to_df(
"SELECT account_id, total FROM orders ORDER BY 1, 2;"
)

Query ran for 8.610218286514282 secs!


Unnamed: 0,account_id,total
0,1001,129
1,1001,132
2,1001,137
3,1001,148
4,1001,158
...,...,...
6907,4501,211
6908,4501,215
6909,4501,224
6910,4501,263


In [11]:
query_to_df(
"SELECT SUM(total) AS total_qty FROM orders;"
)

Query ran for 0.010746240615844727 secs!


Unnamed: 0,total_qty
0,3675765


In [12]:
query_to_df(
"SELECT account_id, SUM(total) AS total_qty FROM orders GROUP BY 1 ORDER BY 1;"
)

Query ran for 0.6699790954589844 secs!


Unnamed: 0,account_id,total_qty
0,1001,18924
1,1011,541
2,1021,3810
3,1031,1363
4,1041,2252
...,...,...
345,4461,31887
346,4471,515
347,4481,1380
348,4491,16806


By contrast, window operations do not collapse groups of query rows to a single output row. Instead, they produce a result for each row. Like the preceding queries, the following query uses SUM(), but this time as a window function:

In [13]:
query_to_df(
"SELECT account_id, total, SUM(total) OVER() AS total_std_qty, SUM(total) \
OVER(PARTITION BY account_id) AS accts_total_split FROM orders ORDER BY 1, 2;"
)

Query ran for 15.824997425079346 secs!


Unnamed: 0,account_id,total,total_std_qty,accts_total_split
0,1001,129,3675765,18924
1,1001,132,3675765,18924
2,1001,137,3675765,18924
3,1001,148,3675765,18924
4,1001,158,3675765,18924
...,...,...,...,...
6907,4501,211,3675765,2244
6908,4501,215,3675765,2244
6909,4501,224,3675765,2244
6910,4501,263,3675765,2244


Each window operation in the query is signified by inclusion of an `OVER` clause that specifies how to partition query rows into groups for processing by the window function:

* The first `OVER` clause is empty, which treats the entire set of query rows as a single partition. The window function thus produces a global sum, but does so for each row.

* The second `OVER` clause partitions rows by account_id, producing a sum per partition (per account_id). The function produces this sum for each partition row.

Window functions are permitted only in the select list and ORDER BY clause. Query result rows are determined from the FROM clause, after WHERE, GROUP BY, and HAVING processing, and windowing execution occurs before ORDER BY, LIMIT, and SELECT DISTINCT.

**NOTE:**

The OVER clause is permitted for many aggregate functions, which therefore can be used as window or nonwindow functions, depending on whether the OVER clause is present or absent:

```
AVG()
BIT_AND()
BIT_OR()
BIT_XOR()
COUNT()
JSON_ARRAYAGG()
JSON_OBJECTAGG()
MAX()
MIN()
STDDEV_POP(), STDDEV(), STD()
STDDEV_SAMP()
SUM()
VAR_POP(), VARIANCE()
VAR_SAMP()
```

MySQL also supports nonaggregate functions that are used only as window functions. For these, the OVER clause is mandatory:

```
CUME_DIST()
DENSE_RANK()
FIRST_VALUE()
LAG()
LAST_VALUE()
LEAD()
NTH_VALUE()
NTILE()
PERCENT_RANK()
RANK()
ROW_NUMBER()
```

As an example of one of those nonaggregate window functions, this query below uses `ROW_NUMBER()`, which produces the row number of each row within its partition. In this case, rows are numbered per `account_id`. By default, partition rows are unordered and row numbering is nondeterministic. To sort partition rows, include an `ORDER BY` clause within the window definition. The query uses unordered and ordered partitions (the row_num1 and row_num2 columns) to illustrate the difference between omitting and including `ORDER BY`:

In [14]:
query_to_df(
"SELECT account_id, total, ROW_NUMBER() OVER(PARTITION BY account_id) AS row_num1, \
ROW_NUMBER() OVER(PARTITION BY account_id ORDER BY total) AS row_num2 FROM orders;"
)

Query ran for 10.429263830184937 secs!


Unnamed: 0,account_id,total,row_num1,row_num2
0,1001,129,9,1
1,1001,132,3,2
2,1001,137,11,3
3,1001,148,10,4
4,1001,158,13,5
...,...,...,...,...
6907,4501,211,12,9
6908,4501,215,4,10
6909,4501,224,5,11
6910,4501,263,13,12


**Creating a Running Total Using Window Functions**

Create a running total. This time, create a running total of standard_amt_usd (in the orders table) over order time with no date truncation. Your final table should have two columns: one with the amount being added for each new row, and a second with the running total.

In [None]:
query_to_df(
"SELECT standard_amt_usd, SUM(standard_amt_usd) OVER(ORDER BY occurred_at) AS running_total FROM orders;"
)

**Creating a Partitioned Running Total Using Window Functions**

Now, modify your query from the previous quiz to include partitions. Still create a running total of standard_amt_usd (in the orders table) over order time, but this time, date truncate occurred_at by year and partition by that same year-truncated occurred_at variable. Your final table should have three columns: One with the amount being added for each row, one for the truncated date, and a final column with the running total within each year.

In [None]:
query_to_df(
"SELECT standard_amt_usd, DATE(occurred_at) year, SUM(standard_amt_usd) \
OVER(PARTITION BY DATE(occurred_at) ORDER BY occurred_at) AS running_total FROM orders;"
)

### ROW_NUMBER() and RANK() and DENSE_RANK():

* `ROW_NUMBER()` does just what it sounds like. It displays the number of a given row, but within the Window we define. It starts at 1 and numbers the rows according to the `ORDER BY` part of the Windows statement.
* `ROW_NUMBER()` does not require a specified variable within its parenthesis.
* Using the `PARTITION BY` clause within the Window function, we can start the Row-number count at 1 again in each partition.
* `RANK()` might perform the same as `ROW_NUMBER()` but the subtle difference is that if 2 rows or more have the same value for the `ORDER bY` column in the Window statement, they are given the same rank. Whereas `ROW_NUMBER()` would hve given them different numbers. The RANK() Clause then skips some values to make up for the numbers.
* `RANK() and DENSE_RANK()` do not expect specific variables within their parenthesis
* `DENSE_RANK()` doesn't skip values after assigning several rows with the same rank.


#### Exercise

**Ranking Total Paper Ordered by Account**

Select the id, account_id, and total variable from the orders table, then create a column called total_rank that ranks this total amount of paper ordered (from highest to lowest) for each account using a partition. Your final table should have these four columns.

In [None]:
query_to_df(
"SELECT id, account_id, total, RANK() OVER(PARTITION BY account_id ORDER BY total DESC) AS total_rank FROM orders;"
)

### QUIZ:

**Aggregates in Window Functions**
```
SELECT id,
       account_id,
       standard_qty,
       DATE_TRUNC('month', occurred_at) AS month,
       DENSE_RANK() OVER (PARTITION BY account_id ORDER BY DATE_TRUNC('month',occurred_at)) AS dense_rank,
       SUM(standard_qty) OVER (PARTITION BY account_id ORDER BY DATE_TRUNC('month',occurred_at)) AS sum_std_qty,
       COUNT(standard_qty) OVER (PARTITION BY account_id ORDER BY DATE_TRUNC('month',occurred_at)) AS count_std_qty,
       AVG(standard_qty) OVER (PARTITION BY account_id ORDER BY DATE_TRUNC('month',occurred_at)) AS avg_std_qty,
       MIN(standard_qty) OVER (PARTITION BY account_id ORDER BY DATE_TRUNC('month',occurred_at)) AS min_std_qty,
       MAX(standard_qty) OVER (PARTITION BY account_id ORDER BY DATE_TRUNC('month',occurred_at)) AS max_std_qty
FROM orders
```

In [None]:
query_to_df(
"SELECT id, account_id, standard_qty, MONTH(occurred_at) AS month, \
DENSE_RANK() OVER(PARTITION BY account_id ORDER BY MONTH(occurred_at)) AS dens_rank, \
SUM(standard_qty) OVER(PARTITION BY account_id ORDER BY MONTH(occurred_at)) AS sum_std_qty, \
COUNT(standard_qty) OVER (PARTITION BY account_id ORDER BY MONTH(occurred_at)) AS count_std_qty, \
AVG(standard_qty) OVER (PARTITION BY account_id ORDER BY MONTH(occurred_at)) AS avg_std_qty, \
MIN(standard_qty) OVER (PARTITION BY account_id ORDER BY MONTH(occurred_at)) AS min_std_qty, \
MAX(standard_qty) OVER (PARTITION BY account_id ORDER BY MONTH(occurred_at)) AS max_std_qty \
FROM orders;"
)

 Now remove ORDER BY clause in each Window Function of the query above. Evaluate your new query, compare it to the previous results

In [None]:
query_to_df(
"SELECT id, account_id, standard_qty, MONTH(occurred_at) AS month, \
DENSE_RANK() OVER(PARTITION BY account_id) AS dens_rank, \
SUM(standard_qty) OVER(PARTITION BY account_id) AS sum_std_qty, \
COUNT(standard_qty) OVER (PARTITION BY account_id) AS count_std_qty, \
AVG(standard_qty) OVER (PARTITION BY account_id) AS avg_std_qty, \
MIN(standard_qty) OVER (PARTITION BY account_id) AS min_std_qty, \
MAX(standard_qty) OVER (PARTITION BY account_id) AS max_std_qty \
FROM orders;"
)

**Answer the following questions**

* What is the value of dense_rank in every row for the following account_id values <1001, 1011, 1021>?
* What is the value of sum_std_qty in the first row for the following account_id values <1001, 1011, 1021>?
* **Reflect...**

What is happening when you omit the ORDER BY clause when doing aggregates with window functions? Use the results from the queries above to guide your thoughts then jot these thoughts down in a few sentences

In [None]:
query_to_df(
"WITH \
t1 AS (SELECT id, account_id, standard_qty, MONTH(occurred_at) AS month, \
DENSE_RANK() OVER(PARTITION BY account_id) AS dens_rank, \
SUM(standard_qty) OVER(PARTITION BY account_id) AS sum_std_qty, \
COUNT(standard_qty) OVER (PARTITION BY account_id) AS count_std_qty, \
AVG(standard_qty) OVER (PARTITION BY account_id) AS avg_std_qty, \
MIN(standard_qty) OVER (PARTITION BY account_id) AS min_std_qty, \
MAX(standard_qty) OVER (PARTITION BY account_id) AS max_std_qty FROM orders), \
\
t2 AS (SELECT DISTINCT account_id, dens_rank FROM t1 WHERE account_id IN(1001, 1011, 1021)) \
SELECT * FROM t2;"
)

Nice! That's correct. dense_rank is constant at 1 for all rows for all account_id values, actually since we removed ORDER BY.

In [None]:
query_to_df(
"WITH \
t1 AS (SELECT id, account_id, standard_qty, MONTH(occurred_at) AS month, \
DENSE_RANK() OVER(PARTITION BY account_id) AS dens_rank, \
SUM(standard_qty) OVER(PARTITION BY account_id) AS sum_std_qty, \
COUNT(standard_qty) OVER (PARTITION BY account_id) AS count_std_qty, \
AVG(standard_qty) OVER (PARTITION BY account_id) AS avg_std_qty, \
MIN(standard_qty) OVER (PARTITION BY account_id) AS min_std_qty, \
MAX(standard_qty) OVER (PARTITION BY account_id) AS max_std_qty FROM orders), \
\
t2 AS (SELECT DISTINCT account_id, sum_std_qty FROM t1 WHERE account_id IN(1001, 1011, 1021)) \
SELECT * FROM t2;"
)

Nice! That's correct. If you look closely, `sum_std_qty` is constant as well for all rows for all account_id values.

Note that when we omit the ORDER BY clause when doing aggregates using the Windows Function, the following happens:

1. Ranking is unable to distinguish each row and every row is given a rank of 1
2. ALL rows are lumped together per partition and treated as one, yet reported individually per row 

As stackoverflow user mathguy explains...

The easiest way to think about this - leaving the `ORDER BY` out is equivalent to "ordering" in a way that all rows in the partition are "equal" to each other. Indeed, you can get the same effect by explicitly adding the `ORDER BY` clause like this: `ORDER BY 0` (or "order by" any constant expression), or even, more emphatically, `ORDER BY NULL`.

### Aliases For Multiple Window Functions:

If we plan to write several Window Functions in the same query, using the same Window, we can create an alias for the window.

* We define the alias using the `WINDOW` clause. Which normally goes between the `WHERE` clause and the `GROUP BY` clause.
* If our query has neither of those, we put the `WINDOW` clause after `FROM`.
* This will make the query a lot easier to read, while still giving us consistent results.
* Let's see an example with the last query

In [None]:
query_to_df(
"SELECT id, account_id, standard_qty, MONTH(occurred_at) AS month, \
DENSE_RANK() OVER main_window AS dens_rank, \
SUM(standard_qty) OVER main_window AS sum_std_qty, \
COUNT(standard_qty) OVER main_window AS count_std_qty, \
AVG(standard_qty) OVER main_window AS avg_std_qty, \
MIN(standard_qty) OVER main_window AS min_std_qty, \
MAX(standard_qty) OVER main_window AS max_std_qty \
FROM orders \
WINDOW main_window AS (PARTITION BY account_id ORDER BY MONTH(occurred_at));"
)

### Comparing a Row to a previous Row:

* `LAG` Function: It's purpose is to return the value from a previous row to the current row in the table.
* `LEAD` Function: It's purpose is to return the value from the row following the current row in the table.

**Scenarios for using LAG and LEAD functions**

You can use LAG and LEAD functions whenever you are trying to compare the values in adjacent rows or rows that are offset by a certain number.

Example 1: You have a sales dataset with the following data and need to compare how the market segments fare against each other on profits earned.
```
Market Segment	Profits earned by each market segment
            A	$550
            B	$500
            C	$670
            D	$730
            E	$982
```
Example 2: You have an inventory dataset and need to compare the number of days elapsed between each subsequent order placed for each item

**Example of LAG function...**

In [15]:
# First the Inner Query

query_to_df(
"SELECT * FROM orders LIMIT 2;"
)

Query ran for 0.008150577545166016 secs!


Unnamed: 0,id,account_id,occurred_at,standard_qty,gloss_qty,poster_qty,total,standard_amt_usd,gloss_amt_usd,poster_amt_usd,total_amt_usd
0,1,1001,2015-10-06 17:31:14,123,22,24,169,613.77,164.78,194.88,973.43
1,2,1001,2015-11-05 03:34:33,190,41,57,288,948.1,307.09,462.84,1718.03
