# Analytic Functions 

**Analytic functions**, also known as **windowing functions**, are a powerful tool in SQL for a record to attach contexts from other records. This will make sense with several examples we demonstrate. While we will show simpler ways to achieve previous tasks we have done previously with subqueries, derived tables, and common table expressions, all of these other approaches we learned are still highly flexible and necessary to know. But as we will see, common analytic operations often can be done with these windowing functions rather than subquerying tools. 

Let's set up first with the `company_operations.db` database. 

In [None]:
import sqlite3
import pandas as pd 

pd.options.display.max_rows = 999

conn = sqlite3.connect('company_operations.db')
pd.read_sql("SELECT * FROM WEATHER_MONITOR LIMIT 10", conn)

## PARTITION BY

Let's say along every `WEATHER_MONITOR` record, we wanted to also show the average `TEMPERATURE` for that record's `YEAR` and `MONTH`. Previously we would use a subquery, derived table, or common table expression to achieve this. 

In [None]:
sql = """
WITH temp_avgs AS (
    SELECT strftime('%Y', REPORT_DATE) AS YEAR, 
    strftime('%m', REPORT_DATE) AS MONTH,
    AVG(TEMPERATURE) AS AVG_TEMP 
    FROM WEATHER_MONITOR
    GROUP BY 1, 2
) 

SELECT ID, 
REPORT_CODE, 
REPORT_DATE, 
LOCATION_ID, 
TEMPERATURE, 
AVG_TEMP

FROM WEATHER_MONITOR INNER JOIN temp_avgs

ON strftime('%Y', REPORT_DATE) = temp_avgs.YEAR
AND strftime('%m', REPORT_DATE) = temp_avgs.MONTH
"""
            
pd.read_sql(sql, conn)


While common table expressions and subqueries are highly useful and customizable, this specific task is so common there are special functions and operators for it. Instead of doing all this common table expression and join work, we can take the average temperature `AVG(TEMPERATURE)` but `PARTITION` it over all records sharing the same year and month. 

In [None]:
sql = """
SELECT ID, 
REPORT_CODE, 
REPORT_DATE, 
LOCATION_ID, 
TEMPERATURE, 
AVG(TEMPERATURE) OVER (PARTITION BY strftime('%Y', REPORT_DATE), strftime('%m', REPORT_DATE)) AS AVG_TEMP_Y_M

FROM WEATHER_MONITOR 

ORDER BY ID
"""
            
pd.read_sql(sql, conn, index_col='ID')


What is particularly powerful about windowing functions like `PARTITION BY` is we can mix and match different scopes and contexts, with familiar aggregate functions like `MIN`, `MAX`, `AVG`, `SUM`, and `COUNT`. Below we add a few more analytic fields getting the average, min, and max temperatures for each record's `LOCATION_ID`. 

In [None]:
sql = """
SELECT ID, 
REPORT_CODE, 
REPORT_DATE, 
LOCATION_ID, 
TEMPERATURE, 
AVG(TEMPERATURE) OVER (PARTITION BY strftime('%Y', REPORT_DATE), strftime('%m', REPORT_DATE)) AS AVG_TEMP_Y_M,
AVG(TEMPERATURE) OVER (PARTITION BY LOCATION_ID) AVG_TEMP_LOCATION, 
MIN(TEMPERATURE) OVER (PARTITION BY LOCATION_ID) MIN_TEMP_LOCATION,
MAX(TEMPERATURE) OVER (PARTITION BY LOCATION_ID) MAX_TEMP_LOCATION

FROM WEATHER_MONITOR 

ORDER BY ID
"""
            
pd.read_sql(sql, conn)


We can also reuse windowing clauses and alias them using the `WINDOW` keyword. 

In [None]:
sql = """
SELECT ID, 
REPORT_CODE, 
REPORT_DATE, 
LOCATION_ID, 
TEMPERATURE, 
AVG(TEMPERATURE) OVER ym AS AVG_TEMP_Y_M,
AVG(TEMPERATURE) OVER loc AVG_TEMP_LOCATION, 
MIN(TEMPERATURE) OVER loc MIN_TEMP_LOCATION,
MAX(TEMPERATURE) OVER loc MAX_TEMP_LOCATION

FROM WEATHER_MONITOR 

WINDOW ym AS (PARTITION BY strftime('%Y', REPORT_DATE), strftime('%m', REPORT_DATE)),
loc AS (PARTITION BY LOCATION_ID)

ORDER BY ID
"""
            
pd.read_sql(sql, conn)


Keep in mind that windowing functions like `PARTITION BY` will only scan records that pass the `WHERE` condition. This means if you need to reach out to records that exist outside the `WHERE` condition, you will need to go back to using subqueries and common table expressions. Notice how putting a `WHERE` condition on the query above for a single `REPORT_CODE` choked all the other data from the windowing functions, making all the statistical values `50` across the board since there is now only one datapoint.  

In [None]:
sql = """
SELECT ID, 
REPORT_CODE, 
REPORT_DATE, 
LOCATION_ID, 
TEMPERATURE, 
AVG(TEMPERATURE) OVER ym AS AVG_TEMP_Y_M,
AVG(TEMPERATURE) OVER loc AVG_TEMP_LOCATION, 
MIN(TEMPERATURE) OVER loc MIN_TEMP_LOCATION,
MAX(TEMPERATURE) OVER loc MAX_TEMP_LOCATION

FROM WEATHER_MONITOR 
WHERE REPORT_CODE = 'UVYMMWW' 

WINDOW ym AS (PARTITION BY strftime('%Y', REPORT_DATE), strftime('%m', REPORT_DATE)),
loc AS (PARTITION BY LOCATION_ID)
"""
            
pd.read_sql(sql, conn)


## ORDER BY 

Here is another useful application of windowing functions. Recall we can use self joins with inequality join conditions to, for example, get a rolling total of orders. Assuming the `CUSTOMER_ORDER_ID` reflects when orders chronologically came in, I can query for records previous to each one and sum them as a `ROLLING_QTY`.  

In [None]:
sql = """
SELECT c1.CUSTOMER_ORDER_ID, 
c1.ORDER_DATE,
c1.PRODUCT_ID,
c1.CUSTOMER_ID,
c1.QUANTITY,
SUM(c2.QUANTITY) as ROLLING_QTY

FROM CUSTOMER_ORDER c1 INNER JOIN CUSTOMER_ORDER c2
ON c1.CUSTOMER_ORDER_ID >= c2.CUSTOMER_ORDER_ID

GROUP BY 1, 2, 3, 4
"""

pd.read_sql(sql, conn)

I can simplify this greatly using an `ORDER BY` clause in an analytic function. 

In [None]:
sql = """
SELECT CUSTOMER_ORDER_ID, 
ORDER_DATE,
PRODUCT_ID,
CUSTOMER_ID,
QUANTITY,
SUM(QUANTITY) OVER (ORDER BY CUSTOMER_ORDER_ID) as ROLLING_QTY

FROM CUSTOMER_ORDER
"""

pd.read_sql(sql, conn)

No more complicated self joins with weird `GROUP BY` logic! Now notice that if we did `ORDER BY ORDER_DATE` rather than `ORDER BY CUSTOMER_ORDER_ID` something weird happens. 

In [None]:
sql = """
SELECT CUSTOMER_ORDER_ID, 
ORDER_DATE,
PRODUCT_ID,
CUSTOMER_ID,
QUANTITY,
SUM(QUANTITY) OVER (ORDER BY ORDER_DATE) as ROLLING_QTY

FROM CUSTOMER_ORDER
"""

pd.read_sql(sql, conn)

Every record with the same `ORDER_DATE` has the same `ROLLING_QTY`. The reason is the `ORDER_DATE` does not have unique values so each `ORDER_DATE` lumps up each day's total. If we wanted to arbitrarily total on a row-by-row basis, it's better to use an ordered unique field like `CUSTOMER_ORDER_ID`. But if you still want to do the former, use the `ROWS BETWEEN` keyword and specify the range. 

In [None]:
sql = """
SELECT CUSTOMER_ORDER_ID, 
ORDER_DATE,
PRODUCT_ID,
CUSTOMER_ID,
QUANTITY,
SUM(QUANTITY) OVER (ORDER BY ORDER_DATE ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as ROLLING_QTY
FROM CUSTOMER_ORDER
"""

pd.read_sql(sql, conn)

Be careful when using `ROWS BETWEEN`, as the ordering of the records is arbitrary feeding into the function, and if you re-sort the records you will get confusing results. The default behavior `RANGE BETWEEN` is usually preferred, which works on logical values rather than the individual rows. 
 
We can also create rolling averages by changing the bounds. Below we create a rolling average between the 3 preceding and 3 following records. 

In [None]:
sql = """
SELECT CUSTOMER_ORDER_ID, 
ORDER_DATE,
PRODUCT_ID,
CUSTOMER_ID,
QUANTITY,
AVG(QUANTITY) OVER (ORDER BY ORDER_DATE ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING) as ROLLING_AVG
FROM CUSTOMER_ORDER
"""

pd.read_sql(sql, conn)

 Let's go back to using the default `RANGE BETWEEN` logic. If you want to silo each record to get the rolling total but only within records sharing the `PRODUCT_ID` and `CUSTOMER_ID`, add that `PARTITION BY` again. As you scan the records, notice how the rolling totals are only accounting for records sharing the same `CUSTOMER_ID` and `PRODUCT_ID`. 

In [None]:
pd.set_option('display.max_rows', None)

sql = """
SELECT CUSTOMER_ORDER_ID, 
ORDER_DATE,
PRODUCT_ID,
CUSTOMER_ID,
QUANTITY,
SUM(QUANTITY) OVER (PARTITION BY PRODUCT_ID, CUSTOMER_ID ORDER BY ORDER_DATE) as ROLLING_QTY

FROM CUSTOMER_ORDER

ORDER BY CUSTOMER_ORDER_ID
"""

pd.read_sql(sql, conn)

## LEAD and LAG 

Two other highly useful windowing functions are `LEAD()` and `LAG()`. These allow you to retrieve another record's value based on an ordered field. Below, we use `LAG()` to look up the previous record's value. Compare the `QUANTITY` and `PREV_QTY` columns below and you will see a pattern! 

In [None]:
sql = """
SELECT CUSTOMER_ORDER_ID, 
CUSTOMER_ID,
ORDER_DATE, 
PRODUCT_ID,
QUANTITY,
LAG(QUANTITY, 1, 0) OVER (ORDER BY ORDER_DATE) AS PREV_QTY
FROM CUSTOMER_ORDER 
"""

pd.read_sql(sql, conn)

The `LEAD()` will look at the next record ahead. 

In [None]:
sql = """
SELECT CUSTOMER_ORDER_ID, 
CUSTOMER_ID,
ORDER_DATE, 
PRODUCT_ID,
QUANTITY,
LEAD(QUANTITY, 1, 0) OVER (ORDER BY ORDER_DATE) AS NEXT_QTY
FROM CUSTOMER_ORDER 
"""

pd.read_sql(sql, conn)

You will see that the second and third arguments, 1 and 0 in these cases, will control the number of records to look-ahead/look-behind and the default value. Below, we change the `LAG()` to retrieve the third record behind it and default the value to `-1` if there is none to retrieve. 

In [None]:
sql = """
SELECT CUSTOMER_ORDER_ID, 
CUSTOMER_ID,
ORDER_DATE, 
PRODUCT_ID,
QUANTITY,
LAG(QUANTITY, 3, -1) OVER (ORDER BY ORDER_DATE) AS PREV_QTY
FROM CUSTOMER_ORDER 
"""

pd.read_sql(sql, conn)

## Ranking

The `ROW_NUMBER()` function can be highly helpful with windowing functions to rank items. For example, say I wanted to get the top 3 selling products by customer. I can use `ROW_NUMBER()` to assign a ranking number to each sorted quantity by `CUSTOMER_ID` and `PRODUCT_ID`. Then I can filter for only the first three items.





In [None]:
sql = """
WITH TOTAL_QTYS AS (
  SELECT CUSTOMER_ID, PRODUCT_ID, SUM(QUANTITY) AS TOTAL_QTY 
  FROM CUSTOMER_ORDER 
  GROUP BY 1,2
),

PRODUCT_SALES_BY_CUSTOMER AS (
   SELECT CUSTOMER_ID, PRODUCT_ID, TOTAL_QTY,
   ROW_NUMBER() OVER (PARTITION BY CUSTOMER_ID ORDER BY TOTAL_QTY DESC) AS RANKING
   FROM TOTAL_QTYS
) 
SELECT * FROM PRODUCT_SALES_BY_CUSTOMER 
WHERE RANKING <= 3
"""

pd.read_sql(sql, conn)

`RANK()` and `DENSE_RANK()` are identical to `ROW_NUMBER()` in behavior, except in how identical values are handled. If you want identical values to receive the same ranking, use the `RANK()` function instead of `ROW_NUMBER()`. Use `DENSE_RANK()` if you want to force the values to be consecutive rather than dupes causing ranks to be skipped.



## Exercise

For the date range of `2024-02-01` to `2024-02-28`, bring in the rolling maximum quantity ordered (up to each `ORDER_DATE`) by `CUSTOMER_ID` and `PRODUCT_ID`. The boilerplate is provided, just replace the question mark `?` below. 

In [None]:
sql = """
SELECT CUSTOMER_ORDER_ID,
ORDER_DATE,
CUSTOMER_ID,
PRODUCT_ID,
QUANTITY,
? as rolling_max_qty_for_customer_and_product

FROM ?
WHERE ORDER_DATE BETWEEN '2024-02-01' AND '2024-02-28'

ORDER BY CUSTOMER_ORDER_ID
"""

pd.read_sql(sql, conn)


### SCROLL DOWN FOR ANSWER
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
v 

In [None]:
sql = """
SELECT CUSTOMER_ORDER_ID,
ORDER_DATE,
CUSTOMER_ID,
PRODUCT_ID,
QUANTITY,
MAX(QUANTITY) OVER(PARTITION BY CUSTOMER_ID, PRODUCT_ID ORDER BY ORDER_DATE) as rolling_max_qty_for_customer_and_product

FROM CUSTOMER_ORDER
WHERE ORDER_DATE BETWEEN '2024-02-01' AND '2024-02-28'

ORDER BY CUSTOMER_ORDER_ID
"""

pd.read_sql(sql, conn)