**UNBOUNDED and Unfiltered!** 

**A brief(ish) SQL tutorial introducing window functions: Part 3**

Each of my tutorials has been designed for people who have an intermediate understanding of SQL queries. This is Part 3 of a SQL series of brief(ish) SQL tutorials examining **window functions**, aimed toward getting an understanding of what they are, how they work, and some of their limitations. If you haven't already, check out Part 1 is "Get OVER it!," which explains how to create window functions and use OVER, PARTITION BY, and ORDER BY in window functions. Part 2 is "This water is RANK!" and covers how to use ROW\_NUMBER, RANK, and DENSE RANK to demonstrate different ways of ranking records in your query. This is the third and final part of the series, "UNBOUNDED and Unfiltered!" In this part of the series, we will explore using BOUNDED and UNBOUNDED frames.

Once again, I will be using the [common\_toxins database](https://github.com/FreshOats/Water_Data_Tutorials/tree/main/Datasets), which can be downloaded from my Github (code provided below). Arsenic always provides a fun analysis, so for this first example, we'll be looking at a single station in Sacramento county that has had nonzero levels of arsenic for over the past 20 years. The state of California recognizes anything below 0.01 mg/L of Arsenic to be safe, but goal for the state is to have 0 mg/L of measured arsenic in the water at any given supply. The reporting limit indicates that the test must report anything above 0.001 mg/L of Arsenic (or 1 microgram per liter). Before diving into more window functions, familiarize yourself with the data.

In [21]:
SELECT 
    * 
FROM 
    common_toxins
WHERE 
    parameter LIKE '%Arsenic%'
    AND 
    county_name = 'Sacramento'
ORDER BY 
    sample_date ASC;

station_id,station_name,full_station_name,station_number,station_type,latitude,longitude,status_,county_name,sample_code,sample_date,sample_depth,sample_depth_units,parameter,result,reporting_limit,units,method_name
909,MAGPIE C A HALEY BLV,MAGPIE C A HALEY BLVD (16TH ST),A0001210,Surface Water,38.6605,121.4266,Review Status Unknown,Sacramento,WDIS_0005461,1952-08-19 00:00:00.000,,Feet,Dissolved Arsenic,0.0,0.001,mg/L,UnkH Arsenic
910,MAGPIE C A FT-BR IN,MAGPIE C A FT-BR IN MCCLELLAN AFB,A0001250,Surface Water,38.6549,121.3872,Review Status Unknown,Sacramento,WDIS_0005471,1952-08-22 00:00:00.000,,Feet,Dissolved Arsenic,0.0,0.001,mg/L,UnkH Arsenic
3209,SACRAMENTO R A SNODG,SACRAMENTO R A SNODGRASS SLU,B9D82101319,Surface Water,38.3505,121.5333,Review Status Unknown,Sacramento,WDIS_0008006,1952-10-07 14:15:00.000,,Feet,Dissolved Arsenic,0.0,0.001,mg/L,UnkH Arsenic
910,MAGPIE C A FT-BR IN,MAGPIE C A FT-BR IN MCCLELLAN AFB,A0001250,Surface Water,38.6549,121.3872,Review Status Unknown,Sacramento,WDIS_0005472,1952-10-08 00:00:00.000,,Feet,Dissolved Arsenic,0.0,0.001,mg/L,UnkH Arsenic
909,MAGPIE C A HALEY BLV,MAGPIE C A HALEY BLVD (16TH ST),A0001210,Surface Water,38.6605,121.4266,Review Status Unknown,Sacramento,WDIS_0005462,1952-10-08 00:00:00.000,,Feet,Dissolved Arsenic,0.0,0.001,mg/L,UnkH Arsenic
1358,NATOMAS EMDC A JIBBO,NATOMAS EMDC A JIBBON ST BR,A0V83621306,Surface Water,38.6035,121.5105,Review Status Unknown,Sacramento,WDIS_0005448,1952-10-08 15:30:00.000,,Feet,Dissolved Arsenic,0.0,0.001,mg/L,UnkH Arsenic
3209,SACRAMENTO R A SNODG,SACRAMENTO R A SNODGRASS SLU,B9D82101319,Surface Water,38.3505,121.5333,Review Status Unknown,Sacramento,WDIS_0913156,1952-10-20 14:30:00.000,,Feet,Dissolved Arsenic,0.01,0.001,mg/L,UnkH Arsenic
103,DELTACRCHAN,Delta Cross Channel Gate nr Walnut Grove,B9D81481305,Surface Water,38.2466,121.5097,Review Status Unknown,Sacramento,WDIS_0007916,1952-10-28 13:00:00.000,,Feet,Dissolved Arsenic,0.0,0.001,mg/L,UnkH Arsenic
2621,COSUMNES R A MICHIGA,COSUMNES R A MICHIGAN BAR,B1115000,Surface Water,38.5002,121.0455,Review Status Unknown,Sacramento,WDIS_0007558,1952-12-03 09:45:00.000,,Feet,Dissolved Arsenic,0.0,0.001,mg/L,UnkH Arsenic
2621,COSUMNES R A MICHIGA,COSUMNES R A MICHIGAN BAR,B1115000,Surface Water,38.5002,121.0455,Review Status Unknown,Sacramento,WDIS_0894868,1953-01-27 10:00:00.000,,Feet,Dissolved Arsenic,0.0,0.001,mg/L,UnkH Arsenic


The station that we'll be looking at is from a drainage canal under the El Camino Bridge north of downtown Sacramento, which has had consistent measurements, typically monthy and sometimes more frequently from 1997 through 2018, totaling 291 rows of data to consider. What we would like to know: 

- What was the last measured arsenic level?
- What is the forecasted arsenic level 3 months from now?
- What is running average arsenic level for the past 4 months?
- What is the cumulative running average arsenic level? 
- What is the yearly average arsenic level? 
- What is the maximum arsenic level per year?

To address these questions, we will need to use the following window functions: 

- LEAD / LAG - these functions allow us to shift our frame by a specified number of rows forward or backward, respectively, without having to use a self-join on the data.
- ROWS PRECEDING / ROWS FOLLOWING - these set up the bounds for our window that slides through the data, looking at only the rows between the number of rows preceding (before) the CURRENT ROW or the number of rows following the CURRENT ROW 
- UNBOUNDED PRECEDING / UNBOUNDED FOLLOWING - unbounded sets the window at the first or last record in the data and the window continues to grow or shrink as you continue to progress through the records. This is great for assessing cumulative sets. 
- RANGE - the range function evaluates entire subsets rather than working row by row, allowing us to easily check yearly stats despite having different numbers of measurements across different years.

Before spending too much time differentiating between the above descriptions, let's step back and look at the data again.

In [23]:
SELECT 
    parameter
    , station_id
    , county_name
    , result
    , units
    , sample_date
FROM 
    common_toxins
WHERE
    parameter LIKE '%Arsenic%' 
    AND 
    station_id = 1362
ORDER BY 
    sample_date ASC;


parameter,station_id,county_name,result,units,sample_date
Dissolved Arsenic,1362,Sacramento,0.002,mg/L,1997-11-13 11:50:00.000
Dissolved Arsenic,1362,Sacramento,0.003,mg/L,1997-12-01 12:43:00.000
Dissolved Arsenic,1362,Sacramento,0.003,mg/L,1998-01-07 10:50:00.000
Dissolved Arsenic,1362,Sacramento,0.0,mg/L,1998-02-03 11:25:00.000
Dissolved Arsenic,1362,Sacramento,0.003,mg/L,1998-03-02 12:45:00.000
Dissolved Arsenic,1362,Sacramento,0.004,mg/L,1998-04-06 09:30:00.000
Dissolved Arsenic,1362,Sacramento,0.003,mg/L,1998-05-04 10:30:00.000
Dissolved Arsenic,1362,Sacramento,0.003,mg/L,1998-06-10 11:30:00.000
Dissolved Arsenic,1362,Sacramento,0.003,mg/L,1998-07-07 11:21:00.000
Dissolved Arsenic,1362,Sacramento,0.004,mg/L,1998-08-04 10:45:00.000


**Question 1: What was the last measured arsenic level?**

This question is very easily addressed with the **LAG** function. Without window functions, we would have to create row numbers and perform a self-join on the next row to access the level; however, the LAG function does this with one simple command. 

  

_Note: To make the window functions easier to see, I am putting the first query into a common table expression. If you're not sure about how to use CTEs, check out_ [A brief(ish) SQL tutorial WITH Common\_Table\_Expressions](https:\medium.com\@justin.papreck\a-brief-sql-tutorial-with-common-table-expressions-ed60f886f12)

In [38]:
WITH el_camino AS ( 
    SELECT 
        parameter
        , station_id
        , county_name
        , sample_date
        , units
        , result

    FROM 
        common_toxins
    WHERE
        parameter LIKE '%Arsenic%' 
        AND 
        station_id = 1362
)

SELECT 
    * 
    , LAG(result) OVER(ORDER BY sample_date ASC) AS prev_result
FROM 
    el_camino


parameter,station_id,county_name,sample_date,units,result,prev_result
Dissolved Arsenic,1362,Sacramento,1997-11-13 11:50:00.000,mg/L,0.002,
Dissolved Arsenic,1362,Sacramento,1997-12-01 12:43:00.000,mg/L,0.003,0.002
Dissolved Arsenic,1362,Sacramento,1998-01-07 10:50:00.000,mg/L,0.003,0.003
Dissolved Arsenic,1362,Sacramento,1998-02-03 11:25:00.000,mg/L,0.0,0.003
Dissolved Arsenic,1362,Sacramento,1998-03-02 12:45:00.000,mg/L,0.003,0.0
Dissolved Arsenic,1362,Sacramento,1998-04-06 09:30:00.000,mg/L,0.004,0.003
Dissolved Arsenic,1362,Sacramento,1998-05-04 10:30:00.000,mg/L,0.003,0.004
Dissolved Arsenic,1362,Sacramento,1998-06-10 11:30:00.000,mg/L,0.003,0.003
Dissolved Arsenic,1362,Sacramento,1998-07-07 11:21:00.000,mg/L,0.003,0.003
Dissolved Arsenic,1362,Sacramento,1998-08-04 10:45:00.000,mg/L,0.004,0.003


**Question 2: What is the forecasted arsenic level 3 months from now?**

The LEAD and LAG functions also take a number of how many rows to skip, in case I would like to see the value from 3 months ago or, for some strange case I'd like to see into the future, I could use LEAD. 

One thing I'd like you to notice is that once we get to 2002, multiple measurements are made in September onward, and not consistently. A more appropriate result to consider than the last measurement would be what happened last month. For now, I'm going to create an aggregate function within the first query, and then look into the future at the monthly average 3 months from now.

In [34]:
WITH el_camino AS ( 
    SELECT 
        parameter
        , station_id
        , county_name
        , MONTH(sample_date) AS month
        , YEAR(sample_date) AS year
        , units
        , AVG(result) as monthly_avg
    FROM 
        common_toxins
    WHERE
        parameter LIKE '%Arsenic%' 
        AND 
        station_id = 1362
    GROUP BY 
        YEAR(sample_date), MONTH(sample_date), parameter, station_id, county_name, units
)

SELECT 
    * 
    , LEAD(monthly_avg, 3) OVER (ORDER BY year, month) AS three_months_ahead
FROM 
    el_camino

parameter,station_id,county_name,month,year,units,monthly_avg,three_months_ahead
Dissolved Arsenic,1362,Sacramento,11,1997,mg/L,0.002,0.0
Dissolved Arsenic,1362,Sacramento,12,1997,mg/L,0.003,0.003
Dissolved Arsenic,1362,Sacramento,1,1998,mg/L,0.003,0.004
Dissolved Arsenic,1362,Sacramento,2,1998,mg/L,0.0,0.003
Dissolved Arsenic,1362,Sacramento,3,1998,mg/L,0.003,0.003
Dissolved Arsenic,1362,Sacramento,4,1998,mg/L,0.004,0.003
Dissolved Arsenic,1362,Sacramento,5,1998,mg/L,0.003,0.004
Dissolved Arsenic,1362,Sacramento,6,1998,mg/L,0.003,0.004
Dissolved Arsenic,1362,Sacramento,7,1998,mg/L,0.003,0.004
Dissolved Arsenic,1362,Sacramento,8,1998,mg/L,0.004,0.002


The results now show the new column is now augmented by 3 rows, and instead of seeing NULL values in the first 3 rows, we see them in the last 3 of the table, as there are no future values.

So our **LEAD** and **LAG** functions take a single row and slide down the entire table, operating on the nth row ahead or behind the current row.   
<span style="color: var(--vscode-foreground);"><br></span>

<span style="color: var(--vscode-foreground);">---</span>

<span style="color: var(--vscode-foreground);"><b>Question 3: What is running average arsenic level for the past 4 months?</b></span>

To get a 4-month running average is a little more complicated. In this case, we are taking the first four rows, calculating the average, then sliding our window down to the next 4 rows, but no longer incorporating the original row, and this process will go from the beginning of our table to the end. In these types of operations, we want to focus on the relationship between the rows preceding or following the current row. The formatting of this is somewhat intuitive: 

After the ORDER BY clause and statements, we enter ROWS BETWEEN # (the number of rows back to start at) AND CURRENT ROW. Similarly, we can use the number of ROWS FOLLOWING and any combination of rows preceding and following the current row.   
  
What is the difference? The only difference will which row the value is returned, answering a slightly modified question.

In [1]:
WITH el_camino AS ( 
    SELECT 
        parameter
        , station_id
        , county_name
        , MONTH(sample_date) AS month
        , YEAR(sample_date) AS year
        , units
        , AVG(result) as monthly_avg
    FROM 
        common_toxins
    WHERE
        parameter LIKE '%Arsenic%' 
        AND 
        station_id = 1362
    GROUP BY 
        YEAR(sample_date), MONTH(sample_date), parameter, station_id, county_name, units
)

SELECT 
    * 
    , AVG(monthly_avg) OVER (ORDER BY year, month ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS four_month_average
FROM 
    el_camino

parameter,station_id,county_name,month,year,units,monthly_avg,four_month_average
Dissolved Arsenic,1362,Sacramento,11,1997,mg/L,0.002,0.002
Dissolved Arsenic,1362,Sacramento,12,1997,mg/L,0.003,0.0025
Dissolved Arsenic,1362,Sacramento,1,1998,mg/L,0.003,0.0026666666666666
Dissolved Arsenic,1362,Sacramento,2,1998,mg/L,0.0,0.002
Dissolved Arsenic,1362,Sacramento,3,1998,mg/L,0.003,0.00225
Dissolved Arsenic,1362,Sacramento,4,1998,mg/L,0.004,0.0025
Dissolved Arsenic,1362,Sacramento,5,1998,mg/L,0.003,0.0025
Dissolved Arsenic,1362,Sacramento,6,1998,mg/L,0.003,0.00325
Dissolved Arsenic,1362,Sacramento,7,1998,mg/L,0.003,0.00325
Dissolved Arsenic,1362,Sacramento,8,1998,mg/L,0.004,0.00325


Where we have to be careful here is in the interpretation of the first four months, since those are not actual 4-month averages. But following the fourth line, we will be able to calculate the average of the 4 monthly averages.

**Question 4:** <span style="color: var(--vscode-foreground);"><b>What is the cumulative running average arsenic level?</b></span>

The cumulative average is a running average from the first measurement onward. We can take the same approach as we did last time; however, instead of adding the bounds of the 3 PRECEDING, we will leave this UNBOUNDED. When using an unbounded window function, we start our window at the first row and continue to open the window as we work our way through the records. Since we want to know the cumulative average, I'm going to remove the aggregate average in the first query. I'm also going to focus on the years from 2002-2003, since we know that there are multiple measurements in several months of these years.

In [14]:
WITH el_camino AS ( 
    SELECT 
        parameter
        , station_id
        , county_name
        , MONTH(sample_date) AS month
        , YEAR(sample_date) AS year
        , units
        , result
    FROM 
        common_toxins
    WHERE
        parameter LIKE '%Arsenic%' 
        AND 
        station_id = 1362
        AND 
        sample_date BETWEEN '2002-01-01' AND '2003-12-31' 
)

SELECT 
    * 
    , AVG(result) OVER (ORDER BY year, month ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS cumulative_average
FROM 
    el_camino

parameter,station_id,county_name,month,year,units,result,cumulative_average
Dissolved Arsenic,1362,Sacramento,1,2002,mg/L,0.002,0.002
Dissolved Arsenic,1362,Sacramento,1,2002,mg/L,0.003,0.0025
Dissolved Arsenic,1362,Sacramento,1,2002,mg/L,0.003,0.0026666666666666
Dissolved Arsenic,1362,Sacramento,2,2002,mg/L,0.003,0.00275
Dissolved Arsenic,1362,Sacramento,3,2002,mg/L,0.003,0.0027999999999999
Dissolved Arsenic,1362,Sacramento,3,2002,mg/L,0.004,0.0029999999999999
Dissolved Arsenic,1362,Sacramento,4,2002,mg/L,0.005,0.0032857142857142
Dissolved Arsenic,1362,Sacramento,5,2002,mg/L,0.002,0.003125
Dissolved Arsenic,1362,Sacramento,5,2002,mg/L,0.002,0.003
Dissolved Arsenic,1362,Sacramento,6,2002,mg/L,0.003,0.003


One of the disadvantages of using **ROWS** goes back to the issue we had earlier, in having to use the aggregate AVG(results) to create a new variable from our results, since some months have multiple measurements, wheras others only have a single measurement. We won't actually see the monthly average using **ROWS** here, but we have another option: **RANGE**! Previously, we did the statistically dubious alteration of calculating an average using a previously averaged value, which isn't always accurate and is always bad practice. What the **RANGE** function does for us is cluster duplicates separated in the ORDER BY clause, and it returns a single value for all rows with duplicate terms. Let's just add this as another column to compare the ROWS and RANGE outputs.

In [33]:
WITH el_camino AS ( 
    SELECT 
        parameter
        , station_id
        , county_name
        , MONTH(sample_date) AS month
        , YEAR(sample_date) AS year
        , units
        , result
    FROM 
        common_toxins
    WHERE
        parameter LIKE '%Arsenic%' 
        AND 
        station_id = 1362
        AND 
        sample_date BETWEEN '2002-01-01' AND '2003-12-31' 
)

SELECT 
    * 
    , AVG(result) OVER (ORDER BY year, month ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_avg_by_measurement
    , AVG(result) OVER (ORDER BY year, month RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_avg_by_month
FROM 
    el_camino

parameter,station_id,county_name,month,year,units,result,running_avg_by_measurement,running_avg_by_month
Dissolved Arsenic,1362,Sacramento,1,2002,mg/L,0.002,0.002,0.0026666666666666
Dissolved Arsenic,1362,Sacramento,1,2002,mg/L,0.003,0.0025,0.0026666666666666
Dissolved Arsenic,1362,Sacramento,1,2002,mg/L,0.003,0.0026666666666666,0.0026666666666666
Dissolved Arsenic,1362,Sacramento,2,2002,mg/L,0.003,0.00275,0.00275
Dissolved Arsenic,1362,Sacramento,3,2002,mg/L,0.003,0.0027999999999999,0.0029999999999999
Dissolved Arsenic,1362,Sacramento,3,2002,mg/L,0.004,0.0029999999999999,0.0029999999999999
Dissolved Arsenic,1362,Sacramento,4,2002,mg/L,0.005,0.0032857142857142,0.0032857142857142
Dissolved Arsenic,1362,Sacramento,5,2002,mg/L,0.002,0.003125,0.003
Dissolved Arsenic,1362,Sacramento,5,2002,mg/L,0.002,0.003,0.003
Dissolved Arsenic,1362,Sacramento,6,2002,mg/L,0.003,0.003,0.003


The running average at the end of all of the monthly measurements are calculated in the ROWS output is the same for each of the measurements within the month for the RANGE output. This allows us to very easily change the parameter to calculate aggregate functions such as year-to-date sums, averages, etc. without a lot of code. It should be noted that when using RANGE, it must be UNBOUNDED PRECEDING up to the CURRENT ROW. We're able to make more fine-tuned adjustments to the ROWS with the number preceding and following that we cannot do with RANGE.

So what if I don't want the running average, but I do want to know...

**Question 5:** <span style="color: var(--vscode-foreground);"><b>What is the yearly average arsenic level?</b>&nbsp;</span>

In [32]:
WITH el_camino AS ( 
    SELECT 
        parameter
        , station_id
        , county_name
        , MONTH(sample_date) AS month
        , YEAR(sample_date) AS year
        , units
        , result
    FROM 
        common_toxins
    WHERE
        parameter LIKE '%Arsenic%' 
        AND 
        station_id = 1362
)

SELECT 
    * 
    , AVG(result) OVER (ORDER BY year, month RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_yearly_avg
    , AVG(result) OVER (PARTITION BY year ORDER BY year, month RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS yearly_avg
FROM 
    el_camino

parameter,station_id,county_name,month,year,units,result,running_yearly_avg,yearly_avg
Dissolved Arsenic,1362,Sacramento,11,1997,mg/L,0.002,0.002,0.002
Dissolved Arsenic,1362,Sacramento,12,1997,mg/L,0.003,0.0025,0.0025
Dissolved Arsenic,1362,Sacramento,1,1998,mg/L,0.003,0.0026666666666666,0.003
Dissolved Arsenic,1362,Sacramento,2,1998,mg/L,0.0,0.002,0.0015
Dissolved Arsenic,1362,Sacramento,3,1998,mg/L,0.003,0.0021999999999999,0.002
Dissolved Arsenic,1362,Sacramento,4,1998,mg/L,0.004,0.0025,0.0025
Dissolved Arsenic,1362,Sacramento,5,1998,mg/L,0.003,0.0025714285714285,0.0026
Dissolved Arsenic,1362,Sacramento,6,1998,mg/L,0.003,0.0026249999999999,0.0026666666666666
Dissolved Arsenic,1362,Sacramento,7,1998,mg/L,0.003,0.0026666666666666,0.0027142857142857
Dissolved Arsenic,1362,Sacramento,8,1998,mg/L,0.004,0.0027999999999999,0.002875


Simple fix! Just PARTITION BY year.

  

In our previous examples, we've excluded the partitioning, since it wasn't necessary for the questions we were answering. By partitioning by year, the window resets at the beginnng of every new partition. To demonstrate this, look at the values for 1997 and 1998. The running yearly average of 1998 is 0.0023857, whereas the yearly average is 0.002917. If we start our query at Jan 01, 1998, then both of them should be <span style="color: var(--vscode-foreground);">0.002917, since the 1997 values are not averaged in with the 1998.&nbsp;</span>

In [31]:
WITH el_camino AS ( 
    SELECT 
        parameter
        , station_id
        , county_name
        , MONTH(sample_date) AS month
        , YEAR(sample_date) AS year
        , units
        , result
    FROM 
        common_toxins
    WHERE
        parameter LIKE '%Arsenic%' 
        AND 
        station_id = 1362
        AND 
        sample_date > '1998-01-01' 
)

SELECT 
    * 
    , AVG(result) OVER (ORDER BY year, month RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_yearly_avg
    , AVG(result) OVER (PARTITION BY year ORDER BY year, month RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS yearly_avg
FROM 
    el_camino

parameter,station_id,county_name,month,year,units,result,running_yearly_avg,yearly_avg
Dissolved Arsenic,1362,Sacramento,1,1998,mg/L,0.003,0.003,0.003
Dissolved Arsenic,1362,Sacramento,2,1998,mg/L,0.0,0.0015,0.0015
Dissolved Arsenic,1362,Sacramento,3,1998,mg/L,0.003,0.002,0.002
Dissolved Arsenic,1362,Sacramento,4,1998,mg/L,0.004,0.0025,0.0025
Dissolved Arsenic,1362,Sacramento,5,1998,mg/L,0.003,0.0026,0.0026
Dissolved Arsenic,1362,Sacramento,6,1998,mg/L,0.003,0.0026666666666666,0.0026666666666666
Dissolved Arsenic,1362,Sacramento,7,1998,mg/L,0.003,0.0027142857142857,0.0027142857142857
Dissolved Arsenic,1362,Sacramento,8,1998,mg/L,0.004,0.002875,0.002875
Dissolved Arsenic,1362,Sacramento,9,1998,mg/L,0.004,0.003,0.003
Dissolved Arsenic,1362,Sacramento,10,1998,mg/L,0.004,0.0031,0.0031


This proves true, and then the values diverge again once we get to 1999, as it incorporates the running average from 1998 on the left, but the column on the right resets. This brings us to our final question. 

**Question 6: <span style="color: var(--vscode-foreground);">What is the maximum arsenic level per year?</span>**

**<span style="color: var(--vscode-foreground);"><br></span>**

It's your turn. How would we write this query using a window function?

In [28]:
WITH el_camino AS ( 
    SELECT 
        parameter
        , station_id
        , county_name
        , MONTH(sample_date) AS month
        , YEAR(sample_date) AS year
        , units
        , result
    FROM 
        common_toxins
    WHERE
        parameter LIKE '%Arsenic%' 
        AND 
        station_id = 1362
)

SELECT 
    * 
    , MAX(result) OVER (PARTITION BY year ORDER BY year, month RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS yearly_max
FROM 
    el_camino

parameter,station_id,county_name,month,year,units,result,yearly_max
Dissolved Arsenic,1362,Sacramento,11,1997,mg/L,0.002,0.002
Dissolved Arsenic,1362,Sacramento,12,1997,mg/L,0.003,0.003
Dissolved Arsenic,1362,Sacramento,1,1998,mg/L,0.003,0.003
Dissolved Arsenic,1362,Sacramento,2,1998,mg/L,0.0,0.003
Dissolved Arsenic,1362,Sacramento,3,1998,mg/L,0.003,0.003
Dissolved Arsenic,1362,Sacramento,4,1998,mg/L,0.004,0.004
Dissolved Arsenic,1362,Sacramento,5,1998,mg/L,0.003,0.004
Dissolved Arsenic,1362,Sacramento,6,1998,mg/L,0.003,0.004
Dissolved Arsenic,1362,Sacramento,7,1998,mg/L,0.003,0.004
Dissolved Arsenic,1362,Sacramento,8,1998,mg/L,0.004,0.004


This should have been a quick change from AVG to MAX from the last query. We need to maintain the partition by statement as yearly, but instead of calculating the average per year, it will find the max.  What would happen if we used ROWS instead of RANGE?

In [27]:
WITH el_camino AS ( 
    SELECT 
        parameter
        , station_id
        , county_name
        , MONTH(sample_date) AS month
        , YEAR(sample_date) AS year
        , units
        , result
    FROM 
        common_toxins
    WHERE
        parameter LIKE '%Arsenic%' 
        AND 
        station_id = 1362
)

SELECT 
    * 
    , MAX(result) OVER (PARTITION BY year ORDER BY year, month ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS yearly_max
FROM 
    el_camino

parameter,station_id,county_name,month,year,units,result,yearly_max
Dissolved Arsenic,1362,Sacramento,11,1997,mg/L,0.002,0.002
Dissolved Arsenic,1362,Sacramento,12,1997,mg/L,0.003,0.003
Dissolved Arsenic,1362,Sacramento,1,1998,mg/L,0.003,0.003
Dissolved Arsenic,1362,Sacramento,2,1998,mg/L,0.0,0.003
Dissolved Arsenic,1362,Sacramento,3,1998,mg/L,0.003,0.003
Dissolved Arsenic,1362,Sacramento,4,1998,mg/L,0.004,0.004
Dissolved Arsenic,1362,Sacramento,5,1998,mg/L,0.003,0.004
Dissolved Arsenic,1362,Sacramento,6,1998,mg/L,0.003,0.004
Dissolved Arsenic,1362,Sacramento,7,1998,mg/L,0.003,0.004
Dissolved Arsenic,1362,Sacramento,8,1998,mg/L,0.004,0.004


If you thought that value would change each row that increased in value, but once it reached the max it would continue until the next year, you're correct. If you think that was a mouthful of explanation, I can't disagree with you. 

**As this series comes to a close...**

<span style="color: var(--vscode-foreground);">Let's look back at the concepts you hopefully learned or reinforced:</span>

- <span style="color: var(--vscode-foreground);">SQL Order of Operations:&nbsp;&nbsp;</span>    **FROM -\> WHERE -\> GROUP BY -\> HAVING -\> SELECT -\> ORDER BY -\> LIMIT**

<span style="color: var(--vscode-foreground);"><b><br></b></span>

<span style="color: var(--vscode-foreground);"><b>Get OVER it!</b></span>

- <span style="color: var(--vscode-foreground);"><b>OVER </b>creates our window function, taking an aggregate function prior to the OVER clause.</span>
- <span style="color: var(--vscode-foreground);"><b>PARTITION BY</b>&nbsp;creates partitions within the OVER clause, acting similar to the GROUP BY clause, which ourwindow function will use as we slide through the rows.</span>
- <span style="color: var(--vscode-foreground);"><b>ORDER BY </b>is another clause we can use within the OVER clause, doing exactly what we expect it to do.&nbsp;&nbsp;</span>  

**This water is RANK!**

- **ROW NUMBER(** _)_ A function that creates a number and increases by one for every row in the dataset - ultimately, this creates a unique identifier for each row.
- **RANK( )** A function that provides a ranking but it maintains the number of places in the list (i.e. if there are duplicate values assigned rank 3, the next rank will be 5) Not all numbers will be present in rank.
- **DENSE RANK( )** A function that provides a ranking but it maintains the numerical order of the list (i.e. if there are duplicate values assigned rank 3, the next rank will be 4) All numerical values will be included in DENSE RANK.

**UNBOUNDED and Unfiltered!**

- **LAG(_value, n_)** will iterate through each row, presenting the _value_ that is _n_ rows behind the current row, and default is 1 row.
- **LEAD(_value, n_)** <span style="color: var(--vscode-foreground);">will iterate through each row, presenting the </span> _value_ <span style="color: var(--vscode-foreground);">that is </span> _n_   <span style="color: var(--vscode-foreground);">&nbsp;rows ahead of the current row, and default is 1 row.</span>
- **ROWS** <span style="color: var(--vscode-foreground);">uses the bounds defined by </span> **PRECEDING** <span style="color: var(--vscode-foreground);">or </span> **FOLLOWING** <span style="color: var(--vscode-foreground);">and number <i style="font-weight: bold;">n</i>&nbsp;with respect to the <b>CURRENT ROW</b>, and it forms a sliding window with those defined positions from the initial row to the final row of the table.&nbsp;&nbsp;</span>    
- <span style="color: var(--vscode-foreground);"><b>RANGE</b>&nbsp;uses <b>UNBOUNDED PRECEDING </b>to start from the beginning of each partition to the <b>CURRENT ROW</b>, but unlike ROWS, it groups duplicates presented in the ORDER BY clause and returns a single value for all rows within the specified range.&nbsp;&nbsp;</span>  

Hopefully, you can use these functions to help make more complex queries more efficient when trying to assess and deliver your data insights.