The purpose of this file is to run queries for the Tableau Dashboards

  

\- I want to query all contaminants within the past 5 years that individual stations were above state maximums

\- Also want to query the Average for contaminants by county within the past 5 years that were over their state maximums

In [54]:
SELECT  contaminant, 
        state_max, 
        Fixed_Result, 
        station_id,
        county_name, 
        DATEPART(year, sample_date) AS year, 
        CAST(AVG(Fixed_Result) OVER(PARTITION BY county_name) AS DECIMAL(5,2)) AS County_Average
FROM    regulated_contaminants
WHERE   Fixed_Result > state_max AND sample_date > '01-01-2018'

contaminant,state_max,Fixed_Result,station_id,county_name,year,County_Average
Dissolved Arsenic,10,10.7,46554,Butte,2019,12.2
Dissolved Arsenic,10,10.8,47853,Butte,2019,12.2
Dissolved Arsenic,10,15.1,47853,Butte,2020,12.2
Dissolved Antimony,6,33.0,331,Kings,2020,33.0
Dissolved Arsenic,10,16.2,47905,Glenn,2022,18.37
Dissolved Antimony,6,14.0,47905,Glenn,2021,18.37
Dissolved Antimony,6,30.0,47906,Glenn,2021,18.37
Dissolved Antimony,6,17.5,47907,Glenn,2021,18.37
Dissolved Arsenic,10,15.0,47907,Glenn,2021,18.37
Dissolved Antimony,6,18.4,47908,Glenn,2021,18.37


The above query was done with a window function and below done with a common table expression. 

I'm using the table expression here becuase the original question I'm looking to answer is for all of the stations, but the second, I'm only interested in the county averages, but I want to use the average as a condition in the WHERE clause, which I can't do using the window function.

In [55]:
WITH County_Average AS (  
SELECT  contaminant,  
        CAST(AVG(Fixed_Result) AS DECIMAL(5,2)) AS Average_Result, 
        county_name
FROM    regulated_contaminants
WHERE   Fixed_Result > state_max AND sample_date > '01-01-2018'
GROUP BY contaminant, county_name
)
SELECT  r.contaminant, 
        r.state_max, 
        c.Average_Result, 
        r.station_id,
        r.county_name, 
        DATEPART(year, r.sample_date) AS year
FROM    regulated_contaminants r
    LEFT JOIN 
        County_Average c 
    ON c.contaminant = r.contaminant AND c.county_name = r.county_name
WHERE   Fixed_Result > state_max AND sample_date > '01-01-2018' 

contaminant,state_max,Average_Result,station_id,county_name,year
Dissolved Antimony,6,16.5,47903,Tehama,2021
Dissolved Antimony,6,16.5,47904,Tehama,2021
Dissolved Arsenic,10,12.2,46554,Butte,2019
Dissolved Arsenic,10,12.2,47853,Butte,2019
Dissolved Arsenic,10,12.2,47853,Butte,2020
Dissolved Nitrate + Nitrite,10,10.1,45770,Los Angeles,2022
Dissolved Antimony,6,33.0,331,Kings,2020
Dissolved Arsenic,10,16.23,47905,Glenn,2022
Dissolved Arsenic,10,16.23,47907,Glenn,2021
Dissolved Arsenic,10,16.23,47908,Glenn,2021


This query will answer the question of which counties have averages in the past 5 years (2018-01-01 through 2022) that are higher than the state maximum

In [58]:
WITH County_Average AS (  
SELECT  contaminant,  
        CAST(AVG(Fixed_Result) AS DECIMAL(5,2)) AS Average_Result, 
        county_name
FROM    regulated_contaminants
WHERE   Fixed_Result > state_max AND sample_date > '01-01-2018'
GROUP BY contaminant, county_name
)
SELECT  r.contaminant, 
        r.state_max, 
        c.Average_Result, 
        r.county_name
FROM    regulated_contaminants r
    LEFT JOIN 
        County_Average c 
    ON c.contaminant = r.contaminant AND c.county_name = r.county_name
WHERE   c.Average_Result > state_max AND sample_date > '01-01-2018'
GROUP BY r.contaminant, r.county_name, r.state_max, c.Average_Result
ORDER BY r.contaminant 

contaminant,state_max,Average_Result,county_name
Dissolved Antimony,6,16.5,Tehama
Dissolved Antimony,6,33.0,Kings
Dissolved Antimony,6,19.98,Glenn
Dissolved Arsenic,10,12.2,Butte
Dissolved Arsenic,10,16.23,Glenn
Dissolved Arsenic,10,60.25,Tehama
Dissolved Mercury,2,3.0,Los Angeles
Dissolved Nitrate + Nitrite,10,11.25,Yolo
Dissolved Nitrate + Nitrite,10,10.1,Los Angeles
Dissolved Strontium,12,80.22,Sacramento


Now, to find the percentage higher than the state maximum; I will use the above query as a second table expression

In [59]:
WITH County_Average AS (  
SELECT  contaminant,  
        CAST(AVG(Fixed_Result) AS DECIMAL(5,2)) AS Average_Result, 
        county_name
FROM    regulated_contaminants
WHERE   Fixed_Result > state_max AND sample_date > '01-01-2018'
GROUP BY contaminant, county_name
),
High_Averages AS (
SELECT  r.contaminant, 
        r.state_max, 
        c.Average_Result, 
        r.county_name
FROM    regulated_contaminants r
    LEFT JOIN 
        County_Average c 
    ON c.contaminant = r.contaminant AND c.county_name = r.county_name
WHERE   c.Average_Result > state_max AND sample_date > '01-01-2018'
GROUP BY r.contaminant, r.county_name, r.state_max, c.Average_Result
)
SELECT  *, 
        CAST(MAX(100*Average_Result/state_max) OVER(PARTITION BY contaminant, county_name) AS DECIMAL(18,2)) AS Percent_of_State_Max   
FROM High_Averages
ORDER BY Percent_of_State_Max DESC

contaminant,state_max,Average_Result,county_name,Percent_of_State_Max
Dissolved Strontium,12,208.88,San Joaquin,1740.67
Dissolved Strontium,12,80.22,Sacramento,668.5
Dissolved Arsenic,10,60.25,Tehama,602.5
Dissolved Strontium,12,68.8,Tehama,573.33
Dissolved Antimony,6,33.0,Kings,550.0
Dissolved Antimony,6,19.98,Glenn,333.0
Dissolved Antimony,6,16.5,Tehama,275.0
Dissolved Arsenic,10,16.23,Glenn,162.3
Dissolved Mercury,2,3.0,Los Angeles,150.0
Dissolved Arsenic,10,12.2,Butte,122.0


Next, I want to determine which counties have averages that are higher than the maximum permissible state levels

In [60]:
WITH County_Average AS (  
SELECT  contaminant,  
        CAST(AVG(Fixed_Result) AS DECIMAL(5,2)) AS Average_Result, 
        county_name
FROM    regulated_contaminants
WHERE   Fixed_Result > state_max AND sample_date > '01-01-2018'
GROUP BY contaminant, county_name
)
SELECT  DISTINCT(r.county_name)
FROM    regulated_contaminants r
    LEFT JOIN 
        County_Average c 
    ON c.contaminant = r.contaminant AND c.county_name = r.county_name
WHERE   c.Average_Result > state_max AND sample_date > '01-01-2018'
GROUP BY r.contaminant, r.county_name, r.state_max, c.Average_Result
ORDER BY county_name


county_name
Butte
Glenn
Kings
Los Angeles
Sacramento
San Joaquin
Tehama
Yolo


There are 8 different counties that have averages that exceed the permissible average. The counties where these occur are:

- Butte
- Glenn
- Kings
- Los Angeles
- Sacramento
- San Juaquin
- Tehama
- Yolo

In [61]:
WITH County_Average AS (  
SELECT  contaminant,  
        CAST(AVG(Fixed_Result) AS DECIMAL(5,2)) AS Average_Result, 
        county_name
FROM    regulated_contaminants
WHERE   Fixed_Result > state_max AND sample_date > '01-01-2018'
GROUP BY contaminant, county_name
)
SELECT  DISTINCT(r.contaminant)
FROM    regulated_contaminants r
    LEFT JOIN 
        County_Average c 
    ON c.contaminant = r.contaminant AND c.county_name = r.county_name
WHERE   c.Average_Result > state_max AND sample_date > '01-01-2018'
GROUP BY r.contaminant, r.county_name, r.state_max, c.Average_Result
ORDER BY r.contaminant

contaminant
Dissolved Antimony
Dissolved Arsenic
Dissolved Mercury
Dissolved Nitrate + Nitrite
Dissolved Strontium


There were 5 different contaminants that had county averages that were higher than the state maximum

- Antimony
- Arsenic
- Mercury
- Nitrate + Nitrite
- Strontium

Some of the counties in California are rather large, and with many water measurements that have very low levels may skew the average, where particular stations may have values that are extremely high. Since 'high' is a rather relative term, I want to start with stations that exceed the state max by 2 time, 5 times, and 10 times

In [66]:
WITH Multiples AS (
    SELECT  contaminant, 
            state_max, 
            Fixed_Result, 
            station_id, 
            CAST(Fixed_Result/state_max AS DECIMAL (18,2)) AS Factor
    FROM    regulated_contaminants
    WHERE   Fixed_Result > state_max AND sample_date >= '2018-01-01'
)

SELECT  m.contaminant, 
        m.station_id, 
        m.Factor,  
        r.county_name, 
        DATEPART(year, r.sample_date) AS year
FROM    Multiples m
    INNER JOIN 
        regulated_contaminants r
    ON m.contaminant = r.contaminant AND m.station_id = r.station_id
WHERE m.Factor >= 10 AND r.sample_date >= '2018-01-01'
GROUP BY m.station_id, m.contaminant, m.Factor, r.county_name, r.sample_date

contaminant,station_id,Factor,county_name,year
Dissolved Arsenic,1806,12.6,Tehama,2018
Dissolved Arsenic,1806,12.6,Tehama,2020
Dissolved Arsenic,1806,13.3,Tehama,2022
Dissolved Arsenic,1806,21.0,Tehama,2018
Dissolved Arsenic,1806,21.0,Tehama,2020
Dissolved Arsenic,1806,21.0,Tehama,2021
Dissolved Strontium,45914,14.75,San Joaquin,2021
Dissolved Strontium,45914,14.75,San Joaquin,2021
Dissolved Strontium,45914,14.75,San Joaquin,2022
Dissolved Strontium,45914,15.33,San Joaquin,2021


This query will show the Maximum measurement from a specific station from the past 5 years (since the beginning of 2018), and all of the results have values at least 2 times higher than the state maximum

In [83]:
WITH Multiples AS (
    SELECT  contaminant, 
            state_max, 
            Fixed_Result, 
            station_id, 
            CAST(Fixed_Result/state_max AS DECIMAL (18,2)) AS Factor
    FROM    regulated_contaminants
    WHERE   Fixed_Result > state_max AND sample_date >= '2018-01-01'
)
SELECT  m.contaminant, 
        m.station_id, 
        MAX(m.Factor) AS Multiple,  
        r.county_name
FROM    Multiples m
    INNER JOIN 
        regulated_contaminants r
    ON m.contaminant = r.contaminant AND m.station_id = r.station_id
WHERE m.Factor >= 2 AND r.sample_date >= '2018-01-01'
GROUP BY m.station_id, m.contaminant, r.county_name

contaminant,station_id,Multiple,county_name
Dissolved Antimony,47905,2.33,Glenn
Dissolved Arsenic,1806,21.0,Tehama
Dissolved Antimony,47904,4.42,Tehama
Dissolved Antimony,47906,5.0,Glenn
Dissolved Antimony,47907,2.92,Glenn
Dissolved Strontium,682,7.3,Tehama
Dissolved Antimony,47908,3.07,Glenn
Dissolved Antimony,331,5.5,Kings
Dissolved Arsenic,666,5.39,Tehama
Dissolved Arsenic,1251,3.3,Tehama


This query will show the Maximum measurement from a specific station from the past 5 years (since the beginning of 2018), and all of the results have values at least 5 times higher than the state maximum

In [84]:
WITH Multiples AS (
    SELECT  contaminant, 
            state_max, 
            Fixed_Result, 
            station_id, 
            CAST(Fixed_Result/state_max AS DECIMAL (18,2)) AS Factor
    FROM    regulated_contaminants
    WHERE   Fixed_Result > state_max AND sample_date >= '2018-01-01'
)
SELECT  m.contaminant, 
        m.station_id, 
        MAX(m.Factor) AS Multiple,  
        r.county_name
FROM    Multiples m
    INNER JOIN 
        regulated_contaminants r
    ON m.contaminant = r.contaminant AND m.station_id = r.station_id
WHERE m.Factor >= 5 AND r.sample_date >= '2018-01-01'
GROUP BY m.station_id, m.contaminant, r.county_name

contaminant,station_id,Multiple,county_name
Dissolved Antimony,47906,5.0,Glenn
Dissolved Strontium,682,7.3,Tehama
Dissolved Arsenic,1806,21.0,Tehama
Dissolved Antimony,331,5.5,Kings
Dissolved Arsenic,666,5.39,Tehama
Dissolved Strontium,45916,9.33,Sacramento
Dissolved Strontium,666,9.58,Tehama
Dissolved Strontium,45914,35.42,San Joaquin


This query will show the Maximum measurement from a specific station from the past 5 years (since the beginning of 2018), and all of the results have values at least 10 times higher than the state maximum

In [85]:
WITH Multiples AS (
    SELECT  contaminant, 
            state_max, 
            Fixed_Result, 
            station_id, 
            CAST(Fixed_Result/state_max AS DECIMAL (18,2)) AS Factor
    FROM    regulated_contaminants
    WHERE   Fixed_Result > state_max AND sample_date >= '2018-01-01'
)
SELECT  m.contaminant, 
        m.station_id, 
        MAX(m.Factor) AS Multiple,  
        r.county_name
FROM    Multiples m
    INNER JOIN 
        regulated_contaminants r
    ON m.contaminant = r.contaminant AND m.station_id = r.station_id
WHERE m.Factor >= 10 AND r.sample_date >= '2018-01-01'
GROUP BY m.station_id, m.contaminant, r.county_name

contaminant,station_id,Multiple,county_name
Dissolved Arsenic,1806,21.0,Tehama
Dissolved Strontium,45914,35.42,San Joaquin
