In [None]:
CREATE DATABASE WaterQuality;

CREATE TABLE state_regulations (
	contaminant VARCHAR(250) NOT NULL, 
	state_max_level FLOAT NOT NULL, 
	state_detection_limit FLOAT, 
	state_health_goal FLOAT, 
	state_health_date INT, 
	federal_max_level FLOAT,
	federal_max_level_goal FLOAT, 
	units VARCHAR(50)
);

BULK INSERT dbo.state_regulations
FROM "Water_Quality\state_regulations.csv"
WITH 
(
	FORMAT = 'CSV',
	FIRSTROW = 2
)
GO
;

SELECT *
FROM dbo.state_regulations
;

**Water Quality Queries**

Query to see how many Federal Maximum Levels are lower than the California Maximum Levels. 

This should return an empty set, because state restrictions are superceded by federal, and CA tends to be more conservative on its standards than the federal government.

In [55]:
SELECT * 
FROM dbo.state_regulations
WHERE federal_max_level < state_max_level;

contaminant,state_max_level,state_detection_limit,state_health_goal,state_health_date,federal_max_level,federal_max_level_goal,units


As expected, this returned an empty set. Next, I'd like to see how many state standards are lower than the federal standards.

This is a more complicated question than it seems for someone new to SQL. The following query is rather simple, showing WHICH state standards are lower than the federal standards, and since this is a rather short list, we can see that the number is 27. 

The original question only asked HOW MANY of the contaminants have lower state standards than federal, and if this was a much larger dataset, it would be cumbersome to report the output based on the rows affected rather than querying for the intended output.

In [57]:
SELECT COUNT(*)
FROM dbo.state_regulations
WHERE federal_max_level > state_max_level;

(No column name)
27


**Method 1: Using a Subquery**

What we really want to do is Query the COUNT from this result. Simply counting the outputs will deliver the number we're looking for.

The easiest place to introduce the above query as a subquery is in the FROM statement. I'm constructing a nested query from the innermost query outward.

In [58]:
SELECT COUNT(subquery.contaminant) AS Number_of_Stricter_State_Maximums
FROM (
    SELECT  contaminant, 
            state_max_level,
            COALESCE(federal_max_level, state_max_level + 1) AS federal_max_coalesced
    FROM    dbo.state_regulations
    WHERE   COALESCE(federal_max_level, state_max_level + 1) > state_max_level
) AS subquery


Number_of_Stricter_State_Maximums
42


It is possible in the first line to just write SELECT COUNT(\*) AS ..., however, it's unclear that there absolutely MUST be a naming of the Subquery. For clarity, I kept the subquery.contaminant as the value I'm counting. When I personally write a query like this, I would provide a better, more descriptive name for the subquery, since as the complexity increases, so does the difficulty in readability. The code below is the same, only with the modified naming for clarity.

In [59]:
SELECT COUNT(Stricter_State_Maximums.contaminant) AS Number_of_Stricter_State_Maximums
FROM (
    SELECT  contaminant, 
            state_max_level,
            COALESCE(federal_max_level, state_max_level + 1) AS federal_max_coalesced
    FROM    dbo.state_regulations
    WHERE   COALESCE(federal_max_level, state_max_level + 1) > state_max_level
) AS Stricter_State_Maximums

Number_of_Stricter_State_Maximums
42


Hopefully, as a reader, it's easier to follow now that the purpose of my subquery was to isolate the contaminants and their levels where the state level was lower than the federal. Then the parent query is a COUNT function on the subset of Stricter State Maximums, revealing the Number of Stricter State Maximums.

**Method 2: Using a WITH clause, aka CTE or Common Table Expression**

The WITH Clause introduces the subset at the beginning of the statement, and it provides more clarity, as you are following the structure in a linear manner from top to bottom, rather than digging deep into nested functions and crawling back out to the SELECT clause. 

This uses the same original query with the WITH clause to establish the 'Common Table Expression'. For linearity, I will do the same thing as I did with the subquery, labeling the expression as cte first, but then giving it the more appropriate name.

In [60]:
WITH cte AS (
    SELECT  contaminant, 
            COALESCE(federal_max_level, state_max_level + 1) AS federal_max_coalesced
    FROM    dbo.state_regulations
    WHERE   COALESCE(federal_max_level, state_max_level + 1) > state_max_level
)
SELECT COUNT(contaminant) AS Number_of_Stricter_State_Maximums
FROM cte

Number_of_Stricter_State_Maximums
42


In both of these methods, the original query is literally copied into the a clause. In the first, it is part of the FROM clause, whereas in the second it is introduced initially in the WITH clause.  

As I add more complicated queries, the WITH clauses can draw from one another or from additional tables, limiting the need for multiple nestings in subqueries.

The more intuitive labeling, as promised above is as follows.

In [61]:
WITH Stricter_State_Maximums AS (
    SELECT  contaminant, 
            state_max_level, 
            COALESCE(federal_max_level, state_max_level + 1) AS federal_max_coalesced
    FROM    dbo.state_regulations
    WHERE   COALESCE(federal_max_level, state_max_level + 1) > state_max_level
)
SELECT COUNT(contaminant) AS Number_of_Stricter_State_Maximums
FROM Stricter_State_Maximums

Number_of_Stricter_State_Maximums
42


**Adding to the existing Query...**

Show the number of Contaminants that have Stricter State Maximums, Stricter Federal Maximums, and Identical Maximums for State and Federal contamination levels.

I'm already cringing at the logic with the Subqueries, so I'm going to start with the WITH clause approach!

**Method 2: WITH**  

The only changes I will make to each of the different expressions is the logical operator: \<, \>, and =

In [76]:
WITH 
Stricter_State_Maximums AS (
    SELECT  contaminant, 
            state_max_level,  
            COALESCE(federal_max_level, state_max_level + 1) AS federal_max_coalesced
    FROM    dbo.state_regulations
    WHERE   COALESCE(federal_max_level, state_max_level + 1) > state_max_level
), 
Stricter_Federal_Maximums AS (
    SELECT  contaminant, 
            state_max_level,  
            COALESCE(federal_max_level, state_max_level + 1) AS federal_max_coalesced
    FROM    dbo.state_regulations
    WHERE   COALESCE(federal_max_level, state_max_level + 1) < state_max_level
), 
Same_State_Federal_Maximums AS (
    SELECT  contaminant, 
            state_max_level,  
            COALESCE(federal_max_level, state_max_level + 1) AS federal_max_coalesced
    FROM    dbo.state_regulations
    WHERE   COALESCE(federal_max_level, state_max_level + 1) = state_max_level
)
SELECT COUNT(contaminant) AS Count, 'State' AS Stricter_Standards
FROM Stricter_State_Maximums
UNION ALL 
SELECT COUNT(contaminant), 'Federal' 
FROM Stricter_Federal_Maximums
UNION ALL 
SELECT COUNT(contaminant), 'Equal'
FROM Same_State_Federal_Maximums;

Count,Stricter_Standards
42,State
0,Federal
49,Equal


This is an alternative output, keeping everything in its own column. I've only done this to simplify the following subquery outcome.

In [68]:
WITH 
Stricter_State_Maximums AS (
    SELECT  contaminant, 
            state_max_level,  
            COALESCE(federal_max_level, state_max_level + 1) AS federal_max_coalesced
    FROM    dbo.state_regulations
    WHERE   COALESCE(federal_max_level, state_max_level + 1) > state_max_level
), 
Stricter_Federal_Maximums AS (
    SELECT  contaminant, 
            state_max_level,  
            COALESCE(federal_max_level, state_max_level + 1) AS federal_max_coalesced
    FROM    dbo.state_regulations
    WHERE   COALESCE(federal_max_level, state_max_level + 1) < state_max_level
), 
Same_State_Federal_Maximums AS (
    SELECT  contaminant, 
            state_max_level, 
            COALESCE(federal_max_level, state_max_level + 1) AS federal_max_coalesced
    FROM    dbo.state_regulations
    WHERE   COALESCE(federal_max_level, state_max_level + 1) = state_max_level
)
SELECT  COUNT(SSM.contaminant) AS Number_of_Stricter_State_Standards, 
        COUNT(SFM.contaminant) AS Number_of_Stricter_Federal_Standards, 
        COUNT(SM.contaminant) AS Number_of_Equal_Standards
FROM    dbo.state_regulations AS SR
LEFT JOIN Stricter_State_Maximums AS SSM
    ON SR.contaminant = SSM.contaminant
LEFT JOIN Stricter_Federal_Maximums AS SFM
    ON SR.contaminant = SFM.contaminant
LEFT JOIN Same_State_Federal_Maximums AS SM
    ON SR.contaminant = SM.contaminant


Number_of_Stricter_State_Standards,Number_of_Stricter_Federal_Standards,Number_of_Equal_Standards
42,0,49


Since I'm drawing the same type of data, it's easy piece together an easy-to-understand output, giving the desired number of stricter state and federal maximums, and the number of equal values using 3 simialar Common Table expressions under the same WITH clause, appended together with UNION ALL.

**Method 1: Subqueries**

In [84]:
SELECT COUNT(contaminant) AS Count, 'State' AS Stricter_Restriction_Levels
FROM (
    SELECT  contaminant, 
            state_max_level,  
            COALESCE(federal_max_level, state_max_level + 1) AS federal_max_coalesced
    FROM    dbo.state_regulations
    WHERE   COALESCE(federal_max_level, state_max_level + 1) > state_max_level
) AS Stricter_State_Maximums
UNION ALL 
SELECT COUNT(contaminant), 'Federal'
FROM (
    SELECT  contaminant, 
            state_max_level, 
            COALESCE(federal_max_level, state_max_level + 1) AS federal_max_coalesced
    FROM    dbo.state_regulations
    WHERE   COALESCE(federal_max_level, state_max_level + 1) < state_max_level
) AS Stricter_Federal_Maximums
UNION ALL 
SELECT COUNT(contaminant), 'Equal'
FROM (
    SELECT  contaminant, 
            state_max_level,  
            COALESCE(federal_max_level, state_max_level + 1) AS federal_max_coalesced
    FROM    dbo.state_regulations
    WHERE   COALESCE(federal_max_level, state_max_level + 1) = state_max_level
) AS Same_State_Federal_Maximums;

Count,Stricter_Restriction_Levels
42,State
0,Federal
49,Equal


In [82]:
SELECT  COUNT(Stricter_State_Maximums.contaminant) AS Number_of_Stricter_State_Maximums,
        COUNT(Stricter_Federal_Maximums.contaminant) AS Number_of_Stricter_Federal_Maximums,
        COUNT(Same_State_Federal_Maximums.contaminant) AS Number_of_Same_State_Federal_Maximums
FROM dbo.state_regulations AS SR
    LEFT JOIN
    (
        SELECT  contaminant, 
                state_max_level, 
                COALESCE(federal_max_level, state_max_level + 1) AS federal_max_coalesced
        FROM    dbo.state_regulations
        WHERE   COALESCE(federal_max_level, state_max_level + 1) > state_max_level
    ) AS Stricter_State_Maximums
    ON SR.contaminant = Stricter_State_Maximums.contaminant
    LEFT JOIN
    (
        SELECT  contaminant, 
                state_max_level, 
                COALESCE(federal_max_level, state_max_level + 1) AS federal_max_coalesced
        FROM    dbo.state_regulations
        WHERE   COALESCE(federal_max_level, state_max_level + 1) = state_max_level
    ) AS Same_State_Federal_Maximums
        ON SR.contaminant = Same_State_Federal_Maximums.contaminant
    LEFT JOIN
    (
        SELECT  contaminant, 
                state_max_level, 
                COALESCE(federal_max_level, state_max_level + 1) AS federal_max_coalesced
        FROM    dbo.state_regulations
        WHERE   COALESCE(federal_max_level, state_max_level + 1) < state_max_level
    ) AS Stricter_Federal_Maximums
        ON SR.contaminant = Stricter_State_Maximums.contaminant

Number_of_Stricter_State_Maximums,Number_of_Stricter_Federal_Maximums,Number_of_Same_State_Federal_Maximums
42,0,49


In the above query, each subquery needed to be joined back to the original table, since the none of the subqueries should have any overlapping data required to make a join. To achieve these results, it is imperative that a LEFT join be used, otherwise the output will not return the correct value.

The output is slightly different, in that here each is a column with its own header instead of rows with an identifying feature. This can also be achieved using the WITH statement and avoiding subqueries, but it does again require the joins.

In [2]:
WITH Stricter_State_Maximums AS (
    SELECT  contaminant, 
            state_max_level, 
            federal_max_level
    FROM    dbo.state_regulations
    WHERE   federal_max_level > state_max_level
),
Same_State_Federal_Maximums AS (
    SELECT  contaminant, 
            state_max_level, 
            federal_max_level
    FROM    dbo.state_regulations
    WHERE   federal_max_level = state_max_level
),
Stricter_Federal_Maximums AS (
    SELECT  contaminant, 
            state_max_level, 
            federal_max_level
    FROM    dbo.state_regulations
    WHERE   federal_max_level < state_max_level
)
SELECT  COUNT(Stricter_State_Maximums.contaminant) AS Number_of_Stricter_State_Maximums,
        COUNT(Stricter_Federal_Maximums.contaminant) AS Number_of_Stricter_Federal_Maximums,
        COUNT(Same_State_Federal_Maximums.contaminant) AS Number_of_Same_State_Federal_Maximums
FROM    dbo.state_regulations AS SR
        LEFT JOIN Stricter_State_Maximums
            ON SR.contaminant = Stricter_State_Maximums.contaminant
        LEFT JOIN Same_State_Federal_Maximums
            ON SR.contaminant = Same_State_Federal_Maximums.contaminant
        LEFT JOIN Stricter_Federal_Maximums
            ON SR.contaminant = Stricter_Federal_Maximums.contaminant;

Number_of_Stricter_State_Maximums,Number_of_Stricter_Federal_Maximums,Number_of_Same_State_Federal_Maximums
27,0,49


**One Final Case:** 

There is one final case I would like to get into that would normally require a nested subquery. 

  

One thing that we ignored is the fact that not all of the Federal Regulations are even present:

In [7]:
SELECT COUNT(*) 
FROM dbo.state_regulations
WHERE federal_max_level IS NULL

(No column name)
15


The design of the tables required that both the contaminant name and the state maximum levels not be NULL upon inclusion in the database, however not all of them had Federal Levels. The above query showed that 15 of the contaminants did not have a federally regulated maximum level. 

In each of the above queries, the NULL value is ignored, as each query only asked for the number of those that had higher, lower, or equal values for state and federal restrictions. But the NULL isn't a value, so it can't be evaluated using logical operators. This isn't a problem when we're just counting the number, but it does pose a problem if the query asks what PERCENTAGE of the restrictions. Consider the following questions: 

1\. What percentage of federal restrictions are less stritct (higher) than state restrictions? 

2\. What percentage of state restrictions are stricter (lower) than federal restrictions? 

3\. What percentage of the total number restrictions have a stricter state maximum than federal, assuming a NULL value means it is unregulated at a federal level?   

All three of these will yield different percentages. While the numerator for the first two will be the same, 27, the third must also include the extra 15 nulls in addition to the 27 where the state is stricter. The following query will provide the denomentators for each of the 3 questions:

In [12]:
SELECT COUNT(*) AS Count, 'Cumulative' AS Total
FROM dbo.state_regulations
UNION ALL
SELECT COUNT(*), 'Federal Total' 
FROM dbo.state_regulations
WHERE federal_max_level IS NOT NULL
UNION ALL 
SELECT COUNT(*), 'State Total'
FROM dbo.state_regulations
WHERE state_max_level IS NOT NULL


Count,Total
91,Cumulative
76,Federal Total
91,State Total


In this example, the three seemingly similar questions will result in three different values based on the handling of the NULL values, so it is extremely important to be clear in how the query is approached, and additionally how it is written for interpretability by someone else verifying the logic. 

I'm using this example because ultimately, I was able to ignore the NULLS in the first sets of queries, but if I need to calculate a percentage, which is dependent on the counting of or ignoring of NULL values, I will need to clearly show how these are being handled. 

This final query will address those three questions:

1\. What percentage of federal restrictions are less stritct (higher) than state restrictions?  
2\. What percentage of state restrictions are stricter (lower) than federal restrictions?  
3\. What percentage of all contaminants have a stricter state maximum than federal, assuming a NULL value means it is a known contaminant that is unregulated at a federal level?

In [23]:
SELECT  MAX(federal_max_level) as max_fed, MAX(state_max_level) AS state_max
FROM    dbo.state_regulations;

max_fed,state_max
30,20000


I want to treat the NULL values in the federal maximum levels as less strict, so higher, than the values in the state maximum levels. In order to do this and guarantee that the value is always higher than that of the state, I will use COALESCE to copy the state max value to the federal max value and add 1 to each of the federal values. The following query will show the outcome of this. If you're unfamiliar with COALESCE, it will only make changes to the federal\_max\_level in the case where there is a NULL value; the existing federal\_max\_level values will remain unchanged.  

It is important to note here, that for the above 3 questions, this subset can only be used in the calculation of question 3. As these are not actual federal restrictions, they cannot be included in the percentage of federal restrictions lower than the state. Likewise, in question 2, the percentage calculated is based on the actual restrictions imposed by both state and federal governments. The third question isn't asking about the percentage of restrictions, but rather about the percentage of all known contaminants, regulated or unregulated by the federal government.

In [85]:
SELECT  contaminant, 
        state_max_level,
        federal_max_level,
        COALESCE(federal_max_level, state_max_level + 1) AS federal_not_null
FROM    dbo.state_regulations

contaminant,state_max_level,federal_max_level,federal_not_null
Aluminum,1.0,,2.0
Antimony,0.006,0.006,0.006
Arsenic,0.01,0.01,0.01
Asbestos,7.0,7.0,7.0
Barium,1.0,2.0,2.0
Beryllium,0.004,0.004,0.004
Cadmium,0.005,0.005,0.005
"Chromium, Total",0.05,0.1,0.1
Cyanide,0.15,0.2,0.2
Fluoride,2.0,4.0,4.0


In [51]:
WITH 
Unregulated_Federal_Coalesced AS (
    SELECT  contaminant, 
            state_max_level,
            COALESCE(federal_max_level, state_max_level + 1) AS federal_not_null
    FROM    dbo.state_regulations
), 
Stricter_State_Coalesced AS (
    SELECT  contaminant, 
            state_max_level,
            federal_not_null
    FROM    Unregulated_Federal_Coalesced
    WHERE   federal_not_null > state_max_level
),
Stricter_State_Maximums AS (
    SELECT  contaminant, 
            state_max_level, 
            federal_max_level
    FROM dbo.state_regulations
    WHERE federal_max_level > state_max_level
),
Federal_Maximums AS (
    SELECT  contaminant,
            state_max_level, 
            federal_max_level    
    FROM    dbo.state_regulations
    WHERE   federal_max_level IS NOT NULL
),
Laxer_Federal_Maximums AS (
    SELECT  contaminant, 
            state_max_level, 
            federal_max_level
    FROM    Federal_Maximums
    WHERE   federal_max_level > state_max_level
), 
All_Counts AS (
    SELECT  COUNT(LFM.contaminant) AS Laxer_Federal_Maximum_Count, 
            COUNT(FM.contaminant) AS Federal_Maximum_Count, 
            COUNT(SSM.contaminant) AS Stricter_State_Maximum_Count, 
            COUNT(SSC.contaminant) AS Stricter_State_Coalesced_Count,
            COUNT(UFC.contaminant) AS Unregulated_Federal_Coalesced_Count
    FROM    Unregulated_Federal_Coalesced AS UFC
        LEFT JOIN
            Stricter_State_Coalesced AS SSC
            ON UFC.contaminant = SSC.contaminant
        LEFT JOIN
            Stricter_State_Maximums AS SSM
            ON UFC.contaminant = SSM.contaminant
        LEFT JOIN 
            Federal_Maximums AS FM
            ON UFC.contaminant = FM.contaminant
        LEFT JOIN 
            Laxer_Federal_Maximums AS LFM
            ON UFC.contaminant = LFM.contaminant
)
SELECT  CAST(100*Laxer_Federal_Maximum_Count/ (SELECT Federal_Maximum_Count FROM All_Counts) AS DECIMAL(5,1)) AS Question_1, 
        CAST(100*Stricter_State_Maximum_Count/ (SELECT Unregulated_Federal_Coalesced_Count FROM All_Counts) AS DECIMAL(5,1)) AS Question_2,
        CAST(100*Stricter_State_Coalesced_Count/ (SELECT Unregulated_Federal_Coalesced_Count FROM All_Counts) AS DECIMAL(5,1)) AS Question_3
FROM    All_Counts

Question_1,Question_2,Question_3
35.0,29.0,46.0


And now for the Subquery version of the above code:

In [53]:
SELECT  CAST(100*Laxer_Federal_Maximum_Count/ (SELECT COUNT(*) FROM dbo.state_regulations WHERE federal_max_level IS NOT NULL) AS DECIMAL(5,1)) AS Question_1,
        CAST(100*Stricter_State_Maximum_Count/ (SELECT COUNT(*) FROM (
            SELECT  contaminant, 
                    state_max_level, 
                    federal_max_level 
            FROM dbo.state_regulations 
            ) AS SSM) AS DECIMAL(5,1)) AS Question_2,
        CAST(100*Stricter_State_Coalesced_Count/ (SELECT COUNT(*) FROM (
            SELECT  contaminant, 
                    state_max_level, 
                    COALESCE(federal_max_level, state_max_level + 1) AS federal_not_null 
            FROM dbo.state_regulations) AS UFC 
            WHERE federal_not_null > state_max_level) AS DECIMAL(5,1)) AS Question_3
FROM (
    SELECT  COUNT(LFM.contaminant) AS Laxer_Federal_Maximum_Count, 
            COUNT(FM.contaminant) AS Federal_Maximum_Count, 
            COUNT(SSM.contaminant) AS Stricter_State_Maximum_Count, 
            COUNT(SSC.contaminant) AS Stricter_State_Coalesced_Count,
            COUNT(UFC.contaminant) AS Unregulated_Federal_Coalesced_Count
    FROM (
        SELECT  contaminant, 
                state_max_level, 
                federal_max_level 
        FROM dbo.state_regulations 
        WHERE federal_max_level IS NOT NULL) AS FM
        LEFT JOIN (
            SELECT  contaminant, 
                    state_max_level, 
                    federal_max_level 
            FROM dbo.state_regulations 
            WHERE federal_max_level > state_max_level) AS SSM 
        ON FM.contaminant = SSM.contaminant
        LEFT JOIN (
            SELECT  contaminant, 
                    state_max_level, 
                    COALESCE(federal_max_level, state_max_level + 1) AS federal_not_null 
            FROM dbo.state_regulations) AS UFC 
        ON FM.contaminant = UFC.contaminant
        LEFT JOIN (
            SELECT  contaminant, 
                    state_max_level, 
                    federal_max_level 
            FROM dbo.state_regulations 
            WHERE federal_max_level > state_max_level) AS SSC 
        ON UFC.contaminant = SSC.contaminant
        LEFT JOIN (
            SELECT  contaminant, 
                    state_max_level, 
                    federal_max_level 
            FROM (
                SELECT  contaminant, 
                        state_max_level, 
                        federal_max_level 
                FROM dbo.state_regulations 
                WHERE federal_max_level IS NOT NULL) AS FM 
            WHERE federal_max_level > state_max_level) AS LFM 
        ON UFC.contaminant = LFM.contaminant
) AS All_Counts;


Question_1,Question_2,Question_3
35.0,29.0,64.0


I hope it's clear that from a readability standpoint, the code using Common Table Expressions (WITH) is much easier to follow and comprehend. 

  

Even using CTEs for organization, I still had to resort to subqueries when performing the calculations within the SELECT clause, which hopefully makes you wonder if there is an easier way to do this. Indeed there is, but that's a different advanced topic that isn't ideal to introduce at this time.

  

  

I hope that this was informative and easy to digest! My goal is to help others who have completed SQL coursework and have a good handle on using joins, subqueries, and the other standard clauses within the SQL language.