# Subqueries, Derived Tables, and Common Table Expressions

For this section, we are going to learn some more advanced patterns for embedding queries inside of queries. This can create some helpful tools for data analysis, including the replacement of missing values. 

Let's first start with the basics of subqueries and then escalate to derived tables and common table expressions. 

Connect to the `company_operations.db` database and let's focus on the `WEATHER_MONITOR` table first. 

In [None]:
import sqlite3
import pandas as pd 

conn = sqlite3.connect('company_operations.db')

pd.read_sql("SELECT * FROM WEATHER_MONITOR LIMIT 10", conn)

## Subqueries

Notice how the `TEMPERATURE` column has missing null values. There are not a lot. Just 3 records. 

In [None]:
sql = """
SELECT * FROM WEATHER_MONITOR
WHERE TEMPERATURE IS NULL 
"""
            
pd.read_sql(sql, conn)


Let's say we wanted to replace these null values with the average `TEMPERATURE` across all records. Perhaps we would do this so we don't throw away the three records and still use them for modeling. We can do this using a **scalar subquery** like this, which embeds a query returning a single value inside a query. 

In [None]:
sql = """
SELECT ID, 
REPORT_CODE, 
REPORT_DATE, 
LOCATION_ID, 

CASE WHEN TEMPERATURE IS NULL THEN (SELECT AVG(TEMPERATURE) FROM WEATHER_MONITOR) 
     ELSE TEMPERATURE 
END AS TEMPERATURE_IMPUTED

FROM WEATHER_MONITOR
WHERE TEMPERATURE IS NULL 
"""
            
pd.read_sql(sql, conn)


Perhaps it would be more accurate to account for the month and year, and only pull averages on fields matching those two attributes. After all, notice how the average temperature varies by `YEAR` and `MONTH`. Note that there is no `YEAR` and `MONTH` field as we can infer that from the `REPORT_DATE` field. In SQLite, the way we do this is use `strftime()` with a [special pattern syntax](https://www.sqlite.org/lang_datefunc.html) to convert date and time elements. We use `%Y` for the year and the `%m` for the month. 

In [None]:
sql = """ 
SELECT strftime('%Y', REPORT_DATE) AS YEAR, 
strftime('%m', REPORT_DATE) AS MONTH,
AVG(TEMPERATURE) AS AVG_TEMP 
FROM WEATHER_MONITOR
GROUP BY 1, 2
"""

pd.read_sql(sql, conn)

This is a little tricky. In our correlated subquery, we are going to work with two instances of the `WEATHER_MONITOR` table, where we will alias the subquery instance `wm2` and the outer instance `wm1`. We can then make sure the subquery kicks off for every record, only querying other records matching the year and month. 

In [None]:
sql = """
SELECT ID, 
REPORT_CODE, 
REPORT_DATE, 
LOCATION_ID, 


CASE WHEN TEMPERATURE IS NULL THEN 
    (
        SELECT AVG(TEMPERATURE) FROM WEATHER_MONITOR wm2 
        WHERE strftime('%Y', wm1.REPORT_DATE) = strftime('%Y', wm2.REPORT_DATE) -- year must match outer record
        AND strftime('%m', wm1.REPORT_DATE) = strftime('%m', wm2.REPORT_DATE) -- month must match outer record
     ) 
     ELSE TEMPERATURE 
END AS TEMPERATURE_IMPUTED

FROM WEATHER_MONITOR wm1 
WHERE TEMPERATURE IS NULL 
"""
            
pd.read_sql(sql, conn)


There are more efficient ways of doing this, including derived tables and common table expressions. We will visit those in a moment.

Let's turn our attention to two other tables, the `CUSTOMER` and `CUSTOMER_ORDER`. 

In [None]:
pd.read_sql("SELECT * FROM CUSTOMER", conn)

In [None]:
pd.read_sql("SELECT * FROM CUSTOMER_ORDER", conn)

These two tables are linked by the `CUSTOMER_ID`, again meaning that each `CUSTOMER_ORDER` record has a `CUSTOMER_ID` associated with it. We can then use that `CUSTOMER_ID` value to look up the `CUSTOMER` details. 

What we are interested in achieving here is finding `CUSTOMER` records that have no `CUSTOMER_ORDER` records associated with them. The most rudimentary way of doing this is with a `LEFT JOIN` operator, which will include all records in the "left" table even if there are no records to join to in the "right" table. "Left" and "right" are determine by which side a table is specified against the `LEFT JOIN` operator keywords. If no records exist in the right table, those fields from the right table will be `NULL` in a placeholder record. We can check if those fields in the right table are null as a result of the `LEFT JOIN`. Here's that technique to find customers that have no order. 

In [None]:
sql = """
SELECT CUSTOMER.CUSTOMER_ID, 
CUSTOMER_NAME

FROM CUSTOMER LEFT JOIN CUSTOMER_ORDER
ON CUSTOMER.CUSTOMER_ID = CUSTOMER_ORDER.CUSTOMER_ID

WHERE CUSTOMER_ORDER.CUSTOMER_ID IS NULL
"""
            
pd.read_sql(sql, conn)


An arguably more elegant way to achieve this is to use subqueries. We can get a set of `CUSTOMER_ID` values in a subqueries and check for customers that do not have a `CUSTOMER_ID` in those `CUSTOMER_ORDER` records. 

In [None]:
sql = """
SELECT * FROM CUSTOMER 
WHERE CUSTOMER_ID NOT IN (SELECT DISTINCT CUSTOMER_ID FROM CUSTOMER_ORDER)
"""

pd.read_sql(sql, conn)

Another way to achieve this is to use the `EXISTS` (or `NOT EXISTS`) operator to find any existing `CUSTOMER` records that meet the `WHERE` condition, looking for `CUSTOMER_ORDER` records with that `CUSTOMER_ID`. We can leverage the fact it will not do a full scan of the table, but rather stop the moment it finds a single record. 

In [None]:
sql = """
SELECT * FROM CUSTOMER c 
WHERE NOT EXISTS (SELECT * FROM CUSTOMER_ORDER WHERE CUSTOMER_ID = c.CUSTOMER_ID)
"""

pd.read_sql(sql, conn)

Allow me to slip in just one more example. We can also use subqueries to return only orders for the latest `ORDER_DATE`. 

In [None]:
sql = """
SELECT * FROM CUSTOMER_ORDER
WHERE ORDER_DATE = (SELECT MAX(ORDER_DATE) FROM CUSTOMER_ORDER)
"""

pd.read_sql(sql, conn)

## Derived Tables 

Recall we demonstrated this query showing the average temperature by year and month. 

In [None]:
sql = """ 
SELECT strftime('%Y', REPORT_DATE) AS YEAR, 
strftime('%m', REPORT_DATE) AS MONTH,
AVG(TEMPERATURE) AS AVG_TEMP 
FROM WEATHER_MONITOR
GROUP BY 1, 2
"""

pd.read_sql(sql, conn)

What if we were to join this "table" (backed by a `SELECT` query) to `WEATHER_MONITOR` and impute those three missing `TEMPERATURE` values with the averages for that year and month? Querying off another `SELECT` query in this fashion, and treating it like a table, is known as a **derived table**. Note below we embed that `SELECT` query into the `INNER JOIN` and treat it like a table, joining on the year and month. 

In [None]:
sql = """
SELECT ID, 
REPORT_CODE, 
REPORT_DATE, 
LOCATION_ID, 
CASE WHEN TEMPERATURE IS NULL THEN AVG_TEMP ELSE TEMPERATURE END AS TEMPERATURE_IMPUTED

FROM WEATHER_MONITOR INNER JOIN ( 
    SELECT strftime('%Y', REPORT_DATE) AS YEAR, 
    strftime('%m', REPORT_DATE) AS MONTH,
    AVG(TEMPERATURE) AS AVG_TEMP 
    FROM WEATHER_MONITOR
    GROUP BY 1, 2
) AS temp_avgs

ON strftime('%Y', REPORT_DATE) = temp_avgs.YEAR
AND strftime('%m', REPORT_DATE) = temp_avgs.MONTH

WHERE TEMPERATURE IS NULL 
"""
            
pd.read_sql(sql, conn)


There is an advantage here in that we calculate these averages in advance, and then look them up in the results in the final `SELECT` query. However derived tables can be nested in several tiers, creating an anti-pattern called the **pyramid of doom**. We can nest several derived tables in a query, but it gets messy and difficult to sift through and manage. For this reason, the modern SQL developer should opt to use common table expressions. 

## Common Table Expressions (CTE's) 

**Common Table Expressions (CTE's)** are your best friend as a SQL developer and analyst. They will break down complex queries into easily digestible steps. Here is our previous example imputating averages for missing values for `TEMPERATURE` in our `WEATHER_MONITOR` table, but using common table expressions. 

In [None]:
sql = """
WITH temp_avgs AS (
    SELECT strftime('%Y', REPORT_DATE) AS YEAR, 
    strftime('%m', REPORT_DATE) AS MONTH,
    AVG(TEMPERATURE) AS AVG_TEMP 
    FROM WEATHER_MONITOR
    GROUP BY 1, 2
)

SELECT ID, 
REPORT_CODE, 
REPORT_DATE, 
LOCATION_ID, 
CASE WHEN TEMPERATURE IS NULL THEN AVG_TEMP ELSE TEMPERATURE END AS TEMPERATURE_IMPUTED

FROM WEATHER_MONITOR INNER JOIN temp_avgs

ON strftime('%Y', REPORT_DATE) = temp_avgs.YEAR
AND strftime('%m', REPORT_DATE) = temp_avgs.MONTH

WHERE TEMPERATURE IS NULL 
"""
            
pd.read_sql(sql, conn)


Notice how the `temp_avgs` can be declared in advance, treating it as a "table" named `temp_avgs` backed by a `SELECT` query. This is much cleaner than embedding it into the body of the main `SELECT` query. 

What's even better about common table expresions is one CTE can refer to a previous CTE, creating a "chain" of steps that breaks up complex logic without creating pyrmaids of doom. Below we further create a CTE imputing the missing temperatures, and then checking to see if the averages replaced those null values in the final query. 

In [None]:
sql = """
WITH temp_avgs AS (
    SELECT strftime('%Y', REPORT_DATE) AS YEAR, 
    strftime('%m', REPORT_DATE) AS MONTH,
    AVG(TEMPERATURE) AS AVG_TEMP 
    FROM WEATHER_MONITOR
    GROUP BY 1, 2
) , 

missing_temps_imputed AS ( 
    SELECT ID, 
    REPORT_CODE, 
    REPORT_DATE, 
    LOCATION_ID, 
    CASE WHEN TEMPERATURE IS NULL THEN temp_avgs.AVG_TEMP ELSE TEMPERATURE END AS TEMPERATURE_IMPUTED
    
    FROM WEATHER_MONITOR INNER JOIN temp_avgs
    
    ON strftime('%Y', REPORT_DATE) = temp_avgs.YEAR 
    AND strftime('%m', REPORT_DATE)  = temp_avgs.MONTH
)

SELECT * FROM missing_temps_imputed
WHERE ID IN (SELECT ID FROM WEATHER_MONITOR WHERE TEMPERATURE IS NULL)
"""
            
pd.read_sql(sql, conn)


## Exercise

Rewrite this query below that shows total rain for each month/year alongside each `WEATHER_MONITOR` record, but as a common table expression rather than a correlated subquery. Take your time here, and label things however you want. 

In [None]:
sql = """
SELECT ID, 
REPORT_CODE, 
REPORT_DATE, 
LOCATION_ID, 
RAIN,
(
    SELECT SUM(RAIN) FROM WEATHER_MONITOR wm2 
    WHERE strftime('%Y', wm1.REPORT_DATE) = strftime('%Y', wm2.REPORT_DATE) -- year must match outer record
    AND strftime('%m', wm1.REPORT_DATE) = strftime('%m', wm2.REPORT_DATE) -- month must match outer record
) AS RAIN_TOTAL_MONTH

FROM WEATHER_MONITOR wm1 
"""
            
pd.read_sql(sql, conn)



### SCROLL DOWN FOR ANSWER
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
v 

In [None]:
sql = """ 
WITH rain_totals AS ( 
    SELECT strftime('%Y', REPORT_DATE) AS REPORT_YEAR, 
    strftime('%m', REPORT_DATE) AS REPORT_MONTH, 
    SUM(RAIN) AS TOTAL_RAIN 
    FROM WEATHER_MONITOR  
    GROUP BY 1, 2 
)

SELECT ID, 
REPORT_CODE, 
REPORT_DATE, 
LOCATION_ID, 
RAIN,
TOTAL_RAIN 

FROM WEATHER_MONITOR INNER JOIN rain_totals
ON strftime('%Y', REPORT_DATE) = rain_totals.REPORT_YEAR
AND strftime('%m', REPORT_DATE) = rain_totals.REPORT_MONTH
"""
            
pd.read_sql(sql, conn)
