# Advanced Join Patterns

In this section, we will cover more advanced join patterns that are helpful in data analysis. We will start with a very brief review of the `INNER JOIN`, `LEFT JOIN`, `RIGHT JOIN`, and `FULL OUTER JOIN`. 

Let's connect to the `company_operations.db` database. 

In [None]:
import sqlite3
import pandas as pd 

conn = sqlite3.connect('company_operations.db')


Let's also tweak pandas to display more rows and not truncate them. 

In [None]:
pd.options.display.max_rows = 999

## INNER, LEFT, RIGHT, and FULL OUTER JOIN 

We have used `INNER JOIN` and `LEFT JOIN` a few times in this course. But let's review the fundamental difference between them, and then how they extend to  `RIGHT JOIN` and `FULL OUTER JOIN`. 

Below we join the `CUSTOMER` and `CUSTOMER_ORDER` table together using an `INNER JOIN`. This allows us to bring in `CUSTOMER` information to show alongside each `CUSTOMER_ORDER`, such as the `CUSTOMER_NAME` and the `SHIP_ADDRESS` (which is concatenated off of four fields).  

In [None]:
sql = """
SELECT CUSTOMER_ORDER_ID,
CUSTOMER_NAME,
ADDRESS || ' ' || CITY || ', ' || STATE || ' ' || ZIP AS SHIP_ADDRESS,
ORDER_DATE,
PRODUCT_ID,
QUANTITY

FROM CUSTOMER INNER JOIN CUSTOMER_ORDER
ON CUSTOMER.CUSTOMER_ID = CUSTOMER_ORDER.CUSTOMER_ID
"""

pd.read_sql(sql, conn)

> Note that `INNER JOIN` can also be aliased as `JOIN`. I personally do not like using this alias. I prefer to be explicit about my intention to use an `INNER JOIN`. There are many types of joins, and when I see a SQL query using `JOIN` rather than an `INNER JOIN`, I assume the composer is not aware there are multiple types of joins. 

However, recall that there is one `CUSTOMER` without any `CUSTOMER_ORDER` records. 

In [None]:
sql = """
SELECT * FROM CUSTOMER 
WHERE CUSTOMER_ID NOT IN (SELECT DISTINCT CUSTOMER_ID FROM CUSTOMER_ORDER)
"""

pd.read_sql(sql, conn)

If we wanted a placeholder record for this `CUSTOMER` record, even if there is no `CUSTOMER_ORDER` record to join to, we can achieve that using a `LEFT JOIN`. This will include all records in the "left" table (to the left of `LEFT JOIN` keywords), even if there is no records to join to in the "right" table (to the right of `LEFT JOIN` keywords). When there are no records to join to in the "right" table, those fields from the "right" table will be null. 

Let's sort on one of those `CUSTOMER_ORDER` fields so the `NULL` records will be at the top that resulted from the `LEFT JOIN`. 

In [None]:
sql = """
SELECT CUSTOMER_ORDER_ID,
CUSTOMER_NAME,
ADDRESS || ' ' || CITY || ', ' || STATE || ' ' || ZIP AS SHIP_ADDRESS,
ORDER_DATE,
PRODUCT_ID,
QUANTITY

FROM CUSTOMER LEFT JOIN CUSTOMER_ORDER
ON CUSTOMER.CUSTOMER_ID = CUSTOMER_ORDER.CUSTOMER_ID

ORDER BY ORDER_DATE
"""

pd.read_sql(sql, conn)

`LEFT JOIN` is a shorthand alias for `LEFT OUTER JOIN`, so you can use those keywords as well. Note they are the same. 

The `RIGHT JOIN` (or `RIGHT OUTER JOIN`) is exactly the same as `LEFT JOIN`, except it flips the direction. This will include all records in the "right" table (to the right of `RIGHT JOIN` keywords), even if there is no records to join to in the "left" table (to the right of `RIGHT JOIN` keywords). When there are no records to join to in the "left" table, those fields from the "left" table will be null. 

We can achieve the exact same result of our previous `LEFT JOIN` query by specifying `CUSTOMER_ORDER RIGHT JOIN CUSTOMER`instead of `CUSTOMER LEFT JOIN CUSTOMER_ORDER`. 

In [None]:
sql = """
SELECT CUSTOMER_ORDER_ID,
CUSTOMER_NAME,
ADDRESS || ' ' || CITY || ', ' || STATE || ' ' || ZIP AS SHIP_ADDRESS,
ORDER_DATE,
PRODUCT_ID,
QUANTITY

FROM CUSTOMER_ORDER RIGHT JOIN CUSTOMER
ON CUSTOMER.CUSTOMER_ID = CUSTOMER_ORDER.CUSTOMER_ID

ORDER BY ORDER_DATE
"""

pd.read_sql(sql, conn)

Because any `RIGHT JOIN` query can be composed as a `LEFT JOIN`, it is rarely used. It is best practice to use a `LEFT JOIN`. The `FULL OUTER JOIN` does a `LEFT JOIN` and `RIGHT JOIN` simultaneously, including all records from both the "left" and "right" tables. Because orphaned records are often illegal in relational databases (e.g. a child without a parent, a `CUSTOMER_ORDER` without an existing `CUSTOMER`), it is rarely used. It is perfectly fine to have a parent without a child though, as seen with a `CUSTOMER` record without `CUSTOMER_ORDER` records.

We can also join more than two tables. Below I joing to a third table called `PRODUCT` to bring in product information, like the `PRODUCT_NAME` and `PRICE`. I have to use another `LEFT JOIN` here rather than an `INNER JOIN` because the `PRODUCT_ID` will be null for "Alpha Medical" since it has no orders. `LEFT JOIN` will tolerate null values in its join condition, but `RIGHT JOIN` will omit them. 

In [None]:
sql = """
SELECT CUSTOMER_ORDER_ID,
CUSTOMER_NAME,
ADDRESS || ' ' || CITY || ', ' || STATE || ' ' || ZIP AS SHIP_ADDRESS,
ORDER_DATE,
PRODUCT.PRODUCT_ID,
PRODUCT_NAME,
PRICE,
QUANTITY

FROM CUSTOMER LEFT JOIN CUSTOMER_ORDER
ON CUSTOMER.CUSTOMER_ID = CUSTOMER_ORDER.CUSTOMER_ID

LEFT JOIN PRODUCT
ON PRODUCT.PRODUCT_ID = CUSTOMER_ORDER.PRODUCT_ID

ORDER BY ORDER_DATE
"""

pd.read_sql(sql, conn)

With all three tables joined, we can introduce a `GROUP BY` and some expressions to find the total revenue by `CUSTOMER`. 

In [None]:
sql = """
SELECT 
CUSTOMER.CUSTOMER_ID,
CUSTOMER_NAME,
COALESCE(SUM(PRICE * QUANTITY), 0) AS TOTAL_REVENUE

FROM CUSTOMER LEFT JOIN CUSTOMER_ORDER
ON CUSTOMER.CUSTOMER_ID = CUSTOMER_ORDER.CUSTOMER_ID

LEFT JOIN PRODUCT
ON PRODUCT.PRODUCT_ID = CUSTOMER_ORDER.PRODUCT_ID

GROUP BY 1, 2
"""

pd.read_sql(sql, conn)

## Volatile/Temporary Tables

Especially when you are doing data analysis, there may be times you need to injection a temporary table into the database with some data, often to join to. Let's say you have a spreadsheet of discounts you want to give to certain customers discounts for certain products. Rather than create a monstrous amount of `CASE` expressions to handle this, you can instead create a **temporary table** that will put data into the database until you disconnect. 

It follows the exact same `CREATE TABLE` convention, but instead you specify it as `CREATE TEMP TABLE`. Other platforms, like Teradata, may call it a `VOLATILE TABLE`.

Let's first create a pandas `DataFrame` of our discounts. 

In [None]:
discounts = pd.DataFrame({
    'customer_id' : [2,2,2,4,4,4,7,7,7,7,7],
    'product_id' : [4,5,9,3,8,6,5,11,12,13,15],
    'discounts' : [.1, .12, .2, .1, .3, .15, .05, .12, .15, .35, .05]
})

discounts

Let's then create our temporary table `DISCOUNT` and upload the pandas `DataFrame` to it using `executemany()` and an `INSERT` template. 

In [None]:
create_sql = """
CREATE TEMP TABLE DISCOUNT ( 
    CUSTOMER_ID INTEGER NOT NULL, 
    PRODUCT_ID INTEGER NOT NULL, 
    DISCOUNT_RATE DOUBLE NOT NULL
);
"""
conn.execute(create_sql)

insert_sql = 'INSERT INTO DISCOUNT (CUSTOMER_ID, PRODUCT_ID, DISCOUNT_RATE) VALUES (?, ?, ?)'
conn.executemany(insert_sql, discounts.values)

pd.read_sql("SELECT * FROM DISCOUNT", conn)

We can now join to the `DISCOUNT` table and apply the discounts!

In [None]:
sql = """
SELECT CUSTOMER_ORDER_ID, 
CUSTOMER_ORDER.CUSTOMER_ID,
CUSTOMER_ORDER.PRODUCT_ID, 
PRICE,
DISCOUNT_RATE, 
PRICE * (1.0 - DISCOUNT_RATE) AS DISCOUNT_PRICE

FROM CUSTOMER_ORDER INNER JOIN PRODUCT
ON CUSTOMER_ORDER.PRODUCT_ID = PRODUCT.PRODUCT_ID

LEFT JOIN DISCOUNT
ON CUSTOMER_ORDER.CUSTOMER_ID = DISCOUNT.CUSTOMER_ID
AND CUSTOMER_ORDER.PRODUCT_ID = DISCOUNT.PRODUCT_ID

WHERE DISCOUNT_RATE IS NOT NULL
"""

pd.read_sql(sql, conn)

## Self Joins and Non-Equal Joins

After all the crazy things we did with subqueries and derived tables, it's probably no surprise we can join a table to itself. This can be helpful to, for example, get the previous day's order `QUANITTY` relative to each record. We alias `CUSTOMER_ORDER` twice as `o1` and `o2`. Think of this as creating two separate copies of that table and joining them. 

In [None]:
sql = """
SELECT o1.CUSTOMER_ORDER_ID,
o1.CUSTOMER_ID,
o1.PRODUCT_ID,
o1.ORDER_DATE,
o1.QUANTITY,
o2.QUANTITY AS PREV_DAY_QUANTITY

FROM CUSTOMER_ORDER o1 LEFT JOIN CUSTOMER_ORDER o2

ON o1.CUSTOMER_ID = o2.CUSTOMER_ID
AND o1.PRODUCT_ID = o2.PRODUCT_ID
AND o2.ORDER_DATE = date(o1.ORDER_DATE, '-1 day')

WHERE o1.ORDER_DATE BETWEEN '2024-03-05' AND '2024-03-11'
"""

pd.read_sql(sql, conn)

If you want to get previous quantity for each record, even if it is earlier than the previous day, we can achieve this using a correlated subquery. 

In [None]:
sql = """
SELECT ORDER_DATE,
PRODUCT_ID,
CUSTOMER_ID,
QUANTITY,
(
    SELECT QUANTITY
    FROM CUSTOMER_ORDER c2
    WHERE c1.ORDER_DATE > c2.ORDER_DATE
    AND c1.PRODUCT_ID = c2.PRODUCT_ID
    AND c1.CUSTOMER_ID = c2.CUSTOMER_ID
    ORDER BY ORDER_DATE DESC
    LIMIT 1
) as PREV_QTY
FROM CUSTOMER_ORDER c1
"""

pd.read_sql(sql, conn)

You also do not have to join strictly on equality. We can join a table to itself, but use this to sum records from previous dates sharing the same `PRODUCT_ID` and `CUSTOMER_ID`. 

In [None]:
sql = """
SELECT c1.ORDER_DATE,
c1.PRODUCT_ID,
c1.CUSTOMER_ID,
c1.QUANTITY,
SUM(c2.QUANTITY) as ROLLING_QTY

FROM CUSTOMER_ORDER c1 INNER JOIN CUSTOMER_ORDER c2
ON c1.PRODUCT_ID = c2.PRODUCT_ID
AND c1.CUSTOMER_ID = c2.CUSTOMER_ID
AND c1.ORDER_DATE >= c2.ORDER_DATE

GROUP BY 1, 2, 3, 4
"""

pd.read_sql(sql, conn)

We will learn how to leverage windowing functions later that will make these tasks easier, but we can still leverage these tools when we need to flexibly create more complex logic. 

# Cross Joins 

A **cross join** is a special type of join that creates all possible combinations between two or more tables. This is also known as a **cartesian join**. For example, we can pair every possible `PRODUCT_ID` with every possible `CUSTOMER_ID`. 

In [None]:
sql = """
SELECT PRODUCT_ID, CUSTOMER_ID 
FROM PRODUCT CROSS JOIN CUSTOMER
"""

pd.read_sql(sql, conn)

Why could this be useful? Let's say we have this query below that finds the total revenue by `CUSTOMER_ID` and `PRODUCT_ID` for a specific date. 

In [None]:
sql = """
SELECT CUSTOMER.CUSTOMER_ID, 
PRODUCT.PRODUCT_ID, 
SUM(PRICE * QUANTITY) AS TOTAL_REVENUE

FROM CUSTOMER LEFT JOIN CUSTOMER_ORDER
ON CUSTOMER.CUSTOMER_ID = CUSTOMER_ORDER.CUSTOMER_ID

LEFT JOIN PRODUCT 
ON PRODUCT.PRODUCT_ID = CUSTOMER_ORDER.PRODUCT_ID

WHERE ORDER_DATE = '2024-03-01'
GROUP BY 1, 2 
"""

pd.read_sql(sql, conn)
## Recursive Self Joins

However, we want to see all customers and products even if there were no orders on that given day for that `PRODUCT_ID` and `CUSTOMER_ID`. We can package that query into a common table expression, and `LEFT JOIN` it to that `CROSS JOIN` query combining every `PRODUCT_ID` and `CUSTOMER_ID`. We will also `COALESCE()` null values to 0 when there were no sales. 

In [None]:
sql = """
WITH totals AS ( 
    SELECT CUSTOMER.CUSTOMER_ID, 
    PRODUCT.PRODUCT_ID, 
    SUM(PRICE * QUANTITY) AS TOTAL_REVENUE
    
    FROM CUSTOMER LEFT JOIN CUSTOMER_ORDER
    ON CUSTOMER.CUSTOMER_ID = CUSTOMER_ORDER.CUSTOMER_ID
    
    LEFT JOIN PRODUCT 
    ON PRODUCT.PRODUCT_ID = CUSTOMER_ORDER.PRODUCT_ID
    
    WHERE ORDER_DATE = '2024-03-01'
    GROUP BY 1, 2 
), 

all_combos AS ( 
    SELECT PRODUCT_ID, CUSTOMER_ID 
    FROM PRODUCT CROSS JOIN CUSTOMER
)

SELECT all_combos.PRODUCT_ID, 
all_combos.CUSTOMER_ID,
COALESCE(totals.TOTAL_REVENUE, 0) AS TOTAL_REVENUE

FROM all_combos LEFT JOIN totals
ON all_combos.CUSTOMER_ID = totals.CUSTOMER_ID
AND all_combos.PRODUCT_ID = totals.PRODUCT_ID
"""

pd.read_sql(sql, conn)

Be mindful to not create combinatorial explosions with cross joins, as creating too many combinations can slow down your queries greatly.

## Recursive Self Joins 

Let's take a look at a different table: the `EMPLOYEE` table. Notice the `MANAGER_ID` column has values that point to other `EMPLOYEE` records. 

In [None]:
pd.read_sql("SELECT * FROM EMPLOYEE", conn, index_col='ID')

You can self join this table and go one level up, but how do we go up the whole hierarchy? Let's focus on one employee with `FIRST_NAME` "Mag" and an `ID` of "29". We can use a special type of common table expression that is `RECURSIVE`.  We seed with a starting value of `29` and recursively append the ID's up the whole hierarchy until there are no more (when it hits the CEO). 

In [None]:
sql = """
-- generates a list of employee ID's hierarchical to Mag
WITH RECURSIVE hierarchy_of_mag(x) AS (
 SELECT 29 -- start with Mag's ID
 UNION ALL -- append each manager ID recursively
 SELECT MANAGER_ID
 FROM hierarchy_of_mag INNER JOIN EMPLOYEE
 ON EMPLOYEE.ID = hierarchy_of_mag.x -- employee ID must equal previous recursion
)

SELECT * FROM hierarchy_of_mag;
"""

pd.read_sql(sql, conn)

We can use this set of ID's to qualify those employees in that hierarchy. 

In [None]:
sql = """
-- generates a list of employee ID's hierarchical to Mag
WITH RECURSIVE hierarchy_of_mag(x) AS (
 SELECT 29 -- start with Mag's ID
 UNION ALL -- append each manager ID recursively
 SELECT MANAGER_ID 
 FROM hierarchy_of_mag INNER JOIN EMPLOYEE
 ON EMPLOYEE.ID = hierarchy_of_mag.x -- employee ID must equal previous recursion
)

SELECT * FROM EMPLOYEE WHERE ID IN hierarchy_of_mag;
"""

pd.read_sql(sql, conn)

Recursive queries are also helpful for generating a range of consecutive values, like a range of integers or dates/times. Here is a range of integers from 1 to 1000.  

In [None]:
sql = """ 
WITH RECURSIVE integers(i) AS (
    SELECT 1
        UNION ALL
    SELECT i + 1 
    FROM integers
    WHERE i < 1000
)

SELECT * FROM integers
"""

pd.read_sql(sql, conn)

And here is an enumeration of dates from now until December 31, 2030. 

In [None]:
sql = """ 
WITH RECURSIVE calendar_dates(dt) AS (
    SELECT date('now')
        UNION ALL
    SELECT date(dt, '+1 day')
    FROM calendar_dates
    WHERE dt < '2030-12-31'
)
SELECT * FROM calendar_dates
"""

pd.read_sql(sql, conn)

Returning to our earlier `CROSS JOIN` example, we can leverage this date enumeration to fill in gaps not only for every `CUSTOMER_ID` and `PRODUCT_ID`, but also the `ORDER_DATE`. In other words, we can see every `CUSTOMER_ID` and `PRODUCT_ID` represented in our query for every `ORDER_DATE`, even if there were not any orders. Just be sure to list the `RECURSIVE` queries first in your common table expressions. 

In [None]:
sql = """
WITH RECURSIVE calendar_dates(dt) AS (
    SELECT date('2020-01-01')
        UNION ALL
    SELECT date(dt, '+1 day')
    FROM calendar_dates
    WHERE dt <'2099-12-31'
), 

totals AS ( 
    SELECT CUSTOMER.CUSTOMER_ID, 
    PRODUCT.PRODUCT_ID, 
    ORDER_DATE,
    SUM(PRICE * QUANTITY) AS TOTAL_REVENUE
    
    FROM CUSTOMER INNER JOIN CUSTOMER_ORDER
    ON CUSTOMER.CUSTOMER_ID = CUSTOMER_ORDER.CUSTOMER_ID
    
    INNER JOIN PRODUCT 
    ON PRODUCT.PRODUCT_ID = CUSTOMER_ORDER.PRODUCT_ID
    
    GROUP BY 1, 2, 3
), 

all_combos AS ( 
    SELECT PRODUCT_ID, CUSTOMER_ID, dt
    FROM PRODUCT CROSS JOIN CUSTOMER 
    CROSS JOIN calendar_dates
    WHERE dt BETWEEN '2024-03-01' AND '2024-03-31'
)
SELECT all_combos.dt as ORDER_DATE, 
all_combos.PRODUCT_ID, 
all_combos.CUSTOMER_ID,
COALESCE(totals.TOTAL_REVENUE, 0) AS TOTAL_REVENUE

FROM all_combos LEFT JOIN totals
ON all_combos.CUSTOMER_ID = totals.CUSTOMER_ID
AND all_combos.PRODUCT_ID = totals.PRODUCT_ID
AND all_combos.dt = totals.ORDER_DATE
"""

pd.read_sql(sql, conn)

Let's look at one last example. Take a look at the `EMPLOYEE_AIR_TRAVEL` table. 

In [None]:
pd.read_sql("SELECT * FROM EMPLOYEE_AIR_TRAVEL", conn)

Note the `NUM_OF_PASSENGERS` column indicates the number of passengers on that ticket. Let's say we wanted to break these up into multiple records, so a `NUM_OF_PASSENGERS` of "3" would turn that one record into three copies of the record. We can use a `RECURSIVE` enumeration of integers to achieve this, copying the record as many times as we need. 

In [None]:
sql = """
WITH RECURSIVE integers(i) AS (
    SELECT 1
        UNION ALL
    SELECT i + 1 
    FROM integers
    WHERE i < 100
)

SELECT BOOKING_ID, 
BOOKED_EMPLOYEE_ID,
DEPARTURE_DATE,
ORIGIN,
DESTINATION,
FARE_PRICE,
integers.i AS PASSENGER_NUMBER
FROM EMPLOYEE_AIR_TRAVEL CROSS JOIN integers
ON integers.i <= NUM_OF_PASSENGERS
"""

pd.read_sql(sql, conn)

We can also use thes integers to enumerate copies of records for another purpose: break up the `ORIGIN` and `DESTINATION` into two separate records. This can help us find how much we are spending for employees to fly through each `AIRPORT` regardless if that `AIRPORT` is the `ORIGIN` or the `DESTINATION`. 

In [None]:
sql = """
WITH RECURSIVE integers(i) AS (
    SELECT 1
        UNION ALL
    SELECT i + 1 
    FROM integers
    WHERE i < 100
)

SELECT
CASE WHEN integers.i == 1 THEN ORIGIN ELSE DESTINATION END AS AIRPORT,
SUM(FARE_PRICE * NUM_OF_PASSENGERS) AS AIRPORT_COST
FROM EMPLOYEE_AIR_TRAVEL CROSS JOIN integers
ON integers.i <= 2
GROUP BY AIRPORT
"""

pd.read_sql(sql, conn)

As you can see, recursive queries are very powerful and are highly underutilized and underrated. Use them to fill in gaps in your data, duplicate and modify records, or simply generate a range of values including integers and dates/times.

## Exercise 

For every calendar date and `PRODUCT_ID`, show the total quantity ordered for the date range of `2024-01-01` to `2024-02-28`. A lot of boilerplate has been coded already. Just fill in the question marks "?". 


In [None]:
sql = """
WITH RECURSIVE calendar_dates(dt) AS (
    SELECT date('2020-01-01')
        UNION ALL
    SELECT date(dt, '+1 day')
    FROM calendar_dates
    WHERE dt <'2099-12-31'
), 

product_totals_by_date AS ( 
    ?
), 

all_combos AS ( 
    SELECT ?, ?
    FROM PRODUCT ? ? calendar_dates
    WHERE dt BETWEEN '2024-01-01' AND '2024-02-28'
)
SELECT all_combos.dt as ORDER_DATE, 
all_combos.PRODUCT_ID, 
COALESCE(?, 0) AS TOTAL_REVENUE

FROM all_combos LEFT JOIN product_totals_by_date
ON all_combos.PRODUCT_ID = ?
AND all_combos.dt = ?
"""

pd.read_sql(sql, conn)


### SCROLL DOWN FOR ANSWER
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
v 

In [None]:
sql = """
WITH RECURSIVE calendar_dates(dt) AS (
    SELECT date('2020-01-01')
        UNION ALL
    SELECT date(dt, '+1 day')
    FROM calendar_dates
    WHERE dt <'2099-12-31'
), 

product_totals_by_date AS ( 
    SELECT PRODUCT.PRODUCT_ID, 
    ORDER_DATE,
    SUM(PRICE * QUANTITY) AS TOTAL_REVENUE
    
    FROM PRODUCT INNER JOIN CUSTOMER_ORDER 
    ON PRODUCT.PRODUCT_ID = CUSTOMER_ORDER.PRODUCT_ID
    
    GROUP BY 1,2
), 

all_combos AS ( 
    SELECT PRODUCT_ID, dt
    FROM PRODUCT CROSS JOIN calendar_dates
    WHERE dt BETWEEN '2024-01-01' AND '2024-02-28'
)
SELECT all_combos.dt as ORDER_DATE, 
all_combos.PRODUCT_ID, 
COALESCE(product_totals_by_date.TOTAL_REVENUE, 0) AS TOTAL_REVENUE

FROM all_combos LEFT JOIN product_totals_by_date
ON all_combos.PRODUCT_ID = product_totals_by_date.PRODUCT_ID
AND all_combos.dt = product_totals_by_date.ORDER_DATE
"""

pd.read_sql(sql, conn)