# [SETUP] 
connect to DuckDB

In [2]:
# Load the extension
%load_ext sql

In [3]:
# Connect to DuckDB
%sql duckdb:///../../tpch.db

In [4]:
%config SqlMagic.displaylimit = 100

In [5]:
%%sql
-- Run a simple show tables
SELECT
  table_name
FROM
  information_schema.tables
WHERE
  table_schema = 'main'

table_name
customer
lineitem
nation
orders
part
partsupp
region
supplier


In [None]:
# If you do not see any tables run the below command (after uncommeting it)
#! python setup.py

# [WHY] CTE (Common Table Expression) can improve readability and reduce code repetition

CTEs make testing complex queries simpler

* A CTE is a select statement that can be reused in a single query. 

* Complex SQL queries often involve multiple sub-queries. Multiple sub-queries make the code hard to read.

* Use a Common Table Expression (CTE) to make your queries readable


## [HOW] to define a CTE


### [Example]

In [5]:
%%sql
-- CTE definition
WITH
  supplier_nation_metrics AS ( -- CTE 1 defined using WITH keyword
    SELECT
      n.n_nationkey,
      SUM(l.l_QUANTITY) AS num_supplied_parts
    FROM
      lineitem l
      JOIN supplier s ON l.l_suppkey = s.s_suppkey
      JOIN nation n ON s.s_nationkey = n.n_nationkey
    GROUP BY
      n.n_nationkey
  ),
  buyer_nation_metrics AS ( -- CTE 2 defined just as a name
    SELECT
      n.n_nationkey,
      SUM(l.l_QUANTITY) AS num_purchased_parts
    FROM
      lineitem l
      JOIN orders o ON l.l_orderkey = o.o_orderkey
      JOIN customer c ON o.o_custkey = c.c_custkey
      JOIN nation n ON c.c_nationkey = n.n_nationkey
    GROUP BY
      n.n_nationkey
  )
SELECT -- The final select will not have a comma before it
  n.n_name AS nation_name,
  s.num_supplied_parts,
  b.num_purchased_parts
FROM
  nation n
  LEFT JOIN supplier_nation_metrics s ON n.n_nationkey = s.n_nationkey
  LEFT JOIN buyer_nation_metrics b ON n.n_nationkey = b.n_nationkey
LIMIT 10;

nation_name,num_supplied_parts,num_purchased_parts
ALGERIA,6454691.0,6117618.0
ARGENTINA,6339724.0,6087566.0
BRAZIL,6085551.0,6149174.0
CANADA,6296547.0,6168913.0
EGYPT,6385468.0,6024134.0
ETHIOPIA,5817697.0,6095241.0
FRANCE,6141618.0,6289987.0
GERMANY,6076474.0,6098776.0
INDIA,6347392.0,6102406.0
INDONESIA,6204759.0,6276420.0


### [Exercise] 

Calculate the money lost due to discounts. Use lineitem to get the price of items (without discounts) that are part of an order and compare it to the order.

**Time limit during live workshop: 10 min**

**Hint**: Figure out the grain that the comparison need to be made in. Think in steps i.e. get the price of all the items in an order without discounts and then compare it to the orders data whose `totalprice` has been computed with discounts.

Here are the schemas of `orders` and `lineitem` tables.

![Orders table](../../images/orders.png)


![lineitem table](../../images/lineitem.png)


In [9]:
%%sql
WITH 
lineitem_agg as (
    select l_orderkey,
    sum(l_extendedprice) as total_price_without_discount
    from lineitem
    group by l_orderkey
)
select o.o_orderkey,
l.total_price_without_discount-o.o_totalprice as loss_due_to_discount
from orders o
join lineitem_agg l
on o.o_orderkey=l.l_orderkey
order by o.o_orderkey;

o_orderkey,loss_due_to_discount
1,8195.8
2,-2234.72
3,11238.07
4,-1460.88
5,3169.77
6,3248.72
7,10127.14
32,184.87
33,7778.18
34,-838.73


# [WHY] Just because you can doesn’t mean you should. Be mindful of code readability.

1. A sql query with multiple temporary tables is better than a 1000-line SQL query with numerous CTEs.

2. Keep the number of CTE per query small (depends on the size of the query, but typically < 5)



**Casestudy:**
  
Read the query below and answer the question

```sql
%%sql
with orders as (
select
        order_id,
        customer_id,
        order_status,
        order_purchase_timestamp::TIMESTAMP AS order_purchase_timestamp,
        order_approved_at::TIMESTAMP AS order_approved_at,
        order_delivered_carrier_date::TIMESTAMP AS order_delivered_carrier_date,
        order_delivered_customer_date::TIMESTAMP AS order_delivered_customer_date,
        order_estimated_delivery_date::TIMESTAMP AS order_estimated_delivery_date
    from raw_layer.orders
    ),
 stg_customers as (
    select
        customer_id,
        zipcode,
        city,
        state_code,
        datetime_created::TIMESTAMP as datetime_created,
        datetime_updated::TIMESTAMP as datetime_updated,
        dbt_valid_from,
        dbt_valid_to
    from customer_snapshot
),
state as (
select
        state_id::INT as state_id,
        state_code::VARCHAR(2) as state_code,
        state_name::VARCHAR(30) as state_name
    from raw_layer.state
    ),
dim_customers as (
select
    c.customer_id,
    c.zipcode,
    c.city,
    c.state_code,
    s.state_name,
    c.datetime_created,
    c.datetime_updated,
    c.dbt_valid_from::TIMESTAMP as valid_from,
    case
        when c.dbt_valid_to is NULL then '9999-12-31'::TIMESTAMP
        else c.dbt_valid_to::TIMESTAMP
    end as valid_to
from stg_customers as c
inner join state as s on c.state_code = s.state_code
)
select
    o.order_id,
    o.customer_id,
    o.order_status,
    o.order_purchase_timestamp,
    o.order_approved_at,
    o.order_delivered_carrier_date,
    o.order_delivered_customer_date,
    o.order_estimated_delivery_date,
    c.zipcode as customer_zipcode,
    c.city as customer_city,
    c.state_code as customer_state_code,
    c.state_name as customer_state_name
from orders as o
inner join dim_customers as c on
    o.customer_id = c.customer_id
    and o.order_purchase_timestamp >= c.valid_from
    and o.order_purchase_timestamp <= c.valid_to;
```

## [Exercise]

**Time limit during live workshop: 10 min** 

**Scenario**: Assume you are building tables for your data team and creating this CTE.
    
**Question**: From a team-wide table reusability perspective, what do you think is wrong with the above query?

**Question**: How would you change this Code so that your colleagues can reuse your work?


# Recap

1. CTEs help with the readability and reusability of your query

2. CTEs are defined using the WITH keyword

3. Don’t overuse CTE; be mindful of query size

4. CTEs performance depends on the DB; check your query plan



# Helpers

1. Solutions are available at [workshop_solutions](./workshop_solutions.ipynb). **Note** You need to stop the kernel in this notebook before starting the next one, since DuckDB can only have one worker on it at a time.
2. Note the `outline`(or `Table of Contents` in the left pane on Jupyter notebook) is a easy way to navigate this workbook.

# Questions