**Goal**: Introduce basic SQL structure and show some example data transformations. Understand CTEs, aliases, and `CASE` statements. Preview boolean logic, `COALESCE`, `NOT NULL`, and other forms of filtering.

In [None]:
import duckdb

# Load SQL extension
%load_ext sql

# Initialize 🦆 DuckDB connection
conn = duckdb.connect()

# Import database
%sql conn --alias duckdb
%sql IMPORT DATABASE '../../data/nps';

We often deal with a mix of structured and semi-structured data in SQL transformations, let's see what our parks dataset looks like.

In [None]:
%%sql
SELECT * FROM nps_public_data.parks LIMIT 3

Note the type of `operatingHours`— `STRUCT`! That means it's a list or JSON.

In [None]:
%%sql
-- Callout: query structuring, LIMIT statements
SELECT 
    name, 
    operatingHours as operating_hours
FROM nps_public_data.parks 
LIMIT 1

What if we want to create an `operatingHours` table? We can unpack `json` using `UNNEST`. Notice what we're doing here: there are two operations happening, but we're splitting them up! That's called a CTE (common table expression). It's a way of separating aggregates or other operations.

Next, we're using `UNNEST` to explode the `STRUCT` or `json` data. DuckDB let's us use `recursive := true` to burrow down and get _every_ level of the `json`... Pretty neat!

In [None]:
%%sql
-- Callout: CTEs, UNNEST
WITH park_hours AS (
    SELECT 
        name as park_name, 
        id as park_id, 
        UNNEST(operatingHours, recursive := true)
    FROM nps_public_data.parks
)
SELECT 
    * EXCLUDE (exceptions, name),
    name as category
FROM park_hours
LIMIT 2

Notice how we use a CTE to make the query easy-to-read and logical. Now we can create a table with the result. `EXCLUDE` lets us use `SELECT *` and remove some unnecessary inclusions.

In [None]:
%%sql
-- Callout: column renaming, EXCLUDE
CREATE OR REPLACE TABLE nps_public_data.park_hours AS (
    WITH park_hours AS (
        SELECT 
            name as park_name, 
            id as park_id, 
            -- https://duckdb.org/docs/sql/query_syntax/unnest.html
            UNNEST(operatingHours, recursive := true)
        FROM nps_public_data.parks
    )
    SELECT 
        * EXCLUDE (exceptions, name),
        name as category
    FROM park_hours 
)

Creating tables with _dimensions_, like operating hours, lets us easily join to access the information. Here, notice how readable the query becomes.

We're selecting the _name_ of the park and the _thursday_ hours `WHERE` the category is 'Hours of Operation'.

If we'd included the above logic in this query, it'd be much more dense! This is one of our first _patterns_ for SQL transformation:

> Store precalculated (or aggregated) queries or use CTEs to limit complexity and improve readability

In [None]:
%%sql
-- Callout: WHERE clause
SELECT
    p.name,
    h.thursday
FROM nps_public_data.park_hours h
LEFT JOIN nps_public_data.parks p
    ON h.park_id = p.id
WHERE h.category = 'Hours of Operation'
LIMIT 5

**Note:** It's important to use single quotes in DuckDB SQL ('), double quotes (") are reserved for table names

If we want to know all the values that `thursday` can take, we can use `DISTINCT` to return a list... This is like using `set()` in Python.

In [None]:
%%sql 
# Callout: DISTINCT, Order, LIMIT
SELECT 
    DISTINCT(thursday) 
FROM nps_public_data.park_hours 
ORDER BY 1 DESC 
LIMIT 10;

We can use `CASE` functions to alter how data is returned or create entirely new columns. Here, we'll create a new table, renaming columns as we go. This is an example of _cleaning_ a dataset. We'll assume 'unknown' hours are closed park resources.

In [None]:
%%sql
CREATE OR REPLACE TABLE nps_public_data.park_hours AS (
    WITH park_hours AS (
        SELECT 
            name as park_name, 
            id as park_id, 
            -- https://duckdb.org/docs/sql/query_syntax/unnest.html
            UNNEST(operatingHours, recursive := true)
        FROM nps_public_data.parks
    )
    SELECT 
        park_name,
        park_id,
        description,
        name as category,
        CASE monday WHEN 'unknown' THEN 'Closed' ELSE monday END as monday_hours,
        CASE tuesday WHEN 'unknown' THEN 'Closed' ELSE tuesday END as tuesday_hours,
        CASE wednesday WHEN 'unknown' THEN 'Closed' ELSE wednesday END as wednesday_hours,
        CASE thursday WHEN 'unknown' THEN 'Closed' ELSE thursday END as thursday_hours,
        CASE friday WHEN 'unknown' THEN 'Closed' ELSE friday END as friday_hours,
        CASE saturday WHEN 'unknown' THEN 'Closed' ELSE saturday END as saturday_hours,
        CASE sunday WHEN 'unknown' THEN 'Closed' ELSE sunday END as sunday_hours,
        CASE WHEN 
            monday != 'Closed' AND
            tuesday != 'Closed' AND
            wednesday != 'Closed' AND
            thursday != 'Closed' AND
            friday != 'Closed' AND
            saturday != 'Closed' AND
            sunday != 'Closed'
        THEN TRUE ELSE FALSE END as open_seven_days_a_week
    FROM park_hours 
)

In the above, we create a boolean column, `open_seven_days_a_week`, that tells us if a park is open every day. Now this might seem repetitive, given that information is already contained in parks, but what it unlocks is a precise, easily readable filter:

In [None]:
%%sql
SELECT * FROM nps_public_data.park_hours WHERE open_seven_days_a_week LIMIT 1

As a data or analytics engineer, it's important to make queries as readable as possible. If you know users downstream are often querying on `open_seven_days_a_week`, you can add a similar filter to make everyone's life easier!

Can we find parks that are closed on Thursday?

In [None]:
%%sql
SELECT
    p.name,
    closed_thurs.category,
    closed_thurs.thursday_hours,
    COALESCE(closed_thurs.thursday_hours, 'Open') as closed_open,
    NOT closed_thurs.thursday_hours IS NULL as is_closed
FROM nps_public_data.parks p
INNER JOIN nps_public_data.park_hours closed_thurs
    ON closed_thurs.park_id = p.id
    AND closed_thurs.thursday_hours = 'Closed'
WHERE 1 = 1
ORDER BY RANDOM()
LIMIT 5;

Notice how we can represent the information in multiple ways— `is_closed`, `closed_open`, and `thursday` all contain the same information, but in different formats. There is no "correct" format— it depends entirely on how you use the data!
- Boolean columns are readable for filters `SELECT * FROM thursday WHERE is_closed`
- Human readable text makes it easier for users to intuit data `SELECT * FROM parks WHERE thursday_hours = 'Closed'`

In [None]:
%%sql
EXPORT DATABASE '../../data/nps' (FORMAT PARQUET, COMPRESSION ZSTD, ROW_GROUP_SIZE 100000);

Notice the pattern:
- Investigate data
- Identify useful facts/dimensions that could be useful
- Data modelling 
- Transformation
- Storage

A few other things:
1. Data modelling is a useful skill that won't be discussed in this course— for helpful reading see the appendix.
2. _Automating_ transformations is another useful tactic that _also_ won't be discussed in this course. See the appendix for data automations.