**Goal**: Introduce basic SQL structure and show some example data transformations. Understand CTEs, aliases, and `CASE` statements. Preview boolean logic, `COALESCE`, `NOT NULL`, and other forms of filtering.

In [15]:
import duckdb

# Load SQL extension, configure display limit
%load_ext sql
%config SqlMagic.displaylimit = 0

# Initialize 🦆 DuckDB connection
conn = duckdb.connect()

# Import database
%sql conn --alias duckdb
%sql IMPORT DATABASE '../../data/nps';

The sql extension is already loaded. To reload it, use:
  %reload_ext sql


Count
224


We often deal with a mix of structured and semi-structured data in SQL transformations, let's see what our parks dataset looks like.

In [4]:
%%sql
DESCRIBE nps_public_data.parks

column_name,column_type,null,key,default,extra
relevanceScore,BIGINT,YES,,,
designation,VARCHAR,YES,,,
weatherInfo,VARCHAR,YES,,,
addresses,"STRUCT(""type"" VARCHAR, line2 VARCHAR, line1 VARCHAR, stateCode VARCHAR, countryCode VARCHAR, line3 VARCHAR, city VARCHAR, provinceTerritoryCode VARCHAR, postalCode VARCHAR)[]",YES,,,
operatingHours,"STRUCT(""name"" VARCHAR, standardHours STRUCT(friday VARCHAR, sunday VARCHAR, thursday VARCHAR, tuesday VARCHAR, saturday VARCHAR, monday VARCHAR, wednesday VARCHAR), description VARCHAR, exceptions STRUCT(endDate DATE, ""name"" VARCHAR, startDate DATE, exceptionHours STRUCT(friday VARCHAR, sunday VARCHAR, thursday VARCHAR, tuesday VARCHAR, saturday VARCHAR, monday VARCHAR, wednesday VARCHAR))[])[]",YES,,,
entrancePasses,"STRUCT(description VARCHAR, title VARCHAR, ""cost"" DOUBLE)[]",YES,,,
name,VARCHAR,YES,,,
description,VARCHAR,YES,,,
directionsUrl,VARCHAR,YES,,,
fees,VARCHAR[],YES,,,


Note the type of `operatingHours`— `STRUCT`! That means it's a list or JSON.

In [5]:
%%sql
-- Callout: query structuring, LIMIT statements
SELECT 
    name, 
    operatingHours as operating_hours
FROM nps_public_data.parks 
LIMIT 1

name,operating_hours
Federal Hall,"[{'name': 'Hours of Operation', 'standardHours': {'friday': '10:00AM - 5:00PM', 'sunday': 'Closed', 'thursday': '10:00AM - 5:00PM', 'tuesday': '10:00AM - 5:00PM', 'saturday': 'Closed', 'monday': '10:00AM - 5:00PM', 'wednesday': '10:00AM - 5:00PM'}, 'description': 'Federal Hall is Open.', 'exceptions': [{'endDate': datetime.date(2025, 1, 15), 'name': 'Martin Luther King Jr. Day', 'startDate': datetime.date(2025, 1, 15), 'exceptionHours': {'friday': None, 'sunday': None, 'thursday': None, 'tuesday': None, 'saturday': None, 'monday': '10:00AM - 5:00PM', 'wednesday': None}}, {'endDate': datetime.date(2024, 2, 19), 'name': ""Washington's Birthday"", 'startDate': datetime.date(2024, 2, 19), 'exceptionHours': {'friday': None, 'sunday': None, 'thursday': None, 'tuesday': None, 'saturday': None, 'monday': '10:00AM - 5:00PM', 'wednesday': None}}, {'endDate': datetime.date(2024, 5, 27), 'name': 'Memorial Day', 'startDate': datetime.date(2024, 5, 27), 'exceptionHours': {'friday': None, 'sunday': None, 'thursday': None, 'tuesday': None, 'saturday': None, 'monday': '10:00AM - 5:00PM', 'wednesday': None}}, {'endDate': datetime.date(2024, 6, 19), 'name': 'Juneteenth', 'startDate': datetime.date(2024, 6, 19), 'exceptionHours': {'friday': None, 'sunday': None, 'thursday': None, 'tuesday': None, 'saturday': None, 'monday': None, 'wednesday': '10:00AM - 5:00PM'}}, {'endDate': datetime.date(2024, 7, 4), 'name': 'Independence Day', 'startDate': datetime.date(2024, 7, 4), 'exceptionHours': {'friday': None, 'sunday': None, 'thursday': '10:00AM - 5:00PM', 'tuesday': None, 'saturday': None, 'monday': None, 'wednesday': None}}, {'endDate': datetime.date(2024, 9, 2), 'name': 'Labor Day', 'startDate': datetime.date(2024, 9, 2), 'exceptionHours': {'friday': None, 'sunday': None, 'thursday': None, 'tuesday': None, 'saturday': None, 'monday': '10:00AM - 5:00PM', 'wednesday': None}}, {'endDate': datetime.date(2024, 10, 7), 'name': 'Columbus Day', 'startDate': datetime.date(2024, 10, 7), 'exceptionHours': {'friday': None, 'sunday': None, 'thursday': None, 'tuesday': None, 'saturday': None, 'monday': '10:00AM - 5:00PM', 'wednesday': None}}, {'endDate': datetime.date(2024, 11, 11), 'name': 'Veterans Day', 'startDate': datetime.date(2024, 11, 11), 'exceptionHours': {'friday': None, 'sunday': None, 'thursday': None, 'tuesday': None, 'saturday': None, 'monday': '10:00AM - 5:00PM', 'wednesday': None}}, {'endDate': datetime.date(2024, 11, 28), 'name': 'Thanksgiving', 'startDate': datetime.date(2024, 11, 28), 'exceptionHours': {'friday': None, 'sunday': None, 'thursday': 'Closed', 'tuesday': None, 'saturday': None, 'monday': None, 'wednesday': None}}, {'endDate': datetime.date(2024, 12, 25), 'name': 'Christmas Day', 'startDate': datetime.date(2024, 12, 25), 'exceptionHours': {'friday': None, 'sunday': None, 'thursday': None, 'tuesday': None, 'saturday': None, 'monday': None, 'wednesday': 'Closed'}}, {'endDate': datetime.date(2025, 1, 1), 'name': ""New Year's Day"", 'startDate': datetime.date(2025, 1, 1), 'exceptionHours': {'friday': None, 'sunday': None, 'thursday': None, 'tuesday': None, 'saturday': None, 'monday': None, 'wednesday': 'Closed'}}]}]"


What if we want to create an `operatingHours` table? We can unpack `json` using `UNNEST`

In [6]:
%%sql
-- Callout: CTEs, UNNEST
WITH park_hours AS (
    SELECT 
        name as park_name, 
        id as park_id, 
        UNNEST(operatingHours, recursive := true)
    FROM nps_public_data.parks
)
SELECT 
    * EXCLUDE (exceptions, name),
    name as category
FROM park_hours
LIMIT 2

park_name,park_id,friday,sunday,thursday,tuesday,saturday,monday,wednesday,description,category
Federal Hall,2337D255-2D32-4997-957A-D461EEA03AF8,10:00AM - 5:00PM,Closed,10:00AM - 5:00PM,10:00AM - 5:00PM,Closed,10:00AM - 5:00PM,10:00AM - 5:00PM,Federal Hall is Open.,Hours of Operation
Lewis & Clark,5D443C5F-19A0-4A06-9CE4-30534A3DD81A,8:30AM - 4:30PM,Closed,8:30AM - 4:30PM,8:30AM - 4:30PM,Closed,8:30AM - 4:30PM,8:30AM - 4:30PM,"Lewis and Clark National Historic Trail Visitor Center is located on the Missouri River in Omaha, Nebraska.",Visitor Center Hours


Notice how we use a CTE to make the query easy-to-read and logical. Now we can create a table with the result.

In [7]:
%%sql
-- Callout: column renaming, EXCLUDE
CREATE OR REPLACE TABLE nps_public_data.park_hours AS (
    WITH park_hours AS (
        SELECT 
            name as park_name, 
            id as park_id, 
            -- https://duckdb.org/docs/sql/query_syntax/unnest.html
            UNNEST(operatingHours, recursive := true)
        FROM nps_public_data.parks
    )
    SELECT 
        * EXCLUDE (exceptions, name),
        name as category
    FROM park_hours 
)

Count
667


Creating tables with _dimensions_, like operating hours, lets us easily join to access the information.

In [8]:
%%sql
-- Callout: WHERE clause
SELECT
    p.name,
    h.thursday
FROM nps_public_data.park_hours h
LEFT JOIN nps_public_data.parks p
    ON h.park_id = p.id
WHERE h.category = 'Hours of Operation'
LIMIT 5

name,thursday
Federal Hall,10:00AM - 5:00PM
Theodore Roosevelt Birthplace,10:00AM - 4:00PM
Tumacácori,9:00AM - 5:00PM
Wright Brothers,9:00AM - 5:00PM


In [9]:
%%sql 
# Callout: DISTINCT, Order, LIMIT
SELECT 
    DISTINCT(thursday) 
FROM nps_public_data.park_hours 
ORDER BY 1 DESC 
LIMIT 10

thursday
unknown
Sunrise to Sunset
Opens at 6:00AM
Opens at 5:00AM
Closes at 12:00PM
Closed
All Day
9:30AM - 5:00PM
9:30AM - 4:30PM
9:30AM - 4:00PM


In [10]:
%%sql
-- Callout: column renaming, EXCLUDE
CREATE OR REPLACE TABLE nps_public_data.park_hours AS (
    WITH park_hours AS (
        SELECT 
            name as park_name, 
            id as park_id, 
            -- https://duckdb.org/docs/sql/query_syntax/unnest.html
            UNNEST(operatingHours, recursive := true)
        FROM nps_public_data.parks
    )
    SELECT 
        park_name,
        park_id,
        description,
        name as category,
        CASE monday WHEN 'unknown' THEN 'Closed' ELSE monday END as monday,
        CASE tuesday WHEN 'unknown' THEN 'Closed' ELSE tuesday END as tuesday,
        CASE wednesday WHEN 'unknown' THEN 'Closed' ELSE wednesday END as wednesday,
        CASE thursday WHEN 'unknown' THEN 'Closed' ELSE thursday END as thursday,
        CASE friday WHEN 'unknown' THEN 'Closed' ELSE friday END as friday,
        CASE saturday WHEN 'unknown' THEN 'Closed' ELSE saturday END as saturday,
        CASE sunday WHEN 'unknown' THEN 'Closed' ELSE sunday END as sunday,
        CASE WHEN 
            monday != 'Closed' AND
            tuesday != 'Closed' AND
            wednesday != 'Closed' AND
            thursday != 'Closed' AND
            friday != 'Closed' AND
            saturday != 'Closed' AND
            sunday != 'Closed'
        THEN TRUE ELSE FALSE END as open_seven_days_a_week
    FROM park_hours 
)

Count
667


In [11]:
%%sql
SELECT * FROM nps_public_data.park_hours WHERE open_seven_days_a_week LIMIT 1

park_name,park_id,description,category,monday,tuesday,wednesday,thursday,friday,saturday,sunday,open_seven_days_a_week
George Washington,E6D5BB41-3251-469F-ABDA-7B43B966F0CF,"The George Washington Memorial Parkway is generally open year round, 24 hours a day. Check current conditions for information about closures due to road work and inclement weather. Parkway headquarters is open Monday through Friday from 8:15 am to 4:15 pm. It is closed on weekends and holidays. Most park sites are open from 6 am to 10 pm. For hours at specific destinations along the parkway please visit the individual webpages for those sites.",The George Washington Memorial Parkway,All Day,All Day,All Day,All Day,All Day,All Day,All Day,True


Can we find parks that are closed on Thursday?

In [12]:
%%sql
SELECT
    p.name,
    closed_thurs.category,
    closed_thurs.thursday,
    COALESCE(closed_thurs.thursday, 'Open') as closed_open,
    NOT closed_thurs.thursday IS NULL as is_closed
FROM nps_public_data.parks p
LEFT JOIN nps_public_data.park_hours closed_thurs
    ON closed_thurs.park_id = p.id
    AND closed_thurs.thursday = 'Closed'
WHERE 1 = 1
ORDER BY RANDOM()
LIMIT 5

name,category,thursday,closed_open,is_closed
Rio Grande,,,Open,False
Gateway,,,Open,False
Amache,,,Open,False
Chickasaw,,,Open,False
Lewis and Clark,,,Open,False


In [13]:
%%sql
EXPORT DATABASE '../../data/nps' (FORMAT PARQUET, COMPRESSION ZSTD, ROW_GROUP_SIZE 100000);

Success


Notice the pattern:
- Investigate data
- Identify useful facts/dimensions that could be useful
- Data modelling 
- Transformation
- Storage

A few other things:
1. Data modelling is a useful skill that won't be discussed in this course— for helpful reading see the appendix.
2. _Automating_ transformations is another useful tactic that _also_ won't be discussed in this course. See the appendix for data automations.