**Goal**: Understanding the basics of `SELECT`ing and filtering through examples, including comparisons (logical, `IS NULL`, `BETWEEN`), the `WHERE` clause, filtering with `JOIN`, `LIKE`, `IN`, `ORDER`, and `GROUP`.

In [9]:
import duckdb

# Load SQL extension, configure display limit
%load_ext sql
%config SqlMagic.displaylimit = 0

# Initialize 🦆 DuckDB connection
conn = duckdb.connect()

# Import database
%sql conn --alias duckdb
%sql IMPORT DATABASE '../../data/nps';

The sql extension is already loaded. To reload it, use:
  %reload_ext sql


Creating fields you commonly filter on can drastically improve readability. String/column manipulation is great for that.

In [12]:
%%sql 
WITH thursday AS (
    SELECT
        p.name,
        closed_thurs.category,
        closed_thurs.thursday,
        COALESCE(closed_thurs.thursday, 'Open') as closed_open,
        NOT closed_thurs.thursday IS NULL as is_closed
    FROM nps_public_data.parks p
    LEFT JOIN nps_public_data.park_hours closed_thurs
        ON closed_thurs.park_id = p.id
        AND closed_thurs.thursday = 'Closed'
    WHERE 1 = 1
)
SELECT
    *
FROM thursday
WHERE 
    is_closed
    -- Don't forget we can use OR :)
    -- OR closed_ope = 'Closed'
LIMIT 5

name,category,thursday,closed_open,is_closed
Eleanor Roosevelt,Val-Kill Cottage Tours,Closed,Closed,True
Morristown,Winter Hours,Closed,Closed,True
Freedom Riders,Anniston Greyhound Bus Depot,Closed,Closed,True
Black Canyon Of The Gunnison,East Portal,Closed,Closed,True
Golden Gate,Fort Point National Historic Site,Closed,Closed,True


We commonly use the `WHERE` clause to filter aggregate queries, but we can also do so in `JOIN`s. 

However, we need to be very careful with how joins work.

In [13]:
%%sql
SELECT
    p.name,
    vc.name as visitor_center_name
FROM nps_public_data.parks p
LEFT JOIN nps_public_data.visitorcenters vc
    ON p.parkcode = vc.parkcode
WHERE 1 = 1
-- Filter base query (parks) for national monument
    AND p.designation = 'National Monument'
-- Filter JOIN (!) for passport stamp locations.
-- what will happen to parks without visitor centers?
    AND vc.ispassportstamplocation
LIMIT 1

name,visitor_center_name
Statue Of Liberty,Liberty Island Information Center


How many rows are returned with/without the `LEFT JOIN`? What does that say about the number of parks we're querying? Why do you think that is?

`INNER JOINS` are identical to `LEFT JOINS` with a `NOT NULL` clause. Why is that?

In [14]:
%%sql
WITH filter_in_join AS (
    SELECT
        p.name,
        vc.name as visitor_center_name
    FROM nps_public_data.parks p
    INNER JOIN nps_public_data.visitorcenters vc
        ON p.parkcode = vc.parkcode
), filter_in_where AS (
    SELECT
        p.name,
        vc.name as visitor_center_name
    FROM nps_public_data.parks p
    LEFT JOIN nps_public_data.visitorcenters vc
        ON p.parkcode = vc.parkcode
    WHERE vc.parkcode IS NOT NULL
)
SELECT
    COUNT(*) as ct
FROM filter_in_join

UNION ALL

SELECT
    COUNT(*) as ct
FROM filter_in_where


ct
705
705


Some common ways of filtering data

1. Comparisons (`>`, `<`, `=`)
2. `BETWEEN`
3. `IN`
4. `IS NULL`
5. `LIKE` & `ILIKE` // `REGEXP`

Comparisons and `BETWEEN` are good for integers, but also timestamps and dates (as we'll see). `IN` can be helpful for lists of data, while `IS NULL` can help us when `NULL` values are a possibility.

`ILIKE`, `LIKE`, and `REGEXP` are all useful when pattern matching is at play.

We can filter numbers and dates with comparisons or between statements

In [15]:
%%sql
SELECT
    title,
    parkfullname,
    category,
    isfree,
    description
FROM nps_public_data.events e
WHERE 1 = 1
    AND recurrencedatestart > '2024-01-01'
    AND recurrencedatestart < '2024-01-23'
ORDER BY RANDOM()
LIMIT 2


title,parkfullname,category,isfree,description
Acadian Cultural Center - Louisiana Talks & Tales,Jean Lafitte National Historical Park and Preserve,Regular Event,True,"Join a ranger to learn about the history, culture, or environment of south Louisiana."


In [16]:
%%sql
SELECT
    title,
    parkfullname,
    category,
    isfree,
    description
FROM nps_public_data.events e
WHERE 1 = 1
    AND recurrencedatestart BETWEEN '2024-01-01' AND '2024-01-23'
ORDER BY RANDOM()
LIMIT 2

title,parkfullname,category,isfree,description
Acadian Cultural Center - Louisiana Talks & Tales,Jean Lafitte National Historical Park and Preserve,Regular Event,True,"Join a ranger to learn about the history, culture, or environment of south Louisiana."
Afternoon Stroll,Joshua Tree National Park,Regular Event,True,"Join a ranger for a 0.4 mile (0.6 km) guided walk! Learn about various topics like plants, animals, geology, cultural history, and more. Topics vary by ranger. Bring water, closed–toe shoes, and layers for this 45–minute walk. Difficulty: easy, unpaved trail"


But we have to note, `BETWEEN` is _inclusive_~

In [17]:
%%sql
SELECT
    'between' as f,
    COUNT(*) as ct
FROM nps_public_data.events e
WHERE 1 = 1
    AND recurrencedatestart BETWEEN '2024-01-01' AND '2024-01-23'
GROUP BY f

UNION ALL

SELECT
    'greater than' as f,
    COUNT(*) as ct
FROM nps_public_data.events e
WHERE 1 = 1
    AND recurrencedatestart > '2024-01-01'
    AND recurrencedatestart < '2024-01-23'
GROUP BY f

f,ct
between,3
greater than,1


Don't forget that we can nest logic!

In [35]:
%%sql
SELECT
    title,
    parkfullname,
    category,
    isfree,
    description
FROM nps_public_data.events e
WHERE 1 = 1
    -- Fetch events with dates in January _or_ March
    AND (
            (recurrencedatestart BETWEEN '2024-01-01' AND '2024-01-31') OR
            (recurrencedatestart BETWEEN '2024-03-01' AND '2024-03-31')
    ) 
ORDER BY RANDOM()
LIMIT 2

title,parkfullname,category,isfree,description
Acadian Cultural Center - Youth Art Showcase,Jean Lafitte National Historical Park and Preserve,Regular Event,True,"Join us for this special exhibition of artwork from local students who placed in this year's Youth Art Showcase. The 2023 theme was ""Waterways of the Atchafalaya Area."""
8 am to 6 pm: Village — History Exhibit: The Amazing Kolb Brothers; A Grand Life at Grand Canyon,Grand Canyon National Park,Regular Event,True,"For those with an interest in Grand Canyon history, see the Amazing Kolb Brothers Exhibit at Kolb Studio, house-turned-museum perched perilously on a western precipice in Grand Canyon Village. — View the antique cameras used by the canyon's pioneer photographers, study paintings by plein-air artists, and watch their 1912 motion picture travelogue, about their exploration of Grand Canyon and river trips down the Colorado River. Nearly demolished in the 1960s, this structure stands today as a park icon, art gallery, and bookstore for visitors in the vicinity of Bright Angel Trail. Currently operated by the park's non-profit partner, Grand Canyon Conservancy, visitors can purchase artwork, books, gifts, souvenirs, and basic hiking gear, or simply stop by for park information and the Amazing the Kolb Brothers Exhibit about the life and adventures on the edge of Grand Canyon. In addition, you can join Grand Canyon Conservancy guides for a behind-the-scenes tour of the historic Kolb Studio Residence! Walk through the home of Emery, Blanche, Edith, Ellsworth and family and learn about the fascinating history of private entrepreneurship and art at Grand Canyon. Details > Kolb Studio Tour | Grand Canyon Conservancy"


Another handy way to filter datasets is through string matching— if you're familiar with Python, you probably know regex, but SQL has a few other, simpler ways. First, `LIKE`:

In [18]:
%%sql
SELECT
    title,
    parkfullname,
    category,
    isfree,
    description
FROM nps_public_data.events e
WHERE 1 = 1
    AND title LIKE '%Stroll%'
LIMIT 5

title,parkfullname,category,isfree,description
Afternoon Stroll,Joshua Tree National Park,Regular Event,True,"Join a ranger for a 0.4 mile (0.6 km) guided walk! Learn about various topics like plants, animals, geology, cultural history, and more. Topics vary by ranger. Bring water, closed–toe shoes, and layers for this 45–minute walk. Difficulty: easy, unpaved trail"


But `LIKE` is case sensitive, so it's easy to miss results.

In [19]:
%%sql
SELECT
    title,
    parkfullname,
    category,
    isfree,
    description
FROM nps_public_data.events e
WHERE 1 = 1
    AND title LIKE '%hike%'
LIMIT 5

title,parkfullname,category,isfree,description


Instead, we can use `ILIKE`:

In [20]:
%%sql
SELECT
    title,
    parkfullname,
    category,
    isfree,
    description
FROM nps_public_data.events e
WHERE 1 = 1
    AND title ILIKE '%hike%'
LIMIT 5

title,parkfullname,category,isfree,description
A Hike Through The (Cactus) Forest (East District),Saguaro National Park,Regular Event,True,"Let your interest reach new heights and join us for a hike through the heart of the cactus forest, getting to know the giant cactus. Hike will up to 2 miles round trip on a flat trail. Bring sun protection and water. Hiking boots are recommended."


Depending on your flavor of SQL, there might be other ways to pattern match. DuckDB also has `glob` matching & `regex` matching, too. Those are outside the scope of this course, but you can read more [here](https://duckdb.org/docs/sql/functions/patternmatching.html).

Sometimes, we might need to construct a list to perform a more robust filter. We can use `split` and cast the result to a list of strings to turn the `states` field in parks into a list. Then, we can query the list more properly.

In this course, we'll challenge you to think critically about the structure of your data and how you can manipulate it to achieve a desired outcome.

In [21]:
%%sql
-- Which parks are fully or partially in Utah?
WITH park_states AS (
    SELECT 
        fullname,
        states AS states_string, 
        split(states, ',') ::string[] AS states_list
    FROM nps_public_data.parks p
    )
SELECT 
    * 
FROM park_states
WHERE list_contains(states_list, 'UT')
LIMIT 5

fullName,states_string,states_list
Cedar Breaks National Monument,UT,['UT']
Arches National Park,UT,['UT']
Bryce Canyon National Park,UT,['UT']
California National Historic Trail,"CA,CO,ID,KS,MO,NE,NV,OR,UT,WY","['CA', 'CO', 'ID', 'KS', 'MO', 'NE', 'NV', 'OR', 'UT', 'WY']"
Canyonlands National Park,UT,['UT']


This allows for some nifty queries in DuckDB for cross-border parks

In [22]:
%%sql
-- Which parks are both in Utah and Wyoming?
WITH park_states AS (
    SELECT 
        fullname,
        split(states, ',') ::string[] AS states_list
    FROM nps_public_data.parks p
    )
SELECT 
    * 
FROM park_states
WHERE list_has_all(states_list, ['UT', 'WY'])

fullName,states_list
California National Historic Trail,"['CA', 'CO', 'ID', 'KS', 'MO', 'NE', 'NV', 'OR', 'UT', 'WY']"
Mormon Pioneer National Historic Trail,"['IL', 'IA', 'NE', 'UT', 'WY']"
Pony Express National Historic Trail,"['CA', 'CO', 'KS', 'MO', 'NE', 'NV', 'UT', 'WY']"


In [23]:
%%sql
-- Which parks are in Utah and/or Wyoming?
WITH park_states AS (
    SELECT 
        fullname,
        split(states, ',') ::string[] AS states_list
    FROM nps_public_data.parks p
    )
SELECT 
    * 
FROM park_states
WHERE list_has_any(states_list, ['UT', 'WY'])
LIMIT 5

fullName,states_list
Cedar Breaks National Monument,['UT']
Yellowstone National Park,"['ID', 'MT', 'WY']"
Arches National Park,['UT']
Bryce Canyon National Park,['UT']
California National Historic Trail,"['CA', 'CO', 'ID', 'KS', 'MO', 'NE', 'NV', 'OR', 'UT', 'WY']"


To deduplicate our first query, we can group or use `DISTINCT`. We'll discuss in a later chapter, but grouping data without an aggregation can be used to collapse identical rows, though `DISTINCT` is a bit cleaner.

We can also filter values in some list using `IN`

In [26]:
%%sql
SELECT 
    id,
    fullname,
    states,
    description
FROM nps_public_data.parks p
WHERE name IN ('Arches', 'Bryce Canyon', 'Zion')

id,fullName,states,description
36240051-018E-4915-B6EA-3F1A7F24FBE4,Arches National Park,UT,"Discover a landscape of contrasting colors, land forms, and textures unlike any other. The park has over 2,000 natural stone arches, hundreds of soaring pinnacles, massive rock fins, and giant balanced rocks. This red-rock wonderland will amaze you with its formations, refresh you with its trails, and inspire you with its sunsets."
6B1D053D-714F-46D1-B410-04BE868F14C1,Bryce Canyon National Park,UT,"Hoodoos (irregular columns of rock) exist on every continent, but here is the largest concentration found anywhere on Earth. Situated along a high plateau at the top of the Grand Staircase, the park's high elevations include numerous life communities, fantastic dark skies, and geological wonders that defy description."
41BAB8ED-C95F-447D-9DA1-FCC4E4D808B2,Zion National Park,UT,"Follow the paths where people have walked for thousands of years. Gaze up at massive sandstone cliffs of cream, pink, and red that soar into a brilliant blue sky. Experience wilderness in a narrow slot canyon. Zion’s unique array of plants and animals will enchant you as you absorb the rich history of the past and enjoy the excitement of present-day adventures."


When we return cells, we can order the results using the `ORDER BY` clause. We can also `GROUP` results. We'll discuss grouping more in the next section on aggregations, but `GROUPING` can be used to eliminate duplicates, like `DISTINCT`

In [29]:
%%sql
SELECT
    fullname,
    states
FROM nps_public_data.parks
ORDER BY fullname DESC
LIMIT 5

fullName,states
Zion National Park,UT
Yukon - Charley Rivers National Preserve,AK
Yucca House National Monument,CO
Yosemite National Park,CA
Yorktown Battlefield Part of Colonial National Historical Park,VA


In [32]:
%%sql
SELECT
    DISTINCT states
FROM nps_public_data.parks
LIMIT 5

states
WI
CO
OR
AK
"MD,VA"
