In [None]:
import duckdb

# Load SQL extension
%load_ext sql

# Initialize 🦆 DuckDB connection
conn = duckdb.connect()

# Import database
%sql conn --alias duckdb
%sql IMPORT DATABASE '../../data/nps';

In our previous examples, we used `WHERE` to filter queries, but we can also do so in `JOIN`s. 

However, we need to be _very_ careful with how joins work.

In [None]:
%%sql
SELECT
    p.name,
    vc.name as visitor_center_name
FROM nps_public_data.parks p
LEFT JOIN nps_public_data.visitorcenters vc
    ON p.parkcode = vc.parkcode
WHERE 1 = 1
-- Filter base query (parks) for national monument
    AND p.designation = 'National Monument'
-- Filter JOIN (!) for passport stamp locations.
-- what will happen to parks without visitor centers?
    AND vc.ispassportstamplocation
LIMIT 1

How many rows are returned with/without the `LEFT JOIN`? What does that say about the number of parks we're querying? Why do you think that is? `INNER JOINS` are identical to `LEFT JOINS` with a `NOT NULL` clause. Why is that?

We can compare the results with a few CTEs and a `UNION`.

In [None]:
%%sql
WITH filter_in_join AS (
    SELECT
        p.name,
        vc.name as visitor_center_name
    FROM nps_public_data.parks p
    INNER JOIN nps_public_data.visitorcenters vc
        ON p.parkcode = vc.parkcode
), filter_in_where AS (
    SELECT
        p.name,
        vc.name as visitor_center_name
    FROM nps_public_data.parks p
    LEFT JOIN nps_public_data.visitorcenters vc
        ON p.parkcode = vc.parkcode
    WHERE vc.parkcode IS NOT NULL
)
SELECT
    COUNT(*) as ct
FROM filter_in_join

UNION ALL

SELECT
    COUNT(*) as ct
FROM filter_in_where


Some common ways of filtering data include

1. Comparisons (`>`, `<`, `=`)
2. `BETWEEN`
3. `IN`
4. `IS NULL`
5. `LIKE` & `ILIKE` // `REGEXP`

Comparisons and `BETWEEN` are good for integers, but also timestamps and dates (as we'll see). `IN` can be helpful for lists of data, while `IS NULL` can help us when `NULL` values are a possibility.

`ILIKE`, `LIKE`, and `REGEXP` are all useful when pattern matching is at play.

We can filter numbers and dates with comparisons or between statements

In [None]:
%%sql
SELECT
    title,
    parkfullname,
    category,
    isfree,
    description
FROM nps_public_data.events e
WHERE 1 = 1
    AND recurrencedatestart > '2024-01-01'
    AND recurrencedatestart < '2024-01-23'
ORDER BY RANDOM()
LIMIT 2


In [None]:
%%sql
SELECT
    title,
    parkfullname,
    category,
    isfree,
    description
FROM nps_public_data.events e
WHERE 1 = 1
    AND recurrencedatestart BETWEEN '2024-01-01' AND '2024-01-23'
ORDER BY RANDOM()
LIMIT 2

What's the difference? `BETWEEN` is _inclusive_~

In [None]:
%%sql
SELECT
    'between' as f,
    COUNT(*) as ct
FROM nps_public_data.events e
WHERE 1 = 1
    AND recurrencedatestart BETWEEN '2024-01-01' AND '2024-01-23'
GROUP BY f

UNION ALL

SELECT
    'greater than' as f,
    COUNT(*) as ct
FROM nps_public_data.events e
WHERE 1 = 1
    AND recurrencedatestart > '2024-01-01'
    AND recurrencedatestart < '2024-01-23'
GROUP BY f

Of course, we can also nest logic for multiple timeframes:

In [None]:
%%sql
SELECT
    title,
    parkfullname,
    category,
    isfree,
    description
FROM nps_public_data.events e
WHERE 1 = 1
    -- Fetch events with dates in January _or_ March
    AND (
            (recurrencedatestart BETWEEN '2024-01-01' AND '2024-01-31') OR
            (recurrencedatestart BETWEEN '2024-03-01' AND '2024-03-31')
    ) 
ORDER BY RANDOM()
LIMIT 2

Another handy way to filter datasets is through string matching— if you're familiar with Python, you probably know regex, but SQL has a few other, simpler ways. First, `LIKE`:

In [None]:
%%sql
SELECT
    title,
    parkfullname,
    category,
    isfree,
    description
FROM nps_public_data.events e
WHERE 1 = 1
    AND title LIKE '%Stroll%'
LIMIT 5

But `LIKE` is case sensitive, so it's easy to miss results.

In [None]:
%%sql
SELECT
    title,
    parkfullname,
    category,
    isfree,
    description
FROM nps_public_data.events e
WHERE 1 = 1
    AND title LIKE '%hike%'
LIMIT 5

Instead, we can use `ILIKE`, which is case INsensitive

In [None]:
%%sql
SELECT
    title,
    parkfullname,
    category,
    isfree,
    description
FROM nps_public_data.events e
WHERE 1 = 1
    AND title ILIKE '%hike%'
LIMIT 5

`LIKE` is also great for cleaning up messy columns:

In [None]:
%%sql 
SELECT 
    name,
    managedByOrganization,
FROM nps_public_data.parkinglots
LIMIT 10

In [None]:
%%sql 
SELECT 
    CASE WHEN name ILIKE '%visitor%' THEN 'Visitor Center'
         WHEN name ILIKE '%parking%' THEN 'Parking Lot'
         ELSE 'Other'
    END as type,
    IF(managedByOrganization ILIKE '%NPS%', 'National Park Service', managedByOrganization) as managed_by,

FROM nps_public_data.parkinglots
LIMIT 10

Depending on your flavor of SQL, there might be other ways to pattern match. DuckDB also has `glob` matching & `regex` matching, too. Those are outside the scope of this course, but you can read more [here](https://duckdb.org/docs/sql/functions/patternmatching.html).

In [None]:
%%sql
SELECT * FROM nps_public_data.states LIMIT 3

Sometimes, we might need to construct a list to perform a more robust filter. We can use `split` and cast the result to a list of strings to turn the `states` field in parks into a list. Then, we can query the list more properly.

In this course, we'll challenge you to think critically about the structure of your data and how you can manipulate it to achieve a desired outcome.

In [None]:
%%sql
-- Which parks are fully or partially in Utah?
WITH park_states AS (
    SELECT 
        fullname,
        states AS states_string, 
        split(states, ',') ::string[] AS states_list
    FROM nps_public_data.parks p
    )
SELECT 
    * 
FROM park_states
WHERE list_contains(states_list, 'UT')
LIMIT 5

This allows for some nifty queries in DuckDB for cross-border parks

In [None]:
%%sql
-- Which parks are both in Utah and Wyoming?
WITH park_states AS (
    SELECT 
        fullname,
        split(states, ',') ::string[] AS states_list
    FROM nps_public_data.parks p
    )
SELECT 
    * 
FROM park_states
WHERE list_has_all(states_list, ['UT', 'WY'])

In [None]:
%%sql
-- Which parks are in Utah and/or Wyoming?
WITH park_states AS (
    SELECT 
        fullname,
        split(states, ',') ::string[] AS states_list
    FROM nps_public_data.parks p
    )
SELECT 
    * 
FROM park_states
WHERE list_has_any(states_list, ['UT', 'WY'])
LIMIT 5

We can also filter values in a list using `IN`. This can be pretty handy for picking out multiple values

In [None]:
%%sql
SELECT 
    fullname,
    states,
    description
FROM nps_public_data.parks p
WHERE name IN ('Arches', 'Bryce Canyon', 'Zion')

When we return cells, we can order the results using the `ORDER BY` clause. We can also `GROUP` results. We'll discuss grouping more in the next section on aggregations, but `GROUPING` can be used to eliminate duplicates, like `DISTINCT`

In [None]:
%%sql
SELECT
    fullname,
    states
FROM nps_public_data.parks
ORDER BY fullname DESC
LIMIT 5

In [None]:
%%sql
SELECT
    DISTINCT states
FROM nps_public_data.parks
LIMIT 5

Voila! That's a bit about joins, comparisons, and filtering!