# 03 - DuckDB Queries
With the timeseries, geospatial, crosswalk and attribute table data stored in parquet files, using DuckBD we can explore the data using SQL.

NOTE: The example code provided is not safe from SQL injection.  It is provided to demonstrate the power of DuckDB to query data stored in Parquet files.

In [None]:
import duckdb
from pathlib import Path

In [None]:
CACHE_DIR = Path(Path.home(), "shared", "rti-eval")
STUDY_DIR = Path(CACHE_DIR, "post-event-example")
USGS = Path(STUDY_DIR, "timeseries/usgs/*.parquet")
MEDIUM_RANGE_MEM1 = Path(STUDY_DIR, "timeseries/medium_range_mem1/*.parquet")
SHORT_RANGE = Path(STUDY_DIR, "timeseries/short_range/*.parquet")
CROSSWALK = Path(STUDY_DIR, "geo/usgs_nwm22_crosswalk.parquet")
GEOMETRY = Path(STUDY_DIR, "geo/usgs_geometry.parquet")

We will write a few simple SQL queries to demonstrate how DuckBD can be used to query the data we stored in the Parquet files.

First, lets just see how many rows are in one of the timeseries "tables".

In [None]:
duckdb.query(f"SELECT count(*) FROM read_parquet('{SHORT_RANGE}');")

Now lets see how many unique `referece_times` are in the timeseries table.

In [None]:
duckdb.query(f"SELECT count(DISTINCT reference_time) as reference_time_count FROM read_parquet('{SHORT_RANGE}');")

See a single row from from the timeseries data.

In [None]:
duckdb.query(f"""
    SELECT * FROM read_parquet('{SHORT_RANGE}') LIMIT 1;
""")

Now, just to show the power, lets join the observed USGS data to the short range forecast data and select a single forecast.

In [None]:
duckdb.query(f"""
    SELECT 
        u.location_id as primary_location_id,
        sr.location_id as secondary_location_id,
        sr.reference_time as reference_time,
        u.value_time as value_time,
        u.value as primary_value,
        sr.value as secondary_value,
    FROM read_parquet('{SHORT_RANGE}') sr
    JOIN read_parquet('{CROSSWALK}') cw 
        ON sr.location_id = cw.secondary_location_id 
    JOIN read_parquet('{USGS}') u 
        ON cw.primary_location_id = u.location_id
        AND sr.value_time = u.value_time
    WHERE
        primary_location_id = 'usgs-10336676' AND
        sr.reference_time = '2023-01-02 16:00:00' AND
        primary_value > 0
    ORDER BY 
        sr.value_time DESC
    LIMIT 10;
""").to_df()

This starts to get trciky as the queries get more complex.  This is where TEEHR queries library comes in.