Joining Timeseries
==================

One of the first and most important steps in comparing simulated and observed timeseries data is to join
the two datasets together based on location and time (and potentially other fields).

In this example, we consider a comparison of National Water Model (NWM) v3.0 retrospective
streamflow simulations ("secondary") to USGS observed streamflow data ("primary") for a few
time steps at five different locations (gage stations).

We'll use an example Evaluation dataset for this tutorial.

In [None]:
from teehr import Evaluation

ev = Evaluation("joining_tutorial_data")

**Primary Timeseries**: The USGS observed streamflow data at all locations and times.

In [None]:
primary_df = ev.primary_timeseries.to_pandas()

In [None]:
primary_df

**Secondary Timeseries**: The NWM v3.0 retrospective streamflow simulations at all locations and times.

In [None]:
ev.secondary_timeseries.to_pandas()

**Locations Crosswalks**: A mapping between the USGS and NWM location IDs

In [None]:
ev.location_crosswalks.to_pandas()

**Location Geometry**: The point geometries of the USGS gage station locations.

In [None]:
ev.locations.to_geopandas()

**Attributes**: Additional information about each of the locations.

In [None]:
ev.location_attributes.to_pandas().head()

There are a number of physical attributes associated with each location:

In [None]:
ev.location_attributes.to_pandas().attribute_name.unique()

Ultimately, we want to combine all the data into a single table to facilitate efficient analysis and exploration based
on the location, time, and potentially some other attributes.  For example, we could start to ask questions like:
"How does the NWM model perform in primarily forested watersheds compared to primarily urban watersheds?"

First, we can join the primary and secondary timeseries by location and time without adding geometry or
attributes.  This requires the crosswalk table to map the primary and secondary location IDs. Because
the data may contain more than one variable (e.g., temperature, C) we also need to consider the ``variable_name``
and ``measurement_unit`` fields during the join.

:::{figure} ../../images/tutorials/joining/nwm_usgs_ex_joining_snip.png
---
height: 400px
width: 900px
---

Joining the primary and secondary streamflow values by location, time, variable name, and measurement unit.
:::

We can also join the location geometry and attributes to the joined timeseries table.  This will allow us to
easily filter and group the data based on the location attributes, and to visualize the output.

To join the geometry, we can simply map each primary location ID in the joined timeseries table to the ID in the
geometry table, which in this case contains the point geometries of the USGS gage stations.

:::{figure} ../../images/tutorials/joining/nwm_usgs_ex_joining_geometry.png
---
height: 400px
width: 900px
---

Joining the geometry to the initial joined timeseries table.
:::

Finally, we can join additional, pre-calculated attributes the table, which give us more options for
filtering and grouping the data when calculating performance metrics.

:::{figure} ../../images/tutorials/joining/nwm_usgs_ex_joining_attributes.png
---
height: 400px
width: 900px
---

Joining the attributes to the initial joined timeseries table.
:::

TEEHR can automatically join the timeseries and location attributes, while simulaneously running user-defined functions on specified fields.

In [None]:
ev.joined_timeseries.create(execute_udf=True)

In [None]:
ev.joined_timeseries.to_pandas()

The geometry data is joined on the fly when a ``GeoDataFrame`` is requested.

In [None]:
ev.joined_timeseries.to_geopandas()

Now that the data is joined into a single table, we can easily filter and groupby the available fields to pre-calculated
performance metrics, such as the Nash-Sutcliffe Efficiency (NSE) or the Kling-Gupta Efficiency (KGE), and create visualizations.