# 08 Add User Defined Fields

## Overview
In this lesson we will clone a small example TEEHR Evaluation from S3 and work through the different ways to add user defined columns to the joined_timeseries table, save them to the joined_timeseries table on disk to persist them, or add them temporarily before calculating metrics.  The first few stesp for creating and cloning the TEEHR Evaluation should look familiar of you worked through the examples as we did this previously.

### Create a new Evaluation
First we will import TEEHR along with some other required libraries for this example.  Then we create a new instance of the Evaluation that points to a directory where the evaluation data will be stored.

In [1]:
import teehr
from pathlib import Path
import shutil

# Tell Bokeh to output plots in the notebook
from bokeh.io import output_notebook
output_notebook()

In [2]:
# Define the directory where the Evaluation will be created
test_eval_dir = Path(Path().home(), "temp", "08_add_udfs")
shutil.rmtree(test_eval_dir, ignore_errors=True)

# Create an Evaluation object and create the directory
ev = teehr.Evaluation(dir_path=test_eval_dir, create_dir=True)

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/12/15 23:08:29 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


### Clone Evaluation Data form S3
As mentioned above, for this exercise we will be cloning a complete Evaluation dataset from the TEEHR S3 bucket.  First we will list the available Evaluations and then we will clone the `e0_2_location_example` evaluation which is  a small example Evaluation that only contains 2 gages.

In [3]:
# List the evaluations in the S3 bucket
ev.list_s3_evaluations()

Unnamed: 0,name,description,url
0,e0_2_location_example,Example evaluation datsets with 2 USGS gages,s3a://ciroh-rti-public-data/teehr-data-warehou...
1,e1_camels_daily_streamflow,Daily average streamflow at ther Camels basins,s3a://ciroh-rti-public-data/teehr-data-warehou...
2,e2_camels_hourly_streamflow,Hourly instantaneous streamflow at ther Camels...,s3a://ciroh-rti-public-data/teehr-data-warehou...
3,e3_usgs_hourly_streamflow,Hourly instantaneous streamflow at USGS CONUS ...,s3a://ciroh-rti-public-data/teehr-data-warehou...


In [4]:
# Clone the e0_2_location_example evaluation from the S3 bucket
ev.clone_from_s3("e0_2_location_example")

24/12/15 23:08:31 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
24/12/15 23:08:46 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
                                                                                

Now we will get right to it.  Let start by looking at the dat in the `joined_timeseries` table.

In [5]:
ev.joined_timeseries.to_pandas().head()

Unnamed: 0,reference_time,value_time,primary_location_id,secondary_location_id,primary_value,secondary_value,unit_name,location_id,frac_snow,frac_urban,...,q_mean,baseflow_index,river_forecast_center,month,year,water_year,primary_normalized_flow,secondary_normalized_flow,configuration_name,variable_name
0,NaT,2000-10-01 00:00:00,usgs-14316700,nwm30-23894572,1.132674,0.38,m^3/s,usgs-14316700,0.176336580742005,0.0,...,19.90923953325952,0.508616082222394,NWRFC,10,2000,2001,0.001927,0.000646,nwm30_retrospective,streamflow_hourly_inst
1,NaT,2000-10-01 00:00:00,usgs-14138800,nwm30-23736071,3.341388,0.06,m^3/s,usgs-14138800,0.317266212149897,0.0,...,1.5975415858787263,0.457869583655904,NWRFC,10,2000,2001,0.157613,0.00283,nwm30_retrospective,streamflow_hourly_inst
2,NaT,2000-10-01 01:00:00,usgs-14316700,nwm30-23894572,1.132674,0.38,m^3/s,usgs-14316700,0.176336580742005,0.0,...,19.90923953325952,0.508616082222394,NWRFC,10,2000,2001,0.001927,0.000646,nwm30_retrospective,streamflow_hourly_inst
3,NaT,2000-10-01 01:00:00,usgs-14138800,nwm30-23736071,3.992675,0.06,m^3/s,usgs-14138800,0.317266212149897,0.0,...,1.5975415858787263,0.457869583655904,NWRFC,10,2000,2001,0.188334,0.00283,nwm30_retrospective,streamflow_hourly_inst
4,NaT,2000-10-01 02:00:00,usgs-14316700,nwm30-23894572,1.132674,0.38,m^3/s,usgs-14316700,0.176336580742005,0.0,...,19.90923953325952,0.508616082222394,NWRFC,10,2000,2001,0.001927,0.000646,nwm30_retrospective,streamflow_hourly_inst


Lets also query the `primary_timeseries` and plot the timeseries data using the `df.teehr.timeseries_plot()` method.

In [6]:
sdf = ev.secondary_timeseries.to_sdf()
validated_sdf = ev.secondary_timeseries._validate(sdf, add_missing_columns=True)
ev.secondary_timeseries._write_spark_df(validated_sdf)
sdf = ev.secondary_timeseries.to_sdf()
sdf.show()


+--------------+-------------------+-----+---------+--------------+------+-------------------+--------------------+
|reference_time|         value_time|value|unit_name|   location_id|member| configuration_name|       variable_name|
+--------------+-------------------+-----+---------+--------------+------+-------------------+--------------------+
|          NULL|2000-10-01 00:00:00| 0.38|    m^3/s|nwm30-23894572|  NULL|nwm30_retrospective|streamflow_hourly...|
|          NULL|2000-10-01 00:00:00| 0.06|    m^3/s|nwm30-23736071|  NULL|nwm30_retrospective|streamflow_hourly...|
|          NULL|2000-10-01 01:00:00| 0.38|    m^3/s|nwm30-23894572|  NULL|nwm30_retrospective|streamflow_hourly...|
|          NULL|2000-10-01 01:00:00| 0.06|    m^3/s|nwm30-23736071|  NULL|nwm30_retrospective|streamflow_hourly...|
|          NULL|2000-10-01 02:00:00| 0.38|    m^3/s|nwm30-23894572|  NULL|nwm30_retrospective|streamflow_hourly...|
|          NULL|2000-10-01 02:00:00| 0.06|    m^3/s|nwm30-23736071|  NUL

In [7]:
ev.joined_timeseries.create(add_attrs=True)

                                                                                

In [8]:
sdf = ev.joined_timeseries.to_sdf()
sdf.show()

+--------------+-------------------+-------------------+---------------------+-------------+---------------+---------+------+-------------+-----------------+----------+-----------------+----------+-------------+----------------+-----------------+-----------+------------------+-------------------+-----------+-------------------+------------+----------------+----------------+--------------+--------------------+-----------+---------+-----------------+--------------------+-----------------+------------------+-----------------+---------------------+-------------------+--------------------+
|reference_time|         value_time|primary_location_id|secondary_location_id|primary_value|secondary_value|unit_name|member|  location_id|        frac_snow|frac_urban|    soil_porosity|slope_mean|drainage_area|          p_mean|          aridity|zero_q_freq|     p_seasonality|                 q5|high_q_freq|dom_land_cover_frac|stream_order|        pet_mean|       slope_fdc|high_prec_freq|        ecoregion

In [9]:
from teehr import RowLevelUDF as rlu
from teehr import TimeseriesAwareUDF as tau

In [None]:
sdf = ev.joined_timeseries.add_udf_columns([
    rlu.Month(),
    rlu.Year(),
    rlu.WaterYear(),
    rlu.NormalizedFlow(),
    rlu.Seasons()
]).to_sdf()
sdf.show()

In [None]:
ev.joined_timeseries.to_sdf().show()

In [None]:
ev.joined_timeseries.add_udf_columns([
    rlu.Month(),
    rlu.Year(),
    rlu.WaterYear(),
    rlu.NormalizedFlow(),
    rlu.Seasons()
]).write()


In [None]:
ev.joined_timeseries.to_sdf().show()

In [None]:
sdf = ev.joined_timeseries.add_udf_columns([
    tau.PercentileEventDetection()
]).to_sdf()
sdf.show()

In [None]:
ev.joined_timeseries.to_sdf().show()

In [None]:
ev.joined_timeseries.add_udf_columns([
    tau.PercentileEventDetection()
]).write()

In [None]:
ev.joined_timeseries.to_sdf().show()

In [None]:
import hvplot.pandas  # noqa

In [None]:
pdf = ev.joined_timeseries.filter([
    "primary_location_id = 'usgs-14138800'",
    "event = true",
]).to_pandas()

In [None]:
primary_plot = pdf.hvplot.points(x="value_time", y="primary_value", color="event_id") #.opts(width=1200, height=400)

In [None]:
primary_plot.opts(width=1200, height=400)

### Metrics

In [None]:
(
    ev.metrics
    .query(
        group_by=["configuration_name", "primary_location_id", "event_id"],
        filters=[
            "primary_location_id = 'usgs-14138800'",
            "event = true",
        ],
        include_metrics=[
            teehr.Metrics.Maximum(
                input_field_names=["primary_value"],
                output_field_name="max_primary_value"
            ),
            teehr.Metrics.Maximum(
                input_field_names=["secondary_value"],
                output_field_name="max_secondary_value"
            )
        ]
    )
    .to_pandas()
)

In [None]:
(
    ev.metrics
    .query(
        group_by=["configuration_name", "primary_location_id", "event_id"],
        filters=[
            "primary_location_id = 'usgs-14138800'",
            "event = true",
        ],
        include_metrics=[
            teehr.Metrics.Maximum(
                input_field_names=["primary_value"],
                output_field_name="max_primary_value"
            ),
            teehr.Metrics.Maximum(
                input_field_names=["secondary_value"],
                output_field_name="max_secondary_value"
            )
        ]
    )
    .query(
        group_by=["configuration_name", "primary_location_id"],
        include_metrics=[
            teehr.Metrics.RelativeBias(
                input_field_names=["max_primary_value", "max_secondary_value"],
                output_field_name="event_max_relative_bias"
            )
        ]
    )
    .to_pandas()
)

In [None]:
(
    ev.metrics
    # Add the PercentileEventDetection UDF to identify events greater than the 90th percentile.
    .add_udf_columns([
        tau.PercentileEventDetection(
            quantile=0.90
        )
    ])
    # First query to calculate the maximum primary and secondary values for each event.
    .query(
        group_by=["configuration_name", "primary_location_id", "event_id"],
        filters=[
            "primary_location_id = 'usgs-14138800'",
            "event = true",
        ],
        include_metrics=[
            teehr.Metrics.Maximum(
                input_field_names=["primary_value"],
                output_field_name="max_primary_value"
            ),
            teehr.Metrics.Maximum(
                input_field_names=["secondary_value"],
                output_field_name="max_secondary_value"
            )
        ]
    )
    # Second query to calculate the relative bias between the maximum primary and secondary values.
    .query(
        group_by=["configuration_name", "primary_location_id"],
        include_metrics=[
            teehr.Metrics.RelativeBias(
                input_field_names=["max_primary_value", "max_secondary_value"],
                output_field_name="event_90th_max_relative_bias"
            )
        ]
    )
    # Convert the metrics to a pandas DataFrame
    .to_pandas()
)

                                                                                

Unnamed: 0,configuration_name,primary_location_id,event_90th_max_relative_bias
0,nwm30_retrospective,usgs-14138800,-0.087357


In [None]:
ev.spark.stop()