# 03 Introduction
In this notebook we will explore the Evaluation schema through the Evaluation class interface. To do so, we first need to create an Evaluation and populate it with data. There are many ways to do this ranging from cloneing a complete Evaluation from the TEEHR S3 bucket that already contains all the nessesary data, to cloning a blank template and populating the tables with all the nessesary data using the builtin loading and fetching methods.  In this exercise we are going to clone a complete Evaluation and explore the tables using the TEEHR Evaluation table subclasses.

### Create a new Evaluation
First we will import the the TEEHR Evaluation class and create a new instance that points to a directory where the evaluation data will be stored

In [1]:
from teehr import Evaluation
from pathlib import Path

# Define the directory where the Evaluation will be created
test_eval_dir = Path(Path().home(), "temp", "03_introduction")

# Create an Evaluation object and create the directory
ev = Evaluation(dir_path=test_eval_dir, create_dir=True)

# Enable logging
ev.enable_logging()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/11/05 14:43:43 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


### Clone Evaluation Data form S3
As mentioned above, for this exercise we will be cloning a complete Evaluation dataset from the TEEHR S3 bucket.  First we will list the available Evaluations and then we will clone the `p0_2_location_example` evaluation which is  a small example Evaluation that conly contains 2 gages.

In [2]:
# List the evaluations in the S3 bucket
ev.list_s3_evaluations()

Unnamed: 0,name,description,url
0,p0_2_location_example,Example evaluation datsets with 2 USGS gages,s3a://ciroh-rti-public-data/teehr-data-warehou...
1,p1_camels_daily_streamflow,Daily average streamflow at ther Camels basins,s3a://ciroh-rti-public-data/teehr-data-warehou...
2,p2_camels_hourly_streamflow,Hourly instantaneous streamflow at ther Camels...,s3a://ciroh-rti-public-data/teehr-data-warehou...
3,p3_retro_hourly_streamflow,Hourly instantaneous streamflow at USGS CONUS ...,s3a://ciroh-rti-public-data/teehr-data-warehou...


In [3]:
# Clone the p0_2_location_example evaluation from the S3 bucket
ev.clone_from_s3("p0_2_location_example")

24/11/05 14:43:44 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
24/11/05 14:43:57 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
                                                                                

Now that we have cloned the `p0_2_location_example` evaluation, lets take a look at the data that was cloned from S3, specifically the `dataset` directory.  The three different data groups are stored in slightly different ways.  The domain tables (units, variables, configurations, attributes) are stored as *.csv files, the location tables (locations, location_attributes, location_crosswalks) are stored as parquet files without hive partitioning, and the timeseries tables (primary_timeseries, secondary_timeseries, joined_timeseries) are stored as parquet files with hive partitioning.

In [17]:
# from teehr.evaluation.utils import print_tree
# print_tree(ev.dataset_dir, exclude_patterns=[".*", "_*"])
!tree $HOME/temp/03_introduction/dataset -I ".*|_*"

[01;34m/Users/mdenno/temp/03_introduction/dataset[0m
├── [01;34mattributes[0m
│   └── [00mpart-00000-ebc3fa5d-f94e-4444-b0eb-8f410aa4b973-c000.csv[0m
├── [01;34mconfigurations[0m
│   └── [00mpart-00000-154140ef-341e-483d-9426-570acf5af026-c000.csv[0m
├── [01;34mjoined_timeseries[0m
│   └── [01;34mconfiguration_name=nwm30_retrospective[0m
│       └── [01;34mvariable_name=streamflow_hourly_inst[0m
│           └── [00mpart-00000-d72c9af4-ad9d-40ed-803d-0fdbfee97ab4.c000.snappy.parquet[0m
├── [01;34mlocation_attributes[0m
│   └── [00mpart-00000-b9f7fca2-1b91-40d1-a7ad-d9276802cbfc-c000.snappy.parquet[0m
├── [01;34mlocation_crosswalks[0m
│   └── [00mpart-00000-74f877cd-cf86-40c9-b604-6b97fc693a2e-c000.snappy.parquet[0m
├── [01;34mlocations[0m
│   └── [00mpart-00000-5b1e2b40-0f17-4391-9349-314a41bb15db-c000.snappy.parquet[0m
├── [01;34mprimary_timeseries[0m
│   └── [01;34mconfiguration_name=usgs_observations[0m
│       └── [01;34mvariable_name=streamflow_ho

### Table Classes
The TEEHR Evaluation class contains different sub-classes that are used to oragnize class methods into logical groups.  One of these types of sub-classes is the "table" sub-classes which contain methods for interacting with the data tables. Each of the tables in the Evaluation dataset has a respective sub-class with the table name.
```
ev.units
ev.attributes
ev.variables
ev.configurations
ev.locations
ev.location_attributes
ev.location_crosswalks
ev.primary_timeseries
ev.secondary_timeseries
ev.joined_timeseries
```
Each of the table sub-classes then has methods to add or insert new data ans well as methods to query the data out.  These are documented in the API documentation.

NEED LINK

### Querying
The underlying query engine for TEEHR is PySpark.  Each of the table sub-classes can return data as either a Spark DataFrame (using the `to_sdf()` method) or as a Pandas DataFrame (using the `to_pandas()` method).  The location data tables have an additional method that returns a GeoPandas DataFrame (using the `to_geopandas()` method) where the geometry bytes column has been converted to a proper WKT geometry column.

Note: PySpark itself is "lazy loaded" meaning that it does not actually run the query until the data is needed for display, plotting, etc.  Therefore, if you just use the `to_sdf()` method, you do not get the data but rather a lazy Spark DataFrame that can be used with subsequent Spark operations.  Here we show how to get the Spark DataFrame and show the data but there are many other ways that the lazy Spark DataFrame can be used in subsequent operations that are beyond the scope of this document.

In [27]:
# Query the locations and return as a lazy Spark DataFrame.
ev.locations.to_sdf()

DataFrame[id: string, name: string, geometry: binary]

In [28]:
# Query the locations and return as a Spark DataFrame but tell Spark to show the data.
ev.locations.to_sdf().show()

+-------------+--------------------+--------------------+
|           id|                name|            geometry|
+-------------+--------------------+--------------------+
|usgs-14316700|STEAMBOAT CREEK N...|[01 01 00 00 00 9...|
|usgs-14138800|BLAZED ALDER CREE...|[01 01 00 00 00 B...|
+-------------+--------------------+--------------------+



In [29]:
# Query the locations and return as a Pandas DataFrame.
# Note that the geometry column is shown as a byte string.
ev.locations.to_pandas()

Unnamed: 0,id,name,geometry
0,usgs-14316700,"STEAMBOAT CREEK NEAR GLIDE, OR",b'\x01\x01\x00\x00\x00\x9f\xcc?\xfa\xa6\xae^\x...
1,usgs-14138800,"BLAZED ALDER CREEK NEAR RHODODENDRON, OR",b'\x01\x01\x00\x00\x00\xb7\xday\xd1\ry^\xc0\x1...


In [30]:
# Query the locations and return as a GeoPandas DataFrame.
# Note that the geometry column is now a proper WKT geometry column.
ev.locations.to_geopandas()

Unnamed: 0,id,name,geometry
0,usgs-14316700,"STEAMBOAT CREEK NEAR GLIDE, OR",POINT (-122.72894 43.34984)
1,usgs-14138800,"BLAZED ALDER CREEK NEAR RHODODENDRON, OR",POINT (-121.89147 45.45262)


### Filter and Order
Becasue the tables are a lazy loaded Spark DataFrames, we can filter and order the data before returning it as a Pandas or GeoPandas DataFrame. The filter methods take either a raw SQL string, a filter dictionary or a FilterObject and Operator and field enumeration. Using an FilterObject and Operator and field enumeration is probably not a common pattern for most users, it is used internally to validate filter arguments

In [None]:
# Filter using a raw SQL string
ev.locations.filter("id = 'usgs-14316700'").to_geopandas()

INFO:teehr.evaluation.tables:Setting filter <class 'filter'>.
DEBUG:teehr.querying.filter_format:Filter id = 'usgs-14316700' is already string.  Applying as is.


Unnamed: 0,id,name,geometry
0,usgs-14316700,"STEAMBOAT CREEK NEAR GLIDE, OR",POINT (-122.72894 43.34984)


In [None]:
# Filter using a dictionary
ev.locations.filter({
    "column": "id",
    "operator": "=",
    "value": "usgs-14316700"
}).to_geopandas()

INFO:teehr.evaluation.tables:Setting filter <class 'filter'>.
DEBUG:teehr.querying.filter_format:Filter is not a list.  Making a list.
DEBUG:teehr.querying.filter_format:Validating and applying {'column': 'id', 'operator': '=', 'value': 'usgs-14316700'}
DEBUG:teehr.querying.filter_format:Filter: {"column":"id","operator":"=","value":"usgs-14316700"}


Unnamed: 0,id,name,geometry
0,usgs-14316700,"STEAMBOAT CREEK NEAR GLIDE, OR",POINT (-122.72894 43.34984)


In [None]:
# Import the LocationFilter and Operators classes
from teehr import LocationFilter, Operators

# Get the field enumeration
fields = ev.locations.field_enum()

# Filter using the LocationFilter class
lf = LocationFilter(
    column=fields.id,
    operator=Operators.eq,
    value="usgs-14316700"
)
ev.locations.filter(lf).to_geopandas()

INFO:teehr.evaluation.tables:Setting filter <class 'filter'>.
DEBUG:teehr.querying.filter_format:Filter is not a list.  Making a list.
DEBUG:teehr.querying.filter_format:Validating and applying column=<LocationFields.id: 'id'> operator=<FilterOperators.eq: '='> value='usgs-14316700'
DEBUG:teehr.querying.filter_format:Filter: {"column":"id","operator":"=","value":"usgs-14316700"}


Unnamed: 0,id,name,geometry
0,usgs-14316700,"STEAMBOAT CREEK NEAR GLIDE, OR",POINT (-122.72894 43.34984)
