# 03 Introduction
In this notebook we will explore the Evaluation schema through the Evaluation class interface. To do so, we first need to create an Evaluation and populate it with data. There are many ways to do this ranging from cloneing a complete Evaluation from the TEEHR S3 bucket that already contains all the nessesary data, to cloning a blank template and populating the tables with all the nessesary data using the builtin loading and fetching methods.  In this exercise we are going to clone a complete Evaluation and explore the tables using the TEEHR Evaluation table subclasses.

### Create a new Evaluation
First we will import the the TEEHR Evaluation class and create a new instance that points to a directory where the evaluation data will be stored

In [None]:
from teehr import Evaluation
from pathlib import Path
import shutil

# Define the directory where the Evaluation will be created
test_eval_dir = Path(Path().home(), "temp", "03_introduction_class")
shutil.rmtree(test_eval_dir, ignore_errors=True)

# Create an Evaluation object and create the directory
ev = Evaluation(dir_path=test_eval_dir, create_dir=True)

# Enable logging
ev.enable_logging()

### Clone Evaluation Data form S3
As mentioned above, for this exercise we will be cloning a complete Evaluation dataset from the TEEHR S3 bucket.  First we will list the available Evaluations and then we will clone the `p0_2_location_example` evaluation which is  a small example Evaluation that conly contains 2 gages.

In [None]:
# List the evaluations in the S3 bucket
ev.list_s3_evaluations()

In [None]:
# Clone the p0_2_location_example evaluation from the S3 bucket
ev.clone_from_s3("p0_2_location_example")

Now that we have cloned the `p0_2_location_example` evaluation, lets take a look at the data that was cloned from S3, specifically the `dataset` directory.  You can see that the three different data groups are stored in slightly different ways.  
- The domain tables (units, variables, configurations, attributes) are stored as *.csv files
- The location tables (locations, location_attributes, location_crosswalks) are stored as parquet files without hive partitioning
- The timeseries tables (primary_timeseries, secondary_timeseries, joined_timeseries) are stored as parquet files with hive partitioning

In [None]:
# from teehr.evaluation.utils import print_tree
# print_tree(ev.dataset_dir, exclude_patterns=[".*", "_*"])
!tree $HOME/temp/03_introduction/dataset -I ".*|_*"

### Table Classes
The TEEHR Evaluation class contains different sub-classes that are used to oragnize class methods into logical groups.  One of these types of sub-classes is the "table" sub-classes which contain methods for interacting with the data tables. Each of the tables in the Evaluation dataset has a respective sub-class with the table name.
```
ev.units
ev.attributes
ev.variables
ev.configurations
ev.locations
ev.location_attributes
ev.location_crosswalks
ev.primary_timeseries
ev.secondary_timeseries
ev.joined_timeseries
```
Each of the table sub-classes then has methods to add or insert new data ans well as methods to query the data out.  These are documented in the API documentation.

NEED LINK

In [None]:
ev.units.to_pandas().head()

In [None]:
ev.attributes.to_pandas().head()

In [None]:
ev.variables.to_pandas().head()

In [None]:
ev.configurations.to_pandas().head()

In [None]:
ev.locations.to_pandas().head()

In [None]:
ev.location_attributes.to_pandas().head()

In [None]:
ev.primary_timeseries.to_pandas().head()

In [None]:
ev.location_crosswalks.to_pandas().head()

In [None]:
ev.secondary_timeseries.to_pandas().head()

### Querying
The underlying query engine for TEEHR is PySpark.  Each of the table sub-classes can return data as either a Spark DataFrame (using the `to_sdf()` method) or as a Pandas DataFrame (using the `to_pandas()` method).  The location data tables have an additional method that returns a GeoPandas DataFrame (using the `to_geopandas()` method) where the geometry bytes column has been converted to a proper WKT geometry column.

Note: PySpark itself is "lazy loaded" meaning that it does not actually run the query until the data is needed for display, plotting, etc.  Therefore, if you just use the `to_sdf()` method, you do not get the data but rather a lazy Spark DataFrame that can be used with subsequent Spark operations.  Here we show how to get the Spark DataFrame and show the data but there are many other ways that the lazy Spark DataFrame can be used in subsequent operations that are beyond the scope of this document.

In [None]:
# Query the locations and return as a lazy Spark DataFrame.
ev.locations.to_sdf()

In [None]:
# Query the locations and return as a Spark DataFrame but tell Spark to show the data.
ev.locations.to_sdf().show()

In [None]:
# Query the locations and return as a Pandas DataFrame.
# Note that the geometry column is shown as a byte string.
ev.locations.to_pandas()

In [None]:
# Query the locations and return as a GeoPandas DataFrame.
# Note that the geometry column is now a proper WKT geometry column.
ev.locations.to_geopandas()

### Filter and Order
As noted above, because the tables are a lazy loaded Spark DataFrames, we can filter and order the data before returning it as a Pandas or GeoPandas DataFrame. The filter methods take either a raw SQL string, a filter dictionary or a FilterObject and Operator and field enumeration. Using an FilterObject and Operator and field enumeration is probably not a common pattern for most users, it is used internally to validate filter arguments

In [None]:
# Filter using a raw SQL string
ev.locations.filter("id = 'usgs-14316700'").to_geopandas()

In [None]:
# Filter using a dictionary
ev.locations.filter({
    "column": "id",
    "operator": "=",
    "value": "usgs-14316700"
}).to_geopandas()

In [None]:
# Import the LocationFilter and Operators classes
from teehr import LocationFilter, Operators

# Get the field enumeration
fields = ev.locations.field_enum()

# Filter using the LocationFilter class
lf = LocationFilter(
    column=fields.id,
    operator=Operators.eq,
    value="usgs-14316700"
)
ev.locations.filter(lf).to_geopandas()

This same approach can be used to query the other tables in the evaluation dataset.