# How to Access STREAM Data on HydroShare in Parquet Format

The purpose of this notebooks is to demonstrate common patterns for access STREAM data on HydroShare. The specific data we're going to access is stored in the Parquet file format which enables us to subset portions of the large STREAM catalog efficiently.

First we'll need to authenticate with HydroShare.

In [None]:
from utils import S3hsclient

In [None]:
hs = S3hsclient.S3HydroShare()

MRB data exists in HydroShare at: https://hydroshare.org/resource/9fc3a923419640729b1606f0e64bd288/. To load these data, we'll use the HydroShare Python library.

In [None]:
resource_id = '9fc3a923419640729b1606f0e64bd288'
resource = hs.resource(resource_id)

This resource contains a number of files that can be loaded for analysis. To see which files are available we'll use the `ls` command.

In [None]:
resource.s3_ls()

## Accessing Data using Scientific Python Libraries

Load the streamflow dataset and inspect the data. This can be done using several common scientific Python libraries.

### Pandas DataFrame

In [None]:
import pandas

The following example loads the entire record of streamflow data for the Mississippi River Basin. This includes over 100,000,000 records from 

In [None]:
# load and plot data for one location using Pandas

df = pandas.read_parquet(
    'tonycastronova/9fc3a923419640729b1606f0e64bd288/data/contents/streamflow.parquet',
    filesystem=hs.get_s3_filesystem()
)
df

Calculate the number of gauges in the data that was returned.

In [None]:
gauge_count = len(df.STREAM_ID.unique())
print(f'{gauge_count} gauges exist in data returned from HydroShare')

Plot the streamflow for one of the STREAM sites.

In [None]:
df[df.STREAM_ID == 'STREAM-gauge-3010'].Q_m3s.plot();

### Dask DataFrame

The advantage of using a Dask DataFrame instead of a Pandas DataFrame is that it provides lazy loading and delayed computations. This is especially beneficial when analyzing large amounts of data.

In [None]:
import dask.dataframe as dd

In [None]:
# load and plot data for one location using Dask

# Create a dataset pointing to the files
ds = dd.read_parquet('tonycastronova/9fc3a923419640729b1606f0e64bd288/data/contents/streamflow.parquet',
                     filesystem=hs.get_s3_filesystem() 
                    )
ds

Notice how fast this query was. This is because data has not been downloaded yet. We can perform our filtering and other computation before downloading any data, which means that we will not be downloading unnecessary data. For instance, we can plot the same data as above in the following manner.

In [None]:
# query the dataset and filter for the 'STREAM-gauge-3010' location, then
# return only DateTime and Q_m3s variables.
dat = ds[ds.STREAM_ID == 'STREAM-gauge-3010'][['DateTime', 'Q_m3s']]

# Perform additional computations/filtering here
# ....
# ....

# tell dask to perform the computations
dat = dat.compute()

dat

In [None]:
# plot the data
dat.Q_m3s.plot();