The purpose of this Jupyter notebook is to show how to use `compute_water_year` to calculate water year, then summarize how many water-quality variables were measured each year for a specified list of sites. Water year helps identify which years have the highest measurement coverage, and which variables were measured most frequently across different sites.

## Main function used in this example

In [None]:
def compute_water_year(
    df: pd.DataFrame, inplace: bool = False):

    # ensure the type is datetime
    df["DateTime"] = pd.to_datetime(df["DateTime"])
    
    water_year = df["DateTime"].map(lambda x: x.year + 1 if x.month > 9 else x.year)

    if inplace:
        df["Water_Year"] = water_year
        return None

    return water_year

## Import libraries and connect to HydroShare to get data

In [88]:
import sys
from pathlib import Path
sys.path.append(str(Path("..").resolve()))
from utils import S3hsclient as hsclient

import pyarrow.dataset as ds
import geopandas as gpd
from typing import Union
import datetime as dt

In [7]:
# use your HydroShare credentials to login
hs = hsclient.S3HydroShare()

Please Enter Your HydroShare Credentials


Username:  igarousi
Password for igarousi:  ········


## Identify the HydroShare resource ID which contains data

In [8]:
resource_id = '9fc3a923419640729b1606f0e64bd288'
resource = hs.resource(resource_id)

List the available STERAM data within the defined HydroShare resource

In [12]:
resource.s3_ls()

['tonycastronova/9fc3a923419640729b1606f0e64bd288/data/contents/dynamic_antropogenic.parquet',
 'tonycastronova/9fc3a923419640729b1606f0e64bd288/data/contents/gauges.cpg',
 'tonycastronova/9fc3a923419640729b1606f0e64bd288/data/contents/gauges.dbf',
 'tonycastronova/9fc3a923419640729b1606f0e64bd288/data/contents/gauges.parquet',
 'tonycastronova/9fc3a923419640729b1606f0e64bd288/data/contents/gauges.prj',
 'tonycastronova/9fc3a923419640729b1606f0e64bd288/data/contents/gauges.shp',
 'tonycastronova/9fc3a923419640729b1606f0e64bd288/data/contents/gauges.shx',
 'tonycastronova/9fc3a923419640729b1606f0e64bd288/data/contents/gauges_meta.xml',
 'tonycastronova/9fc3a923419640729b1606f0e64bd288/data/contents/gauges_resmap.xml',
 'tonycastronova/9fc3a923419640729b1606f0e64bd288/data/contents/grab_samples.parquet',
 'tonycastronova/9fc3a923419640729b1606f0e64bd288/data/contents/lulc.parquet',
 'tonycastronova/9fc3a923419640729b1606f0e64bd288/data/contents/metadata.parquet',
 'tonycastronova/9fc3a92

Load the water quality data that is stored in the Parquet format. 

In [89]:
%%timeit
dataset = ds.dataset(
    'tonycastronova/9fc3a923419640729b1606f0e64bd288/data/contents/water_quality.parquet',
    format="parquet",
    filesystem=hs.get_s3_filesystem() 
)

228 ms ± 7.94 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


Check the type of the loaded data.

In [90]:
type(dataset)

pyarrow._dataset.FileSystemDataset

You'll notice that the loaded data is in `pyarrow._dataset.FileSystemDataset` format. We then convert it to a `pandas.DataFrame` using the code below to do easy analysis operations like groupby, filtering, and plotting. 

In [92]:
result = dataset.to_table()
df = result.to_pandas()

In [93]:
type(df)

pandas.DataFrame

Check the available water quality variables in this dataframe.

In [94]:
df.columns

Index(['DateTime', 'WTemp_C', 'Flag_WTemp_C', 'SpC_uScm', 'Flag_SpC_uScm',
       'DO_mgL', 'Flag_DO_mgL', 'pH', 'Flag_pH', 'Turb_FNU', 'Flag_Turb_FNU',
       'STREAM_ID', 'NO3_mgNL', 'Flag_NO3_mgNL', 'PO4_mgL', 'Flag_PO4_mgL',
       'Turb_NTU', 'Flag_Turb_NTU', 'fDOM_QSU', 'Flag_fDOM_QSU', 'Chla_ugL',
       'Flag_Chla_ugL', 'PC_RFU', 'Flag_PC_RFU', 'fDOM_RFU', 'Flag_fDOM_RFU'],
      dtype='str')

List the site IDs in this DataFrame. The `unique()` function ensures IDs are not duplicated, since this is a time-series DataFrame.

In [95]:
df.STREAM_ID.unique()

<ArrowStringArray>
['STREAM-gauge-3929', 'STREAM-gauge-2223',  'STREAM-gauge-249',
  'STREAM-gauge-248',  'STREAM-gauge-246',  'STREAM-gauge-247',
 'STREAM-gauge-4446', 'STREAM-gauge-3811',  'STREAM-gauge-250',
 'STREAM-gauge-3761',
 ...
 'STREAM-gauge-4434', 'STREAM-gauge-4416', 'STREAM-gauge-4407',
 'STREAM-gauge-4427',   'STREAM-gauge-31', 'STREAM-gauge-1044',
 'STREAM-gauge-1014', 'STREAM-gauge-3028', 'STREAM-gauge-2991',
 'STREAM-gauge-2820']
Length: 869, dtype: str

## Subset data for two sites

In [115]:
subset_df = (
      df[df["STREAM_ID"].isin(["STREAM-gauge-2223", "STREAM-gauge-2991"])]
      .reset_index(drop=True)
  )

In [116]:
subset_df

Unnamed: 0,DateTime,WTemp_C,Flag_WTemp_C,SpC_uScm,Flag_SpC_uScm,DO_mgL,Flag_DO_mgL,pH,Flag_pH,Turb_FNU,...,Turb_NTU,Flag_Turb_NTU,fDOM_QSU,Flag_fDOM_QSU,Chla_ugL,Flag_Chla_ugL,PC_RFU,Flag_PC_RFU,fDOM_RFU,Flag_fDOM_RFU
0,1999-03-26 08:00:00,11.20,A,,,,,,,,...,,,,,,,,,,
1,2007-10-01 07:00:00,12.70,A,470.0,A,,,,,,...,,,,,,,,,,
2,2007-10-01 08:00:00,12.30,A,472.5,A,,,,,,...,,,,,,,,,,
3,2007-10-01 09:00:00,11.90,A,475.0,A,,,,,,...,,,,,,,,,,
4,2007-10-01 10:00:00,11.45,A,477.5,A,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
150854,2025-08-10 09:00:00,14.85,P,,,8.200,P,,,,...,,,,,,,,,,
150855,2025-08-10 10:00:00,14.45,P,,,8.275,P,,,,...,,,,,,,,,,
150856,2025-08-10 11:00:00,14.05,P,,,8.375,P,,,,...,,,,,,,,,,
150857,2025-08-10 12:00:00,13.75,P,,,8.450,P,,,,...,,,,,,,,,,


## Compute water year using the function

In [117]:
subset_df['water_year'] = compute_water_year(subset_df)

Print the newly added column called `water_year` and the original `DateTime` column.

In [118]:
subset_df[['DateTime', 'water_year']]

Unnamed: 0,DateTime,water_year
0,1999-03-26 08:00:00,1999
1,2007-10-01 07:00:00,2008
2,2007-10-01 08:00:00,2008
3,2007-10-01 09:00:00,2008
4,2007-10-01 10:00:00,2008
...,...,...
150854,2025-08-10 09:00:00,2025
150855,2025-08-10 10:00:00,2025
150856,2025-08-10 11:00:00,2025
150857,2025-08-10 12:00:00,2025


Compute the number of available records for each variable by site and water year.

In [120]:
subset_df.groupby(["STREAM_ID", "water_year"]).count().drop(columns=["DateTime"])

Unnamed: 0_level_0,Unnamed: 1_level_0,WTemp_C,Flag_WTemp_C,SpC_uScm,Flag_SpC_uScm,DO_mgL,Flag_DO_mgL,pH,Flag_pH,Turb_FNU,Flag_Turb_FNU,...,Turb_NTU,Flag_Turb_NTU,fDOM_QSU,Flag_fDOM_QSU,Chla_ugL,Flag_Chla_ugL,PC_RFU,Flag_PC_RFU,fDOM_RFU,Flag_fDOM_RFU
STREAM_ID,water_year,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
STREAM-gauge-2223,1999,1,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
STREAM-gauge-2223,2008,8680,8680,8009,8009,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
STREAM-gauge-2223,2009,8759,8759,8615,8615,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
STREAM-gauge-2223,2010,8181,8181,7820,7820,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
STREAM-gauge-2223,2011,8760,8760,8760,8760,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
STREAM-gauge-2223,2012,8712,8712,8448,8448,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
STREAM-gauge-2223,2013,8544,8544,8309,8309,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
STREAM-gauge-2223,2014,8016,8016,7090,7090,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
STREAM-gauge-2223,2015,8544,8544,8469,8469,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
STREAM-gauge-2223,2016,8640,8640,8495,8495,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


As shown in the table above, `STREAM-gauge-2991` has measurements only for water year 2025, while the other gage has broader temporal coverage. This provides a quick view of data availability across sites.
