# Timeseries analysis

Thus far, we've read the data files as output by the National Water Model, which are hourly snapshots. If you want to do some kind if timeseries analysis, where you access data from a (perhaps small) spatial area of interest, you would need to read data from every single NetCDF file. Even if your access to the metadata is fast (thanks to Kerchunk) reading data from that many files is still going to take a while.

At a fundamental level, the chunking of the data on disk will dictate your maximum performance.

In [1]:
import dask.dataframe as dd
import dask_geopandas
import geopandas

In [2]:
df = dd.read_parquet(
    "az://ciroh/short-range-reservoir.parquet/",
    storage_options={"account_name": "noaanwm"},
)
geometry = df.geometry.map_partitions(
    geopandas.GeoSeries.from_wkb,
    meta=geopandas.GeoSeries([], name="geometry"),
    crs="epsg:4326",
)

gdf = dask_geopandas.from_dask_dataframe(df, geometry=geometry)
gdf

Unnamed: 0_level_0,feature_id,geometry,reservoir_type,water_sfc_elev,inflow,outflow
npartitions=21,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
,int32,geometry,category[unknown],float32,float64,float64
,...,...,...,...,...,...
...,...,...,...,...,...,...
,...,...,...,...,...,...
,...,...,...,...,...,...


In [3]:
gdf.head()

Unnamed: 0_level_0,feature_id,geometry,reservoir_type,water_sfc_elev,inflow,outflow
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2021-08-26 01:00:00,491,POINT (-68.37904 46.18327),1,205.007019,0.18,0.27
2021-08-26 01:00:00,531,POINT (-68.45489 46.16116),1,247.148026,0.39,0.09
2021-08-26 01:00:00,747,POINT (-68.06499 46.03409),1,190.294937,0.02,0.07
2021-08-26 01:00:00,759,POINT (-68.16213 46.02238),1,165.124832,0.0,0.17
2021-08-26 01:00:00,1581,POINT (-67.93720 45.64844),1,130.014114,0.47,0.53


In [4]:
%%time
inflow = df.groupby("feature_id").inflow.agg(["mean", "min", "max", "std"]).compute()
inflow.head()

CPU times: user 6.45 s, sys: 1.91 s, total: 8.35 s
Wall time: 3.49 s


Unnamed: 0_level_0,mean,min,max,std
feature_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
491,0.794296,0.07,15.2,1.006294
531,0.342121,0.0,18.43,0.746961
747,0.416411,0.01,14.9,0.815884
759,0.000389,0.0,0.13,0.004175
1581,1.27134,0.37,5.22,1.113017


In [5]:
len(df)

76543788