# CMIP6 Zarr Profiling

Benchmarks are generated for a tiling a few different copies of the CMIP6 daily data to understand the performance for different data pre-processing options.

1. kerchunk + netCDF: A kerchunk reference file for NetCDF files stored on S3. 
2. Zarr stores with different chunking configurations and pyramids.
   * Chunked to optimize for time series analysis: 
       * latitude: 252, longitude: 252, time: 365.
       * This dataset has larger chunks, but more timesteps are loaded into each chunk.
   * Chunked to optimize for visualization at a single time step.
       * latitude: 600, longitude: 1440, time: 1.
       * This dataset has small chunks, but will likely not work well for time series generation.
   * Chunked to optimize for both time series and visualization:
       * latitude: 600, longitude: 1440, time: 29.
       * This dataset has larger chunks, but more timesteps are loaded into each chunk.
3. Zarr store with no coordinate chunking. At this time there is a known issue with pangeo-forge data generation where coordinates are chunked. This makes a significant impact on performance.
4. A pyramid.

The libraries used are xarray for reading the Zarr metadata and rio_tiler's XarrayReader for reading data from the NetCDFs on S3.

Code from these libraries was copied into tile-benchmarking to generate tiles. This enabled full control to add timers to blocks of code and logs to understand where time was being spent. Specifically:

* `import s3fs; s3fs.core.setup_logging("DEBUG")` was used to debug calls to S3. This was used to understand the most time is spent opening the dataset, which was impacted by open all the coordinate chunks.
* Timing code blocks also demonstrated that the most time, other than opening the dataset, was spent in reprojecting the data. Time to reproject the data is positively correlated with the chunk size, since the minimum amount of data that can be read from S3 is the size of the data chunk.

Open questions and future work:

* We think, based on below, a good chunk size for the tile server approach in 3MB. At how many chunks will maintaining this size of chunk start costing in terms of performance and why?
* If you store your data in many small chunks when (in zoom level and resolution) does this impact performance?
* More exploration of the pyramiding option is required, since the pyramid has all timesteps in each chunk and thus is not faster than data chunked optimally for visualitation
* How to simultaneously optimize for time series.

# Interpretation of the Results

Note: The spatial resolution is the same for all datasets because it's from the same underlying data.
The difference in chunk size is impacted by how the data is chunked.

* The best performance was for data that was not chunked spatially and only had 1 timestep per chunk.
* The performance was about the same for the kerchunk reference of this dataset. But it is important to consider this is because the NetCDF data is chunked the same way. Even though 365 time steps (days) are stored in each NetCDF file, it is chunked by day.

# Summary

Data at this spatial resolution, if optimally chunked (1 timestep per chunk and no coordinate chunking), will work pretty well. But this does not tell us at what point we should chunk or pyramid the data. For that, we direct readers to the next notebook: `when-to-chunk.ipynb`.


In [1]:
import pandas as pd
git_url_path = "https://raw.githubusercontent.com/developmentseed/tile-benchmarking/feat/fake-data/profiling/results"
df = pd.read_csv(f"{git_url_path}/profiled_zarr_results.csv")

In [14]:
df.drop(
    columns=['Unnamed: 0', 'source', 'variable', 'chunks', 'dtype', 'compression']
).sort_values('mean total time (ms) (0, 0, 0)')

Unnamed: 0,collection_name,shape,lat_resolution,lon_resolution,chunk_size_mb,number_coord_chunks,"mean total time (ms) (0, 0, 0)","mean total time (ms) (603, 769, 11)"
3,600_1440_1_no-coord-chunks,"{'time': 730, 'lat': 600, 'lon': 1440}",0.25,0.25,3.295898,3,255.627,272.098
0,kerchunk,"{'time': 730, 'lat': 600, 'lon': 1440}",0.25,0.25,3.295898,3,269.517,252.904
1,pyramid,"{'time': 730, 'y': 128, 'x': 128}",,,45.625,3,667.796,828.671
4,600_1440_29,"{'time': 730, 'lat': 600, 'lon': 1440}",0.25,0.25,95.581055,27,709.523,708.818
5,365_262_262,"{'time': 730, 'lat': 600, 'lon': 1440}",0.25,0.25,95.577469,9,1426.366,739.358
2,600_1440_1,"{'time': 730, 'lat': 600, 'lon': 1440}",0.25,0.25,3.295898,732,1816.089,1799.03
