__Zarr Development Notebooks__</br>
<img src="http://static1.squarespace.com/static/530979d9e4b04bff4a3aadf5/t/5446c34ce4b0e5a2c7ff2614/1413923661373/DewberryLogo_RGB.png?format=1500w" width="40%" align='right'/>
__Probability Risk: Post Processor Group__</br>
PYTHON 3.6</br>
Overview: This notebook is to analyze how to store WSE tifs from multiple models into one cube of data.</br>
Updated 2018-10-15</br>
by Stephen Duncan: sduncan@dewberry.com <br/>

*Use Environment with boto3, zarr, & s3fs*

# Create Zarr File Tutorial

 - [Zarr Tutorial & Documentation](https://zarr.readthedocs.io/en/stable/tutorial.html)

# Python Libraries Needed

In [1]:
import os, sys, time, glob, s3fs, zarr
import numpy as np
import pandas as pd
from io import BytesIO
from osgeo import gdal, osr
from matplotlib import pyplot as plt
from IPython.display import display, Markdown, Latex
%matplotlib inline

import zarr_aws #Script with functions

# Gather Sources for Creating Zarr Files

In [2]:
bucket1 = 'probmodelingrepository'
prefix1 = 'Augusta-Levee-AOP/ProductionRuns/outputs/Pluvial'
tifs = ['WSE_Pluvial_E1181.tif','WSE_Pluvial_E1182.tif']
s3_tifs = [zarr_aws.get_files(bucket1,prefix1,tif)[0] for tif in tifs]
zarr_aws.mprint(f'## Number of tifs found: {len(s3_tifs)}\n\n - '+'\n - '.join(s3_tifs))

## Number of tifs found: 2

 - probmodelingrepository:Augusta-Levee-AOP/ProductionRuns/outputs/Pluvial/E1181/WSE_Pluvial_E1181.tif
 - probmodelingrepository:Augusta-Levee-AOP/ProductionRuns/outputs/Pluvial/E1182/WSE_Pluvial_E1182.tif

## We should now read in the TIF objects

This is down with IOBytes

In [3]:
obj_s3paths = [zarr_aws.s3Attributes(tif, '.tif', rtype='S3PATH') for tif in s3_tifs]
obj_s3names = [zarr_aws.s3Attributes(tif, '.tif', rtype='NAME').split('/')[-1] for tif in s3_tifs]

# Define Where to save and store the zarr file

In [4]:
bucket2 = 'probmodelingrepository'
prefix2 = 'zarr'
zarr_file_name = 'demo.zarr'
zarr_file = f's3://{bucket2}/{prefix2}/{zarr_file_name}'
zarr_aws.mprint(f'## Save Zarr File at:\n\n - {zarr_file}')

## Save Zarr File at:

 - s3://probmodelingrepository/zarr/demo.zarr

## Create the mapping object for zarr creation on s3fs

In [5]:
zarr_store = s3fs.S3Map(root=zarr_file, s3=s3fs.S3FileSystem(), check=False)

# Create Zarr Group File

 - [Zarr Arrays API](https://zarr.readthedocs.io/en/stable/api/creation.html)
     - Zarr Arrays are good for single datasets.
 - [Zarr Groups API](https://zarr.readthedocs.io/en/stable/api/hierarchy.html)
     - Zarr Groups are good for storring multiple datasets
     - They can also store seperate chunk groups with different sizes.
     - These are our ideal canidates for storing data with Tifs
     
## Paths

 - Paths are necessary to help the zarr file know what group to associate the data with. Since these are WSE tifs, we can call the path __'WSE'__
 - `path` is the location in the tree that we want to save the datasets in, like 'WSE' or 'Bool' or 'Depth'

In [6]:
zarr_path = 'WSE'

z = zarr.group(store=zarr_store, path=zarr_path, overwrite=True)

## We need to designate a model group

In the zarr file to add our WSE tifs to their own model groups

In [7]:
# We can do this for one tif
s3name = obj_s3names[0]
s3path = obj_s3paths[0]
model_group = z.create_group(f'{s3name}', overwrite=True)

## With any model group create a dataset

 - Model groups can have multiple datasets

In [8]:
rb, gt, ras = zarr_aws.getRasData(s3path)
cell_size, xsize, ysize = gt[1], rb.XSize, rb.YSize
np_dtype = zarr_aws.GDALtypeToDtype(zarr_aws.FindGDALtype(rb.DataType))
zarr_aws.mprint(f' - Cell Size: {cell_size:0.1f}\n - rows x cols: {ysize} x {xsize}' + \
                f'\n - Data Type: "{np_dtype}"')

 - Cell Size: 10.0
 - rows x cols: 13252 x 13549
 - Data Type: "f4"

## Get the whole TIF as an array and write to Zarr File

- Warning, you need to make sure that you have enough free memory or it will crash compute

In [9]:
newx_size, newy_size, ismemenough = zarr_aws.IsFreeMemoryEnough(xsize,ysize)
if ismemenough:
    zarr_aws.mprint(f'Current Free memory is enough for {ysize} x {xsize} array')
else:
    zarr_aws.mprint(f'Current Free memory is not enough for {ysize}, {xsize} array</br>' + \
                    f'Recommend changing Chunks to be {newx_size} x {newx_size}')

Current Free memory is enough for 13252 x 13549 array

### Assuming free memory is enough, continue with full raster size

In [10]:
data_ = zarr_aws.GetChunkAsArray(rb, 0, 0, xsize, ysize)
model_group.create_dataset(f'r0_c0', data=data_, shape=(ysize,xsize), overwrite=True, dtype=np_dtype)

<zarr.core.Array '/WSE/WSE_Pluvial_E1181/r0_c0' (13252, 13549) float32>

## Metadata can aslo be included to help users later.

 - Keys are designated with the brackets. The data can be anything. They will be grouped into a dictionary JSON filetype for readability later

In [11]:
model_metadata = zarr_aws.getFullRASMetaData(ras,gt,rb,xsize,ysize)
model_metadata.keys()

dict_keys(['Cell Size', 'Extents', 'Full GetStatistics', 'Projection', 'Unit Type', 'XYSize'])

In [12]:
data_metadata = zarr_aws.GetChunkMetaData(gt,xsize,ysize,0,0,cell_size)
data_metadata.keys()

dict_keys(['XMIN', 'YMIN', 'XMAX', 'YMAX'])

In [13]:
z.attrs[f'{s3name}'] = model_metadata
model_group.attrs[f'r0_c0'] = data_metadata

# Write multiple datasets to same Zarr Group

In [14]:
zarr_path, start_time = 'WSE', time.time()
z = zarr.group(store=zarr_store, path=zarr_path, overwrite=True)
for i, s3name in enumerate(obj_s3names):
    s3path = obj_s3paths[i]
    model_group = z.create_group(f'{s3name}', overwrite=True)
    rb, gt, ras = zarr_aws.getRasData(s3path)
    cell_size, xsize, ysize = gt[1], rb.XSize, rb.YSize
    np_dtype = zarr_aws.GDALtypeToDtype(zarr_aws.FindGDALtype(rb.DataType))
    data_ = zarr_aws.GetChunkAsArray(rb, 0, 0, xsize, ysize)
    model_group.create_dataset(f'r0_c0', data=data_, shape=(ysize,xsize), overwrite=True, dtype=np_dtype)
    model_group.attrs[f'r0_c0'] = zarr_aws.GetChunkMetaData(gt,xsize,ysize,0,0,cell_size)
    z.attrs[f'{s3name}'] = zarr_aws.getFullRASMetaData(ras,gt,rb,xsize,ysize)
zarr_aws.mprint('Total Time: {:0.2f} Minutes'.format((time.time()-start_time)/60))

Total Time: 3.16 Minutes

# Write the WSE as a Bool

*Only flooded areas will have 1, null areas will be 0*

Line Changed with `data_ = zarr_aws.GetChunkAsArray(rb, 0, 0, xsize, ysize, boolize=makebool)`

In [15]:
zarr_path, start_time = 'Bool', time.time()
makebool = True
z = zarr.group(store=zarr_store, path=zarr_path, overwrite=True)
for i, s3name in enumerate(obj_s3names):
    s3path = obj_s3paths[i]
    model_group = z.create_group(f'{s3name}', overwrite=True)
    rb, gt, ras = zarr_aws.getRasData(s3path)
    cell_size, xsize, ysize = gt[1], rb.XSize, rb.YSize
    np_dtype = zarr_aws.GDALtypeToDtype(zarr_aws.FindGDALtype(rb.DataType))
    data_ = zarr_aws.GetChunkAsArray(rb, 0, 0, xsize, ysize, boolize=makebool)
    model_group.create_dataset(f'r0_c0', data=data_, shape=(ysize,xsize), overwrite=True, dtype=np_dtype)
    model_group.attrs[f'r0_c0'] = zarr_aws.GetChunkMetaData(gt,xsize,ysize,0,0,cell_size)
    z.attrs[f'{s3name}'] = zarr_aws.getFullRASMetaData(ras,gt,rb,xsize,ysize)
zarr_aws.mprint('Total Time: {:0.2f} Minutes'.format((time.time()-start_time)/60))

Total Time: 2.72 Minutes

# Example with a larger group of data

In [4]:
start_time = time.time()
bucket1 = 'probmodelingrepository'
prefix1 = 'Augusta-Levee-AOP/ProductionRuns/outputs/Pluvial'
tifs = ['WSE_Pluvial_E001.tif','WSE_Pluvial_E002.tif','WSE_Pluvial_E003.tif',
        'WSE_Pluvial_E008.tif','WSE_Pluvial_E050.tif','WSE_Pluvial_E094.tif',
        'WSE_Pluvial_E1139.tif','WSE_Pluvial_E1181.tif','WSE_Pluvial_E1182.tif']
s3_tifs = [zarr_aws.get_files(bucket1,prefix1,tif)[0] for tif in tifs]
zarr_aws.mprint(f'## Number of tifs found: {len(s3_tifs)}\n\n - '+'\n - '.join(s3_tifs))

bucket2 = 'probmodelingrepository'
prefix2 = 'zarr'
zarr_file_name = 'demo.zarr'
zarr_file = f's3://{bucket2}/{prefix2}/{zarr_file_name}'
zarr_aws.mprint(f'## Save Zarr File at:\n\n - {zarr_file}')

obj_s3paths = [zarr_aws.s3Attributes(tif, '.tif', rtype='S3PATH') for tif in s3_tifs]
obj_s3names = [zarr_aws.s3Attributes(tif, '.tif', rtype='NAME').split('/')[-1] for tif in s3_tifs]
print(obj_s3names)

zarr_store = s3fs.S3Map(root=zarr_file, s3=s3fs.S3FileSystem(), check=False)

zarr_paths = ['WSE','Bool']
makebools = [False,True]

for j, makebool in enumerate(makebools):
    zarr_path = zarr_paths[j]
    for i, s3name in enumerate(obj_s3names):
        s3path = obj_s3paths[i]
        z = zarr.group(store=zarr_store, path=zarr_path, overwrite=True)
        model_group = z.create_group(f'{s3name}', overwrite=True)
        rb, gt, ras = zarr_aws.getRasData(s3path)
        cell_size, xsize, ysize = gt[1], rb.XSize, rb.YSize
        np_dtype = zarr_aws.GDALtypeToDtype(zarr_aws.FindGDALtype(rb.DataType))
        data_ = zarr_aws.GetChunkAsArray(rb, 0, 0, xsize, ysize, boolize=makebool)
        model_group.create_dataset(f'r0_c0', data=data_, shape=(ysize,xsize), overwrite=True, dtype=np_dtype)
        model_group.attrs[f'r0_c0'] = zarr_aws.GetChunkMetaData(gt,xsize,ysize,0,0,cell_size)
        z.attrs[f'{s3name}'] = zarr_aws.getFullRASMetaData(ras,gt,rb,xsize,ysize)
        zarr_aws.mprint(f'{i+1},{j+1} Time: {(time.time()-start_time)/60:0.2f} Minutes')
zarr_aws.mprint(f'Total Time: {(time.time()-start_time)/60:0.2f} Minutes')

## Number of tifs found: 9

 - probmodelingrepository:Augusta-Levee-AOP/ProductionRuns/outputs/Pluvial/E001/WSE_Pluvial_E001.tif
 - probmodelingrepository:Augusta-Levee-AOP/ProductionRuns/outputs/Pluvial/E002/WSE_Pluvial_E002.tif
 - probmodelingrepository:Augusta-Levee-AOP/ProductionRuns/outputs/Pluvial/E003/WSE_Pluvial_E003.tif
 - probmodelingrepository:Augusta-Levee-AOP/ProductionRuns/outputs/Pluvial/E008/WSE_Pluvial_E008.tif
 - probmodelingrepository:Augusta-Levee-AOP/ProductionRuns/outputs/Pluvial/E050/WSE_Pluvial_E050.tif
 - probmodelingrepository:Augusta-Levee-AOP/ProductionRuns/outputs/Pluvial/E094/WSE_Pluvial_E094.tif
 - probmodelingrepository:Augusta-Levee-AOP/ProductionRuns/outputs/Pluvial/E1139/WSE_Pluvial_E1139.tif
 - probmodelingrepository:Augusta-Levee-AOP/ProductionRuns/outputs/Pluvial/E1181/WSE_Pluvial_E1181.tif
 - probmodelingrepository:Augusta-Levee-AOP/ProductionRuns/outputs/Pluvial/E1182/WSE_Pluvial_E1182.tif

## Save Zarr File at:

 - s3://probmodelingrepository/zarr/demo.zarr

['WSE_Pluvial_E001', 'WSE_Pluvial_E002', 'WSE_Pluvial_E003', 'WSE_Pluvial_E008', 'WSE_Pluvial_E050', 'WSE_Pluvial_E094', 'WSE_Pluvial_E1139', 'WSE_Pluvial_E1181', 'WSE_Pluvial_E1182']


0 Time: 1.03 Minutes

1 Time: 1.45 Minutes

2 Time: 1.87 Minutes

3 Time: 2.27 Minutes

4 Time: 2.74 Minutes

5 Time: 3.17 Minutes

6 Time: 3.64 Minutes

7 Time: 4.08 Minutes

8 Time: 4.57 Minutes

0 Time: 5.05 Minutes

1 Time: 5.74 Minutes

2 Time: 6.16 Minutes

3 Time: 6.56 Minutes

4 Time: 6.99 Minutes

5 Time: 7.41 Minutes

6 Time: 7.86 Minutes

7 Time: 8.27 Minutes

8 Time: 8.67 Minutes

Total Time: 8.67 Minutes