# Sliderule Output to S3

```{admonition} Learning Objectives
- basics of Parquet and Geoparquet formats
- how to output Sliderule results as parquet files on S3
- how to work with outputs on S3
```

In [1]:
from sliderule import sliderule, icesat2
import geopandas as gpd
import s3fs
import os
import boto3

```{tip}
Parquet is cloud-optimized format. At a very basic level, it is for tabular data. Unlike CSV files which are stored as plain text and writen row-wise, Parquet is a columnar binary format that is well-suited to hosting on S3 for data analysis.
```

Sliderule documentation has an extensive description of [Parquet](https://slideruleearth.io/web/rtd/user_guide/GeoParquet.html). And a [tutorial](https://slideruleearth.io/web/rtd/tutorials/user/geoparquet_output.html) with code examples! 

Here we will show a basic example to output this data to S3. As this example was put together for ICESat-2 Hackweek 2023, we are using CryoCloud JupyterHub which has a preconfigured S3 bucket.

## Set Area of Interest

We will use a geojson file from the [sliderule GitHub Repository over Grand Mesa, Colorado. 

In [2]:
gfa = gpd.read_file('https://raw.githubusercontent.com/ICESat2-SlideRule/sliderule-python/main/data/grandmesa.geojson')

In [3]:
folium_map = gfa.explore(tiles="Stamen Terrain", 
                         style_kwds=dict(fill=False, color='magenta'),
                        )
folium_map

## Configure SlideRule

In [4]:
# Connect to server
icesat2.init("slideruleearth.io")

In [5]:
# Sliderule Processing Parameters
parms = {
    "poly": sliderule.toregion(gfa)["poly"],
    "srt": icesat2.SRT_LAND,
    "cnf": icesat2.CNF_SURFACE_HIGH,
    "len": 40.0,
    "res": 20.0,
    "maxi": 6
}

### Get Temporary AWS Credentials (JupyterHub)

```{warning}
This will only work on CryoCloud JupyterHub
```

In [6]:
# Get Temporary AWS Credentials on CryoCloud JupyterHub
client = boto3.client('sts')
# https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sts/client/assume_role_with_web_identity.html

with open(os.environ['AWS_WEB_IDENTITY_TOKEN_FILE']) as f:
    TOKEN = f.read()

response = client.assume_role_with_web_identity(
    RoleArn=os.environ['AWS_ROLE_ARN'],
    RoleSessionName=os.environ['JUPYTERHUB_CLIENT_ID'],
    WebIdentityToken=TOKEN,
    DurationSeconds=3600
)

ACCESS_KEY_ID = response['Credentials']['AccessKeyId']
SECRET_ACCESS_KEY_ID = response['Credentials']['SecretAccessKey']
SESSION_TOKEN = response['Credentials']['SessionToken']

### Configure Parquet and S3 Output

In [7]:
S3_OUTPUT = 's3://nasa-cryo-scratch/sliderule-example/grandmesa.parquet'

parms["output"] = {
    "path": S3_OUTPUT, 
    "format": "parquet", 
    "open_on_complete": False,
    "region": "us-west-2",
    "credentials": {
         "aws_access_key_id": ACCESS_KEY_ID,
         "aws_secret_access_key": SECRET_ACCESS_KEY_ID,
         "aws_session_token": SESSION_TOKEN
     }
}

### Run SlideRule processing

In [8]:
%%time

output_path = icesat2.atl06p(parms,  version='006')
output_path

CPU times: user 96.5 ms, sys: 12.6 ms, total: 109 ms
Wall time: 29.2 s


's3://nasa-cryo-scratch/sliderule-example/grandmesa.parquet'

## Read output from S3


In [9]:
gf = gpd.read_parquet(output_path)

In [10]:
print("Start:", gf.index.min().strftime('%Y-%m-%d'))
print("End:", gf.index.max().strftime('%Y-%m-%d'))
print("Reference Ground Tracks: {}".format(gf["rgt"].unique()))
print("Cycles: {}".format(gf["cycle"].unique()))
print("Elevation Measurements: {} ".format(gf.shape[0]))
gf.head(2)

Start: 2018-10-16
End: 2023-03-07
Reference Ground Tracks: [ 714  272 1156 1179  737  295  211  234]
Cycles: [ 2  1  3  4  5  7  6  8  9 10 11 12 13 14 15 16 17 18]
Elevation Measurements: 328557 


Unnamed: 0_level_0,extent_id,distance,segment_id,rgt,rms_misfit,gt,dh_fit_dy,n_fit_photons,h_sigma,pflags,spot,h_mean,cycle,w_surface_window_final,dh_fit_dx,geometry
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
2019-02-13 05:05:23.099193856,3215570299298775254,15716480.0,784673,714,1.422327,30,0.0,14,0.385593,0,3,1944.361847,2,12.627211,-0.192554,POINT (-108.27924 39.13310)
2019-02-13 05:05:23.102031616,3215570299298775258,15716500.0,784674,714,1.64213,30,0.0,14,0.466869,0,3,1944.483824,2,9.10442,0.195641,POINT (-108.27926 39.13292)


100,000+ is a lot of points to visualize! Let's randomly sample 1000 of them and plot on our map

In [11]:
# Need to turn timestamps into strings first
points = gf.sample(1000).reset_index()
points['time'] = points.time.dt.strftime('%Y-%m-%d')
points.explore(column='h_mean', m=folium_map)

## Summary

We processed all ATL03 v006 data covering Grand Mesa, Colorado spanning 2018-10-16 to 2023-03-07 to ATL06-SR elevations. We output our results in GeoParquet format to an AWS S3 bucket and quickly visualized some of the results.