# Using CryoCloud S3 Scratch Bucket

CryoCloud JupyterHub has a preconfigured S3 "Scratch Bucket" that *automatically deletes files after 7 days*. This is a great resource for experimenting with large datasets and working collaboratively on a shared dataset with other CryoCloud users.

```{tip}
This notebook walks through, uploading, downloading and streaming data from a S3 scratch bucket 
```

## Access the scratch bucket

The CryoCloud scratch bucket is hosted at `s3://nasa-cryo-scratch`. CryoCloud JupyterHub automatically sets an environment variable `SCRATCH_BUCKET` that appends a suffix to the s3 url with your GitHub username. This is intended to keep track of file ownership, stay organized, and prevent users from overwriting data!

```{warning}
Everyone has full access to the scratch bucket, so be careful not to overwrite data from other users when uploading files. Also, any data you put there will be deleted 7 days after it is uploaded
```

```{hint}
If you need permamnent storage refer to [These Docs](./Instructions_for_configuring_AWS_S3_bucket.ipynb)
```


We'll use the [S3FS](https://s3fs.readthedocs.io/en/latest/) Python package, which provides a nice interface for interacting with S3 buckets.

In [1]:
import os
import s3fs
import fsspec
import xarray as xr
import geopandas as gpd

In [2]:
# My GitHub username is `scottyhq`
os.environ['SCRATCH_BUCKET']

's3://nasa-cryo-scratch/scottyhq'

In [3]:
# Here you see I previously uploaded files
s3 = s3fs.S3FileSystem()
s3.ls(os.environ['SCRATCH_BUCKET'])

['nasa-cryo-scratch/scottyhq/ATL03_20230103090928_02111806_006_01.h5',
 'nasa-cryo-scratch/scottyhq/IS2_Alaska.parquet',
 'nasa-cryo-scratch/scottyhq/Notes.txt',
 'nasa-cryo-scratch/scottyhq/example_ATL03']

## Uploading data

It's great to store data in S3 buckets because this storage features very high network throughput. If many users are simultaneously accessing the same file on a spinning networked harddrive (`/home/jovyan/shared`) performance can be quite slow. S3 has much higher performance for such cases.

### Single file

In [4]:
local_file = '/tmp/ATL03_20230103090928_02111806_006_01.h5'

remote_object = f"{os.environ['SCRATCH_BUCKET']}/ATL03_20230103090928_02111806_006_01.h5"

s3.upload(local_file, remote_object)

[None]

In [5]:
s3.stat(remote_object)

{'ETag': '"489f0191a8e9c844576ff2d18adfea59-21"',
 'LastModified': datetime.datetime(2023, 7, 20, 16, 40, 30, tzinfo=tzutc()),
 'size': 1063571816,
 'name': 'nasa-cryo-scratch/scottyhq/ATL03_20230103090928_02111806_006_01.h5',
 'type': 'file',
 'StorageClass': 'STANDARD',
 'VersionId': None,
 'ContentType': 'application/x-hdf5'}

### Directory

In [6]:
local_dir = '/tmp/example_ATL03'

!ls -lh {local_dir}

total 6.2G
-rw-r--r-- 1 jovyan jovyan 1015M Jul 20 15:29 ATL03_20230103090928_02111806_006_01.h5
-rw-r--r-- 1 jovyan jovyan  5.2G Jul 20 15:32 ATL03_20230108204519_02951802_006_01.h5


In [7]:
remote_prefix = f"{os.environ['SCRATCH_BUCKET']}/example_ATL03"

s3.upload(local_dir, remote_prefix, recursive=True)

[None, None]

In [8]:
print(remote_prefix)
s3.ls(remote_prefix)

s3://nasa-cryo-scratch/scottyhq/example_ATL03


['nasa-cryo-scratch/scottyhq/example_ATL03/ATL03_20230103090928_02111806_006_01.h5',
 'nasa-cryo-scratch/scottyhq/example_ATL03/ATL03_20230108204519_02951802_006_01.h5',
 'nasa-cryo-scratch/scottyhq/example_ATL03/example_ATL03']

## Accessing Data

Some software packages allow you to stream data directly from S3 Buckets. But you can always pull objects from S3 and work with local file paths. For file formats that are not Cloud Optimized (like HDF!) this often gives the best performance. 

```{important}
For best performance do not work with data in your home directory. Instead use a local scratch space like `/tmp`
```

In [9]:
local_object = '/tmp/test.h5'
s3.download(remote_object, local_object)

[None]

In [10]:
ds = xr.open_dataset(local_object, group='/gt3r/heights')
ds

```{tip}
If you don't want to think about downloading files you can let `fsspec` handle this behind the scenes for you! This way you only need to think about remote paths
```

In [11]:
fs = fsspec.filesystem("simplecache", 
                       cache_storage='/tmp/files/',
                       same_names=True,  
                       target_protocol='s3',
                       )

In [12]:
# The `simplecache` setting above will download the full file to /tmp/files
print(remote_object)
with fs.open(remote_object) as f:
    ds = xr.open_dataset(f.name, group='/gt3r/heights') # NOTE: pass f.name for local cached path

s3://nasa-cryo-scratch/scottyhq/ATL03_20230103090928_02111806_006_01.h5


In [13]:
ds

## Cloud-optimized formats

Other formats like [COG](https://www.cogeo.org), [ZARR](https://zarr.readthedocs.io/en/stable/), [Parquet](https://parquet.apache.org) are 'Cloud-optimized' and allow for very efficient streaming directly from S3. In other words, you do not need to download entire files for best performance and can easily read subsets.

In [14]:
gf = gpd.read_parquet('s3://nasa-cryo-scratch/scottyhq/IS2_Alaska.parquet')
gf.head(2)

Unnamed: 0,producer_granule_id,time_start,time_end,datetime,geometry
0,ATL03_20181014015337_02360103_006_02.h5,2018-10-14 01:53:36.912,2018-10-14 01:59:02.315,2018-10-14 01:56:19.613500,"POLYGON ((-166.98121 80.05247, -167.61386 80.0..."
1,ATL03_20181014130413_02430105_006_02.h5,2018-10-14 13:04:12.567,2018-10-14 13:09:37.946,2018-10-14 13:06:55.256500,"POLYGON ((-130.81600 80.02773, -131.44724 80.0..."


## Advanced: Access Scratch bucket outside of JupyterHub

Let's say you have a lot of files on your laptop you want to work with on CryoCloud. The S3 Bucket is a convient way to upload large datasets for collaborative analysis. To do this, you need to copy AWS Credentials from the JupyterHub to use on other machines. More extensive documentation on this workflow can be found in this repository https://github.com/scottyhq/jupyter-cloud-scoped-creds. 

### Step 1
Install the latest [AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html). **From a  CryoCloud JupyterHub Terminal Run the following**:

```
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
./aws/install --bin-dir /home/jovyan/.local/bin --install-dir /home/jovyan/.local/aws-cli --update
```

### Step 2
Get temporary credentials using the AWS CLI
```
aws sts assume-role-with-web-identity \
    --role-arn $AWS_ROLE_ARN \
    --role-session-name $JUPYTERHUB_CLIENT_ID \
    --web-identity-token file://$AWS_WEB_IDENTITY_TOKEN_FILE \
    --duration-seconds 3600
```

If the above command is successful the result will look something like this, which you can copy into a jupyter notebook cell

```
{
    "Credentials": {
        "AccessKeyId": "ASIAXXXXXXXXXX",
        "SecretAccessKey": "UULopNYXXXXXXXX",
        "SessionToken": "IQoxxXXXXXXXXdlc3Qt.....",
    }
```

### Step 3
Use the returned credentials on another machine, and the same examples above to upload or download data!

In [None]:
creds = {
    "Credentials": {
        "AccessKeyId": "ASIAXXXXXXXXXX",
        "SecretAccessKey": "UULopNYXXXXXXXX",
        "SessionToken": "IQoxxXXXXXXXXdlc3Qt.....",
    }

s3 = s3fs.S3FileSystem(key=creds['Credentials']['AccessKeyId'],
                       secret=creds['Credentials']['SecretAccessKey'],
                       token=creds['Credentials']['SessionToken'])

s3.ls('s3://nasa-cryo-scratch/scottyhq/')

```
['nasa-cryo-scratch/scottyhq/ATL03_20230103090928_02111806_006_01.h5',
 'nasa-cryo-scratch/scottyhq/IS2_Alaska.parquet',
 'nasa-cryo-scratch/scottyhq/Notes.txt',
 'nasa-cryo-scratch/scottyhq/example_ATL03']
```