# Overview

When you are working in the NASA Openscapes Hub, the default storage location is the `HOME` directory mounted to the compute instance (the cloud computer that is doing the computations). On AWS with an [EC2](https://aws.amazon.com/ec2/) compute instance, the `HOME` directory is in an [AWS Elastic File System (EFS)](https://aws.amazon.com/efs/). This drive is persistent across server restarts and is a great place to store your code. However the `HOME` directory is not a great place to store data, as it is very expensive, and can also be quite slow to read from and write to. 

To that end, the hub provides every user access to two [AWS S3](https://aws.amazon.com/s3/) buckets - a "scratch" bucket for short-term storage that is automatically deleted every seven days, and a "persistent" bucket for longer-term storage. S3 buckets have fast read/write, and storage costs are relatively in expensive. These are accessible from inside the hub using the environment variables:

- `$SCRACTCH_BUCKET` pointing to `s3://openscapeshub-scratch/[your-username]` (deleted every seven days)
- `$PERSISTENT_BUCKET` pointing to `s3://openscapeshub-persistent/[your-username]`

We can interact with these directories on the command line with the `awsv2` cli tool, or using the python packages `boto3` and/or `s3fs`.

## Reading and writing to the `$SCRATCH_BUCKET`

We will start by accessing the same data we did in the [Earthdata Cloud Clinic](/tutorials/Earthdata-cloud-clinic.ipynb) - reading it into memory as an xarray object and subsetting it.

In [1]:
import earthaccess 
from pprint import pprint
import xarray as xr
import hvplot.xarray #plot
import os
import tempfile
import s3fs # aws s3 access

In [2]:
auth = earthaccess.login()

In [3]:
data_name = "SEA_SURFACE_HEIGHT_ALT_GRIDS_L4_2SATS_5DAY_6THDEG_V_JPL2205"

results = earthaccess.search_data(
    short_name=data_name,
    cloud_hosted=True,
    temporal=("2021-07-01", "2021-09-30"),
)

Granules found: 18


In [4]:
ds = xr.open_mfdataset(earthaccess.open(results))
ds

Opening 18 granules, approx size: 0.16 GB
using endpoint: https://archive.podaac.earthdata.nasa.gov/s3credentials


QUEUEING TASKS | :   0%|          | 0/18 [00:00<?, ?it/s]

PROCESSING TASKS | :   0%|          | 0/18 [00:00<?, ?it/s]

COLLECTING RESULTS | :   0%|          | 0/18 [00:00<?, ?it/s]

Unnamed: 0,Array,Chunk
Bytes,303.75 kiB,16.88 kiB
Shape,"(18, 2160, 2)","(1, 2160, 2)"
Dask graph,18 chunks in 55 graph layers,18 chunks in 55 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 303.75 kiB 16.88 kiB Shape (18, 2160, 2) (1, 2160, 2) Dask graph 18 chunks in 55 graph layers Data type float32 numpy.ndarray",2  2160  18,

Unnamed: 0,Array,Chunk
Bytes,303.75 kiB,16.88 kiB
Shape,"(18, 2160, 2)","(1, 2160, 2)"
Dask graph,18 chunks in 55 graph layers,18 chunks in 55 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,135.00 kiB,7.50 kiB
Shape,"(18, 960, 2)","(1, 960, 2)"
Dask graph,18 chunks in 55 graph layers,18 chunks in 55 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 135.00 kiB 7.50 kiB Shape (18, 960, 2) (1, 960, 2) Dask graph 18 chunks in 55 graph layers Data type float32 numpy.ndarray",2  960  18,

Unnamed: 0,Array,Chunk
Bytes,135.00 kiB,7.50 kiB
Shape,"(18, 960, 2)","(1, 960, 2)"
Dask graph,18 chunks in 55 graph layers,18 chunks in 55 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,288 B,16 B
Shape,"(18, 2)","(1, 2)"
Dask graph,18 chunks in 37 graph layers,18 chunks in 37 graph layers
Data type,datetime64[ns] numpy.ndarray,datetime64[ns] numpy.ndarray
"Array Chunk Bytes 288 B 16 B Shape (18, 2) (1, 2) Dask graph 18 chunks in 37 graph layers Data type datetime64[ns] numpy.ndarray",2  18,

Unnamed: 0,Array,Chunk
Bytes,288 B,16 B
Shape,"(18, 2)","(1, 2)"
Dask graph,18 chunks in 37 graph layers,18 chunks in 37 graph layers
Data type,datetime64[ns] numpy.ndarray,datetime64[ns] numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,142.38 MiB,7.91 MiB
Shape,"(18, 960, 2160)","(1, 960, 2160)"
Dask graph,18 chunks in 37 graph layers,18 chunks in 37 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 142.38 MiB 7.91 MiB Shape (18, 960, 2160) (1, 960, 2160) Dask graph 18 chunks in 37 graph layers Data type float32 numpy.ndarray",2160  960  18,

Unnamed: 0,Array,Chunk
Bytes,142.38 MiB,7.91 MiB
Shape,"(18, 960, 2160)","(1, 960, 2160)"
Dask graph,18 chunks in 37 graph layers,18 chunks in 37 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,142.38 MiB,7.91 MiB
Shape,"(18, 960, 2160)","(1, 960, 2160)"
Dask graph,18 chunks in 37 graph layers,18 chunks in 37 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 142.38 MiB 7.91 MiB Shape (18, 960, 2160) (1, 960, 2160) Dask graph 18 chunks in 37 graph layers Data type float32 numpy.ndarray",2160  960  18,

Unnamed: 0,Array,Chunk
Bytes,142.38 MiB,7.91 MiB
Shape,"(18, 960, 2160)","(1, 960, 2160)"
Dask graph,18 chunks in 37 graph layers,18 chunks in 37 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray


In [5]:
ds_subset = ds['SLA'].sel(Latitude=slice(15.8, 35.9), Longitude=slice(234.5,260.5)) 
ds_subset

Unnamed: 0,Array,Chunk
Bytes,1.29 MiB,73.12 kiB
Shape,"(18, 120, 156)","(1, 120, 156)"
Dask graph,18 chunks in 38 graph layers,18 chunks in 38 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 1.29 MiB 73.12 kiB Shape (18, 120, 156) (1, 120, 156) Dask graph 18 chunks in 38 graph layers Data type float32 numpy.ndarray",156  120  18,

Unnamed: 0,Array,Chunk
Bytes,1.29 MiB,73.12 kiB
Shape,"(18, 120, 156)","(1, 120, 156)"
Dask graph,18 chunks in 38 graph layers,18 chunks in 38 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray


### Home directory

Imaging this `ds_subset` object is now an important intermediate dataset, or the result or a complex analysis and we want to save it. Our default action might be to just save it to our `HOME` directory. This is simple, but we want to avoid this as it incurs significant storage costs, and using this data later will be slow.

In [6]:
ds_subset.to_netcdf("test.nc") # avoid this

### Use the s3fs package to interact with our S3 bucket.

[s3fs](https://s3fs.readthedocs.io/en/latest/) is a Python library that allows us to interact with S3 objects in a file-system like manner.

We will start by listing everything in our scratch bucket:

In [7]:
# Create a S3FileSystem class
s3 = s3fs.S3FileSystem()

# Get scratch and persistent buckets
scratch = os.environ["SCRATCH_BUCKET"]
persistent = os.environ["PERSISTENT_BUCKET"]

s3.ls(scratch)

['openscapeshub-scratch/ateucher/foo.txt']

In [8]:
s3.ls(persistent)

['openscapeshub-persistent/ateucher/bar.txt']

## Save dataset as netcdf on SCRATCH bucket

Next we can save `ds_subset` as a netcdf file. This involves writing to a temporary directory first, and then moving that to the `SCRATCH` bucket:

In [9]:
# Where we want to store it:
out_s3_file_path = os.path.join(scratch, "test123.nc")

# Create a temporary intermediate file and save it to the bucket
with tempfile.NamedTemporaryFile(suffix = ".nc") as tmp:
    ds_subset.to_netcdf(tmp.name, engine = 'h5netcdf')
    s3.put(tmp.name, out_s3_file_path)

# Ensure the file is there
s3.ls(scratch)

['openscapeshub-scratch/ateucher/foo.txt',
 'openscapeshub-scratch/ateucher/test123.nc']

And we can open it to ensure it worked:

In [10]:
ds_subs = xr.open_dataarray(s3.open(out_s3_file_path), engine='h5netcdf')

ds_subs

In [11]:
ds_subs.hvplot.image(x='Longitude', y='Latitude', cmap='RdBu', clim=(-0.5, 0.5), title="Sea Level Anomaly Estimate (m)")

## Use the persistent bucket

If we decide this is a file we want to keep around for a longer time period, we can move it to our persistent bucket. We can even make a subdirectory in our persistent bucket to keep us organized:

In [12]:
dest_dir = os.path.join(persistent, "my-analysis-data")

# Make directory in persistent bucket
s3.mkdir(dest_dir)

# Move the file
s3.mv(out_s3_file_path, dest_dir)

# Check the scratch and persistent bucket listings:
s3.ls(scratch)

['openscapeshub-scratch/ateucher/foo.txt']

In [13]:
s3.ls(persistent)

['openscapeshub-persistent/ateucher/bar.txt',
 'openscapeshub-persistent/ateucher/my-analysis-data']

In [14]:
s3.ls(os.path.join(persistent, "my-analysis-data"))

['openscapeshub-persistent/ateucher/my-analysis-data']

## Try to save directly from xarray object to .nc in S3 Scratch bucket:

This is currently not working - it writes a file but errors and the file is not valid. Approach gleaned from [here](https://github.com/pydata/xarray/issues/4122), but [this comment](https://github.com/pydata/xarray/issues/4122#issuecomment-1400545067) suggests it is not possible without first writing a temporary local file.

In [15]:
# s3_target_path = os.path.join("simplecache::"+scratch, "test12345.nc")

# with fsspec.open(s3_target_path, mode="wb", s3=dict(profile='default')) as ff:
#     ds_subset.to_netcdf(ff)