In [1]:
import xarray as xr
from pathlib import Path
import glob

## Create Synthetic Data from Xarray

On datarmor compute node, i can't access internet/(can not access to synthetic data example) thus used real data available on datalake.
/home/ref-oc-public/modeles_marc/f1_e4000/best_estimate/2014/MARC_F1-MARS3D-MANGAE4000_2014123120*Z.nc
These are hourly netcdf computed output.

Let's create a tarbar for the created netCDF files

In [2]:
!cd /home/ref-oc-public/modeles_marc/f1_e4000/best_estimate/2014 ; tar Pcvf ${HOME}/git/xtar/data/realdata.tar MARC_F1-MARS3D-MANGAE4000_20141231T2*Z.nc

MARC_F1-MARS3D-MANGAE4000_20141231T2000Z.nc
MARC_F1-MARS3D-MANGAE4000_20141231T2100Z.nc
MARC_F1-MARS3D-MANGAE4000_20141231T2200Z.nc
MARC_F1-MARS3D-MANGAE4000_20141231T2300Z.nc


## ratarmount

We will use [ratarmount](https://github.com/mxmlnkn/ratarmount) to create an index file with file names, ownership, permission flags, and offset information to be stored at the TAR file's location. Once the index is created, ratarmount then offers a FUSE mount integration for easy access to the files.

**NOTE:** Since `ratarmount` uses FUSE to mount the TAR file as a "filesystem in user space", you will need FUSE installed.  On OSX, you will need to install [osxfuse](https://osxfuse.github.io/) *by hand*.  On Linux, you can install `libfuse` using `conda`, if it is not already installed on your system.

**NOTE:** If you have `libfuse` on your system and it is *older* than October 19, 2018 (i.e., < 3.3.0 for `fuse3` or < 2.9.9 for `fuse2`), and you have either Lustre or GPFS filesystems, `ratarmount` will fail with an error saying that your filesystem is unsupported.  The solution is to upgrade to a newer version of `libfuse`.

**NOTE:** If you install the `libfuse` Conda-Forge package on a Linux system, then you need to set the `FUSE_LIBRARY_PATH` environment variable to the location of the `libfuse.so` library file (e.g., `export FUSE_LIBRARY_PATH=/path/to/libfuse3.so`).  If you do not do this, then `fusepy` (another dependency of `ratarmount`) will use the system `libfuse.so` file, which might be old.

**NOTE:** Currently, the Conda-Forge version of `libfuse` does *not* build the `libfuse` utilities such as `fusermount3`.  However, `fusepy` uses these utility functions under the hood when trying to mount the userspace filesystem.  If you install the most recent version of `libfuse` and properly set the location of `libfuse` so that `fusepy` can find it (i.e., `FUSE_LIBRARY_PATH`), you will get an error the `fusermount3` cannot be found.

In [3]:
%%time
!/home1/datahome/todaka/conda-env/xtar-dev/bin/ratarmount --recreate-index ${HOME}/git/xtar/data/realdata.tar mounted_NFS_dataset

Creating offset dictionary for /home1/datahome/todaka/git/xtar/data/realdata.tar ...
Creating new SQLite index database at /home1/datahome/todaka/git/xtar/data/realdata.tar.index.sqlite
Creating offset dictionary for /home1/datahome/todaka/git/xtar/data/realdata.tar took 0.12s
Writing out TAR index to /home1/datahome/todaka/git/xtar/data/realdata.tar.index.sqlite took 0s and is sized 24576 B
CPU times: user 8 ms, sys: 0 ns, total: 8 ms
Wall time: 528 ms


In [5]:
#test GPFS
!cp  ${HOME}/git/xtar/data/realdata.tar ${DATAWORK}/realdata.tar

!/home1/datahome/todaka/conda-env/xtar-dev/bin/ratarmount --recreate-index ${DATAWORK}/realdata.tar mounted_GPFS_dataset

Creating offset dictionary for /home1/datawork/todaka/realdata.tar ...
Creating new SQLite index database at /home1/datawork/todaka/realdata.tar.index.sqlite
Creating offset dictionary for /home1/datawork/todaka/realdata.tar took 0.01s
Writing out TAR index to /home1/datawork/todaka/realdata.tar.index.sqlite took 0s and is sized 24576 B


In [6]:
#test Lustre
!cp  ${HOME}/git/xtar/data/realdata.tar ${SCRATCH}/realdata.tar
!/home1/datahome/todaka/conda-env/xtar-dev/bin/ratarmount --recreate-index ${SCRATCH}/realdata.tar mounted_lustre_dataset

Creating offset dictionary for /home1/scratch/todaka/realdata.tar ...
Creating new SQLite index database at /home1/scratch/todaka/realdata.tar.index.sqlite
Creating offset dictionary for /home1/scratch/todaka/realdata.tar took 0.01s
Writing out TAR index to /home1/scratch/todaka/realdata.tar.index.sqlite took 0s and is sized 24576 B


In [8]:
%%time
mounted_dir = Path("mounted_NFS_dataset/")
list(mounted_dir.iterdir())


CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 1.03 ms


[PosixPath('mounted_NFS_dataset/MARC_F1-MARS3D-MANGAE4000_20141231T2000Z.nc'),
 PosixPath('mounted_NFS_dataset/MARC_F1-MARS3D-MANGAE4000_20141231T2100Z.nc'),
 PosixPath('mounted_NFS_dataset/MARC_F1-MARS3D-MANGAE4000_20141231T2200Z.nc'),
 PosixPath('mounted_NFS_dataset/MARC_F1-MARS3D-MANGAE4000_20141231T2300Z.nc')]

For comparison, this is how long it takes to `list` the original data directory.

In [None]:
%%time
mounted_dir = Path("data/air/")
list(mounted_dir.iterdir())

**Substantially slower to list the directory contents (~2x slower)!!!**

## Benchmarks

In [None]:
from dask.distributed import performance_report, Client

In [None]:
client = Client()
client

### Original netCDF files

In [None]:
ds_orig = xr.open_mfdataset("data/air/*.nc", combine='nested', concat_dim='member_id')
ds_orig

In [None]:
ds_orig.air.data.visualize()

### Mounted netCDF files from the tar archive

In [None]:
ds_mntd = xr.open_mfdataset("mounted_air_dataset/data/air/*.nc", combine='nested', concat_dim='member_id')
ds_mntd

In [None]:
ds_mntd.air.data.visualize()

### Benchmark: Yearly Averages

In [None]:
%%time
ds_orig.groupby('time.year').mean(['time', 'member_id']).compute()

In [None]:
%%time
ds_mntd.groupby('time.year').mean(['time', 'member_id']).compute()

### Dask Performance Reports

In [None]:
with performance_report(filename="dask-perf-report-original.html"):
    ds_orig.groupby('time.year').mean(['time', 'member_id']).compute()

In [None]:
with performance_report(filename="dask-perf-report-mounted.html"):
    ds_mntd.groupby('time.year').mean(['time', 'member_id']).compute()

In [None]:
from IPython.display import HTML

In [None]:
display(HTML("dask-perf-report-original.html"))

In [None]:
display(HTML("dask-perf-report-mounted.html"))