Example notebook on how to run the OPA for output for bias-correction. 

## First testing on some netCDF lat lon data saved on disk 

In [7]:
import numpy as np 
import os 
import sys 
import glob 
import xarray as xr
import pickle 

path = "/home/b/b382291/git/one_pass"
sys.path.append(path)
os.chdir(path)

from one_pass.convert_time import convert_time
from one_pass.check_request import check_request
from one_pass import util
from one_pass.opa import Opa

! hostname

l40495.lvt.dkrz.de


loading 10 month test data from disk located in the tests folder on the one_pass repo 

In [2]:
file_path = "/home/b/b382291/git/one_pass/tests/uas_10_months.nc"
data = xr.open_dataset(file_path, engine='netcdf4')
data

Configuration file or dictionary to be passed to the OPA. Rather than have a seperate key value pair where "bias_correction" : True, it made a lot more sense for the OPA logic to have it simply passed as a statistic. This maybe revised but for now, it works. Setting the "stat_freq" and "output_freq" as "daily" says you're asking for data (both raw and tdigest objects) over the span of one day.

In [3]:
pass_dic = {"stat" : "bias_correction",
"percentile_list" : None,
"threshold_exceed" : None,
"stat_freq": "daily",
"output_freq": "daily",
"time_step": 60,
"variable": "uas",
"save": True,
"checkpoint": True,
"checkpoint_filepath": "/scratch/b/b382291/data/",
"out_filepath": "/scratch/b/b382291/data"}

Here we're just running a loop of 24 hours (data has hourly frequency) to simulate the streaming, as passing steps set by the parameter step. If step is = 1, we're passing one hour of data at each point. The loading bars that are output are more for diagnostics and are showing the loop of initalising the digests over the whole grid, updating the digests and spending them through a picklable class. 

In [4]:
step = 6 

for i in range(0, 24, step): 

    ds = data.isel(time=slice(i,i+step)) # extract moving window 'simulating streaming'
    # can pass either a dictionary as above or data from the config file 
    #daily_mean = Opa("config.yml")
    opa_stat = Opa(pass_dic)
    dm = opa_stat.compute(ds) # computing algorithm with new data 


100%|██████████| 10000/10000 [00:00<00:00, 111012.99it/s]
100%|██████████| 10000/10000 [00:00<00:00, 31987.19it/s]
100%|██████████| 10000/10000 [00:00<00:00, 995137.14it/s]


written


100%|██████████| 10000/10000 [00:00<00:00, 24058.16it/s]
100%|██████████| 10000/10000 [00:00<00:00, 940342.57it/s]


written


100%|██████████| 10000/10000 [00:00<00:00, 28801.75it/s]
100%|██████████| 10000/10000 [00:00<00:00, 110780.19it/s]


written


100%|██████████| 10000/10000 [00:00<00:00, 28649.62it/s]


array([TDigest(mean=0.0643, weight=24, centroids=16, not merged, compression=15),
       TDigest(mean=0.0124, weight=24, centroids=16, not merged, compression=15),
       TDigest(mean=-0.0336, weight=24, centroids=16, not merged, compression=15),
       ...,
       TDigest(mean=-2.22, weight=24, centroids=16, not merged, compression=15),
       TDigest(mean=-2.21, weight=24, centroids=16, not merged, compression=15),
       TDigest(mean=-2.22, weight=24, centroids=16, not merged, compression=15)],
      dtype=object)
finished saving in 0.1471 s
finished saving in 0.2205 s


As we set save = True, we will have two output files. The first, is a netCDF which contains the accumlated raw data over the day. You can inspect this below: 

In [6]:
file_path = "/scratch/b/b382291/data/2020_05_01_uas_raw_data_for_bc.nc"
data = xr.open_dataset(file_path, engine='netcdf4')
data

The next output file is a pickle file. I have chosen to pass this information as a pickle instead of a netCDF for a few reasons: 

1. We can pass the actual TDigest objects, rather than needing to extract the centriods and means and manipulate the data causing extra overhead 
2. Much faster to save 

To deserialise this data do the following: 

In [12]:
 
file_path = "/scratch/b/b382291/data/2020_05_01_uas_daily_bias_correction.pkl"

f = open(file_path, 'rb')
digest_data = pickle.load(f)

digest_data


The data is still on the given grid that was initally passed. To inspect each digest you can:

In [15]:
digest_data_uas = digest_data.uas
digest_data_uas

If you want to extract an actual digest you can do the following. You can call any pytdigest attribute you want on this object 

In [22]:
one_digest = digest_data_uas.values[0, 1,1]
one_digest.inverse_cdf([0.6])

array([-0.60310357])

## Another example but using the AQUA frame work - for this you need to be on Levante

In [23]:
path = "/home/b/b382291/git/AQUA/"
sys.path.append(path)
os.chdir(path)

from aqua import Reader

This is using the AQUA readers fake streamer, which is a really good way of simulating data streaming. This is will keep looping. AQUA repo can be found here: https://github.com/oloapinivad/AQUA

This example below will be a lot slower as, for the selected FESOM data, the grid has 7 million data points. 

In [24]:
# this is our user requst 

pass_dic = {"stat" : "bias_correction",
"percentile_list" : None,
"threshold_exceed" : None,
"stat_freq": "daily",
"output_freq": "daily",
"time_step": 60,
"variable": "sst",
"save": True,
"checkpoint": True,
"checkpoint_filepath": "/scratch/b/b382291/data/",
"out_filepath": "/scratch/b/b382291/data"}


reader = Reader(model = 'FESOM', exp = 'tco2559-ng5-cycle3', source = '2D_1h_native')
reader.reset_stream()
data_gen = reader.retrieve(streaming_generator=True, stream_step=12, stream_unit = 'hours')# , stream_startdate='2022-12-01')
#data_gen = reader.retrieve(streaming=True, stream_step=3, stream_unit = 'days')

for data in data_gen:
    
    print(f"start_date: {data.time[0].values} stop_date: {data.time[-1].values}")
    data = data['sst']
    opa_stat = Opa(pass_dic)
    dm = opa_stat.compute(data)
    

start_date: 2020-01-20T00:56:00.000000000 stop_date: 2020-01-20T11:56:00.000000000


100%|██████████| 7402886/7402886 [00:22<00:00, 323113.78it/s]
 46%|████▌     | 3381093/7402886 [01:45<02:05, 32023.97it/s]


Unexpected exception formatting exception. Falling back to standard exception


Traceback (most recent call last):
  File "/home/b/b382291/.conda/envs/aqua/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3508, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "/tmp/ipykernel_1056089/1850657218.py", line 27, in <module>
    dm = opa_stat.compute(data)
  File "/home/b/b382291/git/one_pass/one_pass/opa.py", line 1189, in compute
  File "/home/b/b382291/git/one_pass/one_pass/opa.py", line 878, in _update
  File "/home/b/b382291/git/one_pass/one_pass/opa.py", line 783, in _update_tdigest
    self.__getattribute__(str(self.stat + "_cum"))[j].update(ds_values[:, j])
  File "/home/b/b382291/.conda/envs/aqua/lib/python3.10/site-packages/pytdigest/pytdigest.py", line 197, in update
    _lib.td_add_batch(self._tdigest, x.size, x, w)
ctypes.ArgumentError: argument 3: KeyboardInterrupt: 

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/b/b382291/.conda/envs/aqua/li