# Welcome to the example notebook for the one_pass package! 

In this notebook we show the functionality of the one_pass using a small data set stored in the one_pass tests folder. This notebook is designed to be as simple as possible without requiring other packages. As such, we simulate streaming by simply looping through the data. 

Before working through this notebook we recommend reading the documentation located in the docs folder

In [1]:
import os
from urllib.request import urlretrieve

import numpy as np
import xarray as xr

from one_pass.opa import Opa

Loading a 12 month test xr.Dataset over a small lat lon grid from disk located in the tests folder on the one_pass repo 

In [2]:
url = "https://zenodo.org/records/8337510/files/pr_12_months.nc?download=1"
filename = "pr_12_months.nc"

# Download the file if it doesn't exist
if not os.path.isfile(filename):
    print("Downloading data file...")
    filename, _ = urlretrieve(url, filename)

In [3]:
data = xr.open_dataset(filename, engine="netcdf4")
data = data.compute()
data

## Example 1: daily means

As discussed in the documentation, the one_pass request has to be passed either from the config.yml file or from a python dictionary. Below is the first example where we are looking at calculating daily means of the data. Here we pass the request as a python dictionary to keep it all contained within this notebook.

### A quick note on the key value pairs in the request 

For detailed information on the key value pairs in the request, along with a list of all possible options, please see the documentation. A quick summary is provided here for the non self-explanatory ones. 

stat_freq : this is the frequency over which you want to compute your statistic, e.g. **daily** means, **monthly** standard deviation etc. 

output_freq : this is the length of time you want your final xr.Dataset to cover and (if saved : True), the frequency you want it saved to disk. If you put both stat_freq and output_freq as the same time, then you will get a final xr.Dataset output with a time dimension of length 1, which will be saved to disk and the next output will be a new Dataset also of length 1. However, for example if you request stat_freq as daily but output_freq as monthly your final output will have a time dimension of 30 (if the month has 30 days). At the end of the 30 days the final Dataset will save to disk (however the output 'dm' will return at daily frequency, with each day being appended to the same Dataset.) The output_freq must always be greater or the same as the stat_freq.

time_step: this corresponds to the time step of the incoming data in minutes.

checkpoint vs save : checkpointing is the one_pass saving the current state of the statistic to disk every time new information is passed. This way if the model crashes or the memory is wiped, the statistic can start again from where it left off. Checkpoint files are saved as either pickle or zarr files (depending on size). Save files are the final xr.Dataset statistic saved as a .nc file. 

Note: You will need to modify the file paths 

In [4]:
checkpoint_filepath = os.getcwd()
save_filepath = os.getcwd()

In [5]:
request = {
    "stat": "mean",  # statistic you wish to compute
    "stat_freq": "daily",  # frequency of the statistic
    "output_freq": "daily",  # frequency that you wish to output / save the data
    "time_step": 60,  # time step of the data in minutes
    "variable": "pr",  # variable of interest
    "save": True,  # do you want to save the final statistic
    "checkpoint": True,  # do you want checkpoint each time new data is passed to the statistic (recommended yes!)
    "checkpoint_filepath": checkpoint_filepath,  # path to checkpoint
    "save_filepath": save_filepath,  # path to save
}

Here we're running a loop of 24 hours (data has hourly frequency, hence time_step : 60) to simulate the streaming. the step indicates the number of time steps you want to pass each time (it's possible that the climate models will output multiple time steps in one go). If step is = 1, we're passing one hour of data at each iteration of the loop. 

The output of the one_pass is the xr.Dataset called dm. **This dm output is only displayed when enough data has been provided to the Opa class (in this case 24 time steps are required), until then it will be 'None'.** This dm Dataset is of exactly the same shape as the original data, other than the time dimension, which is now of length 1. The timestamp on this dimension corresponds to the first piece of data that contributed to the statistic. The variable name has not changed but a new 'history' attribute has been added to the data explaining that it is now a daily mean that has been calculated from the one_pass package, along with the timestamp of creation.

In [6]:
start = 0
stop = 24
step = 1

opa_stat = Opa(request)

for i in range(start, stop, step):

    # Simulate streaming by extracting a moving window
    incoming_dataset = data.isel(time=slice(i, i + step))

    # Compute result after the incoming data
    dm = opa_stat.compute(incoming_dataset)  # computing algorithm with new data

dm

## Example 2: monthly variance 

In this is example we will calculate the variance of the data over a month. We will use a time step of 4, so passing 4 time stamps in one go to the algorithm. 

The output data (shown below as dm and also saved as "2071_01_pr_monthly_var.nc") is the sample varience of the data over a month. Again, the time stamp on the new xr.Dataset matches the time stamp of the first piece of data passed to the algorithm. 

In [7]:
request = {
    "stat": "var",  # statistic you wish to compute
    "stat_freq": "monthly",  # frequency of the statistic
    "output_freq": "monthly",  # frequency that you wish to output / save the data
    "time_step": 60,  # time step of the data in minutes
    "variable": "pr",  # variable of interest
    "save": True,  # do you want to save the final statistic
    "checkpoint": True,  # do you want checkpoint each time new data is passed to the statistic (recommended yes!)
    "checkpoint_filepath": os.getcwd(),  # path to checkpoint
    "save_filepath": os.getcwd(),  # path to save
}

In [8]:
start = 0
stop = 31 * 24
step = 4

opa_stat = Opa(request)

for i in range(start, stop, step):

    # Simulate streaming by extracting a moving window
    incoming_dataset = data.isel(time=slice(i, i + step))

    # Compute result after the incoming data
    dm = opa_stat.compute(incoming_dataset)

dm

Let's do a quick sanity check to make sure the output of the one pass varience matches what you would get from numpy. The testing file "test_accuracy.py" tests the accuracy of all statistics against their numpy counterparts. 

In [9]:
# defining a normal 'two pass' algorithm for the variance over the time dimension
def two_pass_var(data, n_start, n_data):

    ds = data.isel(time=slice(n_start, n_data))
    axNum = ds.get_axis_num("time")
    two_pass = np.var(ds, axis=axNum, dtype=np.float64, ddof=1, keepdims=True)

    return two_pass

In [10]:
def two_pass_var(data, n_start, n_data):
    """Normal two-pass algorithm for the variance over the time dimension."""
    ds = data.isel(time=slice(n_start, n_data))
    axis = ds.get_axis_num("time")
    two_pass = np.var(ds, axis=axis, dtype=np.float64, ddof=1, keepdims=True)
    return two_pass


start = 0
stop = 31 * 24

data_pr = data.pr
two_pass = two_pass_var(data_pr, start, stop)
one_pass = dm.pr
np.allclose(two_pass, one_pass, atol=1e-5)

True

## Example 3: monthly standard deviation

In this is example we will calculate the monthly standard deviation of the data but output as one file over 3 months. We will use a time step of 12, so passing 12 time stamps in one go to the algorithm. 

The final output now has a time dimension of 3, with 3 monthly standard deviations appended in one file. 

In [11]:
request = {
    "stat": "std",
    "stat_freq": "monthly",
    "output_freq": "3monthly",
    "time_step": 60,
    "variable": "pr",
    "save": True,
    "checkpoint": True,
    "checkpoint_filepath": os.getcwd(),
    "save_filepath": os.getcwd(),
}

In [12]:
start = 0
stop = (31 * 24) + (28 * 24) + (31 * 24)  # Jan + Feb + Mar
step = 4

opa_stat = Opa(request)

for i in range(start, stop, step):

    # Simulate streaming by extracting a moving window
    incoming_dataset = data.isel(time=slice(i, i + step))

    # Compute result after the incoming data
    dm = opa_stat.compute(incoming_dataset)

dm

## Example 4: weekly threshold exceedance 

In this is example we look at the number of times the precipitation exceeds a certain threshold over a weekly frequency. Here, we need to include the key value pair, "threshold_exceed" equal to a value, so we have set it equal to $2 \times 10^{-5}\ \text{kg}\ \text{m}^{-2} \ \text{s}^{-1}$. 

Note that the threshold exceedance will take the absolute value of the threshold, so for variables like wind speed that are directional, it will look for exceedance over the magntiude and will include negative values. 

Here we have asked for weekly frequency. You will notice that the first day of data provided to the statistic does not corresond to a Monday (it corresponds to a Sunday). The algorithm realises this and prints "passing on this data as its not the initial data for the requested statistic" until it reaches the required time stamp. All statistic frequencies follow the Gregorian calander, so if you ask for monthly it will wait for the first day of the month before starting the statistic, same with annually etc. 

You might also note that here we have included the initalisation line ``opa_stat = Opa(request)`` within the loop. This means that each time new data is passed, the class is re-initalised. As we have set ``checkpoint : True``, the Opa class will re-initalise from the checkpoint file. This illustrates how the one_pass will work if the job is re-launched or the model crashes and needs to be re-started. This configuration below would not work if ``checkpoint : False ``

In [13]:
request = {
    "stat": "thresh_exceed",
    "thresh_exceed": [2 * 10 ** (-5)],  # does not require a value for this statistic
    "stat_freq": "weekly",
    "output_freq": "weekly",
    "time_step": 60,
    "variable": "pr",
    "save": True,
    "checkpoint": True,
    "checkpoint_filepath": os.getcwd(),
    "save_filepath": os.getcwd(),
}

In [14]:
start = 3 * 24 + 7 * 24 * 23
stop = start + 24 * 8
step = 4

for i in range(start, stop, step):

    # Simulate streaming by extracting a moving window
    incoming_dataset = data.isel(time=slice(i, i + step))

    # In this example, because we are checkpointing, we create the Opa instance
    # from scratch, since the class will be constructed after the written binary file
    opa_stat = Opa(request, logging_level="error")

    # Compute result after the incoming data
    dm = opa_stat.compute(incoming_dataset)

dm

Here, our final output now shows the number of times the precipitaiton has exceeded $2 \times 10^{-5}\ \text{kg}\ \text{m}^{-2}\ \text{s}^{-1}$ at each spatial location over the week. 

## Example 5: 3 monthly percentiles 

Example 5 shows how to use the one_pass to look get the distribution of a variable, in this case over a 3 month period. We are interested in the whole distribution so will set percentile_list : [].

This example takes around 2 minutes to run (on a compute node), which is much longer than the other statistics, the next phase of the WP9 is to implement parellistaion into the percentiles statistic. 

In [15]:
request = {
    "stat": "percentile",
    "percentile_list": [],  # does not require a value for this statistic   ['all']
    "stat_freq": "3monthly",
    "output_freq": "3monthly",
    "time_step": 60,
    "variable": "pr",
    "save": True,
    "checkpoint": True,
    "checkpoint_filepath": os.getcwd(),
    "save_filepath": os.getcwd(),
}

In [16]:
start = 31 * 24 * 2 + 28 * 24
stop = start + 31 * 24 + 2 * 30 * 24
step = 8

opa_stat = Opa(request)

for i in range(start, stop, step):

    # Simulate streaming by extracting a moving window
    incoming_dataset = data.isel(time=slice(i, i + step))

    # Compute result after the incoming data
    dm = opa_stat.compute(incoming_dataset)

dm

Looking at the output here, we can see that there is a new dimension corresponding to the percentiles. Just from the snapshot below we can see we are looking at the two tail ends of the distribution of the precipitation over this 3 monthly period. 

In [17]:
dm

If you want to look at the 50th percentile

In [18]:
dm.pr.values[0, 50, :, :]

array([[0.        , 0.        , 0.        , ..., 0.00016745, 0.00029003,
        0.00055154],
       [0.        , 0.        , 0.        , ..., 0.00016501, 0.00030175,
        0.00056315],
       [0.        , 0.        , 0.        , ..., 0.00018071, 0.0002709 ,
        0.00065883],
       ...,
       [0.        , 0.        , 0.        , ..., 0.00023277, 0.00050389,
        0.00097017],
       [0.        , 0.        , 0.        , ..., 0.00022417, 0.00044941,
        0.00099724],
       [0.        , 0.        , 0.        , ..., 0.00026018, 0.00037907,
        0.00110839]])

## Example 6: continuous maximum

In this example we're going to look at the maximum value of the data set over an unspecified time period, i.e. continuous. This statistic will start on the very first time step you pass it, it won't wait for the beginning of a week, month etc. It will then produce 'snapshots' of the current stat of the statistic based on the output frequency. Here we have set monthly outputs. 

Below we are starting half way through Feburary. This will still produce the file "2020_02_pr_continuous_max.nc" at the end of Feb, even though it started half way though the month, it will just contain the maximum value from the start date until the end of Feburary. Here we run until the end of April so dm is the maximum values from the start date until the end of April.  

In [19]:
request = {
    "stat": "max",
    "stat_freq": "continuous",
    "output_freq": "monthly",
    "time_step": 60,
    "variable": "pr",
    "save": True,
    "checkpoint": True,
    "checkpoint_filepath": os.getcwd(),
    "save_filepath": os.getcwd(),
}

In [20]:
start = 46 * 24  # mid way through Feb
stop = start + 31 * 24 * 2 + 12 * 24
step = 8

opa_stat = Opa(request)

for i in range(start, stop, step):

    # Simulate streaming by extracting a moving window
    incoming_dataset = data.isel(time=slice(i, i + step))

    # Compute result after the incoming data
    dm = opa_stat.compute(incoming_dataset)

dm

INFO:one_pass.opa:Incoming time stamp 2071-02-16 00:30:00 is further back in time than the previously seen time stamp 2071-04-30 23:30:00. As the stat_freq is continuous it is not possible to roll back this stat so the checkpoint file (if checkpointing is true) has been removed and the time variables n_data, count, time_stamp have been reset.


Running the above will have saved the files "2020_02_pr_continuous_max.nc", 2020_03_pr_continuous_max.nc, and "2020_04_pr_continuous_max.nc" to disk. These are all snapshots showing the result of the statistic at that point in time. Above we can see the output at the end of April. We notice that there is an extra dimension, timings. This dimension contains the timestamps of the maximum value. 

In [21]:
dm.timings

## Example 7: daily summations output at monthly frequency 

In this final example we look at the summation of the data over a daily frequency but setting the output as monthly so that we obtain a final xr.Dataset with multiple days of daily data. Here we start part way through March and also part way through a day. The one_pass will skip the first timestamps corresponding to part way through the day and wait for the first full day. It will then output an xr.Dataset of 16 days corresponding to the last 16 days on the month. 

In [22]:
request = {
    "stat": "sum",
    "stat_freq": "daily",
    "output_freq": "monthly",
    "time_step": 60,
    "variable": "pr",
    "save": True,
    "checkpoint": True,
    "checkpoint_filepath": os.getcwd(),
    "save_filepath": os.getcwd(),
}

In [23]:
start = 30 * 24 + 28 * 24 + 15 * 24 + 12  # mid way March and mid way through a day
stop = start + 12 + 16 * 24
step = 2

opa_stat = Opa(request, logging_level="error")

for i in range(start, stop, step):

    # Simulate streaming by extracting a moving window
    incoming_dataset = data.isel(time=slice(i, i + step))

    # Compute result after the incoming data
    dm = opa_stat.compute(incoming_dataset)

dm