# flox examples on laptop

```
/////
File: flox_examples_on_laptop
Author: Thomas Moore
Description: This notebook explores running `flox` documentation examples on a laptop
Equipment: MacBook Pro, Apple M2 Pro, Memory:	32 GB, Total Number of Cores:	12 (8 performance and 4 efficiency)
References: The very useful - flox: fast & furious GroupBy reductions for dask.array - https://flox.readthedocs.io
Credits: `flox` by @dcherian et al
Credits: .... and all the open source work across the Pangeo community: `xarray`, `dask`, ....
Date: 2 May 2024
/////
```

In [1]:
Author_dict = {"name": "Thomas Moore", 
               "affiliation": "CSIRO", 
               "email": "thomas.moore@csiro.au",
               "orchid_ID":'https://orcid.org/0000-0003-3930-1946'}

# setup

In [2]:
import dask
import flox
import xarray as xr
import pandas as pd
import numpy as np
from dask.distributed import Client, LocalCluster
import gc

## functions

In [3]:
def print_chunks(data_array):
    chunks = data_array.chunks
    dim_names = data_array.dims
    readable_chunks = {dim: chunks[i] for i, dim in enumerate(dim_names)}
    for dim, sizes in readable_chunks.items():
        print(f"{dim} chunks: {sizes}")

def clear_and_restart(variables, client):
    """
    Clear specified variables from memory, collect garbage, and restart the Dask cluster.

    Args:
        variables (list): List of string names of the variables to clear from the namespace.
        client (dask.distributed.Client): The Dask client associated with the cluster to restart.

    Returns:
        None
    """

    # Clear specified variables
    for var in variables:
        if var in globals():
            del globals()[var]
    
    # Collect garbage
    gc.collect()
    
    # Restart the Dask cluster
    client.restart()

def print_options():
    # Retrieve Xarray options
    options = xr.get_options()

    # Convert the options dictionary to a Pandas DataFrame for a nicer table display
    options_df = pd.DataFrame(list(options.items()), columns=['Option', 'Value'])

    # Print the DataFrame
    print(options_df)

def print_flox_options():
    # Retrieve Xarray options
    options = xr.get_options()

    # Convert the options dictionary to a Pandas DataFrame for a nicer table display
    options_df = pd.DataFrame(list(options.items()), columns=['Option', 'Value'])

    option = options_df.iloc[20]
    # Print in a more human-readable, single-line format
    print(f"Option: {option['Option']}, Value: {option['Value']}")

# start LocalCluster

In [4]:
cluster = LocalCluster(
    n_workers=4,          # Number of workers
    threads_per_worker=1 # Number of threads per each worker
)
client = Client(cluster)



# Flox docs examples
[https://flox.readthedocs.io/en/latest/user-stories/climatology.html](https://flox.readthedocs.io/en/latest/user-stories/climatology.html)
<br>and specifically<br>
[How about other climatologies?](https://flox.readthedocs.io/en/latest/user-stories/climatology.html#how-about-other-climatologies)

### adjust this example:
- replace `ones` with `random`

In [5]:
import dask.array as da
# Generate a DataArray with random numbers
oisst = xr.DataArray(
    da.random.random((14532, 720, 1440), chunks=(20, 720, 1440)),  # Generate random values
    dims=("time", "lat", "lon"),
    coords={"time": pd.date_range("1981-09-01 12:00", "2021-06-14 12:00", freq="D")},
    name="sst"
)

In [6]:
print('oisst object is '+ str(oisst.nbytes/1e9) + ' GB \n' + str((oisst.nbytes/1e9)/32.0) + ' times bigger than total memory.')

oisst object is 120.5342208 GB 
3.7666944 times bigger than total memory.


In [7]:
oisst.groupby('time.month').mean('time')

Unnamed: 0,Array,Chunk
Bytes,94.92 MiB,7.91 MiB
Shape,"(12, 720, 1440)","(1, 720, 1440)"
Dask graph,12 chunks in 79 graph layers,12 chunks in 79 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 94.92 MiB 7.91 MiB Shape (12, 720, 1440) (1, 720, 1440) Dask graph 12 chunks in 79 graph layers Data type float64 numpy.ndarray",1440  720  12,

Unnamed: 0,Array,Chunk
Bytes,94.92 MiB,7.91 MiB
Shape,"(12, 720, 1440)","(1, 720, 1440)"
Dask graph,12 chunks in 79 graph layers,12 chunks in 79 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray


#### force `map-reduce` with engine=`flox`

In [8]:
%%time
print_flox_options()
clim_flox = oisst.groupby('time.month').mean('time',engine='flox',method='map-reduce').compute()

Option: use_flox, Value: True
CPU times: user 6.52 s, sys: 1.25 s, total: 7.78 s
Wall time: 1min 29s


In [11]:
clear_and_restart(['clim_flox'],client)

#### force `map-reduce` with engine=`flox` with `skipna = False`

In [10]:
%%time
print_flox_options()
clim_flox = oisst.groupby('time.month').mean('time',engine='flox',method='map-reduce',skipna=False).compute()

Option: use_flox, Value: True
CPU times: user 5.26 s, sys: 998 ms, total: 6.26 s
Wall time: 1min 22s


In [21]:
clear_and_restart(['clim_flox'],client)

#### force `cohorts` with engine=`flox`

In [12]:
%%time
print_flox_options()
clim_flox = oisst.groupby('time.month').mean('time',engine='flox',method='cohorts').compute()

Option: use_flox, Value: True
CPU times: user 4.59 s, sys: 773 ms, total: 5.36 s
Wall time: 42.6 s


In [20]:
clear_and_restart(['clim_flox'],client)

#### force `cohorts` with engine=`flox`with `skipna = False`

In [14]:
%%time
print_flox_options()
clim_flox = oisst.groupby('time.month').mean('time',engine='flox',method='cohorts',skipna=False).compute()

Option: use_flox, Value: True
CPU times: user 4.09 s, sys: 1.45 s, total: 5.54 s
Wall time: 36.4 s


In [13]:
clear_and_restart(['clim_flox'],client)

#### `use_flox=False`

In [16]:
%%time
with xr.set_options(use_flox=False):
    print_flox_options()
    clim_noflox = oisst.groupby('time.month').mean('time').compute()

Option: use_flox, Value: False
CPU times: user 3.4 s, sys: 1.22 s, total: 4.62 s
Wall time: 29.4 s


In [19]:
clear_and_restart(['clim_flox'],client)

#### `use_flox=False` with `skipna = False`

In [18]:
%%time
with xr.set_options(use_flox=False):
    print_flox_options()
    clim_noflox = oisst.groupby('time.month').mean('time',skipna=False).compute()

Option: use_flox, Value: False
CPU times: user 2.67 s, sys: 1.38 s, total: 4.05 s
Wall time: 20.9 s


### results
~~with flox map-reduce = CPU times: user 7.07 s, sys: 1.44 s, total: 8.51 s = Wall time: 2min 9s~~<br>
~~with flox cohorts = CPU times: user 5.82 s, sys: 1.16 s, total: 6.98 s = Wall time: 1min 20s~~<br>
~~without flox = CPU times: user 3.37 s, sys: 1.39 s, total: 4.77 s = Wall time: 29.5 s~~



# $The$ $End$

# $\downarrow$ clean-up buttons $\downarrow$

In [17]:
clear_and_restart(['clim_flox','clim_noflox'],client)

In [None]:
clear_and_restart([],client)