# Function for Cluster Plot w/ Basic Linear Regression
## Example Parameter: Avg Leaf Nitrogen Concentration (LNC) vs Leaf Carbon Nitrogen Ratio (LCN)
#### Author: Heather Childers, Sofia Ingersoll, Sujan Bhattaria 
##### Date: 2024-02-18

##### Loading environment settings

In [2]:
# moved the libraries that were here into utils.py because they're essential
# xarray is required to run the utils import line
import xarray as xr
import glob

In [2]:
# import libraries & data pre-processing functions from utils.py
from utils import *

##### Request additional processing power from server

In [3]:
# Request an additional 10 cores of power for processing from the server
client = get_cluster("UCSB0021", cores = 40)

In [4]:
# apply peer2peer network communication across multiple devices
client.cluster

0,1
Dashboard: https://jupyterhub.hpc.ucar.edu/stable/user/hmchilders/proxy/8787/status,Workers: 0
Total threads: 0,Total memory: 0 B

0,1
Comm: tcp://128.117.208.105:46839,Workers: 0
Dashboard: https://jupyterhub.hpc.ucar.edu/stable/user/hmchilders/proxy/8787/status,Total threads: 0
Started: Just now,Total memory: 0 B


### By loading utils,

A dummy data array `da` & data set `ds` were loaded into our local environment. This is a pre-processed data, the array is of the data variable of interest: LNC: leaf nitrogen content. 

In addition to the dummy data, the functions defined in the utils.py library are also now accesible in this notebook and my be utilized to wrangle the cluster data that is to be read in.

## Loading a Cluster of 500 files
The data files are located in `/glade/campaign/cgd/tss/projects/PPE/PPEn11_LHC/transient/hist/`

The 2005-2010 monthly output files have the form:
`PPEn11_transient_LHC0001.clm2.h0.2005-02-01-00000.nc`

We're interested in files spanning from:
`LHC0001 to LHC0500`

In [12]:
#modify the function if you want to pass the parameter
def read_all_simulation():
    '''prepare cluster list and read to create ensemble(group of data)
    use preprocess to select only certain dimension and a variable'''
    # read all simulations as a list
    cluster_list= sorted(glob.glob('/glade/campaign/cgd/tss/projects/PPE/PPEn11_LHC/transient/hist/PPEn11_transient_LHC[0][0-5][0-9][0-9].clm2.h0.2005-02-01-00000.nc'))
    cluster_list = cluster_list[1:len(cluster_list)]

    # only select latitude, longitude, time, and  using this in preprocess steps
    def preprocess(ds, var):
        '''using this function in xr.open_mfdataset as preprocess
        ensures that when only these four things are selected 
        before the data is combined'''
        return ds[['lat', 'lon', 'time', var]]
    
    #read the list and load it for the notebook
    xr.open_mfdataset( cluster_list, 
                                   combine='nested',
                                   preprocess = lambda ds: preprocess(ds, var),
                                   parallel= True, 
                                   concat_dim="ens")



In [55]:
help(read_all_simulation)

Help on function read_all_simulation in module __main__:

read_all_simulation(var)
    prepare cluster list and read to create ensemble(group of data)
    use preprocess to select only certain dimension and a variable



In [5]:
# modify the function if you want to pass the parameter
def read_all_simulations(var):
    '''prepare cluster list and read to create ensemble(group of data)
    use preprocess to select only certain dimension and a variable'''
    # read all simulations as a list
    cluster_list= sorted(glob.glob('/glade/campaign/cgd/tss/projects/PPE/PPEn11_LHC/transient/hist/PPEn11_transient_LHC[0][0-5][0-9][0-9].clm2.h0.2005-02-01-00000.nc'))
    cluster_list = cluster_list[1:len(cluster_list)]

    def preprocess(ds, var):
        '''using this function in xr.open_mfdataset as preprocess
        ensures that when only these four things are selected 
        before the data is combined'''
        return ds[['lat', 'lon', 'time', var]]
    
    #read the list and load it for the notebook
    ds = xr.open_mfdataset( cluster_list, 
                                   combine='nested',
                                   preprocess = lambda ds: preprocess(ds, var),
                                   parallel=False, 
                                   concat_dim="ens")
    return ds

Unnamed: 0,Array,Chunk
Bytes,45.78 MiB,93.75 kiB
Shape,"(500, 60, 400)","(1, 60, 400)"
Dask graph,500 chunks in 1501 graph layers,500 chunks in 1501 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 45.78 MiB 93.75 kiB Shape (500, 60, 400) (1, 60, 400) Dask graph 500 chunks in 1501 graph layers Data type float32 numpy.ndarray",400  60  500,

Unnamed: 0,Array,Chunk
Bytes,45.78 MiB,93.75 kiB
Shape,"(500, 60, 400)","(1, 60, 400)"
Dask graph,500 chunks in 1501 graph layers,500 chunks in 1501 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray


In [6]:
da = read_all_simulations('LNC')

##### Accessing data processing functions from utils.py library

In [10]:
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# ----    Wrangle  Cluster Data     ----
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Subset for the variable of interest LNC and properly weigh grid cells and time dim
# these functions are defined in the utils library 
#This one works
cluster_ds = fix_time(da)
#This one doesn't
#cluster_ds = da.weighted_landarea_gridcells(landarea).yearly_weighted_average().fix_time()


TypeError: Cannot assign a Dataset to a single key - only a DataArray or Variable object can be stored under a single key.

### Leaf Carbon : Nitrogen Data

This will come in handy later when we want to plot trends overtime

In [None]:
# Leaf CN data for plotting
df = pd.read_csv('/glade/campaign/asp/djk2120/PPEn11/csvs/lhc220926.txt',index_col=0)
# convert to data set
params = xr.Dataset(df)

# the only dimension here is the 'member' aka file index id [LCH0001-500]
params

# subsetting for leafcn
leafcn = params['leafcn']

#leafcn
leafcn

### Down sampled 2 file approach

In [None]:
# Set filepath
#filepath = '/glade/campaign/cgd/tss/projects/PPE/PPEn11_LHC/transient/hist/'

In [None]:
#members = ["LHC" + str(i).zfill(4) for i in range(1,501)]
#members

In [None]:
#Open multiple files as a single dataset
ds_mf =xr.open_mfdataset(['/glade/campaign/cgd/tss/projects/PPE/PPEn11_LHC/transient/hist/PPEn11_transient_LHC0001.clm2.h0.2005-02-01-00000.nc', 
                          '/glade/campaign/cgd/tss/projects/PPE/PPEn11_LHC/transient/hist/PPEn11_transient_LHC0002.clm2.h0.2005-02-01-00000.nc'], 
                         combine='nested', parallel=True, concat_dim = "ens")

In [None]:
ds_mf.time

In [None]:
ds.time.max()

In [None]:
file2 = '/glade/campaign/cgd/tss/projects/PPE/helpers/sparsegrid_landarea.nc'
ds2 = xr.open_dataset(file2)
landarea = ds2['landarea']
weighted_avg_area = ds_mf['TSA'].weighted(landarea).mean(dim = 'gridcell').mean(dim = 'time')
weighted_avg_area.values

#### Single file visualization of LNC

In [None]:
# these are commented out because the utils library provides the same information in line 2
#file = "/glade/campaign/cgd/tss/projects/PPE/PPEn11_LHC/transient/hist/PPEn11_transient_LHC0001.clm2.h0.2005-02-01-00000.nc"
#ds = xr.open_dataset(file)

In [None]:
lnc = ds['LNC']
lnc_timeavg = lnc.mean(dim = 'time')

In [None]:
plt.scatter(ds.grid1d_lon,
            ds.grid1d_lat,
            c=lnc_timeavg)

In [None]:
#lnc_avg = lnc_timeavg.mean(dim = 'lat')