# Basic plotting 
## Examples of typical plots used to look at CTSM variables

This tutorial uses [xarray](https://docs.xarray.dev/en/stable/user-guide/terminology.html) and [matplotlib](https://matplotlib.org/stable/index.html) to create several types of plots. You will find examples for how to:

1. Load datasets with xarray
2. Manipulate data & make plots for:
> 2.1 Raw dataset; <br>
> 2.2 Diel averages for given months of the year; <br>
> 2.3 Daily and annual fluxes; and <br>
> 2.4 Annual climatologies
3. Export data to other file types (*e.g., .csv files*).

------

**This tutorial uses a Jupyter Notebook.** 
For more information on Jupyter notebooks please see the information in the Getting Started tutorials or visit the [Jupyter Notebook Quick Start Guide](https://jupyter-notebook-beginner-guide.readthedocs.io/en/latest/what_is_jupyter.html). 

***

# 1. Load Datasets

## 1.1 Load Python Libraries
It is necessary to start by loading in the libraries that will be used in the script. Note that this list is a good starting point for most of your plotting needs. Not all the libraries loaded here will be used in this tutorial.


In [None]:
import os
import time
import datetime

import numpy as np
import pandas as pd
import xarray as xr

from glob import glob
from os.path import join

import matplotlib
import matplotlib.pyplot as plt
import matplotlib.dates as mdates

from neon_utils import fix_time_h1

# suppress Future Warnings that are annoying...
import warnings
warnings.simplefilter("ignore", category=FutureWarning)

In [None]:
print('numpy '+np.__version__) ##-- was working with 1.23.5
print('pandas '+pd.__version__) ##-- was working with 1.5.3
print('xarray '+xr.__version__) ##-- was working with 2023.1.0

## 1.2 Point to history files 

### 1.2.1 Where are my simulation results?
When you run simulations, history files are all saved in the archive directory within your scratch directory. For the NEON simulations, this directory is `/scratch/NEON_cases/archive/`.

We can print a list of the simulations in this directory. Because this is a python script and we want to use a [bash](https://opensource.com/resources/what-bash) command, we have to use bash magic, `%%bash` or `!` which turns the python cell block below into a bash cell.  

In [None]:
%%bash
ls ~/scratch/NEON_cases/archive/

#### Alternatively, we have all these cases you can run

In [None]:
%%bash
ls /scratch/data/NEONv2/hist/



---

### 1.2.2 Point to the data folder with history files 
**The below code sets the following:**
- the NEON site; 
- the path to the archive directory;
- and the subdirectory with data.

**To Do:**
In the code below, change the name of the NEON site you simulated and want to create plots for

In [None]:
neon_site = 'KONZ'  #NEON site we're going to look at
archive = '~/scratch/NEON_cases/archive' #Path to archive directory

# Alternatively, prestaged data from all NEON sites are here
# archive = '/scratch/data/NEONv2/hist/'

# this unpacks the and expands the shortcut we used above
archive = os.path.realpath(os.path.expanduser(archive)) 

# Create a path to the data folder
data_folder = archive+'/'+neon_site+'.transient/lnd/hist'
data_folder

**Is this the path for the simulated data, `data_folder`, correct?** 
If not, you can set the `data_folder` variable manually. 
<br>*HINT:* You can check in the terminal window or using bash magic.

---

### 1.2.3 Create functions to use when opening the data files
1. `preprocess` reads in a subset of variables using an xarray function. This saves time and memory resources.
2. `fix_time` adjusts a time offset that is a feature of the CTSM history files.
<br>**A note about model timestamps:** The CTSM history includes an initial 0th timestep for each model simulation. This offset in the time dimension can cause challenges when analyzing and evaluating model data if not treated properly.
<br>*Don't worry too much about the details of these functions right now*

**To Do:** You can change the 'variables' list below to add or remove variables that you are interested in plotting. Make sure to follow the same formatting. 

In [None]:
# -- read only these variables from the whole netcdf files
def preprocess (ds):
    variables = ['FCEV', 'FCTR', 'FGEV','FSH','GPP','FSA','FIRA','AR','HR','ELAI']

    ds_new= ds[variables].isel(lndgrid=0)
    return ds_new


Now we have created the functions needed to manipulate our datasets

---

### 1.2.4 List all the files we're going to open
The the 30-minute, high frequency data (**'h1' files**) are written out every day for NEON simulations. CTSM typically saves files as monthly averages (the 'h0' files) and many more variables are available at this frequency. 

To open all of the `*h1*`files in our `data_folder`, we use the `glob` python function. 

You will notice that **several python functions are combined in a single line of code** that runs through a `for` loop over defined simulation years (written as a list of strings).

<div class="alert alert-block alert-info">
<b>Note</b> If you're new to python it's dense, but efficient. This code is borrowed from Negin Sobhani. Sharing code is really helpful!
</div>

**To Do:** You can change the years of data that you will read in below. Make sure that you do not add years where data are not available, though! You can check by listing the files in your `data_folder`

In [None]:
# This list gives you control over the years of data to read in
years = ["2018","2019","2020","2021"]  

# Create an empty list of all the file names to extend
sim_files = []
for year in years:
    sim_files.extend(sorted(glob(join(data_folder,"*h1."+year+"*.nc"))))

print("All simulation files for all years: [", len(sim_files), "files]")
print(sim_files[-1])

How many files are you going to read in?  What is the last day of the simulation you'll be looking at?

---

### 1.2.5 Read in the data
`.open_mfdataset` will open all of these data files and concatinate them into a single **xarray dataset**.

There are lot of files so this can take a long time! Be patient, it will be done in < 1 minute. This can be done more quickly with dask, which parallelizes python code, but we do not use that here.

1. `preprocess` will limit the number of variables we're reading in. This is an xarray feature that helps save time (and memory resources).
2. `fix_time_h1` corrects anoying features related to how CTSM history files handle time and is provided in `neon_utils.py`.
*Don't worry too much about the details of these functions right now*

In [None]:
start = time.time()
print ('---------------------------')
print ("Reading in data for "+neon_site)
ds_ctsm = xr.open_mfdataset(sim_files, decode_times=True, combine='by_coords',preprocess=preprocess)
ds_ctsm = fix_time_h1 (ds_ctsm)

end = time.time()
print("Reading all simulation files took:", end-start, "s.")


#### Take a quick look at the dataset.
- What are your coodinate variables?
- How long is the time dimension?
- What variables are available to look at?
- What are the 'long names', or descriptions, of some of these variables? (HINT: try `ds_ctsm.GPP`)
- What other metadata are associated with this dataset? 

In [None]:
ds_ctsm

---

# 2. Manipulating data and making plots
The data frequency in the **h1 files** is 30 minutes. The below examples focus on different ways to average the time dimension of the data to visualize results.

While several variables are read in, this example focuses on gross primary production (GPP). You can change the variable to look at other variables of intereset.

---

## 2.1 Plotting the full time series
It's often helpful to visualize the raw data.
This is quick and easy if you:
- point your dataset `ds_ctsm`, 
- select a variable `GPP` and use 
- `.plot()` function that's built into xarray.  

These three actions are combined in a single line of code

In [None]:
ds_ctsm.GPP.plot() ;

When looking at this figure:
- Do you see clear peaks for each year's growing season? 
- What other trends are evident in the data? 
- Are patterns similar for other variables?

It can be challenging to interpret patterns and trends with this volume of data. Below are a few examples to summarize data in different ways

---
## 2.2 Mean diel cycle
The high frequcncy data allows us to look at an average daily cycle of fluxes.  

**Let's start by looking at just a few days of data.**

We will subset the data by adding the following xarray features to our previous `ds_ctsm.GPP.plot()` command:
- `.isel` the *index select* on the coorinate variable, *time*, and: 
- `slice` which let's us slice out a range of data, instead of a single time point.

In [None]:
# Select the data range. Below selects a few days of July (Julian Day 181 to 185)
#  at 30 minute frequency (this means 48 times recorded / day, which is why the Julian Day is multiplied by 48)
first_time = (181*48)
last_time = (185*48)

ds_ctsm.GPP.isel(time=slice(first_time,last_time)).plot() ;

*From this plot, you notice that at night GPP goes to zero.  That's why there are zero values in the middle of summer!*

**You can also subset data to explore a single month.** 

Here, we look at data for all of July using the xarray `.where` function, which requires a: 
  - logical to evaluate. In the below example, this is when the time dimension equals our selected month; and 
  - 'dropping', or not using, the remaining data where the logical is false (optional). 

In [None]:
sel_month = 7
month_GPP = ds_ctsm['GPP'].where(ds_ctsm.time.dt.month == sel_month, drop=True)
print("month "+str(sel_month)+" has length = "+str(len(month_GPP.time)))

Does the time dimension have the right length? (48 times/day * n days/month * n year of simulations) 


<div class="alert alert-block alert-info">
<b>Note</b> You can also do this other ways. For example, using the xarray select function `.sel` should produce idential results.

> `month_GPP = ds_ctsm.GPP.sel(time=ds_ctsm.time.dt.month.isin([sel_month]))`    

</div>

**Calculate at the average diel cycle for a month.**

We can accomplish this using the:
- `.groupby` function for combined 
- `local_time` variable (which is made by adding the hour to a decimal minute for each day) and 
- `.plot`, which again gives us a quck look at an xarray data array.

In [None]:
month_GPP['local_time'] = month_GPP["time.hour"]+month_GPP["time.minute"]/60
month_GPP.groupby(month_GPP["local_time"]).mean().plot() ;

CTSM simulations are run with GMT, not local_time.

We'll shift these results here, centering to local noon for now.
CPER is in the Mountain time zone (GMT-7), meaning we should:
- `.shift` -14 timesteps (one timestep = 30 minutes), repeating the 
- `.groupby` function on our `local_time variable`, and then taking the 
- `.mean` or `.std`

You can layer all these steps into a sigle line of code!

In [None]:
# calculate mean and standard deviation 
mean = month_GPP.shift(time=-14).groupby(month_GPP["local_time"]).mean()
std = month_GPP.shift(time=-14).groupby(month_GPP["local_time"]).std()

**Now make the plots!**
Here we'll switch to plotting with matplotlib, loaded as `plt` for convenience, which has intuitive functions like:
- `.plot`(x, y, ...)
- `.fill_between`(x, y+z, y-z, ...) and plot control of features like
- `.xlabel`, `.ylabel`, `.title`, and `.grid`


In [None]:
plt.plot(mean['local_time'], mean, marker = 'o')
plt.fill_between(mean['local_time'], mean+std,mean-std, alpha=0.3)
plt.xlabel('local_time (h)')
plt.ylabel(month_GPP.attrs['long_name']+" ("+month_GPP.attrs['units']+")")
plt.title(neon_site+" diel mean, month="+str(sel_month), 
          loc='left',fontweight='bold') 
plt.grid() ;

**OK just for fun, lets go crazy by plotting mean diel cycles for each month of the year.**

We can do this in a `for` loop with just a few lines of code by layering multiple functions into each line.

<div class="alert alert-block alert-info">
<b>Note:</b> conditionals in python use indent spacing to define what's happening inside the loop.  This helps make the code easier to read and write. See the example below.
</div>

In [None]:
for m in range(1,13):
    month_GPP = ds_ctsm.GPP.sel(time=ds_ctsm.time.dt.month.isin([m]))
    month_GPP['local_time'] = month_GPP["time.hour"]+month_GPP["time.minute"]/60
    month_GPP.shift(time=-14).groupby(month_GPP["local_time"]).mean().plot(label=str(m))

# Note with newer versions of xarray this generates a 'FutureWarning', which is being suppressed

plt.title(neon_site+" diel mean", loc='left',fontweight='bold') 
plt.legend(title='month') ;

**Look at these results!**
- What months have the highest daily fluxes at this site?
- Why does the breadth of each curve change for different months?

---


## 2.3 Daily mean flux calculations

The above examples have calculated the mean and not changed units (*currently gC/m^2/s*).  Since each day has the same number of timesteps (48) and we are plotting diel cycles, the averaging and native units work. However, it is helpful to consider how data are being grouped in other types of averages and whether you might need to adjust the units.  For example, you may want to calculate a weighted mean at times (e.g. each month has a different number of days) or change units to make them better reflect your calculations. **You can think about this as you are working** by printing the outputs of  calculations at the end of each code cell and thinking about their dimensions, attributes, etc.

<div class="alert alert-block alert-info">
<b>Note:</b> Weighted averages and units changes can be calculated several ways. 

For example, you can use `.resample` or `.groupby` in xarray  
    
For one-dimensional variables it is more computationally efficient to use these functions in a pandas dataframe. For simplicity in this example, we will keep using xarray.

</div>

To calculate a daily mean we'll focus on: 
- `GPP` as the variable of interest
- the `.resample` function in xarray, as this handles the time dimension most intuitively
- the daily time dimension, `time='D'`

In [None]:
#-- Calculate daily average for every day 
daily_GPP = ds_ctsm.GPP.resample(time='D').mean()
daily_GPP

**Take a look at your new data array:**
- Do the values of the time coordinate seem appropriate?  
- Is the length of the time coordinate accurate?

<div class="alert alert-block alert-info">
<b>Note:</b> Variable attributes were carried forward to this new xarray data array, but we can modify this to get more sensible results.
</div>

Now we can modify the rate of the GPP fluxes from units of *per second* to units of *per day* and quickly `.plot()` results.

<div class="alert alert-block alert-warning">
<b>Be careful when changing values of variable!</b> 
    
If you run a code block like the one below repeatedly, you'll modify your GPP values every time. To avoid this, you can recalculate `daily_GPP` at the start of this cell by uncommenting out the first line of the code, below. Even though it's a little slower, it may ultimately be safer.
</div>



In [None]:
# daily_GPP = ds_ctsm.GPP.resample(time='D').mean()

sec_per_day = 60. * 60. * 24.
daily_GPP = daily_GPP * sec_per_day

# Then change the attirbute
daily_GPP.attrs['units'] = 'gC/m^2/day'

# Now plot results
daily_GPP.plot() ;

Notice how averaging makes the results look smoother. We can also more easily see the interannual variability in GPP that is simulated in different years.

**What is the range in annual GPP that CTSM simulates at this NEON site?**

We can assess this by creating a different plot: 
- use the `.resample` function  
- adjust the time dimension to the start of the year, `time='AS'`
- convert the `.mean()` to an annual flux with appropriate units (**NOTE** this assumes that all years have 365 days) 
- create a bar plot (only available as a Pandas series, not in xarray). Use `.to_series()` to convert to a pandas.Series and `.plot.bar()` to make a bar plot 


In [None]:
(ds_ctsm.GPP.resample(time='AS').mean()*sec_per_day*365).to_series().plot.bar() ;

**There's even more to think about here:**
- What could cause this kind of interannual variability in GPP?  
- Are similar patterns common across multiple NEON sites in the region?
- Where is this site and what kind of vegetation grows there? 
- Are these annual GPP values realistic for this type of ecosystem? 
<br>

---

## 2.4 Annual climatology

Sometimes it's helpful to look at the seaonal cycle.  Averaging across multiple years to generate a mean annual cycle is called a climatology. This example uses the `.groupby` function.

We have already calculated the daily mean and will also need to calculate the daily standard deviation `.std` to understand how the variability in the **daily** fluxes contributes to seasonal variability across years.

This will create two data arrays for the variable `GPP` (the mean and the standard deviation). We want to:
- use `.rename` to facilitate combining the data arrays
- combine the variable names and their array values into a dataset, `xr.Dataset`

In [None]:
# Calculate mean and standard deviation for variables
climo_GPP_mean = daily_GPP.groupby(daily_GPP["time.dayofyear"]).mean()
climo_GPP_std  = daily_GPP.groupby(daily_GPP["time.dayofyear"]).std()

## rename this variable for this data array
climo_GPP_std = climo_GPP_std.rename('std')

# combine the data arrays into a dataset
climo_GPP = xr.Dataset({"GPP": climo_GPP_mean, "std": climo_GPP_std})

# print a variable from the dataset
climo_GPP['GPP']

Since we are already using daily values, the units are correct.

You'll notice that using `.groupby` changes the time coordinates. Let's adjust these back to a datetime format using the `.rename` function again (here on a coordinate variable).

In [None]:
# rename the coordinate variable
climo_GPP = climo_GPP.rename({'dayofyear': 'time'})

#create a pandas dataframe of the same length
# It's true our climatology is the mean of simulated daily vales between 2018-2021, 
#   but datetime objects are simpler to work with if you provide a year
time = pd.date_range("2020-01-01", periods=366)

#replace time coordinate values
climo_GPP['time'] = time 

# Check that it all worked
climo_GPP

**Now plot the data!** 
We've done a lot of work. Let's:
1. Save this figure. We can do this in a new directory, `plot_dir`, under the `NEON_cases` directory.
2. Plot interanual variability around the daily means using matplotlib. Note that this code is a little more complex to make a nice looking figure, so don't worry about understanding the details now.  Instead, refer back to this as you become more familiar with analyses in python and plotting in matplotlib

In [None]:
# Make the figures directory if it's not already there
plot_dir = os.path.realpath(os.path.expanduser('~/scratch/NEON_cases/figures'))
if not os.path.isdir(plot_dir):
    os.makedirs(plot_dir, exist_ok=True)

In [None]:
# Now plot the climatological mean
f, ax = plt.subplots()
ax.plot(climo_GPP.time,climo_GPP.GPP)

# Add shading around the mean +/- one sigma of daily mean
ax.fill_between(climo_GPP.time, (climo_GPP.GPP-climo_GPP['std']), 
                                (climo_GPP.GPP+climo_GPP['std']), alpha=0.3)

ax.set_ylabel('GPP ('+climo_GPP.GPP.attrs['units']+')',fontweight='bold')
ax.set_title(neon_site+" annual climatology", loc='left',fontweight='bold')


# format x axis for monthly dates
locator = mdates.MonthLocator()  # every month
# Specify the format - %b gives us Jan, Feb...
fmt = mdates.DateFormatter('%b')            
            
X=ax.xaxis
X.set_major_locator(locator)
X.set_major_formatter(fmt)
ax.grid()   ; 

# Save out a .pdf.  Other options include .png, .eps, ect.
print(' -- saving figure -- ')
plot_name = neon_site+'_'+'GPP_climatology.pdf'
plt.savefig (os.path.join(plot_dir,plot_name), 
             dpi=300,bbox_inches='tight', format = 'pdf')

In [None]:
# We can check the image was created
!ls ~/scratch/NEON_cases/figures

---
# Export postprocessed data

Finally! We did a lot of work to generate this climatology. You want to save the data as a different file type to work with later (e.g. .csv) so that you don't need to rely on this python script. To do this: 
1. Create a directory to save the file
2. Define the file name for our postprocessed data
3. Generate a pandas dataframe from the climotology (xarray) dataset using the `.to_dataframe` function  
4. Save the dataframe as a .csv file using the `.to_csv` function

In [None]:
# Make the figures directory if it's not already there
out_dir = os.path.realpath(os.path.expanduser('~/scratch/NEON_cases/postprocessed_data'))
if not os.path.isdir(out_dir):
    os.makedirs(out_dir, exist_ok=True)
    
# define the file name and directory
file_out = out_dir+'/'+neon_site+'_'+'GPP_climatology.csv'

# Create a dataframe and write out the .csv file
climo_GPP.to_dataframe().to_csv(file_out)
print('wrote '+ file_out)

In [None]:
# Check, does this file exist.
!ls /scratch/${USER}/NEON_cases/postprocessed_data/

#### Check the .csv file
- Is there much metadata associated with the .csv file?
- Will you be able to remember how you generated these data if you use them elsewhere or share them with colleagues?

<div class="alert alert-block alert-success">
<b>Congratulations!</b> 
    
You've manipulated and plotted data from one site and one variable.  
    
What would you like to look at now?
</div>
