# Quick plots - CTSM
## *Simple plots to look at CTSM data quickly*

This tutorial is an intdroduction to [xarray](https://docs.xarray.dev/en/stable/user-guide/terminology.html) and [matplotlib](https://matplotlib.org/stable/index.html). There's lot's more information to be found at the documentation for for these libraries.  Note, some users like using the seaborn library instead of matplotlib, we don't have examples using seaborn at this point.

In this tutorial you will find steps and instructions to:

1. Load datasets with xarray
2. Manipulate data & making plots for:
> 2.1 Raw dataset; <br>
> 2.2 Diel averages for given months of the year; <br>
> 2.3 Daily and annual fluxes; and <br>
> 2.4 Annual climatologies
3. Exporting data to other file types (*e.g., .csv files for users who don't want to work with python*).

------

# 1. Load Datasets

## 1.1 Load Python Libraries
We always start by loading in the libraries we're going to use for the script.  There are more libraries being loaded here than we'll likely use, but this list is a good one to get started for most of your plotting needs.


In [None]:
import os
import time
import datetime

import numpy as np
import pandas as pd
import xarray as xr

from glob import glob
from os.path import join

import matplotlib
import matplotlib.pyplot as plt
import matplotlib.dates as mdates

from neon_utils import fix_time_h1

In [None]:
print('xarray '+xr.__version__) ##-- was working with 2023.1.0

## 1.2 Point to history files 

### 1.2.1 Where are my simulation results?
After your simulations finish, history files are all saved in your `/scratch/NEON_cases/archive/` directory

We can print the cases we have to look at using bash magic, `%%bash` or `!` which turns the python cell block below into a bash cell.  

In [None]:
%%bash
ls ~/scratch/NEON_cases/archive/

<div class="alert alert-block alert-info">
<b>Note</b> you can accomplish the same thing with the following.

> `!ls ~/scratch/NEON_cases/archive/`
    
</div>

---

### 1.2.2 Point to the data folder with history files 
**We'll set the following:**
- site to look at; 
- path to our archive directory;
- directory with input data (where history files are found).
By doing this more generally, it makes the script easier to modify for different sites.

In [None]:
neon_site = 'CPER'  #NEON site we're going to look at
archive = '~/scratch/NEON_cases/archive' #Path to archive directory

# this unpacks the and expands the shortcut we used above
archive = os.path.realpath(os.path.expanduser(archive)) 

# Create a path to the data folder
data_folder = archive+'/'+neon_site+'.transient/lnd/hist'
data_folder

**Is this the path for input data, `data_folder`, correct?** *HINT:* You can check in the terminal window or using bash magic.

---
### 1.2.3 Create some functions we'll use when opening the data
- preprocess will limit the number of variables we're reading in. This is an xarray feature that helps save time (and memory resources).
- fix_time_h1 corrects anoying features related to how CTSM history files handle time and is provided as part of `neon_utils.py`.

Don't worry too much about the details of these functions right now


In [None]:
# -- read only these variables from the whole netcdf files
def preprocess_some (ds):
    variables = ['FCEV', 'FCTR', 'FGEV','FSH','GPP','FSA','FIRA','AR','HR','ELAI']
    ds_new= ds[variables].isel(lndgrid=0)
    return ds_new

### 1.2.4 List all the files we're going to open
The the 30-minute, high frequency history output (**'h1' files**) are written out every day in for NEON cases. 

To open all of these files we're going to need to know their names.  This can be done if we:
- Create an empty list `[]` of simulation files that is
- `.extend`ed with a 
- `sorted` list of files generted with the 
- `glob` function in python of the 
- `*h1*`files in our `data_folder` 

You'll notice that **all of this gets combined in a single line of code** that runs through a 
- `for` loop over defined simulation years (written as a list of strings)

<div class="alert alert-block alert-info">
<b>Note</b> If you're new to python it's dense, but efficient.  I actually borrowed a bunch this code from a colleague, Negin Sobhani, who's good at python! Sharing code is really helpful. 
</div>



In [None]:
# This list gives you control over the years of data to read in
years = ["2018","2019","2020","2021"]  

# Create an empty list of all the file names to extend
sim_files = []
for year in years:
    sim_files.extend(sorted(glob(join(data_folder,"*h1."+year+"*.nc"))))

print("All simulation files for all years: [", len(sim_files), "files]")
print(sim_files[-1])

How many files are you going have to read in?  What is the last day of the simulation you'll be looking at?

---

### 1.2.5 Read in the data
`.open_mfdataset` will open all of these data files and concatinate them into a single **xarray dataset**.

There are lot of files here! Be patient it should be done in < 1 minute. This can be done more quickly with dask, but we're not going to mess with it right now.

We are going to also going use or `preprocess` and `fix_time` functions in this step from `neon_utils.py`.

In [None]:
start = time.time()

# Define the variables to read in

print ('---------------------------')
print ("Reading in data for "+neon_site)
ds_ctsm = xr.open_mfdataset(sim_files, decode_times=True, combine='by_coords',
                            preprocess=preprocess_some)
ds_ctsm = fix_time_h1(ds_ctsm)

end = time.time()
print("Reading all simulation files took:", end-start, "s.")


#### Take a quick look at the dataset.
- What are your coodinate variables?
- How long is the time dimensions?
- What variables do we have to look at?
- What are the long names of some of these variables? (HINT try `ds_ctsm.GPP`)
- What are other metadata are associated with this dataset? 

In [None]:
ds_ctsm

---
---

# 2 Manipulating data and making plots
These are all just timeseries data.  We'll focus on different ways to average of the time dimension to visualize results.

We'll also just focus on gross primary production (GPP), but you can adjust this to look at other variables of intereset.

---

## 2.1 Quick look at the full time series
It's often helpful to visualize the raw data before you get started.
This is quick and easy if you:
- point your dataset `ds_ctsm`, 
- select a variable `GPP` and use 
- `.plot()` function that's built into xarray.  

These three actions are combined in a single line of code

In [None]:
ds_ctsm.GPP.plot() ;

Hopefully you can see clear peaks for each year's growing season.  
- Why are there are still zero values durring the middle of the summer?
- What do data for other variables look like?

This volume of data is hard to digest.  Let's look at a few ways to manipulate the data.

---
## 2.2 Mean diel cycle
We have this high frequcncy output to look at the daily cycle of fluxes.  Let's look at just a few days of data.

We'll just subset the data by adding the following xarray features tor our previous `ds_ctsm.GPP.plot()` command:
- `.isel` the *index select* on our coorinate variable, *time*, and: 
- `slice` which let's us slice out a range of data, instead of a single time point.

In [None]:
## Select data range, here for days of July 
#  @ 30 minute frequency this means 48 times recorded / day
first_time = (181*48)
last_time = (185*48)

ds_ctsm.GPP.isel(time=slice(first_time,last_time)).plot() ;

*Ah, it look like at night GPP goes to zero.  That's why there are zero values in the middle of summer!*

**You can also subset data to just look at data for a month.** 

Here we'll look at data from July using the xarray 
- `.where` function, which requires a 
  - logical to evaluate, here when the datasets time dimension equals our selected month; and 
  - dropping the unused data where the logical is false (optional). 

In [None]:
sel_month = 7
month_GPP = ds_ctsm.GPP.where(ds_ctsm.time.dt.month == sel_month, drop=True)
print("month "+str(sel_month)+" has length = "+str(len(month_GPP.time)))

**Does the time dimension seem to have the right length?** (48 times/day * n days/month * n year of simulations) 


<div class="alert alert-block alert-info">
<b>Note</b> you can also do this other ways
    
For example, using the xarray select function, `.sel`, should produce idential results.

> `month_GPP = ds_ctsm.GPP.sel(time=ds_ctsm.time.dt.month.isin([sel_month]))`    

</div>

Even this subset of data is still a lot to try and make sense of! Now we'll take a look at the average diel cycle for the month.

We can accomplish this using the:
- `.groupby` function for combined 
- `local_time` variable (which is made by adding the hour to a decimal minute for each day) and 
- `.plot`, which again gives us a quck look at an xarray data array.

In [None]:
month_GPP['local_time'] = month_GPP["time.hour"]+month_GPP["time.minute"]/60
month_GPP.groupby(month_GPP["local_time"]).mean().plot() ;

CTSM simulations are run with GMT, not local_time.

We'll shift these results here, just centering by eye to local noon for now.
CPER is in the Mountain time zone (GMT-7), meaning we should:
- `.shift` -14 timesteps, repeating the 
- `.groupby` function on our `local_time variable`, and then taking the 
- `.mean` or `.std`

You can layer all these steps into a sigle line of code!

In [None]:
# calculate mean and standard deviation 
mean = month_GPP.shift(time=-14).groupby(month_GPP["local_time"]).mean()
std = month_GPP.shift(time=-14).groupby(month_GPP["local_time"]).std()

**Now make the plots!**
Here we'll switch to plotting wit matplotlib, loaded as `plt` for convenience, which has intuitive functions like:
- `.plot`(x, y, ...)
- `.fill_between`(x, y+z, y-z, ...) and plot control of features like
- `.xlabel`, `.ylabel`, `.title`, and `.grid`


In [None]:
plt.plot(mean['local_time'], mean, marker = 'o')
plt.fill_between(mean['local_time'], mean+std,mean-std, alpha=0.3)
plt.xlabel('local_time (h)')
plt.ylabel(month_GPP.attrs['long_name']+" ("+month_GPP.attrs['units']+")")
plt.title(neon_site+" diel mean, month="+str(sel_month), 
          loc='left',fontweight='bold') 
plt.grid() ;

**OK just for fun, lets go crazy by plotting mean diel cycles for each month of the year.**

We can do this in a `for` loop with just a few lines of code by layering multiple functions into each line.

<div class="alert alert-block alert-info">
<b>Note</b> conditional in python use indent spacing to define what's happening inside the loop.  This helps make the code easier to read and write. See the example below.
</div>

In [None]:
for m in range(1,13):
    month_GPP = ds_ctsm.GPP.sel(time=ds_ctsm.time.dt.month.isin([m]))
    month_GPP['local_time'] = month_GPP["time.hour"]+month_GPP["time.minute"]/60
    month_GPP.shift(time=-14).groupby(month_GPP["local_time"]).mean().plot(label=str(m))

plt.title(neon_site+" diel mean", loc='left',fontweight='bold') 
plt.legend(title='month') ;

**Look at these results!**
- What months have the highest daily fluxes at this site?
- Why does the breadth of each curve change for different months?

---


## 2.3 Daily mean flux calculations

We're calculating the mean to avoid having to change units (*currently gC/m^2/s*).  Since each day has the same number of timesteps (48), this kind of averaging is OK to do, but it's good to be aware of how you're grouping data down the road.  For example you may want to take a weighted mean at times (e.g. each month has a different number of days). Or change units to make them better reflect your calculations. **I like to think about this as I'm working** by printing the outputs of my calculations at the end of each code cell and thinking about their dimensions, attributes, etc.

<div class="alert alert-block alert-info">
<b>Note:</b> This can also be done several ways.
    
For example using `.resample` or `.groupby` in xarray, but `.resample` handles time better.  
    
For one dimensional variables it is more computationally efficient to use a pandas dataframe, from which xarray borrowed much its `.resample` and `.groupby` logic. If you're familiar with pandas, you're welcome to do this, but for simplicity in this example we're going to keep using xarray.

</div>

To calculate daily mean we'll focus on: 
- `GPP` as our variable, and use the 
- `.resample` function in xarray over 
- `time='D'` for every day of the timeseries, to calculate the daily
- `.mean`

In [None]:
#-- Calculate daily average for every day 
daily_GPP = ds_ctsm.GPP.resample(time='D').mean()
daily_GPP

**Take a look at your new data array:**
- Do the values of the time coordinate seem appropriate?  
- Is the length of the time coordinate accurate?

<div class="alert alert-block alert-info">
<b>Note:</b> Variable attributes were carried forward to this new xarray data array, but we can modify this to get more sensible results.
</div>

Now we can modify the rate of the GPP fluxes from units of *per second* to units of *per day* and quickly `.plot()` results.

<div class="alert alert-block alert-warning">
<b>Be careful when changing values of variable!</b> 
    
If you run a code block like the one below repeatedly, you'll modify your GPP values every time. To avoid this, you can recalculate `daily_GPP` at the start of this cell by uncommenting out the first line of the code, below. Even though it's a little slower, it may ultimately be safer.
</div>



In [None]:
# daily_GPP = ds_ctsm.GPP.resample(time='D').mean()

sec_per_day = 60. * 60. * 24.
daily_GPP = daily_GPP * sec_per_day

# Then change the attirbute
daily_GPP.attrs['units'] = 'gC/m^2/day'

# Now plot results
daily_GPP.plot() ;

Now our results look a bit smoother because we averaged in the zero GPP fluxes simulated at night time.  

We can also more easily see the interannual variability in GPP that's simulated in different years.
- What is the range in anual GPP that CTSM simulates at this NEON site?

We can quickly asses this using the same 
- `.resample` function over 
- `time='AS'` years (at the start of the year), and converting the
- `.mean()` to an annual flux with appropraite units (**NOTE** this does assume that all years have 365 days, which isn't completely accurate, but should get us close enough.) 
- `.to_series()` converts to a pandas.Series
- `.plot.bar()` then makes a bar plot (xarray doesn't have this function)


In [None]:
(ds_ctsm.GPP.resample(time='AS').mean()*sec_per_day*365).to_series().plot.bar() ;

**There's even more to think about here:**
- What could cause this kind of interannual variability in GPP?  
- Are similar patterns common across multiple NEON sites in the region?
- Where is this site? 
- What kind of vegetation grows there? 
- Are these annual GPP values realistic for this kind of ecosystem? 
<br>

---

## 2.4 Annual climatology

Sometimes it's helpful to just get a sense of the seaonal cycle at a site.  This is called a climatology, when we average fluxes across multiple years to see what the mean annual cycle looks likes. Now we're going to use `.groupby`, as I'm not smart enough to do this vwith with `.resample`

Becasue we're doing statistics here, taking the standard deviation `.std`, we also want to do this on the daily results we've already calculated.  Otherwise 30 minute values will be used to calculate the standard deviation of the daily climatology, which isn't really reflective of the variability in **daily** fluxes at the site.

In summary, this code uses the following functions:
- `.groupby` over 
- `daily_GPP["time.dayofyear"]` a special time variable for the day of the year to calculate
- `.mean` or `.std`

This will create two data arrays for the variable `GPP` that we want to:
- `.rename` so that we can combine the data arrays into an
- `xr.Dataset`, which takes variable names, and their array values in a dictionary format
- `{"GPP": climo_GPP_mean, "std": climo_GPP_std}`

In [None]:
# Calculate mean and standard deviation for variables
climo_GPP_mean = daily_GPP.groupby(daily_GPP["time.dayofyear"]).mean()
climo_GPP_std  = daily_GPP.groupby(daily_GPP["time.dayofyear"]).std()

## rename this variable for this data array
climo_GPP_std = climo_GPP_std.rename('std')

# combine the data arrays into a dataset
climo_GPP = xr.Dataset({"GPP": climo_GPP_mean, "std": climo_GPP_std})

# print a variable from the dataset
climo_GPP['GPP']

Becasue we're already using daily values, the units are correct.

You'll notice that using `.groupby` changes the time coordinates. Let's adjust these back to a datetime format using the `.rename` function again (here on a coordinate variable).

In [None]:
# rename the coordinate variable
climo_GPP = climo_GPP.rename({'dayofyear': 'time'})

#create a pandas dataframe of the same length
# It's true our climatology is the mean of simulated daily vales between 2018-2021, 
#   but datetime objects are simpler to work with if you provide a year
time = pd.date_range("2020-01-01", periods=366)

#replace time coordinate values
climo_GPP['time'] = time 

# Check that it all worked
climo_GPP

**Now plot the data!** 
We've done a lot of work. Lets:
1. Save this figure to put on the fridge back home. 
  - We can do this in a new `plot_dir` under the `NEON_cases` directory where you can save your work.
2. Use matplotlib to show interanual variability around the daily means
  - This code is a little fancier to make a nice looking figure, but don't worry about understand everything that's going on right away.  Instead, this is a place to refer back to as you get more familiar with analyses in python and plotting in matplotlib

In [None]:
# Make the figures directory if it's not already there
plot_dir = os.path.realpath(os.path.expanduser('~/scratch/NEON_cases/figures'))
if not os.path.isdir(plot_dir):
    os.makedirs(plot_dir, exist_ok=True)

In [None]:
# Now plot the climatological mean
f, ax = plt.subplots()
ax.plot(climo_GPP.time,climo_GPP.GPP)

# Add shading around the mean +/- one sigma of daily mean
ax.fill_between(climo_GPP.time, (climo_GPP.GPP-climo_GPP['std']), 
                                (climo_GPP.GPP+climo_GPP['std']), alpha=0.3)

ax.set_ylabel('GPP ('+climo_GPP.GPP.attrs['units']+')',fontweight='bold')
ax.set_title(neon_site+" annual climatology", loc='left',fontweight='bold')


# format x axis for monthly dates
locator = mdates.MonthLocator()  # every month
# Specify the format - %b gives us Jan, Feb...
fmt = mdates.DateFormatter('%b')            
            
X=ax.xaxis
X.set_major_locator(locator)
X.set_major_formatter(fmt)
ax.grid()   ; 

# Save out a .pdf.  Other options include .png, .eps, ect.
print(' -- saving figure -- ')
plot_name = neon_site+'_'+'GPP_climatology.pdf'
plt.savefig (os.path.join(plot_dir,plot_name), 
             dpi=300,bbox_inches='tight', format = 'pdf')

In [None]:
# We can check the image was created
!ls ~/scratch/NEON_cases/figures

---
# Export postprocessed data

Finally! We did a lot of work to generate this climatology.  Maybe we want to save the data as a different file type  to work with later (e.g. .csv)?  To do this we can: 
1. Create a directory to save our file
2. Define the file name for our postprocessed data
3. `.to_dataframe` function generates a pandas dataframe from our climotology (xarray) dataset 
4. `.to_csv` function save the dataframe as a .csv

In [None]:
# Make the figures directory if it's not already there
out_dir = os.path.realpath(os.path.expanduser('~/scratch/NEON_cases/postprocessed_data'))
if not os.path.isdir(out_dir):
    os.makedirs(out_dir, exist_ok=True)
    
# define the file name and directory
file_out = out_dir+'/'+neon_site+'_'+'GPP_climatology.csv'

# Create a dataframe and write out the .csv file
climo_GPP.to_dataframe().to_csv(file_out)
print('wrote '+ file_out)

In [None]:
# Check, does this file exist.
!ls /glade/scratch/wwieder/NEON_cases/postprocessed_data/

#### It may be worth looking at this .csv file.
- Is there much metadata associated with the .csv file?
- Will you be able to remember how you generated these data if you use them elsewhere or share them with colleagues?

<div class="alert alert-block alert-success">
<b>Congratualtions:</b> 
    
You've done lots of manipulation and plotting with data from one site and one variable.  
    
What would you like to look at now?
</div>
