# Generic workflow for setting up single-point input data

This notebook describes a generic workflow for setting up a new single-point site in the [NorESM Land Sites Platform](https://noresmhub.github.io/NorESM_LandSites_Platform/).  It will guide you through setting up the required driver data (i.e. surface and atmosphere data) for the CTSM model. 

### Questions? 
- Open an [issue](https://github.com/NorESMhub/NorESM_LandSites_Platform/issues) on GitHub or ask a [developer](https://noresmhub.github.io/NorESM_LandSites_Platform/about/#code-development-team) about questions or problems related to the NorESM Land Sites Platform
- To get in touch with the CTSM group for more generic CTSM-related issues, you can also post in the [CTSM forum in the CESM Bulletin Board](https://bb.cgd.ucar.edu/cesm/forums/ctsm-clm-mosart-rtm.134/).
- Additional tutorials and inspiration can be found in the [NCAR CTSM-Tutorial GitHub repository](https://github.com/NCAR/CTSM-Tutorial-2022).


## Outline

The workflow has several components. Below you will find steps to: 
1. check prerequisites for following this workflow.
2. Subset surface and atmosphere data from global datasets at a single latitude and longitude point.
3. Optionally modify some parts of these data.
4. Optionally create spin-up files for your site.


## 1. Prerequisites

The default datasets we need to subset from to extract data for a new single point are too large to keep on a normal pc (including the Docker container used to run the NorESM-LSP because that uses the resources of the computer it's running on). You therefore need access to High-performance computing (HPC) or somewhere these files are already available or where you can download and work with them. We use the Saga HPC as an example; Saga is a NRIS (Norwegian Research Infrastructure Services) computing cluster designed to handle sequential and small scale parallel applications not exceeding 128 threads. NRIS is a collaboration between Uninett Sigma2 and the four universities NTNU, UiB, UiO og UiT, to offer resources and support providing high performance computing and data storage services to researchers affiliated with academic institutions in Norway. 

Before we can start, make sure you...

- have cloned CTSM (`git clone https://github.com/ESCOMP/CTSM.git`)
- and checked out the necessary externals (`./manage_externals/checkout_externals`)
- and have enough free space on your account. E.g. on SAGA there are limits for number of files you can have on your account, and using Anaconda virtual environments produces a lot of hidden files that could create a problem if you have other projects stored on your user. 

You also need to set up your virtual environment with a recent Python version and some dependencies (packages/libraries). (WHICH LIBRARIES ARE NEEDED IN GENERAL?) On Saga, this works:

In [None]:
# Activate conda module:
module load Anaconda3/2020.11

# Create a new virtual environment with a recent Python version and required dependencies:
conda create --name ctsm-env python=3.10
source activate ctsm-env
conda install -c conda-forge xarray dask netCDF4 bottleneck numpy scipy

# Load newer version of git (will this be outdated quite soon? How to make it more general?):
module load git/2.33.1-GCCcore-11.2.0-nodocs

## 2. Subset global surface and atmosphere files

CTSM uses a surface data file to read in important grid cell-level information like vegetation, crop, and glacier grid cell fractions, and soil characteristics.

A global surface data file is located and read by default for global CTSM cases, depending on the chosen _component set_ and _resolution_. To run CTSM at a single point, we will need to supply a surface data file at a specified latitude and longitude.



When running a land-only simulation, that is when using a "data atmosphere model" (e.g., in _DATM mode_ in the [Component set](https://noresmhub.github.io/NorESM_LandSites_Platform/#component-sets-compsets)), with climate data (e.g. temperature, precipitation, solar radiation, etc.) driven by an input file, CTSM needs DATM files. We can also provide subset global DATM for single-point runs.

### 2.1 Specify directories for global data sets and storage

Before we can subset the data, we need to specify where the global, default data can be found, and some other pointers. The model reads this from `/tools/site_and_regional/default_data.cfg`. On Saga, global *.nc* grids are located in `/cluster/shared/noresm/inputdata`. You might need to ask for access to that folder if you want to do this process on Saga. 

First, `cd` (*c*hange *d*irectory) to where `default_data.cfg` is located:

In [None]:
cd ~/CTSM/tools/site_and_regional
ls

Then you need to edit the file. You can change this manually in a text editor, e.g. Vim, by typing 

In [None]:
vi default_data.cfg



and enter editing mode with `i` (for *i*nsert). Exit editing mode with `Esc` and type `:wq` (for *w*rite *q*uit) and press `Enter` to save the changes and close the text editor. You need to change `clmforcingindir` and `dir`. The file should end up looking like this:

    [main]
    clmforcingindir = /cluster/shared/noresm/inputdata
    
    [datm_gswp3]
    dir = /cluster/shared/noresm/inputdata/atm/datm7/atm_forcing.datm7.GSWP3.0.5d.v1.c170516
    domain = domain.lnd.360x720_gswp3.0v1.c170606.nc
    solardir = Solar
    precdir = Precip
    tpqwdir = TPHWL
    solartag = clmforc.GSWP3.c2011.0.5x0.5.Solr.
    prectag = clmforc.GSWP3.c2011.0.5x0.5.Prec.
    tpqwtag = clmforc.GSWP3.c2011.0.5x0.5.TPQWL.
    solarname = CLMGSWP3v1.Solar
    precname = CLMGSWP3v1.Precip
    tpqwname = CLMGSWP3v1.TPQW
    
    [surfdat]
    dir = lnd/clm2/surfdata_map/release-clm5.0.18
    surfdat_16pft = surfdata_0.9x1.25_hist_16pfts_Irrig_CMIP6_simyr2000_c190214.nc
    surfdat_78pft = surfdata_0.9x1.25_hist_78pfts_CMIP6_simyr2000_c190214.nc
    
    [landuse]
    dir = lnd/clm2/surfdata_map/release-clm5.0.18
    landuse_16pft = landuse.timeseries_0.9x1.25_hist_16pfts_Irrig_CMIP6_simyr1850-2015_c190214.nc
    landuse_78pft = landuse.timeseries_0.9x1.25_hist_78pfts_CMIP6_simyr1850-2015_c190214.nc[domain]
    file = share/domains/domain.lnd.fv0.9x1.25_gx1v7.151020.nc

</br>

### 2.2 Use `subset_data` to subset surface and DATM files

NCAR have created a python script, `subset_data`, which will subset default global surface and DATM files at a user-specified latitude and longitude.

Make sure you are in `/tools/site_and_regional`, list the folder contents and see that `subset_data` is there:

In [None]:
cd ~/CTSM/tools/site_and_regional
ls

You can use the built-in print help to see what options are available for the subset data script:

In [None]:
./subset_data --help

There are a lot of options, but for now we will just use a few of the most commonly used:

**Type of subsetting:**<br>
`point` : this tells the script to subset data at a single point (region is the other option)<br>

**Location-related information:**<br>
`--lat` : this tells the script which latitude to subset at (*must be between -90 and 90*)<br>
`--lon` : this tells the script which longitude to subset at (*can be between 0 and 360 or -180 and 180*)<br>
`--site` : optional, specifies a site name or tag<br>

**Type of files to create:**<br>
`--create-surface` : tells the script to subset surface data<br> 
`--create-datm` : tells the script to subset DATM data<br>
`--create-landuse` : tells the script to subset land use data (necessary for transient simulations)

**Time information:** <br>
`--datm-syr` and `--datm-eyr`: starting and ending years for the DATM data to subset (*must be between 1901 and 2014*)<br>

**Data management information:**<br>
`--create-user-mods` : tells the script to create a *user_mods* directory (see below). Note that if you don't use this option, you will have to modify scripts in your simulation to point to the modifed files.<br> 
`--outdir` : specifies the directory to place subset data and user mods directory in<br>

</br>
Run the script to create data files. Below is an example command to subset data for the ALP1 site. Modify the coordinates, name, time period, output directory, etc. to create the data for a different site.

Note that this process is time consuming. Please be patient, and don't worry if you see `WARNING: No dominant pft type is chosen.`

In [None]:
./subset_data point --lat 61.0243 --lon 8.12343 --site ALP1 --create-surface --create-datm --create-landuse --datm-syr 2000 --datm-eyr 2014 --create-user-mods --outdir /cluster/home/$USER/ALP1_subset_data

echo "------------------------"
echo "Successfully subset data"

Depending on the speed of your computing system, it may take a bit of time to subset all the climate data.

### 2.3 Check on the subset files

Once the subsetting has successfully finished, let's navigate to the specified output directory to check on the data that we just created:

In [None]:
cd /cluster/home/$USER/ALP1_subset_data
ls

You should see a surface data file (e.g. *surfdata_0.9x1.25 ... .nc*) and two folders: **datmdata** and **user_mods**. 

* **datmdata** houses the subset DATM files
* **user_mods** houses several files that are useful to customise a single-point case beyond what is possible in the UI (for advanced users)

#### 2.3.1 (Optional) Additional customisation of case

If you want to change additional settings, navigate into the **user_mods** directory to look at the contents:

In [None]:
cd user_mods
ls

You should see three files: *shell_commands*, *user_nl_clm*, and *user_nl_datm_streams*.  

The *shell_commands* file contains *xmlchange* commands required to set up a single point case at the specified latitude and longitude.

Take a look at this file if you want:

In [None]:
cat shell_commands

Note that many of the xml commands are changing aspects of the model configuration that are communicated to **[CIME](https://github.com/ESMCI/cime)** (Common Infrastructure for Modeling the Earth), which is the infrastructure that generates model executables and associated input files. Below are explanations of the commands included in this script. 

`./xmlchange CLM_USRDAT_DIR` - this tells CIME the location of an argument *CLM_USRDAT_DIR* which we can use to specify the main directory of subset data files  

`./xmlchange PTS_LON` and `./xmlchange PTS_LAT` - this tells CIME that we are running at a specified latitude and longitude  

`./xmlchange MPILIB` - this specifies a specific MPI (*Message Passing Interface*) library to use required for single-point runs on NCAR machines.  
 
*user_nl_clm* is a Fortran namelist file used to set up different namelist options for CLM. Here, we are using it to specify the location of our subset surface data. Note the use of the variable `$CLM_USRDAT_DIR` set up in the *shell_commands* file.
    
Similarly, *user_nl_datm_streams* specifies the location and a few other options for our subset DATM data.


<div class="alert alert-block alert-warning">
<b>Note:</b> If for whatever reason you end up moving the subset data directory (i.e. here <b>/cluster/home/$USER/my_subset_data</b>), you will need to modify the xmlchange command that specifies the <i>CLM_USRDAT_DIR</i> to be the full path to the directory's new location. 
</div>

## 3. (Optional) modification of data sets

Depending on you research questions, you may want to change some of the default data. 

## 4. (Optional) Spinup

When running a model like CLM, the initial conditions (i.e. state variables like carbon and nitrogen pools and soil moisture) have an impact on the results of the simulation. Often, we don't know the precise values of these initial conditions. To get around this issue, we can initialize the model with arbitrary values and then run the model with some cycle of atmospheric forcing for many years (e.g. 200) until the model attains an equilibrium state. Then, we can simulate the model response to some perturbation (e.g. changing climate, CO<sub>2</sub>, etc.). This process -- establishing an equilibrium state -- is called _spinup_.

An example from the CTSM tutorial shows the difference in Coil organic Carbon, Vegetation Carbon, and Gross Primary Productivity before and after spin-up with accelerated decomposition functionality that speeds up the spin-up time. *Note that you'll see **AD mode** for accelerated decomposition mode and **Post-AD mode** on this figure. See the tip below or visit the [Model Equilibrium and its Acceleration](https://escomp.github.io/ctsm-docs/versions/release-clm5.0/html/tech_note/Decomposition/CLM50_Tech_Note_Decomposition.html#model-equilibration-and-its-acceleration) section of the CLM Tech Note for more information*

<div>
<img src="https://github.com/NCAR/CTSM-Tutorial-2022/raw/main/images/ad_mode.png" width="525" height="775" alt="Evolution of different C pools during the accelerated decomposition spinup."/>
</div>

<i>Figure: The steady-state, or equilibrium, size of carbon (C) and nitrogen (N) pools are proportional to their turnover time. This spinup simulation was conducted using "accelerated decomposition", or "AD" mode. This accelerates the turnover time of "slow" ecosystem C and N pools (soil, wood, and coarse woody debris) so they come into equilibrium more quickly. On the left, the model ran in AD mode for 100 years cycling through atmospheric forcing for 1981 to 2000. AD mode was invoked with the commands: <code>./xmlchange CLM_FORCE_COLDSTART=on</code> and <code>./xmlchange CLM_ACCELERATED_SPINUP=on</code>. On the right, the model was run again in a "post-AD" simulation (using the end of our "AD" simulation as the starting point) for another 100 years with <code>./xmlchange CLM_ACCELERATED_SPINUP=off</code>. In returning the turnover times of slow C and N pools to their intended rates, the pool sizes must be adjusted from their "AD" steady-state. For example if the turnover of "passive" soil C was 10x faster in AD mode, the passive soil C pool needs to be 10x larger starting the post-AD simulation (the model automatically handles this conversion for you). Running the simulation for 100 years in post-AD mode allows the state variables to equilibrate with non-accelerated decomposition. In post-AD mode the history files are monthly, whereas AD output is set to annual averages by default. This difference in history file output frequency is reflected in the variability in the post-AD output.
</i>

After spin up is complete, we have to tell CIME to use the spinup simulation's end point as the starting point, or initial conditions, for our simulation. 

We do this via the <i>user_nl_clm</i> file, either with the command below or manually using any text editing software (e.g. vi, emacs, etc.). Specify the spin-up file and check that it worked:

In [None]:
echo "finidat='/cluster/home/data/finidat_file/I2000_CTSM51_spinup.clm2.r.0281-01-01-00000.nc'" >> user_nl_clm
cat user_nl_clm

### Acknowledgements

Thanks to the team at NCAR for sharing their tutorial materials. This notebook draws some text snippets and inspiration from the [2022 CTSM Generic Single Point tutorial](https://github.com/NCAR/CTSM-Tutorial-2022/blob/main/notebooks/Day2a_GenericSinglePoint.ipynb). Text and code inpiration also comes from [Hui's notebook for input creation for an older model version]() and [Lasse's repo for generic forcing data creation](https://github.com/lasseke/nlp-input-handling)