## M2.5 - Creating a Reproducible Climate Data Analysis

**Contents:**

---

In the previous lesson, we used `dask` and `xarray` to read a collection of netCDF files, then **mapped** a **vectorized function** over each array in a time series. We produced a graph that showed the Precipitation-to-PET ratio for Tiaret, Algeria, in the early months of 2024, during a severe drought.

On its own, the graph doesn't tell us how severe the drought in Tiaret is. Although precipitation in the region has replenished less than 5% of its lost water over the past few months, this could just be part of the normal seasonal cycle. Actually, we know that January through April is a relatively wet period for Tiaret, but the question remains: **Can we compare this year to past years?**

Whenever we want to apply a completed analysis to a new dataset, either over time or space, that's an opportunity for us to improve how our workflow is represented. Consider the two scripts below, which represent the first two steps in our workflow.

### Our workflow: Downloading the data (Step 1)

The first file might be named something like **`YYYYMMDD_step1_download_MERRA2_data.py`**. Remember that `YYYYMMDD` is today's date, and it will help us to **link our output files with the code that created them.**

Note that Python files, with the `*.py` file extension, can have a **file-level docstring,** which in the example below is the Python multi-line string beginning with `'''`. File-level docstrings must begin on the first line of a file. They are extremely important for reproducible workflows; the first line of a file is the first place you'll look to understand what the purpose of the file is!

```python
'''
Downloads MERRA-2 M2SDNXSLV data, for the first 5 months of a year.
Read more about MERRA-2 here:

    https://gmao.gsfc.nasa.gov/reanalysis/MERRA-2/

Data are downloaded to this folder:

    data_raw/MERRA2
'''

import earthaccess

DATA_YEAR = 2023

auth = earthaccess.login()

results = earthaccess.search_data(
    short_name = 'M2SDNXSLV',
    temporal = (f"{DATA_YEAR}-01-01", f"{DATA_YEAR}-05-31"))

# Could take about 1 minute on a broadband connection
earthaccess.download(results, 'data_raw/MERRA2')
```

&#x1F449; **From top to bottom, note that:**

- We have a **file-level docstring** at the top of script with important information about the purpose of the script, where to get more information, and how the script changes our file system.
- All of our `import` statements are near the top of the script. This signals to someone reading our script what Python modules are required to run the script. We don't want to put any `import` statements farther down in the script because it would be harder to find them. This could lead to a surprise `ImportError` when running the script.
- Parameters that might be changed when running the script are clearly identified, using all capital letters to define the variable, near the top of the script. For example, `DATA_YEAR` is a variable we might want to change when running the script multiple times. Putting it at the top of our script, using all capital letters, helps avoid the difficulty of reading through every line of the script to find the part that needs to change.

### Our workflow: Data processing (Step 2)

The next step is to read-in the data files and calculate top-of-atmosphere (TOA) radiation. The second file might be named **`YYYYMMDD_step2_compute_TOA_radiation.py`**.

```python
'''
Computes top-of-atmosphere (TOA) radiation from a series of MERRA-2 
M2SDNXSLV files, then writes an output netCDF file. TOA radiation is
calculated according to the FAO formula for extraterrestrial radiation:

    https://www.fao.org/4/X0490E/x0490e07.htm#radiation
'''

import numpy as np
import xarray as xr

# NOTE: This will be different on your computer system and you should
#   use an absolute path, not a relative path
MERRA2_DATA_DIR = './data_raw/MERRA2'
DATA_YEAR = 2023
OUTPUT_FILE = f'./outputs/YYYYMMDD_MERRA2_{DATA_YEAR}_with_TOA-radiation.nc'

def main():
    ds = xr.open_mfdataset(f'{MERRA2_DATA_DIR}/*{DATA_YEAR}*.nc4', chunks = 'auto')
    lats = ds['lat'].values.reshape((361, 1, 1))\
        .repeat(ds.lon.size, axis = 1)\
        .repeat(ds.time.size, axis = 2)
    ds['lat_grid'] = (('lat', 'lon', 'time'), lats)

    # Compute TOA radiation
    template = ds['T2MMEAN']
    template.name = 'toa_radiation'
    result = xr.map_blocks(toa_radiation_wrapper, ds, template = template)
    toa_result = result.compute()
    # Converting TOA Radiation from [MJ m-2 day-1] to [mm H2O day-1]
    ds['toa_radiation'] = toa_result * 0.408
    
    # Write the file to disk, just the important variables
    ds = ds[['T2MMAX', 'T2MMEAN', 'T2MMIN', 'toa_radiation']]
    comp = dict(zlib = True, complevel = 5)
    encoding = {var: comp for var in ds.data_vars}
    ds.to_netcdf(OUTPUT_FILE, format = 'NETCDF4', encoding = encoding)

    
def toa_radiation(latitude, doy):
    '''
    Top-of-atmosphere (TOA) radiation for a given latitude (L) and day of year
    (DOY) can be calculated as:

    R = ((24 * 60) / pi) * G * d * (w * sin(L) * sin(D) + cos(L) * cos(D) * sin(w))

    Where G is the solar constant, 0.0820 [MJ m-2 day-1]; d is the earth-sun
    distance; w is the sunset hour angle; and D is the solar declination angle.
    
    For more information, consult the FAO documentation:

        https://www.fao.org/4/X0490E/x0490e07.htm#radiation
    
    Parameters
    ----------
    latitude : float
        The latitude on earth, in degrees
    doy : int
        The day of the year (DOY), an integer on [1,366]
    
    Returns
    -------
    Number
        Top-of-atmosphere (TOA) radiation, in [MJ m-2 day-1]
    '''
    solar_constant = 0.0820 # [MJ m-2 day-1]
    pi = 3.14159
    
    # Convert latitude from degrees to radians
    lat_radians = np.deg2rad(latitude)
    # Earth-Sun distance, as a function of day-of-year (DOY)
    earth_sun_dist = 1 + 0.0033 * np.cos(doy * ((2 * pi) / 365))
    # Solar declination, as a function of DOY
    declination = 0.409 * np.sin(doy * ((2 * pi) / 365) - 1.39)
    
    # Sunset hour angle; we use np.where() below to guard against
    #   warnings where arccos() would return invalid values, which
    #   happens when the argument is outside [-1, 1]
    _hour_angle = -np.tan(lat_radians) * np.tan(declination)
    _hour_angle = np.where(np.abs(_hour_angle) > 1, np.nan, _hour_angle)
    sunset_hour_angle = np.arccos(_hour_angle)
    
    return ((24 * 60) / pi) * solar_constant * earth_sun_dist *\
        (sunset_hour_angle * np.sin(lat_radians) * np.sin(declination) +
            np.cos(lat_radians) * np.cos(declination) * np.sin(sunset_hour_angle))


def toa_radiation_wrapper(dataset):
    'Wraps toa_radiation to work with an xarray.Dataset'
    return toa_radiation(dataset['lat_grid'], dataset['time.dayofyear'])


# If the file is run as a script, run the main() function
if __name__ == '__main__':
    main()
```

&#x1F449; **Again, note that:**

- A **file-level docstring** and `import` statements near the top of the script helps users identify the purpose of the script and what Python modules are required to run it.
- Our `toa_radiation()` function also has a **function-level docstring** that describes how the function works: what **input arguments** it requires and what the **return value** is.

&#x1F449; **Consider the line that contains `if __name__ == '__main__'`; what does this mean?**

- Every Python file, or `*.py` file, has code that can be executed in two ways, either by running `python myscript.py` (as a script) or by *importing* the file as a module, e.g., `import myscript`.
- Every Python file, when the code it contains is run, introduces a variable called `__name__` that indicates the name of the module. When a Python file is executed as a script, instead, then `__name__` is set equal to `'__main__'`. Therefore, `__name__` is a variable that we can use to test whether or not the Python code is currently being run as a script or if it was imported as a module.

**Why do we care about whether the file is being run as a script or if it was imported as a module?** When a Python file is imported as a module, all of the code in that file is executed. This means that any Python code that is outside of a **function definition** will be executed every time we import the module, which is probably not what we want, especially if the file contains useful functions like `toa_radiation()` that we might want to **re-use** elsewhere; that is, we might want to write something like `from myscript import toa_radiation()` in another script.

The code that we want to execute *only when the file is run as a script* should be placed in a function like `main()`, which can be called conditionally:
```python
# If the file is run as a script, run the main() function
if __name__ == '__main__':
    main()
```

If this is confusing then, for now, just consider the `if` statement above to be a "magic" Python technique that allows us to write Python code files that can both be executed as scripts and imported as modules.

---

## Comparing multiple years of climate data



## A reproducible project's files

---

## More resources

- The National Center for Atmospheric Research (NCAR) has an excellent article on ["Using `dask` to scale up your data analysis."](https://ncar.github.io/Xarray-Dask-ESDS-2024/notebooks/02-dask-intro.html)
- Sander van Rijn's [tutorial on using the `timeit` module.](https://sjvrijn.github.io/2019/09/28/how-to-timeit.html)