# M3.3 - Tracking Changes to Research Code

*Part of:* [**Open Science for Water Resources**](https://github.com/OpenClimateScience/M3-Open-Science-for-Water-Resources)

In the previous section, we processed IMERG-Final global precipitation data into a monthly precipitation time series for our basin. **It's likely we would want to save and repeat this analysis in the future.** There are several reasons for this:

- We might want to run this analysis for a different basin.
- We might want to run this analysis for hundreds or thousands of other basins and compare the results.
- Someone else might ask us to run this analysis for a basin they are interested in; or ask if they can use our code.
- We might discover a mistake in our analysis that requires us to process the data again (after correcting the mistake).

**For any of these reasons, we should always ask the following question about research code we generate: *Is it likely that someone, including me, will want to run this code again?***

If the answer is "Yes," then we need to think about what comes next. If we change the code in the future, we might unintentionally break something that is currently working. We might decide that new features we added aren't really necessary and the code would be better without them. Someone else might decide to adapt our code for a completely different purpose. And we might want to work with different versions of the same code; for example, a stable version that is commonly used and an experimental version that has more features.

**Source control management (SCM), sometimes called "version control," can help with these issues.** To see how SCM works, let's revisit our precipitation analysis code.

Below, all we've done so far is to combine the code into a single code cell and to move the `import` statements to the top of the code block, where they belong.

In [None]:
import calendar
import datetime
import glob
import earthaccess
import numpy as np
import h5py
import xarray as xr
import geopandas
from matplotlib import pyplot
from pyproj import CRS

auth = earthaccess.login()

basin = geopandas.read_file('/home/arthur.endsley/Workspace/NTSG/projects/Y2024_TOPS_Training/data/YellowstoneRiver_drainage_WSG84.shp')

# results = earthaccess.search_data(
#     short_name = 'GPM_3IMERGM',
#     temporal = ('2014-01-01', '2023-12-31'))
# earthaccess.download(results, 'data/IMERG-Final_monthly')

file_list = glob.glob('data/IMERG-Final_monthly/*.HDF5')
file_list.sort()

datasets = []
for i, filename in enumerate(file_list):
    # Only need to do this once, for the first file
    if i == 0:
        with h5py.File(filename, 'r') as hdf:
            longitude = hdf['Grid/lon'][:]
            latitude = hdf['Grid/lat'][:]

    # Get the date of this image
    date = datetime.datetime.strptime(filename.split('.')[4][0:8], '%Y%m%d')
    ds0 = xr.open_dataset(
        filename, group = 'Grid', decode_times = False).get(['precipitation'])
    # Define the missing coordinates
    ds0 = ds0.assign_coords({
        'time': [date], 'x': longitude, 'y': latitude
    })
    
    # Define the coordinate reference system (CRS) and the spatial coordinates
    ds0 = ds0.rio.write_crs(CRS.from_epsg(4326))
    ds0 = ds0.rio.set_spatial_dims('lon', 'lat')

    # Clip the IMERG-Final precipitation data to our basin's boundary
    ds_clip = ds0.rio.clip(basin.geometry.values)
    
    # Save the clipped dataset to be merged with the others
    datasets.append(ds_clip)

# Merge the datasets together along the "time" axis (i.e., build a time series)
ds = xr.concat(datasets, dim = 'time')

# Converting from [mm hour-1] to [mm month-1]
days_in_month = np.array(calendar.mdays)[ds.coords['time.month'].values]
ds['precip_monthly'] = ds.precipitation * 24 * days_in_month.reshape((days_in_month.size, 1, 1))

# Compute basin-wide monthly precipitation
precip_series = ds.precip_monthly.mean(['lon','lat']).values

## Adapting research code for re-use

We started this discussion with the idea that our code will be re-used. When we want to re-use code, we typically write a **function.** **What parts of our analysis could be easily re-written as more general-purpose functions?**

Let's start by decomposing our analysis into a series of simple steps:

1. Download the IMERG-Final data for a given period.
2. Open one of the IMERG-Final data granules to read the latitude and longitude coordinates.
3. For each data granule, create an `xarray` Dataset with the proper coordinates.
4. For each data granule, clip the Dataset to the bounds of our basin.
5. Merge the Datasets together.
6. Convert the units of precipitation.
7. Calculate the basin-wide average monthly precipitation.

Step 3 seems like a good candidate for turning into a general-purpose function. Why? The IMERG-Final data are stored as HDF5 files and we have to do a lot of work to prepare them for use with `xarray`. **The *boilerplate code* we wrote to achieve this isn't specific to our analysis; we'd have to do it every time for every IMERG-Final data granule.**

#### &#x1F3C1; Challenge: Re-writing Code as a Function

Functions generally transform inputs (arguments) into outputs (the return value). When looking at existing code to determine if it can be re-written as a function, we might look for parts of our code where *a single argument* is used multiple times.

For example, in this section of our code, we use the `filename` variable a lot!

<code>
    with h5py.File(<span style = "background-color:yellow">filename</span>, 'r') as hdf:
        longitude = hdf['Grid/lon'][:]
        latitude = hdf['Grid/lat'][:]
    # Get the date of this image
    date = datetime.datetime.strptime(<span style = "background-color:yellow">filename</span>.split('.')[4][0:8], '%Y%m%d')
    ds0 = xr.open_dataset(
        <span style = "background-color:yellow">filename</span>, group = 'Grid', decode_times = False).get(['precipitation'])
</code>
<br />

**This suggests that the entire section (above) could be re-written as a function that takes `filename` as an argument. Try writing the function for Step 3 yourself, then compare it with our answer, below.**

In [None]:
def hdf5_to_xarray_dataset(filename, longitude = None, latitude = None):
    '''
    Reads an HDF5 file representing daily data and returns an 
    xarray.Dataset with the date, latitude, and longitude coordinates
    properly defined.

    Parameters
    ----------
    filename : str
        The file path to the HDF5 file
    longitude : numpy.ndarray
        The longitude coordinates, as a 1D NumPy array
    latitude : numpy.ndarray
        The latitude coordinates, as a 1D NumPy array

    Returns
    -------
    xarray.Dataset
    '''
    if longitude is None or latitude is None:
        with h5py.File(filename, 'r') as hdf:
            longitude = hdf['Grid/lon'][:]
            latitude = hdf['Grid/lat'][:]

    # Get the date of this image
    date = datetime.datetime.strptime(filename.split('.')[4][0:8], '%Y%m%d')
    ds0 = xr.open_dataset(
        filename, group = 'Grid', decode_times = False).get(['precipitation'])
    # Define the missing coordinates
    ds0 = ds0.assign_coords({
        'time': [date], 'x': longitude, 'y': latitude
    })
    
    # Define the coordinate reference system (CRS) and the spatial coordinates
    ds0 = ds0.rio.write_crs(CRS.from_epsg(4326))
    ds0 = ds0.rio.set_spatial_dims('lon', 'lat')
    return ds0

#### &#x1F3AF; Best Practice

There is one important thing to note about our `hdf_to_xarray_dataset()` function.

We already know this function is going to be used inside a `for` loop, so we should think carefully about what happens inside the function. If there's a potentially time-consuming operation that only needs to be done once, we should exclude it from the function. 

We solved this problem by making `longitude` and `latitude` into optional arguments; if the function is going to be used inside a `for` loop, the user can provide these arguments to avoid having to read the HDF5 file with `h5py` multiple times. 

---

## Version control for research code

With our `hdf5_to_xarray_dataset()` function already defined, we can put the rest of our code into a `main()` function, as below. This enables us to represent the entire workflow as a single Python script.

```python
import calendar
import datetime
import glob
import earthaccess
import numpy as np
import h5py
import xarray as xr
import geopandas
from matplotlib import pyplot
from pyproj import CRS

BASIN_FILE = '/home/arthur.endsley/Workspace/NTSG/projects/Y2024_TOPS_Training/data/YellowstoneRiver_drainage_WSG84.shp'

def main():
    auth = earthaccess.login()
    basin = geopandas.read_file(BASIN_FILE)
    
    results = earthaccess.search_data(
        short_name = 'GPM_3IMERGM',
        temporal = ('2014-01-01', '2023-12-31'))
    earthaccess.download(results, 'data/IMERG-Final_monthly')
    file_list = glob.glob('data/IMERG-Final_monthly/*.HDF5')
    file_list.sort()
    
    datasets = []
    for i, filename in enumerate(file_list):
        # Only need to do this once, for the first file
        if i == 0:
            with h5py.File(filename, 'r') as hdf:
                longitude = hdf['Grid/lon'][:]
                latitude = hdf['Grid/lat'][:]

        # Read the HDF5 file as an xarray Dataset, clip it to
        #    out basin's boundary
        ds0 = hdf5_to_xarray_dataset(filename, longitude, latitude)
        ds_clip = ds0.rio.clip(basin.geometry.values)
        datasets.append(ds_clip)
    
    # Merge the datasets together along the "time" axis (i.e., build a time series)
    ds = xr.concat(datasets, dim = 'time')
    
    # Converting from [mm hour-1] to [mm month-1], then compute basin-wide
    #    monthly precip.
    days_in_month = np.array(calendar.mdays)[ds.coords['time.month'].values]
    ds['precip_monthly'] = ds.precipitation * 24 * days_in_month.reshape((days_in_month.size, 1, 1))
    precip_series = ds.precip_monthly.mean(['lon','lat']).values
```

<br />

Remember these important lines?

```python
if __name__ == '__main__':
    main()
```

[Review this previous lesson if you need to recall what they are for.](https://github.com/OpenClimateScience/M2-Computational-Climate-Science/blob/main/notebooks/05_Creating_a_Reproducible_Climate_Data_Analysis.ipynb)

### Initializing a `git` repository

### Tracking changes to research code

### Finalizing changes

---

## Updating research software

In [None]:
def potential_et(toa_radiation, temp_max, temp_min, temp_mean):
    '''
    Calculates potential evapotranspiration, according to the Hargreaves
    equation:

    PET = 0.0023 * R * sqrt(Tmax - Tmin) * (Tmean + 17.8)

    Where R is the top-of-atmosphere (TOA) radiation (mm month-1); Tmax and 
    Tmin are the maximum and minimum monthly air temperatures (degrees C),
    respectively; and Tmean is monthly mean air temperature (degrees C).

    Parameters
    ----------
    toa_radiation : Number
        The top-of-atmosphere (TOA) radiation (mm day-1)
    temp_max : Number
        Maximum monthly air temperature (degrees C)
    temp_min : Number
        Minimum monthly air temperature (degrees C)
    temp_mean : Number
        Average monthly air temperature (degrees C)

    Returns
    -------
    Number
        The potential evapotranspiration (PET) in [mm day-1]
    '''
    return 0.0023 * toa_radiation * np.sqrt(temp_max - temp_min) * (temp_mean + 17.8)


def toa_radiation(latitude, doy):
    '''
    Top-of-atmosphere (TOA) radiation for a given latitude (L) and day of year
    (DOY) can be calculated as:

    R = ((24 * 60) / pi) * G * d * (w * sin(L) * sin(D) + cos(L) * cos(D) * sin(w))

    Where G is the solar constant, 0.0820 [MJ m-2 day-1]; d is the (inverse) 
    relative earth-sun distance; w is the sunset hour angle; and D is the solar
    declination angle.
    
    For more information, consult the FAO documentation:

        https://www.fao.org/4/X0490E/x0490e07.htm#radiation
    
    Parameters
    ----------
    latitude : float
        The latitude on earth, in degrees, where southern latitudes
        are represented as negative numbers
    doy : int
        The day of the year (DOY), an integer on [1,366]
    
    Returns
    -------
    Number
        Top-of-atmosphere (TOA) radiation, in [MJ m-2 day-1]
    '''
    assert isinstance(doy, int) or issubclass(doy.dtype.type, np.integer), 'The "doy" argument must be an integer'
    assert np.all(doy >= 1) and np.all(doy <= 366), 'The "doy" argument must be between 1 and 366, inclusive'
    
    solar_constant = 0.0820 # [MJ m-2 day-1]
    pi = 3.14159
    
    # Convert latitude from degrees to radians
    latitude_radians = np.deg2rad(latitude)
    # Inverse Earth-Sun distance (relative), as a function of day-of-year (DOY)
    earth_sun_dist = 1 + 0.0033 * np.cos((doy * 2 * pi) / 365)
    # Solar declination, as a function of DOY
    declination = 0.409 * np.sin(((doy * 2 * pi) / 365) - 1.39)
    
    # Sunset hour angle; we use np.where() below to guard against
    #   warnings where arccos() would return invalid values, which
    #   happens when the argument is outside [-1, 1]
    _hour_angle = -np.tan(latitude_radians) * np.tan(declination)
    _hour_angle = np.where(np.abs(_hour_angle) > 1, np.nan, _hour_angle)
    sunset_hour_angle = np.arccos(_hour_angle)

    # Incident radiation, depends only on the relative earth-sun distance
    inc_radiation = ((24 * 60) / pi) * solar_constant * earth_sun_dist
    return inc_radiation * (sunset_hour_angle * np.sin(latitude_radians) * np.sin(declination) +
            np.cos(latitude_radians) * np.cos(declination) * np.sin(sunset_hour_angle))