# M3.3 - Tracking Changes to Research Code

*Part of:* [**Open Science for Water Resources**](https://github.com/OpenClimateScience/M3-Open-Science-for-Water-Resources)

In the previous section, we processed IMERG-Final global precipitation data into a monthly precipitation time series for our basin. **It's likely we would want to save and repeat this analysis in the future.** There are several reasons for this:

- We might want to run this analysis for a different basin.
- We might want to run this analysis for hundreds or thousands of other basins and compare the results.
- Someone else might ask us to run this analysis for a basin they are interested in; or ask if they can use our code.
- We might discover a mistake in our analysis that requires us to process the data again (after correcting the mistake).

**For any of these reasons, we should always ask the following question about research code we generate: *Is it likely that someone, including me, will want to run this code again?***

If the answer is "Yes," then we need to think about what comes next. If we change the code in the future, we might unintentionally break something that is currently working. We might decide that new features we added aren't really necessary and the code would be better without them. Someone else might decide to adapt our code for a completely different purpose. And we might want to work with different versions of the same code; for example, a stable version that is commonly used and an experimental version that has more features.

**Source control management (SCM), sometimes called "version control," can help with these issues.** To see how SCM works, let's revisit our precipitation analysis code.

Below, all we've done so far is to combine the code into a single code cell and to move the `import` statements to the top of the code block, where they belong.

In [None]:
import calendar
import datetime
import glob
import earthaccess
import numpy as np
import h5py
import xarray as xr
import geopandas
from matplotlib import pyplot
from pyproj import CRS

auth = earthaccess.login()

basin = geopandas.read_file('/home/arthur.endsley/Workspace/NTSG/projects/Y2024_TOPS_Training/data/YellowstoneRiver_drainage_WSG84.shp')

# results = earthaccess.search_data(
#     short_name = 'GPM_3IMERGM',
#     temporal = ('2014-01-01', '2023-12-31'))
# earthaccess.download(results, 'data/IMERG-Final_monthly')

file_list = glob.glob('data/IMERG-Final_monthly/*.HDF5')
file_list.sort()

datasets = []
for i, filename in enumerate(file_list):
    # Only need to do this once, for the first file
    if i == 0:
        with h5py.File(filename, 'r') as hdf:
            longitude = hdf['Grid/lon'][:]
            latitude = hdf['Grid/lat'][:]

    # Get the date of this image
    date = datetime.datetime.strptime(filename.split('.')[4][0:8], '%Y%m%d')
    ds0 = xr.open_dataset(
        filename, group = 'Grid', decode_times = False).get(['precipitation'])
    # Define the missing coordinates
    ds0 = ds0.assign_coords({
        'time': [date], 'x': longitude, 'y': latitude
    })
    
    # Define the coordinate reference system (CRS) and the spatial coordinates
    ds0 = ds0.rio.write_crs(CRS.from_epsg(4326))
    ds0 = ds0.rio.set_spatial_dims('lon', 'lat')

    # Clip the IMERG-Final precipitation data to our basin's boundary
    ds_clip = ds0.rio.clip(basin.geometry.values)
    
    # Save the clipped dataset to be merged with the others
    datasets.append(ds_clip)

# Merge the datasets together along the "time" axis (i.e., build a time series)
ds = xr.concat(datasets, dim = 'time')

# Converting from [mm hour-1] to [mm month-1]
days_in_month = np.array(calendar.mdays)[ds.coords['time.month'].values]
ds['precip_monthly'] = ds.precipitation * 24 * days_in_month.reshape((days_in_month.size, 1, 1))

# Compute basin-wide monthly precipitation
precip_series = ds.precip_monthly.mean(['lon','lat']).values

---

## Adapting research code for re-use

We started this discussion with the idea that our code will be re-used. When we want to re-use code, we typically write a **function.** **What parts of our analysis could be easily re-written as more general-purpose functions?**

Let's start by decomposing our analysis into a series of simple steps:

1. Download the IMERG-Final data for a given period.
2. Open one of the IMERG-Final data granules to read the latitude and longitude coordinates.
3. For each data granule, create an `xarray` Dataset with the proper coordinates.
4. For each data granule, clip the Dataset to the bounds of our basin.
5. Merge the Datasets together.
6. Convert the units of precipitation.
7. Calculate the basin-wide average monthly precipitation.

Step 3 seems like a good candidate for turning into a general-purpose function. Why? The IMERG-Final data are stored as HDF5 files and we have to do a lot of work to prepare them for use with `xarray`. **The *boilerplate code* we wrote to achieve this isn't specific to our analysis; we'd have to do it every time for every IMERG-Final data granule.**

#### &#x1F3C1; Challenge: Re-writing Code as a Function

Functions generally transform inputs (arguments) into outputs (the return value). When looking at existing code to determine if it can be re-written as a function, we might look for parts of our code where *a single argument* is used multiple times.

For example, in this section of our code, we use the `filename` variable a lot!

<code>
    with h5py.File(<span style = "background-color:yellow">filename</span>, 'r') as hdf:
        longitude = hdf['Grid/lon'][:]
        latitude = hdf['Grid/lat'][:]
    # Get the date of this image
    date = datetime.datetime.strptime(<span style = "background-color:yellow">filename</span>.split('.')[4][0:8], '%Y%m%d')
    ds0 = xr.open_dataset(
        <span style = "background-color:yellow">filename</span>, group = 'Grid', decode_times = False).get(['precipitation'])
</code>
<br />

**This suggests that the entire section (above) could be re-written as a function that takes `filename` as an argument. Try writing the function for Step 3 yourself, then compare it with our answer, below.**

In [None]:
def hdf5_to_xarray_dataset(filename, longitude = None, latitude = None):
    '''
    Reads an HDF5 file representing daily data and returns an 
    xarray.Dataset with the date, latitude, and longitude coordinates
    properly defined.

    Parameters
    ----------
    filename : str
        The file path to the HDF5 file
    longitude : numpy.ndarray
        The longitude coordinates, as a 1D NumPy array
    latitude : numpy.ndarray
        The latitude coordinates, as a 1D NumPy array

    Returns
    -------
    xarray.Dataset
    '''
    if longitude is None or latitude is None:
        with h5py.File(filename, 'r') as hdf:
            longitude = hdf['Grid/lon'][:]
            latitude = hdf['Grid/lat'][:]

    # Get the date of this image
    date = datetime.datetime.strptime(filename.split('.')[4][0:8], '%Y%m%d')
    ds0 = xr.open_dataset(
        filename, group = 'Grid', decode_times = False).get(['precipitation'])
    # Define the missing coordinates
    ds0 = ds0.assign_coords({
        'time': [date], 'x': longitude, 'y': latitude
    })
    
    # Define the coordinate reference system (CRS) and the spatial coordinates
    ds0 = ds0.rio.write_crs(CRS.from_epsg(4326))
    ds0 = ds0.rio.set_spatial_dims('lon', 'lat')
    return ds0

#### &#x1F3AF; Best Practice

There is one important thing to note about our `hdf_to_xarray_dataset()` function.

We already know this function is going to be used inside a `for` loop, so we should think carefully about what happens inside the function. If there's a potentially time-consuming operation that only needs to be done once, we should exclude it from the function. 

We solved this problem by making `longitude` and `latitude` into optional arguments; if the function is going to be used inside a `for` loop, the user can provide these arguments to avoid having to read the HDF5 file with `h5py` multiple times. 

---

## Version control for research code

With our `hdf5_to_xarray_dataset()` function already defined, we can put the rest of our code into a `main()` function, as below. This enables us to represent the entire workflow as a single Python script. [**You can view the entire script here.**](https://github.com/OpenClimateScience/M3-Open-Science-for-Water-Resources/blob/main/notebooks/scripts/basin_precip_version1.py)

```python
import calendar
import datetime
import glob
import earthaccess
import numpy as np
import h5py
import xarray as xr
import geopandas
from pyproj import CRS

BASIN_FILE = '/home/arthur.endsley/Workspace/NTSG/projects/Y2024_TOPS_Training/data/YellowstoneRiver_drainage_WSG84.shp'

def main():
    auth = earthaccess.login()
    basin = geopandas.read_file(BASIN_FILE)
    
    results = earthaccess.search_data(
        short_name = 'GPM_3IMERGM',
        temporal = ('2014-01-01', '2023-12-31'))
    earthaccess.download(results, 'data/IMERG-Final_monthly')
    file_list = glob.glob('data/IMERG-Final_monthly/*.HDF5')
    file_list.sort()
    
    datasets = []
    for i, filename in enumerate(file_list):
        # Only need to do this once, for the first file
        if i == 0:
            with h5py.File(filename, 'r') as hdf:
                longitude = hdf['Grid/lon'][:]
                latitude = hdf['Grid/lat'][:]

        # Read the HDF5 file as an xarray Dataset, clip it to
        #    out basin's boundary
        ds0 = hdf5_to_xarray_dataset(filename, longitude, latitude)
        ds_clip = ds0.rio.clip(basin.geometry.values)
        datasets.append(ds_clip)
    
    # Merge the datasets together along the "time" axis (i.e., build a time series)
    ds = xr.concat(datasets, dim = 'time')
    
    # Converting from [mm hour-1] to [mm month-1], then compute basin-wide
    #    monthly precip.
    days_in_month = np.array(calendar.mdays)[ds.coords['time.month'].values]
    ds['precip_monthly'] = ds.precipitation * 24 * days_in_month.reshape((days_in_month.size, 1, 1))
    precip_series = ds.precip_monthly.mean(['lon','lat']).values
```

<br />

Remember these important lines?

```python
if __name__ == '__main__':
    main()
```

[Review this previous lesson if you need to recall what they are for.](https://github.com/OpenClimateScience/M2-Computational-Climate-Science/blob/main/notebooks/05_Creating_a_Reproducible_Climate_Data_Analysis.ipynb)

#### &#x1F3AF; Best Practice

When we defined the `main()` function, we made one more important change. We moved the file path for our basin's Shapefile, `YellowstoneRiver_drainage_WSG84.shp`, towards the top of the script and defined it as a global variable, `BASIN_FILE`. This helps any future users, including ourselves, to quickly identify what to change if they want to use this script for a different basin.

### Introducing Git

In order to track changes to our research code, we'll use [**Git SCM (link).**](https://git-scm.com/) Git provides version control for code, comparing any new changes in a document to previous version(s). Code changes are tracked for a single **repository,** or collection of code, and Git makes it possible to view and go back to previous versions of the repository. Multiple users with access to the same repository can make changes to the code simultaneously and Git helps merge those changes together. A single repository can have multiple copies on different networks, connected over the internet.

**At the most basic level, Git is useful even for a single user working alone.** If you make a copy of your repository on another network, or on [a website like Github,](https://github.com/) then you automatically have a back-up of your work. And you still benefit from Git's version history.

**To get started with Git, we want to be working inside our project's top folder.** Below is an example of what our project folder, called `h2o`, should look like. **The Python script we are currently working on should be stored in a folder called `scripts`.**

![](./assets/M3_repository.png)

<br />

Our **repository** will also be a collection of files and folders, starting with the top folder, `h2o`, and everything below. However, we will get to decide what files and folders are actually included in the repository.

#### &#x1F6A9; <span style="color:red">Pay Attention</red>

**There are a lot of files in our project folder but we want to be careful about what we add to our repository.** SCM tools like Git are generally only good at tracking changes to plain-text files, like Python scripts. Git is very efficient at identifying and representing changes to plain-text files, where each change is represented as a change to a single line of text.

When we add large **binary files** to our repository, if they ever change in the future, the only way Git can represent that change is to store a second (or a third, or a fourth...) copy of the file. This can make our repository grow very large, very fast. **In general, we want to only add plain-text files, like code and plain-text documentation, to our repository.**

### Finding a place for our Git repository

**If you've already started working in your project folder (depicted above), make sure you know how to navigate to that folder using the command line. Otherwise, follow the steps below to create a new project folder.**

- The `cd` command, which stands for "change directory," can be used on any system's command line to navigate to a given folder.
- The `mkdir` command, which stands for "make directory," can be used on any system's command line to create a new folder with a given name.
- We'll then use `cd` again to go inside the new folder.

#### &#x1FA9F; Windows

On Windows, a good place for our new repository is in your Home folder, which is `C:/Users/username`, where `username` is your username.

```sh
cd C:/Users/username
mkdir h2o
cd h2o
```

#### &#x1F34E; Mac OS/X

On Mac OS/X, a good place for our new repository is in your Home folder, which is `/Users/username`, where `username` is your username.

```sh
cd /Users/username
mkdir h2o
cd h2o
```

#### &#x1F427; GNU/Linux

On GNU/Linux, a good place for our new repository is in your Home folder, which is `/home/username`, where `username` is your username.

```sh
cd /home/username
mkdir h2o
cd h2o
```

### Initializing a Git repository

#### &#x1FA9F; &#x1F34E; &#x1F427; All Operating Systems

**To tell Git that we want to create a new repository,** and to start tracking changes to files within that repository, we use the `git init` command.


```sh
git init
```

You should see a message similar to the one below:
```
Initialized empty Git repository in /home/username/h2o/.git
```


#### &#x1FA9F; &#x1F34E; &#x1F427; All Operating Systems

**We can check on this status of our repository anytime with the `git status` command.**

```sh
git status
```

You should see output similar to the following:
```
On branch main

No commits yet

nothing to commit (create/copy files and use "git add" to track)
```

**What does this mean?**

- Git allows us to work on one or more **branches** at a time. Each **branch** is like a parallel version of the repository. By default, our new repository has a single branch called `main`.
- Changes to repositories are bundled together in **commits.** A **commit** is a collection of changes that are related to one another; for example, a new feature or a bug fix.

### Adding files to the repository

Git also tells us that there is "nothing to commit" yet. That's because our repository has no files. Even if the repository *folder* contains files, Git will only track files that we explicitly tell it to track.

#### &#x1FA9F; &#x1F34E; &#x1F427; All Operating Systems

**To tell Git to start tracking a file, we use the `git add` command.** If you haven't already created the Python script called `step01_IMERG-Final_monthly_precipitation.py`, do so now and place it into the `scripts` folder, which should be inside the `h2o` folder (see the file tree diagram, above).

```sh
git add scripts/step01_IMERG-Final_monthly_precipitation.py
```

#### &#x1FA9F; &#x1F34E; &#x1F427; All Operating Systems

Now, if we were to check the status of our repository...

```sh
git status
```

We would see that we now have added a file to the repository.
```
On branch main

No commits yet

Changes to be committed:
  (use "git rm --cached <file>..." to unstage)
	new file:   scripts/step01_IMERG-Final_monthly_precipitation.py
```

### Finalizing changes

Adding a new file also counts as a change to our repository. Therefore, we have a set of changes that we're ready to commit! Committing changes to the repository means that Git will remember them and that we can also revert the repository back to this state.

#### &#x1FA9F; &#x1F34E; &#x1F427; All Operating Systems

```sh
git commit
```

---

## Updating research software