# M3.5 - Documenting our Water Budget Workflow

*Part of:* [**Open Science for Water Resources**](https://github.com/OpenClimateScience/M3-Open-Science-for-Water-Resources)

We've now computed the change in storage for the Yellowstone River basin using our water budget equation. **This is a result that we might want to share with our colleagues and broader scientific community.**

**But we're not finished yet!** There are several important questions we should ask ourselves, first:

1. How should we communicate this scientific result? That is, how will it be published?
2. How do we ensure we get proper credit for this result?
3. How can others verify the steps that led to the result we claim?
4. How could others contribute to and ultimately improve upon the work we've started?

**The answer to Question 1 is typically to submit a manuscript for peer review. However, most journals don't accept scientific code or datasets; you're expected to publish those separately.**

- This can lead to uncertain authorship, co-author credit, and licensing concerns (Question 2).
- Reviewers might not be able to access your code or your data to verify that the analysis was performed correctly (Question 3).
- And, finally, depending on how we publish our code and our data, others may not be able to easily utilize or contribute toward the work we've started (Question 4), which is inefficient and slows down scientific discovery and innovation.

---

## Publishing reproducible scientific workflows

**At this point, you may be thinking about that Git repository we created back in Section 3.** The very questions we've raised here can be addressed using Git and an online community called [Github.](https://github.com/)

&#x1F449; [**Sign up for a Github account here.**](https://github.com/signup) This will allow you to put your Git repository online for others to access.

### Putting your repository on Github

### Updating a repository on Github

---

## Best practices for publishing a reproducible scientific workflow

#### &#x1F6A9; <span style="color:red">Pay Attention</red>

For the rest of this lesson, we'll be referencing [this demonstration project on Github.](https://github.com/OpenClimateScience/demo-M3-project)

**What are the required contents of a reproducible scientific project when we publish a repository?**

The [Journal of Open Source Software (JOSS)](https://joss.theoj.org/), being one of the few journals that actually publishes reproducible scientific code, provides [a detailed list of what you expect to provide](https://joss.readthedocs.io/en/latest/paper.html#what-should-my-paper-contain) so that others can use your research code. Here, we'll discuss just the essentials.

### The README file

[&#x1F449; See the example here.](https://github.com/OpenClimateScience/demo-M3-project/blob/main/README.md)

You should include a `README` file in your project root (i.e., the top folder of the repository) that describes:

- The purpose of the research code you're publishing
- How to install the software, including any software dependencies (e.g., Python libraries) that are required to run it
- A minimal example of how the software could be used

Note that in [our example](https://github.com/OpenClimateScience/demo-M3-project/blob/main/README.md), we provided the `pip` command that would be needed to install the dependencies. `pip` allows a user to specify a `REQUIREMENTS` file, which is a plain text file that lists the Python software dependencies, including the specific version of each required package. [See the example `REQUIREMENTS` file here.](https://github.com/OpenClimateScience/demo-M3-project/blob/main/REQUIREMENTS)

In addition to this essential information, a mature scientific software project should also include:

- Information on how to run tests, to verify that the software was installed correctly
- A link to the software's documentation

### The LICENSE

[&#x1F449; See the example here.](https://github.com/OpenClimateScience/demo-M3-project/blob/main/LICENSE)

You should also include a software license, ideally in a plain text file called `LICENSE`. There are several considerations that go into choosing a software license, but the most commonly used licenses are described at this helpful website:

[&#x1F449; https://choosealicense.com/](https://choosealicense.com/)

---

## Reproducible research code

Finally, in this minimal example, we have the research code itself. We opted to put it in a folder called `scripts` because our entire project currently consists of three (3) separate Python scripts that simply need to be executed in the correct order.

Despite the simplicity of this example, there are some key features of our research code that will help someone to run our code.

### Appropriate use of Python docstrings

We used included a *docstring* at the top of each our Python scripts. For example, in `step01_IMERG-Final_monthly_precipitation.py`, the docstring describes the purpose of the script and how to run it:

```
'''
Computes monthly precipitation data for a basin (represented in a Shapefile)
based on IMERG-Final data. To execute this script:

    python step01_IMERG-Final_monthly_precipitation.py true

Where the argument "true" will cause the script to download the required data
from NASA Earthdata Search.
'''
```

### Global variables that are easy to find

Also near the top of each script we have defined the global variables that point to input and output files. The user may want to change these details, so we put them towards the top of the script to make them easy to find. For example:

```python
CWD = pathlib.Path(__file__).parent.resolve() # Current working directory
OUTPUT_FILE = CWD.parent.joinpath('processed/IMERG-Final_precip_monthly_2014-2023.nc')
BASIN_FILE = CWD.parent.joinpath('data/shp/YellowstoneRiver_drainage_WSG84.shp')
```

Recall that the convention is to use all capital letters for global variables so it is also easy to find where they are used.

### Handling file paths on different file systems

Our project's directory structure looks something like this:

```
demo-M3-project/
|-- h2o/
    |-- data/
        |-- shp/
    |-- processed/
    |-- scripts/
```

The Python scripts are located in the `scripts` folder, within the `h2o` folder. But, ultimately, users choose where to put the `demo-M3-project` on their own file system and we can't predict where that might be.

In each Python script, we used the `pathlib` library and the special, pre-defined `__file__` variable so that the script can be run on any user's file system, regardless of where they put the scripts. The following line defines a global variable `CWD` ("current working directory") that can be used to locate the *parent* of the `scripts` folder (the folder called `h2o`) on any user's file system:

```python
# Get the path to the "h2o" folder on any system
CWD = pathlib.Path(__file__).parent.resolve()
```

Then, we're able to get the absolute file path to any file we're interested in, for example:

```python
BASIN_FILE = CWD.parent.joinpath('data/shp/YellowstoneRiver_drainage_WSG84.shp')
```

---

## Summary

If this was your first time publishing research code to Github, congratulations! While this project is just a simple demonstration, we've included the bare minimum requirements for someone else to run our research code.

In the following modules of [this open-science curriculum,](https://openclimatescience.github.io/curriculum) we'll see more sophisticated ways of publishing research code that is reproducible and ready to use.