# M5.2 - Creating a Reproducible Research Environment

*Part of:* [**Open Climate Science for Crops & Crop Conditions**](https://github.com/OpenClimateScience/M5-Open-Science-for-Crops)

**In this lesson, we'll introduce two new tools for creating and managing reproducible research environments:**

- **Pixi** is a package manager that can install and update Python packages and other parts of your software environment. It's an alternative to `pip`, which can only manage Python packages.
- **Snakemake** is a tool for writing software tasks and executing them using simple, one-line instructions.

**Together, Pixi and Snakemake will allow us to easily re-create a research software environment, installing all the necessary dependencies and executing our workflow using a small number of simple commands.**

## Command-line tools

**At this point, you're probably very experienced with using the command-line tools on your computer!** You'll see fewer hints in this lesson, but don't forget: to run commands like `cd`, `pixi`, or `snakemake` in this lesson...

#### &#x1FA9F; Windows

**On Windows, you'll want to run these commands in the PowerShell or Command Prompt.**

#### &#x1F34E; &#x1F427; Mac OS/X or GNU/Linux

**On Mac OS/X or GNU/Linux, you'll want to run these commands in the Terminal.**

---

## Getting started with Pixi

To install Pixi and the `pixi` command line tool:

#### &#x1FA9F; Windows

[For Windows, see the instructions on their website.](https://pixi.sh/latest/installation/)

#### &#x1F34E; &#x1F427; Mac OS/X or GNU/Linux

```sh
# If you have curl installed:
curl -fsSL https://pixi.sh/install.sh | sh

# Otherwise:
wget -qO- https://pixi.sh/install.sh | sh
```

Note: In general, you should be suspicious of a command that runs a shell script from the internet! [See the Pixi website for yourself, where you can preview the shell script, if you want to verify its source.](https://pixi.sh/latest/installation/)



### Creating an environment with Pixi

To begin with, we need to create an environment with Pixi. This is similar to [what we did in Module 3](https://github.com/OpenClimateScience/M3-Open-Science-for-Water-Resources/blob/main/notebooks/01_Creating_a_Research_Software_Environment.ipynb), creating a virtual environment with `virtualenv`.

#### &#x1FA9F; &#x1F34E; &#x1F427; All Operating Systems

```sh
# Go to the directory we want to work in
cd demo-M3-project

pixi init
```

When we execute `pixi init` with no further arguments, this creates an environment that has the name of our current directory. If we wanted to create a new directory, with the name of our environment, we could instead write: `pixi init name` where `name` is the name of the environment.

**What happened?** There's now a file called `pixi.toml` in this directory. This file should look something like:

```toml
[workspace]
authors = ["K. Arthur Endsley <arthur.endsley@ntsg.umt.edu>"]
channels = ["conda-forge"]
name = "demo-M5-project"
platforms = ["linux-64"]
version = "0.1.0"

[tasks]

[dependencies]
```

This example is from my computer. The information in `authors` came from my Git installation, because I used `git config` to make sure Git knows my name and the e-mail address I use for my Github account. So, don't be alarmed if you see some personal information in this file!

### Compatibility with `pyproject.toml`

Pixi has its own format for `pixi.toml`, though it may resemble the `pyproject.toml` file we saw in Module 4. If you want Pixi to use the `pyproject.toml` format, you can add the argument `--format pyproject`. For example:

```sh
pixi init --format pyproject
```

### Activating our virtual environment

Because we want to do some further work setting up our environment, we need to activate the virtual environment. The command for this is `pixi shell`, because "shell" is another name for the command-line interface to our operating system.

#### &#x1FA9F; &#x1F34E; &#x1F427; All Operating Systems

```sh
pixi shell
```

### Adding software dependencies with Pixi

Let's see how we can use `pixi` to manage software dependencies. Previously, we used `pip` to install Python packages. `pixi` can manage more than just Python packages; it can manage entire software platforms like the Python interpreter itself! 

As an example, let's tell `pixi` that we want to include Python in our project.

#### &#x1FA9F; &#x1F34E; &#x1F427; All Operating Systems

```sh
pixi add python
```

This not only installs the Python interpreter. It tells `pixi` that some of the future things we might `add` will be Python modules and that it should look in certain places (like the Python Package Index or PyPI) to find those things.

How do we install Python packages? It's the same command!

#### &#x1FA9F; &#x1F34E; &#x1F427; All Operating Systems

```sh
# Install the numpy package for Python
pixi add numpy
```

You'll see a few progress bars appear as `pixi` identifies what package we're asking it to install, determines the latest version (since we didn't specify a version), downloads the package, determines the dependencies of this new package and installs those, and ultimately installs the package. This is just as expected, but let's take a look at that `pixi.toml` file again. You should see that the `[dependencies]` section has been updated:

```toml
[dependencies]
numpy = ">=2.3.5,<3"
```

You may have noticed that, in addition to the `pixi.toml` file, there is also a `pixi.lock` file. [You can read more about this file here,](https://pixi.sh/latest/workspace/lockfile/) but we won't discuss it because we don't need to understand how it works in order to manage our environment in a reproducible way. All the information we need is contained in the `pixi.toml` file.

---

## Managing version conflicts

A good package manager needs to be able to handle conflicts between different packages and between different versions of their dependencies. [In Module 3, we discussed how our research project might become dependent on a specific version of a software dependency.](https://github.com/OpenClimateScience/M3-Open-Science-for-Water-Resources/blob/main/notebooks/01_Creating_a_Research_Software_Environment.ipynb) When our project depends on specific versions of one or more packages, that can conflict with the requirements of another one of our software dependencies.

For this project, we need to make sure we're using a version of `numpy` that is *less than* version 2.0. The latest version of `numpy` prior to the version 2.0 release is `1.26.3`. This version of `numpy`, `1.26.3`, has its own software dependencies. Most importantly, `numpy` requires Python, and version `1.26.3` was created to work only with certain versions of Python. **How can we make sure we install both the right version of `numpy` *and* the right version of Python?**

`pixi` and many other package managers handles these situations seamlessly. However, in order to make sure `pixi` has all the required information about potential version conflicts, **we need to tell `pixi` to install all the packages with potential conflicts at the same time.** In this example, that means telling it to install both Python but also a version of `numpy` that is *less than version 2.0.*

#### &#x1FA9F; &#x1F34E; &#x1F427; All Operating Systems

```sh
pixi add python "numpy<2"
```

### &#x1F449; Automatic environment configuration 

**But wait! Pixi is supposed to make it easy to set up your environment? Why are we manually installing each package?** Right now, we're learning how to use Pixi for the first time. But, if you're having difficulties or want to skip ahead, note that there is already a `pixi.toml` file in this repository. You can set up your environment automatically, based on an existing `pixi.toml` file, by executing:
```sh
# Must be in the directory of the pixi.toml file
pixi install
```

If you choose to use `pixi install`, you should at least read through the rest of this lesson to gain an understanding of how Pixi works.

### Installing the remaining dependencies

**Let's go ahead and install the remaining dependencies we'll need for this lesson.**

#### &#x1FA9F; &#x1F34E; &#x1F427; All Operating Systems

```sh
pixi add earthaccess h5py matplotlib pyproj py4eos rasterio rioxarray shapely xarray notebook
```

**You may have encountered the following error:**

```
Error:   × failed to solve requirements of environment...
  ├─▶   × failed to solve the environment
  │   
  ╰─▶ Cannot solve the request because of: No candidates were found for py4eos *.
```

This happens with some packages that are available in the Python Packaging Index (PyPI) but weren't automatically found. This is often because they aren't indexed by `conda`. To make sure that `pixi` checks PyPI, we need to add the `--pypi` argument:

```sh
pixi add --pypi py4eos
```

#### &#x1F6A9; <span style="color:red">Pay Attention</red>

**Because there was an error with our original `pixi add` command, none of those other packages (other than `py4eos`) were installed.** We must use `pixi add` again to install the remaining dependencies.

#### &#x1FA9F; &#x1F34E; &#x1F427; All Operating Systems

```sh
pixi add earthaccess h5py matplotlib pyproj rasterio rioxarray xarray shapely notebook
```

---

## Documenting your environment

We've setup our Python environment and installed all the software dependencies we'll need. Great! But how might we make this easier to reproduce?

- **If we needed to start working on the same project but on a different computer, we'd have to run all these commands again!**
- **If someone else wants to reproduce our work or contribute to our project, we'd want to make sure they set up their computer in the same way.**

The `pixi.toml` file is a complete description of the environment we just created. If you look inside the file, you'll see it's automatically updated with the dependencies we just installed:

```
[dependencies]
python = ">=3.12.12,<3.13"
numpy = "<2"
earthaccess = ">=0.15.1,<0.16"
h5py = ">=3.15.1,<4"
matplotlib = ">=3.10.8,<4"
pyproj = ">=3.7.2,<4"
rasterio = ">=1.4.4,<2"
rioxarray = ">=0.20.0,<0.21"
xarray = ">=2025.12.0,<2026"
notebook = ">=7.5.1,<8"
shapely = ">=2.1.2,<3"

[pypi-dependencies]
py4eos = ">=0.5.0, <0.6"
```

However, if you are working with someone who doesn't have `pixi` installed, and doesn't want to install it, they won't be able to use your `pixi.toml` file. If they're using `mamba` or `conda`, however, you can send them a `conda` YAML file:


#### &#x1FA9F; &#x1F34E; &#x1F427; All Operating Systems

```sh
pixi workspace export conda-environment
```


#### &#x1F34E; &#x1F427; Mac OS/X and GNU/Linux

On Mac OS/X and GNU/Linux, you can write this output to a file:

```sh
pixi workspace export conda-environment > environment.yaml
```

**The `environment.yaml` file can now be shared with anyone who wants to re-create your software environment.**

---

## Getting started with Snakemake

So far, we have a package manager, `pixi`, that is similar in its capabilities to `pip`, the package manager we used before. We can now install Python packages and create a snapshot of our environment that can be re-used and shared, ensuring that the computational environment can be re-created on another computer.

**But that's only half of the problem of reproducibility.** We still need an easy way for someone to reproduce the *workflows* we'll be creating; a way for someone to not just install the same software environment but run the same programs *in the same way* that we did.

One of the early tools created to solve this kind of problem was called `make`. The `make` program allowed users to run an entire sequence of complex workflows using commands as simple as `make install` or `make results.csv`.

**In this lesson, we'll use a tool similar to `make` called [Snakemake (link)](https://snakemake.github.io/).** To install Snakemake using `pixi`:

#### &#x1FA9F; &#x1F34E; &#x1F427; All Operating Systems

```sh
pixi add --pypi Snakemake
```

### Making rules

**Snakemake automates tasks, like scientific workflows.** Every task is defined by a **rule,** often with required input files and output files specified.

As in Python `for` loops or function definitions, rules involve a keyword followed by a colon, `:`, and a code block. The Snakemake syntax is actually just an extension of the Python language.

A very simple example of a rule might look the following:

```
# Snakefile
rule greeting:
    shell: "echo Hello, world!"
```

- Instead of a `def` keyword or the `for` statement, as in a Python program, we use the `rule` statement to name a new rule.
- The indented code block, following the colon, is where we put everything related to this rule.
- The `shell` attribute of a rule is where we would write the command that should be issued on the command line in order to satisfy the rule (i.e., execute the task). In this example, we want the command `echo Hello, world!` to be issued to the command line.

**Open a text editor and put the rule (above) into a plain-text file called `Snakefile`** (with no file extension). We can execute this rule by issuing the following command:

#### &#x1FA9F; &#x1F34E; &#x1F427; All Operating Systems

```sh
snakemake greeting --cores=1 --quiet
```

You should see some output that looks like the following example.

```
Assuming unrestricted shared filesystem usage.
host: Gullveig
Building DAG of jobs...
Using shell: /usr/bin/bash
Select jobs to execute...
Execute 1 jobs...
Hello, world!
Complete log(s): /usr/local/dev/demo-M5-project/.snakemake/log/2025-12-12T152205.384447.snakemake.log
```

It's hard to see, but in addition to the extra lines of text printed by Snakemake, it did run our task, printing `Hello, world!` to the screen.

### About executing snakefiles

We issued two arguments to the `snakemake` command: one of them is required and the other is optional.

- **The `--cores` argument is required anytime we call the `snakemake` program.** It tells Snakemake how many CPUs to use when executing the task. We wrote `--cores=1` to tell Snakemake to use only one CPU. Any number higher than one would mean that we want the task to run in parallel. Because `--cores=1` can be a lot to type, you can instead type `-j1` instead to request one core, `-j2` for two cores, and so on.
- The `--quiet` argument is optional and suppresses some of the informational messages that Snakemake generates when it executes a task. Try running the command without `--quiet` to see the difference.

**Also, note that the `snakemake` command always uses the `snakefile` in the current working directory.** There can be only one `snakefile` in a single directory but that file can contain multiple rules.

### About parallel tasks

Later, we'll see how Snakemake allows us to easily run tasks in parallel. For now, try an experiment. What happens if you change the number of cores, in the `--cores` argument, to a number higher than one?

In this example, nothing changes because the `echo` program is not designed to run in parallel. There are two or more CPUs available (depending on how many you requested) but the `echo` program only uses one.

---

## More resources

- [Read about how workflow tools like Snakemake are revolutionizing computational science (*Nature*).](https://doi.org/10.1038/d41586-019-02619-z)