# M3.1 - Creating a Research Software Environment

*Part of:* [**Open Science for Water Resources**](https://github.com/OpenClimateScience/M3-Open-Science-for-Water-Resources)

## Software versions and reproducibility

When research software is used in an analysis, the results of that analysis depend on the software used, including the specific version of that software. To create *reproducible* research using software, we need to clearly identify and communicate all of the software dependencies and their versions. This becomes complicated when we consider how our own software's behavior depends on (open-source) software that other people wrote. A Python program's results may also depend on the version of Python used! **How can we keep track of it all?**

[In the previous lesson (link)](https://github.com/OpenClimateScience/M2-Computational-Climate-Science), we learned how to use the software `pip` to install and manage Python packages. Previously, when we used `pip`, we were installing Python packages to our user directory. The new packages were available to use every time we started Python. Our version of Python and all of the packages we have installed are collectively referred to as our Python *environment.* Every new Python project we start is able to access those same packages.

**This sounds convenient, but there are some downsides.** To understand what can go wrong when several different Python projects use the same *environment,* let's imagine we are about to start a new Python project using code that someone else wrote, a module named `example.py`.

`example.py` uses NumPy **version 1.26.0** to provide some convenient tools for an analysis you want to perform but, on your system, you have NumPy **version 2.0.0** installed.

- `example.py` can only use **version 1.26.0** or earlier, because it uses NumPy's representation of infinity, `np.infty`, which was changed to `np.inf` in **version 2.0.0.** The developer of `example.py` would have to go through all of their code, replacing `np.infty` with `np.inf`, and they don't have the time or interest to do this right now.
- On our system, we started a new project where we have to use NumPy **version 2.0.0** because that version added support for a new feature we need. To run `example.py`, we'd have to *downgrade* to **version 1.26.0** and then our new project wouldn't work!

**This hypothetical illustrates the problems we can run into when we try to run different Python projects in the same environment.**

### Software versioning

Before we discuss the solution to this problem, let's talk about what software version numbers mean.

There are two commonly used ways to indicate a software's version.

- **Calendar versioning,** where the version number reflects the date of the software's release, usually in `YYYY.MM` format.
- **Semantic versioning,** where the version number represents *how different* the new software version is from a previous version.

[**Semantic versioning** (link)](https://semver.org/#semantic-versioning-200) is the most common. In semantic versioning, a software's version number has three parts, separated by dots:

```
MAJOR.MINOR.PATCH
```

For example, NumPy uses semantic versioning when it releases software versions numbered **version 2.0.0** or **version 1.26.0.**

- The first number, the `MAJOR` ("major") version number, is used to indicate changes to a software that will almost certainly break any software that depended on a previous major version. Recall our earlier example, where a change from `np.infty` to `np.inf` in NumPy version 2.0.0 would cause `example.py` to break. That's why NumPy's developer increased the major version number from 1 to 2.
- The second number, the `MINOR` ("minor") version number, is used when new features are added in a *backwards-compatible* manner; i.e., in a way that won't break software that depends on it. In NumPy version 1.26.0, 26 is the minor version number.
- The third number, the `PATCH` ("patch") version number, is used when we need to release a new version of the software to fix (or to "patch") a bug. The new version shouldn't break anything; in fact, it is released specifically to fix a problem with the previous version.

### Python virtual environments

**The solution to the problem with our NumPy versions is to create separate Python environments for different projects.** 

`virtualenv` is a tool that we'll use for creating Python virtual environments. Using `virtualenv`, each of our projects can have a different Python installation, where different packages can be installed with different versions. We can even use different versions of Python itself in each environment.

![](./assets/M3_venv.png)

### Creating a Python virtual environment

To get started, let's use the command line, as we did in the previous lesson. The screenshot below is a helpful reminder of how to launch the command line.

![](./assets/M2_Jupyter_terminal.png)

#### &#x1F6A9; <span style="color:red">Pay Attention</red>

Python virtual environments, using `virtualenv`, are stored on our file system in their own directories. You should generally have a single place on your file system where all the virtual environments are stored, such as in a sub-directory of your home folder. However, in Jupyter Notebook, we only have access to a limited part of the file system. So, for now, we'll just create our virtual environment in our project directory, inside the `venv` folder:

```sh
mkdir venv
```

**To create a new Python virtual environment, we just need to provide the file path to the folder where it should be created.** We should choose a short but informative name; today, we'll call our project `h2o` (for "water"):

```sh
virtualenv venv/h2o
```

### Using `pip` in a Python virtual environment

### Using Jupyter Notebook in a Python virtual environment

Jupyter Notebook makes it possible to work with literate programming documents in a variety of programming languages, not just Python. When you open a Notebook, you can specify the **kernel** that should be used for running any code in that Notebook. The **kernel** is simply the computer program that executes code; it could be the Python intepereter or it could be an interpreter for another language, like R.

**You can see the current kernel your Notebook is configured to use at the top-right of any Notebook:**

![](./assets/M3_Jupyter_kernel.png)


You may see a slightly different *kernel name;* here, "Python 3 (ipykernel)" is the name of our kernel. If you click on this name, you'll see a menu that allows you to change the kernel being used.

**When we work in a virtual environment, Jupyter Notebook doesn't always know that it should be using the Python installation associated with that virtual environment.** This means that packages we install in the virtual environment might not be available in Jupyter Notebook.

**How can we tell Jupyter Notebook to use the Python kernel associated with our virtual environment?** First, make sure you're virtual environment is still activated in the Terminal. We'll install a Python package called `ipykernel`, which will allow us to register a Python kernel with Jupyter Notebook:

```sh
pip install ipykernel
```

Then, with Jupyter Notebook and `ipykernel` both installed in our virtual environment, we tell `ipykernel` to make the current Python kernel (the for our virtual environment) available as a kernel in Jupyter Notebook. In this example, we give it the name `OpenScience` so that we know it is associated with our open science project:

```sh
python -m ipykernel install --user --name=OpenScience
```

**Now, restart Jupyter Notebook. Click on the kernel name at the top-right of the Notebook you want to work in. You should see a selection menu, similar to the one below:**

![](./assets/M3_Jupyter_kernel_selection.png)

You should see the "OpenScience" kernel in the list of available kernels. Selecting that kernel will enable you to work with the Python packages that you installed in your virtual environment.

#### &#x1F3AF; Best Practice

**It's good practice to make sure the name of your virtual environment and your Jupyter kernel match.** That way, when you have multiple projects, it's clear which kernel is associated with the virtual environment you're working in.

---

### More resources

- [Using Virtual Environments in Jupyter Notebook and Python](https://janakiev.com/blog/jupyter-virtual-envs/#add-virtual-environment-to-jupyter-notebook) - A blog post