# 2. Setting up reproducible environment

By the end of this lecture, you should be able to:

- Understand the benefits of using environment management tool such `conda`
- Set up a simple reproducible environment using `conda` and `env.yaml` files
- Set up a sample project structure using the [`cookiecutter` data science template](https://cookiecutter-data-science.drivendata.org)

## 1. Importance of Managing Environment

Managing environments is crucial in data science for several reasons:

- **Dependency Management**: Different projects may require different versions of libraries. Without managing environments, you might face conflicts between these dependencies.
- **Reproducibility**: Ensuring that your code runs the same way on different machines or at different times is essential for reproducibility. An unmanaged environment can lead to inconsistencies.
- **Isolation**: Isolating project environments prevents changes in one project from affecting another.

### Examples of Issues Without Environment Management
- **Version Conflicts**: Installing a new version of a library for one project might break another project that relies on an older version.
- **Inconsistent Results**: Running the same code on different machines might yield different results due to different library versions.
- **Difficult Collaboration**: Sharing code without a managed environment can lead to "it works on my machine" problems.

## 2. Introduction to Conda

Conda is an open-source package management and environment management system. It is useful because:
- It allows you to create isolated environments with specific versions of libraries and dependencies.
- It supports multiple programming languages, including Python and R.
- It simplifies the process of installing and managing packages.



## 3. Creating a Conda Environment from Scratch

To create a new conda environment and install packages, follow these steps:

```bash
# Create a new environment named 'myenv' with Python 3.8
conda create --name myenv python=3.8

# Activate the environment
conda activate myenv

# Install packages, e.g., numpy and pandas
conda install numpy pandas
```



## 4. Creating a Conda Environment from a YAML File

You can create a conda environment from a YAML file, which specifies the environment configuration:



In [None]:
# environment.yml
name: myenv
dependencies:
  - python=3.8
  - numpy
  - pandas



To create the environment from the YAML file:



In [None]:
conda env create -f environment.yml



## 5. Managing Conda Environments

### Updating a Conda Environment

You may need to update your environment for a variety of reasons. For example, it may be the case that:

- one of your core dependencies just released a new version (dependency version number update).
- you need an additional package for data analysis (add a new dependency).
- you have found a better package and no longer need the older package (add new dependency and remove old dependency).

If any of these occur, all you need to do is update the contents of your `environment.yml` file accordingly and then run the following command:



In [None]:
conda env update --file environment.yml --prune



### Deleting an Installed Package

To remove a specific package from the environment:



In [None]:
# Remove the 'numpy' package from the environment
conda remove --name myenv numpy



### Exporting the Environment to a YAML File

To export the current environment configuration to a YAML file:



In [None]:
# Export your active environment to 'environment.yml'
conda env export > environment.yml

If you want to make your environment file work across platforms, you can use the conda env export `--from-history` flag. This will only include packages that you’ve explicitly asked for, as opposed to including every package in your environment.

In [None]:
conda env export --from-history -f environment.yml



### Deleting a Conda Environment

To delete an entire conda environment:



In [None]:
# Remove the 'myenv' environment
conda env remove --name myenv



By managing your environments with conda, you can ensure that your data science projects are reproducible, isolated, and free from dependency conflicts.
```

## Slide

No slide this week. We will use demo instead.

I recommend you to check out the readings below for more details and explanation

## Supplemental materials


### Readings

- [Filenames and data science project organization, Integrated development environments](https://ubc-dsci.github.io/reproducible-and-trustworthy-workflows-for-data-science/materials/lectures/03-filenames-project-organization.html)
- [Virtual environments](https://ubc-dsci.github.io/reproducible-and-trustworthy-workflows-for-data-science/materials/lectures/04-virtual-environments.html)
- [Cookiecutter data science project template](https://cookiecutter-data-science.drivendata.org)