---
title: "Python Environments and Package Management"
format:
  html:
    toc: true
    toc-title: Contents
    toc-depth: 4
    code-fold: show
    self-contained: false
    html-math-method: mathml
jupyter: python3
---

In [2]:
#| echo: false

# import image module
from IPython.display import Image

# get the image, 
Image(url="./images/package_manger.png",width=400)

## Learning Objectives

By the end of this section, you will be able to:

- Explain why virtual environments are important.  
- Install essential data science packages in your Python environment.  
- Create a `requirements.txt` file and use it to quickly recreate an environment.  
- Recognize why dependency conflicts happen.  
- Describe how tools like `conda` or `poetry` can help resolve dependency conflicts.  



## Data Science packages in Python

Python has a rich ecosystem of packages that make data analysis easier and more powerful.  
In this course, you will install and use several core packages:

- **NumPy** → numerical computing and array operations  
- **pandas** → data manipulation and analysis  
- **Matplotlib** → data visualization  
- **Seaborn** → statistical data visualization (built on top of Matplotlib)  

These packages give you the essential tools to load, clean, analyze, and visualize data.  
👉 Before using them, you need to install them in your Python environment.



In your previous VS Code setup lesson, you created a Python environment using `.venv` and selected it as the kernel to run your notebook.  
This created a folder called `.venv`, which contains:
- A new Python interpreter  
- A `Lib` folder where your installed packages live  

Any packages you install (e.g., with `pip install ...`) will only affect this environment.  

Using virtual environments is considered best practice for Python development, especially in data science projects.  
Before we dive in, let’s first understand what a Python environment is.  


## Python Virtual Environments
A Python virtual environment is an **isolated workspace** that has its own Python interpreter and its own set of installed packages.  

This isolation helps you manage dependencies for different projects and avoid conflicts between package versions.  

**Why use virtual environments?**
- Keep project dependencies separate  
- Avoid version conflicts between packages  
- Make your code more reproducible and shareable  



## Install Data Science Packages Within Your Environment

Packages are collections of pre-written code that provide extra functionality, such as scientific computing, linear algebra, visualization, or machine learning.  
They are **not included in the Python standard library**, so you need to install them separately.

To test this, add the following code to your `test.ipynb` notebook:

```python
import numpy as np
import pandas as pd

print("Setup complete!")
```

👉 If you run this before installing the packages, you will likely see a `ModuleNotFoundError`.  
This error means that the package is not yet available in your environment.  



### How to Install Packages  

There are two main ways to install data science packages:  

####  Installing from the terminal
1. **Open a New Terminal:** Check the terminal prompt to see if the active environment is consistent with the kernel you’ve chosen for your notebook.
    - If you see `(.venv)` at the beginning of the prompt, it means the virtual environment `.venv` is active and matches the notebook kernel.
    - If you see something else, for example, `(base)` at the beginning of the prompt, it indicates that the base conda environment (installed by Anaconda) is currently active
    - You can also use the `which` or `where` (`where.exe` in windows) command:  On macOS/Linux, use: `which python` ;  On windows, use: `where.exe python`
     
> Note that when you have both Anaconda and VS Code installed on your system, sometimes the environments can conflict with each other. If the terminal environment is inconsistent with the notebook kernel, packages may be installed in a different environment than intended. This can lead to issues where the notebook cannot access the installed packages.

1. Using `pip`

`pip install numpy pandas`

#### Installing from the Notebook
You can also install packages directly from a Jupyter Notebook cell using a magic command. This is often convenient because it allows you to install packages without leaving the notebook interface.

- Add a new code cell and run:

```python
pip install numpy pandas 
```

### Backing Up and Sharing Your Environment

A key part of reproducible data science is making sure you (and your collaborators) can recreate the same environment. This is especially important when sharing code or working on different machines.

**How to back up your environment:**

Step 1: **Create a `requirements.txt` file**


Run the following command in your terminal to list all installed packages and their versions:

In [None]:
pip freeze

asttokens==2.4.1
colorama==0.4.6
comm==0.2.2
contourpy==1.3.0
cycler==0.12.1
debugpy==1.8.6
decorator==5.1.1
executing==2.1.0
fonttools==4.54.1
ipykernel==6.29.5
ipython==8.27.0
jedi==0.19.1
jupyter_client==8.6.3
jupyter_core==5.7.2
kiwisolver==1.4.7
matplotlib==3.9.2
matplotlib-inline==0.1.7
nest-asyncio==1.6.0
numpy==2.1.1
packaging==24.1
pandas==2.2.3
parso==0.8.4
pillow==10.4.0
platformdirs==4.3.6
prompt_toolkit==3.0.48
psutil==6.0.0
pure_eval==0.2.3
Pygments==2.18.0
pyparsing==3.1.4
python-dateutil==2.9.0.post0
pytz==2024.2
pywin32==306
pyzmq==26.2.0
six==1.16.0
stack-data==0.6.3
tornado==6.4.1
traitlets==5.14.3
tzdata==2024.2
wcwidth==0.2.13
Note: you may need to restart the kernel to use updated packages.


Using the redirection operator `>`, you can save the output of `pip freeze` to a `requirement.txt`. This file can be used to install the same versions of packages in a different environment.

In [None]:
pip freeze > requirement.txt

Note: you may need to restart the kernel to use updated packages.


Let's check whether the `requirement.txt` is in the current working directory

In [None]:
%ls

 Volume in drive C is Windows
 Volume Serial Number is A80C-7DEC

 Directory of c:\Users\lsi8012\OneDrive - Northwestern University\FA24\303-1\test_env

09/27/2024  02:25 PM    <DIR>          .
09/27/2024  02:25 PM    <DIR>          ..
09/27/2024  07:44 AM    <DIR>          .venv
09/27/2024  01:42 PM    <DIR>          images
09/27/2024  02:25 PM               695 requirement.txt
09/27/2024  02:25 PM            21,352 venv_setup.ipynb
               2 File(s)         22,047 bytes
               4 Dir(s)  166,334,562,304 bytes free


Step 2: **Share the file**
   - Send the `requirements.txt` file to your collaborator, or save it for future use.

Step 3: **Recreate the environment elsewhere**
   - On a new machine or environment, run:
     ```bash
     pip install -r requirements.txt
     ```
   

In [None]:
pip install -r requirement.txt

Note: you may need to restart the kernel to use updated packages.


This installs all the packages listed in the file, matching the versions you used.

**Why is this important?**
- Ensures everyone is using the same package versions, reducing errors and inconsistencies.
- Makes it easy to set up your project on a new computer or server.
- Helps with troubleshooting and collaboration.

**Tip:** Always update your `requirements.txt` after installing or upgrading packages, so it stays current with your environment.

## Dependency Conflicts

A **dependency conflict** occurs when two or more packages in your Python environment require different or incompatible versions of the same library.  
This can cause errors, unexpected behavior, or even break your code.



### Why do dependency conflicts happen?
- Many Python packages rely on other packages (called *dependencies*) to work.  
- If you install packages that depend on different versions of the same dependency, they may not work together.  
- This is especially common in data science, where libraries evolve quickly and have complex interdependencies.  



### Example
Suppose you install two packages:
- `PackageA` requires `numpy==1.24.0`  
- `PackageB` requires `numpy==2.2.0`  

If you install both with `pip`, the **last specified version wins**. One of the packages may then fail because it cannot use the version of `numpy` that was actually installed.  



### How to avoid dependency conflicts

- ✅ Use **virtual environments** to isolate dependencies for each project.  
- ✅ Check package requirements before installing new libraries.  
- ✅ Use tools like **Poetry** or **Conda** to detect and resolve conflicts automatically.  


## How to Resolve Conflicts and Manage Environments 

### Use `conda` 

`conda` is a powerful package and environment management tool, especially popular in data science. It helps you avoid and resolve dependency conflicts by managing packages and their versions more effectively than `pip` alone.

**Key features of conda:**
- Create isolated environments for different projects
- Install packages and their dependencies from the Anaconda repository
- Easily switch between environments
- Export and share environment configurations

**How conda helps with dependency conflicts:**
- When you install a package with `conda`, it automatically checks for compatible versions of all dependencies and installs them together.
- If a conflict is detected, conda will warn you and suggest solutions, or prevent incompatible installations.

**Basic conda commands:**
- Create a new environment:
  ```bash
  conda create --name myenv numpy pandas matplotlib
  ```
- Activate an environment:
  ```bash
  conda activate myenv
  ```
- Install a package in an environment:
  ```bash
  conda install seaborn
  ```
- List all environments:
  ```bash
  conda env list
  ```
- Export environment configuration:
  ```bash
  conda env export > environment.yml
  ```
- Recreate an environment from a file:
  ```bash
  conda env create -f environment.yml
  ```

**Example:**
Suppose you need to work on two projects that require different versions of `scikit-learn`. You can create two separate environments:
```bash
conda create --name projectA scikit-learn=0.24
conda create --name projectB scikit-learn=1.2
```
Each environment will have its own compatible dependencies, so you avoid conflicts.

**Summary:**
Using `conda` is highly recommended for managing complex dependencies and environments in data science workflows.

### Use `poetry` to Manage the Environment

`poetry` is a modern Python tool for dependency management and packaging. It helps you create, manage, and share Python projects with reproducible environments.

**Key features of poetry:**
- Handles dependencies and virtual environments automatically
- Uses a `pyproject.toml` file to specify project requirements
- Locks dependencies for reproducibility (`poetry.lock`)
- Simplifies publishing packages to PyPI

**How poetry helps with dependency management:**
- Automatically resolves and installs compatible versions of all dependencies
- Prevents dependency conflicts by using a lock file
- Makes it easy to update, add, or remove packages

**Basic poetry commands:**
- Install poetry (if not already installed):
  ```bash
  pip install poetry
  ```
- Create a new project:
  ```bash
  poetry new myproject
  ```
- Add a package:
  ```bash
  poetry add numpy pandas matplotlib
  ```
- Install dependencies:
  ```bash
  poetry install
  ```
- Run commands inside the environment:
  ```bash
  poetry run python script.py
  ```
- Update dependencies:
  ```bash
  poetry update
  ```

**Example workflow:**
1. Create a new project:
   ```bash
   poetry new ds_project
   cd ds_project
   ```
2. Add dependencies:
   ```bash
   poetry add numpy pandas matplotlib seaborn
   ```
3. Install all dependencies:
   ```bash
   poetry install
   ```
4. Run your code inside the poetry-managed environment:
   ```bash
   poetry run python your_script.py
   ```

**Summary:**
`poetry` is highly recommended for modern Python projects, especially when you want reliable dependency management, easy environment setup, and reproducible results.

### Difference Between `conda` and `poetry`

Both `conda` and `poetry` are tools for managing Python environments and dependencies, but they have different strengths and use cases.

**conda:**
- Manages both Python environments and packages, including non-Python dependencies (e.g., C libraries, compilers)
- Works with multiple languages (Python, R, etc.)
- Uses the Anaconda repository, which includes many scientific and data science packages
- Great for data science workflows, especially when you need packages with compiled code or system-level dependencies
- Handles complex dependency resolution and environment isolation

**poetry:**
- Focuses on Python projects and packages
- Uses `pyproject.toml` and `poetry.lock` for reproducible builds
- Automatically creates and manages virtual environments
- Excellent for modern Python application development and publishing to PyPI
- Handles dependency resolution and version locking for Python packages only

**Key differences:**
- `conda` can install system-level and non-Python dependencies; `poetry` only manages Python packages
- `conda` environments can include packages from the Anaconda repository; `poetry` uses PyPI
- `poetry` is ideal for pure Python projects and reproducible builds; `conda` is better for scientific computing and mixed-language projects

**When to use each:**
- Use `conda` if you need scientific libraries, compiled code, or non-Python dependencies
- Use `poetry` for modern Python projects, apps, and libraries where you want easy dependency management and publishing

**Summary Table:**
| Feature                | conda                        | poetry                      |
|------------------------|-----------------------------|-----------------------------|
| Language support       | Python, R, more             | Python only                 |
| Non-Python dependencies| Yes                         | No                          |
| Environment isolation  | Yes                         | Yes                         |
| Dependency resolution  | Excellent                   | Excellent                   |
| Reproducibility        | Good                        | Excellent                   |
| Publishing to PyPI     | No                          | Yes                         |

Choose the tool that best fits your project needs!

### Modern Python Package Managers: uv, pip, conda, poetry

Python's evolution has been closely tied to improvements in package management. Over time, tools like `pip`, `conda`, and `poetry` have made installing and managing packages much easier.

**uv** is a new, high-performance Python package manager developed by the creators of `ruff` and written in Rust. It aims to be much faster than traditional tools and supports modern workflows for Python projects.

- Learn more about uv: [uv GitHub page](https://github.com/astral-sh/uv)

**For this course:**
- You will continue using `pip` and `.venv` for installing packages and managing your environment.
- As you work on larger or more complex Python projects, or manage multiple environments, consider exploring tools like `conda`, `poetry`, and `uv` for better performance, reproducibility, and ease of use.

**Summary:**
- `pip` is the standard tool for installing Python packages.
- `.venv` helps you create isolated environments.
- `conda` and `poetry` offer advanced dependency management and environment handling.
- `uv` is an emerging tool for fast, modern Python package management.

Stay curious and keep learning about new tools—they can make your workflow smoother and more efficient!