# Comprehensive Tutorial on Using Pathlib In Python For File System Manipulation

## Introduction

One of the most frustrating aspects of Python up until version 3.4 was file system manipulation. Developers often struggled with tangled strings representing paths. Their code broke frequently due to path inconsistencies on different operating systems (Windows vs. Unix-like). That's when the `pathlib` module was introduced to the standard library.

`pathlib` offers a long-awaited object-oriented approach to path manipulation. It offers a powerful and elegant solution to handling file system paths, ensures platform-agnostic behavior, and promotes code clarity and maintainability. 

The module has matured significantly over the years, making it an essential tool for any Pythonista. This comprehensive tutorial will teach you the features and methods of `pathlib` that will be enough for most of your daily interactions with the file system. Let's get started.

## Python `os` module vs. `pathlib`

Some of our readers might ask "Why learn a new library when we have the Python `os` module?". That's a fair question. 

Let's say we want to find all `png` files inside a given directory and all its sub-directories (a common task in data science). If we were using the `os` module, we would have to write code like this:

```python
import os

dir_path = "/home/user/documents"

# Find all png files inside a directory
files = [
    os.path.join(dir_path, f)
    for f in os.listdir(dir_path)
    if os.path.isfile(os.path.join(dir_path, f)) and f.endswith(".png")
]
```

This snippet has many disadvantages:
1. It is long and unreadable for such a simple operation.
2. Requires knowledge of list comprehensions.
3. It involves string operations which are error-prone.

If we were using `pathlib`, then our code would be much simpler:

In [2]:
from pathlib import Path

# Create a path object
dir_path = Path(dir_path)

# Find all text files inside a directory
files = list(dir_path.glob("*.png"))

NameError: name 'dir_path' is not defined

If you continue reading the article, you will discover many more benefits of `pathlib` over the `os` module besides simplicity and readability. So, shall we?

## `Path` objects

The entire `pathlib` library revolves around `Path` objects:

In [None]:
from pathlib import Path

These objects represent file system paths in a structured and - this is key - platform-independent way. Unlike working with raw strings, these objects offer a more user-friendly approach to manipulating file system paths.

We can create `Path` objects in several ways:

1. __From strings__

You can directly create a `Path` object by passing a string that represents a file system path:

In [None]:
file_path_str = "data/union_data.csv"
data_path = Path(file_path_str)

print(type(data_path))

2. __From other `path` objects__

Existing `Path` objects can be building blocks to create new paths. You can combine them using operators:

In [3]:
base_path = Path("/home/user")
data_dir = Path("data")

file_path = base_path / data_dir / "prices.csv"  # Combining multiple paths
print(file_path)

/home/user/data/prices.csv


By using a forward slash, you can extend `Path` objects with another object or a string path. 

3. __From the current working directory__

The `Path.cwd()` method gives a fast access to the current working directory as a Path object:

In [4]:
cwd = Path.cwd()

print(cwd)

/home/bexgboost/articles/2024/4_april/8_pathlib


4. __From the home directory__

In [5]:
home = Path.home()

home / "downloads" / "projects"

PosixPath('/home/bexgboost/downloads/projects')

__An important note__: The `Path` class itself doesn't perform any file system operations such as path validation, creating directories or files. It is designed for representing and manipulating paths. To actually interact with the file system (checking existence, reading/writing files), we will have to use special methods of `Path` objects and for some advanced cases, get help from the `os` module. More on this later.

## `Path` components 

Just like a physical address has different parts (street number, city, country, zip code, etc.), a file system path can be broken down into smaller components. `pathlib` allows us to access and manipulate these components using path attributes through dot-notation. 

Here are some common path attributes and how to retrieve them in `pathlib`:

- __Root__: This refers to the top level of the file system (e.g., "/" on Unix-like systems, drive letter like "C:" on Windows).

In [6]:
image_file = home / "downloads" / "midjourney.png"

image_file.root

'/'

- __Parent__: This attribute returns a `Path` object representing the directory containing the current path.

In [7]:
image_file.parent

PosixPath('/home/bexgboost/downloads')

- __name__: This attribute returns the entire filename (including extension) as a string.

In [8]:
image_file.name

'midjourney.png'

- __suffix__: This attribute returns the file extension (including the dot) as a string, or an empty string if there's no extension.

In [9]:
image_file.suffix

'.png'

- __stem__: This attribute returns the file name without the extension. It is useful when converting files to different formats.

In [10]:
image_file.stem

'midjourney'

If you want to split a `Path` object into its components, you can use the `.parts` attribute:

In [11]:
image_file.parts

('/', 'home', 'bexgboost', 'downloads', 'midjourney.png')

If you want these components to be `Path` objects in themselves, you can use the `.parents` attribute, which returns a generator:

In [12]:
list(image_file.parents)

[PosixPath('/home/bexgboost/downloads'),
 PosixPath('/home/bexgboost'),
 PosixPath('/home'),
 PosixPath('/')]

## Common path operations using `pathlib`

`Path` objects have many methods that allows you to efficiently interact with directories and contents. Below, we will look at how to perform some of the most common operations you will face daily.

### Listing directories

The `iterdir()` method allows you to iterate over all entries (files and subdirectories) within a folder given as a `Path` object. It is particularly useful for processing all files in a directory or performing operations on each entry:

In [13]:
cwd = Path.cwd()

for entry in cwd.iterdir():
    # Process the entry here
    ...
    # print(entry)

Since `iterdir()` returns an interator, entries are retrieved on-demand as you go through the loop. 

While iterating through a directory, you may want to focus processing only on files or directories. `Path` objects have methods for checking entry type:

- `is_dir()`: This method returns `True` if the path points to a directory, `False` otherwise.

In [14]:
for entry in cwd.iterdir():
    if entry.is_dir():
        print(entry.name)

.ipynb_checkpoints
data
images


- `is_file()`: This method returns `True` if the path points to a regular file, `False` otherwise.

In [15]:
for entry in cwd.iterdir():
    if entry.is_file():
        print(entry.suffix)

.ipynb
.txt


Since `Path` objects only represent paths, sometimes you need to check if a path actually exists using the `.exists()` method:

In [16]:
image_file.exists()

False

### Creating and deleting paths

`pathlib` also offers functionalities for creating and deleting files and directories. Let's see how.

- `mkdir()`: This method creates a new directory at the specified path. By default, it creates the directory in the current working directory.

In [17]:
from pathlib import Path

data_dir = Path("new_data_dir")

# Create the directory 'new_data_dir' in the current working directory
data_dir.mkdir()

- `mkdir(parents=True)`: This method is particularly useful when you want to create a directory structure where some parent directories might not exist. Setting `parents=True` ensures that all necessary parent directories are created along the way.

In [19]:
sub_dir = Path("data/nested/subdirectory")

# Create 'data/nested/subdirectory', even if 'data' or 'nested' don't exist
sub_dir.mkdir(parents=True)

Keep in mind that `mkdir` raises an exception if a directory with the same name already exists:

```python
Path('data').mkdir()
```

```
FileExistsError: [Errno 17] File exists: 'data'
```

- `unlink()`: This method permanently deletes a file represented by the `Path` object. It is recommended to check if a file exists before running this method because it may raise an error:

```python
to_delete = Path("data/prices.csv")

if to_delete.exists():
    to_delete.unlink()
    print(f"Successfully deleted {to_delete.name}")
```

```
Successfully deleted prices.csv
```

- `rmdir()`: This method removes an empty (only empty) directory. If you want to delete a non-empty directory, the easiest way is to use [`shutil` library](https://stackoverflow.com/questions/303200/how-do-i-remove-delete-a-folder-that-is-not-empty) or the terminal.

In [None]:
empty_dir = Path("new_data_dir")

empty_dir.rmdir()

Please, be cautious when using `unlink()` or `rmdir()` as their results are permanent.

## Advanced path manipulation

Let's move on to some advance path manipulation concepts and how to apply them in practice using `pathlib`.

### Relative vs. absolute paths

We will start by understanding the difference between absolute and relative paths, as they come up often in your own work or in others' code. 

__Relative paths__: these paths specify the location of a file or directory relative to the current directory, hence the word _relative_. They are short and flexible within your project but can be confusing if you change the working directory. 

For example, I have an `images` folder in my current working directory, which has the `midjourney.png` file:

In [21]:
image = Path("images/midjourney.png")

image

PosixPath('images/midjourney.png')

The above code works now but if I move the notebook I am using to a different location, the snippet will break because the `images` folder didn't move with the notebook. 

__Absolute paths__: these paths specify the full location of a file or a directory from the root of the file system. They are independent of the current directory and offer a clear reference point for any user anywhere on the system:

In [22]:
image_absolute = Path(
    "/home/bexgboost/articles/2024/4_april/8_pathlib/images/midjourney.png"
)

image_absolute

PosixPath('/home/bexgboost/articles/2024/4_april/8_pathlib/images/midjourney.png')

I guess you can see why most people prefer relative paths - absolute paths can become pretty long, especially in complex projects with nested tree structures.

That's why `pathlib` provides methods to convert relative paths to absolute with the `resolve()` method:

In [23]:
relative_image = Path("images/midjourney.png")

absolute_image = relative_image.resolve()
absolute_image

PosixPath('/home/bexgboost/articles/2024/4_april/8_pathlib/images/midjourney.png')

You can also go the other way - if you have an absolute path, you can convert it to relative based on a reference directory. 

In [24]:
relative_path = Path.cwd()

absolute_image.relative_to(relative_path)

PosixPath('images/midjourney.png')

### Globbing

Let's get back to the example we introduced in the beginning of the article, which showed how to find all PNG files in a given directory:

```python
files = list(dir_path.glob("*.png"))
```

Let's talk about the `.glob()` method a bit more. 

`pathlib` uses the built-in `glob` module to efficiently search for files matching a specific pattern in any directory. This is very useful when you need to process files with similar names or extensions. 

The `glob` method accepts a pattern string containing wildcards as input and returns a generator object that yields matching `Path` objects on demand:
- `*`: Matches zero or more characters.
- `?`: Matches any single character.
- `[]`: Matches a range of characters enclosed within brackets (e.g., `[a-z]` matches any lowercase letter).

Let's try to find all Jupyter notebooks in my `articles` directory:

In [26]:
articles_dir = Path.home() / "articles"

# Find all scripts
notebooks = articles_dir.glob("*.ipynb")

# Print how many found
print(len(list(notebooks)))

0


The `.glob` method only found two notebooks when it should have found much more (I have written over 150 articles in notebooks). The reason is that `glob` only searches inside the given directory, not its subdirectories. To do a recursive search, we need to use the original `rglob` method, which has a similar syntax:

In [27]:
notebooks = articles_dir.rglob("*.ipynb")

print(len(list(notebooks)))

357


This time, it found 357 files, which is the more likely answer. 

## Working with files

As mentioned earlier, `Path` objects only represent files but don't perform operations on them. However, they do have certain methods for common file operations. We will see how to use them in this section.

### Reading files

Reading file contents is a fundamental operation in many Python applications. `pathlib` provides convenient shorthand methods for reading files as either text or raw bytes.

The `read_text` method allows you to read the contents of a text file, retrieve and close the file:

In [28]:
file = Path("file.txt")

print(file.read_text())

This is sample text.


Similarly, for binary files, you can use the `read_bytes` method:

In [29]:
image = Path("images/midjourney.png")

image.read_bytes()[:10]

b'\x89PNG\r\n\x1a\n\x00\x00'

When using a `read_*` method, error handling is important:

In [30]:
nonexistent_file = Path("gibberish.txt")

try:
    contents = nonexistent_file.read_text()
except FileNotFoundError:
    print("No such thing.")

No such thing.


### Writing files

Writing to files is as easy as reading files. First, we have the `write_text` method:

In [41]:
file = Path("file.txt")

file.write_text("This is new text.")

17

In [42]:
file.read_text()

'This is new text.'

As you notice, the method overrides the old text. There is no `append` mode of `write_text` but we can use a workaround:

In [43]:
old_text = file.read_text() + "\n"
final_text = "This is the final text."

# Combine old and new texts and write them back
file.write_text(old_text + final_text)

print(file.read_text())

This is new text.
This is the final text.


By using `read_text` and `write_text` together, we can append text to the end of the file.

`write_bytes` works in the same way. Let's duplicate the `midjourney.png` image with a new name:

In [45]:
original_image = Path("images/midjourney.png")

new_image = original_image.with_stem("duplicated_midjourney")
new_image

PosixPath('images/duplicated_midjourney.png')

First, we define the new image s path by using the `with_stem` method, which returns a given file path with a different filename (suffix stays the same). Now, we can read the `original_image` and write its contents to `new_image`:

In [46]:
new_image.write_bytes(original_image.read_bytes())

1979612

The image is now duplicated.

### File renaming and moving

Above, we used the `with_stem` function to rename the file's stem, which is a common operation. `pathlib` offers the `rename` method for full renaming as well:

In [47]:
file = Path("file.txt")
target_path = Path("new_file.txt")

file.rename(target_path)

PosixPath('new_file.txt')

`rename` accepts a target path, which can be a string or another path object.

To move files, you can use the `replace` function, which also accepts a destination path:

In [55]:
# Define the file to be moved
source_file = Path("new_file.txt")

# Define the location to put the file
destination = Path("data/new/location")

# Create the directories if they don't exist
destination.mkdir(parents=True)

# Move the file
source_file.replace(destination / source_file.name)

PosixPath('data/new/location/new_file.txt')

### Creating blank files

`pathlib` allows you to create blank files using the `touch` method:

In [65]:
# Define new file path
new_dataset = Path("data/new.csv")

new_dataset.exists()

False

In [66]:
new_dataset.touch()

new_dataset.exists()

True

The `touch` method is originally meant for updating a file's modification time, so it can be used on existing files as well:

In [70]:
original_image.touch()

But when you need to reserve a filename for later use but don't have any content to write to it at the moment, you can use `touch` to create a blank. The method was inspired by the Unix `touch` terminal command.

### Permissions and file system information

One last thing about files we will learn is file statistics using the `stat` method:

In [77]:
image_stats = original_image.stat()

image_stats

os.stat_result(st_mode=33188, st_ino=1950175, st_dev=2080, st_nlink=1, st_uid=1000, st_gid=1000, st_size=1979612, st_atime=1714664562, st_mtime=1714664562, st_ctime=1714664562)

It returns the same output as the `os.stat()` function containing several file characteristics. Below, we will retrieve the file size using dot-notation:

In [86]:
image_size = image_stats.st_size

# File size in megabytes
image_size / (1024**2)

1.8879051208496094

## Conclusion

In this tutorial, we have learned about `pathlib` - a powerful library for interacting with the file system in Python (I consider it as one of the best things that has ever happened to Python). Here is a recap of its key benefits:


* **Object-oriented approach:** `Path` objects provide a structured and straightforward way to represent file system paths, making code more readable and maintainable.
* **Platform independence:** `pathlib` handles path separators and operations consistently across different operating systems, ensuring your code doesn't break on another's machine.
* **Concise and expressive methods:** `pathlib` offers a vast set of methods for common file system operations, making tasks like path manipulation, file reading/writing, and directory management a breeze.

If you want to learn more about Python and its built-in libraries, feel free to check out the following resources:

- [Data Scientist With Python Career Track](https://www.datacamp.com/tracks/associate-data-scientist-in-python)
- [Python Programming Skill Track](https://www.datacamp.com/tracks/python-programming)
- [Intro to Python for Data Science Course](https://www.datacamp.com/courses/intro-to-python-for-data-science)

Thank you for reading!