# Introduction to Python Paths using `pathlib`

In Python we can use the system `os` and `os.path` packages to interact with the filesystem. These utilities are useful, but another library that is typically easier to use is the `pathlib` library. The `pathlib` library treats paths as objects and has a number of handy shorthands.

For example, using the `os.path` module, if we want to join two paths in a way that is safe across different operating systems, we would use the `os.path.join` function:

```python
import os

base_path = '/home/jovyan'
subdir = 'curriculum'
path = os.path.join(base_path, subdir)
# path is now equal to '/home/jovyan/curriculum'
```

Using `pathlib`, we can instead use the `/` operator. Although `/` is typically used for division, it is used with `Path` objects to join paths because the `'/'` character is usually used as a directory separator in POSIX operating systems like linux:

```python
from pathlib import Path

base_path = Path('/home/jovyan')
subdir = 'curriculum'
path = base_path / subdir
# path now represents the path '/home/jovyan/curriculum'
```

As an example, we will want to create a cache directory in which to store temporary and downloaded files from these various datasets. We will do that with `pathlib`:

In [1]:
from pathlib import Path

# Make the path object:
cache_path = Path('/tmp/cache')

# Just because we have made a cache path object doesn't mean that the directory
# we made exists; here we check if it exists and make the directory if not.
if not cache_path.exists():
    cache_path.mkdir()

This cell demonstrates how we can also write out a markdown README file for this directory using `Path` objects.

In [2]:
readme_filename = cache_path / 'README.md'
with readme_filename.open('wt') as file:
    print('# Local Dataset Cache Directory', file=file)
    print('', file=file)
    print('This directory contains cache data downloaded from the', file=file)
    print('various datasets. The subdirectories in this directory', file=file)
    print('should be safe to delete.', file=file)

Check that this file was written correctly:

In [3]:
! cat /tmp/cache/README.md

# Local Dataset Cache Directory

This directory contains cache data downloaded from the
various datasets. The subdirectories in this directory
should be safe to delete.


A handy utility for looking at paths is an `ls` function that returns a list of files and directories within a given directory. `Path` objects support an `iterdir()` method that returns an interator of their contents. We can use this to write an `ls` function:

In [4]:
def ls(path):
    "Lists the contents of the given path."
    # If path is not a directory, raise an error:
    if not path.is_dir():
        raise ValueError(f"Path '{path}' is not a directory")
    else:
        return list(path.iterdir())

In [5]:
# There should only be one file in the cache directory (README.md).
ls(cache_path)

[PosixPath('/tmp/cache/README.md')]

## Paths in the Cloud using `cloudpathlib`

The library `cloudpathlib` is similar to `pathlib` in that it represents a path in the cloud and uses a similar interface. Essentially, you can treat a `CloudPath` as a `Path` in most situations and it will simply work. Behind the scenes, it downloads data into whatever cache directory you specify as you request it.

A `CloudPath` object can represent an S3 path, an Azure blob path, or a Google storage path. We will be using S3 paths in these demos. For an `S3Path` we typically need to provide information about our authentication and our cache directory via an `S3Client` object. For S3 buckets that don't require authentication, such as the Natural Scenes Dataset bucket, we can simply tell it not to authenticate. Here is an example.

In [6]:
from cloudpathlib import S3Path, S3Client

# Create a client that uses our cache path and that does not try to
# authenticate with S3.
client = S3Client(
    local_cache_dir=cache_path,
    no_sign_request=True)

# Now, create a cloudpath for the NSD's S3 bucket:
nsd_base_path = S3Path(
    "s3://natural-scenes-dataset/",
    client=client)

Having created this path, we can now list it:

In [7]:
ls(nsd_base_path)

[S3Path('s3://natural-scenes-dataset/nsddata'),
 S3Path('s3://natural-scenes-dataset/nsddata_betas'),
 S3Path('s3://natural-scenes-dataset/nsddata_diffusion'),
 S3Path('s3://natural-scenes-dataset/nsddata_other'),
 S3Path('s3://natural-scenes-dataset/nsddata_rawdata'),
 S3Path('s3://natural-scenes-dataset/nsddata_stimuli'),
 S3Path('s3://natural-scenes-dataset/nsddata_timeseries'),
 S3Path('s3://natural-scenes-dataset/index.html')]