# Lesson 1. Working Directories, Absolute and Relative Paths and Other Science Project Management Terms Defined

After completing this chapter, you will be able to:

- Define a computer directory and list the primary types of directories.
- Explain the difference between relative and absolute paths.
- Check and set your working directory in Python using the os package.

A directory refers to a folder on a computer that has relationships to other folders. The term “directory” considers the relationship between that folder and the folders within and around it. Directories are hierarchical which means that they can exist within other folders as well as have folders exist within them.

## What Is a Parent Directory
The term “parent” directory is used to describe the preceding directory in which a subdirectory is created. A parent directory can have many subdirectories; thus, many subdirectories can share the same parent directory. This also means that parent directories can also be subdirectories of a parent directory above them in the hierarchy.

### What Is the Home Directory?
The home directory on a computer is a directory defined by your operating system. The home directory is the primary directory for your user account on your computer. Your files are by default stored in your home directory.

On Windows, the home directory is typically C:\Users\your-username.

### What Is A Working Directory?
While the terminal will open in your home directory by default, you can change the working directory of the terminal to a different location within your computer’s file structure.

The working directory refers to the directory (or location) on your computer that a the tool assumes is the starting place for all paths that you construct or try to access.

### Working Directories and Relative vs Absolute Paths in Python

When set correctly, working directories help the programming language to find files when you create paths.

Within Python, you can define (or set) the working directory of your choice. Then, you can create paths that are relative to that working directory, or create absolute paths, which means they begin at the home directory of your computer and provide the full path to the file that you wish to open.

##### Relative Paths
A relative path is the path that (as the name sounds) is relative to the working directory location on your computer.

If the working directory is earth-analytics, then Python knows to start looking for your files in the earth-analytics directory.

Following the example above, if you set the working directory to the earth-analytics directory, then the relative path to access streams.csv would be:

data/field-sites/california/colorado/streams.csv

**Data Tip The default working directory in any Jupyter Notebook file is the directory in which it is saved. However, you can change the working directory in your code!**

#### Absolute Paths
An absolute path is a path that contains the entire path to the file or directory that you need to access. This path will begin at the home directory of your computer and will end with the file or directory that you wish to access.

# Lesson 2. Write Code That Will Work On Any Computer: Introduction to Using the OS Python Package to Set Up Working Directories and Construct File Paths

Learning Objectives

- Use the earthpy attribute et.io.HOME to find the home directory on any computer.
- Use os.path.join() to create paths that will work on Windows, Mac and Linux.
- Use os.path.exists() to ensure a file path exists.
- nSet your working directory in Python using os.chdir().

In [5]:
# Import necessary packages
import os

import earthpy as et

Ensuring that your code can run on multiple machines makes it easier to:

- set things up in the rare case that your machine dies.
- move your workflow to a cloud environment or high performance computing infrastructure.
- share your project and collaborate with others.

### Build Directory Paths that Work Across Operating Systems Using os.path.join

In [7]:
# Direction and number of slashes are handled by the function
os.path.join("earth-analytics", "data")

'earth-analytics\\data'

Constructing a path using the join() function will save you time when you (or others!) move your code to another computer, as you will not have to manually create or fix paths.

In [8]:
# Check that a directory exists on your computer
my_path = os.path.join("earth-analytics", "data")

# Boolean output (True or False)
os.path.exists(my_path)

False

In the example above, you have created a path. However, that path may or may not already exist on your computer.

If Python cannot find the directory, there are several issues to consider:

- Your working directory may not be set properly, so that it can find the relative path.
- You have a misspelling in you path. Or, the case (e.g. upper, lower) is incorrect.
- The directory has not been created on your computer.

In [10]:
os.getcwd()# Check Your Current Working Directory

'C:\\Users\\34639\\wa\\jupyter\\earth_data_science_0\\4_python_fundamentals'

In [11]:
# Find your home directory
et.io.HOME

'C:\\Users\\34639'

### Construct a Path to the earth-analytics Directory in Your Home Directory
Now you will implement some useful tricks to construct the path to the earth-analytics directory within your home directory using et.io.HOME and os.path.join.

In [12]:
# Create a path to the home/earth-analytics directory on your computer
os.path.join(et.io.HOME, "earth-analytics")

'C:\\Users\\34639\\earth-analytics'

In [13]:
my_ea_path = os.path.join(et.io.HOME, "earth-analytics")

# Does the path exist?
os.path.exists(my_ea_path)

True

### Set Your Working Directory to home/earth-analytics
Now that you have the basics of good project structure out of the way, you can get your project directory set up.

By now, you have already created the earth-analytics directory (in your home directory) where you will store data and files used in the textbook.

You will now set that project directory as your working directory in Python using the following syntax, which provides the output of os.path.join as input into the the os.chdir function:

- os.chdir(os.path.join(et.io.HOME, 'earth-analytics'))

Breaking the above commands down, you are doing the following.

- os.chdir(): remember from above that this function changes the working directory. However, you need to tell Python the path of the working directory that you want to use.
- os.path.join(): this function combines strings or path variables into a full path that will work on any operating system.
- et.io.HOME: this attribute provides the path for the home directory on your (or any) computer.

Combing the three commands above in a nested structure will:

- create the path for the home/earth-analytics directory and
-  change the working directory to that path.


In [14]:
# Check the current working directory
os.getcwd()

'C:\\Users\\34639\\wa\\jupyter\\earth_data_science_0\\4_python_fundamentals'

In [15]:
# Find the path to your home directory
et.io.HOME

'C:\\Users\\34639'

In [16]:
# Create a path to earth-analytics that will work on any computer
os.path.join(et.io.HOME, 'earth-analytics')

'C:\\Users\\34639\\earth-analytics'

In [17]:
# Change the directory to that path
os.chdir(os.path.join(et.io.HOME, 'earth-analytics'))

In [18]:
# Check the current working directory again
os.getcwd()

'C:\\Users\\34639\\earth-analytics'

# Lesson 3. Use the OS and Glob Python Packages to Manipulate File Paths

Learning Objectives

- Use **earthpy** to download files from a URL (internet address).
- Use **glo**b to get customized lists of files or directories.
- Use various functions in the **os** package to manipulate file paths.

### os, glob, and earthpy 

### Download Files Using EarthPy
You can use the function data.get_data() from the earthpy package to download data from online sources such as the Figshare.com data repository.

In [19]:
# Import necessary packages
import os
from glob import glob

import earthpy as et

To use the function et.data.get_data(), you can provide a parameter value for the url, which you define by providing a text string of the URL (internet address) for the dataset.

In [20]:
# Download data on average monthly temp for two California sites
file_url = "https://ndownloader.figshare.com/files/21894528"
et.data.get_data(url = file_url)

Downloading from https://ndownloader.figshare.com/files/21894528
Extracted output to C:\Users\34639\earth-analytics\data\earthpy-downloads\avg-monthly-temp-fahr


'C:\\Users\\34639\\earth-analytics\\data\\earthpy-downloads\\avg-monthly-temp-fahr'

By default, et.data.get_data() will download files to earth-analytics/data/earthpy-downloads under your home directory, and it will create the necessary directories if they do not already exist.

With this information, you can set the working directory to your earth-analytics directory and then create a relative path to the downloaded data directory.

In [21]:
# Set working directory to earth-analytics
os.chdir(os.path.join(et.io.HOME, "earth-analytics"))

# Create a path to the data folder
data_folder = os.path.join("data", "earthpy-downloads", 
                           "avg-monthly-temp-fahr")

## Glob in Python
glob is a powerful tool in Python to help with file management and filtering. While os helps manage and create specific paths that are friendly to whatever machine they are used on, glob helps to filter through large datasets and pull out only files that are of interest.

The glob() function uses the rules of Unix shell to help users organize their files. Unix shell follows fairly straight-forward rules to search for items, which you will explore below.

#### Search for a Specific Folder or File
The glob function can be used to find just one folder or file. This can be done by just giving glob the path of the item you are trying to find.

In [22]:
# Get a specific directory
file_list = glob(data_folder)

file_list

['data\\earthpy-downloads\\avg-monthly-temp-fahr']

This is not very useful, as you already have the data path if you are using it to search for something.

Notice, however, that glob returns a list of all items that match your search, not as individual strings.

In [23]:
type(file_list)

list

You can also use the glob() function in combination with the os.path.join() function to create lists of paths that are built programmatically.

In [24]:
# Create a list containing a specific file name
glob(os.path.join(data_folder, 'San-Diego', 'San-Diego-1999-temp.csv'))

['data\\earthpy-downloads\\avg-monthly-temp-fahr\\San-Diego\\San-Diego-1999-temp.csv']

### * Operator
glob uses different operators to broaden its searching abilities. The primary operator is *.

The * is a sort of wildcard that can be used to search for items that have differences in their names. Whatever text doesn’t match can be replaced by a *.

For example, if you want every file in a directory to be returned to you, you can put a * at the end of a directory path.

glob will return a list of all of the files in that directory.

In [25]:
# Get list of all files/dirs in data folder
glob(os.path.join(data_folder, '*'))

['data\\earthpy-downloads\\avg-monthly-temp-fahr\\San-Diego',
 'data\\earthpy-downloads\\avg-monthly-temp-fahr\\Sonoma']

In [26]:
# Get list of all files/dirs in San-Diego folder
glob(os.path.join(data_folder, 'San-Diego', '*'))

['data\\earthpy-downloads\\avg-monthly-temp-fahr\\San-Diego\\San-Diego-1999-temp.csv',
 'data\\earthpy-downloads\\avg-monthly-temp-fahr\\San-Diego\\San-Diego-2000-temp.csv',
 'data\\earthpy-downloads\\avg-monthly-temp-fahr\\San-Diego\\San-Diego-2001-temp.csv',
 'data\\earthpy-downloads\\avg-monthly-temp-fahr\\San-Diego\\San-Diego-2002-temp.csv',
 'data\\earthpy-downloads\\avg-monthly-temp-fahr\\San-Diego\\San-Diego-2003-temp.csv']

If you only want .csv files, than *.csv will return every file that ends with .csv.

In [27]:
# Get only csv files
glob(os.path.join(data_folder, 'San-Diego', '*.csv'))

['data\\earthpy-downloads\\avg-monthly-temp-fahr\\San-Diego\\San-Diego-1999-temp.csv',
 'data\\earthpy-downloads\\avg-monthly-temp-fahr\\San-Diego\\San-Diego-2000-temp.csv',
 'data\\earthpy-downloads\\avg-monthly-temp-fahr\\San-Diego\\San-Diego-2001-temp.csv',
 'data\\earthpy-downloads\\avg-monthly-temp-fahr\\San-Diego\\San-Diego-2002-temp.csv',
 'data\\earthpy-downloads\\avg-monthly-temp-fahr\\San-Diego\\San-Diego-2003-temp.csv']

If you only want .csv files with the number 2 somewhere in the file name, than *2*.csv will return that list.

In [28]:
# Use multiple wildcards
glob(os.path.join(data_folder, 'San-Diego', '*2*.csv'))

['data\\earthpy-downloads\\avg-monthly-temp-fahr\\San-Diego\\San-Diego-2000-temp.csv',
 'data\\earthpy-downloads\\avg-monthly-temp-fahr\\San-Diego\\San-Diego-2001-temp.csv',
 'data\\earthpy-downloads\\avg-monthly-temp-fahr\\San-Diego\\San-Diego-2002-temp.csv',
 'data\\earthpy-downloads\\avg-monthly-temp-fahr\\San-Diego\\San-Diego-2003-temp.csv']

Note that 2*.csv would only return files that start with the number 2.

In [29]:
# Create empty list (no file names begin with 2)
glob(os.path.join(data_folder, 'San-Diego', '2*.csv'))

[]

The additional asterix in front of 2 (e.g. *2*.csv) allows the 2 to be anywhere in the path.

The * is meant to replace all text that does not matter to your search.

#### Recursive Searches
If you are trying to operate on files across multiple directories, you can use multiple * in a file path to indicate that you want every file in all folders in a directory.

The first * is to access all directories in the starting directory (e.g. data_folder).

This followed by the second * operator, which loops through all subdirectories to make a list of all their contents.

In [30]:
# Search recursively through both site folders
glob(os.path.join(data_folder, '*', '*'))

['data\\earthpy-downloads\\avg-monthly-temp-fahr\\San-Diego\\San-Diego-1999-temp.csv',
 'data\\earthpy-downloads\\avg-monthly-temp-fahr\\San-Diego\\San-Diego-2000-temp.csv',
 'data\\earthpy-downloads\\avg-monthly-temp-fahr\\San-Diego\\San-Diego-2001-temp.csv',
 'data\\earthpy-downloads\\avg-monthly-temp-fahr\\San-Diego\\San-Diego-2002-temp.csv',
 'data\\earthpy-downloads\\avg-monthly-temp-fahr\\San-Diego\\San-Diego-2003-temp.csv',
 'data\\earthpy-downloads\\avg-monthly-temp-fahr\\Sonoma\\Sonoma-1999-temp.csv',
 'data\\earthpy-downloads\\avg-monthly-temp-fahr\\Sonoma\\Sonoma-2000-temp.csv',
 'data\\earthpy-downloads\\avg-monthly-temp-fahr\\Sonoma\\Sonoma-2001-temp.csv',
 'data\\earthpy-downloads\\avg-monthly-temp-fahr\\Sonoma\\Sonoma-2002-temp.csv',
 'data\\earthpy-downloads\\avg-monthly-temp-fahr\\Sonoma\\Sonoma-2003-temp.csv']

### Sorting glob Lists
Notice that the lists provided by glob are not sorted.

In [31]:
# Get list of CSVs in Sonoma directory
sonoma_files = glob(os.path.join(data_folder, 'Sonoma', '*.csv'))
sonoma_files

['data\\earthpy-downloads\\avg-monthly-temp-fahr\\Sonoma\\Sonoma-1999-temp.csv',
 'data\\earthpy-downloads\\avg-monthly-temp-fahr\\Sonoma\\Sonoma-2000-temp.csv',
 'data\\earthpy-downloads\\avg-monthly-temp-fahr\\Sonoma\\Sonoma-2001-temp.csv',
 'data\\earthpy-downloads\\avg-monthly-temp-fahr\\Sonoma\\Sonoma-2002-temp.csv',
 'data\\earthpy-downloads\\avg-monthly-temp-fahr\\Sonoma\\Sonoma-2003-temp.csv']

If it’s important for a list to be in a certain order, then always make sure to sort the list returned by glob using the .sort() method for lists.

In [32]:
# Sort glob list
sonoma_files.sort()

In [33]:
# Another option for sorting lists
sonoma_files = sorted(glob(os.path.join(data_folder, 'Sonoma', '*.csv')))
sonoma_files

['data\\earthpy-downloads\\avg-monthly-temp-fahr\\Sonoma\\Sonoma-1999-temp.csv',
 'data\\earthpy-downloads\\avg-monthly-temp-fahr\\Sonoma\\Sonoma-2000-temp.csv',
 'data\\earthpy-downloads\\avg-monthly-temp-fahr\\Sonoma\\Sonoma-2001-temp.csv',
 'data\\earthpy-downloads\\avg-monthly-temp-fahr\\Sonoma\\Sonoma-2002-temp.csv',
 'data\\earthpy-downloads\\avg-monthly-temp-fahr\\Sonoma\\Sonoma-2003-temp.csv']

### Why Sort glob Lists?
The order in which glob returns files from a folder can vary drastically. Depending on the operating system being used, or the way the files are stored, different people may get results from a glob list in different orders.

For example, consider how sorting a glob list changes what files you access when getting an index from the list, such as index [4] to access the 5th item in the list.



In [34]:
unsorted_sonoma = glob(os.path.join(data_folder, 'Sonoma', '*'))
print(unsorted_sonoma[4])

data\earthpy-downloads\avg-monthly-temp-fahr\Sonoma\Sonoma-2003-temp.csv


In [35]:
# Indexes change once a list is sorted!
sorted_sonoma = glob(os.path.join(data_folder, 'Sonoma', '*'))
sorted_sonoma.sort() 

print(sorted_sonoma[4])

data\earthpy-downloads\avg-monthly-temp-fahr\Sonoma\Sonoma-2003-temp.csv


### Using Ranges
In addition to using * to specify which parts of a file name are important to you, you can use [] to specify a range of characters to search for.

For example, you can create a search for all files with 2001 to 2003 in the name by using *200 and adding [1-3]* to it.

In [37]:
# Get files for 2001-2003
glob(os.path.join(data_folder, '*', '*200[1-3]*'))

['data\\earthpy-downloads\\avg-monthly-temp-fahr\\San-Diego\\San-Diego-2001-temp.csv',
 'data\\earthpy-downloads\\avg-monthly-temp-fahr\\San-Diego\\San-Diego-2002-temp.csv',
 'data\\earthpy-downloads\\avg-monthly-temp-fahr\\San-Diego\\San-Diego-2003-temp.csv',
 'data\\earthpy-downloads\\avg-monthly-temp-fahr\\Sonoma\\Sonoma-2001-temp.csv',
 'data\\earthpy-downloads\\avg-monthly-temp-fahr\\Sonoma\\Sonoma-2002-temp.csv',
 'data\\earthpy-downloads\\avg-monthly-temp-fahr\\Sonoma\\Sonoma-2003-temp.csv']

### ? Operator
The ? operator functions similarly to the * operator but is used for a single character.

If one character in the file name can be variable, but everything else must stay the same, than ? is a good way to just replace that one character.

In [39]:
# ? operator used for last value in year
glob(os.path.join(data_folder, 'Sonoma', '*200?-temp.csv'))

['data\\earthpy-downloads\\avg-monthly-temp-fahr\\Sonoma\\Sonoma-2000-temp.csv',
 'data\\earthpy-downloads\\avg-monthly-temp-fahr\\Sonoma\\Sonoma-2001-temp.csv',
 'data\\earthpy-downloads\\avg-monthly-temp-fahr\\Sonoma\\Sonoma-2002-temp.csv',
 'data\\earthpy-downloads\\avg-monthly-temp-fahr\\Sonoma\\Sonoma-2003-temp.csv']

In [40]:
# Multiple ? operators
glob(os.path.join(data_folder, 'Sonoma', '*19??-temp.csv'))

['data\\earthpy-downloads\\avg-monthly-temp-fahr\\Sonoma\\Sonoma-1999-temp.csv']

#### Saving a glob Output to a Variable
In order to use the output of glob later in a script, be sure to save it to a variable! It can be done easily by just assigning the glob function output a variable name.

In [41]:
sd_data = glob(os.path.join(data_folder, 'San-Diego', '*'))
sd_data.sort()

sd_data

['data\\earthpy-downloads\\avg-monthly-temp-fahr\\San-Diego\\San-Diego-1999-temp.csv',
 'data\\earthpy-downloads\\avg-monthly-temp-fahr\\San-Diego\\San-Diego-2000-temp.csv',
 'data\\earthpy-downloads\\avg-monthly-temp-fahr\\San-Diego\\San-Diego-2001-temp.csv',
 'data\\earthpy-downloads\\avg-monthly-temp-fahr\\San-Diego\\San-Diego-2002-temp.csv',
 'data\\earthpy-downloads\\avg-monthly-temp-fahr\\San-Diego\\San-Diego-2003-temp.csv']

### os Advanced Functionality
os is another very powerful tool and has additional functionality that can be useful when dealing with file paths, such as advanced parsing abilities.

For example, os.path.normpath() is a great way to clean up file paths. It takes out any unnecessary characters to make the path more easily read.

It is a good way to make sure your path is properly formatted before using other os functions on the path

In [42]:
# Example of normpath cleaning up path
example_path = "home//user//example_dir"
os.path.normpath(example_path)

'home\\user\\example_dir'

os.path.commonpath() is a very useful when combined with glob. This function will take a list of file paths and find the lowest directory that all the files have in common.

So if there were two files, one stored in home/user/dir/dir2/example.txt and one stored in home/user/dir/example.txt, then os.path.commonpath() would return home/user/dir as it’s the lowest common directory the two folders share.

In [43]:
# Print list of files
sd_data

['data\\earthpy-downloads\\avg-monthly-temp-fahr\\San-Diego\\San-Diego-1999-temp.csv',
 'data\\earthpy-downloads\\avg-monthly-temp-fahr\\San-Diego\\San-Diego-2000-temp.csv',
 'data\\earthpy-downloads\\avg-monthly-temp-fahr\\San-Diego\\San-Diego-2001-temp.csv',
 'data\\earthpy-downloads\\avg-monthly-temp-fahr\\San-Diego\\San-Diego-2002-temp.csv',
 'data\\earthpy-downloads\\avg-monthly-temp-fahr\\San-Diego\\San-Diego-2003-temp.csv']

In [44]:
# Get a shared directory from a list of files
os.path.commonpath(sd_data)

'data\\earthpy-downloads\\avg-monthly-temp-fahr\\San-Diego'

os.path.basename() finds the last section of a path and returns that. If a file path is passed in, the file name will be parsed out and returned.

In [45]:
# Print normalized path
os.path.normpath(data_folder)

'data\\earthpy-downloads\\avg-monthly-temp-fahr'

In [46]:
# Get the last part of a file path with basename
os.path.basename(os.path.normpath(data_folder))

'avg-monthly-temp-fahr'

os.path.split() will split a path into two parts:

- the last part of the path.
- the rest of the path.

It returns the same output as os.path.basename() with the addition of the rest of the path that was left out as another .

In [47]:
# Get the last part of a file path and the rest of the path
os.path.split(os.path.normpath(data_folder))

('data\\earthpy-downloads', 'avg-monthly-temp-fahr')

You can then use indexing on the result to get each piece of the split path.

In [48]:
os.path.split(os.path.normpath(data_folder))[0]

'data\\earthpy-downloads'

In [49]:
os.path.split(os.path.normpath(data_folder))[1]

'avg-monthly-temp-fahr'

## String Manipulation
Recall that when you create a file path using os.path.join(), it will properly format a string of the file path, so it can be used on any operating systems.

Note, however, that the file path is still just a string. Thus, you can parse file paths, just like you would strings, and extract information from them that you may need for a project.

.split() is a built-in Python function that splits a string into a list of strings based on a separator character, and can be used in combination with os.sep to separate directories in file paths into their base parts. os.sepis a data value stored in os that will return the character used to separate pathname components, such as directory or file names. This is \\ for Windows and / for POSIX systems, such as Mac or Linux.

In [51]:
# Separate a path into parts
file_path_list = data_folder.split(os.sep)
file_path_list

['data', 'earthpy-downloads', 'avg-monthly-temp-fahr']

In [52]:
file_path_list[2]

'avg-monthly-temp-fahr'

In addition to built-in functions, file paths can be parsed with string[start_index:end_index] like a normal string. This can help get important infromation from a file path, such as a date.

In [54]:
# Print list of files
sd_data

['data\\earthpy-downloads\\avg-monthly-temp-fahr\\San-Diego\\San-Diego-1999-temp.csv',
 'data\\earthpy-downloads\\avg-monthly-temp-fahr\\San-Diego\\San-Diego-2000-temp.csv',
 'data\\earthpy-downloads\\avg-monthly-temp-fahr\\San-Diego\\San-Diego-2001-temp.csv',
 'data\\earthpy-downloads\\avg-monthly-temp-fahr\\San-Diego\\San-Diego-2002-temp.csv',
 'data\\earthpy-downloads\\avg-monthly-temp-fahr\\San-Diego\\San-Diego-2003-temp.csv']

In [55]:
# Get file name
year_path = sd_data[0]
file_name = os.path.basename(year_path)
print(file_name)

San-Diego-1999-temp.csv


In [56]:
# Parse a date from file name
year = file_name[10:14]
print(year)

1999


Notice that the range includes the first index value but not the second index value (e.g. 1999 are index values 10 through 13).