# Using the file system

Learning Objectives:
By the end of this notebook, you should be able to:
1. Use the `os` module to write platform-independent scripts to access information about the file system
2. Copy files using the `shutil` module
3. Use the `glob` module to write Unix-type file search commands
4. Use the `subprocess` module to run shell-script code from python

**Import Modules for this Notebook**

In the notebook introducing [modules](https://profmikewood.github.io/intro_to_python_book/code_organization/modules.html), we imported modules as we needed them. However, it is good style to import all of the modules you need in your notebook (or other scripts) in one import block near the top of the file. For this notebook, we will use 4 modules:

In [1]:
# import the os, shutil, glob, and subprocess modules
import os
import shutil
import glob
import subprocess

## The `os` module and paths

Python's built-in `os` module is a useful too to acccess information about the file system. Many of the `os` functions mimic common shell scripting commands but return python objects that can be used in Python code. Let's take a look at few examples:

### The Current Working Directory

In [2]:
# check your current working directory
print(os.getcwd())

/Users/mike/Documents/Teaching/Github/intro_to_python_book/files


Questions:
1. What kind of path is this (absolute or relative)?
2. What is the equivalent command in your terminal?

#### Contents of the Current Working Directory

In [3]:
# check the contents of your current working directory
print(os.listdir())

['float.bin', 'Monterey Bay Buoy Data.txt', '.DS_Store', 'date_data.pickle', '2023_0101.txt', 'data_file', 'number_list.bin', 'date_dict.pickle', 'year_data.csv', 'using_the_filesystem.ipynb', 'organized_data', 'Monterey Bay Biggest Wave 2022.csv', 'readme.txt', '.ipynb_checkpoints', 'data', 'file-io.ipynb']


Questions:
1. What files are visible in your file system?
2. What is the equivalent command in your terminal?

### Paths on your machine
Paths on your machine provide the address where certain data is stored. For example, the above list, we find that there is a directory called `data` in our present working directory. If we would like to provide a path in this folder, we just need to append `data` to our current path. However, different operating systems use different formats for the string representation of paths. The `os` module gives us a convenient way to write platform-independent paths.

In [4]:
# create an absolute path to the data folder
data_folder = os.path.join(os.getcwd(),'data')

# print out the data folder path
print(data_folder)

# print the number of files in the data folder
print(len(os.listdir(data_folder)))

/Users/mike/Documents/Teaching/Github/intro_to_python_book/files/data
3


### &#x1F914; Mini-Exercise
Goal: Get a list of all Jupyter Notebooks we've written in CS 122 so far using the `os` module.

In [5]:
# set an absolute path to your CS 122 directory
full_path = '/Users/mike/Documents/SJSU/CS_122'

# loop through the contents of your Lecture directory to get a list
# of directories corresponding to each of the lectures we've had so far
lectures = []
for lecture in  os.listdir(os.path.join(full_path,'Lecture')):
    if 'Lecture' in lecture:
        lectures.append(lecture)

# for each lecture directory, loop through the contents and see 
# if any of the files have the extension 'ipynb'. Note also that
# there may be hidden file for ipynb checkpoints, so be careful
# about how your file comparison is checked
notebooks = []
for lecture in lectures:
    for file_name in os.listdir(os.path.join(full_path,'Lecture',lecture)):
        if file_name[-6:]=='.ipynb':
            notebooks.append(file_name)

# print out the list of notebooks, one per line
for notebook in notebooks:
    print(notebook)

CS 122 Lecture 5-1.ipynb
CS 122 Lecture 2-1.ipynb
CS 122 Lecture 1 Notebook.ipynb
CS 122 Lecture 4-2.ipynb
CS 122 Lecture 3-1b.ipynb
CS 122 Lecture 3-1a.ipynb
CS 122 Lecture 2-2b.ipynb
CS 122 Lecture 2-2a.ipynb
CS 122 Lecture 4-1.ipynb


### Making new directories
The `os` module gives us the functionality to modify our file system. For example, we can make a new directory given an absolute (or relative) path.

In [6]:
# define a path for a new organized_data directory
organized_data = os.path.join(os.getcwd(),'organized_data')

# make a new directory called organized_data in the present working directory
if not os.path.exists(organized_data):
    os.mkdir(organized_data)

# revise the above line to provide a check to determine whether the data exists - only make it if it does not exist

Question: What is the equivalent command in the terminal?

### Moving files to new directories
The `os` module also provides a means to move files from one location on your machine to another: the `rename` method.

In the following code block, we will practice creating directories and moving files using the 2022 data in the `data` folder. Begin by looping through the `data` directory and generating a folder in the `organized_data` directory for each month (e.g. 2022_01, 2022_02, etc). Then, move each 2022 file from the `data` directory into its corresponding month file in the `organized_data` directory

In [7]:
# make a new folder in the organized_data folder for each month in 2022
for file_name in os.listdir(data_folder):

    # check that the file is from 2022
    if file_name[:4]=='2022':

        # define the name of a new folder in the format YYYY_MM
        year_month = file_name[:7]

        # if this year_month is not yet in the organized_data directory, then make it
        if year_month not in os.listdir(organized_data):
            os.mkdir(os.path.join(organized_data, year_month))

        # move the file into the year_month folder
        # define the src_path and the dest_path
        # then, move the file
        src_path = os.path.join(data_folder, file_name)
        dest_path = os.path.join(organized_data, year_month, file_name)
        os.rename(src_path, dest_path)

## The `shutil` module
The `shutil` mode provides the utility to make copies of files on your file system. There are three main functions used for copying files, as follows:

|  | copyfile | copy | copy2 |
| -- | -------- | ---- | ----- |
| Destination can be a directory | N | Y | Y |
| Copies metadata | N | N | Y |
| Copies permissions | N | Y | Y |

In [8]:
# define a path to the source data file 2023_0101.txt in data
src_path = os.path.join(data_folder, '2023_0101.txt')

# define a destination path to the current directory with the file name
dst_path = os.path.join(os.getcwd(),'2023_0101.txt')

# try the copyfile method with the dst path
# what happens if you just provide the current directory?
shutil.copyfile(src_path,dst_path)

# try the copy method with the dst path
shutil.copy(src_path,dst_path)

# try the copy2 method with the dst path
shutil.copy2(src_path,dst_path)

'/Users/mike/Documents/Teaching/Github/intro_to_python_book/files/2023_0101.txt'

### &#x1F914; Mini-Exercise
Modify the code above to make copies of the 2023 data in monthly directories in the `organized_data` directory

In [9]:
# make a new folder in the organized_data folder for each month in 2022
for file_name in os.listdir(data_folder):

    # check that the file is from 2023
    if file_name[:4]=='2023':

        # define the name of a new folder in the format YYYY_MM
        year_month = file_name[:7]

        # if this year_month is not yet in the organized_data directory, then make it
        if year_month not in os.listdir(organized_data):
            os.mkdir(os.path.join(organized_data, year_month))

        # make a copy of the file in the year_month folder
        # define the src_path and the dest_path
        # then, copy the file using one of the shutil functions
        src_path = os.path.join(data_folder, file_name)
        dest_path = os.path.join(organized_data, year_month, file_name)
        shutil.copyfile(src_path, dest_path)

## Overview: Python Commands vs Unix Shell Commands

| Python | Unix | Purpose |
| ------ | ---- | ------- |
| os.getcwd() | pwd | Determine the current/present working directory |
| os.chdir() | cd | Change directory |
| os.mkdir() | mkdir | Make a directory |
| os.rename() | mv | Rename a file or move to a new location |
| os.listdir() | ls | List the files and folders in a directory |
| shutil.copy() | cp | Copy a file to a new location |

## The `glob` module
When using Unix-type shell commands, wildcard symbols are extremely useful for finding and accessing subsets of files. There are 2 main wildcard symbols:

| symbol | use |
| ------ | --- |
| `?`    | Wildcard for a single symbol |
| `*`    | Wildcard symbol for any number of symbols |

Try these in the `data` directory in your shelf:
1. How would you determine the names of files that correspond to the first day of each month in 2023?
2. How would you determine the name of all files that correspond to December of 2023?

The `glob` module provides functionality to provide Unix-style searches of your file system.

In [10]:
# find all files names that correspond to the first day of each month in 2023
search_path = os.path.join(data_folder,'2023_??01.txt')
glob.glob(search_path)

# find all files in December 2023
search_path = os.path.join(data_folder,'2023_1*.txt')
glob.glob(search_path)


[]

### &#x1F914; Mini-Exercise
Goal: Get a list of all Jupyter Notebooks we've written in CS 122 so far using the `glob` module.

In [11]:
# define a search path
test_path = '/Users/mike/Documents/SJSU/CS_122/Lecture/Lecture*/*.ipynb'

# use the glob module to get the list of paths for the notebooks
matching_paths = glob.glob(test_path)

# make a loop to just get the file name (not the whole path)
notebooks = []
for file_path in matching_paths:
    notebooks.append(os.path.basename(file_path))

# print the notebook files, line by line
for notebook in notebooks:
    print(notebook)

CS 122 Lecture 5-1.ipynb
CS 122 Lecture 2-1.ipynb
CS 122 Lecture 1 Notebook.ipynb
CS 122 Lecture 4-2.ipynb
CS 122 Lecture 3-1b.ipynb
CS 122 Lecture 3-1a.ipynb
CS 122 Lecture 2-2b.ipynb
CS 122 Lecture 2-2a.ipynb
CS 122 Lecture 4-1.ipynb


## The `subprocess` module
The final module we will investigate in this notebook is the `subprocess` module. It's not necessarily related to using the file system, but its related to accessing the terminal and running shell scripts from Python, so it fits within the theme of this notebook.

The most useful method of the `subprocess` module if Popen.

In [12]:
# write a function to list the files in the current directory
p = subprocess.Popen(['ls','-l'])
# p.communicate()

# the first and second arguments from Popen are the standard output and standard error
# if a string is desired, then "pipe" the stdout
p = subprocess.Popen(['ls','-l'], stdout=subprocess.PIPE)
output, error = p.communicate()

# print the type of output
print(type(output))

# convert the type of the output to a string
output = output.decode()
print(type(output))

# split the output and print line by line
for line in output.split('\n'):
    print(line)

total 2784
-rw-r--r--  1 mike  staff       23 Sep 18  2023 2023_0101.txt
-rw-r--r--@ 1 mike  staff      230 Feb 20 10:25 Monterey Bay Biggest Wave 2022.csv
-rw-r--r--@ 1 mike  staff  1341408 Sep 19  2023 Monterey Bay Buoy Data.txt
drwxr-xr-x  5 mike  staff      160 Feb 18 14:35 [1m[34mdata[m[m
-rw-r--r--@ 1 mike  staff     1904 Sep 20  2023 data_file
-rw-r--r--  1 mike  staff      311 Feb 20 10:25 date_data.pickle
-rw-r--r--  1 mike  staff      112 Feb 20 10:25 date_dict.pickle
-rw-r--r--@ 1 mike  staff    22414 Feb 20 10:25 file-io.ipynb
-rw-r--r--  1 mike  staff       32 Feb 20 10:25 float.bin
-rw-r--r--@ 1 mike  staff        6 Feb 20 10:25 number_list.bin
drwxr-xr-x  3 mike  staff       96 Feb 18 14:37 [1m[34morganized_data[m[m
-rw-r--r--  1 mike  staff       45 Feb 20 10:25 readme.txt
-rw-r--r--@ 1 mike  staff    19095 Feb 18 14:37 using_the_filesystem.ipynb
-rw-r--r--  1 mike  staff       33 Feb 20 10:25 year_data.csv
<class 'bytes'>
<class 'str'>
total 2784
-rw-r--r--  1 