# Notebook 3: Managing files

This notebook focuses file management in Python. We shall touch on the following packages as well as built in python functions:

 - `os`
 - `shutil`
 - `re`
 - `glob`

The `os` and `shutil` packages are particularly useful for manipulating files and directories whilst `glob` and `re` are designed to help manipulate paths and expressions.

### Table of Contents

 - [Notebook 0: Introduction](./nb_00_introduction.ipynb)
 - [Notebook 1: Datatypes, loops and logic](./nb_01_datatypes_loops_and_logic.ipynb)
 - [Notebook 2: Functions, modules and packages](./nb_02_functions_modules_and_packages.ipynb)
 - [**Notebook 3: Managing files**](./nb_03_managing_files.ipynb)
   - [Built in Python functions for reading and writing files](#Built-in-Python-functions-for-reading-and-writing-files)
   - [The os package](#The-os-package)
     - [Listing the contents of a directory](#Listing-the-contents-of-a-directory)
     - [Making directories with os](#Making-directories-with-os)
     - [Deleting directories with os](#Deleting-directories-with-os)
     - [Moving files with os](#Moving-files-with-os)
     - [Removing files with os](#Removing-files-with-os)
     - [Walking a directory tree with os](#Walking-a-directory-tree-with-os)
     - [Manipulating filepaths with os](#Manipulating-filepaths-with-os)
     - [Other useful os functions](#Other-useful-os-functions)
   - [The shutil package](#The-shutil-package)
     - [Removing directories with shutil](#Removing-directories-with-shutil)
     - [Copying directories with shutil](#Copying-directories-with-shutil)
   - [The re package](#The-re-package)
   - [The glob package](#The-glob-package)
   - [Exercises](#Exercises) (Recommended)
     
     
 - [Notebook 4: Numpy](./nb_04_numpy.ipynb)
 - [Notebook 5: Pandas](./nb_05_pandas.ipynb)

## Built in Python functions for reading and writing files

Before diving into packages for file management, we will first talk about the inbuilt Python functions for reading and writing files.

In Python, you can read and write files using the `with` keyword, `as` keyword and the `open` function. This may sound complex but the syntax is designed to be relatively intuitive:

In [21]:
# Open the file and read it
with open('./03_managing_files/file2.txt', 'r') as file:
    print(file.read())

1 2 3
4 5 6
7 8 9
10 11 12


Let's look a bit more at what just happened here. 

The `open` function took in two inputs; the filename and the mode. 
- The filename is the name of file we want to look at. In this case we used a relative path (don't worry if you are not familiar with the term 'relative'; we will cover exactly what this is later).
- The mode tells us how we wish to treat the file. For example, 'r' means read-only access, 'w' means write over the file and 'a' means append to the file.

The `open` function has given us an object which we have named `file`. Using this object we can read and write to the file we are interested in. An example of how we can write to a file is given below (have a look at the file before and after running the below code to make sure you understand what is happening here).

In [22]:
# Open the file and read it
with open('./03_managing_files/file2.txt', 'a') as file:
    file.write('\n oh look, we have added a new line')

 > **Note:** Much like in the loops and conditional statements in Notebook 1, the `with` statement must be consistently indented and have a correctly placed colon. Failure to do this will give some, potentially very confusing, errors!

 > **Note:** It is important to do all operations on the `file` object inside the indented section of the `with` statement. Outside the `with` statement the file object is no longer "connected" to the file in the sense that you can no longer read or write to the file using the file object.

## The `os` package

The letters `os` stand for Operating System. Perhaps unsurprisingly, the `os` package in Python deals with "Miscellaneous operating system interfaces". To get started let's import `os` and also `os.path` (which is particularly useful for path management).

In [23]:
import os
import os.path

One of the most useful features of the `os` package is it's ability to get and set the working directory (the path/location you are currently executing your code from). For example;

In [24]:
# Get the current working directory
print("This is our current working directory:")
print(os.getcwd())

This is our current working directory:
/home/tommaullin/Documents/PythonIntro2019


In [25]:
# Set the current working directory to the 
# '03_managing_files' folder. We will need 
# to be in this directory for the rest of  
# this section.
os.chdir("03_managing_files")

# Print the new current working directory.
print("We have now changed directory to:")
print(os.getcwd())

We have now changed directory to:
/home/tommaullin/Documents/PythonIntro2019/03_managing_files


### Listing the contents of a directory

The `os` package also allows us to list the contents of a directory (for those familiar with `unix` commands this is very similar to the unix `ls` command) using the function `listdir`. For example;

In [26]:
# Get the current working directory
pwd = os.getcwd()
print("Present working directory: \n", pwd, "\n")

# List the contents of the directory
# (we use the join function to insert a newline
# character, `\n` between each entry in contents)
contents = os.listdir(pwd)
print("Directory Contents:") 
print('\n'.join(contents))

Present working directory: 
 /home/tommaullin/Documents/PythonIntro2019/03_managing_files 

Directory Contents:
file_undesired.txt
file1.txt
file2.txt
fileb.txt


 > **Note**: The function `os.scandir` operates in a similar manor to `os.listdir` and is a good alternative for directories with large contents.

### Making directories with `os`

The function `os.mkdir` can be used to make a directory in Python. The function `os.makedirs` is also similar, but allows us to make nested dirctories, i.e. directories within directories (this mimics the unix `mkdir -p` command).

In [27]:
# List the contents of the directory
contents = os.listdir(pwd)
print("Old Directory Contents: ") 
print('\n'.join(contents))

# Make a new directory
os.mkdir('directoryWeJustMade')

# List the contents of the directory
contents = os.listdir(pwd)
print("\nDirectory Contents (after `mkdir`): ") 
print('\n'.join(contents))

# Make a new directory heirachy
os.makedirs('nested/directories/we/just/made')

# List the contents of the directory
contents = os.listdir(pwd)
print("\nDirectory Contents (after `makedirs`): ") 
print('\n'.join(contents))

Old Directory Contents: 
file_undesired.txt
file1.txt
file2.txt
fileb.txt

Directory Contents (after `mkdir`): 
file_undesired.txt
file1.txt
directoryWeJustMade
file2.txt
fileb.txt

Directory Contents (after `makedirs`): 
file_undesired.txt
file1.txt
directoryWeJustMade
file2.txt
nested
fileb.txt


### Deleting directories with `os`

We can also remove empty directories and empty nested directories in a similar fashion using `os.rmdir` and `os.removedirs` (which again mimics the unix command `rmdir -p`);

In [28]:
# List the contents of the directory
contents = os.listdir(pwd)
print("Old Directory Contents: ") 
print('\n'.join(contents))

# Make a new directory
os.rmdir('directoryWeJustMade')

# List the contents of the directory
contents = os.listdir(pwd)
print("\nNew Directory Contents (after `rmdir`): ") 
print('\n'.join(contents))

# Make a new directory heirachy
os.removedirs('nested/directories/we/just/made')

# List the contents of the directory
contents = os.listdir(pwd)
print("\nDirectory Contents (after `removedirs`): ") 
print('\n'.join(contents))

Old Directory Contents: 
file_undesired.txt
file1.txt
directoryWeJustMade
file2.txt
nested
fileb.txt

New Directory Contents (after `rmdir`): 
file_undesired.txt
file1.txt
file2.txt
nested
fileb.txt

Directory Contents (after `removedirs`): 
file_undesired.txt
file1.txt
file2.txt
fileb.txt


 > **Note:** An important footnote here is that the above functions **only remove empty directories**! To remove directories which contain files you will need to use the `shutil.rmtree` function (see the `shutil` section of this notebook).

### Moving files with `os`

We can move files in Python using `os.rename` (this mimics the behaviour of the Unix command `mv`).

 > **Note:** If you try to use `os.rename` to give a file the name of a file which already exists, you will write over the file which already exists with no errors! Be very careful here; especially when working with files which have similar names to one another!

In [29]:
# Make a file for example
with open('file4.txt', 'wt') as f:
    f.write('Oh look, some text in a file')
print(os.listdir())

# Move the file
os.rename('file4.txt', 'file3.txt')
print(os.listdir())

# CAUTION: Running the below will write over `file1.txt`
# without warning!
os.rename('file3.txt', 'file1.txt')
print(os.listdir())


['file_undesired.txt', 'file1.txt', 'file2.txt', 'file4.txt', 'fileb.txt']
['file_undesired.txt', 'file1.txt', 'file2.txt', 'file3.txt', 'fileb.txt']
['file_undesired.txt', 'file1.txt', 'file2.txt', 'fileb.txt']


### Removing files with `os`

In Python, files can be removed with the function `os.remove` (this mimics the Unix command `rm`).

In [30]:
# Make a file for example
with open('file3.txt', 'wt') as f:
    f.write('Oh look, some text in a file')
print(os.listdir())

# Remove the file
os.remove('file3.txt')
print(os.listdir())

['file_undesired.txt', 'file1.txt', 'file2.txt', 'file3.txt', 'fileb.txt']
['file_undesired.txt', 'file1.txt', 'file2.txt', 'fileb.txt']


### Walking a directory tree with `os`

The `os` package also provides a particularly useful function for recursively moving through a directory; `os.walk`. 

Whilst `os.listdir` lets us look at the contents of a directory, `os.walk` lets us look at the contents of all subfolders of that directory (and all subfolders of those subfolders, and all subfolders of those subfolders of those subfolders,... and so on!). 

To understand how this works, consider the below example:

In [31]:
# Let's move up a directory to give a clear demonstration
pwd = os.getcwd()
os.chdir('..')
print('We are looking at the following directory: ', os.getcwd())

# Now let's have a look at all of the files and folders inside
# our working directory
for root, dirs, files in os.walk(os.getcwd()):
    
    # The variable `root` in this example gives us 
    # the current directory we are looking at
    print('\nCurrent directory: ', root)
    
    # The sub-directories of root are given by the variable
    # `dirs` in this example
    print('\n  Sub-directories:')
    print('\n'.join(['    {}'.format(d) for d in dirs]))
    
    # The variable `files` in this example gives us all files
    # root
    print('\n  Files:')
    print('\n'.join(['    {}'.format(f) for f in files]))

# Lets move back to our working directory
os.chdir(pwd)

We are looking at the following directory:  /home/tommaullin/Documents/PythonIntro2019

Current directory:  /home/tommaullin/Documents/PythonIntro2019

  Sub-directories:
    .ipynb_checkpoints
    .git
    02_functions_modules_and_packages
    03_managing_files
    04_numpy
    05_pandas

  Files:
    nb_02_functions_modules_and_packages.ipynb
    nb_05_pandas.ipynb
    nb_03_managing_files.ipynb
    nb_04_numpy.ipynb
    nb_01_datatypes_loops_and_logic.ipynb
    nb_00_introduction.ipynb

Current directory:  /home/tommaullin/Documents/PythonIntro2019/.ipynb_checkpoints

  Sub-directories:


  Files:
    nb_05_pandas-checkpoint.ipynb
    Introduction-checkpoint.ipynb
    nb_02_functions_modules_and_packages-checkpoint.ipynb
    nb_04_numpy-checkpoint.ipynb
    nb_01_datatypes_loops_and_logic-checkpoint.ipynb
    nb_03_managing_files-checkpoint.ipynb
    nb_00_introduction-checkpoint.ipynb

Current directory:  /home/tommaullin/Documents/PythonIntro2019/.git

  Sub-directories:
    logs


By default, `os.walk` searches directories breadth-first (i.e. it considers all the highest level folders and files first and then once it has considered all of the top-level files, it moves downwards). 

However, we can chage this to use a depth-first search (which instantly looks inside each folder it comes across; and works it's way to the lowest level files before considering the remaining files in the higher levels above). This can be done using the `topdown` parameter like so: 

In [32]:
# Let's move up a directory to give a clear demonstration
pwd = os.getcwd()
os.chdir('..')
print('We are looking at the following directory: ', os.getcwd())

# Now let's have a look at all of the files and folders inside
# our working directory
for root, dirs, files in os.walk(os.getcwd(), topdown=False):
    
    # The variable `root` in this example gives us 
    # the current directory we are looking at
    print('\nCurrent directory: ', root)
    
    # The sub-directories of root are given by the variable
    # `dirs` in this example
    print('\n  Sub-directories:')
    print('\n'.join(['    {}'.format(d) for d in dirs]))
    
    # The variable `files` in this example gives us all files
    # root
    print('\n  Files:')
    print('\n'.join(['    {}'.format(f) for f in files]))

# Lets move back to our working directory
os.chdir(pwd)

We are looking at the following directory:  /home/tommaullin/Documents/PythonIntro2019

Current directory:  /home/tommaullin/Documents/PythonIntro2019/.ipynb_checkpoints

  Sub-directories:


  Files:
    nb_05_pandas-checkpoint.ipynb
    Introduction-checkpoint.ipynb
    nb_02_functions_modules_and_packages-checkpoint.ipynb
    nb_04_numpy-checkpoint.ipynb
    nb_01_datatypes_loops_and_logic-checkpoint.ipynb
    nb_03_managing_files-checkpoint.ipynb
    nb_00_introduction-checkpoint.ipynb

Current directory:  /home/tommaullin/Documents/PythonIntro2019/.git/logs/refs/remotes/origin

  Sub-directories:


  Files:
    master

Current directory:  /home/tommaullin/Documents/PythonIntro2019/.git/logs/refs/remotes

  Sub-directories:
    origin

  Files:


Current directory:  /home/tommaullin/Documents/PythonIntro2019/.git/logs/refs/heads

  Sub-directories:


  Files:
    master

Current directory:  /home/tommaullin/Documents/PythonIntro2019/.git/logs/refs

  Sub-directories:
    remotes
  

 > **Note:** Beyond the search method, by default, no ordering is enforced on the files output by `os.walk`. Options are available to remedy this, however (see [the documentation](https://docs.python.org/3.5/library/os.html#os.walk) for more information).

### Manipulating filepaths with `os`

`os` contains a great package for manipulating filepaths. It is called `os.path`. We imported it earlier!

Some particularly useful functions in `os.path`, which check whether a filepath corresponds to some existing object, include:

 - `os.path.isfile`: This checks if a path represents an existing file.
 - `os.path.isdir`: This checks if a path represents an existing directory.
 - `os.path.exists`: This is much more general and checks whether the path represents any existing object.
 
For example;


In [33]:
print('-------------------------------')

# Let's see whether some files exist
print(os.path.isfile("file1.txt"))
print(os.path.isfile("file9.txt"))
print('-------------------------------')

# Let's see whether some directories exist
print(os.path.isdir("../03_managing_files"))
print(os.path.isfile("directoryName"))
print('-------------------------------')

# Let's see what happens when we try the exists function
# on these files:
print(os.path.exists("../03_managing_files"))
print(os.path.exists("file1.txt"))
print(os.path.exists("directoryName"))
print('-------------------------------')

-------------------------------
True
False
-------------------------------
True
False
-------------------------------
True
True
False
-------------------------------


A great feature of the `os.path` package is that it includes some very convenient tools for constructing paths. For example, you often may need to switch between relative and absolute paths. 

 > **Relative and Absolute Paths**
 >
 > A relative path is a path which specifies a location of a directory relative to another directory. Typically we talk about relative paths in relation to our current working directory. (e.g. `./folder/we/want`)
 >
 > An absolute path is a complete path which details entirely the location of a file or folder, starting from the root element and ending with the other subdirectories. (e.g. `C:/Complete/Path/all/the/way/to/folder/we/want`)

In Python, ee can switch between absolute paths and relative paths (in relation to the current working directory) using the `os.abspath` and `os.relpath` functions. For example;

In [34]:
# Absolute path from a relative path
abspath = os.path.abspath('this/is/a/path/to/file.txt')
print('Absolute path: ', abspath)

# Relative path from an absolute path
relpath = os.path.relpath(abspath)
print('Relative path: ', relpath)

Absolute path:  /home/tommaullin/Documents/PythonIntro2019/03_managing_files/this/is/a/path/to/file.txt
Relative path:  this/is/a/path/to/file.txt


Other attributes and functions in `os` and `os.path` which are particularly useful for constructing paths (and whose use, when manipulating filepaths, are highly recommended) include:

 - `os.path.join` - This will automatically join directories and filenames to form a path.
 - `os.sep` - This gives the separator character which is used in file paths on your platform.
 - `os.pathsep` - This gives the separator character which is used in your `$PATH`  environment variable.
 
The above functions are particularly useful for cross-platform compatibility as Windows often uses `\` or `\\` as a file seperator whilst `/` is used on Mac and Linux. 

Examples of these are given below:

In [35]:
# sep
print("sep: ", os.sep)
print("pathsep: ", os.pathsep)

# Join example
print(os.path.join('a','b','c'))

sep:  /
pathsep:  :
a/b/c



 > **Note**: If you are regularly working with filepaths, you may very likely run into a `FileNotFoundError` error at some point! One common cause of this is using `/` instead of `\` or vice versa! To fix this always use the `sep` attribute of `os`!

`os.path` also includes some great tools for deconstructing paths. In particular, the following functions are extremely useful:

 - `os.path.dirname`: This returns the directory from a path.
 - `os.path.basename`: This returns the file (or bottom-level folder) name (with extension) from a path.
 - `os.path.split`: This returns a list containing both the directory name and the bottom-level file/folder name (with extension).
 - `os.path.splitext`: This returns a list containing the filename and extension of a file

For example;

In [36]:
# Make an example path
examplepath = '/a/b/c.txt'

# Directory name
print('Directory name: ', os.path.dirname(examplepath))

# File name
print('Base name: ', os.path.basename(examplepath))

# Directory name and File name
print('Split output: ', os.path.split(examplepath))

# File name and extension
print('Splitext output: ', os.path.splitext(examplepath))

Directory name:  /a/b
Base name:  c.txt
Split output:  ('/a/b', 'c.txt')
Splitext output:  ('/a/b/c', '.txt')


### Other useful `os` functions

Some other mention-worthy `os` operations that are particularly handy and crop up from time to time are:

 - `os.path.expanduser`: On Unix and Windows, this function replaces `~` or `~user` in a path with your home directory.
 - `os.path.expandvars`: This function expands any environment variables in a path.

In [37]:
print(os.path.expanduser('~'))
print(os.path.expandvars('$HOME'))

/home/tommaullin
/home/tommaullin


## The `shutil` package

Another useful package for working with and manipulating files and paths is the `shutil` package. Full documentation for the `shutil` package can be found [here](https://docs.python.org/3/library/shutil.html). In general `shutil` provides many extremely useful higher level file operations, a full list of which would be too numerous to cover in this course.

However, there are a few `shutil` functions which deserves a special mention as they perform extremely commonly required operations that are not provided by `os`.

In [38]:
import shutil

### Removing directories with `shutil`

`shutil` can be used to remove whole directory trees with the function `shutil.rmtree` (remember `os` can only be used for the removal of files and **empty** directories). For example;

In [39]:
# Let's make a directory
os.mkdir('exampledir')

# Let's make a file in the directory
with open('exampledir/examplefile.txt', 'wt') as f:
    f.write('example text')
print(os.listdir())

# Now lets remove the whole directory in one go with `shutil`
shutil.rmtree('exampledir')
print(os.listdir())

['file_undesired.txt', 'exampledir', 'file1.txt', 'file2.txt', 'fileb.txt']
['file_undesired.txt', 'file1.txt', 'file2.txt', 'fileb.txt']


### Copying directories with `shutil`

`shutil` can also copy directories; like so:

In [40]:
# Let's make a directory
os.mkdir('exampledir')

# Let's make a file in the directory
with open('exampledir/examplefile.txt', 'wt') as f:
    f.write('example text')
print(os.listdir())

# Now lets copy the whole directory in one go with `shutil`
shutil.copytree('exampledir','exampledir_copy')
print(os.listdir())

# Remove directories
shutil.rmtree('exampledir')
shutil.rmtree('exampledir_copy')
print(os.listdir())

['file_undesired.txt', 'exampledir', 'file1.txt', 'file2.txt', 'fileb.txt']
['file_undesired.txt', 'exampledir', 'file1.txt', 'file2.txt', 'exampledir_copy', 'fileb.txt']
['file_undesired.txt', 'file1.txt', 'file2.txt', 'fileb.txt']


# The `re` package

Another useful package for filepath manipulation is the `re` package, which provides support for regular expressions in Python. If you are not familiar with regular expressions feel free to skip this and the following section of the notebook; alternatively, if you are interested in learning more about regular expressions [this link](https://www.regular-expressions.info/) provides a good introduction.

In [41]:
import re

A full discussion of regular expressions is beyond the scope of this practical. However, for those familiar with regular expressions a few quick examples are given below demonstrating the use of the `re` package when searching for files.

In [42]:
# In our folder we have 3 files
print(os.listdir())

# Scenario 1
# ------------------------------------------------
# Suppose we want all files whith a filename
# of the form "file.\.txt" (i.e. with only one
# character following the filename)
p = [f for f in os.listdir() if re.match(r'file.\.txt', f)]
print(p)

# Scenario 2
# ------------------------------------------------
# Suppose we want all files which a filename
# of the form "file[0-9]\.txt" (i.e. with only one
# numeric character following the filename)
p = [f for f in os.listdir() if re.match(r'file[0-9]\.txt', f)]
print(p)

['file_undesired.txt', 'file1.txt', 'file2.txt', 'fileb.txt']
['file1.txt', 'file2.txt', 'fileb.txt']
['file1.txt', 'file2.txt']


Whilst the `re` package provides an extremely poweful functionality, in general for manipulating filepaths it is often easier to use `glob` (see below). However, it is worth remembering the `re` package exists as it can be applied to a range of scenarios involving strings beyond file-manipulation-type scenarios. 

## The `glob` package

As discussed in the previous section, `glob` is an extremely powerful package which supports unix-style wildcard matching.

 > **Note:** The glob module contains a function, also called `glob`. This can get confusing, so we only import the function here.

In [43]:
from glob import glob

As a quick example, we will perform the "scenario 1" use case using `glob`:

In [44]:
# Like before, we want all files whith a filename
# of the form "file?.txt" (i.e. with only one
# character following the filename)
p = glob('file?.txt')
print(p)


['file1.txt', 'file2.txt', 'fileb.txt']


 > **Note:** As the `re` package matches regular expressions and the `glob` package uses wildcard matching, it follows that using the `re` package can give much finer control for precise pattern matching. For example, you can perform scenario 1 from the `re` section with `glob` but not scenario 2, as wildcard matching in `glob` does not allow you to specify a list of characters (in our example the digits `0,1,...,9`). This is worth bearing in mind when you have many similarly named files.

# Exercises

**Question 1:** Using the `os` package, get your current working directory and save it as a variable named `pwd`. Then make a directory named `testdir` and change the working directory to that directory.

In [None]:
# Write your code here

**Question 2:** Using the `with` keyword and a `for` loop, make 20 files in your `testdir` folder named `test1.txt`, `test2.txt`,... `test20.txt`, each containing a random number (*note: you can get random numbers using the `random` package as demonstrated below*).

In [None]:
# Import the random package
import random

# Use the gauss function to obtain a random number
randomNo = random.gauss(0, 1) # Random ~N(0,1) variable
print(randomNo)

# Write your code here

**Question 3:** Using the `os.listdir` function, obtain a list of all the files you created in question 2. Loop through these files, reading one at a time, and use the `os.remove` function to remove any file which contains a number less than 0.

In [None]:
# Write your code here

**Question 4:** Remove the `testdir` folder. *Hint: Remember! You cannot do this with the `os` package if `testdir` contains files!*. Your `pwd` variable from question 1 may come in useful here.

In [None]:
# Write your code here

**Question 5 (hard):** Somewhere in the `PythonIntro2019` folder there is a hidden subfolder and file. Find the file `secret.txt` using the `os.walk` function and print the message in the file.

In [None]:
# Write your code here