# File operations in Python

When we want to batch-process image data - i.e., apply the same workflow to a potentially large number of images - we can easily do this in Python. This requires, though, that we need to access the filesystem and and browse all images  that meet a certain criteria. Such criteria could be:

* We want to work with only `.tif` images
* Images could be stored in two different subdirectories like this:
```
root_directory/
   sub_directory_a/
     image_1.tif
     image_2.tif
     ...
   sub_directory_b/
     different_experiment_1.tif
     different_experiment_2.tif
     ...
```
* Images should carry some specific ID in their name. This is helpful if your images are called like this
```
root_directory/
   experiment1_image1.tif
   experiment1_image2.tif
   experiment2_image1.tif
   experiment2_image2.tif
   ...
```

## The os package

`os` is a very handy package to do such things. In this tutorial we will explain some file operations you can do with `os`, like how to iterate over files in a directory. `os` is automatically installed into with every new environent that is created, so we don't have to install separately.

In [1]:
import os

**Note**: There are two different types of paths: *Absolute paths* and *Relative paths*. Absolute paths always show the full pathname of a file in a `Drive/folder/file` kind of structure. Relative paths are more like directions to a file from the location of another file. Relative paths can look like this `./folder/file`. The `./` refers to the current location. In this tutorial we will stick to absolute paths.

Let's start with obtaining the file location where we are working right now (the current working directory). This returns an absolute path:

In [4]:
os.getcwd()

'E:\\BiAPoL\\Projects\\playing-hours-2022\\04_batch_processing_and_loops'

Let's assume we have some data stored on the Desktop. For this, create a folder on your Desktop, call it `my_data` and put `blobs.tif` inside. Now we create a string with the location of the desktop:

In [5]:
root = r'C:\Users\johan\Desktop'

Let's have alook at all the stuff on our Desktop - the directory `my_data` should be there now:

In [6]:
os.listdir(root)

['desktop.ini', 'Mailbox', 'MQ', 'my_data', 'PoL-Johannes']

To obtain the complete path name of `blobs.tif` we can string together the directory structure towards the image with `os.path.join`. This will create full path to the image we are looking for. We can also check if this image really exists where we think it is.

*Note*: This could be done in a simpler fashion by simply doing something like `path = root + '/my_data' + 'blobs.tif'`, but the huge advantage of `os.path.join` is that it takes care of the file format and we don't have to think about using `/` or `\` or `\\`, etc. Hence, this will work on all operating systems.

In [9]:
filename_image = os.path.join(root, 'my_data', 'blobs.tif')
os.path.exists(filename_image)  # returns True if image exists

True

## Iterating over directories

Now, go to the `my_data`  directory and copy the `blobs.tif` image a couple of time, nevermind the names of the copied images. Let's now create a new file string that points directly to the data directory and print all the files there:

In [14]:
data_dir = os.path.join(root, 'my_data')
list_of_images = os.listdir(data_dir)
list_of_images

['blobs - Kopie (2).tif',
 'blobs - Kopie (3).tif',
 'blobs - Kopie.tif',
 'blobs.tif']

To iterate over all the images here, we need a `for` loop. Python provides a very simple syntax for such operations. 

**Important**: Everything that is indended, happens *within* the loop and is potentially overwritten during each cycle of the loop! The loop ends when the indendation is decreased again:

In [52]:
for filename in list_of_images:
    print(filename)  # The variable filename is the "iterator": It changes in every step of the loop

print('Loop finished!')

blobs - Kopie (2).tif
blobs - Kopie (3).tif
blobs - Kopie.tif
blobs.tif
Loop finished!


Note that this does not return the full filename, but only the "base name" of the file. To retrieve the full path we can do:

In [20]:
for filename in list_of_images:
    full_path = os.path.join(data_dir, filename)
    print(full_path)
    
    # image = io.imread(full_path)  # We could now read each image like this, but `image` would be overwritten in every cycle of the loop.

C:\Users\johan\Desktop\my_data\blobs - Kopie (2).tif
C:\Users\johan\Desktop\my_data\blobs - Kopie (3).tif
C:\Users\johan\Desktop\my_data\blobs - Kopie.tif
C:\Users\johan\Desktop\my_data\blobs.tif


## Subdirectories

Next exercise: Let's put all the files we have created into a new directory and call this one `Experiment_1`. Then we copy the created directory (and all the files within) at the same location and call it `Experiment_2`. You'll see that `os.listdir()` now shows this output:

In [21]:
os.listdir(data_dir)

['Experiment_1', 'Experiment_2']

In order to browse all images in all subdirectories at a given location, we can use the very powerful `os.walk()` function, which can browse **all files** in **all subdirectories** at a given location. Applying `os.walk` like this will return in lists for every subdirectory, along with the respective directory:

In [33]:
for root, subdirs, files in os.walk(data_dir):
    print('These are all subdirectories at ' + root)
    print(subdirs)
    
    print('And these are all the files at ' + root)
    print(files)
    
    print('\n')  # print a new line for clarity

These are all subdirectories at C:\Users\johan\Desktop\my_data
['Experiment_1', 'Experiment_2']
And these are all the files at C:\Users\johan\Desktop\my_data
[]


These are all subdirectories at C:\Users\johan\Desktop\my_data\Experiment_1
[]
And these are all the files at C:\Users\johan\Desktop\my_data\Experiment_1
['blobs - Kopie (2).tif', 'blobs - Kopie (3).tif', 'blobs - Kopie.tif', 'blobs.tif']


These are all subdirectories at C:\Users\johan\Desktop\my_data\Experiment_2
[]
And these are all the files at C:\Users\johan\Desktop\my_data\Experiment_2
['blobs - Kopie (2).tif', 'blobs - Kopie (3).tif', 'blobs - Kopie.tif', 'blobs.tif']




This seems complicated to handle. One way to operate this function, would be to iterate over all images in all subdirectories. To do this, we have to put a `for-loop` inside another `for-loop`. This is also called a **nested loop**. Note that you have to increase the indendation to indicate the beginning of another loop:

In [37]:
for root, subdirs, files in os.walk(data_dir):
    for file in files:
        print('Directory:', root, 'Filename: ', file)  # os.walk() gives you the correct directy as you iterate over the files :)

Directory: C:\Users\johan\Desktop\my_data\Experiment_1 Filename:  blobs - Kopie (2).tif
Directory: C:\Users\johan\Desktop\my_data\Experiment_1 Filename:  blobs - Kopie (3).tif
Directory: C:\Users\johan\Desktop\my_data\Experiment_1 Filename:  blobs - Kopie.tif
Directory: C:\Users\johan\Desktop\my_data\Experiment_1 Filename:  blobs.tif
Directory: C:\Users\johan\Desktop\my_data\Experiment_2 Filename:  blobs - Kopie (2).tif
Directory: C:\Users\johan\Desktop\my_data\Experiment_2 Filename:  blobs - Kopie (3).tif
Directory: C:\Users\johan\Desktop\my_data\Experiment_2 Filename:  blobs - Kopie.tif
Directory: C:\Users\johan\Desktop\my_data\Experiment_2 Filename:  blobs.tif


## Checking filenames

Last exercise: Suppose we want to work only wih the original images, and not the copied images, e.g., the ones carrying the string `Kopie` in them. You can easily check this with an `if-condition`. Be careful: You have to increase the indendation yet another time! To do so, we need to check if we can find the string `Kopie` in the filename. Python allows you to do this very easily like this:

In [38]:
'Johannes' in 'Johannes Mueller'

True

In [39]:
'Johanes' in 'Johannes Mueller'

Let's apply this:

In [42]:
for root, subdirs, files in os.walk(data_dir):
    for file in files:
        if not 'Kopie' in file:
            print('Directory:', root, 'Filename: ', file)
            
            # actual work on images goes here - don't forget the indendation
            # image = io.imread(os.path.join(root, file)
            
# putting a command here (like image = io.imread(os.path.join(root, file)) will only work on the last known value of `root` and `file`!

Directory: C:\Users\johan\Desktop\my_data\Experiment_1 Filename:  blobs.tif
Directory: C:\Users\johan\Desktop\my_data\Experiment_2 Filename:  blobs.tif


We could use this technique to check if the image is a tif file:

In [51]:
for root, subdirs, files in os.walk(data_dir):
    for file in files:
        if '.tif' in file:
            print('Directory:', root, 'Filename: ', file)

Directory: C:\Users\johan\Desktop\my_data\Experiment_1 Filename:  blobs - Kopie (2).tif
Directory: C:\Users\johan\Desktop\my_data\Experiment_1 Filename:  blobs - Kopie (3).tif
Directory: C:\Users\johan\Desktop\my_data\Experiment_1 Filename:  blobs - Kopie.tif
Directory: C:\Users\johan\Desktop\my_data\Experiment_1 Filename:  blobs.tif
Directory: C:\Users\johan\Desktop\my_data\Experiment_2 Filename:  blobs - Kopie (2).tif
Directory: C:\Users\johan\Desktop\my_data\Experiment_2 Filename:  blobs - Kopie (3).tif
Directory: C:\Users\johan\Desktop\my_data\Experiment_2 Filename:  blobs - Kopie.tif
Directory: C:\Users\johan\Desktop\my_data\Experiment_2 Filename:  blobs.tif


or, similarly using the `endswith()` function which are native to strings! This means that *every string* allows to execute this function on itself. One example:

In [48]:
Name = 'Johannes Mueller'
print(Name.endswith('Smith'))
print(Name.endswith('ller'))
print(Name.endswith('Mueller'))

False
True
True


In [50]:
for root, subdirs, files in os.walk(data_dir):
    for file in files:
        if file.endswith('.tif'):
            print('Directory:', root, 'Filename: ', file)

Directory: C:\Users\johan\Desktop\my_data\Experiment_1 Filename:  blobs - Kopie (2).tif
Directory: C:\Users\johan\Desktop\my_data\Experiment_1 Filename:  blobs - Kopie (3).tif
Directory: C:\Users\johan\Desktop\my_data\Experiment_1 Filename:  blobs - Kopie.tif
Directory: C:\Users\johan\Desktop\my_data\Experiment_1 Filename:  blobs.tif
Directory: C:\Users\johan\Desktop\my_data\Experiment_2 Filename:  blobs - Kopie (2).tif
Directory: C:\Users\johan\Desktop\my_data\Experiment_2 Filename:  blobs - Kopie (3).tif
Directory: C:\Users\johan\Desktop\my_data\Experiment_2 Filename:  blobs - Kopie.tif
Directory: C:\Users\johan\Desktop\my_data\Experiment_2 Filename:  blobs.tif
