# Episode 5 - For loops
This notebook is based on a snapshot of [Episode 5](https://kmichali.github.io/SC-shell-novice/05-loop/index.html) of the [Unix Shell lesson](https://kmichali.github.io/SC-shell-novice/) from the [Software Carpentry](https://software-carpentry.org). The original material has more detail.

### Questions:
- How can I perform the same actions on many different files?

### Objectives:
- Write a loop that applies one or more commands separately to each file in a set of files.
- Trace the values taken on by a loop variable during execution of the loop.
- Explain the difference between a variable’s name and its value.

<hr style="border: solid 1px red; margin-top: 1.5% ">

### Video
Learn with [video](https://imperial.cloud.panopto.eu/Panopto/Pages/Viewer.aspx?id=cbaa59c0-8902-4047-8b8f-abd700ba3755).

### Practice data in Google Colab
If you are viewing this notebook in Colab and have saved it in your Drive ("File"->"Save a copy in Drive"), run the cell below to download practice data.

In [None]:
%%bash
[ -e data-shell ] && echo "data already exists" || { wget https://kmichali.github.io/SC-shell-novice/data/data-shell.zip; unzip data-shell.zip; } 

<hr style="border: solid 1px red; margin-top: 1.5% ">

Loops are a programming construct that allows us to repeat a command (or set of commands) for each item in a list. As such they are key to productivity improvements through automation. Similar to wildcards, using loops also reduces the amount of typing required (and hence reduces the number of typing mistakes).

Suppose we have several hundred genome data files (e.g., basilisk.dat, minotaur.dat, and unicorn.dat). For this example, we’ll use the **`creatures`** directory which only has three example files, but the principles can be applied to many many more files at once.

The structure of these files is the same: the common name, classification, and updated date are presented on the first three lines, with DNA sequences on the following lines. Let’s look at the files:

In [None]:
cd data-shell/creatures

In [None]:
%%bash
ls *.dat
cat basilisk.dat

In [None]:
%%bash
head -n 5 *.dat

## Print second line from every file
<hr style="border: solid 1px gray; margin-top: 1.5% ">

We would like to print out the classification for each species, which is given on the second line of each file. For each file, we would need to execute the command **`head -n 2`** and pipe this to **`tail -n 1`**. 

One would be tempted to use a wildcard **`*.dat`**.  However, this approach does not work in pipe, we only get the classification for the last file.

In [None]:
%%bash
# print a second line of a file
head -n 2 basilisk.dat | tail -n 1
# wild card does not work as intended
head -n 2 *.dat | tail -n 1

We’ll use a loop to solve this problem, before we do so, let’s look at the general form of a loop:

```
for thing in list_of_things
do
    operation_using $thing
done

```
Note that the loop construct takes four lines and it contains compulsory keywords - **`for`**, **`in`**, **`do`** and **`done`**. When the shell sees the keyword **`for`**, it knows to repeat a command (or commands) in the body of the loop (indented above) once for each item in a list. Each time the loop runs (an iteration), an item in the list is assigned in sequence to the variable (called "thing" above), and the commands inside the loop are executed, before moving on to the next item in the list. 

Inside the loop, we call for the variable’s value by putting **`$`** in front of it. The symbol **`$`** tells the shell interpreter to treat what follows as a variable name and substitute a value in its place.

A loop can be written on a single line as well:

```
for thing in list_of_things; do operation_using $thing ; done

```

Using a loop, we can print out the second line from every **`.dat`** file.

In [None]:
%%bash
for filename in basilisk.dat minotaur.dat unicorn.dat
do
    head -n 2 $filename | tail -n 1
done

Loops also work with wildcards.

In [None]:
%%bash
for filename in *.dat
do 
   head -n 2 $filename | tail -n 1
done

The variable name can be changed.  One should follow good practice and choose variable names that are informative.

In [None]:
%%bash
for creature in *.dat
do 
   head -n 2 $creature | tail -n 1
done

One can follow the progress of a loop by printing out the value of the loop variable in each iteration with **`echo`**. 

In [None]:
%%bash
for filename in *.dat
do 
   echo $filename
   head -n 2 $filename | tail -n 1
done

## File names with spaces

Be very careful with loops that iterate through a list of files names that have spaces, these may be interpreted as separate files; e.g. **`purple unicorn.dat`** could be considered as two files named **`purple`** and **`unicorn.dat`**.  The problem can be avoided by surrounding the file name with quotes.

```
for filename in "purple unicorn.dat" "green basilisk.dat"
do
  echo $filename
done

```

## Save a copy of every .dat file
<hr style="border: solid 1px gray; margin-top: 1.5% ">

The next task is to make a copy of every .dat file and save it under a new name.  For example, a copy of **`unicorn.dat`** should be saved as **`original-unicorn.dat`**.

Similarly to using **`head`** in an example above, we cannot resort to using the copy command with a wildcard; **`cp *.dat original-*.dat`** will not work because the wildcard will be expanded first. Next, the command that the shell will try to execute will look as follows **`cp basilisk.dat minotaur.dat unicorn.dat original-*.dat`** and it will produce an error. The only way that **`copy`** works with more than two arguments is if the last argument is a valid directory name (this would not accomplish our task anyway).



In [None]:
%%bash
# this does not work
cp *.dat original-*.dat

We must use a loop for this task.  When preparing a loop that executes many commands, it is a good practice to use a "dry run" first. Instead of executing the body of the loop when testing, one can only print the intended command; for example **`cp $filename original-$filename`** will become **`echo "cp $filename original-$filename"`**.

Note: The copy command uses the following construct **`original-$filename`** to add a plain text to a variable in order to create a new filename (e.g., original-unicorn.dat). Since the plain text goes in front of the variable name, this is safe.  The other way around **`$filename-original`** could be problematic, the shell may not be able to tell where the variable name ends.  In this case, we would have to use curly braces to clearly delimit the variable name - **`${filename}-original`**.

In [None]:
%%bash
for filename in *.dat
do 
  echo "cp $filename original-$filename"
done

## Practice

In the cell below, remove **`echo`** so that the loop executes the copy commands. Check the results.

In [None]:
%%bash


## Exercise 1

Switch the current working directory to **`data-shell/molecules`**.


In [None]:
cd ../molecules

What will be the output of the following loop?

```
for datafile in *.pdb
do
    cat $datafile >> all.pdb
done

```

1. All of the text from cubane.pdb, ethane.pdb, methane.pdb, octane.pdb, and pentane.pdb would be concatenated and saved to a file called all.pdb.
2. The text from ethane.pdb will be saved to a file called all.pdb.
3. All of the text from cubane.pdb, ethane.pdb, methane.pdb, octane.pdb, pentane.pdb and propane.pdb would be concatenated and saved to a file called all.pdb.
4. All of the text from cubane.pdb, ethane.pdb, methane.pdb, octane.pdb, pentane.pdb and propane.pdb would be printed to the screen and saved to a file called all.pdb.


The solution can be found at the end of this notebook.


<hr style="border: solid 1px red; margin-top: 1.5% ">

## Key points

- A for loop repeats commands once for every thing in a list.
- Every for loop needs a variable to refer to the thing it is currently operating on.
- Use **`$name`** to expand a variable (i.e., get its value). **`${name}`** can also be used.
- Do not use spaces, quotes, or wildcard characters in filenames, as it complicates variable expansion.
- Give files consistent names that are easy to match with wildcard patterns to make it easy to select them for looping.

<hr style="border: solid 1px gray; margin-top: 1.5% ">

### Solution to Exercise 1:

3 is the correct answer. >> appends to a file, rather than overwriting it with the redirected output from a command. Given the output from the cat command has been redirected, nothing is printed to the screen.