# Episode 8 - Shell script for Nelle

This notebook is an extension to the [Unix Shell lesson](https://kmichali.github.io/SC-shell-novice/) from the [Software Carpentry](https://software-carpentry.org). It decribes how to develop a shell script to solve an example problem that was outlined in [Episode 1](https://kmichali.github.io/SC-shell-novice/01-intro/index.html).

### Questions:
- How do you run a program on ~1500 data files using the command line?
- How do you make your script user-friendly?
- How do you check if the data files are valid?

### Objectives:
- Write a shell script that runs a command or series of commands for a fixed set of files.
- Learn about the fundamental scripting building blocks - loops, variables and conditionals.
- Learn about argument validation.

<hr style="border: solid 1px red; margin-top: 1.5% ">

### Video
Learn with video:
- [part 1](https://imperial.cloud.panopto.eu/Panopto/Pages/Viewer.aspx?id=25c07517-6683-4f63-8d24-abd800cac743)
- [part 2](https://imperial.cloud.panopto.eu/Panopto/Pages/Viewer.aspx?id=1095126e-44f1-443d-9fc2-abd800ce6f3b)
- [part 3](https://imperial.cloud.panopto.eu/Panopto/Pages/Viewer.aspx?id=91ecd5c0-4454-4976-a3e2-abd800d37e0b)
- [part 4](https://imperial.cloud.panopto.eu/Panopto/Pages/Viewer.aspx?id=ed699b4f-95e2-4aa6-93bd-abd800d7d2ec)

### Practice data in Google Colab
If you are viewing this notebook in Colab and have saved it in your Drive ("File"->"Save a copy in Drive"), run the cell below to download practice data.

In [None]:
%%bash
[ -e data-shell ] && echo "data already exists" || { wget https://kmichali.github.io/SC-shell-novice/data/data-shell.zip; unzip data-shell.zip; } 

<hr style="border: solid 1px red; margin-top: 1.5% ">

## Nelle's pipeline: a typical problem
<hr style="border: solid 1px gray; margin-top: 1.5% ">

Nelle Nemo, a marine biologist, has just returned from a six-month survey of the [North Pacific Gyre](https://en.wikipedia.org/wiki/North_Pacific_Gyre), where she has been sampling gelatinous marine life in the [Great Pacific Garbage Patch](https://en.wikipedia.org/wiki/Great_Pacific_garbage_patch). She has 1520 samples that she’s run through an assay machine to measure the relative abundance of 300 proteins. She needs to run these 1520 files through an imaginary program called **`goostats`** she inherited. On top of this huge task, she has to write up results by the end of the month so her paper can appear in a special issue of Aquatic Goo Letters.

The bad news is that if she has to run **`goostats`** by hand using a GUI, she’ll have to select and open a file 1520 times. If **`goostats`** takes 30 seconds to run each file, the whole process will take more than 12 hours of Nelle’s attention. With the shell, Nelle can instead assign her computer this mundane task while she focuses her attention on writing her paper.

The next few lessons will explore the ways Nelle can achieve this. More specifically, they explain how she can use a command shell to run the **`goostats`** program, using loops to automate the repetitive steps of entering file names, so that her computer can work while she writes her paper.

As a bonus, once she has put a processing pipeline together, she will be able to use it again whenever she collects more data.

## Preparing to write the script
<hr style="border: solid 1px gray; margin-top: 1.5% ">

Let's change the directory **`data-shell/north-pacific-gyre/2012-07-03/`** and examine the contents.

In [None]:
cd data-shell/north-pacific-gyre/2012-07-03

In [None]:
%%bash 
ls -l

The directory contains some data files (all 1500 would be impractical) and the **`goostats`** program.  The goal is to run **`goostats`** on every **`txt`** file in the directory.

In preparation for writing the scripts, one would normally try to do two things:
- examine the data files and make sure that they are all valid
- figure out the correct command to run **`goostats`**


## Validating data files
<hr style="border: solid 1px gray; margin-top: 1.5% ">

Let's have a look at filenames, they seem to follow the same format: **`NENE*[ABZ].txt`**. The square brackets indicate that there is either A, B or Z in that position.  Nelle knows that the **`NENE*Z.txt`** files contain "Z" because something went wrong with the input data for the protein measurement and she has to remember to exclude those from her further analysis.

She also knows that the files contain measurements for 300 proteins and the line count should reflect that. 

In the next cell, use **`wc`** (with the right flag) to find out how many lines are in the data files.  Pipe **`wc`** into **`sort`** to be able to detect any outliers easily. What have you found?


In [None]:
%%bash
# use wc and sort to sort line counts for *.txt files


You should have found that one of the files is too short (240 lines), the rest of them have 300 lines. The command above should have been **`wc -l *.txt | sort -n`**.  

This means that Nelle's script should be checking the length of every data file before processing it, otherwise she may have wrong results.

## Running goostats
<hr style="border: solid 1px gray; margin-top: 1.5% ">

Nelle has not been given any detail about how to run the **`goostats`** command. She would normally ask her colleagues who used the script before but they are gone on a marine expedition.  Instead, Nelle tries to type the command to see if it produces any useful information.

*Note: Nelle uses "./" in front of **`goostats`**.  Without it, the shell would report "command not found".  This is because the shell only looks for executable commands in a specific list of directories. Since the current directory is never in this list of directories, Nelle has to specify a relative or absolute path to **`goostats`**; **`./goostats`** means look for goostats in the current directory.*

In [None]:
%%bash
./goostats

The program is supposed to be run (called) with two arguments - file1 and file2.  This does help but not entirely. What are the two files?

Nelle has to resort to examining the program further.  She types **`file goostats`** to see if it is binary or text.  If it is a text file, she can open it and read it.  This is something that we should not have to do but it happens time to time.

In [None]:
%%bash
file goostats

Good!  **`goostats`** is a text file and Nelle can read it.  

In [None]:
%%bash 
cat goostats

Nelle is happy to find out that the program is simple she can understand most of it.  Looking at the last command in the file **`head -n 3 $1 | cut -d , -f 1 | sort | uniq > $2`**, she concludes that the first argument "\\$1" should the data file and the second argument "\\$2" should be the result file.  

She also notices that the program is a shell script and it does not really do any statistics but that is ok since this is an imaginary scenario.

Nelle assembles an example **`goostats`** command and runs it. Since **`goostats`** is a shell script, she can use **`bash`** to run it.  Nelle also decides that the output files will be named **`stats-NENE*[AB].txt`**.

In [None]:
%%bash
bash goostats  NENE01729A.txt stats-NENE01729A.txt
ls -l *NENE01729A.txt

It looks like everything is ok, **`goostats`** run without errors and produced a result file.  Nelle is ready to write a shell script that will process all ~1500 files.

## Nelle's script
<hr style="border: solid 1px gray; margin-top: 1.5% ">

Nelle takes some time to think about her script and assembles a list of features:
- To process multiple files, the script will use a for loop
- The script will check if it was called with a right number of arguments (Nelle noticed that the **`goostats`** script contains a construct that may do this)
- The script will check if the input file has the correct number of lines
- The final script will have two arguments - filename (with wildcars) and the correct line count


## Step 1 - for loop

Write a loop that iterates over all **`NENE*[AB].txt`** files, choose **`filename`** for the variable name. For each iteration, the loop executes the following command **`echo "bash goostats $filename stats-$filename"`** (this will not execute **`goostats`**, will only print the command - it is a **dry run**).

Use **`nano`** to create such a script and name it **`rungoostats.sh`**.  For now, it will not take any arguments from the command line, we will add them later.  If you need a bit of help, you can have a look at the bottom of this notebook.

Test the script, it should produce something like this:

```
bash goostats NENE01729A.txt stats-NENE01729A.txt
bash goostats NENE01729B.txt stats-NENE01729B.txt
bash goostats NENE01736A.txt stats-NENE01736A.txt
bash goostats NENE01751A.txt stats-NENE01751A.txt
bash goostats NENE01751B.txt stats-NENE01751B.txt
bash goostats NENE01812A.txt stats-NENE01812A.txt
bash goostats NENE01843A.txt stats-NENE01843A.txt
bash goostats NENE01843B.txt stats-NENE01843B.txt
bash goostats NENE01978A.txt stats-NENE01978A.txt
bash goostats NENE01978B.txt stats-NENE01978B.txt
bash goostats NENE02018B.txt stats-NENE02018B.txt
bash goostats NENE02040A.txt stats-NENE02040A.txt
bash goostats NENE02040B.txt stats-NENE02040B.txt
bash goostats NENE02043A.txt stats-NENE02043A.txt
bash goostats NENE02043B.txt stats-NENE02043B.txt

```


## Step 2 - line count

One of the requirements is to count lines in each data file and to test the count. Before we can learn how to test, we need to figure out how to extract the line count into a variable (that will be tested against the desired count).  

Often, if faced with a shell scripting problem, one can try out various commands directly on the command line. When a suitable solution is found, one can copy it into the shell script.  See and execute the cells below.

If you recall, **`wc -l`** produces the line count plus the file name (e.g., **`wc -l NENE01729A.txt`** produces **`300 NENE01729A.txt`**).  We need to find a way to extract the line count without the filename.  A quick google search reveals a small trick - use **`cat`** and pipe to **`wc -l`**, e.g. **`cat NENE01729A.txt | wc -l`**.

In [None]:
%%bash 
wc -l NENE01729A.txt

In [None]:
%%bash
cat NENE01729A.txt | wc -l

Now, we need to store the line count in a variable.  This is done as follows:

In [None]:
%%bash
LINECOUNT=$(cat NENE01729A.txt | wc -l)
echo $LINECOUNT

Note about variables:
- in shell scripts, user-defined variables names are often in capitals
- at the assignment, use no "\$" before the variable name
- at the assignment, make sure there are no spaces surrounding the equal sign (the shell would parse it as arguments to a command)
- when recalling the value, use "\$" before the name of the variable

You may also remember that we used the "**`$(command)`**" syntax when we learned about the command **`find`**.  This forces the command in braces to execute first, the result is assigned to the variable.

Let's add the **`LINECOUNT`** variable to **`rungoostats.sh`** using **`nano`**.  By now, the script should look something like this:


```
for filename in NENE*[AB].txt
do
  LINECOUNT=$(cat $filename | wc -l)
  echo $filename $LINECOUNT
  echo "bash goostats $filename stats-$filename"
done

```


## Step 3 - test the line count

We are ready to learn about conditionals that let us test an expression and execute different commands depending on the outcome of the test.  Similarly to loops, they are one of the fundamental buiding blocks of programs. 

We want to compare the value of **`LINECOUNT`** to 300 (the expected line count).  If the comparison is true, the script runs **`goostats`**. If the comparison is false, the script reports a problem and skips the **`goostats`** command.


This is the syntax for a conditional: 

```
if [ $LINECOUNT -ne 300 ]
  then
     echo "Error: $filename has $LINECOUNT lines."
  else
     echo "bash goostats $filename stats-$filename"
fi

```

- the conditional always contain the keywords **`if`**, **`then`**, **`else`** and **`fi`** 
- the expression is square brackets is evaluated to true or false; it is important that each bracket and each element of the expression are surrounded by spaces as the shell parses the expression as a command followed by arguments
- the comparison operator "-ne" is used for numerical comparison
- more details on conditionals are listed in tutorial linked from the main README


In **`nano`**, add the conditional to the right place to **`rungoostats.sh`** - inside the loop after the the LINECOUNT variable is assigned a value.  The **`goostats`** command is now inside the conditional. 

If you have problems, have a look at the bottom of the notebook.  If on a local system, you can test the script in the cell below.

In [None]:
%%bash
# this works only on a local system
bash rungoostats.sh

## Step 4 - the command line arguments

Nelle decided that the script will have two command line arguments.  One for file type (e.g. **`NENE*[AB].txt`**) and one for the correct line count.

You may remember from the episode on shell scripts that using wildcards in shell scripts arguments is tricky.  Since wildcards expand to a list of files before the shell script runs, it is impossible to know how many arguments this will result in.  This makes using the  positional arguments (\\$1,\\$2 etc.) inside the script difficult.  In the same episode, we solved the problem by using **`$@`** that refers to all arguments on the command line.  

Nelle's script requires a different solution.  If we use **`$@`** , we cannot add a second argument for line count.  There is another solution, we can call **`goostats`** as follows.

```
bash rungoostats.sh "NENE*[AB].txt" 300
```
If we quote the fist arguement, the wilcard expansion is delayed until the shell reaches the for loop inside the script. This means that, on the command line, we have only two arguments and we can use **`$1`** and **`$2`**.

In **`nano`**, change the script:
- so it contains the first positional arguments instead of filename in the first line of the for loop
- it contains the second positional argument in line count test

Test the script in the cell below. If you need help, the solution is at the bottom of the notebook.

In [None]:
%%bash
# this works only on a local system
bash rungoostats.sh "NENE*[AB].txt" 300

## Step 5 - argument check

We are ready to add another conditional at the top of the script that will test if the script is run with two arguments. This is a very simple but effective check to prevent major errors.

We have already used a conditional that has commands for expression resulting in true or false, sometimes a conditional only has commands for true (a hypothetical conditional):

```
if [ expression ]
then
  echo "expression is true"
fi

```

In Episode 6, we covered the variable **`$#`** that holds the total number of arguments on the command line.  Using **`nano`**, add another conditional to the top of the script that tests if the number of arguments is 2.  If it is not, print a warning message and exit with **`exit`**.

If you need help, check the bottom of the notebook. Test in the cell below, try with the correct arguments as well as with incomplete ones.

In [None]:
%%bash
# this works only on a local system
bash rungoostats.sh "NENE*[AB].txt" 300

## Step 6 - comments

Finally, add comments to the script (start each comment with "\#").  It is a good custom to describe what a script does at the top. You can also add comments to the commands that you may find difficult to remember later.

## Step 7 - run the script

Remove the **`echo`** command from the loop and run the script. Now **`goostats`** will run on each file.

The final script should look like something like this:

```
# this runs goostats on all given files
# arg 1: filenames using wildcard, all in quotes
# arg 2: correct line count per file

if [ $# -ne 2 ]
then
 echo "usage: bash rungoostats.sh \"files\" line_count"
 exit
fi

for filename in $1
do
  echo $filename
  LINECOUNT=$(cat $filename | wc -l)
  if [ $LINECOUNT -ne $2 ]
  then
     echo "Error: $filename has $LINECOUNT lines."
  else
     echo "bash goostats $filename stats-$filename"
  fi
done

```

You can run in the cell below or on the command line.

In [None]:
%%bash
# this works only on a local system
bash rungoostats.sh "NENE*[AB].txt" 300

The script is now reasonably good.  It can be used on various file types with a varying line count. It has comments that, should make it easy to share the scripts with others.

<hr style="border: solid 1px red; margin-top: 1.5% ">

## Key points
- you have developed a moderately complicated shell script that illustrates the power of the command line - task automation
- you know the fundamental building blocks for a shell script - variables, loops and conditionals
- you can use the command line arguments for shell scripts
- you recognise the importance of using good comments

<hr style="border: solid 1px gray; margin-top: 1.5% ">

## Solution to Step 1

```
for filename in NENE*[AB].txt
do
  echo "bash goostats $filename stats-$filename"
done

```

## Solution to Step 4

```
for filename in $1
do
  echo $filename
  LINECOUNT=$(cat $filename | wc -l)
  if [ $LINECOUNT -ne $2 ]
  then
     echo "Error: $filename has $LINECOUNT lines."
  else
     echo "bash goostats $filename stats-$filename"
  fi
done
```

## Solution to Step 3

```
for filename in NENE*[AB].txt
do
  echo $filename
  LINECOUNT=$(cat $filename | wc -l)
  if [ $LINECOUNT -ne 300  ]
  then
     echo "Error: $filename has $LINECOUNT lines."
  else
     echo "bash goostats $filename stats-$filename"
  fi
done
```

## Solution to Step 5

```
if [ $# -ne 2 ]
then
 echo "usage: bash rungoostats.sh \"files\" line_count"
 exit
fi

for filename in $1
do
  echo $filename
  LINECOUNT=$(cat $filename | wc -l)
  if [ $LINECOUNT -ne $2 ]
  then
     echo "Error: $filename has $LINECOUNT lines."
  else
     echo "bash goostats $filename stats-$filename"
  fi
done

```