# Episode 8 - Shell script for Nelle

This notebook is an extension to the [Unix Shell lesson](https://kmichali.github.io/SC-shell-novice/) from the [Software Carpentry](https://software-carpentry.org). It decribes how to develop a shell script to solve an example problem that was outlined in [Episode 1](https://kmichali.github.io/SC-shell-novice/01-intro/index.html).

### Questions:
- How do you run a program on ~1500 data files using the command line?
- How do you make your script user-friendly?
- How do you check if the data files are valid?

### Objectives:
- Write a shell script that runs a command or series of commands for a fixed set of files.
- Learn about the fundamental scripting building blocks - loops, variables and conditionals.
- Learn about argument validation.

<hr style="border: solid 1px red; margin-top: 1.5% ">

## Nelle's pipeline: a typical problem
<hr style="border: solid 1px gray; margin-top: 1.5% ">

Nelle Nemo, a marine biologist, has just returned from a six-month survey of the [North Pacific Gyre](https://en.wikipedia.org/wiki/North_Pacific_Gyre), where she has been sampling gelatinous marine life in the [Great Pacific Garbage Patch](https://en.wikipedia.org/wiki/Great_Pacific_garbage_patch). She has 1520 samples that she’s run through an assay machine to measure the relative abundance of 300 proteins. She needs to run these 1520 files through an imaginary program called **`goostats`** she inherited. On top of this huge task, she has to write up results by the end of the month so her paper can appear in a special issue of Aquatic Goo Letters.

The bad news is that if she has to run **`goostats`** by hand using a GUI, she’ll have to select and open a file 1520 times. If **`goostats`** takes 30 seconds to run each file, the whole process will take more than 12 hours of Nelle’s attention. With the shell, Nelle can instead assign her computer this mundane task while she focuses her attention on writing her paper.

The next few lessons will explore the ways Nelle can achieve this. More specifically, they explain how she can use a command shell to run the **`goostats`** program, using loops to automate the repetitive steps of entering file names, so that her computer can work while she writes her paper.

As a bonus, once she has put a processing pipeline together, she will be able to use it again whenever she collects more data.

## Preparing to write the script
<hr style="border: solid 1px gray; margin-top: 1.5% ">

Let's change the directory **`data-shell/north-pacific-gyre/2012-07-03/`** and examine the contents.

In [1]:
cd data-shell/north-pacific-gyre/2012-07-03

/Users/katerina/GS_comm_line/notebooks/data-shell/north-pacific-gyre/2012-07-03


In [None]:
%%bash 
ls -l

The directory contains some data files (all 1500 would be impractical) and the **`goostats`** program.  The goal is to run **`goostats`** on every **`txt`** file in the directory.

In preparation for writing the scripts, one would normally try to do two things:
- examine the data files and make sure that they are all valid
- figure out the correct command to run **`goostats`**


## Validating data files
<hr style="border: solid 1px gray; margin-top: 1.5% ">

Let's have a look at filenames, they seem to follow the same format: **`NENE*[ABZ].txt`**. The square brackets indicate that there is either A, B or Z in that position.  Nelle knows that the **`NENE*Z.txt`** files contain "Z" because something went wrong with the input data for the protein measurement and she has to remember to exclude those from her further analysis.

She also knows that the files contain measurements for 300 proteins and the line count should reflect that. 

In the next cell, use **`wc`** (with the right flag) to find out how many lines are in the data files.  Pipe **`wc`** into **`sort`** to be able to detect any outliers easily. What have you found?


In [None]:
%%bash
# use wc and sort to sort line counts for *.txt files


You should have found that one of the files is too short (240 lines), the rest of them have 300 lines. The command above should have been **`wc -l *.txt | sort -n`**.  

This means that Nelle's script should be checking the length of every data file before processing it, otherwise she may have wrong results.

## Running goostats
<hr style="border: solid 1px gray; margin-top: 1.5% ">

Nelle has not been given any detail about how to run the **`goostats`** command. She would normally ask her colleagues who used the script before but they are gone on a marine expedition.  Instead, Nelle tries to type the command to see if it produces any useful information.

*Note: Nelle uses "./" in front of **`goostats`**.  Without it, the shell would report "command not found".  This is because the shell only looks for executable commands in a specific list of directories. Since the current directory is never in this list of directories, Nelle has to specify a relative or absolute path to **`goostats`**; **`./goostats`** means look for goostats in the current directory.*

In [None]:
%%bash
./goostats

The program is supposed to be run (called) with two arguments - file1 and file2.  This does help but not entirely. What are the two files?

Nelle has to resort to examining the program further.  She types **`file goostats`** to see if it is binary or text.  If it is a text file, she can open it and read it.  This is something that nobody likes to do but it does happen time to time.

In [None]:
%%bash
file goostats

Good!  **`goostats`** is a text file and Nelle can read it.  

In [None]:
%%bash 
cat goostats

Nelle is happy to find out that the program is simple and, after completing the shell class a while back, she can understand most of it.  Looking at the last command in the file **`head -n 3 $1 | cut -d , -f 1 | sort | uniq > $2`**, she concludes that the first argument "\$1" should the data file and the second argument "\$2" should be the result file.  

She also notices that the program is a shell script and it does not really do any statistics but that is ok since this is an imaginary scenario.

Nelle assembles an example **`goostats`** command and runs it. Since **`goostats`** is a shell script, she can use **`bash`** to run it.  Nelle also decides that the output files will be named stats-NENE*[AB].txt.

In [None]:
%%bash
bash goostats  NENE01729A.txt stats-NENE01729A.txt
ls -l *NENE01729A.txt

It looks like everything is ok, **`goostats`** run and produced a result file.  Nelle is ready to write a shell script that will process all ~1500 files.

## Nelle's script
<hr style="border: solid 1px gray; margin-top: 1.5% ">

