# Episode 4 - Pipes and filters
This notebook is based on a snapshot of [Episode 4](https://kmichali.github.io/SC-shell-novice/04-pipefilter/index.html) of the [Unix Shell lesson](https://kmichali.github.io/SC-shell-novice/) from the [Software Carpentry](https://software-carpentry.org). The original material has more detail.

### Questions:
- How can I combine existing commands to do new things?

### Objectives:
- Redirect a command’s output to a file.
- Construct command pipelines with two or more stages.
- Explain Unix’s ‘small pieces, loosely joined’ philosophy.

<hr style="border: solid 1px red; margin-top: 1.5% ">

### Video
Learn with video:
- [part 1](https://imperial.cloud.panopto.eu/Panopto/Pages/Viewer.aspx?id=0c667489-536e-40f0-9b1a-abd600d1e737)
- [part 2](https://imperial.cloud.panopto.eu/Panopto/Pages/Viewer.aspx?id=5de519db-bffd-46d5-a5fd-abd600d6017f)


### Practice data in Google Colab
If you are viewing this notebook in Colab and have saved it in your Drive ("File"->"Save a copy in Drive"), run the cell below to download practice data.

In [None]:
%%bash
[ -e data-shell ] && echo "data already exists" || { wget https://kmichali.github.io/SC-shell-novice/data/data-shell.zip; unzip data-shell.zip; } 

<hr style="border: solid 1px red; margin-top: 1.5% ">

In this episode, we will work with the directory **`data-shell/molecules`**.

In [None]:
cd data-shell/molecules

## Which molecule is the smallest?
<hr style="border: solid 1px gray; margin-top: 1.5% ">

In this directory, you will find several files that represent small molecules in Protein Data Bank format, a simple text format that specifies the type and position of each atom in the molecule.  Our task for this lesson is to explore the ways to find out which of the molecules is the smallest (has the smallest number of atoms, hence has the smallest number of lines).

Let's begin with examining the files.  The commands below lists the pdb files and show the contents of cubane.pdb.

In [None]:
%%bash
ls *.pdb
cat cubane.pdb

## Counting lines
<hr style="border: solid 1px gray; margin-top: 1.5% ">

The command **`wc`** (word count) counts number of lines, words and characters in a file.

In [None]:
%%bash
wc cubane.pdb

cubane.pdb has 20 lines, 156 words and 1158 characters.  To limit the output to number of lines only, use **`wc -l`**.

In [None]:
%%bash
wc -l cubane.pdb

We can use a wildcard to count lines in all pdb files at once.

In [None]:
%%bash 
wc -l *pdb

## Sorting
<hr style="border: solid 1px gray; margin-top: 1.5% ">

In this case, we can identify the smallest molecule relatively easily.  However, if we had hundreds of molecules in the same directory, it would be more difficult.  We need to be able to sort the output numerically so the smallest molecule appears at the top.

One can sort text or numbers with the **`sort`** command.  The default behaviour for **`sort`** is to sort in alphanumerical order (e.g. any number starting with 1 will come before any number starting with 2).  To sort numerically, use **`sort -n`**.

There is a slight problem.  We need to take the output of **`wc -l`** and use it as an input to a **`sort -n`**.  However, the output only exists on the screen and cannot be used for **`sort`**.  Next, we are going to learn two ways of handling this problem.
- we are going to capture the output of **`wc -l`** in a new file and use **`sort -n`** on the this file
- we are going to use **`wc -l`** and **`sort -n`** together using a powerful utility called 
"pipe"


## Command output redirection
<hr style="border: solid 1px gray; margin-top: 1.5% ">

Normally, commands print output to the screen.  We can use the symbol **`>`** to redirect this output to a file. The next example takes the output of **`wc -l`** and puts it in a new file (one can choose any name, I call it **`lengths.txt`**).  Note, no output will appear on the screen.

In [None]:
%%bash
wc -l *pdb > lengths.txt

Check the file lengths.txt with **`cat`**.

In [None]:
%%bash
cat lengths.txt

- If you use a single **`>`**, a new file will always be made. If a file of the same name exists, it will be
overwritten.  


- If you use a double **`>>`**, one of two things will happen:
  - if a file of the same name exists, it will be appended to
  - if a file of the same name does not exitst, a new file will be made

## Practice

In the next cell, capture the output of the **`ls`** command in a file.  First try with a single **`>`** and then with **`>>`**.  Check the output file with **`cat`**.

In [None]:
%%bash
#put your commands here


## Smallest molecule using output redirection
<hr style="border: solid 1px gray; margin-top: 1.5% ">

The list of commands below finds the smallest molecule with help of output redirection. 

In [None]:
%%bash
wc -l *.pdb > lengths.txt
sort -n lengths.txt > sorted_lengths.txt
cat sorted_lengths.txt

We have found the smallest molecule - methane.pdb.  This method works but, as you may have noticed, it requires a temporary file for every step.  If there are more steps in the process, this could get messy very quickly.


## Smallest molecule using the pipe utility
<hr style="border: solid 1px gray; margin-top: 1.5% ">

Knowing about the output redirection is very useful but our problem is best served by the **pipe** utility. If the pipe symbol **`|`** is placed between two commands on the same line, it means that the output of the first one is directly "piped" as an input to the next command.  One can use many pipes on the same line.

This utility is very powerful; it lets us combine simple commands into useful pipelines.  This is very much the essence of the command line philosophy - provide relatively simple building blocks that can be combined to achieve complex outcomes.

### How many files

I use the pipe very often for counting files.  If you deal with lots data, a frequently asked question is "How many data files are in this directory?"

The example below, one of the most common pipe expression, answers this question.

In [None]:
%%bash 
ls -l *.pdb | wc -l

Now back to our task, the cell below shows the pipe sequence that reproduces the commands in the cell above and finds the smallest molecule.

In [None]:
%%bash
wc -l *pdb | sort -n 

Let's imagine that the list of molecules is very long and it is not practical to view the whole output.  We can easily add another command that displays only the first line - **`head -n 1`**.

In [None]:
%%bash
wc -l *pdb | sort -n | head -n 1 

Please note that only the first command of the pipe requires an argument (in our case, **`*.pdb`**).  The rest of the commands have no arguments as they operate on the output of the previous command.

## New commands
<hr style="border: solid 1px gray; margin-top: 1.5% ">

In this episode, we have learned about **`wc`**, **`sort`**, **`head`**.  These commands can be loosely qualified as "filters", they act on a stream of input and transform it into a stream of output.  

Let's add a few more commands.

- The counterpart of **`head`** is **`tail`**; it shows a specified number of lines from the end of a file; for example **`tail -n 3 lengths.txt`** will return the last three lines from lengths.txt.

- The **`sort`** command is often combined with **`uniq`**; for example **`sort test.txt | uniq`** will sort the file test.txt and remove all duplicate lines. If you want to see the original line counts, use **`sort test.txt | uniq -c`**.  Note that **`uniq`** will not work on a file that is not sorted.  

- The **`cut`** command is useful when handling tabular data, it "cuts" columns out using a separator; for example **`cut -d , -f 1 mytable.csv`** will return the first column from the comma-delimited file mytable.csv (the flag -d is for delimiter and -f is for the column number).

## Exercise 1
<hr style="border: solid 1px gray; margin-top: 1.5% ">

Switch the current working directory to **`data-shell/data`** and have a look at the file **`animals.txt`**.  It is a comma-delimited file. The needed commands are provided in the next two cells.

In [None]:
cd ../data/

In [None]:
%%bash
cat animals.txt

The file above logs sightings of wild animals.  Let's use **`cut`** to get the second column (animal names) from the **`animals.txt`**.

In [None]:
%%bash
cut -d , -f 2 animals.txt 

Your task is to find out an unique list of animals and how many times they were spotted. Which of the following pipes will do the job?

```
1. sort animals.txt | uniq -c
2. sort -t, -k2 animals.txt | uniq -c
3. cut -d , -f 2 animals.txt | uniq -c
4. cut -d , -f 2 animals.txt | sort | uniq -c
5. cut -d , -f 2 animals.txt | sort | uniq -c | wc -l

```

Solution can be found at the end of this notebook.

<hr style="border: solid 1px red; margin-top: 1.5% ">

## Key points

- **`cat`** displays the contents of its inputs.
- **`head`** displays the first 10 lines of its input.
- **`tail`** displays the last 10 lines of its input.
- **`sort`** sorts its inputs.
- **`wc`** counts lines, words, and characters in its inputs.
- **`command > file`** redirects a command’s output to a file (overwriting any existing content).
- **`command >> file`** appends a command’s output to a file.
- **`first | second`** is a pipeline: the output of the first command is used as the input to the second.
- The best way to use the shell is to use pipes to combine simple single-purpose programs (filters).

<hr style="border: solid 1px gray; margin-top: 1.5% ">

### Solution to Exercise 1:

Option 4. is the correct answer. If you have difficulty understanding why, try running the commands, or sub-sections of the pipelines (make sure you are in the data-shell/data directory).