# Introduction to Programming
# #2: the Unix Shell

_Hugo Lhuillier_ -- _Master in Economics, Sciences Po_

# What and why?

* The Unix Shell ($\approx$ the terminal $\approx$ the command line): one of the first way to interact with your computer 
* Why should you use it? It makes your instruction
    - reusable 
    - faster (at some point)
    - and sometimes you cannot avoid it (e.g. with remote machines and supercomputers)

Also, it makes you look really cool in the train or at the library
   
<img src="https://www.cyberciti.biz/media/new/faq/2008/10/perl-realpath-demo.png">

# What and why? Unix? Shell? Bash?

* Unix: one of the first OS
* The Shell: an interpreter
```BASH
$ cd ./Dropbox/Teaching/
```
* Specificity: run other programs rather than to doing the calculations itself
* Bash (Bourne Again SHell): the most common Unix Shell
    * Documentary on this: [revolution OS](https://www.youtube.com/watch?v=k84FMc1GF8M)

# What and why? Specificity 

* Use the Unix shell from a command-line interface (CLI), and not a GUI 
* The heart of the CLI: the REPL (read-evaluate-print loop) 

Why REPL: when the user types a command and then presses the Enter (or Return) key, the computer reads it, executes it, and prints its output.

In [7]:
; whoami

hugolhuillier


More specifically, when we type `whoami` the shell:

* finds a program called whoami,
* runs that program,
* displays that program’s output, then
* displays a new prompt to tell us that it’s ready for more commands.

In [6]:
; pwd

/Users/hugolhuillier/Dropbox/Teaching/intro-prog/2017-2018/2-unix


* **Careful**: the Shell is just an interpreter

In [8]:
; 1 + 1

/bin/bash: 1: command not found


# Linux & Mac vs. Windows 

* Mac and Linux OS include by default a Unix Shell (Bash)
* Windows OS does not. Instead, will use an emulator: Git for Windows
    - Windows 10 users can also install Ubuntu on their machine. [See here for instructions](https://www.laptopmag.com/articles/use-bash-shell-windows-10).

Apparently, has to do with difference in culture in Windows and Lynux. Most programmers on Windows write programs "for the users", and therefore need to work extensively on the GUI. Alternatively, most developers in UNIX focus on programs for themselves or other developes; hence CLI which is faster and more useful in this case.

# Disclaimer

* Most of the material is drawn from the excellent course prepared by [software carpentry](https://software-carpentry.org/lessons/)
* In particular, most exercices are drawn from it. You can look up the answers, but you won't learn anything

# Context: Nelle the marine biologist

* Nelle is a marine biologist who just collected 1520 samples on marine life stuff 

<a href="https://ibb.co/d0M4P6"><img src="https://preview.ibb.co/mVcdj6/nelle.png" alt="nelle" border="0"></a>

# Context: Nelle the marine biologist


* For each sample, she has to 
    1. run the sample through a machine that's going to compute the relative abundance of the different proteins contained in it. The machine will output a text file.
    1. run the text file through a program called `goostats` that's going to compute some statistics
    
Very repetitive procedure $\Rightarrow$ should use the command shell to automate all this.

# The file system

* The file system: part of the OS responsible for managing files and directories

In [26]:
; cd ./Dropbox/Teaching/intro-prog/2017-2018/2-unix

/Users/hugolhuillier/Dropbox/Teaching/intro-prog/2017-2018/2-unix


In [27]:
; pwd

/Users/hugolhuillier/Dropbox/Teaching/intro-prog/2017-2018/2-unix


1. At the top is the root directory that holds everything else. We refer to it using a slash character / on its own
1. Inside that directory are several other directories, including Users (where users’ personal directories are located
1. Our current working directory (nelle) is stored inside /Users - because /Users is the first part of its name - and /Users is stored inside the root directory, its name begins with /

In [28]:
; ls

imgs
notes.pages
slides.ipynb
slides.slides.html


In [29]:
; ls -F

imgs/
notes.pages
slides.ipynb
slides.slides.html


In [41]:
; ls -t -F

slides.slides.html
slides.ipynb
imgs/
notes.pages


In [33]:
; ls -F imgs

nelle.png


# The file system 

* For Linux and Mac users: can also access the documentation via  `man ls`
    * Navigate with the arrow
    * Can search with `/` followed by the character or word of interest 
    * Quit with `q`.

# The file system

* You can navigate through the repositories via `cd` 
* `cd` alone will set the Shell to the home directory 

In [34]:
; pwd

/Users/hugolhuillier/Dropbox/Teaching/intro-prog/2017-2018/2-unix


In [35]:
; cd

/Users/hugolhuillier


In [36]:
; cd ./Dropbox/Teaching/intro-prog/2017-2018

/Users/hugolhuillier/Dropbox/Teaching/intro-prog/2017-2018


In [37]:
; cd ./2-unix/imgs

/Users/hugolhuillier/Dropbox/Teaching/intro-prog/2017-2018/2-unix/imgs


In [38]:
; cd ./..

/Users/hugolhuillier/Dropbox/Teaching/intro-prog/2017-2018/2-unix


In [40]:
; ls -F

imgs/
notes.pages
slides.ipynb
slides.slides.html


# The file system

* There exists plenty of shortcuts. Ex: starting from `/Users/amanda/data/`, Amanda can use all the following to navigate to her home directory, which is `/Users/amanda`:
    - `cd ~`
    - `cd ..` 
    - `cd`

* **Exercice**: navigate to the `data-shell` directories, and explore what's inside the `north-pacific-gyre/2012-07-03` directory

In [43]:
; cd ./data-shell

/Users/hugolhuillier/Dropbox/Teaching/intro-prog/2017-2018/2-unix/data-shell


In [44]:
; ls -F

creatures/
data/
molecules/
north-pacific-gyre/
notes.txt
pizza.cfg
solar.pdf
writing/


In [6]:
; ls -F ./north-pacific-gyre/2012-07-03/

NENE01729A.txt
NENE01729B.txt
NENE01736A.txt
NENE01751A.txt
NENE01751B.txt
NENE01812A.txt
NENE01843A.txt
NENE01843B.txt
NENE01971Z.txt
NENE01978A.txt
NENE01978B.txt
NENE02018B.txt
NENE02040A.txt
NENE02040B.txt
NENE02040Z.txt
NENE02043A.txt
NENE02043B.txt
goodiff*
goostats*


# Working with files and directories 

* How to create a directory? 

In [10]:
; cd ./data-shell

/Users/hugolhuillier/Dropbox/Teaching/intro-prog/2017-2018/2-unix/data-shell


In [11]:
; mkdir thesis

* Try to not use spaces in your files and directories names, EVER
* How to create a file? Most generic command is `touch`

In [15]:
; touch a-blank-file.txt

* This will create a blank file. Work with any extension (`.txt`, `.csv`, `.jl` etc.)

# Working with files and directories 

* To create a text file and write directly in it, use a text editor
* **Text editor**: which one to use? default one on Bash is `nano`. Others are:
    - _Atom_ (open source, made by Github)
    - Sublime Text
    - Notepad ++

* For now, use `nano`
```bash
$ nano draft.txt
```
    - Write in it: 
    > It's not "publish or perish" any more, it's "share and thrive".
    - Save via `Ctrl-O`
    - Exit via `Ctrl-X`

# Working with files and directories

* You can remove files with the command `rm`
* **Careful**: the Unix shell doesn’t have a trash bin. Deleting is permanent

In [17]:
; rm ./thesis/draft.txt

In [18]:
; ls -F ./thesis

* Let's recreate the `draft.txt` file, move to the parent directory (`data-shell`), and try to remove directly the `thesis` directory

In [19]:
; touch ./thesis/draft.txt

In [22]:
; rm thesis

rm: thesis: is a directory


* `rm` only works on files. To get rid of the `thesis` directory, we must first get rid of `draft.txt`

In [23]:
; rm -r thesis

* `rm -r` can be a very dangerous command, since you might remove loads of files. To be certain that you are not removing too many files, you can type `rm -r -i thesis`, and the Shell will ask you whether you want to remove this or that file

# Working with files and directories

* How to rename or move a file? Use `mv name-file-to-be-rename new-name`
* **Careful** `mv` will silently overwrite any existing file with the same name
    - Add the flag `-i` for the Shell to warn you if `mv` is going to overwrite any file

* **Exercice**: in the directory `thesis`, create `draft.txt`. Then, rename it to `quote.txt`. Finally, move `quote.txt` from the `thesis` directory to its parent directory, `data-shell`

In [28]:
; mkdir thesis

In [29]:
; touch thesis/draft.txt

In [30]:
; mv thesis/draft.txt thesis/quote.txt

In [31]:
; mv thesis/quote.txt .

* `cp` works like `mv`, but makes a copy of the file instead of (re)moving it

# Filters

* So far, fancy, but not very useful... 
* Now: use the Shell to combine existing programs in new ways

In [35]:
; cd ./molecules

/Users/hugolhuillier/Dropbox/Teaching/intro-prog/2017-2018/2-unix/data-shell/molecules


In [36]:
; ls -F

cubane.pdb
ethane.pdb
methane.pdb
octane.pdb
pentane.pdb
propane.pdb


* **Ex #1**: count the number of word for each file in the directory

In [38]:
; wc *.pdb

      20     156    1158 cubane.pdb
      12      84     622 ethane.pdb
       9      57     422 methane.pdb
      30     246    1828 octane.pdb
      21     165    1226 pentane.pdb
      15     111     825 propane.pdb
     107     819    6081 total


Stacktrace:
 [1] [1mdepwarn[22m[22m[1m([22m[22m::String, ::Symbol[1m)[22m[22m at [1m./deprecated.jl:70[22m[22m
 [2] [1mwarn_shell_special[22m[22m[1m([22m[22m::String[1m)[22m[22m at [1m./shell.jl:8[22m[22m
 [3] [1m#shell_parse#236[22m[22m[1m([22m[22m::String, ::Function, ::String, ::Bool[1m)[22m[22m at [1m./shell.jl:103[22m[22m
 [4] [1m(::Base.#kw##shell_parse)[22m[22m[1m([22m[22m::Array{Any,1}, ::Base.#shell_parse, ::String, ::Bool[1m)[22m[22m at [1m./<missing>:0[22m[22m (repeats 2 times)
 [5] [1m@cmd[22m[22m[1m([22m[22m::ANY[1m)[22m[22m at [1m./process.jl:796[22m[22m
 [6] [1minclude_string[22m[22m[1m([22m[22m::String, ::String[1m)[22m[22m at [1m./loading.jl:522[22m[22m
 [7] [1minclude_string[22m[22m[1m([22m[22m::Module, ::String, ::String[1m)[22m[22m at [1m/Users/hugolhuillier/.julia/v0.6/Compat/src/Compat.jl:174[22m[22m
 [8] [1mexecute_request[22m[22m[1m([22m[22m::ZMQ.Socket, ::IJulia.Msg[1m)[

# Filters

* **Wildcard**: wildcards are a form of shorcut.
    - `*`: matches zero or more characters; e.g. `*.pdb` will be a list containing `ethane.pdb`, `propane.pdb` etc. Similarly, `p*.pdb` will only contain `pentane.pdb` and `propane.pdb` 
    - `?`: matches one character
    - Wildcards can be combined: e.g. `p*.p*`
    - `[xy]` matches either `A` or `B`

* **Ex**: in `north-pacific-gyre/2012-07-03/`, list all the _text_ files ending with `A` or `B` (i.e. exclude those ending with `Z`)

In [6]:
; ls -F

NENE01729A.txt
NENE01729B.txt
NENE01736A.txt
NENE01751A.txt
NENE01751B.txt
NENE01812A.txt
NENE01843A.txt
NENE01843B.txt
NENE01971Z.txt
NENE01978A.txt
NENE01978B.txt
NENE02018B.txt
NENE02040A.txt
NENE02040B.txt
NENE02040Z.txt
NENE02043A.txt
NENE02043B.txt
goodiff*
goostats*


In [7]:
; ls *[AB].txt

NENE01729A.txt
NENE01729B.txt
NENE01736A.txt
NENE01751A.txt
NENE01751B.txt
NENE01812A.txt
NENE01843A.txt
NENE01843B.txt
NENE01978A.txt
NENE01978B.txt
NENE02018B.txt
NENE02040A.txt
NENE02040B.txt
NENE02043A.txt
NENE02043B.txt


Stacktrace:
 [1] [1mdepwarn[22m[22m[1m([22m[22m::String, ::Symbol[1m)[22m[22m at [1m./deprecated.jl:70[22m[22m
 [2] [1mwarn_shell_special[22m[22m[1m([22m[22m::String[1m)[22m[22m at [1m./shell.jl:8[22m[22m
 [3] [1m#shell_parse#236[22m[22m[1m([22m[22m::String, ::Function, ::String, ::Bool[1m)[22m[22m at [1m./shell.jl:103[22m[22m
 [4] [1m(::Base.#kw##shell_parse)[22m[22m[1m([22m[22m::Array{Any,1}, ::Base.#shell_parse, ::String, ::Bool[1m)[22m[22m at [1m./<missing>:0[22m[22m (repeats 2 times)
 [5] [1m@cmd[22m[22m[1m([22m[22m::ANY[1m)[22m[22m at [1m./process.jl:796[22m[22m
 [6] [1minclude_string[22m[22m[1m([22m[22m::String, ::String[1m)[22m[22m at [1m./loading.jl:522[22m[22m
 [7] [1minclude_string[22m[22m[1m([22m[22m::Module, ::String, ::String[1m)[22m[22m at [1m/Users/hugolhuillier/.julia/v0.6/Compat/src/Compat.jl:174[22m[22m
 [8] [1mexecute_request[22m[22m[1m([22m[22m::ZMQ.Socket, ::IJulia.Msg[1m)[

# Filters

* Let's find which of these files is the shortest 

In [39]:
; wc -l *.pdb > lengths.txt

Stacktrace:
 [1] [1mdepwarn[22m[22m[1m([22m[22m::String, ::Symbol[1m)[22m[22m at [1m./deprecated.jl:70[22m[22m
 [2] [1mwarn_shell_special[22m[22m[1m([22m[22m::String[1m)[22m[22m at [1m./shell.jl:8[22m[22m
 [3] [1m#shell_parse#236[22m[22m[1m([22m[22m::String, ::Function, ::String, ::Bool[1m)[22m[22m at [1m./shell.jl:103[22m[22m
 [4] [1m(::Base.#kw##shell_parse)[22m[22m[1m([22m[22m::Array{Any,1}, ::Base.#shell_parse, ::String, ::Bool[1m)[22m[22m at [1m./<missing>:0[22m[22m (repeats 2 times)
 [5] [1m@cmd[22m[22m[1m([22m[22m::ANY[1m)[22m[22m at [1m./process.jl:796[22m[22m
 [6] [1minclude_string[22m[22m[1m([22m[22m::String, ::String[1m)[22m[22m at [1m./loading.jl:522[22m[22m
 [7] [1minclude_string[22m[22m[1m([22m[22m::Module, ::String, ::String[1m)[22m[22m at [1m/Users/hugolhuillier/.julia/v0.6/Compat/src/Compat.jl:174[22m[22m
 [8] [1mexecute_request[22m[22m[1m([22m[22m::ZMQ.Socket, ::IJulia.Msg[1m)[

The -l flag tells the shell to report only the number of lines. the > commands tells the shell to redirect the command's outptu to a file instead of printing it to the screen.

In [40]:
; cat lengths.txt

      20 cubane.pdb
      12 ethane.pdb
       9 methane.pdb
      30 octane.pdb
      21 pentane.pdb
      15 propane.pdb
     107 total


* `>` overwrites a file, while `>>` append to it

* It only remains to sort `lenghts.txt` - and to store it in a new file

In [42]:
; sort -n lengths.txt > sorted-lengths.txt

Stacktrace:
 [1] [1mdepwarn[22m[22m[1m([22m[22m::String, ::Symbol[1m)[22m[22m at [1m./deprecated.jl:70[22m[22m
 [2] [1mwarn_shell_special[22m[22m[1m([22m[22m::String[1m)[22m[22m at [1m./shell.jl:8[22m[22m
 [3] [1m#shell_parse#236[22m[22m[1m([22m[22m::String, ::Function, ::String, ::Bool[1m)[22m[22m at [1m./shell.jl:103[22m[22m
 [4] [1m(::Base.#kw##shell_parse)[22m[22m[1m([22m[22m::Array{Any,1}, ::Base.#shell_parse, ::String, ::Bool[1m)[22m[22m at [1m./<missing>:0[22m[22m (repeats 2 times)
 [5] [1m@cmd[22m[22m[1m([22m[22m::ANY[1m)[22m[22m at [1m./process.jl:796[22m[22m
 [6] [1minclude_string[22m[22m[1m([22m[22m::String, ::String[1m)[22m[22m at [1m./loading.jl:522[22m[22m
 [7] [1minclude_string[22m[22m[1m([22m[22m::Module, ::String, ::String[1m)[22m[22m at [1m/Users/hugolhuillier/.julia/v0.6/Compat/src/Compat.jl:174[22m[22m
 [8] [1mexecute_request[22m[22m[1m([22m[22m::ZMQ.Socket, ::IJulia.Msg[1m)[

In [43]:
; head -n 2 sorted-lengths.txt

       9 methane.pdb
      12 ethane.pdb


You need to use `-n` because by default the sorting is alphabetical. Here it becomes numerical. Similarly, you need `-n` before the 2 to indicate that you want the first two rows.

# Pipes

* This is a pipe: `|`

In [3]:
; sort -n lengths.txt | head -n 1

       9 methane.pdb


Stacktrace:
 [1] [1mdepwarn[22m[22m[1m([22m[22m::String, ::Symbol[1m)[22m[22m at [1m./deprecated.jl:70[22m[22m
 [2] [1mwarn_shell_special[22m[22m[1m([22m[22m::String[1m)[22m[22m at [1m./shell.jl:8[22m[22m
 [3] [1m#shell_parse#236[22m[22m[1m([22m[22m::String, ::Function, ::String, ::Bool[1m)[22m[22m at [1m./shell.jl:103[22m[22m
 [4] [1m(::Base.#kw##shell_parse)[22m[22m[1m([22m[22m::Array{Any,1}, ::Base.#shell_parse, ::String, ::Bool[1m)[22m[22m at [1m./<missing>:0[22m[22m (repeats 2 times)
 [5] [1m@cmd[22m[22m[1m([22m[22m::ANY[1m)[22m[22m at [1m./process.jl:796[22m[22m
 [6] [1minclude_string[22m[22m[1m([22m[22m::String, ::String[1m)[22m[22m at [1m./loading.jl:522[22m[22m
 [7] [1minclude_string[22m[22m[1m([22m[22m::Module, ::String, ::String[1m)[22m[22m at [1m/Users/hugolhuillier/.julia/v0.6/Compat/src/Compat.jl:174[22m[22m
 [8] [1mexecute_request[22m[22m[1m([22m[22m::ZMQ.Socket, ::IJulia.Msg[1m)[

* Tells the shell that we want to use the output of the command on the left as the input to the command on the right

In [45]:
; wc -l *.pdb | sort -n | head -n 1

Stacktrace:
 [1] [1mdepwarn[22m[22m[1m([22m[22m::String, ::Symbol[1m)[22m[22m at [1m./deprecated.jl:70[22m[22m
 [2] [1mwarn_shell_special[22m[22m[1m([22m[22m::String[1m)[22m[22m at [1m./shell.jl:8[22m[22m
 [3] [1m#shell_parse#236[22m[22m[1m([22m[22m::String, ::Function, ::String, ::Bool[1m)[22m[22m at [1m./shell.jl:103[22m[22m
 [4] [1m(::Base.#kw##shell_parse)[22m[22m[1m([22m[22m::Array{Any,1}, ::Base.#shell_parse, ::String, ::Bool[1m)[22m[22m at [1m./<missing>:0[22m[22m (repeats 2 times)
 [5] [1m@cmd[22m[22m[1m([22m[22m::ANY[1m)[22m[22m at [1m./process.jl:796[22m[22m
 [6] [1minclude_string[22m[22m[1m([22m[22m::String, ::String[1m)[22m[22m at [1m./loading.jl:522[22m[22m
 [7] [1minclude_string[22m[22m[1m([22m[22m::Module, ::String, ::String[1m)[22m[22m at [1m/Users/hugolhuillier/.julia/v0.6/Compat/src/Compat.jl:174[22m[22m
 [8] [1mexecute_request[22m[22m[1m([22m[22m::ZMQ.Socket, ::IJulia.Msg[1m)[

       9 methane.pdb


* The key: any program that reads lines of text from standard input and writes lines of text to standard output can be combined with every other program that behaves this way as well in Unix

# Loops 

* Why should you use loops? 

* Allow us to execute commands repetitively
* Reduces the amount of typing 
* Less typying = less likely to do a typing mistakes

# Loops 

* **Ex #1**: would like to copy several files

In [11]:
; cd ./Dropbox/Teaching/intro-prog/2017-2018/2-unix/data-shell/creatures

/Users/hugolhuillier/Dropbox/Teaching/intro-prog/2017-2018/2-unix/data-shell/creatures


In [12]:
; ls -F

basilisk.dat
original-basilisk.dat
original-unicorn.dat
unicorn.dat


* The following command _would not_ work! 

```bash 
$ cp *.dat original-*.dat
```

When `cp` receives more than two inputs, it expects the last input to be a directory where it can copy all the files it was passed. Since there is no directory named original-*.dat in the creatures directory we get an error.

* Solution: use a loop 
* Example of a loop: 

```bash 
for filename in basilisk.dat unicorn.dat
    do
        head -n 3 $filename 
    done

```

* `$filename` and `${filename}` are equivalent

When the shell sees the keyword `for`, it knows to repeat a command once for each item in a list (the list = the elements after `in`). For each iteration, an item in the list is assigned to the variable, and the commands inside the loop are executed. Inside the loop, we call for the variable’s value by putting `$` in front of it. The `$` tells the shell interpreter to treat the variable as a variable name rather than some text or an external command.

**Astuce**: always write the beginning (`do`) and the ending of the loop (`done`) at the beginning 

# Loops

* **Exercice**: for each files in the `creatures` directory, print the name of the file (may want to check the `echo` command), and the last 20 lines of the file (have a look at the `tail` command)
* **Exercice**: write another loop to show that your previous code indeed printed only 20 lines per file

* Solution #1

```bash 
for filename in *.dat 
do 
    echo $filename 
    tail -n 20
done
```

* Solution #2 

```bash 
for filename in *.dat 
do 
    echo $filename
    tail -n 20 $filename | wc -l 
done
```

# Loops 

* **Exercice**: going back to the initial problem, copy, in the same directory, the files in the `creatures` directory 

* Solution

```bash 
for filename in *.dat 
do 
    cp $filename original-$filename
done
```

* Introduction to debugging: hard to check that the loop is doing the correct thing since it does not print anything. To check: print the command with `echo`

```bash 
for filename in *.dat 
do 
    echo cp $filename original-$filename
    cp $filename original-$filename
done
```

# Loop: Nelle's original problem

* Remember: Nelle wants to run a statistical program, `goostats`, on the sample contained in `north-pacific-gyre/2017-07-03`
* Info: 
    - run only the program on the sample finishing by `A` or `B` 
    - `goostats` takes two inputs: 
        1. the file on which to run the program 
        1. the file that's going to store the results 
    - to run an external program, write `bash` in front of it

* Write a program that's going to run `goostats` on each sample, and store  each results in a different file, with name `stats-ORIGINAL_NAME_OF_THE_FILE`. _Debug as much as possible your code_.

```bash 
for datafile in NENE*[AB].txt
do 
    echo $datafile 
    bash goostats $datafile stats-$datafile
done
```

# Useful commands

* Use the keyboard's arrows to retrieve past commands 
* Stop a running program at any point in time using `Ctrl + c`
* `Ctrl + a` and `Ctrl + e` brings the pointer to the beginning and the end of the line respectively
* `history` lists the last few hundred commands, and `!123` runs the command associated with line `123`. The code below prints the last 5 commands

```bash 
history | tail -n 5
```

# Shell Scripts

* _"You said the advantage of using the Shell is that the instructions are re-usable, but if we have to retype them everytime, that's massive BS"_
* A shell script: a small programs that contains a list of Shell commands

```bash 
cd
cd ./Dropbox/Teaching/intro-prog/2017-2018/2-unix/data-shell/molecules
nano middle.sh
```

* Write in the file 

```bash
head -n 15 octane.pdb | tail -n 5
```

* Execute the shell script with the `bash` command 

In [4]:
; bash middle.sh

ATOM      9  H           1      -4.502   0.681   0.785  1.00  0.00
ATOM     10  H           1      -5.254  -0.243  -0.537  1.00  0.00
ATOM     11  H           1      -4.357   1.252  -0.895  1.00  0.00
ATOM     12  H           1      -3.009  -0.741  -1.467  1.00  0.00
ATOM     13  H           1      -3.172  -1.337   0.206  1.00  0.00


This is a variation on the pipe we constructed earlier: it selects lines 11-15 of the file octane.pdb. Remember, we are not running it as a command just yet: we are putting the commands in a file.

# Shell scripts

* Remember the BLRs? Want to write code as re-usable as possible
* Replace the content of `middle.sh` by 

```bash
head -n 15 "$1" | tail -n 5
```

In other languages, we would call that a function, or a method. `"$1"` means "the first argument after the program on the command line". We surround `$1` with double quote in case the filename contains spaces.

In [5]:
; bash middle.sh octane.pdb

ATOM      9  H           1      -4.502   0.681   0.785  1.00  0.00
ATOM     10  H           1      -5.254  -0.243  -0.537  1.00  0.00
ATOM     11  H           1      -4.357   1.252  -0.895  1.00  0.00
ATOM     12  H           1      -3.009  -0.741  -1.467  1.00  0.00
ATOM     13  H           1      -3.172  -1.337   0.206  1.00  0.00


* As in maths, can use as many arguments as we want in a function. E.g., can make the number of head lines and tail lines we want

```bash 
head -n "$2" "$1" | tail -n "$3"
```

In [6]:
; bash middle.sh octane.pdb 15 5

ATOM      9  H           1      -4.502   0.681   0.785  1.00  0.00
ATOM     10  H           1      -5.254  -0.243  -0.537  1.00  0.00
ATOM     11  H           1      -4.357   1.252  -0.895  1.00  0.00
ATOM     12  H           1      -3.009  -0.741  -1.467  1.00  0.00
ATOM     13  H           1      -3.172  -1.337   0.206  1.00  0.00


# Shell scripts

* Once again, remember the BLRs: always write documentation / comments
* Comments start with `#` in Shell

```bash 
# select lines from the middle of a file 
# usage: bash middle.sh filename end_line num_lines
head -n "$2" "$1" | tail -n "$3"
```

# Shell scripts 

* What if we want a script that works on different files, but we do not know a priori how many? 
* Ex: would like a script that counts the lines of different files

In [7]:
; wc -l *.pdb ../creatures/*.dat

      20 cubane.pdb
      12 ethane.pdb
       9 methane.pdb
      30 octane.pdb
      21 pentane.pdb
      15 propane.pdb
     163 ../creatures/basilisk.dat
     163 ../creatures/original-basilisk.dat
     163 ../creatures/original-unicorn.dat
     163 ../creatures/unicorn.dat
     759 total


Stacktrace:
 [1] [1mdepwarn[22m[22m[1m([22m[22m::String, ::Symbol[1m)[22m[22m at [1m./deprecated.jl:70[22m[22m
 [2] [1mwarn_shell_special[22m[22m[1m([22m[22m::String[1m)[22m[22m at [1m./shell.jl:8[22m[22m
 [3] [1m#shell_parse#236[22m[22m[1m([22m[22m::String, ::Function, ::String, ::Bool[1m)[22m[22m at [1m./shell.jl:103[22m[22m
 [4] [1m(::Base.#kw##shell_parse)[22m[22m[1m([22m[22m::Array{Any,1}, ::Base.#shell_parse, ::String, ::Bool[1m)[22m[22m at [1m./<missing>:0[22m[22m (repeats 2 times)
 [5] [1m@cmd[22m[22m[1m([22m[22m::ANY[1m)[22m[22m at [1m./process.jl:796[22m[22m
 [6] [1minclude_string[22m[22m[1m([22m[22m::String, ::String[1m)[22m[22m at [1m./loading.jl:522[22m[22m
 [7] [1minclude_string[22m[22m[1m([22m[22m::Module, ::String, ::String[1m)[22m[22m at [1m/Users/hugolhuillier/.julia/v0.6/Compat/src/Compat.jl:174[22m[22m
 [8] [1mexecute_request[22m[22m[1m([22m[22m::ZMQ.Socket, ::IJulia.Msg[1m)[

* Use the variable `$@` = all of the command-line arguments to the shell script

```bash 
# sort filenames by their length
# usage: bash sorted.sh one_or_more_filenames 
wc -l "$@" | sort -n
```

In [8]:
; bash sorted.sh *.pdb ../creatures/*.dat

       9 methane.pdb
      12 ethane.pdb
      15 propane.pdb
      20 cubane.pdb
      21 pentane.pdb
      30 octane.pdb
     163 ../creatures/basilisk.dat
     163 ../creatures/original-basilisk.dat
     163 ../creatures/original-unicorn.dat
     163 ../creatures/unicorn.dat
     759 total


Stacktrace:
 [1] [1mdepwarn[22m[22m[1m([22m[22m::String, ::Symbol[1m)[22m[22m at [1m./deprecated.jl:70[22m[22m
 [2] [1mwarn_shell_special[22m[22m[1m([22m[22m::String[1m)[22m[22m at [1m./shell.jl:8[22m[22m
 [3] [1m#shell_parse#236[22m[22m[1m([22m[22m::String, ::Function, ::String, ::Bool[1m)[22m[22m at [1m./shell.jl:103[22m[22m
 [4] [1m(::Base.#kw##shell_parse)[22m[22m[1m([22m[22m::Array{Any,1}, ::Base.#shell_parse, ::String, ::Bool[1m)[22m[22m at [1m./<missing>:0[22m[22m (repeats 2 times)
 [5] [1m@cmd[22m[22m[1m([22m[22m::ANY[1m)[22m[22m at [1m./process.jl:796[22m[22m
 [6] [1minclude_string[22m[22m[1m([22m[22m::String, ::String[1m)[22m[22m at [1m./loading.jl:522[22m[22m
 [7] [1minclude_string[22m[22m[1m([22m[22m::Module, ::String, ::String[1m)[22m[22m at [1m/Users/hugolhuillier/.julia/v0.6/Compat/src/Compat.jl:174[22m[22m
 [8] [1mexecute_request[22m[22m[1m([22m[22m::ZMQ.Socket, ::IJulia.Msg[1m)[

# Finding things

* `grep` = "global/regular expression/print", a finder in Shell 

In [19]:
; cd data-shell/writing

/Users/hugolhuillier/Dropbox/Teaching/intro-prog/2017-2018/2-unix/data-shell/writing


In [20]:
; cat haiku.txt

The Tao that is seen
Is not the true Tao, until
You bring fresh toner.

With searching comes loss
and the presence of absence:
"My Thesis" not found.

Yesterday it worked
Today it is not working
Software is like that.


In [22]:
; grep The haiku.txt

The Tao that is seen
"My Thesis" not found.


In [23]:
; grep -w The haiku.txt

The Tao that is seen


Previously, two lines included the letters “The”, but one instance of those letters is contained within a larger word, “Thesis”. To restrict matches to lines containing the word “The” on its own, we can give grep with the -w flag. This will limit matches to word boundaries.

In [24]:
; grep -w "is not" haiku.txt

Today it is not working


# Finding things

* Several flags that are useful 
    - `-w`: the expression is searched for as a word
    - `-n`: each output line is numbered 
    - `-i`: perform case insensitive matching 
    - `-v`: reverse search: selected lines are those not matching any of the specified patterns.

In [25]:
; grep -n -w -i "the" haiku.txt

1:The Tao that is seen
2:Is not the true Tao, until
6:and the presence of absence:


* If interested in the many possibilities offered by `grep`, e.g. possibility to implement complex searches, have a look [here](http://v4.software-carpentry.org/regexp/index.html)

In [27]:
; grep -E '^.o' haiku.txt

You bring fresh toner.
Today it is not working
Software is like that.


find all the lines that have an ‘o’ in the second position

# Finding things

* `grep` finds patterns in files; `find` finds pattern in files

In [29]:
; ls -F

data/
haiku.txt
thesis/
tools/


In [30]:
; find .

.
./tools
./tools/old
./tools/old/oldtool
./tools/format
./tools/stats
./haiku.txt
./thesis
./thesis/empty-draft.md
./data
./data/two.txt
./data/LittleWomen.txt
./data/one.txt


As always, the . on its own means the current working directory, which is where we want our search to start. find’s output is the names of every file and directory under the current working directory.

* Several very useful options
    - `-type d`: lists only the directories 
    - `-type f`: lists only the files 
    - `-name some_name`: list only the files mathcing `some_name`
    
```bash 
find . -name '*.txt'
```

Very important to put the quotes around; otherwise, would first run `find . name *` which would yield a different result

# Finding things

* As before, can combine `find` and `grep` with other commands. E.g.: to count the number of lines in each `.txt` files contained in the `writing` directory

```bash 
wc -l $(find . -name '*.txt') 
```

* Careful: different from 

```bash 
find . -name '*.txt' | wc -l
```

* The shell executes in the first place what's inside `$`. It then replaces the `$()` expression with that command’s output. Here

```bash 
wc -l ./data/one.txt ./data/LittleWomen.txt ./data/two.txt ./haiku.txt
```