# Beyond Basic BASH Coding
------
### Learning Objectives:

+ Review BASH commands learned in previous lessons

+ Apply looping with complex BASH commands for repetitive tasks

+ Create and execute BASH scripts

## BASH Coding Review
--------

Thus far you've learned quite a bit about how to navigate the terminal environment with some basic BASH commands. In this section we will review those commands and couple of commands we haven't covered yet while we continue to build more complex commands by combining commands in different ways.

### Navigating the Terminal

`pwd` <br>
Print working directory.

`cd` <br>
Change directory, accepts an argument to indicate the name of the directory you would like to move to.

`mkdir` <br>
Create a new directory, accepts an argument to indicate the name of the new directory to be created.

`ls` <br>
Lists the contents of the current working directory. You can use this command without an argument to list the contents of the current working directory, or you can provide an argument in the form of the name (path) of a directory you would like to list the contents of. Alternatively, you can also provide an argument using a regular expression to list only files that contain a pattern of interest. For example to list all files that begin with the letter "s" in my current working directory I would use the command `ls s*`, to list all files that begin with the letter "p" in the directory "new_dir" I would use the command `ls new_dir/p*`.

### Creating and Organizing Files

`nano` <br>
This command invokes the "nano" text editing program, when used without an argument it will open a blank page where you can enter text and when you close the page with the command `ctrl + x` buttons you will be prompted to name the file and asked if you want to save the contents of the file to that filename (type `y` to save the file). Alternatively you can use this command with an argument for the filename(s) you want to create and when you close the page with `ctrl + x` you will be asked if you want to save to the given file name. This tool can also be used to modify the contents of existing files by invoking those filenames with the `nano` command. Nano is one of many text editors for the command line interface, `vi` is another popular one. 
 
 <p align="center">
    <img src="images/nano.png" alt="nano" width="80%"/>
</p>



`touch` <br>
Create an empty file, accepts an argument indicating the name of the file to be created. 

`>` <br>
Redirects the output of a command to a file rather than to standard out, accepts an argument indicating the filename to write to. If the filename already exists, the existing content of that file will be replaced with the output of the new command. 

`>>` <br>
Appends the output of a command to a file rather than to standard out, accepts an argument indicating the filename to write to. If the filename already exists, the new output is added to the bottom of the existing content rather than overwriting what is already there (as in the `>` command).

### Viewing File Contents

`cat`<br>
Print the entire contents of a file to the screen, this command takes an argument to indicate which file should be printed. This command can accept multiple arguments (filenames) separated by spaces and will print the entire contents of each argument one after the next. This command can also accept regular expressions as arguments, for example print the contents of all files that end with the suffix "fasta" in a directory could be implemented with this command `cat *.fasta`.

`zcat` <br>
Print the entire contents of a zipped file to the screen, this command takes a file name as an argument to indicate which file should be printed. This command can accept multiple arguments (filenames) separated by spaces and will print the entire contents of each file one after the next. This command can also accept regular expressions as arguments, for example print the contents of all zipped  fastq files in a directory could be implemented with this command `zcat *.fastq.gz`.

`head` or `tail`<br> 
Print the first (head) or last (tail) ten lines of a file, this command accepts a filename as an argument to indicate which file you would like to view. This command also accepts multiple arguments separated by spaces. One flag that is commonly used with this command is the `-n` flag which indicates the number of lines to print, without this flag ten lines will be printed. To print the last 100 lines of a file one could use the command `tail -n 100 some_file.txt`.

### Renaming, Moving, and Removing Files

`cp`<br>
Copies the contents of a file to a new file, this command accepts two arguments the first is the file to be copied and the second is the new filename to write the data to. There is also an option to use this tool to copy directories when using the `-r` flag to indicate that the structure of the directory should be preserved in the copied version.

`mv` <br>
This command can be used in two ways, it can be used to move a file from one location to another, in this case the first argument would be the filename and the second argument would be the path indicating the new location i.e. `mv somefile.txt new_dir/new_location/`. The second way to use this command is changing the name of a file, in this case the first argument would be the filename and the second argument would be the new filename you would like to replace the current name, i.e. `mv somefileName.txt newfileName.txt`.

`rm`<br>
Removes a file or a directory, this command accepts an argument indicating the name of the file to be removed, or the name of the directory to be removed with the `-r` flag. This action is for the most part permanent, and the only way to get a file back that you've removed is from a backup location so be very careful when using this command and NEVER use this command with the greedy regular expression character `*` which will remove all files in the working directory.

### File Content Manipulation

`wc`<br>
Counts the number of words in a file, this command accepts an argument indicating which file to count words. One flag that is very useful with this command is the `-l` flag which directs the command to count the number of lines in the file rather than the number of words in the file.

`cut`<br>
Separates the contents of a file, this command accepts an argument indicating the file to be cut and requires the flag `-f` which dictates the fields to be cut by the command, you can optionally include the `-d` flag which indicates the delimiter to cut the file on, by default this command uses the tab to indicate separate fields. An example of the utility of this command is taking the first field of a GFF (tab delimited) file with `cut -f1 somefile.gff`. Another useful example uses the `-d` delimiter flag to cut the first and second fields of the information column of a gff file (9th tab delimited column) `cut -f9 somefile.gff|cut -d ";" -f1,2`. In this command I've used the cut command once to isolate the information column `cut -f9 somefile.gff` and then piped the output of that command to a second cut command which then cuts that column using the delimiter ";" `cut -d";"` and takes the first and second fields `-f1,2`. 

`sed`<br>
This command is generally used to edit the contents of files in an automated way. This command is often used for search and replace functions, and accepts two arguments the first is a line of code indicating the string to search for and what to replace it with and the second indicating the file to search. One useful task to perform with this command is replacing a sample name (old_sample) in a code file to utilize the same code with a different input file (new_sample) `sed 's/old_sample/new_sample/g' some_codeFile.sh > new_codeFile.sh`. You will notice the pattern `'s/PATTERN1/PATTERN2/g'` is surrounded by single quotes, the preceeding `s` indicates this is a substitution command and the trailing `g` indicates that you would like every instance of `PATTERN1` replaced with `PATTERN2`. The `g` could be omitted to indicate change only the first instance, or replaced with a number indicating the number of times the substitution will be performed. 

`sort`<br>
Prints the sorted concatenated output to the screen, the output can be redirected with the `>` command. This command accepts an argument indicating the file contents to sort. There are many flags that can modify the way the data are sorted, `-n` is a flag that indicates data should be sorted numerically. Another useful flag is the `-u` flag 

`uniq`<br>
Filters matching adjacent lines and prints the lines to the screen. If two lines are matching but not adjacent both will be printed to the screen. This command accepts an argument indicating the file to be sorted and there are many flags that can be used to modify the output of this command, one useful one is the flag `-c` which will print a count of the number of adjacent occurrences of each line.

## Pattern Matching

`grep`<br>
This pattern matching tool prints the lines in a file that match the pattern of interest. This command accepts an argument indicating the pattern to be matched, usually in quotation marks, and the file to search. There are many flags that can be used to modify the behavior of this command, the flag `-m` is used with an integer to limit the number of matching lines, the flag `-c` is used to return a count of the number of times the pattern was found. This command is often combined with regular expression to look for complex patterns (more information on this can be found in the **Genomics file formats** lesson).

## For & While Loops
-------

Now that you know some basic commands, and you've combined some of those commands with the `|` to build more complex commands we are going to add another layer of complexity. A powerful application of coding lies in repetitive tasks, rather than clicking over and over we can write some code that will apply a code chunk to a specified set of files or a defined variable using a loop. To build a loop you need to indicate what you want to loop over, maybe a set of files or all lines in a file, and the operation you want performed on each item.


### For Loop 

<span style="color:red">Learn more about looping Bash commands in [this video](https://www.youtube.com/watch?v=rWrPwYKtmDA&list=PLXaEJPtnQ4w7Vu7vqWbttBjUGrPp4Qa7b&index=5).</span>

We can start with a simple *for loop*, in a for loop we make the statement:
*For each element in my list, do X, and complete the loop when the list has ended*

We will use the variable ***i*** in the code chunk to reference all the elements of the list to be looped over, and let's keep the code chunk simple using the command `echo` to simply print the element to the screen.

In [None]:
%%bash

# Loop over numbers 1:10, printing them as we go
for i in {1..10}; do 
   echo "$i" 
done

<div class="alert alert-block alert-warning">
    <i class="fa fa-question-circle-o" aria-hidden="true"></i>
    <b>TEST YOUR SKILLS</b> 
      <p>Practice your skills in the code block below</p>
    <div style="background-color: white ; color:black; padding: 3px;">Copy the code above and write your own code chunk that prints the sequence of numbers from 1 to 25.<br><br> Run the #FLASHCARD code block to see the answer. </div>
    
</div>

In [None]:
%%bash

## TEST YOUR SKILLS (enter and run your answers here)

# Now copy the code above and write your own code chunk that prints the sequence of numbers from 1 to 25


In [None]:
# FLASHCARD
from IPython.display import IFrame
IFrame("quiz_files/quiz4-1.html", width=600, height=250)

### For Loop

A for loop is very useful when you have a discrete list of elements that you would like to apply a specific code chunk to, you can also use them with a list of file names. Here we will make a list of the fastq files in our `aws_research_workflow/` directory and loop over them printing the first four lines of each file.

In [None]:
%%bash

# Create a list of the fasta files in the sequences directory
ls aws_research_workflow/*.fastq.gz > fastq.list
head -n3 fastq.list

In [None]:
%%bash

# Loop over the elements of the list and list the filenames
for i in `cat fastq.list`; do
    zcat $i|head -n4
done

### While Loop

If you do not know how many times you might need to run a loop, using a *while loop* may be useful, as it will continue the loop until the boolean (logical) specified in the first line evaluates to `false`. An example would be looping over all of the files in your directory to perform a specific task. let's loop over the same fasta files as we did in the loop above, but using the flexibility of the while loop we can skip the step where we created the file `fasta.list`. To add some additional complexity to the code chunk and we will print the file name and then print the first 3 lines of each file. 


In [None]:
%%bash

ls aws_research_workflow/*.fasta | while read x; do 
     # Tell me what the shell is doing
     echo $x
     # Provide an empty line for ease of viewing
     head -n 3 $x  
 done


Now let's use variations on some of the complex code we wrote in the earlier lesson **Genomic File Formats** to write a loop to check how many reads contain the start codon `ATG`. We can do this by searching for matches with `grep` and counting how many times it was found with `wc -l` (literally how many lines are returned), and repeating this process for each sample using a while loop.


In [None]:
%%bash

#Check how many reads in the first 10,000 entries of each fastq file contain ATG
ls aws_research_workflow/*.fastq.gz | while read x; do 
   echo $x
   zcat $x | sed -n '2~4p' | head -10000 | grep -o "ATG" | wc -l
done


## Scripting in Bash
------

Loops are pretty useful, but what if we wanted to make it even simpler to run this code. Maybe we even want to share the program we just wrote with other lab members so that they can execute it on their own FASTQ files. One way to do this would be to write this series of commands into a `Bash script`, that can be executed at the command line, passing the files you would like to be operated on to the script.

To generate the script (suffix `.sh`, which indicates this script is written in the BASH coding language) we will use the `nano` editor. The first line of a bash script is the **shebang** which indicate the language the code is written in, in this case `#!/bin/bash` indicates the BASH coding language. The **shebang** is followed by a comment line (starting with the `#`) explaining what the code does, and lastly the code that you would use in the terminal.

As in the loops we use the `$` to specify the input variable to the script. `$1` represents the first argument handed to the script. Here, we only need to provide the file name, so we only have 1 `$`, but if we wanted to create more variables to expand the functionality of our script, we would do this using `$2`, `$3`, etc.

The next code chunk uses the `nano` text editor application. This cannot be run in the Jupyter notebook and should be executed in the terminal window only. Once you've executed this in the terminal window you can continue to run code in the Jupyter notebook, though you should also be practicing in the terminal window. 

1. In the terminal window type the command `nano count_ATGC.sh`
2. Copy and paste the text below into the terminal window to save to the `count_ATGC.sh` file you just created with `nano`:
    ```
    #!/bin/bash

    ## this code processes zipped fastq files to count the number of Gs and Cs 
    ## in the first 10,000 reads
    ## this code accepts one argument, the name of a file in the fastq format
    
    echo processing sample "$1" 
    zcat $1 | sed -n '2~4p' | sed -n '1,10000p' | grep -o . | sort | uniq -c
    ```
3. Use the `ctrl +X` keys to exit the text editor
4. Use the `y` key to save the changes to the filename `count_ATGC.sh`

In [None]:
%%bash

# Check to make sure your file contains the code chunk from above
cat count_ATGC.sh

The code you copied into the file `count_ATGC.sh` is a variation of code that you've seen before in the **Genomics File Formats** lesson. The first code line uses the `echo` command to print the name of the file being processed to the screen. Then there is a complex set of code built with several pipes:

- `zcat` lists the contents of a zipped fastq file
- `sed -n '2~4p` prints the second line (read sequence) of the fastq entry and then skips 4 lines to get to the next read sequence
- `sed -n '1,10000p'` prints the first 10,000 read sequence lines from the prior command
- `grep -o .` separates each base pair in the first 10,000 read sequences on it's own line
- `sort` sorts all of the base pairs alphabetically
- `uniq -c` prints a count of each base pair

You can see that each piece of this code takes the output from the command in the proceeding section, delimited by the `|`, and manipulates it slightly until we have the information we are after. Now let's run the script with the `bash` command, specifying the FASTQ file as the variable 1 (`$1`).

In [None]:
%%bash

# Run the script with the bash command
bash count_ATGC.sh aws_research_workflow/SRR1039508_1.chr20.fastq.gz


This is helpful information, but it would be more helpful to have this information for all of our fastq files, so let's run our BASH script with a while loop.

In [None]:
%%bash

# Run the bash script in a while loop on all fastq files
ls aws_research_workflow/*.fastq.gz|while read x; do
    bash count_ATGC.sh $x
done

This is much more useful to be able to quickly compare sequence quality between all of our fastq files, you can see all of our samples have very few Ns and that the number of Gs & Cs are very similar in all of our reads (the same is true for As & Ts). It might be useful to write the output into a file instead of printing to the screen, so that we can save this information to share with collaborators later. We can use the append command `>>` to save the output to a file called **bp_info.txt**. 

In [None]:
%%bash

# Create the text file you want to write to
touch bp_info.txt

# Run the loop
ls aws_research_workflow/*.fastq.gz | while read x; do \
   bash count_ATGC.sh $x >> bp_info.txt
done

# View the file
cat bp_info.txt

<div class="alert alert-block alert-warning">
    <i class="fa fa-question-circle-o" aria-hidden="true"></i>
    <b>TEST YOUR SKILLS</b> 
      <p>Practice your skills in the code block below</p>
    <div style="background-color: white ; color:black; padding: 3px;">Write the same loop but instead of the append command >> use the redirect command > <br><br>How does the output file change? Why?<br>HINT: check the section above on creating and organizing files to remind yourself of the operations performed by >> and > <br><br> Run the #FLASHCARD code block to see the answer.</div>
    
</div>

In [None]:
%%bash

## TEST YOUR SKILLS (enter and run your answers here)

## Now write the same loop but instead of the append command >> use the redirect command >

## How does the output file change? 

## Why is this, what is the difference between the append command >> and the redirect command > ? (hint check the section above about creating and organizing files)


In [None]:
# FLASHCARD
from IPython.display import IFrame
IFrame("quiz_files/quiz4-2.html", width=600, height=250)

#### nohup

These example programs run quickly, but stringing together multiple complex commands in a bash script is common and these programs often take many hours or sometimes days to run. In these cases we might want to close our computer and go and do some other stuff while our program is running.

We can achieve this using the command `nohup` which stands for *no hang up* and allows us to run a series of commands in the background, but disconnects the process from the terminal window you initially submit it through, so you are free to close the terminal window and the process will continue to run until completion. When you return the text that would have printed to the screen is automatically saved in a file called **nohup.out**. 

As with the `nano` command using `nohup` in the Jupyter notebook isn't recommended but this command is very useful in the terminal environment. The syntax of the `nohup` command is `nohup COMMAND -flag -flag -flag argument1 argument2 &` where the command and any flags and arguments are proceeded by the `nohup` command and adding `&` to the end of the command sends the process to the background. 

Try running the following code in the terminal window

    ```
    nohup bash count_ATGC.sh aws_research_workflow/SRR1039508_1.chr20.fastq.gz &
    ```

In [None]:
%%bash

# Show the result
cat nohup.out