# Introduction to the command line and BASH scripting

The command line is a text interface that allows you interact efficently with your computer's operating system and achieve tasks that would be difficult or impossible using a Graphical User Interface such as Windows or OSX. A shell is a software program used to interpret commands that are input *via* a command-line interface, enabling users to interact with a computer by giving it instructions. There are a few different shells for interacting with the command line, we will be using the **Bourne Again Shell (BASH)**, for the next two labs.

Many bionformatics algorithms and software packages are **only** available as command line tools. And these are most commonly compatible with BASH. In addition to allowing you to use bioinformatics tools that would not otherwise be accessible, having a basic undertstanding of BASH will allow you to manipulate data on a large scale and stich together different bionformatics tools into pipelines that automate data processing and analysis. In these labs, for example, you will start with reads from an Illumina DNA sequencer reads and use BASH to carry out a series of processing steps with different bionformatics tools. 

You will first trim adapter sequences from your reads using a program called `AdapterRemoval`. Once you have trimmed your reads, you will assemble them into genomes using a program called `spades.py`. You will then identify genes within your assembled genomes using a program called `glimmer3`. Once you have identified genes within all of your genomes, you will extract these into a new file and map sequencing reads from and RNAseq run against the genes using a programm called `kallisto`. Mapping with Kallisto will give you a count table, in which the number of reads mapping to each gene are recorded. You will then use a statistical analysis package called `pydeseq` to identify genes that are differentially expressed between two conditions. 

**Note** I typed this in jupyter lab, so there was no spell checker. My spelling is atrocicous (see previous word for example). I would like to apologies for this in advance.

## A litte bit of background about the data you will be using.

For this lab, you will be assuming the role of a bioinformatician in a laboratory that studies the human microbiome. To make sure things runs smoothly and in a reasonable time frame, we have provided you small synthetic datasets. This is purely to avoid long running times. The software packages you will use are commonly used in many research applications, and the analysis you run could easily be scaled up to analyse real clinical data.

The data you have been given are part of a study (albiet and imaginary one) in which metagenome and RNAseq data were aquired from the gut microbiomes  of a cohort of patients. These patients were then monitored for various health outcomes over the course of ten years with the aim of identifying potenital linkages between the gut microbiome and certain disease states. You have been given RNAseq data from 12 patients. Six of these patients remained healthy over the course of the study. The other six developed colorectal cancer within two years of having their metagenome data collected. Don't worry though, all of these imaginary patients made a full recovery and then won lotto. Yay!

In addition to the RNAseq data, which is in the folders labelled ```patient_data_1``` and ```patient_data_2``` you have a folder called ```reads``` that contains sequencing reads from an exhaustive collection of gut microbes isolated from the patients. Luckily, in our imaginary scenario, there are only ten microbes in the human microbiome (not thousands). Your task, as a newly minted bioinformatician is to complete the following steps:

 1) Assemble the microbial isolate sequencing reads into genomes.
 
 2) Identify the genes in each genome and extract these to a single file.
 
 3) Map the patient RNAseq reads against the extracted gene sequences
 
 4) Use an appropriate statistical approach to identify changes in gene expression between the healthy and colorectal cancer cohorts that might provide provide insight into the connection between gut microbiome function and cancer risk
 

**A note on saving your work:** Using **ctrl+S**, or the jupyter lab dropdown menu will save your work within the VM you created. When you shut down this VM, make sure you save the machine state or you will loose your work. In addition to saving locally I **strongly reccomend** that you periodically attach the jupyter notebook you are working on to an email and send this to yourself as a fail safe. You can use firefox to login to whichever email provider you prefer and attach a file as you would normally do. The notebook is called `bash_lab.ipynb`. 

# Introductory exercises
Before we start analysing data, you will need to aquire some basic skills by completing some simple BASH scripting exercises

## Basic commands:

**Navigating the file system:** Just like a graphical interface, we can navigate through directories using the command line. The directory we are currently in is our **working directory**. The `pwd` command prints the current working directory to the console so that we know where we are. 

**Exercise 1:** In the cell below use the `pwd` command to print your working directory.

**Exercise 2:** In the cell below use the `cd` command to change directory to`~/339_lab-main` and list the contents of this directory using `ls`. You can put both of these commands into the same cell and gthey will be executed sequentially. 

**Note:** The tilde symbol `~` is a shortcut for your home directory. If you ever get lost, you can always find your way home by typing `cd ~`

**Note:** In BASH `*` is a wildcard that matches any pattern, so by typing `ls *.zip` you will list all files ending in the extension `*.zip`. Try it out below

**Unzipping files with `unzip`:** The command `unzip` will run a program that unpacks the contents of a `.zip` file. The name of the file to be unzipped is specified after the `unzip` command, with a space between the two. For example, `unzip my_file.zip` will unpack `my_file.zip`, if such a file exists in your current working directory. You might have noticed that all of the file names in this lab use underscores instead of spaces, this is because spaces are used to seperate arguments in BASH commands, which makes dealing with file names that contain spaces a pain. 

**Exercise 3:** In the cell below, write commands which will unzip all of the zip folders in your current working directory that end with the extension `.zip`

**Removing files with `rm`:** Unwanted files can be deleted using `rm`, the file(s) to be removed are specified after the `rm` command and are seperated by as space.  For example, the command `rm junk_1.txt junk_2.txt junk_3.txt` will remove three files called `junk_1.txt` `junk_2.txt` and `junk_3.txt` from your current working directory, if they exist. You can also use `*` to remove multiple files that match a given pattern. For example, `rm *junk*` would remove all files in the current directory that contain the word `junk`; `rm *.txt` would remove all files with the extension `.txt`; `rm *junk*.txt` would remove all files that contain the word `junk` *and* end in the extension `.txt`

**!!Caution!!** There is no safety net. Deleted files are not sent to a bin. They are gone for good. Before using `rm`, you wnat to be very sure you are not going to delete something you need. This is particulary true when using `*` to delete multiple files.

**Exercise 4:** Once you have successfully unzipped the folders use the command `rm` to remove the zip files and free up space. If you want to, you can use `*.zip` to remove all of them with one command. **Note** When using pattern matching to `rm` multiple files, it is a good idea to check what you are removing first using `ls`

**Making, naming and moving:** the command `mkdir` makes directories. For example `mkdir dir1` will make a  subdirectory within your current working directory called `dir1` with. You can pass mkdir multiple arguments, for example `mkdir dir1 dir2 dir3` will make three directories called  `dir1`, `dir2`, and `dir3`.

`mv` is a command that moves and renames files. For example if I wanted to copy all files ending in the extension `.txt` from my current working directory to a directory one level up I would type `mv *.txt ..` where `..` refers to the directory one level up. The command `mv *.txt ../new_location` would move all txt files to a folder called `new_location` that is one level up from my current working directory. You can also use `mv` to rename files. For example, the command `mv old_name.txt new_name.txt` would rename the file `old_name.txt` to `new_name.txt`

**Exercise 5:** Using `mkdir` make a new directory in `339_lab-main` called `patient_data` then use the command we have discussed to change to move all of the files from `patient_data_1` and `patient_data_2` to this folder. 

**Exercise 6:** the command `ls patient_data/* | wc -l` will count the number of files in the folder `patient_data`. We won't unpack exactly how this works right now, but you should be able to figure it out with a little more BASH learning. In the cell below, modify this command so that will count all the files that end in the extension `.fq`

**Exercise 7:** Change to the directory `~/339_lab-main/reads` once in that directory list all of the files. Notice that there are files that end with `reads1.fq` and files that end with `reads2.fq` these are forward and reverse reads respectively from sequencing of ten different genomes. Can you think of a command wthat will list only the forward reads? Hint: you can use the wild card `*` more than once. For example `ls *something*.txt` would list all text files with the pattern `something` in them

## Variables

BASH uses the equal sign to assign variables as shown in the cell below. Execute the commands and see what happens. 


In [None]:
course=biol339
echo $course

**Note:** `echo` is a command that prints something to the console. 

**Note:** The dollar symbol `$` can be thought of as meaning *the value of*. So the command `echo $course` prints the value we previously assigned to the variable named course.

**Note:** In jupyter lab, variables that you assign in one cell will persist in subsequent cells. 

**Note:** You can also use `$` to place the value of your variables in sentences or commands. See the cell below for an example of this using the `echo` command.

In [None]:
echo I love $course

## Arrays:
Below is an exerpt from **the unix work bench** by Sean Ross that provides a concise introduction to arrays. Make sure you understand all of this before moving on.

_______________________________________________________________________________________________

Arrays in Bash are ordered lists of values. You can create a list from scratch by assigning it to a variable name. Lists are created with parentheses `( )` with a space separating each element in the list. Let’s make a list of the plagues of Egypt:

```bash
plagues=(blood frogs lice flies sickness boils hail locusts darkness death)
```

To retrieve the array you need to use parameter expansion, which involves the dollar sign and curly brackets `${ }`. The positions of the elements in the array are numbered starting from zero. To get the first element of this array use `${plagues[0]}`like so:

```bash
echo ${plagues[0]}
```
`blood`

`echo` is a command that prints something to the console. the symbol `$` can be thought of as meaning **the value of** so the command above is asks the shell to print the value of the first element of the array `plagues` to the console.

**Notice** that the first element has an index of 0. You can get any of the elements this way, for example the fourth element:

```bash
echo ${plagues[3]}
```
`flies`

To get all of the elements of plagues use a star (*) between the square brackets:
```bash
echo ${plagues[*]}
```
`blood frogs lice flies sickness boils hail locusts darkness death`

You can also change an individual elements in the array by specifying their index with square brackets:

```bash
plagues[4]=disease
echo ${plagues[*]}
```

`blood frogs lice flies disease boils hail locusts darkness death`

You can find the length of an array using the pound sign `#`:

```bash
echo ${#plagues[*]}
```

`10`

**Summary**

 - Arrays are a linear data structure with ordered elements which can be stored in variables.
 - The each element of an array has an index and the first index is 0. 
 - Individual elements of an array can be accessed using their index.
 
_________________________________________________________________________

**Note:** You can use `*` to make arrays of items matching a particular pattern. For example, the cell below will create an array called `fw_reads` that contains all file names in the current directory that contain the pattern `reads1`



In [None]:
fw_reads=(*reads1*)

**Exercise 8:** In the cell below, write a command that will get the length of the array `fw_reads`

**Exercise 9:** In the cell below write a commands that will print the first, second, third and tenth element of the array `fw_reads`

## For loops
In BASH, and other programming languages, for loops allow us to iterate through items in an array. For example, the code in the cell below iterates through `fw_reads` and prints each item it contains to the console.

In [None]:
for i in ${fw_reads[@]}
do
    echo $i
done

**Let's look at the structure of the `for` loop above one line at a time:**

The first line, `for i in ${fw_reads[@]}`, specifies the thing we are iterating. In this case we are iterating through the array `fw_reads` that we created earlier. This line also assigns a varible name for iteration. In this case the variable is called `i`, but  could have used `j` or `k` or `read` or `cheeseburger` or another word with no spaces. On the first iteration, the value of `i` will be zero. The variable `i` changes by one as we step through each item in the thing we are iterating. 

Everything between the lines **`do`** and **`done`** is a command that will be carried out at every iteration. You can put multiple commands on each line. In this case we are using the `echo` command and `$` to print the value of `i` as it changes with each iteration. 

The indentation is added for clarity of formatting. You can also make for loops on one line by using `;` as a seperator. See the cell below for an example:

In [None]:
for i in ${fw_reads[@]}; do echo $i; done

**Note:** By adding an exlaimation mark before the array name, we can iterate through the indicies of the array as shown below.

In [None]:
for i in ${!fw_reads[@]}
do
    echo $i
done

**Note:** We can also use the indicies to grab the corresponding item from the array as shown below. It will soon become clear why this is a useful thing to be able to do.

In [None]:
for i in ${!fw_reads[@]}
do
    echo ${fw_reads[$i]}
done

**Exercise 10:** In the cell below, Make another array called `rv_reads` which contains the names of all of the reverse read files in the folder `reads`. Recall that these contain the pattern `reads2`

**Note:** We can retrive items from the same position in two or more arrays as shown in the cell below. Note that `echo` is printing everything you put after it including the words: `and`, `are` ,`at` and `position`

In [None]:
echo ${fw_reads[2]} and ${rv_reads[2} are at position 3

**Exercise 11:** In the cell below, wite a for loop that iterates trhough the indicies of `fw_reads` and prints a statement of the type shown above for each element in both the `fw_reads` and `rv_reads` arrays. You will notice that the corresponding fw and rv reads occupy the same position in their respective arrays. This provides a convenient way of passing matching forward and reverse read sets to another command or to a software tool, as we will soon see...

# Assembling genomes from illumina sequence reads:

Now that you have learned a little bit of BASH, you are ready to run your first bioinformatics software package, but first a little bit about  **command line applications**

Command line applications are pieces of software that are deployed using the command line. We installed a bunch of these at the start of the lab, now we are going to start using them. Command line applications can be called using their name. The can are given arguments that specificy which files they should act on, and what the should do to these files.

## Adapter removal

Before we can assemble genomes from our reads, we need to trim the reads to remove adapter sequences that were added during the sequencing run. We will do this using a program called `AdapterRemoval`. This program will identify and remove adapter sequences from our reads, and create new files from the resulting trimmed reads.

We are going to write a short script that uses a `for` loop to trim the adapters from all of our forward and reverse reads.

If we wanted to use adapter removal to trim a single pair of matching forward and reverse reads, we would use the command shown below:

```bash
AdapterRemoval --file1 fwd_reads.fq --file2 rv_reads.fq --basename new_file_name
--trimns --trimqualities --collapse
```

Whatever we put after `--file1` tells AdapterRemoval where the forward reads file is. `--file2` tells Adapter removal where the reverse read file is. `--basename` tells AdapterRemoval what name to give the trimmed read files. `--trimns` and `--trimqualities` and `--collapse` are additional arguments that tell AdapterRemoval to remove dubious data and do some other stuff.

In **exercises 12-15** we are going to build the pieces of our adapter removal script one-by-one and bring them together to make our final script for adapter removal.

**Exercise 12:** In the cell below, write a for loop that iterates through the indices of the array `fw_reads` and uses the number returned to `echo` the corresponding values from both `fw_reads` and `rv_reads` on seperate lines

**Exercise 13:** Update the your for loop so that it first assigns the values to variables called `reads_1` and `reads_2`, then uses `echo` and  `$` to print the values of these variables on the same line

**Helper Code**

*Notice* that we could pass the read file names to `AdapterRemoval` instead of `echo` using the same logic. To do this, we will also need to generate a new file name and store this as a third variable. You cut and paste the extremely nasty looking line of code below to do this.

```bash
genome_name=${reads_1%??????????} 
```
This line of code strips the last n characters from `reads_1` where the number of question marks specifies. This gives us the common prefix for both `reads1` and `reads2` and stores it as a new variable called`$genome_name`. This can now be passed to the argument `--basename` so that `AdapterRemoval` knows what prefix it should use for the new files it creates.

**Exercise 14:** Update your for loop so that it assigns an additional variable called `genome_name` as described above and prints all three variables on a single line. You can just cut and paste the line from the cell above if you want.

**Exercise 15:**  Update your for loop so that it passes the variables `reads_1`, `reads_2` and `genome_name` as arguments to AdapterRemoval. You should also add `--trimns --trimqualities --collapse` as shown in our previous example.

**Exercise 16:** If your script worked, your folder should now contain a bunch of files that end in `pair1.truncated` and `pair2.truncated` these are your trimmed forward and reverse reads. 

Once you have confirmed these are present:

 - Make a new directory `../trimmed_reads`
 - Move all the files from `reads` that match the pattern `*pair*` to your newly created directory

**Helper Code**

Running the cell below will:

 - Remove all the extra junk that AdapterRemoval made in your reads folder.
 
 - Change directory to ../trimmed_reads
 
 - Add `.fq` as an extension to your trimmed read files so that they will be recognised by the genome assembly tool we are going to use

In [None]:
shopt -s extglob 
rm !(*.fq)
cd ../trimmed_reads
for f in *.truncated; do mv "$f" "$f.fq"; done

## Assembling your genomes:

You now have all the skills required to assemble your trimmed reads using `spades`. Genome assembly algorithms like the one used bu `spades` take an overlapping collection of reads and rebuild a continuos genome sequence from them. This is analogus to exploding a stack of newspapers into 150 word chunks then using the overlapping fragments of text to reconstitute the entire newspaper. Genome assembly algorithms are super interesting, but we don't have time to go into exactly how they work here. If you want to know more, let me know!

**The command for assembling a single pair of fw and rv read sets with `spades` is shown below:** 

    spades.py --pe1-1 fw_reads.fq --pe1-2 rv_reads.fq -o output_directory

**Exercise 17:** Using the same approach you applied for adapter removal, write a script containing a `for` loop that uses `*` to make arrays of fw and rv trimmed reads and iterates through indicies to pass matching fw and rv read sets to SPAdes.

**Note:** You will also need to use mkdir once in every iteration to make a new directory to send the assembly outputs to. You should name your new directories using the current index of the `for` loop. If the index is called `i`, as in our previous example, you'd make the directory with this command: `mkdir $i`, and pass the directory name to spades as argument using  `-o $i` so that SPAdes sends the outputs from each iteration of the loop to a newly created and uniquely named directory. 

**Helper Code**

Running the code provided below will:

 - Make a new directory `../assemblies`
 
 - Copy each of your newly assembled genomes to this directory and give it a unique name
 
 - Print some data about the sizes and coverages of the assembled genomes

In [None]:
mkdir ../assemblies
fw_reads=(*pair1*)
for i in ${!fw_reads[@]}
do
    echo $i
    file_name=${fw_reads[$i]} 
    new_name="${file_name%%.*}.fna"
    echo $new_name
    contig_file=$i/contigs.fasta
    head -n 1 $contig_file
    cp $contig_file ../assemblies/$new_name
    
done

# Annotating your genome assemblies using glimmer3

We will now identify genes within each of the newly assembled genomes using another program caleld `glimmer3`. First, lets change to the directory where we stored your assembled genomes, and see what is in there

In [None]:
cd ../assemblies
ls

**A brief note on how to use glimmer3:** `glimmer3` requires that a model be built for each genome before it can find genes. This is achieved using three helper programs called `long-orfs`, `extract` and `build-icm`. The `extract` program can also be used to pull out the gene sequences that `glimmer3` identifies and store these as a new file. The example script below To annotates a single genome using `glimmer3`. This script assumes that the file name for the genome to be annotated has previuosly been assigned as a variable called `genome_file`, such that the file name can be accessed using `$genome_file`

```bash
long-orfs $genome_file.fna long_orfs.txt
extract -t $genome_file.fna long_orfs.txt > run1.train
build-icm -r run1.icm < run1.train
glimmer3 $genome_file.fna run1.icm run1
extract -t $genome_file.fna run1.predict > genes_$genome_file.fna
rm run1* long_orfs.txt 
```
**The first line**, `long-orfs genome_file.fna long_orfs.txt`, runs the program `long-orfs` to identify long open reading frames (ORFs) in the specified genome assembly `genome_file.fna`

**The second line**, `extract -t genome_file.fna long_orfs.txt > run1.train`, runs the program `extract` to extract the long orf sequences identified in line one and store these in a new file called `run1.train`

**the third line**, `build-icm -r run1.icm < run1.train`, uses the program `build-icm` to build a gene finding model for the genome using the training data `run1.train` generated in line two. The model is stored as a file called `run1.icm`

**the fourth line**, `glimmer3 $genome run1.icm run1`, runs `glimmer3` on `genome_file.fna` and uses the model `run1.icm` to find the coordinates of genes in `genome_file.fna`, these are stored in a file called `run1.predict`

**the fifth line**, `extract -t genome_file.fna run1.predict > genes_$genome`, uses `extract` again to pull out the sequence of the genes according to the coordiates in `run1.predict` and store these in a new file called `genes_genome_file.fna`

**the last line** `rm run1* long_orfs.txt` removes the files that are no longer needed.

**Exercise 18:** Use what you have already learned, and the example above, to write a genome annotation script. Your script should make an array by matching all files in the `assemblies` directory that end with the extension `.fna`. It should then use a `for` loop to iterate through the resulting array of file names, and extract the gene sequences from each genome. You will need to store these as uniquely named files by adding the prefix `genes_` to each genome file name so that you end up with files called `genes_genome_1.fna`, `genes_genome_2.fna` ... `genes_genome_10.fna ` It might be helpful to test pieces of the script before putting the whole thing together.

**Helper Code**

Running the cell below will combine the extracted genes from all of your genomes into a single file called `all_genes.fna` that will be used for mapping the RNAseq data in the `patient_data` folder

In [None]:
for f in genes_*.fna; do sed -i -e "s/^>/>${f%.fna}_/g" "${f}"; done
rm *fna-e
cat genes* > all_genes.fna

# Measuring Gene Expression in the Gut Microbiome with RNAseq

The folder `patient_data` contains metatranscriptome data from the gut metagenomes of 12 patients. There are two files for each patient. These are fw and rv illumina sequencing reads from and RNAseq experiment. Six of the patients are healthy controls while the other six developed colorectal cancer within 2 years of date that the metatranscriptome data was collected. 

**What exactly is a metatranscriptome?** A transcriptome is the collection of genes that are being transcribed by a single organism, tissue sample or cell culture. Most commomnly this is measured using RNAseq. A *meta*transcriptome is the transcriptome of a collection of organisms. In our case, the data are transcriptomes from all of the gut microbes for 12 patients that were collected by extracting RNA directly from metagenome samples. Thankfully I didn't actually have to do this since the data are imaginary.

In the previous sections, we assembled the genomes for all known gut microbes in our imaginary world. We then annotated genes within these, and extracted the sequences for all of the genes, from all of the bacteria, into a single file. We are now going to use a program called `kallisto` to align our RNAseq reads against our extracted gene sequences. The output we generate for each patient dataset will be a count table where the number of transcripts for each gene are counted. We will then feed these count tables to a statistical analysis package in order to identify genes which are differentially expressed in the gut microbiomes of patients who went on to develop colorectal cancer, as compared to healthy controls. 

## Making an index file:

Before aligning RNAseq reads against our extracted gene sequences, `kallisto` needs to make an index file. The command for doing this on a file called `gene_file.fna` is shown below. 

`kallisto index -i kallisto_index gene_file.fna --make-unique`

This command is telling `kallisto` to run the `index` making algorithm to generate an index file called `kallisto_index` from the input `gene_file.fna` the flag `--make-unique` tells `kallisto` to make unique names if it encounters any duplicated gene names in the file `gene_file.fna`

**Exercise 19:** In the next cell, run this same command on the concatenated gene file you made previously.

**Exercise 20:** change directory to `patient_data` and list the files contained in this directory

**Notice** that all of the fw read files end in 1 and all of the rv read files end in 2. 

## Aligning RNAseq reads against an index file

The index file we made is a special data structure that allows `kallisto` to rapidly align sequence reads agaisnt the sequences represented in the index. In our case, the index contains all the genes we extracted from our gut bacterial genome assemblies. In exercises 21 and 22, you will use the `quant` function of `kallisto` toalign the RNAseq data in the `patient_data` folder against your index file. This will generate a count table for each patient that keeps track of the number of RNAseq reads aligning to each of the genes in the index file. 

**Exercise 21:** In the cell bewlow, make arrays called `fw_reads` and `rv_reads` that contain the file names for all of the forward and reverse reads for each patient.

Execute the cell below to make the folder where we will save the outputs from `kallisto`

In [None]:
mkdir ../outputs

The command for running `kallisto` with a single pair of RNAseq read sets is shown below:

```bash
kallisto quant -i ../assemblies/kallisto_index -o $out_dir $fw_file $rv_file
```
**Notes:**

- This command assumes that the names of the forward and reverse read files have been assigned to the variables `fw_file` and `rv_file` respectively. 

- It is telling `kallisto` to run the `quant` algorithm to map the reads in `$fw_file` and `$rv_file` against the index file `../assemblies/kallisto_index` that we created earlier. 

- The outputs will be stored in `$out_dir`, assuming this variable has also been set to an existing directory where we want to store the ouptuts from `kallisto`

**Exercise 22** In the cell below, write a script that uses a `for` loop to iterate through the fw and rv reads in all of your patent datasets and runs `kallisto quant` on these. At each iteration you should assign a variable for your fw and rv read set using the index `$i` generated by iterating trhough one of your arrays. You should also assign a variable for the output directory, make this directory, and tell `kallisto` to store the outputs there. Assuming you call your fw read file `fw_file` for each iteration, you could achieve this using the two lines shown below to make a new subdirectory `../outputs/$fw_file`.

```bash
out_dir=../outputs/$fw_file
mkdir $out_dir
```