# Putting It All Together
------
### Learning Objectives:
+ Use the skills developed in previous lessons to assemble a genome

Now you have been through most of the lessons and believe it or not you have the basics of the skills you need for most bioinformatic analyses. At this point you should be able to interact with your data through the terminal interface, you know how to use the help flag to learn how to use and customize BASH commands, you can build complex commands with the pipe and execute those commands on a set of files with looping and BASH scripts, you understand the data structure of several common genomic file formats, and you can install software with the Conda package manager. 


## Genome Assembly
----------

Best practice is to create different environments for each analysis task that you need to perform. However, here we will install packages to the base environment due to the complexity of creating conda environments on AWS SageMaker. For genome assembly, you will install software for downloading sequences from the SRA (`sra-tools`), checking the quality of the raw data we download (`fastqc`), and assembling fastq reads into a complete genome (`spades`). For annotation, you will install software for annotating the assembly (`prokka`).

In [None]:
%%bash

# Install assembly and annotation software with all required software
mamba install -c bioconda fastqc sra-tools spades perl perl-bioperl prokka

## Creating a New Directory for Your Project
----------
Again this is optional but it is good practice to organize your projects into directories so that you know where to find all files related to a given project.

In [None]:
%%bash

# Create a directory to store your data in 
mkdir assembly_test

## Downloading SRA Data (SRA toolkit)
----------
The sequence read archive is a database hosted by the National Center for Biotechnology Information (NCBI) as a long term storage repository for raw sequencing data. These data are often fairly large (depending on the organism being sequenced) and can take up a lot of space, so it is handy to have a way to access these data when you need them, be able to use the files then delete them and go back and get them if you need them again.

Here we are downloading the sequence of a Sars CoV2 sample collected in western New Hampshire in March 2022. 

In [None]:
%%bash

# Pull the raw fastq files you need from the SRA
aws s3 sync s3://sra-pub-run-odp/sra/SRR18241034/ SRR18241034/ --no-sign-request

fastq-dump --outdir assembly_test/ --split-files SRR18241034/SRR18241034

# Remove the prefetch directory now that you are finished with it
rm -r SRR18241034

## Checking the Quality of the Data (FastQC)
----------
When you download data from a public source the first thing you should do is look at the quality of the data you've accessed. Remember it is normal for quality scores to drop off at the end of the read, but ideally we would like to see the majority of bases in a read above 30 and certainly about 25. If you have downloaded low quality data you can always using trimming to improve the quality of reads before assembly/mapping.

<div class="alert alert-block alert-warning">
    <i class="fa fa-question-circle-o" aria-hidden="true"></i>
    <b>TEST YOUR SKILLS</b> 
      <p>Practice your skills in the code block below</p>
    <div style="background-color: white ; color:black; padding: 3px;">Check the quality of the raw data from the SRA with the fastqc tool.<br><br>HINT: This is a command you've used before at the end of submodule 5, if you've forgotten the command syntax you can use fastqc --help to remind you.<br><br> Run the #FLASHCARD code block to see the answer. </div>
    
</div>

In [None]:
%%bash

## TEST YOUR SKILLS (enter and run your answers here)

# Check the quality of the raw data


In [None]:
# FLASHCARD
from IPython.display import IFrame
IFrame("quiz_files/quiz6-1.html", width=600, height=250)

## Assembling the Genome (Spades)
----------
The assembler we are using here, [`spades`](https://cab.spbu.ru/files/release3.15.4/manual.html) is a commonly used assembler for small genomes, think bacteria and viruses. For larger genomes we recommend using something like [`bowtie2`]() or [`bwa`](). 

<p align="center">
<img src="images/assembly.jpg" alt="assembly" width="50%"/>
</p>

Briefly Spades, like most assemblers, work by breaking reads down into sets of k-mers, short sequences of *n* length and then connecting reads that end in the same k-mer. Reads are connected by building a graph where each read is a node and edges represent k-mer overlaps between the ends of other nodes (reads). The series of overlapping pieces between reads are used to build and extend a consensus sequence in an iterative process to build the longest possible sequence with the overlapping reads. The final fasta file produced, in spades this is called `scaffolds.fasta`, is the set of sequences created by connecting reads to form the longest possible sequences.

The syntax for using the `spades.py` command with paired-end reads (what we have here) is `spades.py -1 fwd_reads.fastq -2 rev_reads.fastq -o out_directory`. Conveniently spades also has a flag specifically for assembling Sars CoV2 genomes, `--corona` and we will leverage this flag in our assembly run as well.

In [None]:
%%bash

# Have a look at the manual for spades using the --help flag
spades.py --help

In [None]:
%%bash

# Assembly the genome - this takes ~ 15 minutes
spades.py --corona -1 assembly_test/SRR18241034_1.fastq -2 assembly_test/SRR18241034_2.fastq -o assembly_test

## Annotating the Assembly (Prokka)
----------

Prokka is a tool that leverages multiple existing annotation softwares and enables parallelization of the task to accomplish rapid annotation of features of interest in microbial genomes. 

<table>
<tr><th>Tools leveraged by Prokka and the features they predict.</th></tr>
<tr><td><table></table>

|Tool | Features predicted |
|--- | --- |
| Prodigal | Coding sequences |
| RNAmmer | Ribosomal RNA genes (rRNA) |
| Aragorn | Transfer RNA genes (tRNA) |
| SignalP | Signal leader peptides |
| Infernal | non-coding RNA |
</td></tr> </table>


After annotation prokka returns fasta files containing both protein (.faa) and nucleotide (.fna) sequences as well as an annotation file (.gff) of annotated coding features. We have not used this software in these lessons but try to use the manual to determine what flags might be helpful to include in your annotation. 
 


<div class="alert alert-block alert-info">
    <i class="fa fa-lightbulb-o" aria-hidden="true"></i>
    <b>Tip: </b> Remember you can add a new cell in your Jupyter notebook with the raw format to take notes.
</div>

<div class="alert alert-block alert-warning">
    <i class="fa fa-question-circle-o" aria-hidden="true"></i>
    <b>TEST YOUR SKILLS</b> 
      <p>Practice your skills in the code block below</p>
    <div style="background-color: white ; color:black; padding: 3px;">Annotate your assembly with the prokka tool.<br><br>HINT: This is not a command you've used before but you can use prokka --help to determine the syntax of the command you will need to run. <br><br>Remember to switch your conda environment to annotation. <br><br>Use the flags to give both the directory and the assembly a name (I would suggest reusing the assembly_test directory as the outdirectory and using SRR18435413 as the prefix for your annotation).<br><br> Run the #FLASHCARD code block to see the answer.</div>
    
</div>

In [None]:
%%bash

## TEST YOUR SKILLS (enter and run your answers here)

prokka --help
# Annotate your assembly


<div class="alert alert-block alert-danger">
    <i class="fa fa-exclamation-circle" aria-hidden="true"></i>
    <b>Alert: </b>  There will be some warnings when running prokka - the contigs names are too long, the directory has been created: but there are suggestions for what flags need to be added to mitigate these warnings.
</div>

In [None]:
# FLASHCARD
from IPython.display import IFrame
IFrame("quiz_files/quiz6-2.html", width=600, height=250)

## Storing your Finalized Dataset
---------
Once your analysis is complete you will want to store the final output files into a directory that can be easily shared with collaborators. To do this we will create an Amazon S3 bucket and move the final set of files into it. 

In [None]:
%%bash
# The mb option stands for MAKE BUCKET
# aws s3 mb s3://NEW_BUCKET_NAME
# Create a bucket 

aws s3 mb s3://

In [None]:
%%bash
# Copy the assembled fasta file

# Rename the assembly to something that makes sense
mv assembly_test/scaffolds.fasta assembly_test/SRR18435413.fasta

# Copy the final assembly file into your Amazon S3 bucket
aws s3 cp assembly_test/SRR18435413.fasta s3://#Enter_previously_created_bucket_name

<div class="alert alert-block alert-info">
    <i class="fa fa-lightbulb-o" aria-hidden="true"></i>
    <b>Tip: </b>  Make sure that your bucket name follows the criteria below, otherwise the command will end in a <b>error</b>:
    <ul>
      <li>Can contain letters and numbers but must be a unique name</li>
      <li>Does not contain any spaces </li>
      <li>Does not contain any uppercase letters</li>
      <li>Only special characters it can container if you choose to use them are dots (.) , dashes (-) , and underscores (_)</li>
      <li>Does not contain 'amazon', 'aws' or the prefix 's3'</li>
    </ul>  
</div>

<div class="alert alert-block alert-warning">
    <i class="fa fa-question-circle-o" aria-hidden="true"></i>
    <b>TEST YOUR SKILLS</b> 
      <p>Practice your skills in the code block below</p>
    <div style="background-color: white ; color:black; padding: 3px;">Move the QC report from fastqc into the Amazon S3 bucket you just created.<br><br> Run the #FLASHCARD code block to see the answer.</div>
    
</div>

In [None]:
%%bash

## TEST YOUR SKILLS (run and enter your answers here)
#Move the QC report from fastqc into the Amazon S3 bucket you just created.


In [None]:
# FLASHCARD
from IPython.display import IFrame
IFrame("quiz_files/quiz6-3.html", width=600, height=250)

<div class="alert alert-block alert-warning">
    <i class="fa fa-question-circle-o" aria-hidden="true"></i>
    <b>TEST YOUR SKILLS</b> 
      <p>Practice your skills in the code block below</p>
    <div style="background-color: white ; color:black; padding: 3px;">Move the gff file from prokka into your Amazon S3 bucket.<br><br> Run the #FLASHCARD code block to see the answer.</div>
    
</div>

In [None]:
%%bash

## TEST YOUR SKILLS (run and enter your answers here)
#Move the gff file from prokka into your Amazon S3 bucket.


In [None]:
# FLASHCARD
from IPython.display import IFrame
IFrame("quiz_files/quiz6-4.html", width=600, height=250)