Nanopore DNA sequencing is portable and relatively cheap, allowing real time sequencing in the field.  We see the potential to use nanopore sequencing as an accessible educational experience. With a clear pipeline that Just Works(TM), a citizen scientist could do WIMP (What's in my pot?) analysis on their own samples without the need for any external tools.  Undergrad or high school students could follow the steps of the pipeline to learn about the basics of genome assembly.

# Why Nanopore is amazing
* Long reads, cheap, portable
* Comparison to the current standard (Illumina)
* Used for detecting ebola/zika?  Sent to space
* Sequencing in the jungle (tweet below!)
* Idea we care about: you should be able to take a random sample of stuff (ocean water?  dirt?) and sequence it cheaply and easily to find out what's there

To see a video of nanopore sequencing in the jungle, click on the below block of code and click the "Run" button at the top of this page.

In [23]:
%%html
<blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">Welcome to my laboratory :)<br><br>Sequencing long ribosomal cluster from plants, insects &amp; fungi in real-time in the Amazon rainforest. Within a few mins of <a href="https://twitter.com/nanopore?ref_src=twsrc%5Etfw">@nanopore</a> data generated, performed BLAST &amp; got correct hits! Dual indexing looks great for pooling many samples<a href="https://twitter.com/hashtag/junglegenomics?src=hash&amp;ref_src=twsrc%5Etfw">#junglegenomics</a> <a href="https://t.co/UQVjYfmU8U">pic.twitter.com/UQVjYfmU8U</a></p>&mdash; Aaron Pomerantz (@AaronPomerantz) <a href="https://twitter.com/AaronPomerantz/status/980873273348038656?ref_src=twsrc%5Etfw">April 2, 2018</a></blockquote>
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>

# How Nanopore actually works
Insert a good paragraph description here:

My current understanding is just like, DNA strand pulled through a pore, current across pore changes depending on the nucleotide, you end up with an electrical signal in fast5 format that can be converted to a fasta file (could be fastq but no one trusts quality scores?)

Poretools would fit in well here, show what the raw signal looks like, run poretools, show the sequence we get out (something like the notebook here: http://nbviewer.jupyter.org/github/arq5x/poretools/blob/master/poretools/ipynb/test_run_report.ipynb)

## Poretools
The raw data we get from nanopore sequencing is in the fast5 format.  This is just a series of current values that were read across the pore as the DNA strand passed through it.

We are going to start by looking at this fast5 data, containing current values, and converting it to a fasta file that contains nucleotides. 

This poretools tutorial is adapted from here: http://nbviewer.jupyter.org/github/arq5x/poretools/blob/master/poretools/ipynb/test_run_report.ipynb

First we're going to find our fast5 files.  Our sample fast5 file is in the "data" folder, so we set the variable dataDirectory to "data/".  If you are using your own data, change dataDirectory to the path to your .fast5 files.

In [2]:
# dataDirectory is the path to our fast5 file.
# If you are using your own data, change dataDirectory to the path to your .fast5 files.
dataDirectory = 'data/'

# Print the number of fast5 files in the dataDirectory.
# Click the "Run" button at the top of this page to run this code.
!find $dataDirectory -maxdepth 1 -name "*.fast5" | wc -l

2


Poretools has a number of command line options.  Let's start with the stats command, which will give us some basic statistics about our reads.

In [3]:
# The -q option stops poretools outputting any warning messages.
!poretools stats -q $dataDirectory

  from ._conv import register_converters as _register_converters
total reads	6
total base pairs	25217
mean	4202.83
median	4205
min	2940
max	5826
N25	5079
N50	5011
N75	3399


Our sample data has 6 reads and 25,217 base pairs. (Anything else of interest to say about this info?)

We have 3 reads per fast5 because forward, reverse, and two-directional reads are all counted separately. (Is this correct?) We can see the stats about each of these types of reads using the below code.

In [4]:
# Look at stats for forward strands
!poretools stats -q --type fwd $dataDirectory

  from ._conv import register_converters as _register_converters
total reads	2
total base pairs	8019
mean	4009.50
median	4009
min	2940
max	5079
N25	5079
N50	5079
N75	2940


In [5]:
# Look at stats for reverse strands
!poretools stats -q --type rev $dataDirectory

  from ._conv import register_converters as _register_converters
total reads	2
total base pairs	7973
mean	3986.50
median	3986
min	2962
max	5011
N25	5011
N50	5011
N75	2962


In [6]:
# Look at two-directional reads
!poretools stats -q --type 2D $dataDirectory

  from ._conv import register_converters as _register_converters
total reads	2
total base pairs	9225
mean	4612.50
median	4612
min	3399
max	5826
N25	5826
N50	5826
N75	3399


Hopefully we are going to add how to make a squiggle plot here at some point, that shows the current changing and gives a good idea of what signal actually looks like.

In [7]:
# Add squiggle plot here!!!

ls: cannot access 'dataDirectory': No such file or directory


Now we are going to convert our fast5 file to fasta.  Fasta is a common format for storing DNA sequences.  The below code will take each of the fast5 files in dataDirectory, create a fasta file of that sequence, and store it in a folder called fastaOutput.

In [8]:
# Make a folder to store our fasta files in.
!mkdir fastaOutput

# Convert our fast5 files to fasta.
!poretools fasta $dataDirectory > fastaOutput/outputPoretoolData.fasta

mkdir: cannot create directory ‘fastaOutput’: File exists
  from ._conv import register_converters as _register_converters


We can look at the first few lines of this fasta file to see what's in it.  Each of the sequences has a line containing ">" and then a unique identfier, followed by a line containing the nucleotide sequence.

In [13]:
# This will show us the first 200 characters of the first two lines of our file.
# We don't want to look at the whole sequence because it's going to be really long!
!cut -c -200 fastaOutput/outputPoretoolData.fasta | head -2

>b233f432-7786-4b0b-8b2d-03c2e168a45b_Basecall_2D_2d CPHG_CNU4299G4G_louse_library_2016_3_4_3507_1_ch120_read240_strand data/2016_3_4_3507_1_ch120_read240_strand.fast5
GAAATTGCTCCCGCTCTCAGTTCTGCTTTAACAGATAAATTAATAACATATCAATAAAGCATCAAAATCACGTGATTGGAACGCCGTACTTCGAAGAGGAGGATGGAGACGAGGATGGGAGCAGAGGGGAGGATGTGCACTTCTCCCCACGTCAGTTGGGATTCGAAGGAAGTTTGCGGCTTGTTTTAGAGTGGAGGACA


This fasta file containing our raw reads will be the input to the next steps in our pipeline.

# Clustering reads before assembly?
Are we going to use Mash?  Does Canu just do this for us?

# Assembling our reads
Assuming that we get Canu to work, an explanation of running Canu goes here.

If not, minimap and then miniasm (or other way around?).

General idea: all vs. all meaning trying to align each read to all the others, end up with contigs (miniasm gave us one really long contig?)

For documentation stuff, it would be really good to put something about using IGV/looking at the actual read pileup here.

# What's in our sample?

BLAST on our contig(s) except that it hasn't actually worked yet.  Want an output that shows species.

MetaGeneMark or other way to show genes and read counts, some nice way to visualize/explore this information (I'm not convinced that this is the best software for this yet)

Displaying pileups/genes in genome browser at the end?