## IMPORTANT -->Change this dataDirectory variable to the path to your *.fast5 files 

In [1]:
# dataDirectory='~/Users/...'

dataDirectory='data/'

### This will print the number of FAST5 files in the dataDirectory.

- Poretools has a number of different command line options 
- Running poretools with no parameters gives us a brief list (and complies with Torsten's first rule)

In [2]:
!find $dataDirectory -maxdepth 1 -name "*.fast5" | wc -l

2


## What's the numbers?
### Let's start with a simple one, the stats command, this will give us some basic statistics about our reads.

The -q option stops poretools outputting any warning messages.

In [3]:
!poretools stats -q $dataDirectory

  from ._conv import register_converters as _register_converters
total reads	6
total base pairs	25217
mean	4202.83
median	4205
min	2940
max	5826
N25	5079
N50	5011
N75	3399


### Directional reads in forward 

###### forward, reverse and two-directional reads are all counted separately.

In [48]:
!poretools stats -q --type fwd $dataDirectory

  from ._conv import register_converters as _register_converters
total reads	2
total base pairs	8019
mean	4009.50
median	4009
min	2940
max	5079
N25	5079
N50	5079
N75	2940


### Directional reads in reverse 



In [49]:
!poretools stats -q --type rev $dataDirectory

  from ._conv import register_converters as _register_converters
total reads	2
total base pairs	7973
mean	3986.50
median	3986
min	2962
max	5011
N25	5011
N50	5011
N75	2962


### Two-direction reads 



In [50]:
!poretools stats -q --type 2D $dataDirectory

  from ._conv import register_converters as _register_converters
total reads	2
total base pairs	9225
mean	4612.50
median	4612
min	3399
max	5826
N25	5826
N50	5826
N75	3399


# Covert to FASTA

- stores all fasta files into a folder named fastaOutput

In [51]:
#!poretools fasta $dataDirectory > fastaOutput/nameOfFile.fasta

In [52]:
!mkdir fastaOutput
!poretools fasta $dataDirectory > fastaOutput/outputPoretoolData.fasta

mkdir: cannot create directory ‘fastaOutput’: File exists
  from ._conv import register_converters as _register_converters


## Congratulations! You have made a FASTA file out of your raw data!
Next Step: Using the **mash** tool to clump together sequences that are close for more efficient genome assembly.


# What is mash?
- Fast metagenome distance estimation using MinHash

# Using MetaGenomeMark to Describe Gene Information
- meta genome mark will take the fasta file of the assembled genome
- tell you what genes are in the genome


genomeOutputAssembled.fa - input file from Canu

In [19]:
!pwd
!ls /work/MetaGeneMark_linux_64/mgm

/work/data
gmhmmp	INSTALL  LICENSE  MetaGeneMark_v1.mod  README.MetaGeneMark


## This will obtain a GFF file

In [123]:
!gmhmmp -a -r -f G -d -m ../MetaGeneMark_linux_64/mgm/MetaGeneMark_v1.mod -o data/sequence.gff assembly.fa


### Using this gff file, we can learn what genes are in your sample!

A GFF file has the columns: 
- seqname - name of the chromosome or scaffold
- source - name of the program that generated this feature: GeneMark.hmm
- feature - name of Gene, Variation, or Similarity
- start - Start position of the feature, with sequence numbering starting at 1.
- end - End position of the feature, with sequence numbering starting at 1.
- score - A floating point value.
- strand - defined as + (forward) or - (reverse).
- frame - One of '0', '1' or '2'. '0' indicates that the first base of the feature is the first base of a codon, '1' that the second base is the first base of a codon, and so on..
- attribute - A semicolon-separated list of tag-value pairs, providing additional information about each feature.

### Quick look at GFF file

In [120]:
!head -20 data/sequence.gff

##gff-version 2
##source-version GeneMark.hmm_PROKARYOTIC 3.38
##date: Wed Apr  4 20:04:08 2018
# Sequence file name: data/asm.fa
# Model file name: ../MetaGeneMark_linux_64/mgm/MetaGeneMark_v1.mod
# RBS: true

# Model information: Heuristic_model_for_genetic_code_11_and_GC_51

utg000001l	GeneMark.hmm	CDS	1	72	2.248856	+	0	gene_id=1
utg000001l	GeneMark.hmm	CDS	1984	2214	0.337658	+	0	gene_id=2
utg000001l	GeneMark.hmm	CDS	2289	2411	-2.242110	+	0	gene_id=3
utg000001l	GeneMark.hmm	CDS	3078	3404	-8.565622	+	0	gene_id=4
utg000001l	GeneMark.hmm	CDS	3420	3665	5.495138	+	0	gene_id=5
utg000001l	GeneMark.hmm	CDS	11382	11618	0.812308	-	0	gene_id=6
utg000001l	GeneMark.hmm	CDS	12156	12431	-0.839825	-	0	gene_id=7
utg000001l	GeneMark.hmm	CDS	12557	12790	2.258979	+	0	gene_id=8
utg000001l	GeneMark.hmm	CDS	13044	13190	6.278825	+	0	gene_id=9
utg000001l	GeneMark.hmm	CDS	15513	15620	-1.745413	-	0	gene_id=10
utg000001l	GeneMark.hmm	CDS	15624	15899	-8.697843	-	0	gene_id=11


### Goal: Getting FASTA files of all contigs listed 

In [124]:
!sed -i -e 1,9d data/sequence.gff

In [180]:
!head -20 data/sequence.gff
!wc -l data/sequence.gff

utg000001l	GeneMark.hmm	CDS	1	72	2.248856	+	0	gene_id=1
utg000001l	GeneMark.hmm	CDS	1984	2214	0.337658	+	0	gene_id=2
utg000001l	GeneMark.hmm	CDS	2289	2411	-2.242110	+	0	gene_id=3
utg000001l	GeneMark.hmm	CDS	3078	3404	-8.565622	+	0	gene_id=4
utg000001l	GeneMark.hmm	CDS	3420	3665	5.495138	+	0	gene_id=5
utg000001l	GeneMark.hmm	CDS	11382	11618	0.812308	-	0	gene_id=6
utg000001l	GeneMark.hmm	CDS	12156	12431	-0.839825	-	0	gene_id=7
utg000001l	GeneMark.hmm	CDS	12557	12790	2.258979	+	0	gene_id=8
utg000001l	GeneMark.hmm	CDS	13044	13190	6.278825	+	0	gene_id=9
utg000001l	GeneMark.hmm	CDS	15513	15620	-1.745413	-	0	gene_id=10
utg000001l	GeneMark.hmm	CDS	15624	15899	-8.697843	-	0	gene_id=11
utg000001l	GeneMark.hmm	CDS	16628	16783	-4.989063	+	0	gene_id=12
utg000001l	GeneMark.hmm	CDS	17340	17516	-4.470129	-	0	gene_id=13
utg000001l	GeneMark.hmm	CDS	20076	20378	-6.331264	+	0	gene_id=14
utg000001l	GeneMark.hmm	CDS	21634	21828	4.955708	+	0	gene_id=15
utg000001l	GeneMark.hmm	CDS	23066	23227	0.340974	-	0	gen

In [147]:
!pip3 install biopython

Collecting biopython
  Downloading biopython-1.71-cp35-cp35m-manylinux1_x86_64.whl (2.0MB)
[K    100% |████████████████████████████████| 2.0MB 872kB/s eta 0:00:01
[?25hCollecting numpy (from biopython)
  Downloading numpy-1.14.2-cp35-cp35m-manylinux1_x86_64.whl (12.1MB)
[K    100% |████████████████████████████████| 12.1MB 117kB/s eta 0:00:01   44% |██████████████▏                 | 5.4MB 8.9MB/s eta 0:00:01
[?25hInstalling collected packages: numpy, biopython
Successfully installed biopython-1.71 numpy-1.14.2
[33mYou are using pip version 8.1.1, however version 9.0.3 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [199]:
import csv

nameOfContig = list()
startIndexList = list()
stopIndexList = list()
with open("data/sequence.gff") as tsv:
    for line in csv.reader(tsv, dialect="excel-tab"): #You can also use delimiter="\t" rather than giving a dialect.
        if len(line) > 1:
            nameOfContig.append(""+str(line[2:3][0])+str(line[3:4][0])+"-"+str(line[4:5][0]))
            startIndexList.append(line[3:4])
            stopIndexList.append(line[4:5])
startAndStopList = list(zip(nameOfContig,startIndexList,stopIndexList))
from Bio import SeqIO
sequences = list()
for record in SeqIO.parse("data/asm.fa", "fasta"):
    print("This is the header for your assembly fasta: "+record.id)
    for name,start,stop in startAndStopList :
        if start != [] and stop != [] :
            sequences.append(record.seq[int(start[0]):int(stop[0])])
fastaList = list(zip(nameOfContig, sequences))
with open("data/annotatedGene.fa", "w") as output_handle:
    for name, seq in fastaList:
        fasta_format_string = ">"+name+"\n%s\n" % seq
        output_handle.write(fasta_format_string)

This is the header for your assembly fasta: utg000001l


### Annotated FASTA file is in /data
This created a fasta of all the gffs, but theres alot!
Lets get the biggest 

In [22]:
!head data/asm.fa

>utg000001l
TTGCTCGGTTTTATTACTTTAGGCATTTATACTCCGCTGGAAGCGTGTGACCTGCTCAAAATAATTGCATGAGTTGCCCA
TCGATTGTAAGCTCTATTGAGCACTGCTCATTAATATACTTCTGGGTTCCTCAGTTCCAGTTGTTTTGCATAGTGATCAG
CCTCTCTGAGGGTGAAATAATCCCGTTCAGCGGTGTCTGCCAGTCGGGGGGAGGCTGCATTATCACGCCGGAGGCTGCGG
CTTCACGCATGACTGACAGACTGCTTTGATGTGCAACCGACGACCAAGAGCGGCAGCAACATCATCACGCAGAGCATCAT
TTTCAGCTTCGCATCAGCTAACTCCTTCGTGTATTTTGCAGCGACGCAGCAACATCACGCTGACGCATCTGCATGTCAGT
AATTGCCGCGTTCGCTAGCTTTTGCCAGTTCTCTCTGGCATTTTGTCGCCTGGACTTTGTAGGCGATTGCGTTATCACAC
GGTAATGATTGACCGCCCATGACAGGCTGACGATGATGCAGATAATCAGAGCGGATATAATCGCGGTTACTCTGCTCACT
GTTGCCCCCACAAACAGACTTCACGCTCAATCTCACGACGAGTCATCAGGCCTTTCCCATTATTGCTTACCGCCAGCGTA
TGTCCAGCGACGCAGCTGATGGATGCGCCTTTGATATCGCCCTGGTTTATTTTGCGAAGAAGCGTCGATGTTCTAAATTG


### What can we do with this 