## IMPORTANT -->Change this dataDirectory variable to the path to your *.fast5 files 

In [None]:
# dataDirectory='~/Users/...'

dataDirectory='data/'

### This will print the number of FAST5 files in the dataDirectory.

- Poretools has a number of different command line options 
- Running poretools with no parameters gives us a brief list (and complies with Torsten's first rule)

In [None]:
!find $dataDirectory -maxdepth 1 -name "*.fast5" | wc -l

## What's the numbers?
### Let's start with a simple one, the stats command, this will give us some basic statistics about our reads.

The -q option stops poretools outputting any warning messages.

In [None]:
!poretools stats -q $dataDirectory

### Directional reads in forward 

###### forward, reverse and two-directional reads are all counted separately.

In [None]:
!poretools stats -q --type fwd $dataDirectory

### Directional reads in reverse 



In [None]:
!poretools stats -q --type rev $dataDirectory

### Two-direction reads 



In [None]:
!poretools stats -q --type 2D $dataDirectory

# Covert to FASTA

- stores all fasta files into a folder named fastaOutput

In [None]:
#!poretools fasta $dataDirectory > fastaOutput/nameOfFile.fasta

In [None]:
!mkdir fastaOutput
!poretools fasta $dataDirectory > fastaOutput/outputPoretoolData.fasta

## Congratulations! You have made a FASTA file out of your raw data!
Next Step: Using the **mash** tool to clump together sequences that are close for more efficient genome assembly.


# What is mash?
- Fast metagenome distance estimation using MinHash

# Using MetaGenomeMark to Describe Gene Information
- meta genome mark will take the fasta file of the assembled genome
- tell you what genes are in the genome


genomeOutputAssembled.fa - input file from Canu

In [None]:
!pwd
!ls /work/MetaGeneMark_linux_64/mgm

## This will obtain a GFF file

In [None]:
!gmhmmp -a -r -f G -d -m ../MetaGeneMark_linux_64/mgm/MetaGeneMark_v1.mod -o data/sequence.gff assembly.fa


### Using this gff file, we can learn what genes are in your sample!

A GFF file has the columns: 
- seqname - name of the chromosome or scaffold
- source - name of the program that generated this feature: GeneMark.hmm
- feature - name of Gene, Variation, or Similarity
- start - Start position of the feature, with sequence numbering starting at 1.
- end - End position of the feature, with sequence numbering starting at 1.
- score - A floating point value.
- strand - defined as + (forward) or - (reverse).
- frame - One of '0', '1' or '2'. '0' indicates that the first base of the feature is the first base of a codon, '1' that the second base is the first base of a codon, and so on..
- attribute - A semicolon-separated list of tag-value pairs, providing additional information about each feature.

### Quick look at GFF file

In [None]:
!head -20 data/sequence.gff

### Goal: Getting FASTA files of all contigs listed 
This is your assembly file

In [None]:
!head data/asm.fa

### The GFF file has unecessary headers that we don't need.
Run this sed command in unix to remove them

In [None]:
!sed -i -e 1,9d data/sequence.gff

This is our current GFF file, followed how many contigs found

In [None]:
!head -20 data/sequence.gff
!wc -l data/sequence.gff

### Run this to slice the FASTA sequences from the assembly from the GFF start and stop indecies

In [None]:
# This python script will get the start and stop indexes from the GFF 
# and get FASTA sequences from the assembly 

import csv

nameOfContig = list()
startIndexList = list()
stopIndexList = list()
# get start and stop indexes in the GFF file
with open("data/sequence.gff") as tsv:
    for line in csv.reader(tsv, dialect="excel-tab"): #You can also use delimiter="\t" rather than giving a dialect.
        if len(line) > 1:
            nameOfContig.append(""+str(line[2:3][0])+str(line[3:4][0])+"-"+str(line[4:5][0]))
            startIndexList.append(line[3:4])
            stopIndexList.append(line[4:5])
startAndStopList = list(zip(nameOfContig,startIndexList,stopIndexList))

# Use BioPython to assemble output FASTA file
from Bio import SeqIO
sequences = list()
for record in SeqIO.parse("data/asm.fa", "fasta"):
    print("This is the header for your assembly fasta: "+record.id)
    for name,start,stop in startAndStopList :
        if start != [] and stop != [] :
            sequences.append(record.seq[int(start[0]):int(stop[0])])
fastaList = list(zip(nameOfContig, sequences))
with open("data/annotatedGene.fa", "w") as output_handle:
    for name, seq in fastaList:
        fasta_format_string = ">"+name+"\n%s\n" % seq
        output_handle.write(fasta_format_string)

# Get the largest FASTA sequence
maxFasta = max(fastaList, key=lambda x: len(x[1]))
fasta_format_string = ">"+str(maxFasta[0])+"\n%s\n" % str(maxFasta[1])
print(fasta_format_string)

# Blastn the largest sequence
from Bio.Blast import NCBIWWW
result_handle = NCBIWWW.qblast("blastx", "nr", str(maxFasta[1]))



### Annotated FASTA file is in /data
- This created a fasta of all the gffs, but theres alot!
- Copy the fasta sequence printed out from the last command and use the BLAST website to find hits to species!