# Using SeqIO in biopython to hack the largest contig

In [None]:
#!/usr/bin/python
from Bio import SeqIO
#SeqIO does not allow printing of returned record objects to file.


def print_header():
    '''this is how to print all sequence header id in a fast file'''
    with open("F_decem_v100.fasta", "rU") as handle:
        for record in SeqIO.parse(handle, "fasta"):
            print(record.id)
    handle.close()

def print_lengths():
    #print any sequences longer than 10300bp in length...
    # see https://biopython.org/wiki/SeqIO filter by sequence length.
    with open("F_decem_v100.fasta", "rU") as handle:
        for record in SeqIO.parse(handle, "fasta"):
            #all seqs become record objects.
            if len(record) > 9000000:
                print(len(record), record.id)
    handle.close()

def save_largest():
    #based on the previous function we know evgMSTRG.7663.1 is the largest contig.
    #this function saves it to a fasta
    with open("F_decem_v100.fasta", "rU") as handle:
        largest_seq = []
        for record in SeqIO.parse(handle, "fasta"):
            if len(record.seq) > 9000000:
                largest_seq.append(record)
    SeqIO.write(largest_seq, "chrom1.fasta", "fasta")



if __name__ == "__main__":
    #print_header()
    print_lengths()
    save_largest()

# What you have so far:

 1) Chromsome 1 / largest contig extracted from the Fusarium decemcellulares genome. 
 
 2) evidence transcriptome with redundancies eliminated. found at /home/data/bioinf_training/colin/comparative_genomics/fusarium_decemcellulares/merged/annotation/pipeline/1_RNAseq/okayset

## Notes on Maker

read http://gmod.org/wiki/MAKER_Tutorial#About_MAKER

Maker is a de novo annotator pipeline.

Maker runs evidence against the genome. 
It selects the best combination of gene predicting tools output to give the best gene model
Doing Ab initio gene prediction with Augustus or GeneMark

# 1) Training maker

 Maker has not been trained for fusarium yet. Hence need to train by running maker using Augustus/fusarium species on the largest contig in the Fusarium decemcellulares genome.

- Annotations is the conversion of an assembly into a genome landscape with structural and functional descriptions.
- should always have an evidence showing where it comes from
- evidence is transcribed RNA / transcript in this case.
- wrongly annotated genome poison the database and gives systematic error to all experiments derived from it.

In [3]:
#export PATH
export PATH="/home/data/bioinf_resources/programming_tools/maker/bin:$PATH"

SyntaxError: invalid syntax (<ipython-input-3-3616983626af>, line 2)

#### Prerequisites

Maker has 3 configuration files:
 - maker_exe.ctl (paths to executable programs that are part of maker)
 - maker_bopt.ctl (contains values for filtering of blast and exonerate alignments)
 -  maker_opt.ctl (the most important, location of genome, inputs etc.)

Configuration files contain options, parameters etc that instructions that helps the annotation process to run.

lines 2, 16, 26, 36 were changed in maker_opts.ctl to be specific for fusarium.

### maker_opts.ctl: vital parameters

 - genome=../../../data/F_decem_chrom1.fasta #../../../data/F_decem_v100.fasta # genomic assembly
 
 
 - est=../1_RNAseq/okayset/evidence.okay.fasta #set of ESTs or assembled mRNA-seq in fasta format
 
 
 - model_org=fungi #select a model organism for RepBase masking in RepeatMasker

### Questions

 - What protein seq do we need for maker?
 - what model species for RepeatMasker RepBase? fungi?
 - what gff files? line 18 of config opt.cl
 - Do we need to turn maker repeat masking options off?