# Genome Assembly and Annotation

# Assembly of nanpore reads
__Versions__  
Canu: 1.7

### Canu: 


In [None]:
#!/bin/bash
#SBATCH --partition=ellison_1 # Partition (job queue)
#SBATCH --job-name=canu200 # Assign an 8-character name to your job, no spaces
#SBATCH --nodes=1 # Number of compute nodes
#SBATCH --ntasks=1 # Processes (usually = cores) on each node
#SBATCH --cpus-per-task=28 # Threads per process (or per core)
#SBATCH --export=ALL # Export you current environment settings to the job envronment
#SBATCH --time=72:00:00
#SBATCH --mem=200G

canu -p D_triauraria_all_seqs -d D_triauraria_all_sequs_assembly_200m gnuplotTested\=true overlapper\=mhap utgReAlign\=true useGrid\=0 genomeSize\=200m -nanopore-raw all_TA_combined_seqs.fastq >& all_TA_combined_seqs_output_200.txt

# Quality Control

__Versions__   
Purge Haplotigs: v1.0.1    
Nanopolish: 0.8.4  
Pilon: 1.22  
bowtie2: 2.2.9  

### Purge Haplotigs:
identified heterozygosity and determines consensus sequence

In [None]:
#!/bin/bash
#SBATCH --partition=ellison_1 # Partition (job queue)
#SBATCH --job-name=PHap # Assign an 8-character name to your job, no spaces
#SBATCH --nodes=1 # Number of compute nodes
#SBATCH --ntasks=1 # Processes (usually = cores) on each node
#SBATCH --cpus-per-task=28 # Threads per process (or per core)
#SBATCH --export=ALL # Export you current environment settings to the job envronment
#SBATCH --time=72:00:00
#SBATCH --mem=200G

python nanopolish_makerange.py D_triauraria_all_seqs.contigs.fasta | parallel --results nanopolish.results -P 8 \
    nanopolish variants --consensus -o polished.{1}.vcf -w {1} -r reads.fastq -b reads.sorted.bam -g draft.fa -t 4 --min-candidate-frequency 0.1

### Nanopolish:
Accounts for errors in nanopore sequencing

In [None]:
#!/bin/bash
#SBATCH --partition=genetics_1 # Partition (job queue)
#SBATCH --job-name=nanopol # Assign an 8-character name to your job, no spaces
#SBATCH --nodes=1 # Number of compute nodes
#SBATCH --ntasks=1 # Processes (usually = cores) on each node
#SBATCH --cpus-per-task=28 # Threads per process (or per core)
#SBATCH --export=ALL # Export you current environment settings to the job envronment
#SBATCH --time=72:00:00
#SBATCH --mem=150G
#SBATCH --mail-type=END
#SBATCH --mail-user=

python nanopolish_makerange.py /all_TA_combined/curated.fa | parallel --results nanopolish.results -P 6 \
 nanopolish variants --consensus -o polished.{1}.vcf -w {1} -r /run_3_sequs.fastq -b /curated_run3_sequs.sorted.bam -g /all_TA_combined/curated.fa -t 4 --min-candidate-frequency 0.1

In [None]:
#!/bin/bash
#SBATCH --partition=ellison_1 # Partition (job queue)
#SBATCH --job-name=nanopol # Assign an 8-character name to your job, no spaces
#SBATCH --nodes=1 # Number of compute nodes
#SBATCH --ntasks=1 # Processes (usually = cores) on each node
#SBATCH --cpus-per-task=28 # Threads per process (or per core)
#SBATCH --export=ALL # Export you current environment settings to the job envronment
#SBATCH --time=72:00:00
#SBATCH --mem=150G

nanopolish vcf2fasta -g /all_TA_combined/curated.fa polished.*.vcf > polished_genome.fa

### Pilon:
Combines nanopore and Illumina data to correct sequencing errors.

In [None]:
#!/bin/bash
 #SBATCH --partition=ellison_1 # Partition (job queue)
#SBATCH --job-name=allpilon # Assign an 8-character name to your job
#SBATCH --nodes=1 # Number of compute nodes
#SBATCH --ntasks=1 # Processes (usually = cores) on each node
#SBATCH --cpus-per-task=28 # Threads per process (or per core)
 #SBATCH --export=ALL # Export you current environment settings to the job environment
#SBATCH --time=12:00:00
#SBATCH --mem=128G
module load bowtie2

bowtie2-build pilon_1.fasta pilon_1_index

bowtie2 -X 1200 -p 28 --no-unal -x pilon_1_index -S TA_illumina_Pilon_15.sam -1 /TA_illumina/sample_trimmed_1P.fastq.gz -2 /TA_illumina/sample_trimmed_2P.fastq.gz

samtools view -b TA_illumina_Pilon_1.sam > TA_illumina_Pilon_1.bam

samtools sort TA_illumina_Pilon_1.bam > TA_illumina_Pilon_1.sorted.bam

samtools index TA_illumina_Pilon_1.sorted.bam

pilon --genome pilon_1.fasta --frags TA_illumina_Pilon_1.sorted.bam --output pilon_2 --changes --minmq 15 &> pilon_1.log

# Scaffolding

### Juicer and 3D DNA alignment pipeline  

__Versions__:  
BWA:0.7.13  
JUICER:1.5.6  
JUICEBOX: 1.11.08   
3D-DNA: 180419  

Prepare reference genome

In [None]:
bwa index reference.fasta
samtools faidx reference.fasta
#fasta.fai has the length of each scaffold. Print first 2 columns for scaffold ID and length info
awk '{print $1,$2}' reference.fasta.fai > reference.chrom.sizes

Combine replicate HiC files 

In [None]:
#submit SLURM script with the following to combine replicate fastq files
zcat ./d_species_HiC_*_R1_001.fastq.gz | gzip > combined_reps_R1.fastq.gz
zcat ./d_species_HiC_*_R2_001.fastq.gz | gzip > combined_reps_R2.fastq.gz

Run JUICER to align HiC reads to reference assembly, create a contact matrix (.hic), and a "merged_nodups.txt" file to input to 3D-DNA

In [None]:
juicer.sh -d /Dtria/juicer/work/Dtria -p/scratch/tg484/Dtria/juicer/references/Dtria_20181107_r2_illumina_reformat.chrom.sizes -s none -z /Dtria/juicer/references/Dtria_20181107_r2_illumina.reformat.fa -D /Dtria/juicer -t 28 -b GCTGAGGGATCCCTCAGC

Run **3D-DNA**  
First run without --editor-repeat-coverage flag. 

In [None]:
run-asm-pipeline.sh ---build-gapped-map -m diploid -i 5000 -r 3 /Dtria/juicer/references/Dtria_20181107_r2_illumina.reformat.fa /Dtria/juicer/work/Dtria/aligned/merged_nodups.txt >& /Dtria/3D-DNA/3d-dna.log

View the **rawchrom.assembly** output using Juicebox. If the output looks poorly aligned, run the following two commands to view a text based histogram. Run --editor-repeat coverage flag with value corresponding to the peak of the histogram.

In [None]:
bash ~/pkg/3d-dna/edit/run-coverage-analyzer.sh ./3D-DNA_d_species/reference.0.hic
awk -f plot_coverage.awk coverage_wide.wig 

In [None]:
#final 3D-DNA command using "--editor-repeat-coverage 6" because peak of histogram was around 6
run-asm-pipeline.sh ---build-gapped-map --editor-repeat-coverage 6 -m diploid -i 5000 -r 3 /Dtria/juicer/references/Dtria_20181107_r2_illumina.reformat.fa /Dtria/juicer/work/Dtria/aligned/merged_nodups.txt >& /Dtria/3D-DNA/3d-dna.log

Once obtaining fairly well-aligned output. Edit any misalignments using JUICEBOX. Export new assembly (called **.review.assembly**) and create new fasta file using the following command. This command will output a CAPITALIZED **(.FINAL.fasta)** file that should be used in downstream analysis.

In [None]:
module load java
./pkg/3d-dna/run-asm-pipeline-post-review.sh -r reference.rawchrom.review.assembly ./juicer/references/reference.fa ./juicer/work/d_species/aligned/merged_nodups.txt >& final_fasta.log
#the .review.assembly file is the edited file from JUICEBOX. This command will output a CAPITALIZED (.FINAL.fasta) file that should be used in downstream analysis.

In [None]:
module load java
./pkg/3d-dna/run-asm-pipeline-post-review.sh -r reference.rawchrom.review.assembly ./juicer/references/reference.fa ./juicer/work/d_species/aligned/merged_nodups.txt >& final_fasta.log
#the .review.assembly file is the edited file from JUICEBOX. This command will output a CAPITALIZED (.FINAL.fasta) file that should be used in downstream analysis.

Assign D. triauraria megascaffolds to corresponding Muller element.

Sanity check: Check the number of genes on each Muller element in the final <tt>chrom_count_table<tt>.    
Should be between ~1000-3500 for Muller A - E. Muller F ~50-75. 

In [None]:
module load samtools
module load blast

# GET LENGTHS OF SCAFFOLDS
# scaffold lengths will be stored in the second column of NAME.scaffolded.fai
samtools faidx reference.FINAL.fasta 
makeblastdb -dbtype nucl -in reference.FINAL.fasta 
tblastn -query dmel_r6_longest_isoform.pep.fasta -db reference.FINAL.fasta -evalue 1e-4 -outfmt 6 > d_species.tblastn
# KEEP ONLY THE BEST HIT FOR EACH QUERY
python /projects/genetics/ellison_lab/scripts/best.py d_species.tblastn > d_species.tblastn.best

# COUNT THE NUMBER OF PEPTIDES FROM EACH DMEL CHROMOSOME THAT MATCH EACH SCAFFOLD IN THE ASSEMBLY
cut -f 1,2 d_species.tblastn.best | sed -r 's/_/\t/' | sort -k3 | bedtools groupby -g 3 -c 2 -o freqdesc > d_species.chrom_count

echo -e "SCAFFOLD\tLENGTH\tMULLER_A\tMULLER_B\tMULLER_C\tMULLER_D\tMULLER_E\tMULLER_F\tOTHER\tTOTAL" > d_species.chrom_count_table.txt

python blast_chrom_count2table.py d_species.chrom_count reference.FINAL.fasta.fai | sort -k2 -nr >> d_species.chrom_count_table.txt

blast_chrom_count2table.py

In [None]:
import sys

countfile=sys.argv[1]
lengthfile=sys.argv[2]

lengths={}
le=open(lengthfile)
for scaff in le:
    slist=scaff.rstrip().split("\t")
    lengths[slist[0]]=slist[1]

#print("SCAFFOLD","LENGTH","MULLER_A","MULLER_B","MULLER_C","MULLER_D","MULLER_E","MULLER_F","OTHER","TOTAL",sep="\t")
muller = {"2L":"Muller_B","2R":"Muller_C","3L":"Muller_D","3R":"Muller_E","4":"Muller_F","X":"Muller_A"}

output = {}
fh=open(countfile)
for line in fh:
    list1=line.rstrip().split("\t")
    list2=list1[1].split(",")
    outdict = {"Muller_A":0,
               "Muller_B":0,
               "Muller_C":0,
               "Muller_D":0,
               "Muller_E":0,
               "Muller_F":0}
    other = {}
    chromSum=0
    for c in list2:
        (chrom,num) = c.split(":")
        chromSum+=int(num)
        if muller.get(chrom):
            mchrom = muller[chrom]
            outdict[mchrom] += int(num)
        else:
            other[chrom] = num
    outline = [list1[0],lengths[list1[0]],outdict["Muller_A"],outdict["Muller_B"],outdict["Muller_C"],outdict["Muller_D"],outdict["Muller_E"],outdict["Muller_F"]]
    other_list=[]
    if(other):
        for m,n in other.items():
            item = m+":"+n
            other_list.append(item)
    else:
        other_list.append("NA:NA")
    other_str = ','.join(other_list)
    outline.append(other_str)
    outline.append(chromSum)
    convert = [str(i) for i in outline] 
    print("\t".join(convert))


best.py

In [None]:
import sys

filename = sys.argv[1]
fh = open(filename)

hits = {}
best = {}
tot=0
for line in fh:
    tot+=1
    line = line.rstrip()
    cols = line.split('\t')
    qid  = cols[0]
    scr  = float(cols[11])
    if best.get(qid):
        if scr > best[qid]:
            best[qid] = float(scr)
#            print(best[qid])
    else:
        best[qid] = scr
#        print(best[qid])
    if hits.get(qid):
        if hits[qid].get(scr):
            hits[qid][scr].append(line)
        else:
            hits[qid][scr] = [line]
    else:
        hits[qid] = {}
        hits[qid][scr]=[line]
fh.close()

all_hits = hits.keys()
qtot = len(all_hits)

for i in all_hits:
    score = best[i]
    outstring = '\n'.join(hits[i][score])
    print(outstring)
print('Parsed a total of',tot,'alignments and reported the best hit for',qtot,'queries.',file=sys.stderr)

# Annotation

### MAKER pipeline

__Versions__:  
Maker: 2.31.10  
stringtie: 2.0  
HISAT2:  2.1.0  
trimmomatic: 0.38    
samtools: 1.9     
gffread: 0.11.8     
Augustus: 3.3.2  
SNAP: 1.0beta.18 

Remove adaptor sequences from RNAseq data using _Trimmomatic_

In [None]:
#!/bin/bash
#SBATCH --partition=genetics_1 # Partition (job queue)
#SBATCH --job-name=trim # Assign an 8-character name to your job, no spaces
#SBATCH --nodes=1 # Number of compute nodes
#SBATCH --ntasks=1 # Processes (usually = cores) on each node
#SBATCH --cpus-per-task=28 # Threads per process (or per core)
#SBATCH --export=ALL # Export you current environment settings to the job envronment
#SBATCH --time=10:00:00
#SBATCH --mem=100G
cd /RNAseq_dtri_testes
trimmomatic PE dtri_testes_mRNA_RNAseq_9_S9_R1_001.fastq dtri_testes_mRNA_RNAseq_9_S9_R2_001.fastq -baseout sample_trimmed.fastq.gz ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 LEADING:10 TRAILING:10 SLIDINGWINDOW:4:15 MINLEN:36

cd /RNAseq_dtri_ovaries
trimmomatic PE dtri_ovaries_mRNA_RNAseq_3_S3_R1_001.fastq dtri_ovaries_mRNA_RNAseq_3_S3_R2_001.fastq -baseout sample_trimmed.fastq.gz ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 LEADING:10 TRAILING:10 SLIDINGWINDOW:4:15 MINLEN:36

cd /RNAseq_dtri_embryos
trimmomatic PE dtri_embryos_mRNA_RNAseq_1_S1_R1_001.fastq dtri_embryos_mRNA_RNAseq_1_S1_R2_001.fastq -baseout sample_trimmed.fastq.gz ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 LEADING:10 TRAILING:10 SLIDINGWINDOW:4:15 MINLEN:36


Align the _D. triauraria_ RNA-seq Illumina reads to the _D. triauraria_ reference genome using _HISAT2_

In [None]:
#!/bin/bash
#SBATCH --partition=genetics_1    # Partition (job queue)
#SBATCH --job-name=hisat2       # Assign an 8-character name to your job, no spaces
#SBATCH --nodes=1                 # Number of compute nodes
#SBATCH --ntasks=1                # Processes (usually = cores) on each node
#SBATCH --cpus-per-task=28        # Threads per process (or per core)
#SBATCH --export=ALL              # Export you current environment settings to the job environment
#SBATCH --time=12:00:00
#SBATCH --mem=100G

hisat2 --rna-strandness RF --dta --max-intronlen 50000 --no-unal -x tria_ordered_muller_plus_other_scaffs_index -1 /RNAseq_dtri_embryos/sample_trimmed_1P.fastq.gz -2 /RNAseq_dtri_embryos/sample_trimmed_2P.fastq.gz > RNAseq_dtri_embryos_out.sam

hisat2 --rna-strandness RF --dta --max-intronlen 50000 --no-unal -x tria_ordered_muller_plus_other_scaffs_index -1 /RNAseq_dtri_ovaries/sample_trimmed_1P.fastq.gz -2 /RNAseq_dtri_ovaries/sample_trimmed_2P.fastq.gz > RNAseq_dtri_ovaries_out.sam

hisat2 --rna-strandness RF --dta --max-intronlen 50000 --no-unal -x tria_ordered_muller_plus_other_scaffs_index -1 /RNAseq_dtri_testes/sample_trimmed_1P.fastq.gz -2 /RNAseq_dtri_testes/sample_trimmed_2P.fastq.gz > RNAseq_dtri_testes_out.sam


convert sam to bam files   
sort sam files

In [None]:
#!/bin/bash
#SBATCH --partition=genetics_1    # Partition (job queue)
#SBATCH --job-name=hisat2SAM       # Assign an 8-character name to your job, no spaces
#SBATCH --nodes=1                 # Number of compute nodes
#SBATCH --ntasks=1                # Processes (usually = cores) on each node
#SBATCH --cpus-per-task=28        # Threads per process (or per core)
#SBATCH --export=ALL              # Export you current environment settings to the job environment
#SBATCH --time=12:00:00
#SBATCH --mem=50G


samtools view -bS dtri_embryos_out.sam > dtri_embryos_out.bam
samtools sort dtri_embryos_out.bam > dtri_embryos_out_sort.bam

samtools view -bS dtri_ovaries_out.sam > dtri_ovaries_out.bam
samtools sort dtri_ovaries_out.bam > dtri_ovaries_out_sort.bam

samtools view -bS dtri_testes_out.sam > dtri_testes_out.bam
samtools sort dtri_testes_out.bam > dtri_testes_out_sort.bam

_HISAT2_ alignments were used to assemble mRNA transcripts with _stringtie_

In [None]:
#generate gtf file from bam files
stringtie dtri_embryos_HISAT2_2_sort.bam ‐j 3 ‐‐rf ‐o dtri_embryos_HISAT2_2.gtf 

#merge the gtf files  
stringtie ‐‐merge dtri_embryos_HISAT2_2.gtf dtri_embryos_HISAT2_2.gtf 
dtri_embryos_HISAT2_2.gtf ‐o dtri_all_sequs_merged.gtf 

#create fasta file for MAKER input
gffread ‐w dtri_all_sequs_transcripts.fa ‐g dtria_ordered_muller_plus_other_scaffs.fasta dtri_all_sequs_merged.gtf

Generate _MAKER_ configuration files: (found below)   
**maker_exe.ctl** - paths to the executables used by MAKER.  
**maker_bopt.ctl** - all of the statistics for BLAST, we used all defaults.   
**maker_opt.ctl** - contains all other information for MAKER, including the location of the input genome file.  

In [None]:
maker -CTL

Run _MAKER_

In [None]:
#!/bin/bash
#SBATCH --partition=genetics_1    # Partition (job queue)
#SBATCH --job-name=MAKER       # Assign an 8-character name to your job, no spaces
#SBATCH --nodes=1                 # Number of compute nodes
#SBATCH --ntasks=1                # Processes (usually = cores) on each node
#SBATCH --cpus-per-task=28        # Threads per process (or per core)
#SBATCH --export=ALL              # Export you current environment settings to the job environment
#SBATCH --time=72:00:00
#SBATCH --mem=100G

cd /maker_main_run
module use /projects/community/modulefiles/
module load augustus
export AUGUSTUS_CONFIG_PATH=/projects/community/augustus/augustus-3.3.2/config
 
/pkg/maker/bin/maker -cpus 28 

maker_exe.ctl

In [None]:
#-----Location of Executables Used by MAKER/EVALUATOR
makeblastdb=/anaconda3/bin/makeblastdb #location of NCBI+ makeblastdb executable
blastn=/anaconda3/bin/blastn #location of NCBI+ blastn executable
blastx=/anaconda3/bin/blastx #location of NCBI+ blastx executable
tblastx=/anaconda3/bin/tblastx #location of NCBI+ tblastx executable
formatdb= #location of NCBI formatdb executable
blastall= #location of NCBI blastall executable
xdformat= #location of WUBLAST xdformat executable
blasta= #location of WUBLAST blasta executable
RepeatMasker=/pkg/maker/bin/../exe/RepeatMasker/RepeatMasker #location of RepeatMasker executable
exonerate=/pkg/maker/bin/../exe/exonerate/bin/exonerate #location of exonerate executable

#-----Ab-initio Gene Prediction Algorithms
snap=/bin/snap #location of snap executable
gmhmme3=/pkg/gm_et_linux_64/gmes_petap/gmhmme3 #location of eukaryotic genemark executable
gmhmmp= #location of prokaryotic genemark executable
augustus=/projects/community/augustus/augustus-3.3.2/bin/augustus #location of augustus executable
fgenesh= #location of fgenesh executable
tRNAscan-SE= #location of trnascan executable
snoscan= #location of snoscan executable

#-----Other Algorithms
probuild=/pkg/gm_et_linux_64/gmes_petap/probuild #location of probuild executable (required for genemark)


maker_bopt.ctl

In [None]:
#-----BLAST and Exonerate Statistics Thresholds
blast_type=ncbi+ #set to 'ncbi+', 'ncbi' or 'wublast'

pcov_blastn=0.8 #Blastn Percent Coverage Threhold EST-Genome Alignments
pid_blastn=0.85 #Blastn Percent Identity Threshold EST-Genome Aligments
eval_blastn=1e-10 #Blastn eval cutoff
bit_blastn=40 #Blastn bit cutoff
depth_blastn=0 #Blastn depth cutoff (0 to disable cutoff)

pcov_blastx=0.5 #Blastx Percent Coverage Threhold Protein-Genome Alignments
pid_blastx=0.4 #Blastx Percent Identity Threshold Protein-Genome Aligments
eval_blastx=1e-06 #Blastx eval cutoff
bit_blastx=30 #Blastx bit cutoff
depth_blastx=0 #Blastx depth cutoff (0 to disable cutoff)

pcov_tblastx=0.8 #tBlastx Percent Coverage Threhold alt-EST-Genome Alignments
pid_tblastx=0.85 #tBlastx Percent Identity Threshold alt-EST-Genome Aligments
eval_tblastx=1e-10 #tBlastx eval cutoff
bit_tblastx=40 #tBlastx bit cutoff
depth_tblastx=0 #tBlastx depth cutoff (0 to disable cutoff)

pcov_rm_blastx=0.5 #Blastx Percent Coverage Threhold For Transposable Element Masking
pid_rm_blastx=0.4 #Blastx Percent Identity Threshold For Transposbale Element Masking
eval_rm_blastx=1e-06 #Blastx eval cutoff for transposable element masking
bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking

ep_score_limit=20 #Exonerate protein percent of maximal score threshold
en_score_limit=20 #Exonerate nucleotide percent of maximal score threshold


maker_opt.ctl

In [None]:
#-----Genome (these are always required)
genome=/dtria_ordered_muller_plus_other_scaffs.fasta #genome sequence (fasta file or fasta embeded in GFF3 file)
organism_type=eukaryotic #eukaryotic or prokaryotic. Default is eukaryotic

#-----Re-annotation Using MAKER Derived GFF3
maker_gff= #MAKER derived GFF3 file
est_pass=0 #use ESTs in maker_gff: 1 = yes, 0 = no
altest_pass=0 #use alternate organism ESTs in maker_gff: 1 = yes, 0 = no
protein_pass=0 #use protein alignments in maker_gff: 1 = yes, 0 = no
rm_pass=0 #use repeats in maker_gff: 1 = yes, 0 = no
model_pass=0 #use gene models in maker_gff: 1 = yes, 0 = no
pred_pass=0 #use ab-initio predictions in maker_gff: 1 = yes, 0 = no
other_pass=0 #passthrough anyything else in maker_gff: 1 = yes, 0 = no

#-----EST Evidence (for best results provide a file for at least one)
est=/dtri_all_sequs_transcripts.fa #set of ESTs or assembled mRNA-seq in fasta format
altest= #EST/cDNA sequence file in fasta format from an alternate organism
est_gff= #aligned ESTs or mRNA-seq from an external GFF3 file
altest_gff= #aligned ESTs from a closly relate species in GFF3 format

#-----Protein Homology Evidence (for best results provide a file for at least one)
protein=/dmel-all-translation-r6.26.fasta #protein sequence file in fasta format (i.e. from mutiple oransisms)
protein_gff=  #aligned protein homology evidence from an external GFF3 file

#-----Repeat Masking (leave values blank to skip repeat masking)
model_org=drosophila #select a model organism for RepBase masking in RepeatMasker
rmlib= #provide an organism specific repeat library in fasta format for RepeatMasker
repeat_protein=/pkg/maker/data/te_proteins.fasta #provide a fasta file of transposable element proteins for RepeatRunner
rm_gff= #pre-identified repeat elements from an external GFF3 file
prok_rm=0 #forces MAKER to repeatmask prokaryotes (no reason to change this), 1 = yes, 0 = no
softmask=1 #use soft-masking rather than hard-masking in BLAST (i.e. seg and dust filtering)

#-----Gene Prediction
snaphmm=/pkg/snap/HMM/D.melanogaster.hmm #SNAP HMM file
gmhmm= #GeneMark HMM file
augustus_species=fly #Augustus gene prediction species model
fgenesh_par_file= #FGENESH parameter file
pred_gff= #ab-initio predictions from an external GFF3 file
model_gff= #annotated gene models from an external GFF3 file (annotation pass-through)
est2genome=0 #infer gene predictions directly from ESTs, 1 = yes, 0 = no
protein2genome=0 #infer predictions from protein homology, 1 = yes, 0 = no
trna=0 #find tRNAs with tRNAscan, 1 = yes, 0 = no
snoscan_rrna= #rRNA file to have Snoscan find snoRNAs
unmask=0 #also run ab-initio prediction programs on unmasked sequence, 1 = yes, 0 = no

#-----Other Annotation Feature Types (features MAKER doesn't recognize)
other_gff= #extra features to pass-through to final MAKER generated GFF3 file

#-----External Application Behavior Options
alt_peptide=C #amino acid used to replace non-standard amino acids in BLAST databases
cpus=28 #max number of cpus to use in BLAST and RepeatMasker (not for MPI, leave 1 when using MPI)

#-----MAKER Behavior Options
max_dna_len=100000 #length for dividing up contigs into chunks (increases/decreases memory usage)
min_contig=1 #skip genome contigs below this length (under 10kb are often useless)

pred_flank=200 #flank for extending evidence clusters sent to gene predictors
pred_stats=0 #report AED and QI statistics for all predictions as well as models
AED_threshold=1 #Maximum Annotation Edit Distance allowed (bound by 0 and 1)
min_protein=0 #require at least this many amino acids in predicted proteins
alt_splice=0 #Take extra steps to try and find alternative splicing, 1 = yes, 0 = no
always_complete=0 #extra steps to force start and stop codons, 1 = yes, 0 = no
map_forward=0 #map names and attributes forward from old GFF3 genes, 1 = yes, 0 = no
keep_preds=0 #Concordance threshold to add unsupported gene prediction (bound by 0 and 1)

split_hit=10000 #length for the splitting of hits (expected max intron size for evidence alignments)
single_exon=0 #consider single exon EST evidence when generating annotations, 1 = yes, 0 = no
single_length=250 #min length required for single exon ESTs if 'single_exon is enabled'
correct_est_fusion=0 #limits use of ESTs in annotation to avoid fusion genes

tries=2 #number of times to try a contig if there is a failure for some reason
clean_try=0 #remove all data from previous run before retrying, 1 = yes, 0 = no
clean_up=0 #removes theVoid directory with individual analysis files, 1 = yes, 0 = no
TMP= #specify a directory other than the system default temporary directory for temporary files
