To guide eBook authors having a better sense of the workflow layout, here we briefly introduce the specific purposes of the dir system.
- cache: Here, it stores intermediate datasets or results that are generated during the preprocessing steps.
- graphs: The graphs/figures produced during the analysis.
- input: Here, we store the raw input data. Data size > 100M is not allowed. We recommend using small sample data for the illustration purpose of the workflow. If you have files > 100M, please contact the chapter editor to find a solution.
- lib: The source code, functions, or algorithms used within the workflow.
- output: The final output results of the workflow.
- workflow: Step by step pipeline. It may contain some sub-directories.
- It is suggested to use a numbering system and keywords to indicate the order and the main purpose of the scripts, i.e.,
1_fastq_quality_checking.py
,2_cleaned_reads_alignment.py
. - To ensure reproducibility, please use the relative path within the
workflow
.
- It is suggested to use a numbering system and keywords to indicate the order and the main purpose of the scripts, i.e.,
- README: In the readme file, please briefly describe the purpose of the repository, the installation, and the input data format.
- We recommend using a diagram to describe the workflow briefly.
- Provide the installation details.
- Show a small proportion of the input data unless the data file is in a well-known standard format, i.e., the
head
ortail
of the input data.
-
Running environment:
- The workflow was constructed based on the Linux system running the Oracle v1.6 to 1.8 java runtime environment (JREs).
-
Required software and versions:
- MAKER-P (Campbell et al., 2014; v3.1; http://www.yandell-lab.org/software/maker-p.html)
- RepeatMasker (Tarailo-Graovac et al., 2009; v4.1.1; www.repeatmasker.org)
- Augustus (Stanke et al., 2006; v3.0; http://bioinf.uni-greifswald.de/augustus/)
- Fgenesh (Solovyev et al., 2006; v8.0.0a; http://www.softberry.com/berry.phtml)
- Snap (Korf, 2004; version 2013-11-29; https://github.com/KorfLab/SNAP)
- WUBLAST (Gish, W. (1996-2003); v2.0; ttp://blast.wustl.edu)
- InterProScan (Quevillon et al., 2005; v89.0; http://www.ebi.ac.uk/interpro/search/sequence-search).
- Exonerate (Slater GS and Birney E, 2005; https://www.ebi.ac.uk/about/vertebrate-genomics/software/exonerate)
The example data used here is the FASTA file of genome sequence, here we use maize RP125 genome chr1 sequence from Nie et al., 2021.
Set up each configure file: Maker_opts.ctl; Maker_exe.ctl; Maker_evm.ctl; and Maker_bopts.ctl, provide path of input data, evidence data to the .ctl file, and set parameter as needed.
Run script ‘run_maker.sh’ to annotate the genomes.
Use the following command to create the final merged gff file. The “-n” option would produce a gff file without genome sequences.
gff3_merge -s -n -d genome.maker.output/genome_master_datastore_index.log>genome.noseq.gff
Generate AED plots.
/programs/maker/AED_cdf_generator.pl -b 0.025 chr1.noseq.gff > AED_rnd3
Plot the file AED_rnd3 in Excel or any plotting software.
Load the gff file into IGV or JBrowse. Instructions for IGV and JBrowse can be found at:
IGV: http://software.broadinstitute.org/software/igv/UserGuide
JBrowse: https://biohpc.cornell.edu/lab/userguide.aspx?a=software&i=357#c
GFF3 file with gene structure information, and AED score
It is a free and open source software, licensed under (choose a license from the suggested list: GPLv3, MIT, or CC BY 4.0).