-
Notifications
You must be signed in to change notification settings - Fork 52
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
78 changed files
with
4,620 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,106 @@ | ||
# AGAT - **A**nother **G**tf/gff **A**nalysis **T**oolkit | ||
## Suite of tools to handle gene annotations in any GTF/GFF format. | ||
--------------------------------------------- | ||
|
||
# Table of Contents | ||
|
||
* [Foreword](#foreword) | ||
* [List of AGAT tools (v0.6.0)](#list-of-agat-tools-v060) | ||
* [Topological sorting of gff features](https://github.com/NBISweden/AGAT/wiki/Topological-sorting-of-gff-features) | ||
|
||
## Foreword | ||
Providing support in genome annotation within [NBIS](https://nbis.se) the GTF/GFF format is the main format I handle. I receive from customers file in GTF/GFF format coming from a broad range of sources. Even sometimes files from mixed sources (concatenated in the same file), or manually edited. | ||
The problem is that often those files do not follow the official specifications or even if they do, they are not even be sure to be compatible we the inputs expected by the tools. | ||
|
||
* The main idea was **first** to be able to **parse all possible cases** that can be met (I listed more than 30 cases). To my knowledge AGAT is the only one able to handle all of them. | ||
|
||
* The **second** idea was to be able to **create a full standardised GFF3** file that could actually fit in any tool. | ||
Once again AGAT is the only one recreating fully the missing information: | ||
* missing features (gene, mRNA, tRNA, exon, UTRs, etc...) | ||
* missing attributes (ID, Parent). | ||
|
||
and fixing wrong information: | ||
* identifier to be uniq. | ||
* feature location (e.g mRNA will be stretched if shorter than its exons). | ||
* remove duplicated features. | ||
* merge overlapping loci (if option activate because for prokaryote is not something we would like) | ||
|
||
* The **third** idea was to have a **correct topological sorting output**. To my knowledge AGAT is the only one dealing properly with this task. More information about it [here](https://github.com/NBISweden/AGAT/wiki/Topological-sorting-of-gff-features). | ||
|
||
* **Finally**, based on the abilities described previously I have developed a **toolkit to perform different tasks**. Some are originals, some are similar than what other tools could offer, but within AGAT they will always have the strength of the 3 first points. | ||
|
||
|
||
**A final word** | ||
AGAT can solve lot of complicated cases and save headaches. | ||
Enjoy!! | ||
|
||
## List of AGAT tools (v0.6.1) | ||
[agat_convert_bed2gff.pl](https://github.com/NBISweden/AGAT/wiki/agat_convert_bed2gff) | ||
[agat_convert_embl2gff.pl](https://github.com/NBISweden/AGAT/wiki/agat_convert_embl2gff) | ||
[agat_convert_genscan2gff.pl](https://github.com/NBISweden/AGAT/wiki/agat_convert_genscan2gff) | ||
[agat_convert_mfannot2gff.pl](https://github.com/NBISweden/AGAT/wiki/agat_convert_mfannot2gff) | ||
[agat_convert_minimap2_bam2gff.pl](https://github.com/NBISweden/AGAT/wiki/agat_convert_minimap2_bam2gff) | ||
[agat_convert_sp_gff2bed.pl](https://github.com/NBISweden/AGAT/wiki/agat_convert_sp_gff2bed) | ||
[agat_convert_sp_gff2gtf.pl](https://github.com/NBISweden/AGAT/wiki/agat_convert_sp_gff2gtf) | ||
[agat_convert_sp_gff2tsv.pl](https://github.com/NBISweden/AGAT/wiki/agat_convert_sp_gff2tsv) | ||
[agat_convert_sp_gff2zff.pl](https://github.com/NBISweden/AGAT/wiki/agat_convert_sp_gff2zff) | ||
[agat_convert_sp_gxf2gxf.pl](https://github.com/NBISweden/AGAT/wiki/agat_convert_sp_gxf2gxf) | ||
[agat_sp_Prokka_inferNameFromAttributes.pl](https://github.com/NBISweden/AGAT/wiki/agat_sp_Prokka_inferNameFromAttributes) | ||
[agat_sp_add_introns.pl](https://github.com/NBISweden/AGAT/wiki/agat_sp_add_introns) | ||
[agat_sp_add_start_and_stop.pl](https://github.com/NBISweden/AGAT/wiki/agat_sp_add_start_and_stop) | ||
[agat_sp_alignment_output_style.pl](https://github.com/NBISweden/AGAT/wiki/agat_sp_alignment_output_style) | ||
[agat_sp_clipN_seqExtremities_and_fixCoordinates.pl](https://github.com/NBISweden/AGAT/wiki/agat_sp_clipN_seqExtremities_and_fixCoordinates) | ||
[agat_sp_compare_two_BUSCOs.pl](https://github.com/NBISweden/AGAT/wiki/agat_sp_compare_two_BUSCOs) | ||
[agat_sp_compare_two_annotations.pl](https://github.com/NBISweden/AGAT/wiki/agat_sp_compare_two_annotations) | ||
[agat_sp_complement_annotations.pl](https://github.com/NBISweden/AGAT/wiki/agat_sp_complement_annotations) | ||
[agat_sp_ensembl_output_style.pl](https://github.com/NBISweden/AGAT/wiki/agat_sp_ensembl_output_style) | ||
[agat_sp_extract_attributes.pl](https://github.com/NBISweden/AGAT/wiki/agat_sp_extract_attributes) | ||
[agat_sp_extract_sequences.pl](https://github.com/NBISweden/AGAT/wiki/agat_sp_extract_sequences) | ||
[agat_sp_filter_by_ORF_size.pl](https://github.com/NBISweden/AGAT/wiki/agat_sp_filter_by_ORF_size) | ||
[agat_sp_filter_by_locus_distance.pl](https://github.com/NBISweden/AGAT/wiki/agat_sp_filter_by_locus_distance) | ||
[agat_sp_filter_by_mrnaBlastValue.pl](https://github.com/NBISweden/AGAT/wiki/agat_sp_filter_by_mrnaBlastValue) | ||
[agat_sp_filter_feature_by_attribute_presence.pl](https://github.com/NBISweden/AGAT/wiki/agat_sp_filter_feature_by_attribute_presence) | ||
[agat_sp_filter_feature_by_attribute_value.pl](https://github.com/NBISweden/AGAT/wiki/agat_sp_filter_feature_by_attribute_value) | ||
[agat_sp_filter_feature_from_keep_list.pl](https://github.com/NBISweden/AGAT/wiki/agat_sp_filter_feature_from_keep_list) | ||
[agat_sp_filter_feature_from_kill_list.pl](https://github.com/NBISweden/AGAT/wiki/agat_sp_filter_feature_from_kill_list) | ||
[agat_sp_filter_gene_by_intron_numbers.pl](https://github.com/NBISweden/AGAT/wiki/agat_sp_filter_gene_by_intron_numbers) | ||
[agat_sp_filter_gene_by_length.pl](https://github.com/NBISweden/AGAT/wiki/agat_sp_filter_gene_by_length) | ||
[agat_sp_filter_incomplete_gene_coding_models.pl](https://github.com/NBISweden/AGAT/wiki/agat_sp_filter_incomplete_gene_coding_models) | ||
[agat_sp_filter_record_by_coordinates.pl](https://github.com/NBISweden/AGAT/wiki/agat_sp_filter_record_by_coordinates) | ||
[agat_sp_fix_cds_phases.pl](https://github.com/NBISweden/AGAT/wiki/agat_sp_fix_cds_phases) | ||
[agat_sp_fix_features_locations_duplicated.pl](https://github.com/NBISweden/AGAT/wiki/agat_sp_fix_features_locations_duplicated) | ||
[agat_sp_fix_fusion.pl](https://github.com/NBISweden/AGAT/wiki/agat_sp_fix_fusion) | ||
[agat_sp_fix_longest_ORF.pl](https://github.com/NBISweden/AGAT/wiki/agat_sp_fix_longest_ORF) | ||
[agat_sp_fix_overlaping_genes.pl](https://github.com/NBISweden/AGAT/wiki/agat_sp_fix_overlaping_genes) | ||
[agat_sp_fix_small_exon_from_extremities.pl](https://github.com/NBISweden/AGAT/wiki/agat_sp_fix_small_exon_from_extremities) | ||
[agat_sp_flag_premature_stop_codons.pl](https://github.com/NBISweden/AGAT/wiki/agat_sp_flag_premature_stop_codons) | ||
[agat_sp_flag_short_introns.pl](https://github.com/NBISweden/AGAT/wiki/agat_sp_flag_short_introns) | ||
[agat_sp_functional_statistics.pl](https://github.com/NBISweden/AGAT/wiki/agat_sp_functional_statistics) | ||
[agat_sp_keep_longest_isoform.pl](https://github.com/NBISweden/AGAT/wiki/agat_sp_keep_longest_isoform) | ||
[agat_sp_kraken_assess_liftover.pl](https://github.com/NBISweden/AGAT/wiki/agat_sp_kraken_assess_liftover) | ||
[agat_sp_list_short_introns.pl](https://github.com/NBISweden/AGAT/wiki/agat_sp_list_short_introns) | ||
[agat_sp_load_function_from_protein_align.pl](https://github.com/NBISweden/AGAT/wiki/agat_sp_load_function_from_protein_align) | ||
[agat_sp_manage_IDs.pl](https://github.com/NBISweden/AGAT/wiki/agat_sp_manage_IDs) | ||
[agat_sp_manage_UTRs.pl](https://github.com/NBISweden/AGAT/wiki/agat_sp_manage_UTRs) | ||
[agat_sp_manage_attributes.pl](https://github.com/NBISweden/AGAT/wiki/agat_sp_manage_attributes) | ||
[agat_sp_manage_functional_annotation.pl](https://github.com/NBISweden/AGAT/wiki/agat_sp_manage_functional_annotation) | ||
[agat_sp_manage_introns.pl](https://github.com/NBISweden/AGAT/wiki/agat_sp_manage_introns) | ||
[agat_sp_merge_annotations.pl](https://github.com/NBISweden/AGAT/wiki/agat_sp_merge_annotations) | ||
[agat_sp_prokka_fix_fragmented_gene_annotations.pl](https://github.com/NBISweden/AGAT/wiki/agat_sp_prokka_fix_fragmented_gene_annotations) | ||
[agat_sp_sensitivity_specificity.pl](https://github.com/NBISweden/AGAT/wiki/agat_sp_sensitivity_specificity) | ||
[agat_sp_separate_by_record_type.pl](https://github.com/NBISweden/AGAT/wiki/agat_sp_separate_by_record_type) | ||
[agat_sp_statistics.pl](https://github.com/NBISweden/AGAT/wiki/agat_sp_statistics) | ||
[agat_sp_webApollo_compliant.pl](https://github.com/NBISweden/AGAT/wiki/agat_sp_webApollo_compliant) | ||
[agat_sq_add_attributes_from_tsv.pl](https://github.com/NBISweden/AGAT/wiki/agat_sq_add_attributes_from_tsv) | ||
[agat_sq_add_hash_tag.pl](https://github.com/NBISweden/AGAT/wiki/agat_sq_add_hash_tag) | ||
[agat_sq_add_locus_tag.pl](https://github.com/NBISweden/AGAT/wiki/agat_sq_add_locus_tag) | ||
[agat_sq_keep_annotation_from_fastaSeq.pl](https://github.com/NBISweden/AGAT/wiki/agat_sq_keep_annotation_from_fastaSeq) | ||
[agat_sq_list_attributes.pl](https://github.com/NBISweden/AGAT/wiki/agat_sq_list_attributes) | ||
[agat_sq_manage_IDs.pl](https://github.com/NBISweden/AGAT/wiki/agat_sq_manage_IDs) | ||
[agat_sq_manage_attributes.pl](https://github.com/NBISweden/AGAT/wiki/agat_sq_manage_attributes) | ||
[agat_sq_mask.pl](https://github.com/NBISweden/AGAT/wiki/agat_sq_mask) | ||
[agat_sq_remove_redundant_entries.pl](https://github.com/NBISweden/AGAT/wiki/agat_sq_remove_redundant_entries) | ||
[agat_sq_repeats_analyzer.pl](https://github.com/NBISweden/AGAT/wiki/agat_sq_repeats_analyzer) | ||
[agat_sq_rfam_analyzer.pl](https://github.com/NBISweden/AGAT/wiki/agat_sq_rfam_analyzer) | ||
[agat_sq_split.pl](https://github.com/NBISweden/AGAT/wiki/agat_sq_split) | ||
[agat_sq_stat_basic.pl](https://github.com/NBISweden/AGAT/wiki/agat_sq_stat_basic) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,58 @@ | ||
# NAME | ||
|
||
agat\_convert\_bed2gff.pl | ||
|
||
# DESCRIPTION | ||
|
||
The script takes a bed file as input, and will translate it in gff format. | ||
The BED format is described here: https://genome.ucsc.edu/FAQ/FAQformat.html#format1 | ||
The script converts 0-based, half-open \[start-1, end) bed file to | ||
1-based, closed \[start, end\] General Feature Format v3 (GFF3). | ||
|
||
# SYNOPSIS | ||
|
||
``` | ||
agat_convert_bed2gff.pl --bed infile.bed [ -o outfile ] | ||
agat_convert_bed2gff.pl -h | ||
``` | ||
|
||
# OPTIONS | ||
|
||
- **--bed** | ||
|
||
Input bed file that will be converted. | ||
|
||
- **--source** | ||
|
||
The source informs about the tool used to produce the data and is stored in 2nd field of a gff file. | ||
Example: Stringtie,Maker,Augustus,etc. \[default: data\] | ||
|
||
- **--primary\_tag** | ||
|
||
The primary\_tag corresponds to the data type and is stored in 3rd field of a gff file. | ||
Example: gene,mRNA,CDS,etc. \[default: gene\] | ||
|
||
- **--inflate\_off** | ||
|
||
By default we inflate the block fields (blockCount, blockSizes, blockStarts) to create subfeatures | ||
of the main feature (primary\_tag). The type of subfeature created is based on the | ||
inflate\_type parameter. If you do not want this inflating behaviour you can deactivate it | ||
by using the --inflate\_off option. | ||
|
||
- **--inflate\_type** | ||
|
||
Feature type (3rd column in gff) created when inflate parameter activated \[default: exon\]. | ||
|
||
- **--verbose** | ||
|
||
add verbosity | ||
|
||
- **-o** , **--output** , **--out** , **--outfile** or **--gff** | ||
|
||
Output GFF file. If no output file is specified, the output will be | ||
written to STDOUT. | ||
|
||
- **-h** or **--help** | ||
|
||
Display this helpful text. | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,38 @@ | ||
# NAME | ||
|
||
gaas\_converter\_embl2gff.pl | ||
|
||
# DESCRIPTION | ||
|
||
The script takes an EMBL file as input, and will translate it in gff format. | ||
|
||
# SYNOPSIS | ||
|
||
``` | ||
gaas_converter_embl2gff.pl --embl infile.embl [ -o outfile ] | ||
``` | ||
|
||
# OPTIONS | ||
|
||
- **--embl** | ||
|
||
Input EMBL file that will be read | ||
|
||
- **--primary\_tag**, **--pt**, **-t** | ||
|
||
List of "primary tag". Useful to discard or keep specific features. | ||
Multiple tags must be coma-separated. | ||
|
||
- **-d** | ||
|
||
Means that primary tags provided by the option "primary\_tag" will be discarded. | ||
|
||
- **-o**, **--output**, **--out**, **--outfile** or **--gff** | ||
|
||
Output GFF file. If no output file is specified, the output will be | ||
written to STDOUT. | ||
|
||
- **-h** or **--help** | ||
|
||
Display this helpful text. | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,65 @@ | ||
# NAME | ||
|
||
agat\_convert\_genscan2gff.pl | ||
|
||
# DESCRIPTION | ||
|
||
The script takes a genscan file as input, and will translate it in gff format. | ||
The genscan format is described here: http://genome.crg.es/courses/Bioinformatics2003\_genefinding/results/genscan.html | ||
/!\\ vvv Known problem vvv /!\\ | ||
You must have submited only DNA sequence, wihtout any header!! | ||
Indeed the tool expects only DNA sequences and does not crash/warn if an header | ||
is submited along the sequence. | ||
e.g If you have an header ">seq" s-e-q are seen as the 3 first nucleotides of the sequence. | ||
Then all prediction location are shifted accordingly. | ||
(checked only on the online version http://argonaute.mit.edu/GENSCAN.html. I don't | ||
know if there is the same pronlem elsewhere.) | ||
/!\\ ^^^ Known problem ^^^^ /!\\ | ||
|
||
# SYNOPSIS | ||
|
||
``` | ||
agat_convert_genscan2gff.pl --genscan infile.bed [ -o outfile ] | ||
agat_convert_genscan2gff.pl -h | ||
``` | ||
|
||
# OPTIONS | ||
|
||
- **--genscan** or **-g** | ||
|
||
Input bed file that will be convert. | ||
|
||
- **--source** | ||
|
||
The source informs about the tool used to produce the data and is stored in 2nd field of a gff file. | ||
Example: Stringtie,Maker,Augustus,etc. \[default: data\] | ||
|
||
- **--primary\_tag** | ||
|
||
The primary\_tag corresponf to the data type and is stored in 3rd field of a gff file. | ||
Example: gene,mRNA,CDS,etc. \[default: gene\] | ||
|
||
- **--inflate\_off** | ||
|
||
By default we inflate the block fields (blockCount, blockSizes, blockStarts) to create subfeatures | ||
of the main feature (primary\_tag). Type of subfeature created based on the | ||
inflate\_type parameter. If you don't want this inflating behaviour you can deactivate it | ||
by using the option --inflate\_off. | ||
|
||
- **--inflate\_type** | ||
|
||
Feature type (3rd column in gff) created when inflate parameter activated \[default: exon\]. | ||
|
||
- **--verbose** | ||
|
||
add verbosity | ||
|
||
- **-o** , **--output** , **--out** , **--outfile** or **--gff** | ||
|
||
Output GFF file. If no output file is specified, the output will be | ||
written to STDOUT. | ||
|
||
- **-h** or **--help** | ||
|
||
Display this helpful text. | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,36 @@ | ||
# NAME | ||
|
||
gaas\_convert\_mfannot2gff.pl | ||
|
||
# DESCRIPTION | ||
|
||
Conversion utility for MFannot "masterfile" annotation produced by the MFannot | ||
pipeline (http://megasun.bch.umontreal.ca/RNAweasel/). Reports GFF3 format. | ||
|
||
# SYNOPSIS | ||
|
||
``` | ||
gaas_convert_mfannot2gff.pl -m <mfannot> -o <gff> | ||
gaas_convert_mfannot2gff.pl --help | ||
``` | ||
|
||
# COPYRIGHT AND LICENSE | ||
|
||
Copyright (C) 2015, Brandon Seah (kbseah@mpi-bremen.de) | ||
... GPL-3 ... | ||
modified by jacques dainat 2017-11 | ||
|
||
# OPTIONS | ||
|
||
- **-m** or **-i** or **--mfannot** | ||
|
||
The mfannot input file | ||
|
||
- **-g** or **-o** or **--gff** | ||
|
||
the gff output file | ||
|
||
- **-h** or **--help** | ||
|
||
Display this helpful text. | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,44 @@ | ||
# NAME | ||
|
||
agat\_convert\_sp\_minimap2\_bam2gff.pl | ||
|
||
# DESCRIPTION | ||
|
||
The script converts output from minimap2 (bam or sam) into gff file. | ||
To get bam from minimap2 use the following command: | ||
minimap2 -ax splice:hq genome.fa Asecodes\_parviclava.nucest.fa | samtools sort -O BAM -o output.bam | ||
To use bam with this script you will need samtools in your path. | ||
|
||
# SYNOPSIS | ||
|
||
``` | ||
agat_convert_sp_minimap2_bam2gff.pl -i infile.bam [ -o outfile ] | ||
agat_convert_sp_minimap2_bam2gff.pl -i infile.sam [ -o outfile ] | ||
agat_convert_sp_minimap2_bam2gff.pl --help | ||
``` | ||
|
||
# OPTIONS | ||
|
||
if ( !GetOptions( 'i|input=s' => \\$opt\_in, | ||
|
||
- **-i** or **--input** | ||
|
||
Input file in sam (.sam extension) or bam (.bam extension) format. | ||
|
||
- **-b** or **--bam** | ||
|
||
To force to use the input file as sam file. | ||
|
||
- **-s** or **--sam** | ||
|
||
To force to use the input file as sam file. | ||
|
||
- **-o**, **--out** or **--output** | ||
|
||
Output GFF file. If no output file is specified, the output will be | ||
written to STDOUT. | ||
|
||
- **-h** or **--help** | ||
|
||
Display this helpful text. | ||
|
Oops, something went wrong.