Skip to content

Commit

Permalink
improve doc, help, info (#294)
Browse files Browse the repository at this point in the history
* improve doc, help, info
  • Loading branch information
Juke34 committed Oct 26, 2022
1 parent 7458305 commit ab147d7
Show file tree
Hide file tree
Showing 12 changed files with 143 additions and 168 deletions.
30 changes: 15 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,10 +36,10 @@ Suite of tools to handle gene annotations in any GTF/GFF format.
* [Usage](#usage)
* [List of tools](#list-of-tools)
* [More about the tools](#more-about-the-tools)
* [Omniscient - Standardisation for a full GFF3 compliant to any tool](#omniscient---standardisation-for-a-full-gff3-compliant-to-any-tool)
* [Omniscient data structure](#omniscient-data-structure)
* [How does the Omniscient parser work](#how-does-the-omniscient-parser-work)
* [What can the Omniscient parser do for you](#what-can-the-omniscient-parser-do-for-you)
* [The AGAT parser - Standardisation to create GXF files compliant to any tool](#the-agat-parser---standardisation-to-create-gxf-files-compliant-to-any-tool)
* [The data structure](#the-data-structure)
* [How does the AGAT parser work](#how-does-the-agat-parser-work)
* [What can the AGAT parser do for you](#what-can-the-agat-parser-do-for-you)
* [examples](#examples)
* [How to cite?](#how-to-cite)
* [Publication using AGAT](#publication-using-agat)
Expand Down Expand Up @@ -293,34 +293,34 @@ To have a look to the available tools you have several approaches:

#### with \_sp\_ prefix => Means SLURP

The gff file will be charged in memory Omniscient data structure that is way to facilitate access to desired features at any time.
The gff file will be charged in memory in a specific data structure facilitating the access to desired features at any time.
It has a memory cost but make life smoother. Indeed, it allows to perform complicated tasks in a more time efficient way.
Moreover, it allows to fix all potential errors in the limit of the possibilities given by the format itself.
See the Omniscient section for more information about it.
See the AGAT parser section for more information about it.

#### with \_sq\_ prefix => Means SEQUENTIAL

The gff file is read and processed from its top to the end line by line without sanity check. This is memory efficient.

## Omniscient - Standardisation for a full GFF3 compliant to any tool
## The AGAT parser - Standardisation to create GXF compliant to any tool

All tools with `agat_sp_` prefix will parse and slurps the entire data into a data structure called Omniscient.
Below you will find more information about peculiarity of the Omniscient structure,
All tools with `agat_sp_` prefix will parse and slurps the entire data into a specific data structure called.
Below you will find more information about peculiarity of the data structure,
and the parsing approach used.

#### Omniscient data structure
#### the data structure

The method create a hash structure containing all the data in memory. We call it OMNISCIENT. The OMNISCIENT structure is a three levels structure:
The method create a hash structure containing all the data in memory. We can call it OMNISCIENT. The OMNISCIENT structure is a three levels structure:
```
$omniscient{level1}{tag_l1}{level1_id} = feature <= tag could be gene, match
$omniscient{level2}{tag_l2}{idY} = @featureListL2 <= tag could be mRNA,rRNA,tRNA,etc. idY is a level1_id (know as Parent attribute within the level2 feature). The @featureListL2 is a list to be able to manage isoform cases.
$omniscient{level3}{tag_l3}{idZ} = @featureListL3 <= tag could be exon,cds,utr3,utr5,etc. idZ is the ID of a level2 feature (know as Parent attribute within the level3 feature). The @featureListL3 is a list to be able to put all the feature of a same tag together.
```

#### How does the Omniscient parser work
#### How does the AGAT parser work

The Omniscient parser phylosophy:
* 1) Parse by Parent/child relationship
The AGAT parser phylosophy:
* 1) Parse by Parent/child relationship or gene_id/transcript_id relationship.
* 2) ELSE Parse by a common tag (an attribute value shared by feature that must be grouped together. By default we are using locus_tag but can be set by parameter).
* 3) ELSE Parse sequentially (mean group features in a bucket, and the bucket change at each level2 feature, and bucket are join in a common tag at each new L1 feature).

Expand All @@ -329,7 +329,7 @@ The Omniscient parser phylosophy:
To resume by priority of way to parse: **Parent/child relationship > locus_tag > sequential.**
The parser may used only one or a mix of these approaches according of the peculiarity of the gtf/gff file you provide.

#### What can the Omniscient parser do for you
#### What can the AGAT parser do for you

* It creates missing parental features. (e.g if a level2 or level3 feature do not have parental feature(s) we create the missing level2 and/or level1 feature(s)).
* It creates missing mandatory attributes (ID and/or Parent).
Expand Down
29 changes: 24 additions & 5 deletions bin/agat
Original file line number Diff line number Diff line change
Expand Up @@ -43,13 +43,27 @@ my $application = {
MAIN => {
help => $header,
description =>
'AGAT has the power to check, fix, pad missing information (features/
attributes) of any kind of GTF and GFF to create complete, sorted and
standardised gff3 format. Over the years it has been enriched by many many
'AGAT checks, fixes, pads missing information (features/attributes) of any
kind of GTF/GFF (GXF) files and create complete, sorted and standardised
GFF/GTF formated files. Over the years it has been enriched by many many
tools to perform just about any tasks that is possible related to GTF/GFF
format files (sanitizing, conversions, merging, modifying, filtering, FASTA
sequence extraction, adding information, etc). Comparing to other methods
AGAT is robust to even the most despicable GTF/GFF files.',
AGAT is robust to even the most despicable GTF/GFF files.
By default AGAT automatically selects the appropriate parser and generates
a GFF3 output by default. This can be tuned via the config file.
Configuration (optional)
========================
To access the config.yaml configuration file: agat config --expose
The file will appear in the working folder. By default, AGAT uses the config file from the working directory when any.
Feature levels (optional)
========================
To access the AGAT feature_levels file: agat levels --expose
The file will appear in the working folder. By default, AGAT uses the config file from the working directory when any.
',
children => [qw< levels config >],

# allow for configuration files
Expand All @@ -71,6 +85,11 @@ AGAT is robust to even the most despicable GTF/GFF files.',
shortbool => 1,
help => 'Display the AGAT tools available',
},
{
getopt => 'info|i!',
shortbool => 1,
help => 'Display information on how AGAT works',
},
],
commit => '#handle_main',
},
Expand All @@ -83,7 +102,7 @@ AGAT is robust to even the most despicable GTF/GFF files.',
use this option. It will copy past in you working directory the feature_levels.yaml file
used to define the relationships between feature types and their level organisation.
Typical level organisation: Level1 => gene; Level2 => mRNA; level3 => exon,cds,
utrs. If you get warning from the Omniscient parser that a feature relationship
utrs. If you get warning from the AGAT parser that a feature relationship
is not defined, you can provide information about it within the exposed feature_levels.yaml
file. Indeed, if the feature_levels.yaml file exists in your working directory, it will be
used by default.',
Expand Down
2 changes: 1 addition & 1 deletion bin/agat_convert_sp_gff2gtf.pl
Original file line number Diff line number Diff line change
Expand Up @@ -96,7 +96,7 @@ =head1 DESCRIPTION
To be fully GTF compliant all feature have a gene_id and a transcript_id attribute.
The gene_id is unique identifier for the genomic source of the transcript, which is
used to group transcripts into genes.
The transcript_id is a unique identifier for the predicted transcript,
The transcript_id is a unique identifier for the predicted transcript,
which is used to group features into transcripts.
=head1 SYNOPSIS
Expand Down
43 changes: 7 additions & 36 deletions bin/agat_convert_sp_gxf2gxf.pl
Original file line number Diff line number Diff line change
Expand Up @@ -77,43 +77,14 @@ =head1 NAME
=head1 DESCRIPTION
This script fixes and/or standardizes any GTF/GFF file into full sorted GFF3 file.
The output GFF syntax is shaped by bioperl and choose among the versions
1,2,2.5 (GTF equivalent) and 3. For a correct GTF file, it is recommended to use
agat_convert_sp_gff2gtf.pl
Without specifying an input GTF/GFF version, the Omniscient parser will first detect
automtically the most appropriate GFF parser to use from bioperl (GFF1,GFF2,GFF3)
in order to read you file properly.
Then the Omniscient parser removes duplicate features, fixes duplicated IDs,
adds missing ID and/or Parent attributes, deflates factorized attributes
(attributes with several parents are duplicated with uniq ID), add missing features
when possible (e.g. add exon if only CDS described, add UTR if CDS and exon described),
This script fixes and/or standardizes any GTF/GFF file into full sorted GTF/GFF file.
It AGAT parser removes duplicate features, fixes duplicated IDs, adds missing ID and/or Parent attributes,
deflates factorized attributes (attributes with several parents are duplicated with uniq ID),
add missing features when possible (e.g. add exon if only CDS described, add UTR if CDS and exon described),
fix feature locations (e.g. check exon is embedded in the parent features mRNA, gene), etc...
All AGAT's scripts with the _sp_ prefix use the same parser, before to perform supplement tasks.
With that script you can tuned the Omniscient parser behaviour. I.e. you can decide
to merge loci that have an overlap at their CDS features (Only one top feature
is kept (gene), and the mRNA features become isoforms). This is not activated by
default in case you are working on a prokaryote annotation that often have overlaping
loci.
The Omniscient parser defines relationship between features using 3 levels.
e.g Level1=gene; Level2=mRNA,tRNA; Level3=exon,cds,utr.
The feature type information is stored within the 3rd column of a GTF/GFF file.
The parser need to know to which level a feature type is part of. This information
is stored by default in a yaml file coming with the tool. We have implemented the
most common feature types met in gff/gtf files. If a feature type is not yet handle
by the parser it will throw a warning. You can easily inform the parser how
to handle it (level1, level2 or level3) by modifying the feature_levels.yaml file.
How to access this file? Easy just run: agat levels --expose
The yaml file will appear in the working folder. By default, the Omniscient parser
use the feature_levels.yaml file from the working directory when any.
Omniscient parser phylosophy:
Parse by Parent/child relationship
ELSE Parse by a comomn tag (an attribute value shared by feature that must be grouped together.
By default we are using locus_tag and gene_id as locus tag, but you can specify the one of your choice
ELSE Parse sequentially (features are grouped in a bucket, and the bucket change at each level2 feature met, and bucket(s) are linked to the first l1 top feature met)
All AGAT's scripts with the _sp_ prefix use the AGAT parser, before to perform any supplementary task.
So, it is not necessary to run this script prior the use of any other _sp_ script.
=head1 SYNOPSIS
Expand Down
6 changes: 0 additions & 6 deletions bin/agat_sp_compare_two_BUSCOs.pl
Original file line number Diff line number Diff line change
Expand Up @@ -174,11 +174,6 @@
}
}

# set verbosity for the parser. Quiete except if verbose == 66

my $parser_verbosity = -1;
$parser_verbosity = 0 if ($verbose and $verbose == 66) ; # put to -1 make the parser quiete even for warnings.

#extract gff from folder1
my $full_omniscient={};
my $loop = 0;
Expand Down Expand Up @@ -358,7 +353,6 @@ =head1 OPTIONS
=item B<-v> or B<--verbose>
Integer: For displaying extra information use -v 1.
For activating the verbosity in the omniscient parser use -v 66. (not recommended)
=item B<-o> or B<--output>
Expand Down
2 changes: 1 addition & 1 deletion bin/agat_sp_merge_annotations.pl
Original file line number Diff line number Diff line change
Expand Up @@ -92,7 +92,7 @@ =head1 NAME
=head1 DESCRIPTION
This script merge different gff annotation files in one.
It uses the Omniscient parser that takes care of duplicated names and fixes other oddities met in those files.
It uses the AGAT parser that takes care of duplicated names and fixes other oddities met in those files.
=head1 SYNOPSIS
Expand Down
20 changes: 10 additions & 10 deletions docs/agat_how_does_it_work.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,20 +4,20 @@ All tools taking GFF/GTF as input can be divided in two groups: \_sp\_ and \_sq\

* Tools with \_sp\_ prefix

\_sp\_ stands for SLURP. Those tools will charge the file in memory Omniscient data structure. It has a memory cost but makes life smoother. Indeed, it allows to perform complicated tasks in a more time efficient way ( Any features can be accessed at any time by AGAT).
\_sp\_ stands for SLURP. Those tools will charge the file in memory in a specific data structure. It has a memory cost but makes life smoother. Indeed, it allows to perform complicated tasks in a more time efficient way ( Any features can be accessed at any time by AGAT).
Moreover, it allows to fix all potential errors in the limit of the possibilities given by the format itself.
See the Omniscient section for more information about it.
See the AGAT parser section for more information about it.

* with \_sq\_ prefix

\_sq\_ stands for SEQUENTIAL. Those tools will read and process GFF/GTF files from the top to the bottom, line by line, performing tasks on the fly. This is memory efficient but the sanity check of the file is minimum. Those tools are not intended to perform complex tasks.

## Omniscient / parsing performed by \_sp\_ prefix tools / Standardisation for a full GFF3 compliant to any tool
## The AGAT parser / used by \_sp\_ prefix tools / Standardisation to create GXF files compliant to any tool

The first step of AGAT' tools with the \_sp\_ prefix of is to fix the file to standardize it. (e.g. a file containing only exon will be modified to create mRNA and gene features). To perform this task AGAT parses and slurps the entire data into a data structure called Omniscient.
Below you will find more information about peculiarity of the Omniscient structure, and the parsing approach used.
The first step of AGAT' tools with the \_sp\_ prefix of is to fix the file to standardize it. (e.g. a file containing only exon will be modified to create mRNA and gene features). To perform this task AGAT parses and slurps the entire data into a specific data structure.
Below you will find more information about peculiarity of this data structure, and the parsing approach used.

### What performs the Omniscient parser
### What performs the AGAT parser

* It creates missing parental features. (e.g if a level2 or level3 feature do not have parental feature(s) we create the missing level2 and/or level1 feature(s)).
* It creates missing mandatory attributes (ID and/or Parent).
Expand All @@ -30,7 +30,7 @@ Below you will find more information about peculiarity of the Omniscient structu
* It groups features together (if related features are spread at different places in the file).


### Omniscient data structure
### The data structure

The method create a hash structure containing all the data in memory. We call it OMNISCIENT. The OMNISCIENT structure is a three levels structure:
```
Expand All @@ -39,9 +39,9 @@ $omniscient{level2}{tag_l2}{idY} = @featureListL2 <= tag could be mRNA,rRNA,tRNA
$omniscient{level3}{tag_l3}{idZ} = @featureListL3 <= tag could be exon,cds,utr3,utr5,etc. idZ is the ID of a level2 feature (know as Parent attribute within the level3 feature). The @featureList is a list to be able to put all the feature of a same tag together.
```

### How does the Omniscient parser work
### How does the AGAT parser work

To resume by priority of way to parse: **Parent/child relationship > common attribute/tag > sequential.**
To resume by priority of way to parse: **Parent/child or gene_id/transcript_id relationship > common attribute/tag > sequential.**
The parser may used only one or a mix of these approaches according of the peculiarity of the gtf/gff file you provide.
If you need to use the `--ct` option you will have to process the file `agat_convert_sp_gxf2gxf.pl` first before running any other tool.

Expand Down Expand Up @@ -90,7 +90,7 @@ Example of relationship made sequentially:

### Particular case

Below you will find more information about peculiarity of the Omniscient structure, and the parsing approach used.
Below you will find more information about peculiar GXF files and how the AGAT parser behaves and uses the different parsing approaches.

#### A. Level1 feature type missing and no Parent/gene_id

Expand Down

0 comments on commit ab147d7

Please sign in to comment.