improve doc, help, info (#294)

* improve doc, help, info
NBISweden · Oct 26, 2022 · ab147d7 · ab147d7
1 parent 7458305
commit ab147d7
Show file tree

Hide file tree

Showing 12 changed files with 143 additions and 168 deletions.
diff --git a/README.md b/README.md
@@ -36,10 +36,10 @@ Suite of tools to handle gene annotations in any GTF/GFF format.
    * [Usage](#usage)
    * [List of tools](#list-of-tools)
    * [More about the tools](#more-about-the-tools)
-   * [Omniscient - Standardisation for a full GFF3 compliant to any tool](#omniscient---standardisation-for-a-full-gff3-compliant-to-any-tool)
-      * [Omniscient data structure](#omniscient-data-structure)
-      * [How does the Omniscient parser work](#how-does-the-omniscient-parser-work)
-      * [What can the Omniscient parser do for you](#what-can-the-omniscient-parser-do-for-you)
+   * [The AGAT parser - Standardisation to create GXF files compliant to any tool](#the-agat-parser---standardisation-to-create-gxf-files-compliant-to-any-tool)
+      * [The data structure](#the-data-structure)
+      * [How does the AGAT parser work](#how-does-the-agat-parser-work)
+      * [What can the AGAT parser do for you](#what-can-the-agat-parser-do-for-you)
       * [examples](#examples)
    * [How to cite?](#how-to-cite)
    * [Publication using AGAT](#publication-using-agat)
@@ -293,34 +293,34 @@ To have a look to the available tools you have several approaches:
 
 #### with \_sp\_ prefix => Means SLURP
 
-The gff file will be charged in memory Omniscient data structure that is way to facilitate access to desired features at any time.
+The gff file will be charged in memory in a specific data structure facilitating the access to desired features at any time.
 It has a memory cost but make life smoother. Indeed, it allows to perform complicated tasks in a more time efficient way.
 Moreover, it allows to fix all potential errors in the limit of the possibilities given by the format itself.
-See the Omniscient section for more information about it.  
+See the AGAT parser section for more information about it.  
 
 #### with \_sq\_ prefix => Means SEQUENTIAL
 
 The gff file is read and processed from its top to the end line by line without sanity check. This is memory efficient.
 
-## Omniscient - Standardisation for a full GFF3 compliant to any tool  
+## The AGAT parser - Standardisation to create GXF compliant to any tool  
 
-All tools with `agat_sp_` prefix will parse and slurps the entire data into a data structure called Omniscient.
-Below you will find more information about peculiarity of the Omniscient structure,
+All tools with `agat_sp_` prefix will parse and slurps the entire data into a specific data structure called.
+Below you will find more information about peculiarity of the data structure,
 and the parsing approach used.
 
-#### Omniscient data structure
+#### the data structure
 
-The method create a hash structure containing all the data in memory. We call it OMNISCIENT. The OMNISCIENT structure is a three levels structure:
+The method create a hash structure containing all the data in memory. We can call it OMNISCIENT. The OMNISCIENT structure is a three levels structure:
 ```
 $omniscient{level1}{tag_l1}{level1_id} = feature <= tag could be gene, match  
 $omniscient{level2}{tag_l2}{idY} = @featureListL2 <= tag could be mRNA,rRNA,tRNA,etc. idY is a level1_id (know as Parent attribute within the level2 feature). The @featureListL2 is a list to be able to manage isoform cases.  
 $omniscient{level3}{tag_l3}{idZ} =  @featureListL3 <= tag could be exon,cds,utr3,utr5,etc. idZ is the ID of a level2 feature (know as Parent attribute within the level3 feature). The @featureListL3 is a list to be able to put all the feature of a same tag together.  
 ```
 
-#### How does the Omniscient parser work
+#### How does the AGAT parser work
 
-The Omniscient parser phylosophy:
-  * 1) Parse by Parent/child relationship  
+The AGAT parser phylosophy:
+  * 1) Parse by Parent/child relationship or gene_id/transcript_id relationship.
   * 2) ELSE Parse by a common tag  (an attribute value shared by feature that must be grouped together. By default we are using locus_tag but can be set by parameter).  
   * 3) ELSE Parse sequentially (mean group features in a bucket, and the bucket change at each level2 feature, and bucket are join in a common tag at each new L1 feature).  
 
@@ -329,7 +329,7 @@ The Omniscient parser phylosophy:
 To resume by priority of way to parse: **Parent/child relationship > locus_tag > sequential.**  
 The parser may used only one or a mix of these approaches according of the peculiarity of the gtf/gff file you provide.
 
-#### What can the Omniscient parser do for you
+#### What can the AGAT parser do for you
 
 * It creates missing parental features. (e.g if a level2 or level3 feature do not have parental feature(s) we create the missing level2 and/or level1 feature(s)).    
 * It creates missing mandatory attributes (ID and/or Parent).  

diff --git a/bin/agat b/bin/agat
@@ -43,13 +43,27 @@ my $application = {
       MAIN => {
          help        => $header,
          description =>
-    'AGAT has the power to check, fix, pad missing information (features/
-attributes) of any kind of GTF and GFF to create complete, sorted and
-standardised gff3 format. Over the years it has been enriched by many many
+'AGAT checks, fixes, pads missing information (features/attributes) of any
+kind of GTF/GFF (GXF) files and create complete, sorted and standardised 
+GFF/GTF formated files. Over the years it has been enriched by many many
 tools to perform just about any tasks that is possible related to GTF/GFF
 format files (sanitizing, conversions, merging, modifying, filtering, FASTA
 sequence extraction, adding information, etc). Comparing to other methods
-AGAT is robust to even the most despicable GTF/GFF files.',
+AGAT is robust to even the most despicable GTF/GFF files.
+
+By default AGAT automatically selects the appropriate parser and generates
+a GFF3 output by default. This can be tuned via the config file.
+
+Configuration (optional)
+========================
+To access the config.yaml configuration file: agat config --expose
+The file will appear in the working folder. By default, AGAT uses the config file from the working directory when any.
+
+Feature levels (optional)
+========================
+To access the AGAT feature_levels file: agat levels --expose
+The file will appear in the working folder. By default, AGAT uses the config file from the working directory when any.
+',
 				 children => [qw< levels config >],
 
          # allow for configuration files
@@ -71,6 +85,11 @@ AGAT is robust to even the most despicable GTF/GFF files.',
 							 shortbool   => 1,
 							 help        => 'Display the AGAT tools available',
 						},
+						{
+							 getopt      => 'info|i!',
+							 shortbool   => 1,
+							 help        => 'Display information on how AGAT works',
+						},
          ],
 				 commit	     => '#handle_main',
       },
@@ -83,7 +102,7 @@ AGAT is robust to even the most despicable GTF/GFF files.',
 use this option. It will copy past in you working directory the feature_levels.yaml file 
 used to define the relationships between feature types and their level organisation.
 Typical level organisation: Level1 => gene; Level2 => mRNA; level3 => exon,cds,
-utrs. If you get warning from the Omniscient parser that a feature relationship
+utrs. If you get warning from the AGAT parser that a feature relationship
 is not defined, you can provide information about it within the exposed feature_levels.yaml
 file. Indeed, if the feature_levels.yaml file exists in your working directory, it will be
 used by default.',

diff --git a/bin/agat_convert_sp_gff2gtf.pl b/bin/agat_convert_sp_gff2gtf.pl
@@ -96,7 +96,7 @@ =head1 DESCRIPTION
 To be fully GTF compliant all feature have a gene_id and a transcript_id attribute.
 The gene_id	is unique identifier for the genomic source of the transcript, which is
 used to group transcripts into genes.
-The transcript_id	is a unique identifier for the predicted transcript,
+The transcript_id is a unique identifier for the predicted transcript,
 which is used to group features into transcripts.
 
 =head1 SYNOPSIS

diff --git a/bin/agat_convert_sp_gxf2gxf.pl b/bin/agat_convert_sp_gxf2gxf.pl
@@ -77,43 +77,14 @@ =head1 NAME
 
 =head1 DESCRIPTION
 
-This script fixes and/or standardizes any GTF/GFF file into full sorted GFF3 file.
-The output GFF syntax is shaped by bioperl and choose among the versions
-1,2,2.5 (GTF equivalent) and 3. For a correct GTF file, it is recommended to use
-agat_convert_sp_gff2gtf.pl
-
-Without specifying an input GTF/GFF version, the Omniscient parser will first detect
-automtically the most appropriate GFF parser to use from bioperl (GFF1,GFF2,GFF3)
-in order to read you file properly.
-Then the Omniscient parser removes duplicate features, fixes duplicated IDs,
-adds missing ID and/or Parent attributes, deflates factorized attributes
-(attributes with several parents are duplicated with uniq ID), add missing features
-when possible (e.g. add exon if only CDS described, add UTR if CDS and exon described),
+This script fixes and/or standardizes any GTF/GFF file into full sorted GTF/GFF file.
+It AGAT parser removes duplicate features, fixes duplicated IDs, adds missing ID and/or Parent attributes,
+deflates factorized attributes (attributes with several parents are duplicated with uniq ID),
+add missing features when possible (e.g. add exon if only CDS described, add UTR if CDS and exon described),
 fix feature locations (e.g. check exon is embedded in the parent features mRNA, gene), etc...
-All AGAT's scripts with the _sp_ prefix use the same parser, before to perform supplement tasks.
-With that script you can tuned the Omniscient parser behaviour. I.e. you can decide
-to merge loci that have an overlap at their CDS features (Only one top feature
-is kept (gene), and the mRNA features become isoforms). This is not activated by
-default in case you are working on a prokaryote annotation that often have overlaping
-loci.
-The Omniscient parser defines relationship between features using 3 levels.
-e.g Level1=gene; Level2=mRNA,tRNA; Level3=exon,cds,utr.
-The feature type information is stored within the 3rd column of a GTF/GFF file.
-The parser need to know to which level a feature type is part of. This information
-is stored by default in a yaml file coming with the tool. We have implemented the
-most common feature types met in gff/gtf files. If a feature type is not yet handle
-by the parser it will throw a warning. You can easily inform the parser how
-to handle it (level1, level2 or level3) by modifying the feature_levels.yaml file.
-How to access this file? Easy just run: agat levels --expose 
-The  yaml file will appear in the working folder. By default, the Omniscient parser 
-use the feature_levels.yaml file from the working directory when any.
-
-Omniscient parser phylosophy:
-
- Parse by Parent/child relationship
-   ELSE Parse by a comomn tag  (an attribute value shared by feature that must be grouped together.
-        By default we are using locus_tag and gene_id as locus tag, but you can specify the one of your choice
-     ELSE Parse sequentially (features are grouped in a bucket, and the bucket change at each level2 feature met, and bucket(s) are linked to the first l1 top feature met)
+
+All AGAT's scripts with the _sp_ prefix use the AGAT parser, before to perform any supplementary task.
+So, it is not necessary to run this script prior the use of any other _sp_ script. 
 
 =head1 SYNOPSIS
 

diff --git a/bin/agat_sp_compare_two_BUSCOs.pl b/bin/agat_sp_compare_two_BUSCOs.pl
@@ -174,11 +174,6 @@
   }
 }
 
-# set verbosity for the parser. Quiete except if verbose == 66
-
-my $parser_verbosity = -1;
-$parser_verbosity = 0 if ($verbose and $verbose == 66) ; # put to -1 make the parser quiete even for warnings.
-
 #extract gff from folder1
 my $full_omniscient={};
 my $loop = 0;
@@ -358,7 +353,6 @@ =head1 OPTIONS
 =item B<-v> or B<--verbose>
 
 Integer: For displaying extra information use -v 1.
-For activating the verbosity in the omniscient parser use -v 66. (not recommended)
 
 =item B<-o> or B<--output>
 

diff --git a/bin/agat_sp_merge_annotations.pl b/bin/agat_sp_merge_annotations.pl
@@ -92,7 +92,7 @@ =head1 NAME
 =head1 DESCRIPTION
 
 This script merge different gff annotation files in one.
-It uses the Omniscient parser that takes care of duplicated names and fixes other oddities met in those files.
+It uses the AGAT parser that takes care of duplicated names and fixes other oddities met in those files.
 
 =head1 SYNOPSIS
 

diff --git a/docs/agat_how_does_it_work.md b/docs/agat_how_does_it_work.md
@@ -4,20 +4,20 @@ All tools taking GFF/GTF as input can be divided in two groups: \_sp\_ and \_sq\
 
 * Tools with \_sp\_ prefix
 
-\_sp\_ stands for SLURP. Those tools will charge the file in memory Omniscient data structure. It has a memory cost but makes life smoother. Indeed, it allows to perform complicated tasks in a more time efficient way ( Any features can be accessed at any time by AGAT).
+\_sp\_ stands for SLURP. Those tools will charge the file in memory in a specific data structure. It has a memory cost but makes life smoother. Indeed, it allows to perform complicated tasks in a more time efficient way ( Any features can be accessed at any time by AGAT).
 Moreover, it allows to fix all potential errors in the limit of the possibilities given by the format itself.
-See the Omniscient section for more information about it.  
+See the AGAT parser section for more information about it.  
 
 * with \_sq\_ prefix
 
  \_sq\_ stands for SEQUENTIAL. Those tools will read and process GFF/GTF files from the top to the bottom, line by line, performing tasks on the fly. This is memory efficient but the sanity check of the file is minimum. Those tools are not intended to perform complex tasks.
 
-## Omniscient / parsing performed by \_sp\_ prefix tools / Standardisation for a full GFF3 compliant to any tool
+## The AGAT parser / used by \_sp\_ prefix tools / Standardisation to create GXF files compliant to any tool
 
-The first step of AGAT' tools with the \_sp\_ prefix of is to fix the file to standardize it. (e.g. a file containing only exon will be modified to create mRNA and gene features). To perform this task AGAT parses and slurps the entire data into a data structure called Omniscient.
-Below you will find more information about peculiarity of the Omniscient structure, and the parsing approach used.
+The first step of AGAT' tools with the \_sp\_ prefix of is to fix the file to standardize it. (e.g. a file containing only exon will be modified to create mRNA and gene features). To perform this task AGAT parses and slurps the entire data into a specific data structure.
+Below you will find more information about peculiarity of this data structure, and the parsing approach used.
 
-### What performs the Omniscient parser
+### What performs the AGAT parser
 
 * It creates missing parental features. (e.g if a level2 or level3 feature do not have parental feature(s) we create the missing level2 and/or level1 feature(s)).    
 * It creates missing mandatory attributes (ID and/or Parent).  
@@ -30,7 +30,7 @@ Below you will find more information about peculiarity of the Omniscient structu
 * It groups features together (if related features are spread at different places in the file).  
 
 
-### Omniscient data structure
+### The data structure
 
 The method create a hash structure containing all the data in memory. We call it OMNISCIENT. The OMNISCIENT structure is a three levels structure:
 ```
@@ -39,9 +39,9 @@ $omniscient{level2}{tag_l2}{idY} = @featureListL2 <= tag could be mRNA,rRNA,tRNA
 $omniscient{level3}{tag_l3}{idZ} =  @featureListL3 <= tag could be exon,cds,utr3,utr5,etc. idZ is the ID of a level2 feature (know as Parent attribute within the level3 feature). The @featureList is a list to be able to put all the feature of a same tag together.  
 ```
 
-### How does the Omniscient parser work
+### How does the AGAT parser work
 
-To resume by priority of way to parse: **Parent/child relationship > common attribute/tag > sequential.**  
+To resume by priority of way to parse: **Parent/child or gene_id/transcript_id relationship > common attribute/tag > sequential.**  
 The parser may used only one or a mix of these approaches according of the peculiarity of the gtf/gff file you provide.
 If you need to use the `--ct` option you will have to process the file `agat_convert_sp_gxf2gxf.pl` first  before running any other tool.
 
@@ -90,7 +90,7 @@ Example of relationship made sequentially:
 
 ### Particular case
 
-Below you will find more information about peculiarity of the Omniscient structure, and the parsing approach used.
+Below you will find more information about peculiar GXF files and how the AGAT parser behaves and uses the different parsing approaches.
 
 #### A. Level1 feature type missing and no Parent/gene_id