-
Notifications
You must be signed in to change notification settings - Fork 0
QC phase
By hosting several genome projects, I5K Workspace@NAL team has found a great amount of formatting errors that could happen in GFF3 format. The automated QC procedure detects ~50 types of formatting errors caused by manual curation. Some errors are automatically fixed, whereas other error types need to be manually reviewed by curators or administrators. Curators are provided with a list of errors to correct in Apollo. After a correction period, QC reports are re-generated until no errors remain.
Errors are detected by reviewing three types of feature sets in a GFF3 file, and thus are grouped into three categories (Error category – feature type):
- Intra-model errors (Ema) – multiple features within a model
- Inter-model errors (Emr) – multiple features across models
- Single feature errors (Esf) – each single feature.
The error category of 'Intra-model' collects all kinds of formatting errors that could be found by together considering multiple features within a gene model, such as gene, mRNA, exon, and CDS features. An error belongs to this category would be bound to an 'Error_Code' starting with 'Ema'.
Error_Code | Error_Tag | Note |
---|---|---|
Ema0001 | redundant length of the gene | TBA |
Ema0002 | internal stops | TBA |
Ema0003 | This feature is not contained within the feature boundaries of parent | Done |
Ema0004 | Incomplete gene feature that should be contain at least one mRNA, exon, and CDS | TBA |
Ema0005 | unusual child features in the type of pseudogene found | Done |
Ema0006 | Wrong phase | Done |
Ema0007 | Inconsistent CDS strand with parent | Done |
The error category of 'Inter-model' collects all kinds of formatting errors that could be found by the comparison between multiple gene models. An error belongs to this category would be bound to an 'Error_Code' starting with 'Emr'.
Error_Code | Error_Tag | Note |
---|---|---|
Emr0001 | Duplicate transcript found | Done |
Emr0002 | wrongly merged gene parent? | TBA |
Emr0003 | wrongly split gene parent? | TBA |
Emr0004 | models with distant isoforms | TBA |
Emr0005 | Duplicate ID | Done |
The error category of 'Single Feature' collects all kinds of formatting errors that could be found by searching the gff file line by line. An error belongs to this category would be bound to an 'Error_Code' starting with 'Esf'.
Error_Code | Error_Tag | Note |
---|---|---|
Esf0001 | pseudogene or not? | Done |
Esf0002 | Negative/Zero start/end coordinate | Done |
Esf0003 | strand information missing | TBA |
Esf0004 | Seqid not found in any ##sequence-region | Done |
Esf0005 | Start is less than the ##sequence-region start | Done |
Esf0006 | End is greater than the ##sequence-region end | Done |
Esf0007 | Seqid not found in the embedded ##FASTA | Done |
Esf0008 | End is greater than the embedded ##FASTA sequence length | Done |
Esf0009 | Found Ns in a feature using the embedded ##FASTA | Done |
Esf0010 | Seqid not found in the external FASTA file | Done |
Esf0011 | End is greater than the external FASTA sequence length | Done |
Esf0012 | Found Ns in a feature using the external FASTA | Done |
Esf0013 | White chars not allowed at the start of a line | Done |
Esf0014 | ##gff-version" missing from the first line | Done |
Esf0015 | Expecting certain fields in the feature | Done |
Esf0016 | ##sequence-region seqid may only appear once | Done |
Esf0017 | Start/End is not a valid integer | Done |
Esf0018 | Start is not less than or equal to end | Done |
Esf0019 | Version is not "3" | Done |
Esf0020 | Version is not a valid integer | Done |
Esf0021 | Unknown directive | Done |
Esf0022 | Features should contain 9 fields | Done |
Esf0023 | escape certain characters | Done |
Esf0024 | Score is not a valid floating point number | Done |
Esf0025 | Strand has illegal characters | Done |
Esf0026 | Phase is not 0, 1, or 2, or not a valid integer | Done |
Esf0027 | Phase is required for all CDS features | Done |
Esf0028 | Attributes must escape the percent (%) sign and any control characters | Done |
Esf0029 | Attributes must contain one and only one equal (=) sign | Done |
Esf0030 | Empty attribute tag | Done |
Esf0031 | Empty attribute value | Done |
Esf0032 | Found multiple attribute tags | Done |
Esf0033 | Found ", " in a attribute, possible unescaped | Done |
Esf0034 | attribute has identical values (count, value) | Done |
Esf0035 | attribute has unresolved forward reference | Done |
Esf0036 | Value of a attribute contains unescaped "," | Done |
Esf0037 | Target attribute should have 3 or 4 values | Done |
Esf0038 | Start/End value of Target attribute is not a valid integer coordinate | Done |
Esf0039 | Strand value of Target attribute has illegal characters | Done |
Esf0040 | Value of Is_circular attribute is not "true" | Done |
Esf0041 | Unknown reserved (uppercase) attribute | Done |