Skip to content
This repository has been archived by the owner on Apr 24, 2018. It is now read-only.

QC phase

mpoelchau edited this page Jun 21, 2017 · 8 revisions

Automated and manual QC for formatting errors caused by manual curation

By hosting several genome projects, I5K Workspace@NAL team has found a great amount of formatting errors that could happen in GFF3 format. The automated QC procedure detects ~50 types of formatting errors caused by manual curation. Some errors are automatically fixed, whereas other error types need to be manually reviewed by curators or administrators. Curators are provided with a list of errors to correct in Apollo. After a correction period, QC reports are re-generated until no errors remain.

Errors are detected by reviewing three types of feature sets in a GFF3 file, and thus are grouped into three categories (Error category – feature type):

  • Intra-model errors (Ema) – multiple features within a model
  • Inter-model errors (Emr) – multiple features across models
  • Single feature errors (Esf) – each single feature.

Intra-model: Multiple features within a model (Ema)

The error category of 'Intra-model' collects all kinds of formatting errors that could be found by together considering multiple features within a gene model, such as gene, mRNA, exon, and CDS features. An error belongs to this category would be bound to an 'Error_Code' starting with 'Ema'.

Error_Code Error_Tag Note
Ema0001 redundant length of the gene TBA
Ema0002 internal stops TBA
Ema0003 This feature is not contained within the feature boundaries of parent Done
Ema0004 Incomplete gene feature that should be contain at least one mRNA, exon, and CDS TBA
Ema0005 unusual child features in the type of pseudogene found Done
Ema0006 Wrong phase Done
Ema0007 Inconsistent CDS strand with parent Done

Inter-model: Multiple features across models (Emr)

The error category of 'Inter-model' collects all kinds of formatting errors that could be found by the comparison between multiple gene models. An error belongs to this category would be bound to an 'Error_Code' starting with 'Emr'.

Error_Code Error_Tag Note
Emr0001 Duplicate transcript found Done
Emr0002 wrongly merged gene parent? TBA
Emr0003 wrongly split gene parent? TBA
Emr0004 models with distant isoforms TBA
Emr0005 Duplicate ID Done

Single feature (Esf)

The error category of 'Single Feature' collects all kinds of formatting errors that could be found by searching the gff file line by line. An error belongs to this category would be bound to an 'Error_Code' starting with 'Esf'.

Error_Code Error_Tag Note
Esf0001 pseudogene or not? Done
Esf0002 Negative/Zero start/end coordinate Done
Esf0003 strand information missing TBA
Esf0004 Seqid not found in any ##sequence-region Done
Esf0005 Start is less than the ##sequence-region start Done
Esf0006 End is greater than the ##sequence-region end Done
Esf0007 Seqid not found in the embedded ##FASTA Done
Esf0008 End is greater than the embedded ##FASTA sequence length Done
Esf0009 Found Ns in a feature using the embedded ##FASTA Done
Esf0010 Seqid not found in the external FASTA file Done
Esf0011 End is greater than the external FASTA sequence length Done
Esf0012 Found Ns in a feature using the external FASTA Done
Esf0013 White chars not allowed at the start of a line Done
Esf0014 ##gff-version" missing from the first line Done
Esf0015 Expecting certain fields in the feature Done
Esf0016 ##sequence-region seqid may only appear once Done
Esf0017 Start/End is not a valid integer Done
Esf0018 Start is not less than or equal to end Done
Esf0019 Version is not "3" Done
Esf0020 Version is not a valid integer Done
Esf0021 Unknown directive Done
Esf0022 Features should contain 9 fields Done
Esf0023 escape certain characters Done
Esf0024 Score is not a valid floating point number Done
Esf0025 Strand has illegal characters Done
Esf0026 Phase is not 0, 1, or 2, or not a valid integer Done
Esf0027 Phase is required for all CDS features Done
Esf0028 Attributes must escape the percent (%) sign and any control characters Done
Esf0029 Attributes must contain one and only one equal (=) sign Done
Esf0030 Empty attribute tag Done
Esf0031 Empty attribute value Done
Esf0032 Found multiple attribute tags Done
Esf0033 Found ", " in a attribute, possible unescaped Done
Esf0034 attribute has identical values (count, value) Done
Esf0035 attribute has unresolved forward reference Done
Esf0036 Value of a attribute contains unescaped "," Done
Esf0037 Target attribute should have 3 or 4 values Done
Esf0038 Start/End value of Target attribute is not a valid integer coordinate Done
Esf0039 Strand value of Target attribute has illegal characters Done
Esf0040 Value of Is_circular attribute is not "true" Done
Esf0041 Unknown reserved (uppercase) attribute Done