Skip to content
Martin Asser Hansen edited this page Oct 2, 2015 · 6 revisions

The KISS format

Background

The KISS format is derived from the UCSC BED format, but with important differences allowing it to carry e.g. alignment information thus making it ideal for describing the mapping of short sequence matches from Next Generation Sequencing, but also the layout of multi-exon genes.

The BED format is per definition bound to chromosomes, which is not feasible when working with bacterial contigs, plasmids, vira, etc. Also, the BED format is not suitable for carrying alignment information, and the format contains the useless itemRgb field. Moreover, the BED format position scheme is awkward because the chromEnd field includes 1 extra position, that does not pertain to a base position, but is used for drawing features.

The SAM format cannot be used to describe a multi-exon gene, but is specifically designed for short sequence alignments of Next Generation Sequencing data. The SAM format is problematic to work with for several reasons. First of all, the SAM format is actually quite messy to parse even though it is claimed to be simple. The reason for this is the number of fields reserved for mate-pair sequence mapping and a complex bit field. Also, the alignment information is encoded using Extended Cigar format and two optional fields for mismatch information, however, there is no information on inserted or deleted nucleotides - for that you need to recreate the alignment form the original sequence. Finally, the SAM format cannot be considered generic when it has an unlimited number of optional fields which may contain user defined content.

The GFF format is impossible to work with because of the requirement of parsing multiple lines to resolve the parent/child features for describing e.g. a gene.

Synopsis

The KISS format (Keep it Simple Stupid) is a text based data format for describing generic feature information in a simple format with one feature per line in 12 tab-separated columns:

  1. S_ID: Subject ID - e.g. chr12.
  2. S_BEG: Begin position of a feature relating to the subject sequence. 0-based.
  3. S_END: End position of a feature relating to the subject sequence.
  4. Q_ID: Query ID - e.g. a Solexa read ID e.g. a3_2VCOjxwXsN1
  5. SCORE: A float that can describe e.g. a BLAT score.
  6. STRAND: Denotes which strand a feature relates to. + or -.
  7. HITS: Number of times a feature is found in the subject sequence.
  8. ALIGN: Comma-separated list of alignment descriptors for mismatches, insertions, and deletions *).
  9. BLOCK_COUNT: Number of blocks in a feature (e.g. introns + exons).
  10. BLOCK_BEGS: Comma-separated list of block begin positions. Offset is S_BEG.
  11. BLOCK_LENS: Comma-separated list of block lengths.
  12. BLOCK_TYPE: Comma-separated list of block types (0=Gap,1=Non-gap,2=CDS,3=5'UTR,4=3'UTR).

Values in fields 4-12 are optional and empty fields must contain a '.'.

*) Alignment descriptors:

  • mismatch: (offset:S-base>Q-base) - e.g. 0:C>T,13:G>C
  • insertion: (offset:->Q-base) - e.g. 8:->G,18:->A
  • deletions: (offset:S-base>-) - e.g. 5:A>-,16:T>-

The offset position is based on S_BEG and do not change with insertions or deletions. Alignment descriptors are based on the + strand.

Descriptors should be sorted by offset postion.

The KISS format description

Mandatory fields: S_ID, S_BEG, and S_END fields

The mandatory Subject ID S_ID field is used to identify the subject sequence which could be chr1, contig00001, or gi|255961261|ref|NC_007492.2| Pseudomonas fluorescens Pf0-1, complete genome. There are no restrictions to this field except that tabs must be escaped.

The mandatory Subject Begin S_BEG field is the 0-based begin position of a feature. S_BEG is a positive integer.

The mandatory Subject End S_END field is the end position of a feature. S_END is a positive integer, and is always equal to or greater than S_BEG.

The most simple KISS entry could be like this (empty fields are denoted with .):

Contig1   10   20   .   .   .   .   .   .   .   .   .

This would describe a match of an 11 bases long unidentified feature on the sequence of Contig1:

Pos:     0123456789012345678901234567890
Contig1: -------------------------------
                   ===========

The Q_ID field

The optional Query ID Q_ID field is used to identify the query sequnce which could be the ID of a mapped sequence such as a Solexa e.g. a3_2VCOjxwXsN1 or a gene ID NM_006140. There are no restrictions to this field except that tabs must be escaped.

Contig1   10   20   NM_006140   .   .   .   .   .   .   .   .

The SCORE field

The optional SCORE field is used to hold a score value of a sequence match, e.g. a BLAT or BLAST match. SCORE is a float value.

Contig1   10   20   NM_006140   0.123   .   .   .   .   .   .   .

The STRAND field

The optional STRAND field is used to indicate the orientation of a feature. + and - are allowed.

Contig1   10   20   NM_006140   .   -   .   .   .   .   .   .

The HITS field

The optional HITS field can be used to denote how many times this feature was found on this Subject sequence or in a stack of subject sequences. This is useful for analyzing multi-mapping Next Generation Sequencing reads. The below value of 123 in the HITS field indicates that this sequence was mapped at 122 other loci.

Contig1   10   20   NM_006140   .   .   123   .   .   .   .   .

The ALIGN field

The optional ALIGN field consists of a comma-separated list of alignment descriptors of which there are 3 types:

  • mismatch: offset:S-base>Q-base
  • insertion: offset:->Q-base
  • deletion: offset:S-base>-

The offset position is based on S_BEG and do not change with insertions or deletions.

The alignment descriptors are sorted on the offset positions.

The alignment descriptors are based on the + strand.

Thus we can generate an alignments of a Subject and a Query sequence using only the Subject sequence and the ALIGN descriptors.

The following Subject sequence is used in the below examples:

S_SEQ: CCGTAAGACTACGGCTTAGGC

ALIGN mismatches example

Here is an example of two mismatches: ALIGN: 0:C>T,13:G>C

                 1         2
       012345678901234567890
S_SEQ: CCGTAAGACTACGGCTTAGGC
        |||||||||||| |||||||
Q_SEQ: TCGTAAGACTACGCCTTAGGC

ALIGN insertions example

Here is an example of two insertions: ALIGN: 8:->G,18:->A

                  1         2
       01234567-8901234567-890
S_SEQ: CCGTAAGA-CTACGGCTTA-GGC
       |||||||| |||||||||| |||
Q_SEQ: CCGTAAGAGCTACGGCTTAAGGC

ALIGN deletions example

Here is an example with two deletions: ALIGN: 3:T>-,16:T>-

                 1         2
       012345678901234567890
S_SEQ: CCGTAAGACTACGGCTTAGGC
       ||| |||||||||||| ||||
Q_SEQ: CCG-AAGACTACGGCT-AGGC

Comprehensive example

And an example with all of the above: 0:C>T,3:T>-,8:->G,13:G>C,16:T>-,18:->A

                  1         2
       01234567-8901234567-890
S_SEQ: CCGTAAGA-CTACGGCTTA-GGC
        || |||| ||||| || | |||
Q_SEQ: TCG-AAGAGCTACGCCT-AAGGC

The BLOCK_COUNT field

The optional BLOCK_COUNT field is used to denote the number of blocks in gapped alignments such as mapped paired-end reads, or multi-exon genes. BLOCK_COUNT is a positive integer.

The BLOCK_BEGS field

The optional BLOCK_BEGS field contains a comma-separated list of block begin positions with S_BEG as offset.

The BLOCK_LENS field

The optional BLOCK_LENS field contains a comma-separated list of block lengths.

The BLOCK_TYPE field

The optional BLOCK_TYPE field contains a comma-separated list of block types. BLOCK_TYPE is a positive integer denoting the block type which are the following:

0 = Gap (or intron)
1 = Non-gap
2 = CDS
3 = 5' UTR
4 = 3' UTR

Thus a paired-end read could result in the following KISS entry:

Contig1   10   50   ID00001   .   .   .   .   3   0,5,10   5,5,8   1,0,1
ID00001: =====-----========
	 Block1    Block2

A gene with a 5' UTR, 3 exons, and a 3' UTR with an intron can be described like this:

Contig1   10   42   GENE00001   .   .   .   .   9   0,6,9,14,18,21,26,28,30   6,3,5,4,3,5,2,2,3   3,2,0,2,0,2,4,0,2
# CDS
= UTR
- Intron

GENE00001: ======###-----####---#####==--===

Example KISS records

Here are some real life KISS records from a BWA mapping of Solexa reads against a reference:

CP000046   49   92    1_524msxwXsN1   61.61   -   .   .   1   .   .   .
CP000046   50   93    5_LKLjAywXsN1   62.61   -   .   .   1   .   .   .
CP000046   51   94    1_64zDoxwXsN1   62.27   -   .   .   1   .   .   .
CP000046   55   98    3_WFMk4ywXsN1   61.52   -   .   0:G>A,5:A>-     1   .   .   .
CP000046   59   102   5_XYjz6ywXsN1   60.98   +   .   40:C>-,41:A>C   1   .   .   .
CP000046   59   102   5_XmvSlxwXsN1   62.73   +   .   40:C>-,41:A>C   1   .   .   .
CP000046   64   107   7_8ZFZ3ywXsN1   62.32   +   .   .   1   .   .   .
CP000046   66   100   5_ay97zxwXsN1   61.97   +   .   .   1   .   .   .
CP000046   67   101   7_nqQ11ywXsN1   63.14   +   .   .   1   .   .   .
CP000046   67   110   5_eky4kxwXsN1   62.34   -   .   .   1   .   .   .
Clone this wiki locally