Skip to content

unambiguous_codes

Simon Hegele edited this page May 28, 2025 · 12 revisions

Replacing ambiguous codes in FASTA/FASTQ-files.

The genomic nucleotides are adenine, cytosine, guanine and thymine, denoted as A, C, G and T. When the identity of a nucleotide is uncertain, other characters can be used to denote them (see IUPAC Codes. However, many bioinformatics tools do not accept FASTA/FASTQ-files using these codes.

Ambiguous codes are replaced with a randomly selected allowed base.
The total number of replacements as well as the number of replacements per ambiguous code are reported.
Additionally, a "uncertainty" is reported which is calculated as:

$$ u = \frac{\Sigma_{\text{ambiguous code}} \frac{\text{Appearances of the ambiguous code in all sequences}}{\text{Number of possible bases represented by the ambiguous code}}}{\text{total number of bases in all sequences}} $$

usage: unambiguous_codes [-h] [-t THREADS] in_file out_file

Replacing ambigouity codes in FASTA/FASTQ with A,C,G or T.

positional arguments:
  in_file
  out_file

options:
  -h, --help            show this help message and exit
  -t THREADS, --threads THREADS
                        Number of parallel threads [default: 1]

Clone this wiki locally