Skip to content

unambiguous_codes

Simon Hegele edited this page May 28, 2025 · 12 revisions

Replacing ambiguous codes in FASTA/FASTQ-files.

The genomic nucleotides are adenine, cytosine, guanine and thymine, denoted as A, C, G and T. When the identity of a nucleotide is uncertain, other characters can be used to denote them (see IUPAC Codes. However, many bioinformatics tools do not accept FASTA/FASTQ-files using these codes.

Ambiguous codes are replaced with a randomly selected allowed base.
The total number of replacements as well as the number of replacements per ambiguous code are reported.
Additionally, a "uncertainty" is reported which is calculated as:

$$ u = \frac{\Sigma_{ac} ap(ac)*(1-\frac{1}{pb(ac)})}{total bases} $$

usage: unambiguous_codes [-h] [-t THREADS] in_file out_file

Replacing ambigouity codes in FASTA/FASTQ with A,C,G or T.

positional arguments:
  in_file
  out_file

options:
  -h, --help            show this help message and exit
  -t THREADS, --threads THREADS
                        Number of parallel threads [default: 1]

Clone this wiki locally