complexcgr
contains classes around the Chaos Game Representation for DNA sequences.
The FCGR helps to visualize a k-mer distribution The FCGR
of a sequence is an image showing the distribution of the
The position that a CGR
.
Some examples of bacterial assemblies (see reference) are shown below.
The name of the species and the sample_id
is in the title of each image (see an example with the first image). These images were
created using the 6-mers of each assembly and the class FCGR
of this library.
10 different species of bacteria represented by their FCGR (6-mers) |
pip install complexcgr
to update to the latest version
pip install complexcgr --upgrade
version 0.8.0:
A list of available classes and functionalities are listed below:
Encoders
The encoders are functions that map a sequence CGR
, iCGR
, and ComplexCGR
.
CGR
Chaos Game Representation: encodes a DNA sequence in 3 numbers
- encode a sequence.
- recover a sequence from a CGR encoding.
iCGR
integer CGR: encodes a DNA sequence in 3 integers
CGR
Chaos Game Representation: encodes a DNA sequence in 3 numbers
- encode a sequence.
- recover a sequence from a CGR encoding.
iCGR
integer CGR: encodes a DNA sequence in 3 integers
- encode a sequence
- recover a sequence from an iCGR encoding
ComplexCGR
: encodes a DNA sequence in 2 integers
- encode a sequence
- recover a sequence from a ComplexCGR encoding
- plot sequence of ComplexCGR encodings
Image for distribution of k-mers
-
FCGR
Frequency Matrix CGR: representation as an image for k-mer representativity, based on CGR.- generates FCGR from an arbitrary n-long sequence.
- plot FCGR.
- save FCGR generated.
- save FCGR in different bits.
-
FCGRKmc
Same asFCGR
but receives as input the file with k-mer counts generated with KMC -
ComplexFCGR
: Frequency ComplexCGR: representation as an image (circle) for k-mer representativity, based on ComplexCGR.- generates ComplexFCGR from an arbitrary n-long sequence.
- plot ComplexFCGR.
- save ComplexFCGR generated.
from complexcgr import CGR
# Instantiate class CGR
cgr = CGR()
# encode a sequence
cgr.encode("ACGT")
# > CGRCoords(N=4, x=0.1875, y=-0.5625)
# recover a sequence from CGR coordinates
cgr.decode(N=4,x=0.1875,y=-0.5625)
# > "ACGT"
Input for FCGR only accept sequences in
import random; random.seed(42)
from complexcgr import FCGR
# set the k-mer
fcgr = FCGR(k=8) # (256x256) array
# Generate a random sequence without T's
seq = "".join(random.choice("ACG") for _ in range(300_000))
chaos = fcgr(seq) # an array with the frequencies of each k-mer
fcgr.plot(chaos)
FCGR representation for a sequence without T's |
You can save the image with
fcgr.save_img(chaos, path="img/ACG.jpg")
Formats allowed are defined by PIL.
You can also generate the image in 16 (or more bits), to avoid losing information of k-mer frequencies
# Generate image in 16-bits (default is 8-bits)
fcgr = FCGR(k=8, bits=16) # (256x256) array. When using plot() it will be rescaled to [0,65535] colors
# Generate a random sequence without T's and lots of N's
seq = "".join(random.choice("ACGN") for _ in range(300_000))
chaos = fcgr(seq) # an array with the probabilities of each k-mer
fcgr.plot(chaos)
FCGR representation for a sequence without T's and lots of N's |
from complexcgr import iCGR
# Instantiate class CGR
icgr = iCGR()
# encode a sequence
icgr.encode("ACGT")
# > CGRCoords(N=4, x=3, y=-9)
# recover a sequence from CGR coordinates
icgr.decode(N=4,x=3,y=-9)
# > "ACGT"
from complexcgr import ComplexCGR
# Instantiate class CGR
ccgr = ComplexCGR()
# encode a sequence
ccgr.encode("ACGT")
# > CGRCoords(k=228,N=4)
# recover a sequence from ComplexCGR coordinates
ccgr.decode(k=228,N=4)
# > "ACGT"
Input for FCGR only accept sequences in
import random; random.seed(42)
from complexcgr import FCGR
# set the k-mer desired
cfcgr = ComplexFCGR(k=8) # 8-mers
# Generate a random sequence without T's
seq = "".join(random.choice("ACG") for _ in range(300_000))
fig = cfcgr(seq)
ComplexFCGR representation for a sequence without T's |
You can save the image with
cfcgr.save(fig, path="img/ACG-ComplexCGR.png")
Currently the plot must be saved as png
Count k-mers could be the bottleneck for large sequences (> 100000 bp).
Note that the class FCGR
(and ComplexCGR
) has implemented a naive approach to count k-mers, this is intended since in practice state-of-the-art tools like KMC or Jellyfish are used to count k-mers very efficiently.
We provide the class FCGRKmc
, that receives as input the file generated by the following pipeline using KMC3
Make sure to have kmc
installed. One recommended way is to create a conda environment and install it there
kmer_size=6
input="path/to/sequence.fa"
output="path/to/count-kmers.txt"
mkdir -p tmp-kmc
kmc -v -k$kmer_size -m4 -sm -ci0 -cs100000 -b -t4 -fa $input $input "tmp-kmc"
kmc_tools -t4 -v transform $input dump $output
rm -r $input.kmc_pre $input.kmc_suf
the output file path/to/count-kmers.txt
can be used with FCGRKmc
from complexcgr import FCGRKmc
kmer = 6
fcgr = FCGRKmc(kmer)
arr = fcgr("path/to/count-kmers.txt") # k-mer counts ordered in a matrix of 2^k x 2^k
# to visualize the distribution of k-mers.
# Frequencies are scaled between [min, max] values.
# White color corresponds to the minimum value of frequency
# Black color corresponds to the maximum value of frequency
fcgr.plot(arr)
# Save it with numpy
import numpy as np
np.save("path_save/fcgr.npy",arr)
CGR encoding
CGR encoding of all k-mers This will define the positions of k-mers in the FCGR image.
ComplexCGR encoding