Pronounced génépi, like the French alpine spirit.
import genepy mysequences = genepy.seqarray("tutorial/rubella_samples.fasta") mysequences.show()
Different colours represent different nucleotides. They're not aligned, and one is much longer than the others. Let's dig a little deeper by printing a summary of the sequence array.
This returns :
GenePy sequence set (genepy.seqarray) : -- 32 sequences -- Mean length : 1697.2 (min 739, max 9762)
We have a wide range of sequence lengths. Most are small, but based on the visualisation, we have two that are probably a full genome. Our next step will be to align them and cut away parts of sequences that we don't have enough duplicates for.
We've aligned the sequences with the default GenePy arguments to Clustal Omega. A new file
was written to disk :
rubella_samples_aligned_genepy.phy. This file, in PHYLIP file format,
acts as a checkpoint for your future work. Let's visualise the result.
Most of the sequences belong to the E1 glycoprotein gene. Let's trim excess from the left and right, so we're comparing like with like.
mysequences.trimalignment(left = 8750, right = 9470) mysequences.show()
Our sequence array is now shorter in terms of nucleotides per sequence. We're considering the same parts of the genome for each sequence, so we can start to look at the nucleotide statistics.
This shows us the average nucleotide content, distributions of nucleotide frequencies across difference sequences, and the frequencies of two-step nucleotide transitions. We see that there's a very high cytosine and guanine content, and that cytosines tend to be followed by more cytosines, although CpG and GpC are also common. The high C+G content is an interesting feature of the rubella virus, which is somewhat of an outlier in this respect amongst ssRNA viruses.
Let's construct a phylogenetic tree from this alignment, once again using ... TO BE CONTINUED
The core data structure in GenePy is the sequence array, or
seqarray object. It acts as a container for a list of sequences, and has member functions for sequence alignment and display, for trimming sequences, for showing sequence array statistics, and for constructing phylogenetic trees.
GenePy makes system calls to ClustalO and PhyML, which do the heavy lifting. Some of the description of these packages, their usage, and their command-line arguments which follows is taken from their respective documentation, such as the ClustalO README file and the PhyML Manual.
Sequence Array Methods
Visual representation of array sequence
A visual representation of the sequences in the
seqarray. Each nucleotide has its own colour; black is an empty site or an unknown nucleotide.
Sequence array statistics
- the nucleotide content
- the distributions of nucleotide frequencies over each sequence as estimated kernel densities ( if scikit-learn is installed ) or as a histogram ( if scikit-learn is not found )
- the two-step transition matrix between nucleotides ( frequency of having an A going to a C, etc. )
.align(force = True, it = False, full = False, full_iter = False, auto = True, threads = False)
Align the sequences in the
seqarray by calling Clustal Omega. Sequences are clustered into a guide tree, which is used to guide a progressive alignment.
Arguments are command-line arguments to ClustalO.
force: overwrite the filename, if the output alignment file exists. The filename defaults to the filename of the sequence you passed on creation of the sequence array, without the extension, and with
it: the integer number of guide tree iterations. By default, no iteration of the guide tree is done. Iteration generates an alignment from the guide tree, then uses this alignment to generate a new guide tree. This iterated alignment procedure could give rise to better alignments at a linear cost in alignment time. Set to
Falsefor no iteration.
full: use the full distance matrix for guide-tree calculation; default uses the fast clustering algorithm, mBed5 instead of constructing a full distance matrix. mBed calculates a reduced set of pairwise distances. Use
full_iter: use the full distance matrix for guide-tree calculation during guide-tree iteration only.
auto: sets options automatically, selecting options for both speed and accuracy according to the number of sequences. This could overwrite some of your other arguments.
auto = Falseis automatically set if any of
full_iterare set to
threads: the number of threads to use for the parallelised part of the alignment. By default, ClustalO will use as many cores as are available;
threadscan be used to limit core usage.
Trim an alignment
.trimalignment(array = None, left = None, right = None)
left nucleotides from the beginning and
right nucleotides from the end of the sequence array. This is useful when you have aligned a number of sequences of different lengths, and want to consider only a full array, where each sequence has the same length. Currently, this method does not automatically find a reasonable truncation zone. As such,
array must be
right must be indices. Will probably get rid of the
array parameter later and just have
right as optional arguments.
Construct a phylogenetic tree
.phylotree(nucleotide_frequency = "empirical", bootstrap = -4, search_algorithm = "BEST")
Construct a phylogenetic tree by calling PhyML.
nucleotide_frequency: equilibrium nucleotide frequencies. Can be
max_likelihood. Will make this a better argument setup later.
bootstrap: bootstrap analysis of internal branch support. Integers > 0 set the number of bootstrap replicates;
bootstrap = 0turns off bootstrapping. In addition, values of
-1run approximate likelihood ratio tests returning aLRT statistics;
-2returns Chi2 based parametric branch supports, and
-4gives SH-like branch supports only.
search_algorithm: algorithm for tree searching. Options are
NNIfor nearest neighbour interchange,
SPRfor subtree pruning and regrafting, and
BESTfor the best of NNI and SPR methods.
Drop sequences with low nucleotide count
.dropempties(fraction = 0.5)
Removes any sequences which have lower nucleotide fraction than specified. This is useful after aligning sequences that may contain different parts of the genome, and trimming to an area of interest; calling
seqarray.dropempties() will the remove any sequences that don't contain information about this part of the genome.
fraction: the fraction of known nucleotides per site in the sequence, below which it is removed from the sequence array.
Sequencey Array Member Variables
.seq- a Python list of BioPython sequence objects ( from
.len- the number of sequences in the sequence array
Sequence Array Iteration
Sequence array objects are iterables. The elements are BioPython
Bio.Seq objects. For example :
for record in mysequences : print "%s \n%s \n" % (record.id, record.seq)
may return :
gi|538511347|gb|KF593861.1| CACCGGGCCCCTTGGGGCTGAAATTCAAGACCGTCCGCCCGGTTGCCCTGCCACGCGCGTTCGCACCACCCCGCAATGCGCGTGTGACCGGGTGTTACCAGTGCGGCACCCCCGCGCTGGTGGAAGGCCTTGCCCCCGGGGGGGGCAATTGCCATCTCACTGTCAATGGCGAGGATGTCGGCGCCTTCCCCCCTGGGAAGTTCGTCACCGCCGCCCTCCTCAACACCCCCCCGCCCTACCAAGTCAGCTGCGGGGGCGAGAGCGATCGCGCGAGCGCGCGGGTCATCGACCCCGCCGCGCAATCGTTTACCGG gi|513136723|gb|JX546593.1| TCTCCGTAGCCGGCGTGTCGTGCAATGTTACCACTGAACACCCGTTCTGCAACACGCCGCACGGACAACTCGAGGTCCAGGTCCCGCCCGACCCTGGGGACCTGGTTGAGTACATTATGAATTACACCGGCAATCAACAGTCCCGGTGGGGCCTCGGGAGCCCGAACTGTCATGGCCCCGATTGGGCCTCCCCGGTTTGCCAACGCCATTCCCCTGACTGCTCGCGGCTTGTGGGGGCCACGCCAGAGCGTCCCCGGCTGCGCCTGGTTGACGCCGATGACCCCCTGCTGCGCACCGCCCCTGGGCCCGGCGAGGTGTGGGTCACGCCTGTCATAGGCTCTCAGGCGCGCAAGTGCGGACTCCACATACGCGCTGGACCGTACGGCCATGCTACCGTCGAAATGCCCGAGTGGATCCACGCCCACACCACCAGCGACCCTTGGCACCCACCGGGCCCCTTGGGGCTGAAGTTTAAGACGGTTCGCCCGGTGGTCCTGCCACGCGCGTTGGCGCCGCCCCGCAATGTGCGTGTGACCGGGTGCTACCAGTGCGGCACGCCCGCGCTGGTGGAAGGCCTTGCCCCCGGGGGAGGGAA
At the moment, GenePy is in pre-alpha. Much of the documentation is missing, and there's not yet much functionality. I'll be adding functionality, mostly as I require it, but please feel free to send a pull request my way if you'd like to contribute.
GenePy is available on the Python Package Index (PyPI). As such, you can install it using
pip install genepy
- Cock, PJ et al., Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 25, 1422, 2009
- Sievers, F et al., Fast, scalable generation of high‐quality protein multiple sequence alignments using Clustal Omega. Molecular Systems Biology 7, 539, 2011
- Guindon, S et al., New Algorithms and Methods to Estimate Maximum-Likelihood Phylogenies: Assessing the Performance of PhyML 3.0. Systematic Biology 59, 307, 2010
- Benson, DA et al., Genbank. Nucleic Acid Research 41 (D1), D36, 2013
- Blackshields, G et al., Sequence embedding for fast construction of guide trees for multiple sequence alignment. Algorithms for Molecular Biology 5, 21, 2010