Skip to content

Latest commit

 

History

History
298 lines (148 loc) · 11.4 KB

README.md

File metadata and controls

298 lines (148 loc) · 11.4 KB

GenePy

Pronounced génépi, like the French alpine spirit.

GenePy is a Python package that acts as an interface between BioPython,1 ClustalO,2 and PhyML,3 for manipulating nucleotide sequences, all in a neat data structure.

Dependencies :

Optional :

Example Usage

We have a file in FASTA format from GenBank,4 containing 31 sequences from the Rubella virus genome. Let's import the sequences and display them.

import genepy

mysequences = genepy.seqarray("tutorial/rubella_samples.fasta")

mysequences.show()

.show() - visually display the sequence array

Different colours represent different nucleotides. They're not aligned, and one is much longer than the others. Let's dig a little deeper by printing a summary of the sequence array.

mysequences

This returns :

GenePy sequence set (genepy.seqarray) :
-- 32 sequences
-- Mean length : 1697.2 (min 739, max 9762)

We have a wide range of sequence lengths. Most are small, but based on the visualisation, we have two that are probably a full genome. Our next step will be to align them and cut away parts of sequences that we don't have enough duplicates for.

mysequences.align()

We've aligned the sequences with the default GenePy arguments to Clustal Omega. A new file was written to disk : rubella_samples_aligned_genepy.phy. This file, in PHYLIP file format, acts as a checkpoint for your future work. Let's visualise the result.

mysequences.show()

Updated visualisation after sequence alignment

Most of the sequences belong to the E1 glycoprotein gene. Let's trim excess from the left and right, so we're comparing like with like.

mysequences.trimalignment(left = 8750, right = 9470)

mysequences.show()

Updated visualisation after sequence trimming

Our sequence array is now shorter in terms of nucleotides per sequence. We're considering the same parts of the genome for each sequence, so we can start to look at the nucleotide statistics.

mysequences.stats()

Sequence statistics

This shows us the average nucleotide content, distributions of nucleotide frequencies across difference sequences, and the frequencies of two-step nucleotide transitions. We see that there's a very high cytosine and guanine content, and that cytosines tend to be followed by more cytosines, although CpG and GpC are also common. The high C+G content is an interesting feature of the rubella virus, which is somewhat of an outlier in this respect amongst ssRNA viruses.

Let's construct a phylogenetic tree from this alignment, once again using ... TO BE CONTINUED

Sequence Arrays

The core data structure in GenePy is the sequence array, or seqarray object. It acts as a container for a list of sequences, and has member functions for sequence alignment and display, for trimming sequences, for showing sequence array statistics, and for constructing phylogenetic trees.

Workhorse Packages

GenePy makes system calls to ClustalO and PhyML, which do the heavy lifting. Some of the description of these packages, their usage, and their command-line arguments which follows is taken from their respective documentation, such as the ClustalO README file and the PhyML Manual.

Sequence Array Methods

Visual representation of array sequence

.show()

A visual representation of the sequences in the seqarray. Each nucleotide has its own colour; black is an empty site or an unknown nucleotide.


Sequence array statistics

.stats()

Displays :

  • the nucleotide content
  • the distributions of nucleotide frequencies over each sequence as estimated kernel densities ( if scikit-learn is installed ) or as a histogram ( if scikit-learn is not found )
  • the two-step transition matrix between nucleotides ( frequency of having an A going to a C, etc. )

Align sequences

.align(force = True, it = False, full = False, full_iter = False, auto = True, threads = False)

Align the sequences in the seqarray by calling Clustal Omega. Sequences are clustered into a guide tree, which is used to guide a progressive alignment.

Arguments are command-line arguments to ClustalO.

  • force : overwrite the filename, if the output alignment file exists. The filename defaults to the filename of the sequence you passed on creation of the sequence array, without the extension, and with _aligned_genepy.phy appended.
  • it : the integer number of guide tree iterations. By default, no iteration of the guide tree is done. Iteration generates an alignment from the guide tree, then uses this alignment to generate a new guide tree. This iterated alignment procedure could give rise to better alignments at a linear cost in alignment time. Set to False for no iteration.
  • full : use the full distance matrix for guide-tree calculation; default uses the fast clustering algorithm, mBed5 instead of constructing a full distance matrix. mBed calculates a reduced set of pairwise distances. Use True to iterate.
  • full_iter : use the full distance matrix for guide-tree calculation during guide-tree iteration only.
  • auto : sets options automatically, selecting options for both speed and accuracy according to the number of sequences. This could overwrite some of your other arguments. auto = False is automatically set if any of iter, full, or full_iter are set to True.
  • threads : the number of threads to use for the parallelised part of the alignment. By default, ClustalO will use as many cores as are available; threads can be used to limit core usage.

Trim an alignment

.trimalignment(array = None, left = None, right = None)

Remove left nucleotides from the beginning and right nucleotides from the end of the sequence array. This is useful when you have aligned a number of sequences of different lengths, and want to consider only a full array, where each sequence has the same length. Currently, this method does not automatically find a reasonable truncation zone. As such, array must be None, and left and right must be indices. Will probably get rid of the array parameter later and just have left and right as optional arguments.


Construct a phylogenetic tree

.phylotree(nucleotide_frequency = "empirical", bootstrap = -4, search_algorithm = "BEST")

Construct a phylogenetic tree by calling PhyML.

  • nucleotide_frequency : equilibrium nucleotide frequencies. Can be empirical or max_likelihood. Will make this a better argument setup later.
  • bootstrap : bootstrap analysis of internal branch support. Integers &gt 0 set the number of bootstrap replicates; bootstrap = 0 turns off bootstrapping. In addition, values of -1 run approximate likelihood ratio tests returning aLRT statistics; -2 returns Chi2 based parametric branch supports, and -4 gives SH-like branch supports only.
  • search_algorithm : algorithm for tree searching. Options are NNI for nearest neighbour interchange, SPR for subtree pruning and regrafting, and BEST for the best of NNI and SPR methods.

Drop sequences with low nucleotide count

.dropempties(fraction = 0.5)

Removes any sequences which have lower nucleotide fraction than specified. This is useful after aligning sequences that may contain different parts of the genome, and trimming to an area of interest; calling seqarray.dropempties() will the remove any sequences that don't contain information about this part of the genome.

  • fraction : the fraction of known nucleotides per site in the sequence, below which it is removed from the sequence array.

Sequencey Array Member Variables

  • .seq - a Python list of BioPython sequence objects ( from Bio.Seq.Seq )
  • .len - the number of sequences in the sequence array

Sequence Array Iteration

Sequence array objects are iterables. The elements are BioPython Bio.Seq objects. For example :

for record in mysequences :
	print "%s \n%s \n" % (record.id, record.seq)

may return :

gi|538511347|gb|KF593861.1| 
CACCGGGCCCCTTGGGGCTGAAATTCAAGACCGTCCGCCCGGTTGCCCTGCCACGCGCGTTCGCACCACCCCGCAATGCGCGTGTGACCGGGTGTTACCAGTGCGGCACCCCCGCGCTGGTGGAAGGCCTTGCCCCCGGGGGGGGCAATTGCCATCTCACTGTCAATGGCGAGGATGTCGGCGCCTTCCCCCCTGGGAAGTTCGTCACCGCCGCCCTCCTCAACACCCCCCCGCCCTACCAAGTCAGCTGCGGGGGCGAGAGCGATCGCGCGAGCGCGCGGGTCATCGACCCCGCCGCGCAATCGTTTACCGG

gi|513136723|gb|JX546593.1| 
TCTCCGTAGCCGGCGTGTCGTGCAATGTTACCACTGAACACCCGTTCTGCAACACGCCGCACGGACAACTCGAGGTCCAGGTCCCGCCCGACCCTGGGGACCTGGTTGAGTACATTATGAATTACACCGGCAATCAACAGTCCCGGTGGGGCCTCGGGAGCCCGAACTGTCATGGCCCCGATTGGGCCTCCCCGGTTTGCCAACGCCATTCCCCTGACTGCTCGCGGCTTGTGGGGGCCACGCCAGAGCGTCCCCGGCTGCGCCTGGTTGACGCCGATGACCCCCTGCTGCGCACCGCCCCTGGGCCCGGCGAGGTGTGGGTCACGCCTGTCATAGGCTCTCAGGCGCGCAAGTGCGGACTCCACATACGCGCTGGACCGTACGGCCATGCTACCGTCGAAATGCCCGAGTGGATCCACGCCCACACCACCAGCGACCCTTGGCACCCACCGGGCCCCTTGGGGCTGAAGTTTAAGACGGTTCGCCCGGTGGTCCTGCCACGCGCGTTGGCGCCGCCCCGCAATGTGCGTGTGACCGGGTGCTACCAGTGCGGCACGCCCGCGCTGGTGGAAGGCCTTGCCCCCGGGGGAGGGAA

Current State

At the moment, GenePy is in pre-alpha. Much of the documentation is missing, and there's not yet much functionality. I'll be adding functionality, mostly as I require it, but please feel free to send a pull request my way if you'd like to contribute.

Installation

GenePy is available on the Python Package Index (PyPI). As such, you can install it using

pip install genepy

References

  1. Cock, PJ et al., Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 25, 1422, 2009
  2. Sievers, F et al., Fast, scalable generation of high‐quality protein multiple sequence alignments using Clustal Omega. Molecular Systems Biology 7, 539, 2011
  3. Guindon, S et al., New Algorithms and Methods to Estimate Maximum-Likelihood Phylogenies: Assessing the Performance of PhyML 3.0. Systematic Biology 59, 307, 2010
  4. Benson, DA et al., Genbank. Nucleic Acid Research 41 (D1), D36, 2013
  5. Blackshields, G et al., Sequence embedding for fast construction of guide trees for multiple sequence alignment. Algorithms for Molecular Biology 5, 21, 2010