AMechtley edited this page Aug 23, 2016 · 70 revisions

####April 22nd - May 5th: Weeks 1 & 2

  • Read paper on previous work related to regular-expression-like queries: PhyloPattern
  • Joined IRC on Gitter
  • First meeting with Mentors on Skype
  • Started reviewing ETE code
  • quantifiedcode:
  • "Resolve import naming collision": fixed 3 issues
  • "Check usage of exploitable MD5 hashes": Ignored. Not used for storing passwords, only for file matching.
  • "Avoid accessing class members from outside": Skipped. Large task with low priority. All internal. Consider subclassing QDialog.
  • "Avoid empty exception handlers". Fixed where it says "except Exception:" or simply "except:" to provide the specific range of possible exceptions. If it appears that no exceptions can be raised, then ignored (may want to remove exception handling because since all exceptions are already caught somewhere else). Skipped Exception passing (could provide printing/logging errors as is standard practice but there are reasons not to do it). (Looked through first page of 10 on quanitifedcode list).

####May 6th - May 19th: Weeks 3 & 4

  • Read two papers regarding another tree pattern search tool: Tree pattern matching suite, TPMS.
  • Their 2005 paper discusses basic algorithm (minus one key detail). Their 2009 paper discusses an improvement (query trees can now be collected by users, not just from within their own database) and gives examples.
  • This tool and the PhyloPattern tool both know about each other but compare their software to one another. PhloPattern simply says that TPMS is restricted to their own database (which is was at the time) and TPMS simply says that PhyloPattern has a difficult syntax to learn and is not strictly compiled (written in Prolog and Java) so it will not likely be as fast to search for a pattern.
  • Searched forums and conference proceedings for relevant topics.
  • learned details of the languages of the two programs PhyloPattern and TPMS to look for a commonality in techniques for structuring queries.

###Regular-expression-like Syntax

I focused on the regular-expression-like syntax used in each program, learning the details of how to write queries for TPMS vs PhyloPattern, making comparisons and focusing on the advantages of each language.

Example: search for a subtree with three nodes A, B, C and where B and C are siblings and A is attached to the parent node.

PhyloPattern Query:

Tags (which may be empty brackets) are required for every group of nodes. As a result, the syntax contains several layers of brackets making it difficult to read. There are also several grammar rules to learn (at least 14 grammars and 4 symbols).
@[[Species(‘ A)’,[[Species(‘B’),Species(‘C’)],[]]],[]]

TPMS query:

Uses a (Parameters/Subtree Constraints/Leaf Constraints) format. It is easy to read but is very literal (lacks ways of collapsing nodes; it is not possible to specify fuzzy patterns in which some species are optional for the search.). Finding families containing a large number of species and exactly matching this pattern is difficult in TPMS.

###Notes: In each of the examples above, Newick trees can not be used directly in queries. To be fair, it would be possible to manipulate Newick trees to become TPMS queries, so such searches are still feasible but require preprocessing. The current prototype implementation of treematcher uses a python-like syntax based on newick structure, which will easily allow for trees to be searched directly when I begin coding.

Suggested syntax for the same query above using ETE treematcher, written in standard Newick format:

PhyloPattern is built on the prolog language which enables many options for creating complex queries whereas TPMS is built from the need to perform specific types of queries on gene sets. TPMS is easy to read but is limited in the types of questions you can ask. In our program, the syntax is based both on enabling easy querying of common biological questions (through predefined and user-built functions) AND a python-based syntax that allows for complex queries. This will allow the program to be both flexible and easy-to-read.

Examples of questions we would like to ask and possible functions to build:

  • Is there evidence of domain shuffling?
  • Have there been gene loss events?
  • find sets of orthologs in a specific taxon using a pattern corresponding to the subtree of a species tree containing the given taxon. If a gene tree contains exactly this pattern, then the most parsimonious hypothesis is that the matching sequences are orthologous. (according to TPMS paper).
  • The issue of whether all animals with a specific characteristic form a single clade. Example in TPMS paper: all animals that moult an exoskeleton (such as the coelomate arthropods and the pseudocoelomate nematodes) form a distinct clade.

My goal this week is to begin forming ways to ask those queries using either the python-like syntax or predefined python functions.


If using Pyyaml, this is relevant: Here's two examples of how you might define constraints on a tree using YAML. YAML normally consists of key : value pairs separated by a colon. You can have more complex nested structures where each level has a name, followed by a colon, and an indented list of sub-attributes, which might themselves have nested structure. Dashes before elements indicate they're an (ordered) list. Here's an example of how a general phylo tree might look. Note that dist is an attribute on the node above it. Don't pay attention to the bullets, it is the indents that are important.

    dist: 0.5
     dist: 0.1
     dist: 0.1
     dist: 0.6

Normally each element needs a name (left side of the colon). We can get around that by using nested lists. Lists are just ordered lists of values (without names).

(length( < 3 or == "pasa") and @.dist >= 0.5
   len(@.children) > 2 in ("hello", "bye") # may need quoting since comma is special char

Another way to represent the above set of constraints, where we use the key: value syntax to define constraints (the values) on individual node attributes (the keys). I'm using 'this' as a shorthand for the given attribute you should be able to convince yourself that this syntax (compared to the above) does not allow you to constrain one attribute OR another (only AND) i.e. if we replace 'and' with 'or' in the first constraint above: (length( < 3 or == "pasa") or @.dist >= 0.5 there's no way to represent that in the syntax below. these constraints are on the root node.

name: length(this) < 3 or this == "pasa"
   dist: this >= 0.5
      children: len(this) > 2
      name: this in ("hello", "bye")

Week 3, May 23-28 (project starts)

####May 20th - June 2nd: Weeks 5 & 6 Progress The first week I focused on seeing what the program can currently do and extending those capabilities. The second week I stated adding attributes and an annotation function.

Permute the target rather than the pattern (handles unspecified sibling nodes)


  • tree1 = Tree("((kk,(1,2,3)bye)y:1, NODE);", format=1)
  • tree2 = Tree("(((1,2,3)bye)y:1, NODE);", format=1)
  • tree3 = Tree("(((1,2,3)bye,kk)y:1, NODE);", format=1)

The following pattern places 3 constraints on a node

  • a name less than 3 characters long
  • a distance value greater than or equal to 0.5.
  • one child has more than two children of it’s own

pattern_1 = """ (len{@.children} > 2 and in {"hello"|"bye"}){length{} < 3} and @.dist >= 0.5; """

Although all three trees technically satisfy the condition, only the second tree returned True. This is because the query specifies a single child node. I don’t actually care if there is a sibling to node “bye” or not. In order to return True for the other two trees, you had to change the query to something like the following, where an additional child node is specified to exist.

pattern_2 = """ (True, len{@.children} > 2 and in {"hello"|"bye"}){length{} < 3} and @.dist >= 0.5; """ Changing line 40 of the code, to allow the target tree to be larger than the pattern tree, did not correct the problem entirely. if len(node.children) >= len(self.children):

The first tree still returns False. This is because the code specified that all permutations of the pattern be compared to the target. If there is only one child in the pattern, there is only one permutation and it tries to place the child constraint on the left child of the target. By permuting the target node instead of the constraint node, the problem is averted. All three trees now return true.

Note that if you want to force “bye” to have a sibling, you can still use the second pattern.

It was decided that we will focus on Newick format and we will add a YAML parser once the Newick implementation is complete.

####June 3rd - June 16th: Weeks 6 & 7

  • Switched to branch of ETE using node names.

  • Submitted a pull request to that branch to remove the quotes. (It currently includes the quotes in the node name which I’m 99% sure is never the behavior we want.)

  • Added a function to parse constraints.

  • Created list of tuples to replace keywords with python code. The parser currently replaces inner strings (e.g., node names) containing syntax keywords. This is handled in the quoted_node_names parser by creating temporary variable during the replace operation, I will do the same. Something to consider: you may not want to allow custom functions on an API (could send malicious Python code).

English language syntax

Constraints like ' "human" in @.children ' work if I include @.children in the syntax and replace it with [ for node in __target.children]. This will allow more consistent behavior, which is what you’d expect in a query language. Whether you are searching in @.children, @.species, @.leaves, or @.descendants it will all behave in the same way. This will only possibly become an issue when people try to write their own custom functions and don’t see the normal behavior. How to handle the word "is". The " is " was replaced with "==" in the following examples. Although it intuitively seems that it should be used this way, it's not it may not behave properly if someone tries to use it as a Python “is” statement. Users may use "is" intuitively in other places (e.g., comparison statements like "is greater than", "is less than") and I have accounted for those few instances but we may decide to ignore "is" all together. For every @.something I am including in the syntax, I include a version without @ by capitalizing the first letter of Something and using the full word if it is abbreviated (e.g., @.dist is Distance). I am checking ete3 variables and methods for conflicts and so far there have not been any. I can use regular expressions to resolve any conflicts as they arise. I also included a few common symbols in word form to make the queries more readable. The following two queries currently produce the same results. We can make this second style optional if it becomes too complicated.

Examples The following two patterns are equivalent and were parsed correctly by treematcher.

pattern1 = """

( '  @.dist >= 0.5 ' , ' @.species in ("sapiens","pygmaeus")  ' )

' "Pan_troglodytes_1" in @.leaves or "Homo_sapiens_1" in @.children '


pattern2 = """

( '  Distance greater than or equal to 0.5 ' , ' Species is "sapiens" or Species is "pygmaeus" ')

' "Pan_troglodytes_1" in Leaves or "Homo_sapiens_1" in Children '

Note that the above examples required the following dictionary which has been removed. It can be added back later as an optional syntax and I am writing this here as a reminder of how it works in case we want to pursue this option again later.

("is greater than", ">"),
("greater than", ">"),
(" or equal to", "="),
("is less than", "<"),
("less than", "<"),
("Children", "[ for n in __target.children]"),
("Distance", "__target.dist"),
("Leaves", "[ for node in __target.get_leaves()]"),
("Species", "__target.species"),
(" is ", " == "),

Highlights of Progress for Midterm evaluations:

  • Tree matcher has been shown to successfully search for Newick based patterns involving node name, distance, support, species, children, and lineage.

  • Tree matcher now allows for exact matching of Newick trees. Note that this is not an option in the previous two re-pattern programs (TPMS and PhyloPattern).

  • Tree matcher has unittests for Name, distance, support, species, children, and lineage functions. The tests are currently implemented for True/False cases where the tests that return "True" also check that the correct node is returned.

  • Approximately 27k trees were searched for lineage patterns at an average 11 seconds to per 1,000 trees on a 3GHz Intel I7 with 32GB of RAM and a SATA 3 Solid State Drive running Ubuntu Linux. This same test took approximately 56 seconds per 1,000 trees on a MacBook Pro (13-inch, Mid 2010) with 2.4 GHz Intel Core 2 Duo processor and 4 GB RAM and 5400 RPM Hard disk.

June 17th - June 30th

Previous engagement, gone for one week for Evolution Conference (driving and attending). Mentor Evaluations.

  • Implemented command line capabilities
  • optimizing searches by creating cache functions for common operations and checking current code for python 2/3 compatibility.

July 1st - July 14th

  • removed custom pattern syntax (i.e. "is greater than")
  • improved unittest insertions to include specific node names where match occurs

July 15th - July 28th

Improved Unittiests, documentation, and added a tutorial in the

July 29th- August 11th

Worked on fixing bugs. Cache vs no cache had a discrepancy. Updated Unittests. Began working on a Biological Example. Began implementation of a relaxed match (i.e., a way to match 0 or more nodes using the * syntax).

August 11th - 23rd:

Completed biological example. Finished command line tool. Submitted final work.

My Contributions:

To see my contributions to the code, click and drag any "dot" with my name on it in the network image here:

Clone this wiki locally
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
Press h to open a hovercard with more details.