# BMI/CS 576 Fall 2023 - HW4
The objectives of this homework are to practice with

* weighted parsimony
* tree space search
* Markov chains


## HW policies
Before starting this homework, please read over the [homework policies](https://canvas.wisc.edu/courses/374201/pages/hw-policies) for this course.  In particular, note that homeworks are to be completed *individually* and plagiarism from any source (with the one exception noted below) will be considered **academic misconduct**.

You are welcome to use any code from the weekly notebooks (including the official solutions) in your solutions to the HW.

## Modules for this HW

In [1]:
import toytree
import fasta
import submatrix

## PROBLEM 1: Weighted parsimony (40 points)

In this problem you will implement the [weighted parsimony algorithm](https://canvas.wisc.edu/courses/374201/pages/day-16-online-lecture-phylogenetic-trees-weighted-parsimony) (also see [worked example](https://canvas.wisc.edu/courses/374201/modules/items/6356958)) for computing the cost of a tree given assignments of characters to its leaves.  Implement this algorithm as a function `weighted_parsimony` below, which takes as input a tree, a dictionary giving the character at each leaf, a cost matrix, and a string specifying all valid characters (e.g., "ACGT").  Your function will output the minimum cost of the tree as well as a dictionary giving an assignment of characters to all nodes of the tree (including the leaves) that achieves the minimum cost.

### Implementation details

* In the case of a tie during the traceback (e.g., multiple characters at a child node that allow for the same minimum cost at the parent, or multiple characters at the root node that give the same minimum cost), pick the character that is lexicographically smallest.

* You may find useful the form of the implementations of `fitch_score_and_min_cost_states` and `fitch_ancestral_states` in the Day 16 notebook for the fill and traceback stages, respectively.

* To directly access the root node of a `toytree.tree` object, simply access its `treenode` attribute. 

Tests for Problem 1 are found at the bottom of this notebook.

### Cost matrices to be used in this assignment

In [2]:
DNA = "ACGT"
basic_dna_cost_matrix = submatrix.match_mismatch_matrix(0, 1, DNA)
purine_pyrimidine_cost_matrix = submatrix.read_substitution_matrix("purine_pyrimidine.txt")

print("basic_dna_cost_matrix = ")
submatrix.print_matrix(basic_dna_cost_matrix)
print("purine_pyrimidine_cost_matrix = ")
submatrix.print_matrix(purine_pyrimidine_cost_matrix)

basic_dna_cost_matrix = 
         A    C    G    T
    A    0    1    1    1
    C    1    0    1    1
    G    1    1    0    1
    T    1    1    1    0
purine_pyrimidine_cost_matrix = 
         A    C    G    T
    A    0    2    1    2
    C    2    0    2    1
    G    1    2    0    2
    T    2    1    2    0


In [3]:
def weighted_parsimony(tree, leaf_states, cost_matrix, alphabet = DNA):
    """Computes the minimum cost of a tree and an assignment of ancestral characters achieving that cost.   
    Args:
        tree: a toytree tree.
        leaf_states: a dictionary mapping leaf names to characters.
        cost_matrix: a cost matrix (represented as a dictionary with tuples as keys)
           where cost_matrix[a, b] is the cost of a substitution between characters a and b
        alphabet: a string specifying the possible character states that each node may take.
    Returns:
        A tuple (min_cost, node_states) where min_cost is the minimum cost of the tree (a numeric value)
        and node_states is an assignment of characters to the nodes that achieves this minimum cost, 
        (a dictionary mapping node names to characters).
    """      
    ### BEGIN SOLUTION
    import collections
    R = collections.defaultdict(lambda: {c: float("inf") for c in alphabet}) # costs
    T = collections.defaultdict(lambda: {c: [] for c in alphabet})           # tracebacks
    for node in tree.treenode.traverse("postorder"):
        if node.is_leaf():
            R[node.name][leaf_states[node.name]] = 0
        else:
            children_R = [R[child.name] for child in node.children]
            for c in alphabet:
                minimizers = [min((child_R[a] + cost_matrix[c, a], a) for a in alphabet)
                              for child_R in children_R]
                cost = sum(min_cost for min_cost, min_char in minimizers)
                traceback = [min_char for min_cost, min_char in minimizers]
                R[node.name][c] = cost
                T[node.name][c] = traceback
    
    # traceback
    r = {} # a dictionary mapping node names to character states
    # determine minimum cost character at the root
    root_node_name = tree.treenode.name
    min_cost, min_char = min((cost, char) for char, cost in R[root_node_name].items())
    r[root_node_name] = min_char
    
    for node in tree.treenode.traverse("preorder"):
        traceback = T[node.name][r[node.name]]
        for child, traceback_char in zip(node.children, traceback):
            r[child.name] = traceback_char

    return min_cost, r
    ### END SOLUTION

## Helper functions

You may find the function below helpful for visualizing trees and the names of its nodes.

In [4]:
def draw_tree_with_internal_labels(t, leaf_states=None):
    """Draws the given toytree tree with all nodes labeled and 
    (optionally) with character assignments to the leaves.
    Args:
        tree: a toytree tree.
        leaf_states: a dictionary mapping leaf names to characters.
    """
    if leaf_states:
        tip_labels = [leaf_states[leaf_name] for leaf_name in t.get_tip_labels()]
    else:
        tip_labels = False
    t.draw(node_labels=t.get_node_values(feature="name", show_root=True, show_tips=True),
           tip_labels=tip_labels,
           node_sizes=20,
           use_edge_lengths=False)

# Example usage:
t = toytree.tree("(Z,(X,Y));")
leaf_states = {"X": "C", "Y": "T", "Z": "A"}
draw_tree_with_internal_labels(t, leaf_states)

## PROBLEM 2: Human coronavirus phylogeny (20 POINTS)

There are many types of coronaviruses that can infect humans, some of which are quite common and cause a subset of common cold cases.  It is helpful to understand the phylogenetic relationships and between these viruses and their related evolutionary histories.  Included with this assignment is a multiple sequence alignment (`human_coronavirus_rdrp.fasta`) of the RNA-dependent RNA polymerase (RdRP) gene from six human coronavirus genomes (SARS-CoV-2, SARS_CoV_1, MERS, OC43, HKU1, and CoV229E).  For simplicity and efficiently, all gapped columns and all uninformative columns (no differences within the column) have been removed from this alignment.

In this problem, we will use your `weighted_parsimony` function from problem 1 to analyze the evolutionary relationships between these viruses.  If your `weighted_parsimony` function is incorrect, you may use the unweighted parsimony algorithms implemented in the Day 16 notebook instead.

**(a)** Using your `weighted_parsimony` function, compute the weighted parsimony score for all possible rooted trees of these six viral genomes, assuming CoV229E is the outgroup.  For your convenience, the file `all_rooted_trees.txt` contains the newick strings for all such trees.  Use `purine_pyrimidine_cost_matrix` as the cost matrix for your weighted parsimony computations.  *Hint: You will likely want to use slightly modified versions of functions and code from the Day 16 notebook for this problem*.

In [5]:
### BEGIN SOLUTION
def alignment_leaf_states_list(alignment, sequence_names):
    """Returns a list of dictionaries, where each dictionary corresponds to the leaf states
    for a column of the alignment."""
    return [dict(zip(sequence_names, column)) for column in zip(*alignment)]

def alignment_weighted_parsimony(tree, alignment, sequence_names, cost_matrix):
    columns = alignment_leaf_states_list(alignment, sequence_names)
    weighted_parsimony_results = [weighted_parsimony(tree, column, cost_matrix) for column in columns]
    column_scores, column_states = zip(*weighted_parsimony_results)
    return column_scores, column_states

def score_tree_weighted_parsimony(tree, alignment, sequence_names, cost_matrix):
    column_scores, column_states = alignment_weighted_parsimony(tree, alignment, sequence_names, cost_matrix)
    return sum(column_scores)

def p2_score_tree(newick):
    return score_tree_weighted_parsimony(toytree.tree(newick), 
                                         alignment, 
                                         sequence_names, 
                                         purine_pyrimidine_cost_matrix)

alignment_filename = "human_coronavirus_rdrp.fasta"
aligned_sequences = fasta.read_sequences_from_fasta_file(alignment_filename)
sequence_names, alignment = zip(*aligned_sequences)

all_possible_trees_filename = "all_rooted_trees.txt"
all_possible_trees = [line.strip() for line in open(all_possible_trees_filename)]

tree_scores = list(map(p2_score_tree, all_possible_trees))
### END SOLUTION

**(b)** List the scores and trees for the **3** trees with the smallest weighted parsimony scores.  Given that these trees have similar scores, for what aspect of the true tree is there the most uncertainty?

In [6]:
### BEGIN SOLUTION
top_n = 3
top_trees = sorted(zip(tree_scores, all_possible_trees))[:top_n]
top_trees
### END SOLUTION

[(4154, '(CoV229E,((SARS-CoV-2,SARS_CoV_1),(MERS,(OC43,HKU1))));'),
 (4157, '(CoV229E,(((SARS-CoV-2,SARS_CoV_1),MERS),(OC43,HKU1)));'),
 (4191, '(CoV229E,(((SARS-CoV-2,SARS_CoV_1),(OC43,HKU1)),MERS));')]

### BEGIN SOLUTION
The only difference between these three trees is the placement of MERS.  SARS-CoV-2 and SARS_CoV_1 are siblings in all three trees, as are OC43 and HKU1.  Thus, there is the most uncertainty in the evolutionary relationship of MERS to these other viruses.
### END SOLUTION

**(c)** For the tree with the smallest score, use the results of your `weighted_parsimony` function to construct the ancestral sequence at the root node of the tree.  Save this sequence to a file `ancestor.fasta` in FASTA format. *Hint: You will likely want to use slightly modified versions of functions and code from the Day 16 notebook for this problem*

In [7]:
### BEGIN SOLUTION
def ancestor_weighted_parsimony(tree, alignment, sequence_names, cost_matrix):
    column_scores, column_states = alignment_weighted_parsimony(tree, alignment, sequence_names, cost_matrix)
    root_name = tree.treenode.name
    return ''.join(states[root_name] for states in column_states)

ancestor_filename = "ancestor.fasta"
ancestor_name = "ancestor"
top_tree = toytree.tree(top_trees[0][1])
ancestor_sequence = ancestor_weighted_parsimony(top_tree, alignment, sequence_names, purine_pyrimidine_cost_matrix)
fasta.write_sequences_to_fasta_file([(ancestor_name, ancestor_sequence)], ancestor_filename)
### END SOLUTION

## PROBLEM 3: Branch and Bound with unweighted Parsimony (25 points)

Suppose we wish to find an unrooted tree with the minimum unweighted parsimony score for five taxa: 1,2,3,4,and 5, which have character states $A, C, C, A, C$, respectively.  In this problem, we will use the first branch and bound method described in the [Day 16 Tree space search lecture](https://canvas.wisc.edu/courses/374201/pages/day-16-online-lecture-phylogenetic-trees-tree-space-search) (slide 7: "Exact Method: Branch and Bound") to find such a tree.  We will use the unweighted parsimony score of a partial tree as the lower bound for the score of a full tree that may be built from it.

**(a)** Manually run the branch and bound algorithm on these data starting with the unrooted tree containg taxa 1, 2, and 3.  At the end of each iteration of the algorithm, list the elements (as newick strings) of the queue with their lower bounds. You do *not* need to show your work with respect to computing the parsimony score of each (partial) tree.

**(b)** For how many (partial) trees did you have to compute a parsimony score during the algorithm in part (a)?  How does this compare to the number of possible unrooted trees of five taxa?

### BEGIN SOLUTION template=Your solution to Problem 3a here

**(a)**  Listed below are the states of the queue at the beginning of each iteration and the tree that would be popped off.

*Iteration 1*

| tree     | lower bound |
|----------|-------------|
| (1,2,3)  |    1        |


Pop and branch on (1,2,3)

*Iteration 2*

| tree         | lower bound |
|--------------|-------------|
| ((1,4),2,3)  |    1        |
| (1,(2,4),3)  |    2        |
| (1,2,(3,4))  |    2        |


Pop and branch on ((1,4),2,3)

*Iteration 3*

| tree            | lower bound |
|-----------------|-------------|
| (((1,4),5),2,3) |    1        |
| ((1,4),(2,5),3) |    1        |
| ((1,4),2,(3,5)) |    1        |
| (((1,5),4),2,3) |    2        |
| ((1,(4,5)),2,3) |    2        |
| (1,(2,4),3)     |    2        |
| (1,2,(3,4))     |    2        |

The top three trees (arbitrarily ordered) on the queue all have lower bound = 1 and are complete trees: (((1,4),5),2,3), ((1,4),(2,5),3), and ((1,4),2,(3,5)).  Any one of these could be returned as the final optimal tree from the branch and bound algorithm.

### END SOLUTION

### BEGIN SOLUTION template=Your solution to Problem 3b here

**(b)** Including the first tree (1,2,3), there were 9 trees for which a parsimony score was calculated (1, 3, and 5 in the 1st, 2nd, and 3rd iterations, respectively).  There are $3 \times 5 = 15$ possible unrooted trees of five taxa.  Thus, branch and bound does $9/15 = 60\%$ of the work of a brute force approach that would explicitly examine every possible unrooted tree.

### END SOLUTION

## PROBLEM 4: Markov chain parameter estimation and likelihood (15 points)

Suppose we are given the following five DNA sequences. In this problem, we will model these of sequences using a simple Markov chain with a state for each of the four DNA bases.

$\begin{eqnarray}
x_1 & = & \mathrm{\tt ATGT} \\
x_2 & = & \mathrm{\tt AAAA} \\
x_3 & = & \mathrm{\tt GTCG} \\
x_4 & = & \mathrm{\tt AACA} \\
x_5 & = & \mathrm{\tt TACC} \\
\end{eqnarray}$

**(a)** Using uniform distributions for the transition probabilities and initial state probabilities, calculate the likelihood, $P(x_1, x_2, x_3, x_4, x_5)$, of these sequences (where we assume that each sequence is generated independently from the model).

**(b)** Estimate the parameters (transition and initial probabilities) of the Markov chain using maximum likelihood estimates. Calculate the likelihood of these sequences given these maximum likelihood parameter estimates.

**(c)** Estimate the parameters (transition and initial probabilities) of the Markov chain using Laplace estimates (pseudocount = 1). Calculate the likelihood of these sequences given these Laplace parameter estimates.

### BEGIN SOLUTION template=Your solution to Problem 4a here

**(a)** The probability of each sequence is the product of one initial probability and three transition probabilities, all of which are equal to $\frac{1}{4}$ because we are using uniform distributions.  Thus, the probability of each sequence is $\left(\frac{1}{4}\right)^4 = \frac{1}{256}$ .  The likelihood of all of the sequences is simply the product of the probabilities of each sequence, since they are generated independently.

$P(x_1, x_2, x_3, x_4, x_5) = \prod_{i=1}^{5} P(x_i) = \left(\frac{1}{256}\right)^5 = \frac{1}{1099511627776} = 9.1 \times 10^{-13}$

### END SOLUTION

### BEGIN SOLUTION template=Your solution to Problem 4b here

**(b)** The transition count matrix $n_{s \rightarrow t}$ (rows index $s$ and columns index $t$) is shown below.  $S$ represents transitions from the start state (initial transitions).  The $*$ column represents the total number of transitions from the state indexed by the row.
$$\begin{array}{cccc}
  & A & C & G & T & * \\
S & 3 & 0 & 1 & 1 & 5 \\
A & 4 & 2 & 0 & 1 & 7 \\
C & 1 & 1 & 1 & 0 & 3\\
G & 0 & 0 & 0 & 2 & 2 \\
T & 1 & 1 & 1 & 0 & 3
\end{array}$$

Maximum likelihood estimates for the parameters parameters $a_{s \rightarrow t}$. The formula used for each cell is
$\frac{n_{s \rightarrow t}}{n_{s \rightarrow *}}$.
$$\begin{array}{cccc}
  & A & C & G & T \\
S & \frac{3}{5} = 0.6 & 
    \frac{0}{5} = 0 & 
    \frac{1}{5} = 0.2 & 
    \frac{1}{5} = 0.2 \\
A & \frac{4}{7} \approx 0.571 & 
    \frac{2}{7} \approx 0.286 & 
    \frac{0}{7} = 0 & 
    \frac{1}{7} \approx 0.143 \\
C & \frac{1}{3} \approx 0.333 & 
    \frac{1}{3} \approx 0.333 & 
    \frac{1}{3} \approx 0.333 & 
    \frac{0}{3}  = 0 \\
G & \frac{0}{2} = 0 & 
    \frac{0}{2} = 0 & 
    \frac{0}{2} = 0 & 
    \frac{2}{2} = 1 \\
T & \frac{1}{3} \approx 0.333 & 
    \frac{1}{3} \approx 0.333 & 
    \frac{1}{3} \approx 0.333 & 
    \frac{0}{3} = 0 \\    
\end{array}$$

$\begin{eqnarray}
P(x_1, x_2, x_3, x_4, x_5) & = & P(x_1)P(x_2)P(x_3)P(x_4)P(x_5) \\
& = & (a_{S,A}a_{A,T}a_{T,G}a_{G,T}) \times 
(a_{S,A}a_{A,A}a_{A,A}a_{A,A}) \times 
(a_{S,G}a_{G,T}a_{T,C}a_{C,G}) \times 
(a_{S,A}a_{A,A}a_{A,C}a_{C,A}) \times 
(a_{S,T}a_{T,A}a_{A,C}a_{C,C})\\
& = & a_{S,A}^3 a_{S,G}^1 a_{S,T}^1 a_{A,A}^4 a_{A,C}^2 a_{A,T}^1 a_{C,A}^1 a_{C,C}^1 a_{C,G}^1 a_{G,T}^2 a_{T,A}^1 a_{T,C}^1 a_{T,G}^1 \\
& \approx & (0.6)^3 (0.2)^1 (0.2)^1 (0.571)^4 (0.286)^2 (0.143)^1 (0.333)^1 (0.333)^1 (0.333)^1 (1)^2 (0.333)^1 (0.333)^1 (0.333)^1 \\
& \approx & 1.5 \times 10^{-8}
\end{eqnarray}$


### END SOLUTION

### BEGIN SOLUTION template=Your solution to Problem 4c here

Laplace estimates for the parameters $a_{s \rightarrow t}$. The formula used for each cell is
$\frac{n_{s \rightarrow t} + 1}{n_{s \rightarrow *} + 4}$.
$$\begin{array}{cccc}
  & A & C & G & T \\
S & \frac{3 + 1}{5 + 4} \approx 0.444 & 
    \frac{0 + 1}{5 + 4} \approx 0.111 & 
    \frac{1 + 1}{5 + 4} \approx 0.222 & 
    \frac{1 + 1}{5 + 4} \approx 0.222 \\
A & \frac{4 + 1}{7 + 4} \approx 0.455 & 
    \frac{2 + 1}{7 + 4} \approx 0.273 & 
    \frac{0 + 1}{7 + 4} \approx 0.091 & 
    \frac{1 + 1}{7 + 4} \approx 0.182 \\
C & \frac{1 + 1}{3 + 4} \approx 0.286 & 
    \frac{1 + 1}{3 + 4} \approx 0.286 & 
    \frac{1 + 1}{3 + 4} \approx 0.286 & 
    \frac{0 + 1}{3 + 4} \approx 0.143 \\
G & \frac{0 + 1}{2 + 4} \approx 0.167 & 
    \frac{0 + 1}{2 + 4} \approx 0.167 & 
    \frac{0 + 1}{2 + 4} \approx 0.167 & 
    \frac{2 + 1}{2 + 4} = 0.5 \\
T & \frac{1 + 1}{3 + 4} \approx 0.286 & 
    \frac{1 + 1}{3 + 4} \approx 0.286 & 
    \frac{1 + 1}{3 + 4} \approx 0.286 & 
    \frac{0 + 1}{3 + 4} \approx 0.143 \\       
\end{array}$$

$\begin{eqnarray}
P(x_1, x_2, x_3, x_4, x_5) & = & a_{S,A}^3 a_{S,G}^1 a_{S,T}^1 a_{A,A}^4 a_{A,C}^2 a_{A,T}^1 a_{C,A}^1 a_{C,C}^1 a_{C,G}^1 a_{G,T}^2 a_{T,A}^1 a_{T,C}^1 a_{T,G}^1 \\
& \approx & (0.444)^3 (0.222)^1 (0.222)^1 (0.455)^4 (0.273)^2 (0.182)^1 (0.286)^1 (0.286)^1 (0.286)^1 (0.5)^2 (0.286)^1 (0.286)^1 (0.286)^1 \\
& \approx & 3.4 \times 10^{-10}
\end{eqnarray}
$


### END SOLUTION

## Tests for Problem 1

In [8]:
pair_tree =              toytree.tree("(X,Y);")
triple_tree =         toytree.tree("(Z,(X,Y));")
quartet_tree =     toytree.tree("(W,(Z,(X,Y)));")
quartet2_tree =   toytree.tree("((W,Z),(X,Y));")
quintet_tree = toytree.tree("((V,W),(Z,(X,Y)));")
large_tree = toytree.tree("((E,(F,(G,H))),((A,B),(C,D)));")

pair_match_states =                             {"X": "C", "Y": "C"}
pair_mismatch_states =                          {"X": "A", "Y": "T"}
triple_states =                       {"Z": "C", "X": "A", "Y": "T"}
quartet_states =            {"W": "A", "Z": "C", "X": "A", "Y": "T"}
quartet2_states =           {"W": "C", "Z": "C", "X": "A", "Y": "T"}
quintet_states =  {"V": "C", "W": "C", "Z": "C", "X": "A", "Y": "T"}
large_states = {"A": "G", "B": "C", "C": "T", "D": "A", "E": "G", "F": "C", "G": "C", "H": "C"}

In [9]:
# pair_match_score (6 points)
tree, leaf_states = pair_tree, pair_match_states
draw_tree_with_internal_labels(tree, leaf_states)
cost, states = weighted_parsimony(tree, leaf_states, purine_pyrimidine_cost_matrix)
assert cost == 0
print("SUCCESS: pair_match_score test case passed!")

SUCCESS: pair_match_score test case passed!


In [10]:
# pair_mismatch_score (6 points)
tree, leaf_states = pair_tree, pair_mismatch_states
draw_tree_with_internal_labels(tree, leaf_states)
cost, states = weighted_parsimony(tree, leaf_states, purine_pyrimidine_cost_matrix)
assert cost == 2
print("SUCCESS: pair_mismatch_score test case passed!")

SUCCESS: pair_mismatch_score test case passed!


In [11]:
# triple_score (4 points)
tree, leaf_states = triple_tree, triple_states
draw_tree_with_internal_labels(tree, leaf_states)
cost, states = weighted_parsimony(tree, leaf_states, purine_pyrimidine_cost_matrix)
assert cost == 3
print("SUCCESS: triple_score test case passed!")

SUCCESS: triple_score test case passed!


In [12]:
# quartet_score (2 points)
tree, leaf_states = quartet_tree, quartet_states
draw_tree_with_internal_labels(tree, leaf_states)
cost, states = weighted_parsimony(tree, leaf_states, purine_pyrimidine_cost_matrix)
assert cost == 4
print("SUCCESS: quartet_score test case passed!")

SUCCESS: quartet_score test case passed!


In [13]:
# quartet2_score (2 points)
tree, leaf_states = quartet2_tree, quartet_states
draw_tree_with_internal_labels(tree, leaf_states)
cost, states = weighted_parsimony(tree, leaf_states, purine_pyrimidine_cost_matrix)
assert cost == 4
print("SUCCESS: quartet2_score test case passed!")

SUCCESS: quartet2_score test case passed!


In [14]:
# quintet_score (1 points)
tree, leaf_states = quintet_tree, quintet_states
draw_tree_with_internal_labels(tree, leaf_states)
cost, states = weighted_parsimony(tree, leaf_states, purine_pyrimidine_cost_matrix)
assert cost == 3
print("SUCCESS: quintet_score test case passed!")

SUCCESS: quintet_score test case passed!


In [15]:
# large_score (1 point)
tree, leaf_states = large_tree, large_states
draw_tree_with_internal_labels(tree, leaf_states)
cost, states = weighted_parsimony(tree, leaf_states, purine_pyrimidine_cost_matrix)
assert cost == 7
print("SUCCESS: large_score test case passed!")

SUCCESS: large_score test case passed!


In [16]:
# pair_match_states (3 points)
tree, leaf_states = pair_tree, pair_match_states
draw_tree_with_internal_labels(tree, leaf_states)
cost, states = weighted_parsimony(tree, leaf_states, purine_pyrimidine_cost_matrix)
assert states == {'2': 'C', 'X': 'C', 'Y': 'C'}
print("SUCCESS: pair_match_states test case passed!")

SUCCESS: pair_match_states test case passed!


In [17]:
# pair_mismatch_states (3 points)
tree, leaf_states = pair_tree, pair_mismatch_states
draw_tree_with_internal_labels(tree, leaf_states)
cost, states = weighted_parsimony(tree, leaf_states, purine_pyrimidine_cost_matrix)
assert states == {'2': 'A', 'X': 'A', 'Y': 'T'}
print("SUCCESS: pair_mismatch_states test case passed!")

SUCCESS: pair_mismatch_states test case passed!


In [18]:
# triple_states (2 points)
tree, leaf_states = triple_tree, triple_states
draw_tree_with_internal_labels(tree, leaf_states)
cost, states = weighted_parsimony(tree, leaf_states, purine_pyrimidine_cost_matrix)
assert states == {'4': 'C', 'Z': 'C', '3': 'C', 'X': 'A', 'Y': 'T'}
print("SUCCESS: triple_states test case passed!")

SUCCESS: triple_states test case passed!


In [19]:
# quartet_states (1 point)
tree, leaf_states = quartet_tree, quartet_states
draw_tree_with_internal_labels(tree, leaf_states)
cost, states = weighted_parsimony(tree, leaf_states, purine_pyrimidine_cost_matrix)
assert states == {'6': 'A', 'W': 'A', '5': 'A', 'Z': 'C', '4': 'A', 'X': 'A', 'Y': 'T'}
print("SUCCESS: quartet_states test case passed!")

SUCCESS: quartet_states test case passed!


In [20]:
# quartet2_states (1 point)
tree, leaf_states = quartet2_tree, quartet_states
draw_tree_with_internal_labels(tree, leaf_states)
cost, states = weighted_parsimony(tree, leaf_states, purine_pyrimidine_cost_matrix)
assert states == {'6': 'A', '5': 'A', '4': 'A', 'W': 'A', 'Z': 'C', 'X': 'A', 'Y': 'T'}
print("SUCCESS: quartet2_states test case passed!")

SUCCESS: quartet2_states test case passed!


In [21]:
# quintet_states (1 point)
tree, leaf_states = quintet_tree, quintet_states
draw_tree_with_internal_labels(tree, leaf_states)
cost, states = weighted_parsimony(tree, leaf_states, purine_pyrimidine_cost_matrix)
assert states == {'8': 'C','7': 'C','6': 'C','V': 'C','W': 'C','Z': 'C','5': 'C','X': 'A','Y': 'T'}
print("SUCCESS: quintet_states test case passed!")

SUCCESS: quintet_states test case passed!


In [22]:
# large_states (1 point)
tree, leaf_states = large_tree, large_states
draw_tree_with_internal_labels(tree, leaf_states)
cost, states = weighted_parsimony(tree, leaf_states, purine_pyrimidine_cost_matrix)
assert states == {'14': 'C','13': 'C','12': 'C','E': 'G','11': 'C','F': 'C', '8': 'C',
                  'G': 'C','H': 'C','10': 'C','9': 'C','A': 'G','B': 'C','C': 'T','D': 'A'}
print("SUCCESS: large_states test case passed!")

SUCCESS: large_states test case passed!


### Hidden tests (6 points total)

In [23]:
# p1_hidden1 (3 points)
### BEGIN HIDDEN TESTS
tree = toytree.tree("(((G,(F,B)),E),((C,J),(((H,D),A),I)));")
leaf_states = {'A': 'A', 'B': 'T', 'C': 'G', 'D': 'A', 'E': 'G', 'F': 'T', 'G': 'T', 'H': 'A', 'I': 'G', 'J': 'G'}
draw_tree_with_internal_labels(tree, leaf_states)
cost, states = weighted_parsimony(tree, leaf_states, purine_pyrimidine_cost_matrix)
assert cost == 3
assert states == {'18': 'G','17': 'G','16': 'G','E': 'G','15': 'T','G': 'T','12': 'T','F': 'T','B': 'T',
                  '14': 'G','13': 'G','C': 'G','J': 'G','I': 'G','11': 'A','A': 'A','10': 'A','H': 'A','D': 'A'}
print("SUCCESS: hidden1 test case passed!")
### END HIDDEN TESTS

SUCCESS: hidden1 test case passed!


In [24]:
# p1_hidden2 (2 points)
### BEGIN HIDDEN TESTS
tree = toytree.tree("(J,(I,(H,(G,(F,(E,(D,(C,(B,A)))))))));")
leaf_states = {'A': 'A', 'B': 'T', 'C': 'T', 'D': 'T', 'E': 'T', 'F': 'T', 'G': 'T', 'H': 'T', 'I': 'T', 'J': 'T'}
draw_tree_with_internal_labels(tree, leaf_states)
cost, states = weighted_parsimony(tree, leaf_states, purine_pyrimidine_cost_matrix)
assert cost == 2
assert states == {'18': 'T','J': 'T','17': 'T','I': 'T','16': 'T','H': 'T','15': 'T','G': 'T','14': 'T','F': 'T','13': 'T','E': 'T','12': 'T','D': 'T','11': 'T','C': 'T','10': 'T','B': 'T','A': 'A'}
print("SUCCESS: hidden2 test case passed!")
### END HIDDEN TESTS

SUCCESS: hidden2 test case passed!


In [25]:
# p1_hidden3 (1 points)
### BEGIN HIDDEN TESTS
tree = toytree.tree("((human, bat),pangolin);")
leaf_states = {"human": "0", "bat": "0", "pangolin": "1"}
binary_alphabet = "01"
binary_cost_matrix = submatrix.match_mismatch_matrix(0, 1, binary_alphabet)
draw_tree_with_internal_labels(tree, leaf_states)
cost, states = weighted_parsimony(tree, leaf_states, binary_cost_matrix, binary_alphabet)
assert cost == 1
assert states == {'4': '0', 'pangolin': '1', '3': '0', 'human': '0', 'bat': '0'}
print("SUCCESS: hidden3 test case passed!")
### END HIDDEN TESTS

SUCCESS: hidden3 test case passed!
