In [1]:
import numpy as np 
import pandas as pd
import itertools as it


For a given tree T, we can calculate a distance between two leafs $i$ and $j$, noted as $d_{ij}(T)$

In [2]:
# distance matrix 

m1 = np.array([[0,8,7,12],
               [8,0,9,14],
               [7,9,0,11],
               [12,14,11,0]])

m2 = np.array([[0,2,3,8,14,18],
               [2,0,3,8,14,18],
               [3,3,0,8,14,18],
               [8,8,8,0,14,18],
               [14,14,14,14,0,18],
               [18,18,18,18,18,0]])

In [3]:
def convert_matrix_df(matrix,columns_name=None):
    return pd.DataFrame(matrix, columns=columns_name)

In [4]:
m1_df = convert_matrix_df(m1)
m1_df

Unnamed: 0,0,1,2,3
0,0,8,7,12
1,8,0,9,14
2,7,9,0,11
3,12,14,11,0


In [5]:
m2_df = convert_matrix_df(m2)
m2_df

Unnamed: 0,0,1,2,3,4,5
0,0,2,3,8,14,18
1,2,0,3,8,14,18
2,3,3,0,8,14,18
3,8,8,8,0,14,18
4,14,14,14,14,0,18
5,18,18,18,18,18,0


In the context of molecular phylogenetics, an additive matrix is a technique for displaying the evolutionary distances between sequences. This matrix shows the evolutionary changes that have taken place between various biological sequences, including sequences of DNA, RNA, and proteins.
Based on:
* Buneman’s 4-point condition Theorem:  M is additive if and only if the 4-point condition is satisfied
* 3-point condition Theorem: M is ultrametric if and only if the 3-point condition is satisfied

In [6]:
def is_additive(matrix):    
    comb = it.combinations(range(len(matrix)),4)
    for groupe in comb:
        i,j,k,l=groupe
        if not (matrix[i,j]+matrix[k,l]<=max(matrix[i,k]+matrix[j,l],matrix[i,l]+matrix[j,k])):
            return False
    return True

def is_ultrametrix(matrix):
    comb=it.combinations(range(len(matrix)),3)
    for groupe in comb:
        i,j,k=groupe
        if not(matrix[i,k] <= max(matrix[i,j], matrix[j,k])):
            return False
    return True

In [7]:
print("is M1 additive", is_additive(m1))
print("is M2 ultrametrix", is_ultrametrix(m2))

is M1 additive True
is M2 ultrametrix True


In [8]:
def cluster(df_matrix,i):
    return sum(df_matrix.iloc[:, i])

def all_cluster(df_matrix):
    for column in df_matrix.columns:
        print("Number of cluster in ",column,"is",cluster(df_matrix,column)) 

In [9]:
all_cluster(m1_df)

Number of cluster in  0 is 27
Number of cluster in  1 is 31
Number of cluster in  2 is 27
Number of cluster in  3 is 37


In [10]:
all_cluster(m2_df)

Number of cluster in  0 is 45
Number of cluster in  1 is 45
Number of cluster in  2 is 46
Number of cluster in  3 is 56
Number of cluster in  4 is 74
Number of cluster in  5 is 90


---

The Newick format is a technique for representing hierarchical tree structures. It is frequently used in computer science to depict hierarchical connections and in biology to describe phylogenetic trees, which show the evolutionary links between species.

UPGMA steps: 
1. Align & name 
2. Compare sequences using pairwise sequence alignment 
3. Count the mismatches and records them in the mismatche matrix 
4. Create a new cluster $u$ that joins the Closest Pair $(i,j)$ with the smallest distance $d_{i,j}$
5. Update the MatriX replace the rows and columns that correspond to the two clustered items with a new row and column. Based on the average distance from the newly.
6. Repeat step 4 and 5 until we get one cluster  
