# Class : Hierarchical Clustering - UPGMA

---

## Before Class
This weeks class we will be implementing UPGMA for hierarchical clustering. We will start with the implementation of a distance matrix today.

Prior to class, please do the following:
1. Review the structure of a distance matrix
2. Compare Hamming and Levenshtein distances
3. Re-familiarize yourself with numpy arrays

---
## Learning Objectives

1. Implement a simple distance metric
* Build a distance matric using the metric and a set of alignments

---
## Background
Today we will start implementing a frequently used hierarchical clustering algorithm from Sokal and Michener (1958) called UPGMA - unweighted pair group method using arithmetic averages. The first step of this algorithm requires that we build a distance matrix for all of the alignments that we will be clustering. To accomplish this today, we will be using the Hamming distance of the sequences. The Hamming distance is total edit distance between two strings (the total number of changes required to make two strings exactly match). This metric requires that the strings are the same length. Luckily, our previous work on alignments results in strings that are all the same length and optimally aligned! We will be using those strings as input.


---
## Imports

In [None]:
import numpy as np

---
## Distance metrics for comparing sequences

### Hamming Distance



In [None]:
def hamming_distance(alignment1, alignment2): 
    ''' Function to calculate Hamming distance between two alignments
    
    Args: 
        alignment1 (str): first sequence that has already been aligned
        alignment2 (str): second sequence that has already been aligned

    Returns:
        distance (int): hamming distance between the two alignment
    
    '''


In [None]:
# Example data as from slides:
alignment1 = "TA-TTTA"
alignment2 = "TA-TTAA"
print (hamming_distance(alignment1, alignment2))

alignment1 = "TA-TTCCA"
alignment2 = "TA-TTAAC"
print (hamming_distance(alignment1, alignment2))

alignment1 = "TA-TTAAC"
alignment2 = "TA-TTAAC"
print (hamming_distance(alignment1, alignment2))

In [None]:
def build_distance_matrix(alignments): 
    ''' Function to build a distance matrix from a list of alignments
    This is a number of alignments x number of alignments matrix with 
    all pairwise distances (and 0 along the diagonal).
    All alignments must be same length!
    
    Args: 
        alignments (list of strings): a list of our sequence alignments

    Returns:
        distance_matrix (np.array of floats): n x n distance matrix
    
    '''
    


In [None]:
# Example data as from slides:
alignments = ["TA-TTTA", "TA-TTAA", "TA-TTTA", "TACTT-A", "TACTTAA"]

D = build_distance_matrix(alignments)
print(D)