<a href="https://colab.research.google.com/github/EmiliaJarochowska/Echinoid_phylogeny/blob/main/Phylogenetic_tree_of_sea_urchins.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Phylogenetic tree of sea urchins

In [None]:
!pip install numpy
!pip install pandas
!pip install matplotlib
!pip install scipy # might only work via Anaconda?

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.spatial.distance import pdist, squareform
from scipy.cluster.hierarchy import linkage, dendrogram
from google.colab import files # for file upload & download

In [None]:
uploads = files.upload() # upload your distance matrix here

In [None]:
df = pd.read_excel(r"matrix.xlsx") # turn .xlsx file into data frame
df

Now we put the names into a separate matrix and keep only the values of characters

In [None]:
Taxa = df.index.tolist()[1:] # select taxon names
Characters = df.iloc[1:, :] # select character vaues

In [None]:
sequences_str = Characters.values.astype(np.uint8)

Calculate distance matrix using [Euclidean distance](https://en.wikipedia.org/wiki/Euclidean_distance)

In [None]:
distance_matrix = squareform(pdist(sequences_str, metric='euclidean'))

Visualize the distance matrix

In [None]:
fig, ax = plt.subplots()
cax = ax.matshow(distance_matrix, cmap='viridis')

# Distance values for matrix cells
for i in range(len(sequences_str)):
    for j in range(len(sequences_str)):
        ax.text(j, i, f'{distance_matrix[i, j]:.2f}', ha='center', va='center', color='white')

# Add color bar
cbar = fig.colorbar(cax)
cbar.ax.set_ylabel('Distance', rotation=270, labelpad=15)

Create a tree using a clustering algorithm. Here we use [UPGMA](https://en.wikipedia.org/wiki/UPGMA). That is indicated by the parameter `method='average'`, see [Scipy documentation](https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.linkage.html#scipy.cluster.hierarchy.linkage) for details.

In [None]:
Z = linkage(pdist(sequences_str, metric='euclidean'), method='average')

Plot the tree

In [1]:
plt.figure(figsize=(10, 5))
dendrogram(Z,
            orientation='left',
            labels=Taxa,
            distance_sort='descending',
            show_leaf_counts=False
          )
plt.title('Hierarchical Cluster Dendrogram')
plt.xlabel('Data Point Indexes')
plt.ylabel('Distance')
plt.show()

NameError: name 'plt' is not defined