<a href="https://colab.research.google.com/github/DessimozLab/fold_tree/blob/main/notebooks/Foldtree.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img src="https://github.com/DessimozLab/fold_tree/raw/main/foldtree_logo.png" height="200" align="right" style="height:240px">


##Foldtree - construct trees from protein structures

Easy to use notebook to construct phylogenetic trees from protein structure using [Foldtree](https://github.com/DessimozLab/fold_tree).
Foldtree is powered by [Foldseek](foldseek.com) to align protein structures and generate the distance matrix used for tree computation.

[Moi D., Bernard C., Steinegger M., Nevers Y., Langleib M., Dessimoz C.
Structural phylogenetics unravels the evolutionary diversification of communication systems in gram-positive bacteria and their viruses,
*biorxiv*, 2023](https://www.biorxiv.org/content/10.1101/2023.09.19.558401v2)

### Input Types

This notebook supports three types of input for constructing phylogenetic trees from protein structures:

- **AFDB Cluster**: Use this option to input an AlphaFold Database (AFDB) cluster. You must provide a valid AFDB cluster ID (e.g., `A0A074YNE0`). Only AFDB cluster IDs are accepted for this input type.

- **Identifier List**: Provide a list of UniProt IDs, one per line. It is the user's responsibility to ensure that the listed proteins are homologous.

- **Custom PDBs**: Upload your own set of PDB files. All PDB files must be compressed into a single `.zip` archive before uploading. As with the identifier list, users must verify that the input proteins are homologous.

Carefully select the input type that matches your data and ensure the biological relevance of your input set.

In [None]:
#@markdown ### Input (custom PDBs upload, identifier list, cluster ids)
from google.colab import files
import os
import re
import hashlib
import random
import zipfile

input_type = "afdb_cluster" #@param ["afdb_cluster", "identifier", "custom"]
#
#@markdown - afdb_cluster = identifier of an AFDB cluster,
#@markdown - identifier" = uniprot identifer (e.g. A0A074YNE0) list line by line,
#@markdown - custom - zip file with PDBs

cluster_id = "A0A074YNE0" #@param {type:"string"}
jobname = 'test' #@param {type:"string"}

def add_hash(x,y):
  return x+"_"+hashlib.sha1(y.encode()).hexdigest()[:5]

from sys import version_info
python_version = f"{version_info.major}.{version_info.minor}"


basejobname = "".join(jobname.split())
basejobname = re.sub(r'\W+', '', basejobname)
jobname = add_hash(basejobname, cluster_id)

# check if directory with jobname exists
def check(folder):
  if os.path.exists(folder):
    return False
  else:
    return True
if not check(jobname):
  n = 0
  while not check(f"{jobname}_{n}"): n += 1
  jobname = f"{jobname}_{n}"

# make directory to save results
os.makedirs(jobname, exist_ok=True)

if input_type == "custom":
  input_file = os.path.join(jobname,f"{jobname}.zip")
  if not os.path.isfile(input_file):
    zipfiles = files.upload()
    zipfile_name = list(zipfiles.keys())[0]
    os.rename(zipfile_name, input_file)
    # Unzipping the file
    with zipfile.ZipFile(input_file, 'r') as zip_ref:
      zip_ref.extractall(jobname)
    os.remove(input_file)

    input_file = os.path.join(jobname,f"identifiers.txt")
    with open(input_file, "w") as f:
      f.write("")
    os.mkdir(os.path.join(jobname,"structs"))
    for file in os.listdir(jobname):
      if file.endswith(".pdb"):
        os.rename(os.path.join(jobname,file), os.path.join(jobname,"structs",file))        



elif input_type == "afdb_cluster":
  import requests
  # Define the endpoint and parameters
  base_url = "https://cluster.foldseek.com/api/cluster/"
  params = {
      "format": "accessions",
      "groupBy": "",
      "groupDesc": "",
      "itemsPerPage": 10,
      "multiSort": "false",
      "mustSort": "false",
      "page": 1,
      "sortBy": "",
      "sortDesc": "false"
  }

  # Make the request
  response = requests.get(f"{base_url}{cluster_id}/members", params=params)

  # Ensure the request was successful
  response.raise_for_status()

  # Save the response content to a file
  with open(f"{jobname}/identifiers.txt", "w") as file:
      file.write(response.text)
elif input_type == "identifier":
  input_file = os.path.join(jobname,f"identifiers.txt")
  if not os.path.isfile(input_file):
    identifierfiles = files.upload()
    identifierfilename = list(identifierfiles.keys())[0]
    os.rename(identifierfilename, input_file)



In [None]:
#@title Install dependecies
%%bash -s $python_version
PYTHON_VERSION=$1
# Check if fold_tree directory exists and remove it
if [ -d "fold_tree" ]; then
  rm -r fold_tree
fi
#pip install -q biopython ete3 pyqt5 wget statsmodels toytree toyplot requests tqdm
# Clone the repository
git clone -q https://github.com/DessimozLab/fold_tree
#git clone -q https://github.com/DessimozLab/fold_tree --branch foldtreeserver

wget -qnc "https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh"
bash Miniforge3-$(uname)-$(uname -m).sh -bfp /usr/local > /dev/null 2>&1

#mamba config --set auto_update_conda false

export CONDA_AUTO_UPDATE_CONDA=false

if ! mamba env list | grep -qE '^\s*foldtree\s'; then
    mamba create -n foldtree -y
fi

eval "$(mamba shell hook --shell bash)"

mamba activate foldtree

# Create the foldtree environment
mamba install -q -c bioconda -c conda-forge -c nodefaults python="${PYTHON_VERSION}" foldseek snakemake snakedeploy snakefmt iqtree muscle quicktree fasttree=2.1.11 clustalo=1.2.4 python-wget ete3 statsmodels toytree toyplot tqdm requests biopython
pip install ujson

mamba activate base

mamba install -q -c bioconda -c conda-forge -c nodefaults python="${PYTHON_VERSION}"  snakemake

In [None]:
#@title Run Foldtree
%%bash -s $jobname $input_type
JOBNAME=$1
INPUT_TYPE=$2
SUFFIX=""
if [[ $INPUT_TYPE = "custom" ]]; then
  mkdir -p "${JOBNAME}/structs"
  mv "${JOBNAME}/"*.pdb "${JOBNAME}/"*.cif "${JOBNAME}/structs"
  SUFFIX="custom_structs=True"
fi
snakemake --cores $(nproc --all) --use-conda -s fold_tree/workflow/fold_tree -k --config folder="./${JOBNAME}" filter=False $SUFFIX  #> /dev/null 2>&1
#snakemake --cores 4 --use-conda -s fold_tree/workflow/fold_tree --config folder=./${jobname} filter=False

In [None]:
#@title Plot Foldtree output {run: "auto"}
tree = "foldseek_rooted" #@param ["foldseek_rooted", "foldseek", "lddt_rooted", "lddt", "alntmscore_rooted", "alntmscore"]
import sys
if f"/usr/local/lib/python{python_version}/site-packages/" not in sys.path:
    sys.path.insert(0, f"/usr/local/lib/python{python_version}/site-packages/")

import os
os.environ['QT_QPA_PLATFORM']='offscreen'
from ete3 import Tree, TreeStyle, TextFace, CircleFace

filelookup = {
    "foldseek_rooted": "foldtree_struct_tree.PP.nwk.rooted.final",
    "foldseek": "foldtree_struct_tree.PP.nwk",
    "lddt_rooted": "lddt_struct_tree.PP.nwk.rooted.final",
    "lddt": "lddt_struct_tree.PP.nwk",
    "alntmscore_rooted" : "alntmscore_struct_tree.PP.nwk.rooted.final",
    "alntmscore" :  "alntmscore_struct_tree.PP.nwk"
}

t = Tree(f"{jobname}/{filelookup[tree]}", format = 0)
# Define a tree style
ts = TreeStyle()
ts.mode = "c"  # This sets the tree layout to radial
ts.show_leaf_name = True
ts.show_branch_length = True
ts.show_branch_support = True

for n in t.traverse():
    support_face = CircleFace(radius=10, color="Thistle", style="circle")
    n.add_face(support_face, column=0, position="branch-right")
    n.img_style["vt_line_width"] = 50
    n.img_style["hz_line_width"] = 50


for leaf in t.iter_leaves():
    leaf.img_style["vt_line_type"] = 1  # for vertical lines
    leaf.img_style["hz_line_type"] = 1  # for horizontal lines
    leaf.add_face(TextFace(leaf.name, fsize=512), column=0, position="branch-right")

# Visualize the tree
t.render(jobname + "/tree.svg", w=1000, h=1000, units="px", tree_style=ts, dpi=300)

import base64
from IPython.display import display, HTML

with open(jobname + '/tree.svg', 'r') as f:
  display(HTML('<img style="width:100%; background:white; height:100%;max-width: 80vw;margin:1em;" src="data:image/svg+xml;base64,' + base64.b64encode(f.read().encode('ascii')).decode('ascii') + '" />'))


## Tree visualisation and comparison
The tree visulasition below is powered by [Phylo.io](https://beta.phylo.io/viewer/) 
You can select structural distance metrics to compare tree topologies.
Comparison with sequence based trees is coming soon.

### Usage
Use the dropdown menus to select the rooted or unrooted tree, the distance metric and the tree to display.
The best results in the manuscript were obtained with the Foldseek score.

To return to a single tree view, select no tree in the second dropdown menu.
The color of the branches represents the maximum jaccard similarity between that subtree's leafset and the closest matching subtree's leafset in the tree on the opposite side of the visualization.
The darker the of the branch leading up to a node, the more similar the sets of leaves are.

In [None]:
#@title Phyloio visualization
!cp -r /content/fold_tree/docs/dist_server/* /usr/local/share/jupyter/nbextensions/google.colab
!cp -r /content/{jobname} /usr/local/share/jupyter/nbextensions/google.colab
import csv
filelookup = {
    "foldseek_rooted": "foldtree_struct_tree.PP.nwk.rooted.final",
    "foldseek": "foldtree_struct_tree.PP.nwk",
    "lddt_rooted": "lddt_struct_tree.PP.nwk.rooted.final",
    "lddt": "lddt_struct_tree.PP.nwk",
    "alntmscore_rooted" : "alntmscore_struct_tree.PP.nwk.rooted.final",
    "alntmscore" :  "alntmscore_struct_tree.PP.nwk"
}

id_mapper = {}

with open(jobname +  '/' + 'finalset.csv', newline='') as csvfile:
    for row in csv.reader(csvfile, delimiter=','):
      id_mapper[row[3]] = row[2]

with open('fold_tree/docs/dist_server/compare_tree.html', 'r') as f:
      html_string = f.read()
      html_string = html_string.replace( u'\u200b', '' )

      for key, value in filelookup.items():

        with open(jobname + '/' + value, 'r') as f:
          output = f.read()

          for name, name_species in id_mapper.items():
            output = output.replace( name, name_species )

          html_string = html_string.replace( key + '_123456789', output )

from IPython.display import HTML
HTML(html_string)

In [None]:
#@title Package and download results
!zip -FSr $jobname".result.zip" $jobname
files.download(f"{jobname}.result.zip")