In [42]:

# Setup AlphaFold Benchmark

# 5 structural families

# collect input sets with good tax range

# use AA and foldtree 2 to reconstruct all internal nodes

#alphafold the amino acid sequences inferred from both

#recover plddt values for each ancestral reconstruction

#plddt as a measure of confidence in the ancestral reconstruction overall

#plddt ( mean, var , skew ) vs distance from root

# AlphaFold Benchmark Setup - Dual-Method Comparison

This notebook sets up a comprehensive benchmark comparing two approaches to ancestral sequence reconstruction:

## 1. Traditional AA-Based Method
- **Input**: Amino acid sequences from UniProt
- **Alignment**: MAFFT with standard amino acid substitution matrices
- **Tree Inference**: RAxML-NG with amino acid models
- **Ancestral Reconstruction**: RAxML-NG ancestral state reconstruction
- **Validation**: AlphaFold prediction quality (pLDDT scores)

## 2. FoldTree2 Structure-Based Method
- **Input**: 3D protein structures from AlphaFold Database
- **Encoding**: Trained neural network encoder converts structures to discrete tokens
- **Alignment**: MAFFT with custom structure-based substitution matrices
- **Tree Inference**: RAxML-NG with multi-state custom alphabets
- **Ancestral Reconstruction**: Structure-based ancestral sequences
- **Validation**: AlphaFold prediction quality (pLDDT scores)

## Information Theory Benchmark Families

Selected protein families with high phylogenetic information content:

1. **Rhodopsin**: G-protein coupled receptors with conserved 7-transmembrane structure
2. **RuBisCO**: Critical photosynthesis enzyme with large/small subunits
3. **ATP Synthase F0**: Membrane-embedded proton channel subunit
4. **ATP Synthase F1**: Catalytic domain of ATP synthase
5. **HAP2**: Gamete fusion protein with conserved fusogenic role

## Workflow Overview

### Phase 1: Data Collection
1. Query UniProt for protein sequences and structures
2. Download AlphaFold Database structures
3. Cluster structures using Foldseek to reduce redundancy

### Phase 2: Dual Phylogenetic Analysis
**AA Method:**
- Convert structures to FASTA
- Align with MAFFT
- Build tree with RAxML-NG
- Reconstruct ancestral sequences

**FoldTree2 Method:**
- Encode structures with trained model
- Align discrete structural tokens
- Build tree with custom substitution matrices
- Reconstruct ancestral structural sequences

### Phase 3: Validation
1. Prepare ancestral sequences for AlphaFold
2. Run AlphaFold on both AA and FoldTree2 ancestral sequences
3. Extract pLDDT confidence scores
4. Compare reconstruction quality between methods
5. Statistical analysis and visualization

## Expected Outcomes

- **pLDDT distributions**: Quantify prediction confidence for each method
- **Method comparison**: Determine if structural information improves reconstruction
- **Family-specific insights**: Identify which protein families benefit most from structural encoding
- **Phylogenetic depth analysis**: Correlate reconstruction quality with evolutionary distance

## Dependencies

- **FoldTree2**: Structure-based phylogenetics framework
- **RAxML-NG**: Phylogenetic inference
- **MAFFT**: Multiple sequence alignment
- **Foldseek**: Structure clustering
- **AlphaFold**: Structure prediction and validation
- **BioPython**: Sequence/structure manipulation
- **PyTorch**: FoldTree2 model execution

In [43]:
cd /home/dmoi/projects/foldtree2/

/home/dmoi/projects/foldtree2


In [None]:
overwrite = False
benchmark_folder = 'alphafold_benchmark'

# FoldTree2 model configuration
model_name = 'monodecoder_model_best'  # Update with your trained model name
model_dir = 'models/'
model_path = f'{model_dir}{model_name}_encoder'

In [45]:
#use autoreload
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [None]:
# Benchmark configuration
# Choose which benchmark set to use: 'information_theory' or 'marker_genes'
BENCHMARK_SET = 'information_theory'  # Change to 'marker_genes' for alternative benchmark

# Marker genes directory (used when BENCHMARK_SET='marker_genes')
MARKER_GENES_DIR = '/home/dmoi/projects/foldtree2/families/Information_benchmark/marker_genes'
MARKER_GENES_CSV = '/home/dmoi/projects/foldtree2/oma_markergene_files.csv'

if BENCHMARK_SET == 'information_theory':
    # Original benchmark families with high phylogenetic signal
    benchmark_families = {
        'rhodopsin': {
            'search_term': 'rhodopsin',
            'taxonomy': 'Metazoa',  # Animals only
            'reviewed': True,
            'description': 'G-protein coupled receptor family'
        },
        'rubisco': {
            'search_term': 'ribulose bisphosphate carboxylase',
            'taxonomy': 'Viridiplantae',  # Green plants
            'reviewed': True,
            'description': 'Key enzyme in carbon fixation'
        },
        'atp_synthase_f0': {
            'search_term': 'ATP synthase subunit c',
            'taxonomy': 'Bacteria',
            'reviewed': True,
            'description': 'F0 complex proton channel'
        },
        'atp_synthase_f1': {
            'search_term': 'ATP synthase subunit alpha',
            'taxonomy': 'Bacteria',
            'reviewed': True,
            'description': 'F1 complex catalytic subunit'
        },
        'hap2': {
            'search_term': 'HAP2',
            'taxonomy': 'Eukaryota',
            'reviewed': True,
            'description': 'Gamete fusion protein'
        }
    }
    
elif BENCHMARK_SET == 'marker_genes':
    # Load marker genes - each subdirectory in marker_genes/ is a COG family with structs/ folder
    import os
    
    if not os.path.exists(MARKER_GENES_DIR):
        raise FileNotFoundError(f"Marker genes directory not found: {MARKER_GENES_DIR}")
    
    # Create benchmark_families dictionary for marker genes
    # Each subdirectory (e.g., COG0012, COG0016) becomes a family
    benchmark_families = {}
    
    for family_dir in os.listdir(MARKER_GENES_DIR):
        full_path = os.path.join(MARKER_GENES_DIR, family_dir)
        structs_path = os.path.join(full_path, 'structs')
        
        # Only include if it's a directory with a structs subdirectory
        if os.path.isdir(full_path) and os.path.isdir(structs_path):
            # Count PDB files in structs folder
            pdb_files = [f for f in os.listdir(structs_path) if f.endswith('.pdb')]
            
            if len(pdb_files) > 0:
                benchmark_families[family_dir] = {
                    'search_term': None,  # Not used for marker genes (already have structures)
                    'taxonomy': None,  # Not filtering by taxonomy
                    'reviewed': None,  # Not filtering by review status
                    'description': f'Marker gene family {family_dir} ({len(pdb_files)} structures)',
                    'directory': full_path  # Direct path to family folder
                }
    
    print(f"Loaded {len(benchmark_families)} marker gene families from {MARKER_GENES_DIR}")
    
else:
    raise ValueError(f"Invalid BENCHMARK_SET: {BENCHMARK_SET}. Must be 'information_theory' or 'marker_genes'")

# Extract search terms for UniProt queries (only used for information_theory)
if BENCHMARK_SET == 'information_theory':
    search_terms = [family_config['search_term'] for family_config in benchmark_families.values()]
    print(f"Using information theory benchmark with {len(benchmark_families)} families")
    print(f"Families: {list(benchmark_families.keys())}")
else:
    search_terms = []  # Not used for marker genes
    print(f"Using marker genes benchmark with {len(benchmark_families)} families")
    print(f"First 5 families: {list(benchmark_families.keys())[:5]}")


['Naegleria gruberi', 'Trypanosoma brucei', 'Giardia lamblia', 'Dictyostelium discoideum', 'Acanthamoeba castellanii', 'Chlamydomonas reinhardtii', 'Cyanidioschyzon merolae', 'Arabidopsis thaliana', 'Oryza sativa', 'Physcomitrella patens', 'Selaginella moellendorffii', 'Marchantia polymorpha', 'Thalassiosira pseudonana', 'Plasmodium falciparum', 'Tetrahymena thermophila', 'Bigelowiella natans', 'Saccharomyces cerevisiae', 'Neurospora crassa', 'Monosiga brevicollis', 'Homo sapiens', 'Drosophila melanogaster', 'Caenorhabditis elegans', 'Emiliania huxleyi', 'Guillardia theta']
Species to TaxID mapping:
Naegleria gruberi: 5762
Trypanosoma brucei: 5691
Giardia lamblia: 5741
Dictyostelium discoideum: 44689
Acanthamoeba castellanii: 5755
Chlamydomonas reinhardtii: 3055
Cyanidioschyzon merolae: 45157
Arabidopsis thaliana: 3702
Oryza sativa: 4530
Physcomitrella patens: 3218
Selaginella moellendorffii: 88036
Marchantia polymorpha: 3197
Thalassiosira pseudonana: 35128
Plasmodium falciparum: 5833


### Benchmark Set Options

This notebook supports two benchmark datasets:

#### 1. **Information Theory Benchmark** (Default)
- Uses protein families with high phylogenetic signal: rhodopsin, RuBisCO, ATP synthase, HAP2
- Downloads structures from AlphaFold DB based on UniProt queries
- Filters by taxonomy, reviewed status
- Workflow: UniProt query → Download AFDB structures → Cluster → Align → Build trees

#### 2. **Marker Genes Benchmark**
- Uses pre-existing marker gene families from OMA (Orthologous MAtrix) database  
- **Structures already downloaded** in `families/Information_benchmark/marker_genes/`
- Each COG family (e.g., COG0012, COG0016) has a `structs/` subdirectory with PDB files
- **No UniProt queries or structure downloads needed** - skips directly to clustering
- Workflow: Use existing structures → Cluster → Align → Build trees

**To switch benchmark sets:**
```python
# For information theory:
BENCHMARK_SET = 'information_theory'

# For marker genes:
BENCHMARK_SET = 'marker_genes'
```

The rest of the workflow adapts automatically based on your selection.

In [None]:
# Import FoldTree2 treebuilder for structure-based phylogenetics
import sys
sys.path.insert(0, '/home/dmoi/projects/foldtree2')
from foldtree2.ft2treebuilder import treebuilder
from foldtree2.src.pdbgraph import PDB2PyG
import torch

# Check if model files exist
encoder_path = f'{model_path}.pt'
decoder_path = f'{model_dir}{model_name}_decoder.pt'
mafftmat_path = f'{model_path}_mafftmat.mtx'
submat_path = f'{model_path}_submat.txt'
charmaps_path = f'{model_path}_pair_counts.pkl'

print("Checking FoldTree2 model files:")
print(f"  Encoder: {encoder_path} - {'✓' if os.path.exists(encoder_path) else '✗ MISSING'}")
print(f"  Decoder: {decoder_path} - {'✓' if os.path.exists(decoder_path) else '✗ MISSING'}")
print(f"  MAFFT matrix: {mafftmat_path} - {'✓' if os.path.exists(mafftmat_path) else '✗ MISSING'}")
print(f"  Substitution matrix: {submat_path} - {'✓' if os.path.exists(submat_path) else '✗ MISSING'}")
print(f"  Character maps: {charmaps_path} - {'✓' if os.path.exists(charmaps_path) else '✗ MISSING'}")

# Initialize treebuilder if model exists
if os.path.exists(encoder_path):
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    print(f"\nInitializing FoldTree2 treebuilder on device: {device}")
    
    tb = treebuilder(
        model=encoder_path,
        decoder_model=decoder_path if os.path.exists(decoder_path) else None,
        mafftmat=mafftmat_path if os.path.exists(mafftmat_path) else None,
        submat=submat_path if os.path.exists(submat_path) else None,
        raxml_path='./raxmlng/raxml-ng',
        charmaps=charmaps_path if os.path.exists(charmaps_path) else None,
        aapropcsv='./foldtree2/config/aaindex1.csv',
        device=str(device),
        ncores=8
    )
    print("✓ FoldTree2 treebuilder initialized successfully")
else:
    tb = None
    print("\n⚠ Warning: FoldTree2 model not found. Will only run AA-based reconstruction.")
    print(f"  To use FoldTree2, train a model and update model_name in the config cell.")

{'2611352': 1.0, '1206794': 3.0, '33213': 2.0, '716545': 2.0, '33154': 1.0, '33630': 2.0, '2698737': 1.0, '554915': 1.0, '1437183': 4.0, '58023': 3.0, '3193': 2.0, '33090': 1.0, '2759': 0.0}


In [48]:
import requests
from io import StringIO
import pandas as pd
search_terms = ['rhodopsin', 'RuBisCO', 'FO ATP synthase', 'F1 ATP synthase' , 'Hap2' ]
def get_swissprot_eukaryota_entries(search_terms, reviewed=True, max_entries=1000):
	results = {}
	base_url = 'http://rest.uniprot.org/uniprotkb/stream?'
	search_terms = search_terms.replace(' ', '+')  # Replace spaces with '+' for URL encoding
	#query = f'{search_terms} AND taxonomy_id:2759'
	query = f'{search_terms} AND taxonomy_id:2759 AND database:alphafolddb'
	print(f"Querying UniProt for: {query}")
	if reviewed:
		query += ' AND reviewed:true'
	params = {
		'query': query,
	'fields': 'accession,id,protein_name,organism_name,sequence,lineage,',
		'format': 'tsv',
		'size': max_entries
	}
	response = requests.get(base_url, params=params)
	if response.status_code != 200:
		print(f"Error fetching data for {search_terms}: {response.status_code}")
		return None
	if not response.text.strip():
		print(f"No data returned for search term: {search_terms}")
		return None

	# Parse the response into a DataFrame
	results = pd.read_csv(StringIO(response.text), sep='\t')
	results['search_term'] = search_terms  # Add search term column
	results['taxonomy_id'] = 2759
	return results



In [49]:
import os
if overwrite:
	# Remove existing results folder if it exists
	import shutil
	if os.path.exists(benchmark_folder):
		shutil.rmtree(benchmark_folder)
		print(f"Removed existing folder: {benchmark_folder}")


In [None]:
import os

# Only download structures for information_theory benchmark
# Marker genes already have structures in their directories
if BENCHMARK_SET == 'information_theory':
    if not os.path.exists(benchmark_folder):
        os.makedirs(benchmark_folder)

    for term in search_terms:
        print(f"\nSearch results for '{term}':")
        resultsdf = get_swissprot_eukaryota_entries(term)
        print(resultsdf)
        # Save the results to a CSV file
        termfolder = term.replace(' ', '_')
        termfolder = os.path.join(benchmark_folder, termfolder)
        if not os.path.exists(termfolder):
            os.makedirs(termfolder)
        # Save the results to a CSV file
        results_file = os.path.join(termfolder, f"{term.replace(' ', '_')}_results.csv")
        resultsdf.to_csv(results_file, index=False)
        print(f"Results saved to {results_file}")
elif BENCHMARK_SET == 'marker_genes':
    print("\nSkipping UniProt search - marker genes already have structures")
    print(f"Using structures from: {MARKER_GENES_DIR}")
else:
    raise ValueError(f"Invalid BENCHMARK_SET: {BENCHMARK_SET}")



Search results for 'rhodopsin':
Querying UniProt for: rhodopsin AND taxonomy_id:2759 AND database:alphafolddb
           Entry   Entry Name  \
0     A0A0K3AWM6   MOM5_CAEEL   
1     A0A2R9YJI3  GPR22_DANRE   
2         A0T2N3   APJB_DANRE   
3         A1Z7G7   LPHN_DROME   
4         A2ARI4   LGR4_MOUSE   
...          ...          ...   
3182      Q09964   YS94_CAEEL   
3183      Q19473  SRD51_CAEEL   
3184      Q19474  SRD50_CAEEL   
3185      Q19508  SRD46_CAEEL   
3186      Q19975  SRD34_CAEEL   

                                          Protein names  \
0                                         Protein mom-5   
1                         G-protein coupled receptor 22   
2     Apelin receptor B (Angiotensin II receptor-lik...   
3                                      Latrophilin Cirl   
4     Leucine-rich repeat-containing G-protein coupl...   
...                                                 ...   
3182        Putative G-protein coupled receptor B0244.4   
3183  Serpentine rec

In [None]:

import glob
from src import AFDB_tools
import tqdm

# Only download structures for information_theory benchmark
# Marker genes already have structures
if BENCHMARK_SET == 'information_theory' and overwrite:
	for term in search_terms:
		termfolder = term.replace(' ', '_')
		structfolder = os.path.join(benchmark_folder, termfolder, 'input_structs')
		results_file = os.path.join(benchmark_folder , termfolder, f"{term.replace(' ', '_')}_results.csv")
		if not os.path.exists(results_file):
			print(f"Results file {results_file} does not exist. Skipping.")
			continue
		resultsdf = pd.read_csv(results_file)
		print(f"Processing results for {term} with {len(resultsdf)} entries.")
		print( len( glob.glob(os.path.join(structfolder, '*.pdb')) ), " structures already downloaded.")
		for index, row in tqdm.tqdm(resultsdf.iterrows() , total=len(resultsdf) , desc=f"Processing {term}"):
			uniprot_id = row['Entry']

			if not os.path.isfile(os.path.join(structfolder, uniprot_id + '.pdb')):	print("\nSkipping structure download - marker genes already have structures in their directories.")

				AFDB_tools.grab_struct(uniprot_id , structfolder= structfolder + '/' , overwrite=False )elif BENCHMARK_SET == 'marker_genes':

In [None]:
# Skip clustering for marker genes - use structures directly
# Clustering only needed for information_theory benchmark to reduce redundancy

import os
import subprocess

overwrite = True  # Set to True to overwrite existing results

def run_foldseek(query_folder, target_folder, tmp_folder, foldseek_path='foldseek'):
	# Ensure output folder exists
	if not os.path.exists(target_folder):
		os.makedirs(target_folder)
		
	# Command example: foldseek easy-cluster example/ res tmp -c 0.9 
	command = [foldseek_path, 'easy-cluster', query_folder, target_folder, tmp_folder, '-c', '0.9']
	print(f"Running command: {' '.join(command)}")
	try:
		subprocess.run(command, check=True)
		print(f"Foldseek clustering completed successfully for {query_folder}.")
	except subprocess.CalledProcessError as e:
		print(f"Error running foldseek: {e}")
		raise

# Only run foldseek clustering for information_theory benchmark
if BENCHMARK_SET == 'information_theory':
	print("Running Foldseek clustering for information_theory benchmark...")
	for family_name in benchmark_families.keys():
		# Use search term to create folder name
		term = benchmark_families[family_name]['search_term']
		termfolder = term.replace(' ', '_')
		termfolder = os.path.join(benchmark_folder, termfolder)
		structfolder = os.path.join(termfolder, 'input_structs')
		
		print(f"Processing family: {family_name} in folder: {termfolder}")
		print(f"Input structures folder: {structfolder}")
		
		# Count PDB files in the input folder
		pdb_files = glob.glob(os.path.join(structfolder, '*.pdb'))
		print(f"Number of PDB files in {structfolder}: {len(pdb_files)}")
		
		temp_folder = os.path.join(termfolder, 'tmp')
		print(f"Temporary folder for foldseek: {temp_folder}")
		
		output_folder = os.path.join(termfolder, 'foldseek_output')
		
		if overwrite:
			# Remove existing output folder if it exists
			if os.path.exists(output_folder):
				import shutil
				shutil.rmtree(output_folder)
				print(f"Removed existing output folder: {output_folder}")
		
		print(f"Running foldseek for {family_name} in {termfolder}")
		run_foldseek(structfolder, output_folder, tmp_folder=temp_folder)
		
elif BENCHMARK_SET == 'marker_genes':
	print("Skipping Foldseek clustering for marker_genes - will use all structures directly for tree building.")


Processing term: rhodopsin in folder: alphafold_benchmark/rhodopsin
Input structures folder: alphafold_benchmark/rhodopsin/input_structs
Number of PDB files in alphafold_benchmark/rhodopsin/input_structs: 3187
Temporary folder for foldseek: alphafold_benchmark/rhodopsin/tmp
Removed existing output folder: alphafold_benchmark/rhodopsin/foldseek_output
Running foldseek for rhodopsin in rhodopsin
Running command: foldseek easy-cluster alphafold_benchmark/rhodopsin/input_structs alphafold_benchmark/rhodopsin/foldseek_output alphafold_benchmark/rhodopsin/tmp -c 0.9
alphafold_benchmark/rhodopsin/foldseek_output exists and will be overwritten
easy-cluster alphafold_benchmark/rhodopsin/input_structs alphafold_benchmark/rhodopsin/foldseek_output alphafold_benchmark/rhodopsin/tmp -c 0.9 

MMseqs Version:                     	9.427df8a
Substitution matrix                 	aa:3di.out,nucl:3di.out
Seed substitution matrix            	aa:3di.out,nucl:3di.out
Sensitivity                         	4
k-

In [None]:
if BENCHMARK_SET == 'information_theory':
	print("Foldseek clustering completed for information_theory benchmark.")
else:
	print("Clustering step skipped for marker_genes benchmark.")


Foldseek clustering completed for all terms.


In [None]:
# Read cluster heads from foldseek output (information_theory) 
# OR use all structures directly (marker_genes)

import pandas as pd
import shutil

def copy_cluster_heads(input_file, input_folder, output_folder, verbose=True):
	"""Copy cluster representative structures from foldseek clustering results."""
	if not os.path.exists(output_folder):
		os.makedirs(output_folder)
	df = pd.read_csv(input_file, sep='\t', header=None)
	print(f"Read {len(df)} rows from {input_file}")
	if verbose:
		print("First few rows of the DataFrame:")
		print(df.head())
	# Get unique cluster IDs
	cluster_ids = df[0].unique()
	print(f"Found {len(cluster_ids)} unique clusters.")
	for cluster_id in cluster_ids:
		cluster_df = df[df[0] == cluster_id]
		if not cluster_df.empty:
			head_row = cluster_df.iloc[0]
			structure_path = os.path.join(input_folder, head_row[1] + '.pdb')
			if os.path.exists(structure_path):
				dest_path = os.path.join(output_folder, f"{head_row[0]}.pdb")
				print(f"Copying {structure_path} to {dest_path}")
				subprocess.run(['cp', structure_path, dest_path])
			else:
				print(f"Structure file {structure_path} does not exist. Skipping.")

def copy_all_structures(input_folder, output_folder, verbose=True):
	"""Copy all structures without clustering (for marker genes)."""
	if not os.path.exists(output_folder):
		os.makedirs(output_folder)
	
	# Get all PDB files
	pdb_files = glob.glob(os.path.join(input_folder, '*.pdb'))
	print(f"Found {len(pdb_files)} PDB files in {input_folder}print()")
	
	for pdb_file in pdb_files:
		dest_path = os.path.join(output_folder, os.path.basename(pdb_file))
		if verbose:
			print(f"Copying {pdb_file} to {dest_path}")
		shutil.copy2(pdb_file, dest_path)
	
	print(f"Copied {len(pdb_files)} structures to {output_folder}")

# Process each family based on benchmark set
for family_name in benchmark_families.keys():
	if BENCHMARK_SET == 'information_theory':
		# For information theory: use cluster heads from foldseek
		term = benchmark_families[family_name]['search_term']
		termfolder = term.replace(' ', '_')
		termfolder = os.path.join(benchmark_folder, termfolder)
		input_folder = os.path.join(termfolder, 'input_structs')
		
		# Define the input file path from foldseek
		input_file = os.path.join(termfolder, 'foldseek_output_cluster.tsv')
		output_folder = os.path.join(termfolder, 'cluster_heads')
		
		if os.path.exists(input_file):
			copy_cluster_heads(input_file, input_folder, output_folder, verbose=True)
		else:
			print(f"Foldseek cluster file {input_file} not found. Skipping {family_name}.")
			
	elif BENCHMARK_SET == 'marker_genes':
		# For marker genes: use ALL structures directly (no clustering)
		termfolder = os.path.join(benchmark_folder, family_name)
		input_folder = os.path.join(MARKER_GENES_DIR, family_name, 'structs')
		output_folder = os.path.join(termfolder, 'cluster_heads')
		
		print(f"\nProcessing marker gene family: {family_name}")
		print(f"Copying all structures from {input_folder} to {output_folder}")
		copy_all_structures(input_folder, output_folder, verbose=False)


Read 3187 rows from alphafold_benchmark/rhodopsin/foldseek_output_cluster.tsv
First few rows of the DataFrame:
        0       1
0  O55240  O55240
1  O60241  O60241
2  O60241  Q8CGM1
3  O60241  O14514
4  O60241  C0HL12
Found 178 unique clusters.
Copying alphafold_benchmark/rhodopsin/input_structs/O55240.pdb to alphafold_benchmark/rhodopsin/cluster_heads/O55240.pdb
Copying alphafold_benchmark/rhodopsin/input_structs/O60241.pdb to alphafold_benchmark/rhodopsin/cluster_heads/O60241.pdb
Copying alphafold_benchmark/rhodopsin/input_structs/O70430.pdb to alphafold_benchmark/rhodopsin/cluster_heads/O70430.pdb
Copying alphafold_benchmark/rhodopsin/input_structs/O70432.pdb to alphafold_benchmark/rhodopsin/cluster_heads/O70432.pdb
Copying alphafold_benchmark/rhodopsin/input_structs/O75154.pdb to alphafold_benchmark/rhodopsin/cluster_heads/O75154.pdb
Copying alphafold_benchmark/rhodopsin/input_structs/O77830.pdb to alphafold_benchmark/rhodopsin/cluster_heads/O77830.pdb
Copying alphafold_benchmark/

In [None]:
import glob
from Bio.PDB import PDBParser
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio import SeqIO
import os
import tqdm
from scipy.stats import describe

#families directories
def pdb_to_fasta(pdb_file, fasta_file):
	parser = PDBParser(QUIET=True)
	three_to_one = {'ALA':'A', 'CYS':'C', 'ASP':'D', 'GLU':'E',
				'PHE':'F', 'GLY':'G', 'HIS':'H', 'ILE':'I',
				'LYS':'K', 'LEU':'L', 'MET':'M', 'ASN':'N',
				'PRO':'P', 'GLN':'Q', 'ARG':'R', 'SER':'S',
				'THR':'T', 'VAL':'V', 'TRP':'W', 'TYR':'Y'}

	try:
		structure = parser.get_structure('protein', pdb_file)
		# Get first chain sequence
		for model in structure:
			for chain in model:
				seq = ''
				for residue in chain:
					# Only process amino acid residues with CA atoms
					if 'CA' in residue and residue.get_resname() in three_to_one:
						try:
							aa = three_to_one.get(residue.get_resname(), 'X')
							seq += aa
						except Exception:
							# Use X for unknown/modified amino acids
							seq += 'X'
				if seq:  # Only create record if sequence was found
					pdb_id = os.path.splitext(os.path.basename(pdb_file))[0]
					record = SeqRecord(
						Seq(seq),
						id=f"{pdb_id}_{chain.id}",
						description=f"Chain {chain.id} from {pdb_id}"
					)
					return record
		return None
	except Exception as e:
		print(f"Error processing {pdb_file}: {e}")
		return None

def pdbs_to_fasta(pdb_files, fasta_file):
	with open(fasta_file, 'w') as fasta_out:
		for pdb_file in tqdm.tqdm(pdb_files, desc="Converting PDB to FASTA"):
			record = pdb_to_fasta(pdb_file, fasta_file)
			if record:
				SeqIO.write(record, fasta_out, "fasta")
	return fasta_file

fams = glob.glob('./alphafold_benchmark/*/')
for family_name in benchmark_families.keys():
	# Build folder paths based on benchmark set
	if BENCHMARK_SET == 'information_theory':
		# For information theory: use search term to create folder name
		term = benchmark_families[family_name]['search_term']
		termfolder = term.replace(' ', '_')
		fmt_term = term.replace(' ', '_')
		termfolder = os.path.join(benchmark_folder, termfolder)
	elif BENCHMARK_SET == 'marker_genes':
		# For marker genes: use family name directly
		termfolder = os.path.join(benchmark_folder, family_name)
		fmt_term = family_name
	
	structs = glob.glob(termfolder + '/cluster_heads/*.pdb')
	if os.path.exists(termfolder + '/structs.fasta') and not overwrite:
		print(f"Skipping {termfolder} - fasta already exists.")
		continue
	fasta_file = termfolder + f'/{fmt_term}_AA.fasta'
	if os.path.exists(fasta_file):
		os.remove(fasta_file)
	pdbs_to_fasta(structs, fasta_file)
	print(f"Converted {len(structs)} PDB files to FASTA in {fasta_file}")

Converting PDB to FASTA: 100%|█| 178/178 [00:08<00:00,


Converted 178 PDB files to FASTA in alphafold_benchmark/rhodopsin/rhodopsin_AA.fasta


Converting PDB to FASTA: 100%|█| 70/70 [00:02<00:00, 2


Converted 70 PDB files to FASTA in alphafold_benchmark/RuBisCO/RuBisCO_AA.fasta


Converting PDB to FASTA: 100%|█| 5/5 [00:00<00:00, 69.


Converted 5 PDB files to FASTA in alphafold_benchmark/FO_ATP_synthase/FO_ATP_synthase_AA.fasta


Converting PDB to FASTA: 100%|█| 97/97 [00:02<00:00, 4


Converted 97 PDB files to FASTA in alphafold_benchmark/F1_ATP_synthase/F1_ATP_synthase_AA.fasta


Converting PDB to FASTA: 100%|█| 31/31 [00:01<00:00, 2

Converted 31 PDB files to FASTA in alphafold_benchmark/Hap2/Hap2_AA.fasta





In [57]:
import subprocess
def align_AA(fasta_path, output_dir, mafft_path='mafft'):
	# Align with MAFFT
	aligned_fasta = os.path.join(output_dir, 'aligned.fasta')
	with open(aligned_fasta, 'w') as out_f:
		subprocess.run([mafft_path, '--auto', fasta_path], stdout=out_f, check=True)
	return aligned_fasta

In [None]:
#align the fasta files
for family_name in benchmark_families.keys():
	# Build folder paths based on benchmark set
	if BENCHMARK_SET == 'information_theory':
		# For information theory: use search term to create folder name
		term = benchmark_families[family_name]['search_term']
		termfolder = term.replace(' ', '_')
		fmt_term = term.replace(' ', '_')
		termfolder = os.path.join(benchmark_folder, termfolder)
	elif BENCHMARK_SET == 'marker_genes':
		# For marker genes: use family name directly
		termfolder = os.path.join(benchmark_folder, family_name)
		fmt_term = family_name
	
	fasta_file = os.path.join(termfolder, f'{fmt_term}_AA.fasta')
	if not os.path.exists(fasta_file):
		print(f"FASTA file {fasta_file} does not exist. Skipping alignment.")
		continue
	output_dir = termfolder
	aligned_fasta = align_AA(fasta_file, output_dir)
	print(f"Aligned FASTA saved to {aligned_fasta}")


nthread = 0
nthreadpair = 0
nthreadtb = 0
ppenalty_ex = 0
stacksize: 8192 kb
rescale = 1
Gap Penalty = -1.53, +0.00, +0.00



Making a distance matrix ..
  101 / 178
done.

Constructing a UPGMA tree (efffree=0) ... 
  170 / 178
done.

Progressive alignment 1/2... 
STEP    54 / 177 
Reallocating..done. *alloclen = 6207
STEP    76 / 177 
Reallocating..done. *alloclen = 7588
STEP   128 / 177 
Reallocating..done. *alloclen = 9078
STEP   177 / 177 
done.

Making a distance matrix from msa.. 
  100 / 178
done.

Constructing a UPGMA tree (efffree=1) ... 
  170 / 178
done.

Progressive alignment 2/2... 
STEP   129 / 177 
Reallocating..done. *alloclen = 6680
STEP   133 / 177 
Reallocating..done. *alloclen = 7763
STEP   153 / 177 
Reallocating..done. *alloclen = 9822
STEP   177 / 177 
done.

disttbfast (aa) Version 7.526
alg=A, model=BLOSUM62, 1.53, -0.00, -0.00, noshift, amax=0.0
0 thread(s)

distout=h
rescale = 1
dndpre (aa) Version 7.526
alg=X, model=BLOSUM62, 1.53, +0.12, -0.00, noshift, ama

Aligned FASTA saved to alphafold_benchmark/rhodopsin/aligned.fasta


outputhat23=16
treein = 0
compacttree = 0
stacksize: 8192 kb
rescale = 1
All-to-all alignment.
tbfast-pair (aa) Version 7.526
alg=L, model=BLOSUM62, 2.00, -0.10, +0.10, noshift, amax=0.0
0 thread(s)

outputhat23=16
Loading 'hat3.seed' ... 
done.
Writing hat3 for iterative refinement
rescale = 1
Gap Penalty = -1.53, +0.00, +0.00
tbutree = 1, compacttree = 0
Constructing a UPGMA tree ... 
   60 / 70
done.

Progressive alignment ... 
STEP    41 /69 
Reallocating..done. *alloclen = 4544
STEP    69 /69 
done.
tbfast (aa) Version 7.526
alg=A, model=BLOSUM62, 1.53, -0.00, -0.00, noshift, amax=0.0
1 thread(s)

minimumweight = 0.000010
autosubalignment = 0.000000
nthread = 0
randomseed = 0
blosum 62 / kimura 200
poffset = 0
niter = 16
sueff_global = 0.100000
nadd = 16
Loading 'hat3' ... done.
rescale = 1

   60 / 70
Segment   1/  1    1-3041
STEP 012-018-1  rejected..    identical.    identical.    identical.    rejected. identical.    accepted. identical.    rejected. rejected. rejected. rejec

Aligned FASTA saved to alphafold_benchmark/RuBisCO/aligned.fasta


outputhat23=16
treein = 0
compacttree = 0
stacksize: 8192 kb
rescale = 1
All-to-all alignment.
tbfast-pair (aa) Version 7.526
alg=L, model=BLOSUM62, 2.00, -0.10, +0.10, noshift, amax=0.0
0 thread(s)

outputhat23=16
Loading 'hat3.seed' ... 
done.
Writing hat3 for iterative refinement
rescale = 1
Gap Penalty = -1.53, +0.00, +0.00
tbutree = 1, compacttree = 0
Constructing a UPGMA tree ... 
    0 / 5
done.

Progressive alignment ... 
STEP     4 /4 
done.
tbfast (aa) Version 7.526
alg=A, model=BLOSUM62, 1.53, -0.00, -0.00, noshift, amax=0.0
1 thread(s)

minimumweight = 0.000010
autosubalignment = 0.000000
nthread = 0
randomseed = 0
blosum 62 / kimura 200
poffset = 0
niter = 16
sueff_global = 0.100000
nadd = 16
Loading 'hat3' ... done.
rescale = 1

    0 / 5
Segment   1/  1    1- 373
STEP 004-001-0  rejected..   
Converged.

done
dvtditr (aa) Version 7.526
alg=A, model=BLOSUM62, 1.53, -0.00, -0.00, noshift, amax=0.0
0 thread(s)


Strategy:
 L-INS-i (Probably most accurate, very slow)
 Iterat

Aligned FASTA saved to alphafold_benchmark/FO_ATP_synthase/aligned.fasta


outputhat23=16
treein = 0
compacttree = 0
stacksize: 8192 kb
rescale = 1
All-to-all alignment.
tbfast-pair (aa) Version 7.526
alg=L, model=BLOSUM62, 2.00, -0.10, +0.10, noshift, amax=0.0
0 thread(s)

outputhat23=16
Loading 'hat3.seed' ... 
done.
Writing hat3 for iterative refinement
rescale = 1
Gap Penalty = -1.53, +0.00, +0.00
tbutree = 1, compacttree = 0
Constructing a UPGMA tree ... 
   90 / 97
done.

Progressive alignment ... 
STEP    96 /96 
done.
tbfast (aa) Version 7.526
alg=A, model=BLOSUM62, 1.53, -0.00, -0.00, noshift, amax=0.0
1 thread(s)

minimumweight = 0.000010
autosubalignment = 0.000000
nthread = 0
randomseed = 0
blosum 62 / kimura 200
poffset = 0
niter = 16
sueff_global = 0.100000
nadd = 16
Loading 'hat3' ... done.
rescale = 1

   90 / 97
Segment   1/  1    1-2748
STEP 016-051-0  rejected..    identical.    identical.    rejected. identical.    accepted. identical.    rejected. accepted. identical.    identical.    rejected. rejected. accepted. rejected. rejected. reje

Aligned FASTA saved to alphafold_benchmark/F1_ATP_synthase/aligned.fasta


outputhat23=16
treein = 0
compacttree = 0
stacksize: 8192 kb
rescale = 1
All-to-all alignment.
tbfast-pair (aa) Version 7.526
alg=L, model=BLOSUM62, 2.00, -0.10, +0.10, noshift, amax=0.0
0 thread(s)

outputhat23=16
Loading 'hat3.seed' ... 
done.
Writing hat3 for iterative refinement
rescale = 1
Gap Penalty = -1.53, +0.00, +0.00
tbutree = 1, compacttree = 0
Constructing a UPGMA tree ... 
   20 / 31
done.

Progressive alignment ... 
STEP    30 /30 
done.
tbfast (aa) Version 7.526
alg=A, model=BLOSUM62, 1.53, -0.00, -0.00, noshift, amax=0.0
1 thread(s)

minimumweight = 0.000010
autosubalignment = 0.000000
nthread = 0
randomseed = 0
blosum 62 / kimura 200
poffset = 0
niter = 16
sueff_global = 0.100000
nadd = 16
Loading 'hat3' ... done.
rescale = 1

   20 / 31
Segment   1/  1    1-2078
STEP 008-015-0  rejected..   epted. rejected. identical.    rejected. accepted. rejected. rejected. rejected. rejected. rejected. rejected. rejected. rejected. rejected. rejected. rejected. rejected. rejected

Aligned FASTA saved to alphafold_benchmark/Hap2/aligned.fasta


STEP 008-014-0  rejected.
Converged.

done
dvtditr (aa) Version 7.526
alg=A, model=BLOSUM62, 1.53, -0.00, -0.00, noshift, amax=0.0
0 thread(s)


Strategy:
 L-INS-i (Probably most accurate, very slow)
 Iterative refinement method (<16) with LOCAL pairwise alignment information

If unsure which option to use, try 'mafft --auto input > output'.
For more information, see 'mafft --help', 'mafft --man' and the mafft page.

The default gap scoring scheme has been changed in version 7.110 (2013 Oct).
It tends to insert more gaps into gap-rich regions than previous versions.
To disable this change, add the --leavegappyregion option.



In [None]:

def build_tree_ng(fasta_path, output_dir,  raxmlng_path='raxml-ng', model='LG+G+I', ancestral_states=True):
	# Run RAxML-NG
	tree_prefix = os.path.join(output_dir, 'raxmlng')
	raxmlng_cmd = [
		raxmlng_path,
		'--msa', fasta_path,
		'--model', model,
		'--prefix', tree_prefix,
		'--seed', '12345'
	]
	if ancestral_states:
		raxmlng_cmd += ['--ancestral']
	subprocess.run(raxmlng_cmd, check=True)
	print(f"Alignment written to {aligned_fasta}")
	print(f"RAxML-NG output in {output_dir}")

def build_states_ng(fasta_path, treefile, output_dir,  raxmlng_path='raxml-ng', model='LG+G+I', ancestral_states=True):
	# Run RAxML-NG
	tree_prefix = os.path.join(output_dir, 'raxmlng')
	raxmlng_cmd = [
		raxmlng_path,
		'--msa', fasta_path,
		'--model', model,
		'--prefix', tree_prefix,
		'--seed', '12345',
		'--tree', treefile,
	]
	if ancestral_states:
		raxmlng_cmd += ['--ancestral']
	subprocess.run(raxmlng_cmd, check=True)
	print(f"Alignment written to {aligned_fasta}")
	print(f"RAxML-NG output in {output_dir}")

In [None]:
#build tree for each term
for family_name in benchmark_families.keys():
	if BENCHMARK_SET == 'information_theory':
		term = benchmark_families[family_name]['search_term']
		termfolder = term.replace(' ', '_')
		fmt_term = term.replace(' ', '_')
		termfolder = os.path.join(benchmark_folder, termfolder)
	elif BENCHMARK_SET == 'marker_genes':
		termfolder = os.path.join(benchmark_folder, family_name)
		fmt_term = family_name
	
	fasta_file = os.path.join(termfolder, 'aligned.fasta')
	if not os.path.exists(fasta_file):
		print(f"FASTA file {fasta_file} does not exist. Skipping tree building.")
		continue
	output_dir = termfolder
	print( fasta_file, output_dir)
	build_tree_ng( fasta_file, output_dir , raxmlng_path= '/home/dmoi/projects/foldtree2/raxmlng/raxml-ng' , model='LG+G+I', ancestral_states=False)
	# Check if the tree was built successfully
	tree_file = os.path.join(output_dir, 'raxmlng.tree')
	if not os.path.exists(tree_file):
		print(f"RAxML-NG tree file {tree_file} does not exist. Tree building may have failed.")
		continue
	# Print the tree file path
	print(f"RAxML-NG tree file: {tree_file}")
	#run ancestral reconstruction
	build_tree_ng( fasta_file, output_dir , raxmlng_path= './raxmlng/raxml-ng' , model='LG+G+I', ancestral_states=True)
	print(f"RAxML-NG tree built for {fmt_term} in {output_dir}")

alphafold_benchmark/rhodopsin/aligned.fasta alphafold_benchmark/rhodopsin

RAxML-NG v. 1.2.2-master released on 30.04.2024 by The Exelixis Lab.
Developed by: Alexey M. Kozlov and Alexandros Stamatakis.
Contributors: Diego Darriba, Tomas Flouri, Benoit Morel, Sarah Lutteropp, Ben Bettisworth, Julia Haag, Anastasis Togkousidis.
Latest version: https://github.com/amkozlov/raxml-ng
Questions/problems/suggestions? Please visit: https://groups.google.com/forum/#!forum/raxml

System: Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz, 16 cores, 251 GB RAM

RAxML-NG was called at 08-Jul-2025 10:41:26 as follows:

/home/dmoi/projects/foldtree2/raxmlng/raxml-ng --msa alphafold_benchmark/rhodopsin/aligned.fasta --model LG+G+I --prefix alphafold_benchmark/rhodopsin/raxmlng --seed 12345

Analysis options:
  run mode: ML tree search
  start tree(s): random (10) + parsimony (10)
  random seed: 12345
  tip-inner: OFF
  pattern compression: ON
  per-rate scalers: OFF
  site repeats: ON
  logLH epsilon: general

KeyboardInterrupt: 

## FoldTree2 Structure-Based Phylogenetics

Now we run FoldTree2 to build trees from structural information using the trained encoder model.

In [None]:
# Run FoldTree2 structure-based phylogenetics for each family
if tb is not None:
    for family_name in benchmark_families.keys():
        if BENCHMARK_SET == 'information_theory':
            term = benchmark_families[family_name]['search_term']
            termfolder = term.replace(' ', '_')
            fmt_term = term.replace(' ', '_')
            termfolder = os.path.join(benchmark_folder, termfolder)
        elif BENCHMARK_SET == 'marker_genes':
            termfolder = os.path.join(benchmark_folder, family_name)
            fmt_term = family_name
        
        # Check if cluster heads exist
        cluster_heads_folder = os.path.join(termfolder, 'cluster_heads')
        if not os.path.exists(cluster_heads_folder):
            print(f"Cluster heads folder {cluster_heads_folder} does not exist. Skipping FoldTree2 for {fmt_term}.")
            continue
        
        # Set up output directory for FoldTree2
        ft2_output_dir = os.path.join(termfolder, 'foldtree2_output')
        os.makedirs(ft2_output_dir, exist_ok=True)
        
        print(f"\n{'='*60}")
        print(f"Running FoldTree2 for {fmt_term}")
        print(f"{'='*60}")
        
        # Use FoldTree2 treebuilder to encode structures, align, and build tree
        try:
            ft2_results = tb.structs2tree(
                structs=os.path.join(cluster_heads_folder, '*.pdb'),
                outdir=ft2_output_dir,
                ancestral=True,  # Enable ancestral reconstruction
                raxml_iterations=20,
                raxml_path='./raxmlng/raxml-ng',
                output_prefix=os.path.join(ft2_output_dir, fmt_term),
                verbose=False
            )
            
            print(f"\n✓ FoldTree2 completed for {fmt_term}")
            print(f"  Encoded FASTA: {ft2_results['encoded_fasta']}")
            print(f"  Tree file: {ft2_results['tree']}")
            print(f"  Alignment: {ft2_results['alignment']}")
            if ft2_results['ancestral_fasta']:
                print(f"  Ancestral sequences (AA): {ft2_results['ancestral_fasta']}")
            
        except Exception as e:
            print(f"✗ Error running FoldTree2 for {term}: {e}")
            import traceback
            traceback.print_exc()
            continue
else:
    print("Skipping FoldTree2 - model not available.")

In [None]:
# Prepare AlphaFold runs for both AA and FoldTree2 ancestral reconstructions
from Bio import SeqIO

for family_name in benchmark_families.keys():
    if BENCHMARK_SET == 'information_theory':
        term = benchmark_families[family_name]['search_term']
        termfolder = term.replace(' ', '_')
        fmt_term = term.replace(' ', '_')
        termfolder = os.path.join(benchmark_folder, termfolder)
    elif BENCHMARK_SET == 'marker_genes':
        termfolder = os.path.join(benchmark_folder, family_name)
        fmt_term = family_name
    
    # Process AA-based ancestral sequences
    aa_ancestral_file = os.path.join(termfolder, 'raxmlng.raxml.ancestralStates')
    if os.path.exists(aa_ancestral_file):
        output_dir_aa = os.path.join(termfolder, 'alphafold_run_AA')
        os.makedirs(output_dir_aa, exist_ok=True)
        
        print(f"\nPreparing AA ancestral sequences for {fmt_term}")
        # Parse ancestral states file and create individual FASTAs
        try:
            with open(aa_ancestral_file, 'r') as f:
                lines = f.readlines()
                for line in lines:
                    if line.startswith('Node'):
                        parts = line.strip().split('\t')
                        if len(parts) >= 2:
                            node_id = parts[0].replace(' ', '_')
                            sequence = parts[1].replace('-', '')  # Remove gaps
                            if len(sequence) > 0:
                                fasta_file = os.path.join(output_dir_aa, f"{node_id}.fasta")
                                with open(fasta_file, 'w') as out_f:
                                    out_f.write(f">{node_id}\n{sequence}\n")
            print(f"  ✓ AA ancestral FASTAs saved to {output_dir_aa}")
        except Exception as e:
            print(f"  ✗ Error processing AA ancestral sequences: {e}")
    
    # Process FoldTree2-based ancestral sequences
    ft2_ancestral_file = os.path.join(termfolder, 'foldtree2_output', f'{fmt_term}_ancestral.fasta')
    if os.path.exists(ft2_ancestral_file):
        output_dir_ft2 = os.path.join(termfolder, 'alphafold_run_FT2')
        os.makedirs(output_dir_ft2, exist_ok=True)
        
        print(f"\nPreparing FoldTree2 ancestral sequences for {fmt_term}")
        # Split FoldTree2 ancestral FASTA into individual files
        try:
            for record in SeqIO.parse(ft2_ancestral_file, "fasta"):
                # Clean up sequence ID
                clean_id = record.id.replace(' ', '_').replace('/', '_')
                record.id = clean_id
                record.description = ""
                
                fasta_file = os.path.join(output_dir_ft2, f"{clean_id}.fasta")
                SeqIO.write(record, fasta_file, "fasta")
            print(f"  ✓ FoldTree2 ancestral FASTAs saved to {output_dir_ft2}")
        except Exception as e:
            print(f"  ✗ Error processing FoldTree2 ancestral sequences: {e}")

In [None]:
#get fastas from ancestral reconstruction
import os
import glob
from Bio import SeqIO	

def extract_ancestral_sequences(fasta_file, output_folder):
	if not os.path.exists(output_folder):
		os.makedirs(output_folder)
	
	for record in SeqIO.parse(fasta_file, "fasta"):
		if 'ancestral' in record.id:
			output_file = os.path.join(output_folder, f"{record.id}.fasta")
			SeqIO.write(record, output_file, "fasta")
			print(f"Extracted ancestral sequence: {record.id} to {output_file}")
	
	print(f"Ancestral sequences extracted to {output_folder}")

In [None]:
#setup alphafold run

## Run FoldTree2 on a Protein Family

In [16]:
import os

# Example: Run FoldTree2 on a family directory
family_dir = './alphafold_benchmark/families/example_family/'  # Change to your family path
model_path = '../../models/your_trained_model'  # Path to your trained model (without .pkl)
mafftmat = model_path + '_mafftmat.mtx'
submat = model_path + '_submat.txt'
output_dir = os.path.join(family_dir, 'foldtree2_results')

os.makedirs(output_dir, exist_ok=True)

cmd = f"python ../../ft2treebuilder.py --model {model_path} --mafftmat {mafftmat} --submat {submat} --structures '{family_dir}/structs/*.pdb' --outdir {output_dir} --ancestral"
print('Run this command in your shell:')
print(cmd)
# Optionally, to run from the notebook (uncomment the next line):
# !{cmd}


Run this command in your shell:
python ../../ft2treebuilder.py --model ../../models/your_trained_model --mafftmat ../../models/your_trained_model_mafftmat.mtx --submat ../../models/your_trained_model_submat.txt --structures './alphafold_benchmark/families/example_family//structs/*.pdb' --outdir ./alphafold_benchmark/families/example_family/foldtree2_results --ancestral


In [17]:
#convert phylip files to fasta
def phylip_to_fasta(phylip_file, fasta_file):
	with open(phylip_file, 'r') as infile, open(fasta_file, 'w') as outfile:
		lines = infile.readlines()
		num_seqs = int(lines[0].split()[0])
		seq_length = int(lines[0].split()[1])
		for i in range(1, num_seqs + 1):
			parts = lines[i].split()
			seq_id = parts[0]
			seq = ''.join(parts[1:])
			record = SeqRecord(Seq(seq), id=seq_id, description="")
			SeqIO.write(record, outfile, "fasta")

In [18]:
#convert the ancestral sequences to separate fastas for folding with alphafold

def split_fasta_by_id(fasta_file, output_dir):
    os.makedirs(output_dir, exist_ok=True)
    for record in tqdm.tqdm(SeqIO.parse(fasta_file, "fasta")):
        output_file = os.path.join(output_dir, f"{record.id}.fasta")
        SeqIO.write(record, output_file, "fasta")


for fam in fams:
    print(fam)
    ancestral_fasta = fam + '/ancestral.fasta'
    if not os.path.exists(ancestral_fasta):
        print(f"Skipping {fam} - ancestral fasta does not exist.")
        continue
    output_dir = fam + '/alphafold'
    split_fasta_by_id(ancestral_fasta, output_dir)
    print(f"Split ancestral sequences into separate FASTA files in {output_dir}")

In [19]:
# run alphafold separate on each ancestral sequence
# whichever method you prefer outside of this notebook

In [None]:
# Extract pLDDT values from AlphaFold predictions for both AA and FoldTree2 methods

def ret_plddt_from_pdb(pdb_file):
    """Extract pLDDT values from a PDB file's B-factor column."""
    parser = PDBParser(QUIET=True)
    structure = parser.get_structure('protein', pdb_file)
    plddt_values = []
    for model in structure:
        for chain in model:
            for residue in chain:
                if 'CA' in residue:
                    # AlphaFold stores pLDDT in B-factor column
                    plddt = residue['CA'].get_bfactor()
                    if plddt is not None:
                        plddt_values.append(plddt)
    return plddt_values

# Dictionary to store results for both methods
resdf_aa = {}   # AA-based ancestral reconstruction
resdf_ft2 = {}  # FoldTree2-based ancestral reconstruction

for family_name in benchmark_families.keys():
    if BENCHMARK_SET == 'information_theory':
        term = benchmark_families[family_name]['search_term']
        termfolder = term.replace(' ', '_')
        termfolder = os.path.join(benchmark_folder, termfolder)
        fmt_term = term
    elif BENCHMARK_SET == 'marker_genes':
        termfolder = os.path.join(benchmark_folder, family_name)
        fmt_term = family_name
    
    # Process AA method results
    aa_alphafold_dir = os.path.join(termfolder, 'alphafold_run_AA')
    if os.path.exists(aa_alphafold_dir):
        print(f"\nProcessing AA AlphaFold results for {fmt_term}")
        best_models = glob.glob(os.path.join(aa_alphafold_dir, '*/ranked_0.pdb'))
        for model in tqdm.tqdm(best_models, desc=f"AA-{fmt_term}"):
            plddt_values = ret_plddt_from_pdb(model)
            if plddt_values:
                prot_id = os.path.basename(os.path.dirname(model))
                stats = describe(plddt_values)
                resdf_aa[prot_id] = {
                    'mean': stats.mean,
                    'variance': stats.variance,
                    'skewness': stats.skewness,
                    'max': stats.minmax[1],
                    'min': stats.minmax[0],
                    'plddt': plddt_values,
                    'family': fmt_term,
                    'method': 'AA'
                }
    
    # Process FoldTree2 method results
    ft2_alphafold_dir = os.path.join(termfolder, 'alphafold_run_FT2')
    if os.path.exists(ft2_alphafold_dir):
        print(f"\nProcessing FoldTree2 AlphaFold results for {term}")
        best_models = glob.glob(os.path.join(ft2_alphafold_dir, '*/ranked_0.pdb'))
        for model in tqdm.tqdm(best_models, desc=f"FT2-{term}"):
            plddt_values = ret_plddt_from_pdb(model)
            if plddt_values:
                prot_id = os.path.basename(os.path.dirname(model))
                stats = describe(plddt_values)
                resdf_ft2[prot_id] = {
                    'mean': stats.mean,
                    'variance': stats.variance,
                    'skewness': stats.skewness,
                    'max': stats.minmax[1],
                    'min': stats.minmax[0],
                    'plddt': plddt_values,
                    'family': term,

                    'method': 'FoldTree2'print(f"  FoldTree2 method: {len(resdf_ft2)} predictions")

                }print(f"  AA method: {len(resdf_aa)} predictions")

print(f"\nSummary:")

In [21]:
from Bio import Phylo

def get_node_to_prot_mapping(tree_file):
    tree = Phylo.read(tree_file, "newick")
    node_to_prot = {}
    for clade in tree.find_clades(order='level'):
        if clade.name:
            node_to_prot[clade.name] = clade
    return node_to_prot

def normalize_tree_branch_lengths(tree):
    total_length = sum(clade.branch_length for clade in tree.find_clades() if clade.branch_length)
    if total_length == 0:
        return tree  # Avoid division by zero
    for clade in tree.find_clades():
        if clade.branch_length:
            clade.branch_length /= total_length
    return tree

def get_distance_to_root(tree, node_name):
    clade = None
    for c in tree.find_clades():
        if c.name == node_name:
            clade = c
            break
    if clade is None:
        raise ValueError(f"Node {node_name} not found in tree.")
    distance = 0.0
    while clade != tree.root:
        parent = tree.get_path(clade)[-2] if len(tree.get_path(clade)) > 1 else tree.root
        if clade.branch_length:
            distance += clade.branch_length
        clade = parent
    return distance

# Assign distance to root for each protein/node in resdf
for prot, data in resdf.items():
    fam = data['fam']
    tree_file = os.path.join(fam, 'raxmlng.bestTree')  # adjust if tree filename differs
    if not os.path.exists(tree_file):
        print(f"Tree file not found for {fam}")
        data['distance_to_root'] = None
        continue
    tree = Phylo.read(tree_file, "newick")
    normalize_tree_branch_lengths(tree)
    try:
        distance = get_distance_to_root(tree, prot)
    except Exception as e:
        print(f"Error for {prot} in {fam}: {e}")
        distance = None
    data['distance_to_root'] = distance

In [None]:
# Visualize and compare pLDDT distributions for AA vs FoldTree2 methods

# Convert to DataFrames for easier analysis
df_aa = pd.DataFrame.from_dict(resdf_aa, orient='index')
df_ft2 = pd.DataFrame.from_dict(resdf_ft2, orient='index')

# Combine for comparative analysis
df_aa['method'] = 'AA'
df_ft2['method'] = 'FoldTree2'
df_combined = pd.concat([df_aa, df_ft2], ignore_index=True)

print("Combined dataset shape:", df_combined.shape)
print("\nSummary statistics by method:")
print(df_combined.groupby('method')['mean'].describe())

# Create comprehensive visualization
fig, axes = plt.subplots(2, 2, figsize=(15, 12))

# 1. Mean pLDDT comparison
ax1 = axes[0, 0]
if len(df_aa) > 0 and len(df_ft2) > 0:
    ax1.hist([df_aa['mean'], df_ft2['mean']], bins=30, label=['AA', 'FoldTree2'], alpha=0.7)
    ax1.set_xlabel('Mean pLDDT')
    ax1.set_ylabel('Count')
    ax1.set_title('Distribution of Mean pLDDT Scores')
    ax1.legend()
    ax1.axvline(df_aa['mean'].mean(), color='C0', linestyle='--', label='AA mean')
    ax1.axvline(df_ft2['mean'].mean(), color='C1', linestyle='--', label='FT2 mean')
else:
    ax1.text(0.5, 0.5, 'Insufficient data', ha='center', va='center')

# 2. Box plots by family
ax2 = axes[0, 1]
if len(df_combined) > 0 and 'family' in df_combined.columns:
    families = df_combined['family'].unique()
    positions = np.arange(len(families))
    width = 0.35
    
    means_aa = [df_combined[(df_combined['family'] == f) & (df_combined['method'] == 'AA')]['mean'].mean() 
                for f in families]
    means_ft2 = [df_combined[(df_combined['family'] == f) & (df_combined['method'] == 'FoldTree2')]['mean'].mean() 
                 for f in families]
    
    ax2.bar(positions - width/2, means_aa, width, label='AA', alpha=0.7)
    ax2.bar(positions + width/2, means_ft2, width, label='FoldTree2', alpha=0.7)
    ax2.set_xlabel('Protein Family')
    ax2.set_ylabel('Mean pLDDT')
    ax2.set_title('Mean pLDDT by Family and Method')
    ax2.set_xticks(positions)
    ax2.set_xticklabels([f[:15] for f in families], rotation=45, ha='right')
    ax2.legend()
else:
    ax2.text(0.5, 0.5, 'Insufficient data', ha='center', va='center')

# 3. Scatter plot comparing methods
ax3 = axes[1, 0]
if len(df_aa) > 0 and len(df_ft2) > 0:
    # Match proteins that appear in both methods
    aa_ids = set(df_aa.index)
    ft2_ids = set(df_ft2.index)
    common_ids = aa_ids.intersection(ft2_ids)
    
    if len(common_ids) > 0:
        aa_means = [df_aa.loc[pid, 'mean'] for pid in common_ids]
        ft2_means = [df_ft2.loc[pid, 'mean'] for pid in common_ids]
        
        ax3.scatter(aa_means, ft2_means, alpha=0.5)
        ax3.plot([0, 100], [0, 100], 'r--', label='y=x')
        ax3.set_xlabel('AA Method Mean pLDDT')
        ax3.set_ylabel('FoldTree2 Method Mean pLDDT')
        ax3.set_title(f'Method Comparison (n={len(common_ids)} common proteins)')
        ax3.legend()
        ax3.grid(True, alpha=0.3)
        
        # Add correlation coefficient
        from scipy.stats import pearsonr
        r, p = pearsonr(aa_means, ft2_means)
        ax3.text(0.05, 0.95, f'r = {r:.3f}\np = {p:.3e}', 
                transform=ax3.transAxes, va='top')
    else:
        ax3.text(0.5, 0.5, 'No common proteins', ha='center', va='center')
else:
    ax3.text(0.5, 0.5, 'Insufficient data', ha='center', va='center')

# 4. Statistical summary table
ax4 = axes[1, 1]
ax4.axis('tight')
ax4.axis('off')

if len(df_combined) > 0:
    # Create summary table
    summary_data = []
    for method in ['AA', 'FoldTree2']:
        method_df = df_combined[df_combined['method'] == method]
        if len(method_df) > 0:
            summary_data.append([
                method,
                len(method_df),
                f"{method_df['mean'].mean():.2f} ± {method_df['mean'].std():.2f}",
                f"{method_df['mean'].min():.2f}",
                f"{method_df['mean'].max():.2f}"
            ])
    
    table = ax4.table(cellText=summary_data,
                     colLabels=['Method', 'N', 'Mean ± SD', 'Min', 'Max'],
                     cellLoc='center',
                     loc='center')
    table.auto_set_font_size(False)
    table.set_fontsize(10)
    table.scale(1, 2)
    ax4.set_title('Summary Statistics', pad=20)

plt.tight_layout()
plt.savefig(os.path.join(benchmark_folder, 'method_comparison.png'), dpi=300, bbox_inches='tight')
plt.show()

# Statistical test comparing methods
if len(df_aa) > 0 and len(df_ft2) > 0:
    from scipy.stats import mannwhitneyu, ttest_ind
    
    print("\n" + "="*60)
    print("STATISTICAL COMPARISON")
    print("="*60)
    
    # T-test
    t_stat, t_pval = ttest_ind(df_aa['mean'], df_ft2['mean'])
    print(f"\nIndependent t-test:")
    print(f"  t-statistic: {t_stat:.4f}")
    print(f"  p-value: {t_pval:.4e}")
    
    # Mann-Whitney U test (non-parametric)
    u_stat, u_pval = mannwhitneyu(df_aa['mean'], df_ft2['mean'])
    print(f"\nMann-Whitney U test:")
    print(f"  U-statistic: {u_stat:.4f}")
    print(f"  p-value: {u_pval:.4e}")
    
    # Effect size (Cohen's d)
    pooled_std = np.sqrt((df_aa['mean'].std()**2 + df_ft2['mean'].std()**2) / 2)
    cohens_d = (df_aa['mean'].mean() - df_ft2['mean'].mean()) / pooled_std
    print(f"\nEffect size (Cohen's d): {cohens_d:.4f}")
    
    print("\nInterpretation:")
    if abs(cohens_d) < 0.2:
        print("  Small effect size")
    elif abs(cohens_d) < 0.5:
        print("  Medium effect size")
    else:
        print("  Large effect size")
    
    # Save combined results
    df_combined.to_csv(os.path.join(benchmark_folder, 'plddt_results_combined.csv'))
    print(f"\nResults saved to {os.path.join(benchmark_folder, 'plddt_results_combined.csv')}")
else:
    print("\nInsufficient data for statistical comparison")

KeyError: ['distance_to_root']

In [None]:
#visualize the trees

#calculate the tcs values

#

In [None]:
#recover fasta from the ancestral sequences


## Benchmark Summary and Conclusions

This notebook compares two approaches to ancestral sequence reconstruction:

1. **Traditional AA-based method**: Uses amino acid sequences with MAFFT alignment and RAxML-NG
2. **FoldTree2 structure-based method**: Uses 3D structure encoding with custom substitution matrices

### Key Metrics:
- **pLDDT scores**: AlphaFold confidence scores indicate reconstruction quality
- **Statistical comparison**: t-tests and effect sizes quantify method differences
- **Family-specific analysis**: Different protein families may benefit differently from structural information

### Interpretation:
- Higher pLDDT scores suggest more confident/accurate structure predictions
- Comparison reveals whether structural encoding captures evolutionary information differently than sequence alone
- Information theory families (rhodopsin, RuBisCO, ATP synthase, HAP2) provide diverse test cases

### Next Steps:
1. Run AlphaFold on prepared ancestral sequences
2. Analyze pLDDT distributions and structural features
3. Correlate with phylogenetic distance/node depth
4. Validate with known ancestral structures (if available)