# Ancestral Sequence Reconstruction - Psilocybin BGC

**Project:** Analysis of psilocybin biosynthetic gene cluster (BGC) evolution in Psilocybe mushrooms

**Based on:** Bradshaw et al. (2024) - Phylogenomics of the psychoactive mushroom genus Psilocybe

**Objectives:**
1. Extract and align PsiD, PsiK, PsiM, PsiH sequences from Psilocybe species
2. Reconstruct phylogenetic trees for each gene
3. Perform ancestral sequence reconstruction (ASR)
4. Analyze evolutionary patterns and gene cluster organization

---

## Setup and Imports

In [None]:
# Standard libraries
import os
import sys
from pathlib import Path
import subprocess

# Data manipulation
import pandas as pd
import numpy as np

# Bioinformatics
from Bio import SeqIO, Seq, AlignIO, Phylo
from Bio.Align.Applications import MafftCommandline

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Set plotting style
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (12, 8)

# Set paths
PROJECT_DIR = Path.cwd().parent
DATA_DIR = PROJECT_DIR / "03_Sequences_Paper"
RESULTS_DIR = Path.cwd() / "results"
RESULTS_DIR.mkdir(exist_ok=True)

print(f"Project directory: {PROJECT_DIR}")
print(f"Data directory: {DATA_DIR}")
print(f"Results directory: {RESULTS_DIR}")

## 1. Data Exploration

Let's explore the available sequence data from the Bradshaw et al. (2024) paper.

In [None]:
# List available data directories
data_subdirs = [d for d in DATA_DIR.iterdir() if d.is_dir()]
print("Available data directories:")
for d in data_subdirs:
    print(f"  - {d.name}")
    # List first few files in each directory
    files = list(d.glob("*"))[:5]
    for f in files:
        print(f"      {f.name}")
    if len(list(d.glob("*"))) > 5:
        print(f"      ... and {len(list(d.glob('*'))) - 5} more files")

## 2. Load PsiD Protein Sequences

PsiD (tryptophan decarboxylase) is one of the four core genes in the psilocybin BGC.

In [None]:
# Find PsiD sequences
psid_dir = DATA_DIR / "Proteins_PsiD"

if psid_dir.exists():
    psid_files = list(psid_dir.glob("*.fa*"))
    print(f"Found {len(psid_files)} PsiD sequence files")
    
    # Display first few files
    for f in psid_files[:10]:
        print(f"  - {f.name}")
else:
    print(f"PsiD directory not found at {psid_dir}")
    print("Available directories:")
    for d in DATA_DIR.iterdir():
        print(f"  - {d.name}")

## 3. Sequence Alignment with MAFFT

We'll use MAFFT for multiple sequence alignment of PsiD proteins.

In [None]:
def run_mafft(input_fasta, output_fasta, algorithm="auto"):
    """
    Run MAFFT multiple sequence alignment.
    
    Parameters:
    -----------
    input_fasta : str or Path
        Input unaligned FASTA file
    output_fasta : str or Path
        Output aligned FASTA file
    algorithm : str
        MAFFT algorithm: 'auto', 'linsi', 'ginsi', 'einsi'
    """
    cmd = [
        "mafft",
        "--auto" if algorithm == "auto" else f"--{algorithm}",
        "--reorder",
        str(input_fasta)
    ]
    
    print(f"Running MAFFT: {' '.join(cmd)}")
    
    with open(output_fasta, 'w') as outfile:
        result = subprocess.run(cmd, stdout=outfile, stderr=subprocess.PIPE, text=True)
    
    if result.returncode == 0:
        print(f"✓ Alignment complete: {output_fasta}")
        return True
    else:
        print(f"✗ MAFFT failed: {result.stderr}")
        return False

# Example usage (uncomment when ready):
# run_mafft(
#     input_fasta="combined_PsiD.fasta",
#     output_fasta=RESULTS_DIR / "PsiD_aligned.fasta",
#     algorithm="auto"
# )

## 4. Phylogenetic Tree Construction with IQ-TREE

IQ-TREE will be used to construct maximum-likelihood phylogenetic trees.

In [None]:
def run_iqtree(alignment_file, output_prefix, bootstrap=1000, model="MFP"):
    """
    Run IQ-TREE for phylogenetic analysis.
    
    Parameters:
    -----------
    alignment_file : str or Path
        Input alignment file
    output_prefix : str or Path
        Prefix for output files
    bootstrap : int
        Number of ultrafast bootstrap replicates
    model : str
        Substitution model (MFP = ModelFinder Plus for automatic selection)
    """
    cmd = [
        "iqtree",
        "-s", str(alignment_file),
        "-pre", str(output_prefix),
        "-m", model,
        "-bb", str(bootstrap),
        "-nt", "AUTO"
    ]
    
    print(f"Running IQ-TREE: {' '.join(cmd)}")
    
    result = subprocess.run(cmd, capture_output=True, text=True)
    
    if result.returncode == 0:
        print(f"✓ IQ-TREE analysis complete")
        print(f"  Tree file: {output_prefix}.treefile")
        return True
    else:
        print(f"✗ IQ-TREE failed: {result.stderr}")
        return False

# Example usage (uncomment when ready):
# run_iqtree(
#     alignment_file=RESULTS_DIR / "PsiD_aligned.fasta",
#     output_prefix=RESULTS_DIR / "PsiD_tree",
#     bootstrap=1000
# )

## 5. Visualize Phylogenetic Tree

In [None]:
def visualize_tree(tree_file, title="Phylogenetic Tree"):
    """
    Visualize a phylogenetic tree using Biopython.
    
    Parameters:
    -----------
    tree_file : str or Path
        Path to tree file (Newick format)
    title : str
        Title for the plot
    """
    tree = Phylo.read(tree_file, "newick")
    
    fig, ax = plt.subplots(figsize=(15, 10))
    Phylo.draw(tree, axes=ax, do_show=False)
    ax.set_title(title, fontsize=16, fontweight='bold')
    plt.tight_layout()
    plt.show()
    
    return tree

# Example usage (uncomment when ready):
# tree = visualize_tree(
#     tree_file=RESULTS_DIR / "PsiD_tree.treefile",
#     title="PsiD Phylogenetic Tree"
# )

## 6. Ancestral Sequence Reconstruction

This section will implement ancestral sequence reconstruction using various methods.

In [None]:
# Placeholder for ASR implementation
# Will use tools like:
# - PAML (codeml)
# - IQ-TREE (with -asr option)
# - MEGA

print("Ancestral Sequence Reconstruction methods to be implemented...")

## 7. Analysis of Gene Cluster Organization

Compare the two distinct BGC gene orders found in Clade I vs Clade II.

In [None]:
# BGC gene orders from the paper:
clade_I_order = ["PsiD", "PsiK", "PsiH", "PsiM"]
clade_II_order = ["PsiD", "PsiM", "PsiH", "PsiK"]

print("Psilocybin BGC Gene Cluster Organization:")
print(f"  Clade I:  {' > '.join(clade_I_order)}")
print(f"  Clade II: {' > '.join(clade_II_order)}")

# Visualization of gene orders
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(12, 4))

colors = {'PsiD': 'red', 'PsiK': 'green', 'PsiM': 'gold', 'PsiH': 'purple'}

# Clade I
for i, gene in enumerate(clade_I_order):
    ax1.add_patch(plt.Rectangle((i, 0), 0.8, 0.5, facecolor=colors[gene], edgecolor='black', linewidth=2))
    ax1.text(i+0.4, 0.25, gene, ha='center', va='center', fontsize=12, fontweight='bold')

ax1.set_xlim(-0.5, len(clade_I_order))
ax1.set_ylim(-0.2, 1)
ax1.set_title('Clade I BGC Organization', fontsize=14, fontweight='bold')
ax1.axis('off')

# Clade II
for i, gene in enumerate(clade_II_order):
    ax2.add_patch(plt.Rectangle((i, 0), 0.8, 0.5, facecolor=colors[gene], edgecolor='black', linewidth=2))
    ax2.text(i+0.4, 0.25, gene, ha='center', va='center', fontsize=12, fontweight='bold')

ax2.set_xlim(-0.5, len(clade_II_order))
ax2.set_ylim(-0.2, 1)
ax2.set_title('Clade II BGC Organization', fontsize=14, fontweight='bold')
ax2.axis('off')

plt.tight_layout()
plt.show()

## 8. Summary and Next Steps

**Completed:**
- [ ] Data exploration
- [ ] Sequence alignment (MAFFT)
- [ ] Phylogenetic tree construction (IQ-TREE)
- [ ] Tree visualization
- [ ] BGC organization analysis

**To Do:**
- [ ] Implement ancestral sequence reconstruction
- [ ] Analyze all four BGC genes (PsiD, PsiK, PsiM, PsiH)
- [ ] Divergence time estimation
- [ ] Functional analysis of reconstructed ancestral sequences

---

## References

1. Bradshaw et al. (2024). Phylogenomics of the psychoactive mushroom genus Psilocybe and evolution of the psilocybin biosynthetic gene cluster. *PNAS* 121(3):e2311245121.

2. Fricke et al. (2017). Enzymatic synthesis of psilocybin. *Angew. Chem. Int. Ed.* 56:12352–12355.