# ColabFold Structure Prediction - Psilocybin Enzymes

This notebook predicts 3D structures for psilocybin biosynthesis enzymes (PsiD, PsiK, PsiM, PsiH).

**Instructions:**
1. Go to Runtime → Change runtime type → Select **T4 GPU** (or A100 if available)
2. Upload your FASTA files to the `input/` folder
3. Run all cells
4. Download results from `output/` folder

**Expected time:** ~2-3 min per sequence on T4, ~1 min on A100

## 1. Setup Environment

In [None]:
!curl ifconfig.me

34.136.87.8

In [None]:
#@title Install ColabFold
#@markdown This takes ~5 minutes on first run

import os
import sys

# Check GPU
!nvidia-smi

# Remove incompatible JAX CUDA plugin
!pip uninstall -y jax-cuda12-plugin jax-cuda12-pjrt 2>/dev/null || true

# Install ColabFold (without bundled JAX)
!pip install -q --no-warn-conflicts "colabfold[alphafold-minus-jax] @ git+https://github.com/sokrypton/ColabFold"

# Install compatible JAX with CUDA
!pip install jax[cuda12_local]==0.4.35 -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html

# Fix TensorFlow crash
!rm -f /usr/local/lib/python3.*/dist-packages/tensorflow/core/kernels/libtfkernel_sobol_op.so
# Install hhsearch for templates
!apt-get update -qq && apt-get install -y hhsuite


# Download model weights
!python -m colabfold.download

# Verify installation
import jax
print(f"\nJAX devices: {jax.devices()}")
print("Setup complete!")


Fri Jan 16 01:30:38 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA A100-SXM4-40GB          Off |   00000000:00:04.0 Off |                    0 |
| N/A   31C    P0             45W /  400W |       0MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                

In [None]:
#@title Mount Google Drive and set paths
from google.colab import drive
drive.mount('/content/drive')


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# Your Google Drive paths
INPUT_DIR = "/content/drive/MyDrive/Psilobycin_Structure_Predictions/input"
OUTPUT_DIR = "/content/drive/MyDrive/Psilobycin_Structure_Predictions/output"

# Create output folders
!mkdir -p "$OUTPUT_DIR/PsiD"
!mkdir -p "$OUTPUT_DIR/PsiK"
!mkdir -p "$OUTPUT_DIR/PsiM"
!mkdir -p "$OUTPUT_DIR/PsiH"
!mkdir -p "$OUTPUT_DIR/ASR1_PsiD"

# Check input files
print("=== Input FASTA files ===")
!ls -la "$INPUT_DIR"

=== Input FASTA files ===
total 145
-rw------- 1 root root 33090 Jan 15 23:23 ASR1_PsiD_aln_nodes_CLEAN.faa
-rw------- 1 root root 33583 Jan 12 00:18 ASR1_PsiD_aln_nodes.faa
-rw------- 1 root root 22028 Jan 16 03:02 PsiD_sequences.faa
-rw------- 1 root root 26054 Jan 16 03:02 PsiH_sequences.faa
-rw------- 1 root root 19069 Jan 16 03:02 PsiK_sequences.faa
-rw------- 1 root root 12301 Jan 16 03:02 PsiM_sequences.faa


## 3. Run Structure Predictions

In [None]:
#@title Configuration

ENZYME = "PsiD"  #@param ["PsiD", "PsiK", "PsiM", "PsiH", "ASR1_PsiD", "ALL"]
NUM_MODELS = 5  #@param {type:"slider", min:1, max:5, step:1}
NUM_RECYCLES = 6  #@param {type:"slider", min:1, max:6, step:1}
USE_TEMPLATES = True  #@param {type:"boolean"}

print(f"Will predict: {ENZYME}")
print(f"Models per sequence: {NUM_MODELS}")
print(f"Recycles: {NUM_RECYCLES}")
print(f"Use templates: {USE_TEMPLATES}")

Will predict: PsiD
Models per sequence: 5
Recycles: 6
Use templates: True


In [None]:
#@title Run ColabFold predictions
#@markdown This is the main prediction cell - will take several hours for all sequences

import os
import time

# Define enzyme list based on selection
if ENZYME == "ALL":
    enzymes = ["PsiD", "PsiK", "PsiM", "PsiH", "ASR1_PsiD"]
else:
    enzymes = [ENZYME]

# Map enzyme names to input file names
input_file_map = {
    "PsiD": "PsiD_sequences.faa",
    "PsiK": "PsiK_sequences.faa",
    "PsiM": "PsiM_sequences.faa",
    "PsiH": "PsiH_sequences.faa",
    "ASR1_PsiD": "ASR1_PsiD_aln_nodes_CLEAN.faa"
}

template_flag = "--templates" if USE_TEMPLATES else ""

start_time = time.time()

for enzyme in enzymes:
    input_file = f"{INPUT_DIR}/{input_file_map.get(enzyme, enzyme + '_sequences.faa')}"
    output_dir = f"{OUTPUT_DIR}/{enzyme}"

    if not os.path.exists(input_file):
        print(f"WARNING: {input_file} not found, skipping...")
        continue

    print(f"\n{'='*50}")
    print(f"Processing: {enzyme}")
    print(f"{'='*50}\n")

    !colabfold_batch \
        --num-models {NUM_MODELS} \
        --num-recycle {NUM_RECYCLES} \
        {template_flag} \
        "{input_file}" \
        "{output_dir}"

elapsed = time.time() - start_time
print(f"\n\nTotal time: {elapsed/3600:.1f} hours")



Processing: PsiD

2026-01-16 03:20:43,562 Running colabfold 1.5.5 (83ee93d262a99ad62d6f0897c5ddd37eb918d385)
2026-01-16 03:20:49,078 Running on GPU
2026-01-16 03:20:49,242 Found 8 citations for tools or databases
2026-01-16 03:20:49,242 Query 1/51: Psilocybe_columbiana_ISOTYPE_NY-761607_PsiD___170_aa (length 170)
COMPLETE: 100% 150/150 [00:01<00:00, 75.35it/s] 
2026-01-16 03:21:02,756 Sequence 0 found templates: ['7cnz_G', '7cnz_A', '4qsh_D', '4qsh_A', '4qsh_C', '4qsk_A', '4qsh_B', '4qsh_A', '7cnz_A', '7cnz_G']
I0000 00:00:1768533674.661722   38492 mlir_graph_optimization_pass.cc:437] MLIR V1 optimization pass is not enabled
2026-01-16 03:21:15,574 Padding length to 180
2026-01-16 03:22:12,190 alphafold2_ptm_model_1_seed_000 recycle=0 pLDDT=84.9 pTM=0.804
2026-01-16 03:22:37,200 alphafold2_ptm_model_1_seed_000 recycle=1 pLDDT=82.6 pTM=0.791 tol=2.6
2026-01-16 03:22:39,411 alphafold2_ptm_model_1_seed_000 recycle=2 pLDDT=82.1 pTM=0.79 tol=1.03
2026-01-16 03:22:41,623 alphafold2_ptm_mode

## 4. Check Results

In [None]:
#@title Summary of completed predictions (Corrected Path)

import os

# Ensure OUTPUT_DIR is defined (using the path from your setup)
OUTPUT_DIR = "/content/drive/MyDrive/Psilobycin_Structure_Predictions/output"

print(f"Checking in: {OUTPUT_DIR}")
print("=== Completed Structures ===")
print()

for enzyme in ["PsiD", "PsiK", "PsiM", "PsiH", "ASR1_PsiD"]:
    # USE THE FULL PATH HERE
    output_dir = os.path.join(OUTPUT_DIR, enzyme)

    if not os.path.exists(output_dir):
        print(f"{enzyme}: No output folder found at {output_dir}")
        continue

    # List all files
    all_files = os.listdir(output_dir)
    pdb_files = [f for f in all_files if f.endswith(".pdb")]

    # Simple count of unique sequences based on ColabFold naming convention
    # Assumes format like "SequenceName_unrelaxed..." or "SequenceName_relaxed..."
    unique_seqs = set(f.split("_unrelaxed")[0].split("_relaxed")[0] for f in pdb_files)

    print(f"{enzyme}: {len(unique_seqs)} unique sequences, {len(pdb_files)} PDB files")

Checking in: /content/drive/MyDrive/Psilobycin_Structure_Predictions/output
=== Completed Structures ===

PsiD: 53 unique sequences, 265 PDB files
PsiK: 24 unique sequences, 120 PDB files
PsiM: 0 unique sequences, 0 PDB files
PsiH: 0 unique sequences, 0 PDB files
ASR1_PsiD: 66 unique sequences, 330 PDB files


In [None]:
#@title Extract quality scores (pLDDT) - Corrected

import os
import json
import pandas as pd

# Ensure OUTPUT_DIR is defined correctly
OUTPUT_DIR = "/content/drive/MyDrive/Psilobycin_Structure_Predictions/output"

results = []

print(f"Scanning directories in: {OUTPUT_DIR}")

for enzyme in ["PsiD", "PsiK", "PsiM", "PsiH", "ASR1_PsiD"]:
    # 1. FIX: Use the correct Google Drive path
    output_dir = os.path.join(OUTPUT_DIR, enzyme)

    if not os.path.exists(output_dir):
        print(f"Skipping {enzyme} (not found)")
        continue

    # Counter to verify it finds files
    found_count = 0

    for f in os.listdir(output_dir):
        # 2. FIX: Look for any "rank_001" file (best model), regardless of model number
        if "_scores_rank_001" in f and f.endswith(".json"):
            found_count += 1

            # 3. FIX: Robust ID extraction.
            # Splits at "_scores..." to get everything before it, regardless of the suffix length
            seq_id = f.split("_scores_rank_001")[0]

            try:
                with open(os.path.join(output_dir, f)) as fp:
                    scores = json.load(fp)

                plddt = scores.get("plddt", [])
                mean_plddt = sum(plddt) / len(plddt) if plddt else 0
                ptm = scores.get("ptm", 0) # Sometimes calls 'ptm', sometimes 'pTM' depending on version
                if ptm == 0: ptm = scores.get("pTM", 0)

                results.append({
                    "enzyme": enzyme,
                    "sequence": seq_id,
                    "pLDDT": round(mean_plddt, 1),
                    "pTM": round(ptm, 3)
                })
            except Exception as e:
                print(f"Error reading {f}: {e}")

    print(f"  {enzyme}: Found {found_count} best-rank structures")

df = pd.DataFrame(results)

if len(df) > 0:
    # Sort by Enzyme then pLDDT (descending) so you see best folders first
    df = df.sort_values(by=['enzyme', 'pLDDT'], ascending=[True, False])

    print("\n" + "="*40)
    print("TOP 10 STRUCTURES BY QUALITY")
    print("="*40)
    print(df.head(10).to_string(index=False))

    # 4. FIX: Save the CSV to Drive, not the deleted local folder
    save_path = os.path.join(OUTPUT_DIR, "quality_summary.csv")
    df.to_csv(save_path, index=False)
    print(f"\n\nFull summary saved to: {save_path}")
else:
    print("\nNo results found. Check your paths.")

Scanning directories in: /content/drive/MyDrive/Psilobycin_Structure_Predictions/output


KeyboardInterrupt: 

## 5. Download Results

In [None]:
#@title Zip and download all results

!zip -r psilocybin_structures.zip output/

from google.colab import files
files.download("psilocybin_structures.zip")

In [None]:
#@title Save to Google Drive
#@markdown Alternative: save directly to your Drive

DRIVE_OUTPUT = "/content/drive/MyDrive/psilocybin_structures/output"  #@param {type:"string"}

!mkdir -p "{DRIVE_OUTPUT}"
!cp -r output/* "{DRIVE_OUTPUT}/"
print(f"Results saved to {DRIVE_OUTPUT}")

---
## Notes

- **Runtime disconnects:** Colab may disconnect after ~12 hours. Save results periodically.
- **GPU limits:** Free tier has usage limits. Consider Colab Pro for longer runs.
- **Resume:** ColabFold skips completed sequences, so you can restart and continue.