<a href="https://colab.research.google.com/github/JKourelis/Colab_Boltz-2/blob/main/Boltz_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img src="https://raw.githubusercontent.com/jwohlwend/boltz/main/docs/boltz2_title.png" height="200" align="right" style="height:240px">

## Boltz-2: Democratizing Biomolecular Interaction Modeling

Easy to use protein structure and binding affinity prediction using [Boltz-2](https://doi.org/10.1101/2025.06.14.659707). Boltz-2 is a biomolecular foundation model that jointly models complex structures and binding affinities, approaching [AlphaFold3](https://www.nature.com/articles/s41586-024-07487-w) accuracy while running 1000x faster than physics-based methods.

**Key Features:**
- **Structure Prediction**: Protein, DNA, RNA, and ligand complexes with AlphaFold3-level accuracy
- **Binding Affinity**: First deep learning model to approach FEP accuracy for drug discovery
- **Open Source**: MIT license for academic and commercial use
- **Fast**: 1000x faster than traditional physics-based methods

**Usage Options:**
1. **Manual Input**: Enter sequences directly in the configuration boxes below
2. **FASTA Upload**: Upload FASTA files for batch processing

**Repository:**
- [Boltz-2 Colab Repository](https://github.com/JKourelis/Colab_Boltz-2)

**Citations:**

[Wohlwend J, Corso G, Passaro S, et al. Boltz-1: Democratizing Biomolecular Interaction Modeling. *bioRxiv*, 2024](https://doi.org/10.1101/2024.11.19.624167)

[Passaro S, Corso G, Wohlwend J, et al. Boltz-2: Towards Accurate and Efficient Binding Affinity Prediction. *bioRxiv*, 2025](https://doi.org/10.1101/2025.06.14.659707)

If using automatic MSA generation: [Mirdita M, Sch√ºtze K, Moriwaki Y, et al. ColabFold: making protein folding accessible to all. *Nature Methods*, 2022](https://doi.org/10.1038/s41592-022-01488-1)

In [None]:
#@title Cell 1: Install Boltz-2 with cuEquivariance Kernel Test
%%time
import subprocess
import sys
import os
import re

# Restart marker to handle Colab Feb 2025 NumPy issue
restart_marker = "/content/.boltz_numpy_restart"
is_post_restart = os.path.exists(restart_marker)

def run_cmd(cmd, desc):
    """Execute command with output suppression unless error"""
    print(f"[{desc}]")
    result = subprocess.run(cmd, shell=True, capture_output=True, text=True)
    if result.returncode != 0:
        print(f"FAILED: {result.stderr[:300]}")
        return False
    print("OK")
    return True

def get_cuda_version():
    """Detect CUDA version from nvidia-smi"""
    try:
        result = subprocess.run(['nvidia-smi'], capture_output=True, text=True)
        if result.returncode == 0:
            match = re.search(r'CUDA Version: (\d+\.\d+)', result.stdout)
            if match:
                version = match.group(1)
                major = int(version.split('.')[0])
                minor = int(version.split('.')[1])
                return major, minor, version
    except Exception as e:
        print(f"‚ö†Ô∏è  Could not detect CUDA version: {e}")
    return None, None, None

def test_cuequivariance_kernels():
    """Test if cuEquivariance triangle kernels are available"""
    print("\n" + "=" * 60)
    print("CUEQUIVARIANCE KERNEL PREFLIGHT TEST")
    print("=" * 60)

    try:
        import torch
        print(f"‚úÖ PyTorch: {torch.__version__}")
        print(f"‚úÖ CUDA available: {torch.cuda.is_available()}")
        if torch.cuda.is_available():
            print(f"‚úÖ CUDA version: {torch.version.cuda}")
            print(f"‚úÖ GPU: {torch.cuda.get_device_name(0)}")
    except Exception as e:
        print(f"‚ùå PyTorch check failed: {e}")
        return False

    # Test cuequivariance-torch import
    try:
        import cuequivariance_torch
        print(f"‚úÖ cuequivariance-torch installed")
    except ImportError as e:
        print(f"‚ö†Ô∏è  cuequivariance-torch not found: {e}")
        return False

    # Test cuequivariance-ops-torch-cu12 import
    try:
        import cuequivariance_ops_torch
        print(f"‚úÖ cuequivariance-ops-torch-cu12 installed")
    except ImportError as e:
        print(f"‚ö†Ô∏è  cuequivariance-ops-torch-cu12 not found: {e}")
        return False

    # CRITICAL TEST: triangle_multiplicative_update
    try:
        from cuequivariance_ops_torch.triangle import triangle_multiplicative_update
        print(f"‚úÖ triangle_multiplicative_update import: SUCCESS")

        if callable(triangle_multiplicative_update):
            print(f"‚úÖ triangle_multiplicative_update is callable")
        else:
            print(f"‚ùå triangle_multiplicative_update exists but is not callable")
            return False

    except ImportError as e:
        print(f"‚ùå triangle_multiplicative_update import FAILED: {e}")
        print(f"   This error requires --no_kernels flag")
        return False
    except Exception as e:
        print(f"‚ùå Unexpected error testing triangle kernels: {e}")
        return False

    return True

def install_boltz():
    """Install Boltz-2 and dependencies with NumPy compatibility fix"""

    # Detect CUDA version
    cuda_major, cuda_minor, cuda_version = get_cuda_version()

    if cuda_major is None:
        print("‚ùå Could not detect CUDA version")
        return False

    print(f"‚úÖ Detected CUDA {cuda_version}")

    # Determine PyTorch version and index URL based on CUDA
    if cuda_major == 12:
        if cuda_minor >= 4:
            pytorch_version = "2.6.0"
            pytorch_cuda = "cu124"
            index_url = "https://download.pytorch.org/whl/cu124"
        else:
            pytorch_version = "2.6.0"
            pytorch_cuda = "cu121"
            index_url = "https://download.pytorch.org/whl/cu121"
    elif cuda_major == 11:
        pytorch_version = "2.6.0"
        pytorch_cuda = "cu118"
        index_url = "https://download.pytorch.org/whl/cu118"
    else:
        print(f"‚ö†Ô∏è  Unsupported CUDA version: {cuda_version}")
        print("   Attempting with CUDA 12.4 packages")
        pytorch_version = "2.6.0"
        pytorch_cuda = "cu124"
        index_url = "https://download.pytorch.org/whl/cu124"

    print(f"üì¶ Will install PyTorch {pytorch_version} ({pytorch_cuda})")

    # Nuclear cleanup
    print("\n[Uninstalling conflicting packages]")
    subprocess.run(
        f"{sys.executable} -m pip uninstall -y numpy pandas scipy torch torchvision torchaudio pytorch-lightning torchmetrics boltz",
        shell=True, capture_output=True
    )
    print("OK")

    # Install NumPy 1.26.4 first
    if not run_cmd(
        f"{sys.executable} -m pip install -q numpy==1.26.4",
        "Installing numpy==1.26.4"
    ):
        return False

    # Verify NumPy version
    try:
        import numpy as np
        if not np.__version__.startswith('1.26'):
            print(f"‚ùå NumPy version mismatch: {np.__version__}")
            return False
        print(f"‚úÖ NumPy {np.__version__} verified")
    except Exception as e:
        print(f"‚ùå NumPy verification failed: {e}")
        return False

    # Install compatible pandas and scipy
    if not run_cmd(
        f"{sys.executable} -m pip install -q pandas scipy",
        "Installing pandas and scipy"
    ):
        return False

    # Install PyTorch with correct CUDA version
    if not run_cmd(
        f"{sys.executable} -m pip install -q torch=={pytorch_version}+{pytorch_cuda} torchvision torchaudio --index-url {index_url}",
        f"Installing PyTorch {pytorch_version} ({pytorch_cuda})"
    ):
        return False

    # Install cuEquivariance with correct CUDA version
    if cuda_major == 12:
        cuequiv_pkg = "cuequivariance-ops-torch-cu12"
    elif cuda_major == 11:
        cuequiv_pkg = "cuequivariance-ops-torch-cu11"
    else:
        cuequiv_pkg = "cuequivariance-ops-torch-cu12"

    if not run_cmd(
        f"{sys.executable} -m pip install -q cuequivariance-torch {cuequiv_pkg}",
        f"Installing cuEquivariance ({cuequiv_pkg})"
    ):
        return False

    # Install Lightning stack
    if not run_cmd(
        f"{sys.executable} -m pip install -q pytorch-lightning==2.4.0 torchmetrics==1.4.0",
        "Installing lightning stack"
    ):
        return False

    # Install boltz
    if not run_cmd(
        f"{sys.executable} -m pip install -q boltz",
        "Installing boltz"
    ):
        return False

    # Test installation
    print("[Testing boltz]")
    test_result = subprocess.run(["boltz", "--help"], capture_output=True, text=True, timeout=30)

    if test_result.returncode != 0:
        print("FAILED:")
        print(test_result.stderr)
        return False

    print("OK")

    # Create ready marker
    with open("/content/BOLTZ_READY", "w") as f:
        f.write("Ready")

    return True, pytorch_version, pytorch_cuda, cuda_version

# MAIN EXECUTION
if not is_post_restart:
    print("=" * 60)
    print("PHASE 1: ENVIRONMENT SETUP (requires restart)")
    print("=" * 60)
    print("\n‚ö†Ô∏è  Colab Feb 2025 pre-loads NumPy 2.0, but Boltz-2 requires 1.26")
    print("   This will take ~10 seconds for one-time restart\n")

    # Create restart marker
    with open(restart_marker, "w") as f:
        f.write("pre-restart")

    print("üîÑ Restarting Python environment...")
    import time
    time.sleep(2)
    os.kill(os.getpid(), 9)

else:
    print("=" * 60)
    print("PHASE 2: PACKAGE INSTALLATION")
    print("=" * 60)

    # sitecustomize.py prevention
    python_version = f"{sys.version_info.major}.{sys.version_info.minor}"
    sitecustomize_path = f"/usr/local/lib/python{python_version}/dist-packages/sitecustomize.py"

    sitecustomize_content = """import sys
if '/env/python' in sys.path:
    sys.path.remove('/env/python')
"""

    try:
        with open(sitecustomize_path, "w") as f:
            f.write(sitecustomize_content)
        print("‚úÖ Permanent conflict prevention installed")
    except Exception as e:
        print(f"‚ö†Ô∏è  Could not install sitecustomize.py: {e}")

    # Current kernel cleanup
    if '/env/python' in sys.path:
        sys.path.remove('/env/python')

    # Clear cached imports
    modules_to_clear = [key for key in list(sys.modules.keys())
                       if key.startswith(('numpy', 'pandas', 'np', 'pd'))]
    for mod in modules_to_clear:
        del sys.modules[mod]

    if modules_to_clear:
        print(f"   ‚úÖ Cleared {len(modules_to_clear)} cached modules")

    # Install Boltz (verification happens inside this function)
    print("\n" + "=" * 60)
    print("INSTALLING BOLTZ-2")
    print("=" * 60 + "\n")

    result = install_boltz()
    if not result:
        print("\n‚ùå Installation failed")
        sys.exit(1)

    success, pytorch_version, pytorch_cuda, cuda_version = result

    # CUEQUIVARIANCE KERNEL TEST
    kernels_available = test_cuequivariance_kernels()

    if kernels_available:
        print("\n‚úÖ KERNEL TEST PASSED")
        print("   cuEquivariance kernels are available")
        print("   Will run Boltz-2 WITHOUT --no_kernels flag")
        use_no_kernels = False
    else:
        print("\n‚ùå KERNEL TEST FAILED")
        print("   cuEquivariance kernels are NOT available")
        print("   Will run Boltz-2 WITH --no_kernels flag")
        print("   Performance penalty: ~12 seconds per prediction")
        use_no_kernels = True

    # Store result for execution cell
    if 'global_settings' not in globals():
        global_settings = {}
    global_settings['use_no_kernels'] = use_no_kernels
    global_settings['kernels_tested'] = True

    print(f"\nüîß Flag stored: use_no_kernels = {use_no_kernels}")

    # Show installed versions
    print("\n" + "=" * 60)
    print("INSTALLED PACKAGE VERSIONS")
    print("=" * 60)

    result = subprocess.run(
        [sys.executable, "-m", "pip", "list", "--format=freeze"],
        capture_output=True, text=True
    )

    all_packages = result.stdout.strip().split('\n')
    relevant = [
        'numpy', 'pandas', 'scipy',
        'torch', 'torchvision', 'torchaudio',
        'pytorch-lightning', 'torchmetrics',
        'boltz', 'cuequivariance-torch',
        'cuequivariance-ops-torch-cu11',
        'cuequivariance-ops-torch-cu12'
    ]

    print("\nüìã Core packages:")
    for pkg in relevant:
        for line in all_packages:
            if line.lower().startswith(pkg.lower() + '=='):
                print(f"   {line}")
                break
        else:
            for line in all_packages:
                if pkg.lower().replace('-', '_') in line.lower():
                    print(f"   {line}")
                    break

    # Save complete requirements.txt
    print("\nüìÑ Saving complete requirements.txt...")
    with open("/content/requirements_boltz.txt", "w") as f:
        f.write(f"# Boltz-2 Installation - CUDA {cuda_version}\n")
        f.write(f"# PyTorch {pytorch_version} ({pytorch_cuda})\n\n")
        f.write(result.stdout)
    print("   ‚úÖ Saved to: /content/requirements_boltz.txt")

    # Cleanup and mark ready
    os.remove(restart_marker)

    print("\n" + "=" * 60)
    print("‚úÖ BOLTZ-2 INSTALLATION COMPLETE")
    print("=" * 60)
    print("Next: Configure your sequences in Cell 2")

PHASE 2: PACKAGE INSTALLATION
‚úÖ Permanent conflict prevention installed
   ‚úÖ Cleared 105 cached modules

INSTALLING BOLTZ-2

‚úÖ Detected CUDA 12.4
üì¶ Will install PyTorch 2.6.0 (cu124)

[Uninstalling conflicting packages]
OK
[Installing numpy==1.26.4]
OK
‚úÖ NumPy 1.26.4 verified
[Installing pandas and scipy]




OK
[Installing PyTorch 2.6.0 (cu124)]
OK
[Installing cuEquivariance (cuequivariance-ops-torch-cu12)]
OK
[Installing lightning stack]
OK
[Installing boltz]
OK
[Testing boltz]
OK

CUEQUIVARIANCE KERNEL PREFLIGHT TEST
‚úÖ PyTorch: 2.6.0+cu124
‚úÖ CUDA available: True
‚úÖ CUDA version: 12.4
‚úÖ GPU: NVIDIA A100-SXM4-40GB


Error while loading libcue_ops.so: /usr/local/lib/python3.12/dist-packages/cuequivariance_ops/lib/libcue_ops.so: undefined symbol: cublasGemmGroupedBatchedEx, version libcublas.so.12


‚úÖ cuequivariance-torch installed
‚ö†Ô∏è  cuequivariance-ops-torch-cu12 not found: libcue_ops.so: cannot open shared object file: No such file or directory

‚ùå KERNEL TEST FAILED
   cuEquivariance kernels are NOT available
   Will run Boltz-2 WITH --no_kernels flag
   Performance penalty: ~12 seconds per prediction

üîß Flag stored: use_no_kernels = True

INSTALLED PACKAGE VERSIONS

üìã Core packages:
   numpy==1.26.4
   pandas==2.3.3
   scipy==1.13.1
   torch==2.6.0+cu124
   torchvision==0.21.0+cu124
   torchaudio==2.6.0+cu124
   pytorch-lightning==2.5.0
   torchmetrics==1.4.0
   boltz==2.2.1
   cuequivariance-torch==0.7.0
   cuequivariance-ops-torch-cu12==0.7.0

üìÑ Saving complete requirements.txt...
   ‚úÖ Saved to: /content/requirements_boltz.txt

‚úÖ BOLTZ-2 INSTALLATION COMPLETE
Next: Configure your sequences in Cell 2
CPU times: user 3.83 s, sys: 349 ms, total: 4.18 s
Wall time: 1min 57s


In [None]:
#@title Cell 3: Manual Input Configuration (Skip if using FASTA Upload)
#@markdown Only run this cell if you selected "Manual Input" above

# Job configuration
jobname = '' #@param {type:"string"}
#@markdown - Job name for output files

# Google Drive setup
setup_google_drive = True #@param {type:"boolean"}
#@markdown - Setup Google Drive for automatic result upload
gdrive_folder_name = "Boltz2_Predictions" #@param {type:"string"}
#@markdown - Google Drive folder name

# Sequence inputs
seq1_name = 'A' #@param {type:"string"}
seq1_type = "protein" #@param ["protein", "dna", "rna", "smiles", "ccd"]
seq1_content = '' #@param {type:"string"}
seq1_copies = 1 #@param {type:"integer"}

seq2_name = 'B' #@param {type:"string"}
seq2_type = "protein" #@param ["protein", "dna", "rna", "smiles", "ccd"]
seq2_content = '' #@param {type:"string"}
seq2_copies = 1 #@param {type:"integer"}

seq3_name = 'C' #@param {type:"string"}
seq3_type = "protein" #@param ["protein", "dna", "rna", "smiles", "ccd"]
seq3_content = '' #@param {type:"string"}
seq3_copies = 1 #@param {type:"integer"}

seq4_name = 'D' #@param {type:"string"}
seq4_type = "protein" #@param ["protein", "dna", "rna", "smiles", "ccd"]
seq4_content = '' #@param {type:"string"}
seq4_copies = 1 #@param {type:"integer"}

seq5_name = 'E' #@param {type:"string"}
seq5_type = "protein" #@param ["protein", "dna", "rna", "smiles", "ccd"]
seq5_content = '' #@param {type:"string"}
seq5_copies = 1 #@param {type:"integer"}

seq6_name = 'F' #@param {type:"string"}
seq6_type = "protein" #@param ["protein", "dna", "rna", "smiles", "ccd"]
seq6_content = '' #@param {type:"string"}
seq6_copies = 1 #@param {type:"integer"}

seq7_name = 'G' #@param {type:"string"}
seq7_type = "protein" #@param ["protein", "dna", "rna", "smiles", "ccd"]
seq7_content = '' #@param {type:"string"}
seq7_copies = 1 #@param {type:"integer"}

seq8_name = 'H' #@param {type:"string"}
seq8_type = "protein" #@param ["protein", "dna", "rna", "smiles", "ccd"]
seq8_content = '' #@param {type:"string"}
seq8_copies = 1 #@param {type:"integer"}

seq9_name = 'I' #@param {type:"string"}
seq9_type = "protein" #@param ["protein", "dna", "rna", "smiles", "ccd"]
seq9_content = '' #@param {type:"string"}
seq9_copies = 1 #@param {type:"integer"}

seq10_name = 'J' #@param {type:"string"}
seq10_type = "protein" #@param ["protein", "dna", "rna", "smiles", "ccd"]
seq10_content = '' #@param {type:"string"}
seq10_copies = 1 #@param {type:"integer"}

# Check if this cell should run
if 'global_settings' not in globals():
    print("‚ö†Ô∏è  Please run the 'Choose Input Method' cell first")
elif global_settings['input_method'] != "Manual Input":
    print("‚è≠Ô∏è  Skipping manual input (FASTA Upload selected)")
else:
    # Setup Google Drive if requested
    drive = None
    if setup_google_drive:
        try:
            from pydrive2.drive import GoogleDrive
            from pydrive2.auth import GoogleAuth
            from google.colab import auth
            from oauth2client.client import GoogleCredentials
            from google.colab import files

            print("Setting up Google Drive...")
            auth.authenticate_user()
            gauth = GoogleAuth()
            gauth.credentials = GoogleCredentials.get_application_default()
            drive = GoogleDrive(gauth)
            print("‚úÖ Google Drive connected successfully!")
        except Exception as e:
            print(f"‚ùå Google Drive setup failed: {e}")
            drive = None

    # Process sequences - ALWAYS use A, B, C, etc. for MSA compatibility
    sequences = []
    all_sequences = [
        (seq1_name, seq1_type, seq1_content, seq1_copies),
        (seq2_name, seq2_type, seq2_content, seq2_copies),
        (seq3_name, seq3_type, seq3_content, seq3_copies),
        (seq4_name, seq4_type, seq4_content, seq4_copies),
        (seq5_name, seq5_type, seq5_content, seq5_copies),
        (seq6_name, seq6_type, seq6_content, seq6_copies),
        (seq7_name, seq7_type, seq7_content, seq7_copies),
        (seq8_name, seq8_type, seq8_content, seq8_copies),
        (seq9_name, seq9_type, seq9_content, seq9_copies),
        (seq10_name, seq10_type, seq10_content, seq10_copies)
    ]

    # Generate sequential letter IDs for MSA compatibility (5-char limit)
    # CRITICAL: Each copy gets a DIFFERENT letter (A, B, C...)
    # YAML generation groups identical sequences for MSA optimization
    letter_index = 0
    alphabet = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'

    for name, seq_type, content, copies in all_sequences:
        if content.strip():  # Only process non-empty sequences
            chain_ids = []

            # Each copy gets its own letter - EXACT CSV processor behavior
            for copy_num in range(copies):
                if letter_index < len(alphabet):
                    chain_ids.append(alphabet[letter_index])
                else:
                    print(f"‚ö†Ô∏è  Warning: Exceeded 26 chains. Using extended notation.")
                    chain_ids.append(f"Z{letter_index - 25}")

                letter_index += 1  # Increment for EACH copy

            sequences.append({
                'name': name,  # Preserve user name for display
                'type': seq_type,
                'content': content.strip(),
                'copies': copies,
                'chain_ids': chain_ids  # Each copy has unique ID: A, B for 2 copies
            })

    # Generate jobname hash
    if sequences:
        sequence_string = "".join([seq['content'] for seq in sequences])
        final_jobname = add_hash(jobname.replace(' ', '_'), sequence_string)

        # Update global settings
        global_settings.update({
            'sequences': sequences,
            'drive': drive,
            'gdrive_folder_name': gdrive_folder_name,
            'final_jobname': final_jobname
        })

        print("‚úÖ Manual sequences configured:")
        print(f"üìÅ Job name: {final_jobname}")
        for seq in sequences:
            print(f"  {seq['name']}: {seq['type']}, {seq['copies']} copies, chains: {seq['chain_ids']}")
            print(f"    Content: {seq['content'][:50]}{'...' if len(seq['content']) > 50 else ''}")
    else:
        print("‚ùå No sequences provided")

‚òÅÔ∏è  Setting up Google Drive...
Mounted at /content/gdrive
‚úÖ Google Drive connected
‚úÖ Manual sequences configured:
üè∑Ô∏è Job name: Cf4AVR4-single_791fc
  Cf4: protein, 1 copies, chains: ['Cf4']
    Content: SSLPHLCPEDQALALLEFKNMFTVNPNASDYCYDRRTLSWNKSTSCCSWD...
  AVR4: protein, 1 copies, chains: ['AVR4']
    Content: PCKPQEVIDTKCMGPKDCLYPNPDSCTTYIQCVPLDEVGNAKPVVKPCPK...

üìä Total sequences: 2


In [None]:
#@title Cell 3: MSA Configuration
msa_mode = "mmseqs2_uniref_env" #@param ["mmseqs2_uniref_env", "mmseqs2_uniref","single_sequence","custom"]
#@markdown - MSA generation method. mmseqs2 modes use the ColabFold server

msa_pairing_strategy = "greedy" #@param ["greedy", "complete"]
#@markdown - `greedy` = pair any taxonomically matching subsets, `complete` = all sequences must match

# Check if global_settings exists
if 'global_settings' not in globals():
    print("‚ö†Ô∏è  Please run the 'Choose Input Method' cell first")
else:
    # Configure MSA settings based on mode
    if "mmseqs2" in msa_mode:
        use_msa_server = True
        msa_server_url = "https://api.colabfold.com"
    else:
        use_msa_server = False
        msa_server_url = None

    # Handle custom MSA upload if selected
    if msa_mode == "custom":
        print("Upload your custom MSA file (A3M format):")
        from google.colab import files
        custom_msa_dict = files.upload()
        if custom_msa_dict:
            custom_msa_file = list(custom_msa_dict.keys())[0]
            print(f"Custom MSA uploaded: {custom_msa_file}")
        else:
            print("No custom MSA uploaded, switching to single_sequence mode")
            msa_mode = "single_sequence"
            use_msa_server = False

    # Store MSA settings in global_settings
    global_settings.update({
        'msa_mode': msa_mode,
        'msa_pairing_strategy': msa_pairing_strategy,
        'use_msa_server': use_msa_server,
        'msa_server_url': msa_server_url
    })

    print(f"‚úÖ MSA configuration set:")
    print(f"  Mode: {msa_mode}")
    print(f"  Pairing strategy: {msa_pairing_strategy}")
    print(f"  Use MSA server: {use_msa_server}")

‚úÖ MSA configuration set:
  Mode: mmseqs2_uniref_env
  Pairing strategy: greedy
  Use MSA server: True


In [None]:
#@title Cell 4: Advanced Prediction Settings
# Structure Prediction Settings
recycling_steps = 6 #@param {type:"integer"}
#@markdown - **Iterative refinement passes**: Each cycle refines the structure using updated predictions. Higher values improve local geometry and confidence scores. **Time**: ~linear scaling (3 steps = 3x base time). **VRAM**: +20-30% per additional step for intermediate states.

sampling_steps = 200 #@param {type:"integer"}
#@markdown - **Diffusion denoising iterations**: Controls how many steps the diffusion model takes to generate structures from noise. More steps = smoother, higher quality structures. **Time**: Linear scaling (50 steps = 4x faster than 200). **VRAM**: +10-15% for intermediate diffusion states.

diffusion_samples = 5 #@param {type:"integer"}
#@markdown - **Independent structure predictions**: Number of different structures generated per input. More samples increase diversity and reliability of results. **Time**: Linear scaling (5 samples = 5x base time). **VRAM**: Depends on max_parallel_samples setting.

max_parallel_samples = 5 #@param {type:"integer"}
#@markdown - **GPU memory management**: How many diffusion samples are processed simultaneously. Critical for large complexes - each parallel sample requires full model memory allocation. **Time**: Minimal impact on total time. **VRAM**: ~Linear scaling (2 parallel = ~2x memory, 5 parallel = ~5x memory).

step_scale = 1.638 #@param {type:"number"}
#@markdown - **Sampling temperature**: Controls randomness in structure generation. Higher values increase diversity but may reduce quality. 1.638 is optimized default. **Time**: No impact. **VRAM**: No impact.

# Affinity Prediction Settings
predict_affinity = False #@param {type:"boolean"}
#@markdown - **Binding strength prediction**: Runs additional affinity model to predict binding strength (Kd/Ki values). Most reliable for protein-small molecule complexes. **Time**: +50-100% total time. **VRAM**: +40-60% for affinity model loading.

affinity_mw_correction = False #@param {type:"boolean"}
#@markdown - **Molecular weight adjustment**: Applies size-based corrections to affinity predictions. Only affects affinity calculation, not structure. **Time**: Minimal impact. **VRAM**: No impact.

sampling_steps_affinity = 200 #@param {type:"integer"}
#@markdown - **Affinity model diffusion steps**: Controls quality of affinity predictions. Similar to sampling_steps but for the affinity model. **Time**: Linear scaling within affinity prediction. **VRAM**: +5-10% for affinity diffusion states.

diffusion_samples_affinity = 5 #@param {type:"integer"}
#@markdown - **Affinity prediction ensemble size**: Number of independent affinity predictions to average for final binding strength. More samples = more reliable Kd estimates. **Time**: Linear scaling for affinity portion. **VRAM**: Minimal additional impact.

# Output and Optimization Settings
output_format = "mmcif" #@param ["mmcif", "pdb"]
#@markdown - **Structure file format**: mmCIF supports more metadata and modern features, PDB is more widely compatible. Both contain same structural information. **Time**: No impact. **VRAM**: No impact.

write_full_pae = True #@param {type:"boolean"}
#@markdown - **Save Predicted Aligned Error matrix**: Confidence scores between all residue pairs. Essential for assessing interface quality and domain reliability. **Time**: +5-10% for matrix computation and I/O. **VRAM**: +10-20% for large complexes during matrix storage.

write_full_pde = False #@param {type:"boolean"}
#@markdown - **Save Predicted Distance Error matrix**: Distance confidence predictions between residue pairs. Useful for validation and uncertainty quantification. **Time**: +5-10% for matrix computation and I/O. **VRAM**: +10-20% for large complexes during matrix storage.

use_potentials = True #@param {type:"boolean"}
#@markdown - **Inference-time physics optimization**: Applies physics-based energy minimization to improve local geometry and remove clashes. Significantly improves structure quality, especially for interfaces. **Time**: +30-50% total time. **VRAM**: +15-25% for physics calculation buffers.

# Check if global_settings exists
if 'global_settings' not in globals():
    print("‚ö†Ô∏è  Please run the 'Choose Input Method' cell first")
else:
    # Store advanced settings
    advanced_settings = {
        'recycling_steps': recycling_steps,
        'sampling_steps': sampling_steps,
        'diffusion_samples': diffusion_samples,
        'max_parallel_samples': max_parallel_samples,
        'step_scale': step_scale,
        'predict_affinity': predict_affinity,
        'affinity_mw_correction': affinity_mw_correction,
        'sampling_steps_affinity': sampling_steps_affinity,
        'diffusion_samples_affinity': diffusion_samples_affinity,
        'output_format': output_format,
        'write_full_pae': write_full_pae,
        'write_full_pde': write_full_pde,
        'use_potentials': use_potentials,
        'max_msa_seqs': 8192,
        'subsample_msa': False,
        'num_subsampled_msa': 1024
    }

    global_settings.update(advanced_settings)

    print("‚úÖ Advanced settings configured:")
    print(f"  Recycling steps: {recycling_steps}")
    print(f"  Sampling steps: {sampling_steps}")
    print(f"  Diffusion samples: {diffusion_samples}")
    print(f"  Predict affinity: {predict_affinity}")
    print(f"  Output format: {output_format}")
    print(f"  Use potentials: {use_potentials}")

‚úÖ Advanced settings configured:
  Recycling steps: 6
  Sampling steps: 200
  Diffusion samples: 5
  Predict affinity: False
  Output format: mmcif
  Use potentials: True


In [None]:
#@title Cell 4.1: Residue Modifications Instructions (Optional)
#@markdown Specify residue modifications for amino acid, DNA, or RNA sequences. Each row should define one modification, with values separated by colons (:). The format is:
#@markdown
#@markdown `SEQ_ID : RESIDUE_INDEX : CCD_CODE`
#@markdown
#@markdown * **SEQ_ID** ‚Üí The chain ID of the sequence as defined in **Input Sequences**.
#@markdown * **RESIDUE_INDEX** ‚Üí The residue position to modify. Use **1** for the first residue.
#@markdown * **CCD_CODE** ‚Üí The **Chemical Component Dictionary (CCD) code** of the modification.
#@markdown
#@markdown **Example Input:**
#@markdown ```
#@markdown A:102:MLY
#@markdown B:1:5MC
#@markdown C:26:PSU
#@markdown ```
#@markdown
#@markdown **Notes:**
#@markdown * Chain IDs (**SEQ_ID**) must match those in **Input Sequences**.
#@markdown * Residue indices start at **1**, not **0**.
#@markdown * Use valid **CCD codes** for modifications, use this resource for information on which CCD codes to use for your modification: https://pmc.ncbi.nlm.nih.gov/articles/PMC11394121/

residue_modifications = '' #@param {type:"string"}
#@markdown - Enter residue modifications (one per line, format: CHAIN_ID:RESIDUE_INDEX:CCD_CODE)

# Process residue modifications
modifications_list = []
if residue_modifications.strip():
    for line in residue_modifications.strip().split('\n'):
        if line.strip():
            parts = line.strip().split(':')
            if len(parts) == 3:
                chain_id, res_idx, ccd_code = parts
                modifications_list.append({
                    'chain_id': chain_id.strip(),
                    'position': int(res_idx.strip()),
                    'ccd': ccd_code.strip()
                })
            else:
                print(f"Invalid modification format: {line}")

print(f"Residue modifications configured: {len(modifications_list)} modifications")
for mod in modifications_list:
    print(f"  Chain {mod['chain_id']}, position {mod['position']}: {mod['ccd']}")

if 'global_settings' in globals() and modifications_list:
    global_settings['modifications_list'] = modifications_list

Residue modifications configured: 0 modifications


In [None]:
#@title Cell 4.2: Pocket Restraints Instructions (Optional)
#@markdown The **Binder Chain** corresponds to the binder chain, while "Contact Residues" specifies residues interacting with it.
#@markdown Specify inter-chain pocket restraints to guide Boltz-2 in folding complexes. These restraints define interactions between a binder sequence and residues in other sequences, influencing the folding process.
#@markdown Each row should define one pocket restraint, with values separated by colons (:). The format is:
#@markdown
#@markdown `CONTACT_CHAIN:CONTACT_RES`
#@markdown
#@markdown * **CONTACT_CHAIN** ‚Üí The chain containing the interacting residue.
#@markdown * **CONTACT_RES** ‚Üí The position of the residue on **CONTACT_CHAIN**.
#@markdown
#@markdown **Example Input:**
#@markdown ```
#@markdown A:66
#@markdown A:78
#@markdown B:13
#@markdown ```
#@markdown
#@markdown **Notes:**
#@markdown * Chain names match those in **Input Sequences**.
#@markdown * Residue numbering starts at 1.
#@markdown * The model currently only supports a single binder chain per pocket restraint, but multiple contact residues can be specified across different chains.
#@markdown * The chain name of the binder should only be specified if pocket restraints are being used.

binder_chain = '' #@param {type:"string"}
#@markdown - Specify the chain acting as the binder. See above instructions for more details.
contact_residues = '' #@param {type:"string"}
#@markdown - Specify residues interacting with the binder chain. See above instructions for more details.

# Process pocket restraints
pocket_contacts = []
if contact_residues.strip() and binder_chain.strip():
    for line in contact_residues.strip().split('\n'):
        if line.strip():
            parts = line.strip().split(':')
            if len(parts) == 2:
                contact_chain, contact_res = parts
                pocket_contacts.append({
                    'chain_id': contact_chain.strip(),
                    'residue': int(contact_res.strip())
                })
            else:
                print(f"Invalid contact format: {line}")

if binder_chain.strip():
    print(f"Pocket restraints configured:")
    print(f"  Binder chain: {binder_chain.strip()}")
    print(f"  Contact residues: {len(pocket_contacts)} contacts")
    for contact in pocket_contacts:
        print(f"    Chain {contact['chain_id']}, residue {contact['residue']}")
else:
    print("No pocket restraints configured")

if 'global_settings' in globals() and binder_chain.strip():
    global_settings['binder_chain'] = binder_chain.strip()
    global_settings['pocket_contacts'] = pocket_contacts

No pocket restraints configured


In [None]:
#@title Cell 4.3: Covalent Restraints Instructions (Optional)
#@markdown Specify covalent bonds between atoms to guide Boltz-2 in complex folding. These restraints define fixed interactions between atoms in different sequences, ensuring structural constraints are maintained.
#@markdown Each row should define one covalent restraint, with values separated by colons (:). The format is:
#@markdown
#@markdown `CHAIN_ID1:RES_ID1:ATOM_NAME1:CHAIN_ID2:RES_ID2:ATOM_NAME2`
#@markdown
#@markdown * **CHAIN_ID1** ‚Üí The chain containing the first atom.
#@markdown * **RES_ID1** ‚Üí Residue index on **CHAIN_ID1**.
#@markdown * **ATOM_NAME1** ‚Üí Atom name in **RES_ID1**.
#@markdown * **CHAIN_ID2** ‚Üí The chain containing the second atom.
#@markdown * **RES_ID2** ‚Üí Residue index on **CHAIN_ID2**.
#@markdown * **ATOM_NAME2** ‚Üí Atom name in **RES_ID2**.
#@markdown
#@markdown **Example Input:**
#@markdown ```
#@markdown A:6:CA:B:26:CB
#@markdown C:1:N1:A:45:OG
#@markdown ```
#@markdown
#@markdown **Notes:**
#@markdown * Chain names match those in **Input Sequences**.
#@markdown * Residue numbering starts at 1.
#@markdown * Atom names must match standardized PDB/CIF naming conventions.
#@markdown * Only canonical residues and CCD ligands are supported.
#@markdown * Covalent restraints ensure atoms remain bonded during folding but do not enforce bond angles or torsions.

covalent_restraints = '' #@param {type:"string"}
#@markdown - Specify covalent bonds between atoms. See above instructions for more details.

# Process covalent restraints
covalent_bonds = []
if covalent_restraints.strip():
    for line in covalent_restraints.strip().split('\n'):
        if line.strip():
            parts = line.strip().split(':')
            if len(parts) == 6:
                chain1, res1, atom1, chain2, res2, atom2 = parts
                covalent_bonds.append({
                    'atom1': [chain1.strip(), int(res1.strip()), atom1.strip()],
                    'atom2': [chain2.strip(), int(res2.strip()), atom2.strip()]
                })
            else:
                print(f"Invalid covalent restraint format: {line}")

print(f"Covalent restraints configured: {len(covalent_bonds)} bonds")
for bond in covalent_bonds:
    print(f"  {bond['atom1'][0]}:{bond['atom1'][1]}:{bond['atom1'][2]} - {bond['atom2'][0]}:{bond['atom2'][1]}:{bond['atom2'][2]}")

if 'global_settings' in globals() and covalent_bonds:
    global_settings['covalent_bonds'] = covalent_bonds

Covalent restraints configured: 0 bonds


In [None]:
#@title Run Boltz-2 Prediction (Complete Integration)
%%time
import subprocess
import os
import zipfile
import shutil
import time
from datetime import datetime

# COMPLETE INTEGRATION OF ALL CONFIGURATION CELLS

# Check if global_settings exists
if 'global_settings' not in globals():
    print("‚ùå Error: Please run Cell 2 (Manual Input Configuration) first")
elif not global_settings.get('sequences'):
    print("‚ùå Error: No sequences configured in Cell 2")
    print("   Please configure at least one sequence before running prediction")
else:
    settings = global_settings

    # GPU verification - EXACT FROM CSV
    print("üìç Checking GPU availability...")
    try:
        import torch
        if torch.cuda.is_available():
            print(f"‚úÖ GPU: {torch.cuda.get_device_name(0)} ({torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB)")
        else:
            print("‚ö†Ô∏è  WARNING: No GPU detected - predictions will be very slow")
    except ImportError:
        print("‚ùå PyTorch not available")

    # Check kernel test status - EXACT FROM CSV
    if not settings.get('kernels_tested', False):
        print("\n‚ö†Ô∏è  WARNING: Kernel preflight test not run!")
        print("   Running with --no_kernels by default for safety")
        settings['use_no_kernels'] = True

    use_no_kernels_flag = settings.get('use_no_kernels', True)
    print(f"\nüîß Kernel mode: {'--no_kernels' if use_no_kernels_flag else 'WITH kernels'}")

    if use_no_kernels_flag:
        print("   (Using CPU fallback - slower but more compatible)")
    else:
        print("   (Using CUDA kernels - faster performance)")

    # Helper functions - EXACT FROM CSV
    def find_or_create_folder(drive, folder_name, parent_id='root'):
        if not drive:
            return None
        try:
            file_list = drive.ListFile({
                'q': f"title='{folder_name}' and '{parent_id}' in parents and mimeType='application/vnd.google-apps.folder' and trashed=false"
            }).GetList()
            if file_list:
                print(f"‚úÖ Found existing folder: {folder_name}")
                return file_list[0]['id']
            else:
                folder = drive.CreateFile({
                    'title': folder_name,
                    'mimeType': 'application/vnd.google-apps.folder',
                    'parents': [{'id': parent_id}]
                })
                folder.Upload()
                print(f"‚úÖ Created new folder: {folder_name}")
                return folder['id']
        except Exception as e:
            print(f"‚ö†Ô∏è  Error with folder: {e}")
            return None

    def upload_to_drive(drive, local_path, folder_id):
        if not drive or not folder_id:
            return None
        try:
            file = drive.CreateFile({
                'title': os.path.basename(local_path),
                'parents': [{'id': folder_id}]
            })
            file.SetContentFile(local_path)
            file.Upload()
            return file['alternateLink']
        except Exception as e:
            print(f"‚ö†Ô∏è  Upload failed: {e}")
            return None

    # EXACT _generate_yaml from CSV BoltzJobProcessor
    def _generate_yaml(job):
        """Generate YAML format for Boltz-2 - EXACT COPY FROM CSV PROCESSOR"""
        lines = ["version: 1", "sequences:"]

        # Group sequences by type and content
        protein_groups = {}
        dna_groups = {}
        rna_groups = {}
        ligand_groups = {}

        for seq in job['sequences']:
            seq_type = seq['type']

            if seq_type == 'protein':
                key = (seq['sequence'], tuple(sorted((m['position'], m['ccd']) for m in seq['modifications'])) if seq['modifications'] else ())
                if key not in protein_groups:
                    protein_groups[key] = []
                protein_groups[key].append(seq)

            elif seq_type == 'dna':
                key = (seq['sequence'], tuple(sorted((m['position'], m['ccd']) for m in seq['modifications'])) if seq['modifications'] else ())
                if key not in dna_groups:
                    dna_groups[key] = []
                dna_groups[key].append(seq)

            elif seq_type == 'rna':
                key = (seq['sequence'], tuple(sorted((m['position'], m['ccd']) for m in seq['modifications'])) if seq['modifications'] else ())
                if key not in rna_groups:
                    rna_groups[key] = []
                rna_groups[key].append(seq)

            elif seq_type == 'ligand':
                if 'smiles' in seq:
                    key = ('smiles', seq['smiles'])
                else:
                    key = ('ccd', seq['ccd'])
                if key not in ligand_groups:
                    ligand_groups[key] = []
                ligand_groups[key].append(seq)

        # Write protein sequences
        for (sequence, mod_tuple), seqs in protein_groups.items():
            lines.append("  - protein:")
            chain_ids = [s['id'] for s in seqs]
            if len(chain_ids) == 1:
                lines.append(f"      id: {chain_ids[0]}")
            else:
                lines.append(f"      id: [{', '.join(chain_ids)}]")
            lines.append(f"      sequence: {sequence}")

            if seqs[0]['modifications']:
                lines.append("      modifications:")
                for mod in seqs[0]['modifications']:
                    lines.append(f"        - ptmType: {mod['ccd']}")
                    lines.append(f"          ptmPosition: {mod['position']}")

        # Write DNA sequences
        for (sequence, mod_tuple), seqs in dna_groups.items():
            lines.append("  - dna:")
            chain_ids = [s['id'] for s in seqs]
            if len(chain_ids) == 1:
                lines.append(f"      id: {chain_ids[0]}")
            else:
                lines.append(f"      id: [{', '.join(chain_ids)}]")
            lines.append(f"      sequence: {sequence}")

            if seqs[0]['modifications']:
                lines.append("      modifications:")
                for mod in seqs[0]['modifications']:
                    lines.append(f"        - modificationType: {mod['ccd']}")
                    lines.append(f"          basePosition: {mod['position']}")

        # Write RNA sequences
        for (sequence, mod_tuple), seqs in rna_groups.items():
            lines.append("  - rna:")
            chain_ids = [s['id'] for s in seqs]
            if len(chain_ids) == 1:
                lines.append(f"      id: {chain_ids[0]}")
            else:
                lines.append(f"      id: [{', '.join(chain_ids)}]")
            lines.append(f"      sequence: {sequence}")

            if seqs[0]['modifications']:
                lines.append("      modifications:")
                for mod in seqs[0]['modifications']:
                    lines.append(f"        - modificationType: {mod['ccd']}")
                    lines.append(f"          basePosition: {mod['position']}")

        # Write ligand sequences
        for (lig_type, lig_value), seqs in ligand_groups.items():
            lines.append("  - ligand:")
            chain_ids = [s['id'] for s in seqs]
            if len(chain_ids) == 1:
                lines.append(f"      id: {chain_ids[0]}")
            else:
                lines.append(f"      id: [{', '.join(chain_ids)}]")

            if lig_type == 'smiles':
                lines.append(f"      smiles: '{lig_value}'")
            else:
                lines.append(f"      ccd: {lig_value}")

        # Add constraints if present
        if 'pocket' in job and job['pocket']:
            lines.append("constraints:")
            lines.append("  pocket:")
            lines.append(f"    binder: {job['pocket']['binder']}")
            lines.append(f"    contacts: [{', '.join(map(str, job['pocket']['contacts']))}]")

        if 'covalent_bonds' in job and job['covalent_bonds']:
            if 'pocket' not in job or not job['pocket']:
                lines.append("constraints:")
            lines.append("  covalent:")
            for bond in job['covalent_bonds']:
                lines.append(f"    - atom1: [{bond['atom1'][0]}, {bond['atom1'][1]}, {bond['atom1'][2]}]")
                lines.append(f"      atom2: [{bond['atom2'][0]}, {bond['atom2'][1]}, {bond['atom2'][2]}]")

        return "\n".join(lines)

    # STEP 1: Convert manual sequences to CSV processor format
    print("\nüîÑ Converting manual input to job format...")

    csv_sequences = []

    # STEP 2: Get modifications from Cell 4.1 if present
    modifications_by_seq = {}
    if settings.get('modifications_list'):
        print(f"   ‚úì Found {len(settings['modifications_list'])} modifications from Cell 4.1")
        for mod in settings['modifications_list']:
            seq_id = mod['seq_id']
            if seq_id not in modifications_by_seq:
                modifications_by_seq[seq_id] = []
            # Convert format: {'seq_id': 'A', 'residue': 10, 'ccd': 'SEP'}
            # to {'position': 10, 'ccd': 'SEP'}
            modifications_by_seq[seq_id].append({
                'position': mod['residue'],
                'ccd': mod['ccd']
            })

    # Convert each sequence with modifications attached
    for seq in settings['sequences']:
        for chain_id in seq.get('chain_ids', [seq['name']]):
            # Get modifications for this chain
            mods = modifications_by_seq.get(chain_id, [])

            seq_dict = {
                'type': seq['type'],
                'id': chain_id,
                'modifications': mods if mods else []
            }

            # Handle different sequence types
            if seq['type'] in ['protein', 'dna', 'rna']:
                seq_dict['sequence'] = seq['content']
            elif seq['type'] == 'smiles':
                seq_dict['type'] = 'ligand'
                seq_dict['smiles'] = seq['content']
            elif seq['type'] == 'ccd':
                seq_dict['type'] = 'ligand'
                seq_dict['ccd'] = seq['content']

            csv_sequences.append(seq_dict)

    print(f"   ‚úì Converted {len(csv_sequences)} sequence entries")

    # STEP 3: Convert pocket restraints from Cell 4.2 if present
    pocket_config = None
    if settings.get('binder_chain') and settings.get('pocket_contacts'):
        print(f"   ‚úì Found pocket restraints from Cell 4.2")
        # Convert from: {'chain_id': 'A', 'residue': 66} to just [66]
        contacts = [c['residue'] for c in settings['pocket_contacts']]
        pocket_config = {
            'binder': settings['binder_chain'],
            'contacts': contacts
        }
        print(f"     Binder: {pocket_config['binder']}, Contacts: {contacts}")

    # STEP 4: Get covalent bonds from Cell 4.3 (already in correct format)
    covalent_config = settings.get('covalent_bonds')
    if covalent_config:
        print(f"   ‚úì Found {len(covalent_config)} covalent bonds from Cell 4.3")

    # Create job structure in CSV format
    job = {
        'name': settings['final_jobname'],
        'sequences': csv_sequences,
        'pocket': pocket_config,
        'covalent_bonds': covalent_config
    }

    print("‚úÖ Job configuration complete")

    # Setup Google Drive folder if configured - EXACT FROM CSV
    folder_id = None
    if settings.get('drive'):
        folder_id = find_or_create_folder(
            settings['drive'],
            settings.get('gdrive_folder_name', 'Boltz2_Predictions')
        )

    # SINGLE JOB PROCESSING - EXACT LOGIC FROM CSV LOOP
    print("\n" + "=" * 60)
    print("üöÄ STARTING PREDICTION")
    print("=" * 60)
    print(f"Job: {job['name']}")
    print(f"Sequences: {len(job['sequences'])}")
    if job['pocket']:
        print(f"Pocket restraints: {len(job['pocket']['contacts'])} contacts")
    if job['covalent_bonds']:
        print(f"Covalent bonds: {len(job['covalent_bonds'])} bonds")
    print("=" * 60)

    job_name = job['name']
    job_dir = job_name
    os.makedirs(job_dir, exist_ok=True)

    # Generate YAML file - EXACT FROM CSV
    yaml_content = _generate_yaml(job)
    yaml_file = os.path.join(job_dir, f"{job_name}.yaml")

    with open(yaml_file, 'w') as f:
        f.write(yaml_content)

    print(f"üìù Generated YAML configuration")

    # Build Boltz command - EXACT FROM CSV LINE FOR LINE
    cmd_parts = [
        "boltz", "predict", yaml_file,
        "--out_dir", job_dir,
        "--recycling_steps", str(settings.get('recycling_steps', 6)),
        "--sampling_steps", str(settings.get('sampling_steps', 200)),
        "--diffusion_samples", str(settings.get('diffusion_samples', 5)),
        "--max_parallel_samples", str(settings.get('max_parallel_samples', 5)),
        "--step_scale", str(settings.get('step_scale', 1.638)),
        "--output_format", settings.get('output_format', 'mmcif'),
        "--max_msa_seqs", str(settings.get('max_msa_seqs', 8192)),
        "--override"
    ]

    # Conditionally add --no_kernels based on preflight test - EXACT FROM CSV
    if settings.get('use_no_kernels', True):
        cmd_parts.append("--no_kernels")

    # Add MSA server if configured - EXACT FROM CSV
    if settings.get('use_msa_server', True):
        cmd_parts.extend([
            "--use_msa_server",
            "--msa_server_url", settings.get('msa_server_url', 'https://api.colabfold.com'),
            "--msa_pairing_strategy", settings.get('msa_pairing_strategy', 'greedy')
        ])

    # Add optional flags - EXACT FROM CSV
    if settings.get('write_full_pae', False):
        cmd_parts.append("--write_full_pae")
    if settings.get('write_full_pde', False):
        cmd_parts.append("--write_full_pde")
    if settings.get('use_potentials', True):
        cmd_parts.append("--use_potentials")
    if settings.get('predict_affinity', False):
        cmd_parts.extend([
            "--predict_affinity",
            "--sampling_steps_affinity", str(settings.get('sampling_steps_affinity', 200)),
            "--diffusion_samples_affinity", str(settings.get('diffusion_samples_affinity', 5))
        ])
        if settings.get('affinity_mw_correction', False):
            cmd_parts.append("--affinity_mw_correction")

    cmd = " ".join(cmd_parts)
    print(f"üîß Command: {cmd}")

    # Run prediction - EXACT FROM CSV
    start_time = time.time()
    try:
        result = subprocess.run(
            cmd,
            shell=True,
            capture_output=True,
            text=True,
            timeout=7200  # 2 hour timeout
        )

        # CRITICAL: Always show stderr if present - EXACT FROM CSV
        if result.stderr and result.stderr.strip():
            print(f"\nüìã Boltz output/warnings:")
            stderr_lines = result.stderr.strip().split('\n')
            for line in stderr_lines[-50:]:
                if line.strip():
                    print(f"   {line}")

        if result.returncode == 0:
            # Check for output files - EXACT FROM CSV
            results_dirs = [d for d in os.listdir(job_dir) if d.startswith('boltz_results_')]

            if not results_dirs:
                print(f"\n‚ùå No results directory found")
                print(f"   Expected directory starting with 'boltz_results_' in {job_dir}")
                print(f"   Actual contents: {os.listdir(job_dir)}")
            else:
                predictions_dir = os.path.join(job_dir, results_dirs[0])

                # Count structure files - EXACT FROM CSV
                structure_count = 0
                structure_files = []
                for root, dirs, files in os.walk(predictions_dir):
                    for file in files:
                        if file.endswith(('.cif', '.pdb', '.mmcif')):
                            structure_files.append(file)
                            structure_count += 1

                if structure_count > 0:
                    print(f"‚úÖ Generated {structure_count} structure files")
                    for f in structure_files:
                        print(f"   üìÑ {f}")

                    # Create results zip - EXACT FROM CSV
                    zip_filename = f"{job_name}_results.zip"
                    with zipfile.ZipFile(zip_filename, 'w', zipfile.ZIP_DEFLATED) as zipf:
                        for root, dirs, files in os.walk(job_dir):
                            for file in files:
                                file_path = os.path.join(root, file)
                                arcname = os.path.relpath(file_path, job_dir)
                                zipf.write(file_path, arcname)

                    print(f"üì¶ Created: {zip_filename}")

                    # Google Drive upload - EXACT FROM CSV
                    if folder_id:
                        try:
                            url = upload_to_drive(settings['drive'], zip_filename, folder_id)
                            if url:
                                print(f"  ‚òÅÔ∏è  Uploaded to Google Drive: {url}")
                        except Exception as e:
                            print(f"‚ö†Ô∏è  Google Drive upload failed: {e}")

                    elapsed = time.time() - start_time
                    print(f"‚è±Ô∏è  Completed in {elapsed:.1f}s")
                else:
                    print("‚ùå No structure files found in output")

        else:
            print(f"\n‚ùå Prediction failed (return code: {result.returncode})")
            if result.stderr:
                print("Error output:")
                print(result.stderr[-1000:])

    except subprocess.TimeoutExpired:
        print("‚ùå Prediction timed out after 2 hours")
    except Exception as e:
        print(f"‚ùå Unexpected error: {e}")
        import traceback
        traceback.print_exc()

    print("\n" + "=" * 60)
    print("‚úÖ PREDICTION COMPLETE")
    print("=" * 60)

üìç Checking GPU availability...
‚úÖ GPU: NVIDIA A100-SXM4-40GB (42.5 GB)

üîß Kernel mode: --no_kernels
   (Using CPU fallback - slower but more compatible)

üîÑ Converting manual input to job format...
   ‚úì Converted 2 sequence entries
‚úÖ Job configuration complete
‚úÖ Found existing folder: Boltz2_Predictions

üöÄ STARTING PREDICTION
Job: Cf4AVR4-single_791fc
Sequences: 2
üìù Generated YAML configuration
üîß Command: boltz predict Cf4AVR4-single_791fc/Cf4AVR4-single_791fc.yaml --out_dir Cf4AVR4-single_791fc --recycling_steps 6 --sampling_steps 200 --diffusion_samples 5 --max_parallel_samples 5 --step_scale 1.638 --output_format mmcif --max_msa_seqs 8192 --override --no_kernels --use_msa_server --msa_server_url https://api.colabfold.com --msa_pairing_strategy greedy --write_full_pae --use_potentials
