## What is Molecular Docking?

Molecular docking is a computational method used to predict how small molecules (ligands) bind to proteins (receptors). It's used during intial insilco screening of potential small molecules screening:

- Drug discovery and design

- Understanding protein-ligand interactions

- Predicting binding affinities

- Optimizing lead compounds

## Workshop Overview
In this workshop, you'll learn to:

- Prepare protein structures for docking

- Process ligand molecules

- Define binding sites and search spaces

- Run molecular docking simulations

- Analyze and visualize results

We will be doing, what is know as rigid docking, which means, we assume the receptor protein is "fixed/rigid" in its structure and only the ligand changes its conformation through bond rotations. The other type is flexible docking which allows protein structure to also change to adapt to ligand binding. 


# Molecular Docking Workflow with AutoDock Vina

## Import Libraries and Utility Functions

This cell contains all the essential imports and utility functions needed for molecular docking:

### Key Libraries:
- **`vina`**: Python wrapper for AutoDock Vina docking engine
- **`rdkit`**: Chemical informatics toolkit for molecular manipulation
- **`pdbfixer`**: Fixes missing atoms and residues in PDB structures
- **`py3Dmol`**: Interactive 3D molecular visualization

### Utility Functions:

1. **`run_command()`**: Safely executes shell commands (e.g., OpenBabel conversions)
2. **`validate_pdb_files()`**: Checks that input PDB files exist and are readable
3. **`protonate_protein()`**: Adds missing atoms and hydrogens at specified pH
4. **`prepare_protein_openbabel()`**: Converts protein to PDBQT format and cleans inappropriate tags
5. **`prepare_ligand_openbabel()`**: Prepares ligand with protonation, 3D coordinates, and partial charges
6. **`get_ligand_center()`**: Calculates geometric center for defining binding site
7. **`parse_docking_results()`**: Extracts binding scores and poses from Vina output
8. **`save_best_pose_pdb()`**: Saves the best-scoring pose for visualization
9. **`visualize_docking_complex()`**: Creates interactive 3D visualization of results

### Special Handling:
- **AlphaFold3 compatibility**: Extra filtering steps for structure files with missing information
- **PDBQT cleaning**: Removes inappropriate ligand-specific tags from protein files
- **Validation checks**: Ensures no protein residues contaminate ligand files


In [None]:
# Import essential libraries for molecular docking workflow
import subprocess
from pathlib import Path
import numpy as np

# Suppress RDKit warnings for cleaner output
from rdkit import rdBase
rdBase.DisableLog('rdApp.*')

# Core docking and molecular manipulation libraries
from vina import Vina                    # AutoDock Vina Python interface
from rdkit import Chem                   # Chemical informatics
from pdbfixer import PDBFixer            # Fix PDB structure issues
from openmm.app import PDBFile           # PDB file handling
import py3Dmol                           # 3D molecular visualization

def run_command(command):
    """
    Execute shell commands safely with error handling.
    
    This function is used throughout the workflow to run external tools
    like OpenBabel for file format conversions.
    """
    print(f"Running: {command}")
    proc = subprocess.run(command, shell=True, capture_output=True, text=True)
    if proc.returncode != 0:
        print(proc.stderr)
        raise RuntimeError(f"Command failed: {command}")
    else:
        print(proc.stdout)

def validate_pdb_files(receptor_pdb_file, ligand_pdb_file):
    """
    Validate that input PDB files exist and are accessible.
    
    Essential first step to avoid runtime errors later in the workflow.
    """
    if not Path(receptor_pdb_file).exists():
        raise FileNotFoundError(f"Receptor PDB file not found: {receptor_pdb_file}")
    if not Path(ligand_pdb_file).exists():
        raise FileNotFoundError(f"Ligand PDB file not found: {ligand_pdb_file}")
    
    print(f" Validated input files:")
    print(f"   Receptor: {receptor_pdb_file}")
    print(f"   Ligand: {ligand_pdb_file}")

def protonate_protein(pdb_file, output_pdb_file, ph=7.4):
    """
    Add missing atoms and hydrogens to protein structure at specified pH.
    
    This is crucial because:
    - Crystal structures often lack hydrogen atoms
    - Protein protonation states affect binding
    - Different pH conditions change ionizable residues
    
    Parameters:
    -----------
    ph : float
        Physiological pH (7.4) or experimental conditions (e.g., 5.5 for HIV protease)
    """
    fixer = PDBFixer(filename=str(pdb_file))
    fixer.removeHeterogens(keepWater=False)  # Remove non-protein molecules
    fixer.findMissingResidues()              # Identify missing residues
    fixer.findMissingAtoms()                 # Identify missing atoms
    fixer.addMissingAtoms()                  # Add missing heavy atoms
    fixer.addMissingHydrogens(pH=ph)         # Add hydrogens at specified pH
    
    with open(str(output_pdb_file), 'w') as out_f:
        PDBFile.writeFile(fixer.topology, fixer.positions, out_f)
    print(f"Protein protonated at pH {ph} and saved to {output_pdb_file}")

def prepare_protein_openbabel(protein_pdb_path, protein_pdbqt_path):
    """
    Convert protein PDB to PDBQT format required by AutoDock Vina.
    
    PDBQT format includes:
    - Partial charges on atoms
    - Atom types for force field
    - Rotatable bonds information (for ligands)
    
    The function also cleans inappropriate ligand-specific tags that
    sometimes appear in protein files.
    """
    # Convert PDB to PDBQT using OpenBabel
    run_command(f"obabel {protein_pdb_path} -O {protein_pdbqt_path}")
    
    # Clean the PDBQT file by removing ligand-specific tags
    with open(protein_pdbqt_path, 'r') as f:
        lines = f.readlines()
    
    cleaned_lines = []
    skip_section = False
    
    for line in lines:
        line_stripped = line.strip()
        
        # Skip ligand-specific sections
        if line_stripped.startswith('ROOT'):
            skip_section = True
            continue
        elif line_stripped.startswith('ENDROOT'):
            skip_section = False
            continue
        elif line_stripped.startswith(('BRANCH', 'ENDBRANCH', 'TORSDOF')):
            continue
        
        if not skip_section:
            cleaned_lines.append(line)
    
    with open(protein_pdbqt_path, 'w') as f:
        f.writelines(cleaned_lines)
    
    print(f"Protein PDBQT prepared and cleaned: {protein_pdbqt_path}")

def prepare_ligand_openbabel(ligand_in_path, ligand_pdbqt_path, ph=7.4, charge_model="gasteiger"):
    """
    Prepare ligand for docking by:
    1. Adding hydrogens at specified pH
    2. Generating 3D coordinates (if needed)
    3. Assigning partial charges
    4. Converting to PDBQT format
    
    Parameters:
    -----------
    charge_model : str
        Method for assigning partial charges ('gasteiger' is most common)
    """
    cmd = (
        f'obabel "{ligand_in_path}" '
        f'-O "{ligand_pdbqt_path}" '
        f'-p {ph} '                          # Protonate at specified pH
        f'--gen3d '                          # Generate 3D coordinates
        f'--partialcharge {charge_model}'    # Assign partial charges
    )
    run_command(cmd)

    # Validation: ensure no protein residue names in ligand file
    protein_residues = {
        'ALA','ARG','ASN','ASP','CYS','GLN','GLU','GLY','HIS','ILE',
        'LEU','LYS','MET','PHE','PRO','SER','THR','TRP','TYR','VAL'
    }
    with open(ligand_pdbqt_path, 'r') as f:
        for line_num, line in enumerate(f, 1):
            if line.startswith(('ATOM', 'HETATM')):
                for res in protein_residues:
                    if res in line:
                        raise ValueError(f"Protein residue {res} found in ligand PDBQT at line {line_num}")

def get_ligand_center(ligand_pdb_path):
    """
    Calculate the geometric center of the ligand.
    
    This center point is used to define where the docking search space
    should be positioned. In co-crystal structures, this represents
    the known binding site location.
    """
    mol = Chem.MolFromPDBFile(str(ligand_pdb_path), removeHs=False)
    if mol is None:
        raise ValueError(f"Could not load ligand from {ligand_pdb_path}")
    conf = mol.GetConformer()
    coords = np.array(conf.GetPositions())
    center = coords.mean(axis=0)
    print(f"Ligand geometric center: {center.tolist()}")
    return center.tolist()

def parse_docking_results(docked_poses_pdbqt):
    """
    Extract binding scores and poses from AutoDock Vina output.
    
    Vina outputs multiple poses ranked by binding affinity.
    More negative scores indicate stronger binding.
    """
    poses = []
    current_pose = []
    current_score = None
    
    with open(docked_poses_pdbqt, 'r') as f:
        for line in f:
            if line.startswith('REMARK VINA RESULT:'):
                # Extract score from line like "REMARK VINA RESULT: -8.5 0.000 0.000"
                parts = line.split()
                current_score = float(parts[3])
            elif line.startswith('MODEL'):
                current_pose = [line]
            elif line.startswith('ENDMDL'):
                current_pose.append(line)
                if current_score is not None:
                    poses.append({'score': current_score, 'pdbqt_lines': current_pose})
                current_pose = []
                current_score = None
            elif current_pose:
                current_pose.append(line)
    
    return poses

def save_best_pose_pdb(poses, best_pose_pdb_path):
    """
    Save the best (lowest energy) docking pose as a PDB file for visualization.
    
    The best pose is determined by the most negative binding affinity score.
    """
    if not poses:
        raise ValueError("No docking poses found")
    
    # Sort poses by score (more negative = better binding)
    poses.sort(key=lambda x: x['score'])
    best_pose = poses
    
    # Convert PDBQT to PDB for visualization
    temp_pdbqt = Path(str(best_pose_pdb_path).replace('.pdb', '_temp.pdbqt'))
    
    with open(temp_pdbqt, 'w') as f:
        f.writelines(best_pose['pdbqt_lines'])
    
    # Convert to PDB using OpenBabel
    run_command(f"obabel {temp_pdbqt} -O {best_pose_pdb_path}")
    
    # Clean up temporary file
    temp_pdbqt.unlink()
    
    print(f"Best pose (score: {best_pose['score']:.3f} kcal/mol) saved to {best_pose_pdb_path}")
    return best_pose['score']

def visualize_docking_complex(protein_pdb_path, docked_ligand_pdb_path, box_center, box_size):
    """
    Create interactive 3D visualization of the docked protein-ligand complex.
    
    Visualization includes:
    - Protein shown as cartoon representation
    - Ligand shown as stick model
    - Docking search box shown as wireframe
    """
    if isinstance(box_size, (int, float)):
        box_size = [box_size] * 3
    
    view = py3Dmol.view(width=800, height=600)
    
    # Add protein structure
    with open(protein_pdb_path) as f:
        protein_str = f.read()
    view.addModel(protein_str, "pdb")
    view.setStyle({'model': 0}, {'cartoon': {'color': 'spectrum'}})
    
    # Add docked ligand
    with open(docked_ligand_pdb_path) as f:
        ligand_str = f.read()
    view.addModel(ligand_str, "pdb")
    view.setStyle({'model': 1}, {'stick': {'colorscheme': 'greenCarbon'}})
    
    # Add docking box visualization
    x, y, z = box_center
    view.addBox({
        'center': {'x': x, 'y': y, 'z': z},
        'dimensions': {'w': box_size[0], 'h': box_size[1], 'd': box_size},
        'color': 'red',
        'opacity': 0.3,
        'wireframe': True,
        'linewidth': 3
    })
    
    view.zoomTo()
    view.show()

print("✅ All utility functions loaded successfully!")
print("Ready to proceed with molecular docking workflow.")


✅ All utility functions loaded successfully!
Ready to proceed with molecular docking workflow.


## Main Docking Function

This cell defines the complete molecular docking workflow function `dock_separate_pdb_files()`.

### Function Purpose:
Orchestrates the entire docking process from structure preparation to result visualization.

### Key Parameters:
- **`receptor_pdb_file`**: Path to protein structure (target)
- **`ligand_pdb_file`**: Path to small molecule ligand
- **`output_prefix`**: Naming prefix for all generated files
- **`pH`**: Protonation pH (7.4 physiological, 5.5 for HIV protease, etc.)
- **`box_size`**: Search space dimensions in Angstroms (typically 15-25 Å)
- **`num_poses`**: Number of binding poses to generate (5-20 recommended)
- **`exhaustiveness`**: Search thoroughness (8-32, higher = more accurate but slower)

### Workflow Steps:
1. **Validation**: Check input files exist and are accessible
2. **Protein Preparation**: Fix missing atoms, add hydrogens at specified pH
3. **Format Conversion**: Convert structures to PDBQT format for Vina
4. **Binding Site Definition**: Calculate search space center from ligand position
5. **Docking Simulation**: Run AutoDock Vina with specified parameters
6. **Results Analysis**: Parse scores, rank poses, interpret binding affinities
7. **Visualization**: Generate interactive 3D view of best binding pose


### Critical Preparation Parameters:

#### pH Settings:
- **Default pH 7.4**: Physiological conditions for most proteins
- **HIV protease pH 5.5**: Optimal activity pH, affects ASP25 protonation states
- **Alkaline proteins pH 8.0+**: Some enzymes require basic conditions

#### Charge Models:
- **Gasteiger**: Fast, empirical partial charge assignment (default)
- **AM1-BCC**: More accurate quantum mechanical charges (slower)
- **MMFF94**: Force field-based charges, good for drug-like molecules

### Special Handling:
- **AlphaFold3 compatibility**: Extra filtering steps for structure files with missing information
- **PDBQT cleaning**: Removes inappropriate ligand-specific tags from protein files
- **Validation checks**: Ensures no protein residues contaminate ligand files
Cell 2 Markdown Description:

## Cell 2: Main Docking Function with Detailed Parameterization

This cell defines the complete molecular docking workflow function `dock_separate_pdb_files()`.

### Core Docking Parameters and Their Significance:

#### 1. **pH (Protonation State Control)**
pH = 5.5 # Example for HIV protease

**Significance**:
- Controls ionization states of ionizable residues (ASP, GLU, HIS, LYS, ARG, CYS, TYR)
- **Critical for binding**: Different protonation states create different electrostatic environments
- **pH-dependent binding**: Many drugs show pH-dependent activity
- **Examples**:
  - HIV protease: pH 5.5 (ASP25 monoprotonated for catalysis)
  - Physiological: pH 7.4 (blood/tissue conditions)
  - Stomach: pH 1.5-3.5 (gastric drug delivery)
  - Intestine: pH 8.0+ (enteric conditions)

#### 2. **Box Size (Search Space Definition)**
box_size = 20.0 # Angstroms

**Significance**:
- Defines cubic search volume around binding site
- **15-20 Å**: Typical binding pockets (most drug targets)
- **25-30 Å**: Large binding sites or allosteric sites
- **>30 Å**: Multi-domain proteins or uncertain binding sites
- **Trade-offs**:
  - Smaller: Faster, focused search, but may miss alternative sites
  - Larger: Comprehensive search, but computationally expensive

#### 3. **Number of Poses**
num_poses = 9 # Range: 1-20 typical

**Significance**:
- Number of top-ranked binding conformations to retain
- **1-5**: Quick screening, single binding mode expected
- **9-12**: Standard drug discovery (captures binding diversity)
- **15-20**: Research applications, exploring multiple binding modes
- **Considerations**:
  - More poses = better sampling of conformational space
  - Diminishing returns after ~10-15 poses for most systems

#### 4. **Exhaustiveness (Search Thoroughness)**
exhaustiveness = 25 # Range: 1-32

**Significance**:
- Controls depth of conformational search
- **1-8**: Fast screening, preliminary results
- **8-16**: Standard accuracy for most applications
- **16-32**: High accuracy, research-quality results
- **Computational cost scales linearly with exhaustiveness**
- **Rule of thumb**: Use 8× exhaustiveness for publication-quality results

### Advanced Docking Parameters (Vina Internal):

#### Scoring Function Parameters:
- **Vina scoring function**: Empirical, trained on binding affinity data
- **Components**:
  - Van der Waals interactions
  - Hydrogen bonding
  - Hydrophobic interactions
  - Electrostatic interactions
  - Entropy penalties (rotatable bonds)


### Output Interpretation:

#### Binding Affinity Ranges:
- **< -12 kcal/mol**: Extremely strong binding (covalent-like)
- **-9 to -12**: Very strong binding (sub-nanomolar Kd)
- **-7 to -9**: Strong binding (nanomolar Kd, drug-like)
- **-5 to -7**: Moderate binding (micromolar Kd, lead optimization)
- **-3 to -5**: Weak binding (millimolar Kd, fragment screening)
- **> -3**: Very weak/non-specific binding

#### RMSD Values (Root Mean Square Deviation):
- **< 2.0 Å**: Similar binding poses (same binding mode)
- **2.0-4.0 Å**: Related poses (similar region, different orientation)
- **> 4.0 Å**: Different binding modes or sites

### Parameter Optimization Guidelines:

#### For Different Applications:
1. **Virtual Screening**: pH 7.4, box_size 20Å, num_poses 1-3, exhaustiveness 8
2. **Lead Optimization**: pH optimal, box_size 15-20Å, num_poses 9, exhaustiveness 16
3. **Research Studies**: pH experimental, box_size 20-25Å, num_poses 15, exhaustiveness 24-32
4. **Fragment Screening**: pH 7.4, box_size 25-30Å, num_poses 5, exhaustiveness 12

#### System-Specific Considerations:
- **Flexible proteins**: Larger box sizes, more poses
- **Allosteric sites**: Blind docking or multiple search centers  
- **Metal-containing sites**: Special charge models and constraints
- **Membrane proteins**: Lipid bilayer considerations
Cell 3 Markdown Description:

## Cell 3: Example Usage - HIV Protease Parameter Analysis

### System-Specific Parameter Selection for HIV Protease:

#### **pH = 5.5** - Critical for HIV Protease Activity
**Biological Significance**:
- HIV protease has optimal activity at pH 5.5
- **ASP25 catalytic residues**: One protonated, one deprotonated
- **Protonation pattern**: Essential for substrate cleavage mechanism
- **Drug design implication**: Inhibitors must account for this ionization state

**Comparison with other pH values**:
- pH 7.4: ASP25 residues both deprotonated (inactive enzyme)
- pH 4.0: Both ASP25 residues protonated (also inactive)
- pH 5.5: Optimal monoprotonated state for catalysis

#### **Box Size = 20.0 Å** - Binding Site Characteristics
**HIV protease binding site properties**:
- **Active site volume**: ~1200 Å³ (relatively large)
- **Binding cleft**: 20 Å length, accommodates peptide substrates
- **Inhibitor binding**: 15-20 Å covers S4 to S4' subsites
- **Rationale**: 20 Å captures full binding pocket without excessive search space

#### **Number of Poses = 9** - Capturing Binding Diversity
**HIV protease binding modes**:
- **Symmetry**: C2-symmetric homodimer creates symmetric binding modes
- **Multiple conformations**: Flap open/closed states
- **Inhibitor flexibility**: Darunavir has 6 rotatable bonds
- **9 poses**: Adequate to sample major binding conformations

#### **Exhaustiveness = 30** - High-Accuracy Search
**Justification for high exhaustiveness**:
- **Research quality**: Publication-standard accuracy
- **Complex system**: Flexible protein-ligand system
- **Validation**: Comparison with experimental crystal structures
- **Computational cost**: Acceptable for single-system detailed study

### Expected vs. Observed Results Analysis:

#### **Experimental Data Comparison**:
- **Crystal structure**: 6OPS with Darunavir (0.84 Å resolution)
- **Binding affinity**: Ki = 0.16 nM (experimental)
- **ΔG calculated**: -13.2 kcal/mol (from Ki using ΔG = -RT ln(Ki))
- **Docking prediction**: ~-8.0 kcal/mol (typical underestimation)

#### **Scoring Function Limitations**:
- **Entropy underestimation**: Docking doesn't fully account for conformational entropy
- **Solvation effects**: Simplified desolvation terms
- **Protein flexibility**: Rigid receptor approximation
- **Quantum effects**: Classical force field limitations

### Parameter Sensitivity Analysis:

#### **pH Sensitivity Test**:
pH_test = [4.5, 5.0, 5.5, 6.0, 6.5, 7.0, 7.4]

Expected: Optimal binding around pH 5.5


#### **Box Size Optimization**:
box_sizes =

Expected: Convergence around 20-22 Å for HIV protease


#### **Exhaustiveness Convergence**:
exhaustiveness_levels =

Expected: Convergence by exhaustiveness 24 for this system


### Validation Metrics:

#### **RMSD to Crystal Structure**:
- **< 1.5 Å**: Excellent prediction (near-native pose)
- **1.5-2.5 Å**: Good prediction (correct binding mode)
- **2.5-4.0 Å**: Moderate prediction (similar region)
- **> 4.0 Å**: Poor prediction (different binding mode)

#### **Binding Pose Analysis**:
- **Key interactions**: H-bonds with ASP25, ASP29, ASP30
- **Hydrophobic contacts**: VAL82, ILE84, PHE99
- **P1/P1' binding**: Tetrahydrofuran oxygen interactions
- **Flap region**: ILE50/ILE50' hydrophobic interactions

### Clinical Relevance:

#### **Drug Resistance Mutations**:
- **I50V**: Reduces binding affinity (smaller hydrophobic contact)
- **V82A**: Alters P1 binding pocket
- **L76V**: Affects P2 subsite interactions
- **Docking applications**: Predicting resistance mutation effects

#### **Structure-Activity Relationships (SAR)**:
- **P1 modifications**: Tetrahydrofuran vs. cyclopentanol
- **P2 substitutions**: Aromatic ring modifications
- **Linker variations**: Different bridging groups
Additional Cell 4 for Advanced Parameterization:

## Cell 4: Advanced Parameterization and Method Comparison

### Comprehensive Parameter Matrix for Different Applications:

#### **High-Throughput Virtual Screening (HTVS)**:
htvs_params = {
'pH': 7.4, # Standard physiological
'box_size': 20.0, # Standard pocket size
'num_poses': 1, # Only best pose needed
'exhaustiveness': 8, # Fast screening
'charge_model': 'gasteiger' # Fast charge calculation
}

Throughput: ~1000-10000 compounds/day

#### **Lead Optimization**:
lead_opt_params = {
'pH': 5.5, # System-specific optimal pH
'box_size': 18.0, # Focused on known binding site
'num_poses': 9, # Sample binding diversity
'exhaustiveness': 16, # Good accuracy/speed balance
'charge_model': 'gasteiger'
}

Throughput: ~100-500 compounds/day

#### **Research/Publication Quality**:
research_params = {
'pH': 5.5, # Experimental conditions
'box_size': 22.0, # Comprehensive site coverage
'num_poses': 15, # Full conformational sampling
'exhaustiveness': 32, # Maximum accuracy
'charge_model': 'am1bcc' # High-quality charges
}

Throughput: ~10-50 compounds/day

### Scoring Function Parameters and Weights:

#### **AutoDock Vina Scoring Components**:
Score = Σ(interactions) - entropy_penalty


**Individual Components**:
1. **Gauss 1** (weight: -0.035579): Short-range interactions
2. **Gauss 2** (weight: -0.005156): Medium-range interactions  
3. **Repulsion** (weight: 0.840245): Steric clashes
4. **Hydrophobic** (weight: -0.035069): Hydrophobic contacts
5. **Hydrogen bond** (weight: -0.587439): H-bond interactions

#### **Distance-Dependent Scaling**:
- **0-3 Å**: Full interaction strength
- **3-8 Å**: Gradual decay to zero
- **>8 Å**: No interaction (cutoff)

### Charge Model Comparison:

#### **Gasteiger Charges**:
- **Speed**: Very fast (~1 second per molecule)
- **Accuracy**: Moderate (±0.2-0.3 e⁻ typical error)
- **Applications**: High-throughput screening
- **Limitations**: Poor for charged/polar molecules

#### **AM1-BCC Charges**:
- **Speed**: Slow (~30-60 seconds per molecule)
- **Accuracy**: High (±0.1 e⁻ typical error)
- **Applications**: Lead optimization, research
- **Advantages**: Better for drug-like molecules

#### **MMFF94 Charges**:
- **Speed**: Fast (~5 seconds per molecule)
- **Accuracy**: Good (±0.15 e⁻ typical error)
- **Applications**: General-purpose docking
- **Balance**: Speed vs. accuracy compromise

### Search Space Optimization Strategies:

#### **Binding Site Identification Methods**:
1. **Co-crystal structures**: Use known ligand position (best)
2. **Cavity detection**: fpocket, CASTp, SiteMap
3. **Consensus docking**: Multiple algorithms agreement
4. **Blind docking**: Grid search across protein surface

#### **Box Size Guidelines by Target Class**:
- **Kinases**: 15-18 Å (compact ATP-binding site)
- **Proteases**: 18-22 Å (extended active site cleft)
- **GPCRs**: 15-20 Å (orthosteric site), 20-25 Å (allosteric)
- **Ion channels**: 20-30 Å (pore region)
- **Transcription factors**: 15-18 Å (DNA-binding domain)

### Validation and Benchmarking:

#### **Cross-Validation Approaches**:
1. **Leave-one-out**: Remove one ligand, dock back
2. **Temporal split**: Use older structures to predict newer
3. **Chemotype split**: Different chemical scaffolds
4. **Target split**: Different protein families

#### **Success Metrics**:
- **Top-1 success rate**: Best pose within 2 Å RMSD
- **Top-3 success rate**: Any of top 3 poses correct
- **Enrichment factor**: True positives vs. random selection
- **AUC**: Area under ROC curve for actives vs. decoys

### Computational Resource Requirements:

#### **CPU Time Scaling**:
Time ≈ exhaustiveness × num_poses × box_volume × flexibility

#### **Memory Requirements**:
- **Small molecules (<500 Da)**: ~10-50 MB RAM
- **Large molecules (>1000 Da)**: ~100-500 MB RAM
- **Protein preparation**: ~100 MB - 2 GB depending on size

#### **Parallel Processing**:
- **Multi-threading**: Vina supports OpenMP (2-8 cores optimal)
- **Embarrassingly parallel**: Independent ligand docking
- **GPU acceleration**: Not available in standard Vina

### Troubleshooting Common Parameter Issues:

#### **Poor Binding Scores (>-5 kcal/mol)**:
- Check protein protonation state (pH parameter)
- Verify ligand preparation (3D coordinates, charges)
- Increase exhaustiveness (may be trapped in local minimum)
- Check binding site definition (box center/size)

#### **Unrealistic Poses**:
- Reduce box size (too much conformational freedom)
- Check for crystallographic artifacts in protein structure
- Verify ligand rotatable bonds (too flexible)
- Consider protein flexibility (receptor may need adjustment)


### Output:
Returns dictionary with binding scores, pose information, and file paths for further analysis.


In [None]:

def dock_separate_pdb_files(
    receptor_pdb_file,
    ligand_pdb_file,
    output_prefix="docking_result",
    pH=5.5,
    box_size=20.0,
    num_poses=9,
    exhaustiveness=25
):
    """
    Perform molecular docking with separate receptor and ligand PDB files
    
    Parameters:
    -----------
    receptor_pdb_file : str
        Path to the receptor (protein) PDB file
    ligand_pdb_file : str  
        Path to the ligand PDB file
    output_prefix : str
        Prefix for output files
    pH : float
        pH for protein protonation
    box_size : float
        Docking box size in Angstroms
    num_poses : int
        Number of poses to generate
    exhaustiveness : int
        Search thoroughness (1-32, higher = more thorough)
    """
    
    # Setup file paths
    receptor_ph_pdb = Path(f"{output_prefix}_receptor_ph_{pH}.pdb")
    receptor_pdbqt = Path(f"{output_prefix}_receptor.pdbqt")
    ligand_pdbqt = Path(f"{output_prefix}_ligand.pdbqt")
    docked_poses_pdbqt = Path(f"{output_prefix}_docked_poses.pdbqt")
    best_pose_pdb = Path(f"{output_prefix}_best_pose.pdb")

    # print(f"   Receptor: {receptor_pdb_file}")
    # print(f"   Ligand: {ligand_pdb_file}")

    # Step 1: Validate input files
    print("\nStep 1: Validating input PDB files")
    validate_pdb_files(receptor_pdb_file, ligand_pdb_file)

    # Step 2: Fix the pdb file for missing atoms and protonate the receptor at specified pH
    print(f"\n Step 2: Protonating receptor at pH {pH}")
    protonate_protein(receptor_pdb_file, receptor_ph_pdb, pH)

    # Step 3: Prepare receptor PDBQT
    print("\n Step 3: Preparing receptor PDBQT")
    prepare_protein_openbabel(receptor_ph_pdb, receptor_pdbqt)

    # Step 4: Prepare ligand PDBQT
    print("\nStep 4: Preparing ligand PDBQT")
    prepare_ligand_openbabel(ligand_pdb_file, ligand_pdbqt)

    # Step 5: Calculate ligand center for docking box
    print("\n Step 5: Calculating binding site center")
    center = get_ligand_center(ligand_pdb_file)

    # Step 6: **Perform Molecular Docking**
    print(f"\n Step 6: **PERFORMING MOLECULAR DOCKING** ({num_poses} poses, exhaustiveness={exhaustiveness})")
    v = Vina(sf_name='vina')
    v.set_receptor(str(receptor_pdbqt))
    v.set_ligand_from_file(str(ligand_pdbqt))
    
    box_dims = [box_size] * 3
    v.compute_vina_maps(center=center, box_size=box_dims)
    
    #  searches for optimal binding poses
    v.dock(exhaustiveness=exhaustiveness, n_poses=num_poses)
    
    # Save all docking poses
    v.write_poses(str(docked_poses_pdbqt), n_poses=num_poses)
    print(f" Docking complete! {num_poses} poses saved to {docked_poses_pdbqt}")

    # Step 7: Analyze docking results
    print("\n Step 7: Analyzing docking results")
    poses = parse_docking_results(docked_poses_pdbqt)
    
    if poses:
        print(f"\n **DOCKING RESULTS** ({len(poses)} poses found):")
        for i, pose in enumerate(sorted(poses, key=lambda x: x['score'])[:5], 1):  # Show top 5
            print(f"  Pose {i}: {pose['score']:.3f} kcal/mol")
        
        best_score = save_best_pose_pdb(poses, best_pose_pdb)
        
        print(f"\n **BEST BINDING POSE**: {best_score:.3f} kcal/mol")
    else:
        raise ValueError("No valid poses found in docking results")

    # Step 8: Visualise the docked complex
    print("\n Step 8: Generating 3D visualization of docked complex...")
    visualize_docking_complex(receptor_ph_pdb, best_pose_pdb, center, box_dims)

    return {
        'best_score': best_score,
        'all_poses': [(pose['score']) for pose in sorted(poses, key=lambda x: x['score'])],
        'num_poses': len(poses),
        'docking_files': {
            'receptor_pdbqt': str(receptor_pdbqt),
            'ligand_pdbqt': str(ligand_pdbqt),
            'all_poses': str(docked_poses_pdbqt),
            'best_pose': str(best_pose_pdb)
        }
    }


## Example Usage - HIV Protease with Darunavir

This cell demonstrates the docking workflow using a real-world example.

### Example System:
- **Target**: HIV protease (PDB: 6OPS) - essential enzyme for HIV replication
- **Ligand**: Darunavir - FDA-approved second-generation protease inhibitor
- **pH 5.5**: Optimal pH for HIV protease (important for accurate protonation states)

### Key Learning Points:

#### 1. File Path Configuration
- Update these paths to your actual PDB files
- receptor_file = "path/to/your/protein.pdb"
- ligand_file = "path/to/your/ligand.pdb"

#### 2. Parameter Selection
- **pH 5.5**: HIV protease optimal conditions
- **Box size 20Å**: Adequate for most binding sites
- **Exhaustiveness 30**: High thoroughness for accurate results
- **9 poses**: Good sampling of binding modes

#### 3. Expected Results
- **Good inhibitors**: Binding scores < -8.0 kcal/mol
- **Darunavir**: Typically shows -8 to -9 kcal/mol (literature comparison)
- **Experimental Kd**: ~0.16 nM (very strong binding)

### Result Interpretation:
- **< -9.0 kcal/mol**: Excellent binding (strong drug candidate)
- **-7.0 to -9.0**: Good binding (promising compound)
- **-5.0 to -7.0**: Moderate binding (needs optimization)
- **> -5.0**: Weak binding (poor candidate)

### Troubleshooting:
- Verify OpenBabel installation: `conda install -c conda-forge openbabel`
- Check file paths and formatting
- Ensure write permissions in working directory
- Validate PDB structure quality

### Next Steps:
1. Examine 3D visualization for binding interactions
2. Compare with experimental crystal structures
3. Analyze molecular contacts and hydrogen bonds
4. Consider molecular dynamics validation



In [None]:
# HIV-1 protease dimer + darunavir molecular docking with separate files
# IMPORTANT : CHANGE FILE PATH 
receptor_pdb_path = '/Users/kap037/Desktop/CSIRO-Malaysia-alphafold/molecules/protein-structure-prediction/x-ray-structures-pdb/6OPS-WT-clean-receptor-dockprep.pdb'
ligand_pdb_path = '/Users/kap037/Desktop/CSIRO-Malaysia-alphafold/molecules/protein-structure-prediction/x-ray-structures-pdb/Darunavir-dockprep.pdb'

# Docking parameters  
target_pH = 5.5       # HIV-1 protease optimal pH, https://pubmed.ncbi.nlm.nih.gov/1761538/
docking_box_size = 20.0
num_poses = 9         # Number of poses to generate
exhaustiveness = 30    # Search thoroughness

# Perform docking
results = dock_separate_pdb_files(
    receptor_pdb_file=receptor_pdb_path,
    ligand_pdb_file=ligand_pdb_path,
    output_prefix="hiv_protease_darunavir",  # Output file prefix
    pH=target_pH,
    box_size=docking_box_size,
    num_poses=num_poses,
    exhaustiveness=exhaustiveness
)

print(f"\n **FINAL DOCKING RESULTS:**")
print(f"Best binding affinity: {results['best_score']:.3f} kcal/mol") 
print(f"Total poses generated: {results['num_poses']}")
print(f"Top 5 pose scores: {[f'{score:.3f}' for score in results['all_poses'][:5]]} kcal/mol")
print(f"Output files: {results['docking_files']}")
