# 🧬 ParaDeep: Sequence-Based Paratope Prediction with BiLSTM-CNN

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/PiyachatU/ParaDeep/blob/main/ParaDeep_Colab_Final.ipynb)

**ParaDeep** is a lightweight, chain-aware deep learning framework for predicting **paratope residues** (antigen-binding sites) directly from antibody amino acid sequences. It employs a BiLSTM-CNN architecture with task-specific encodings—**learnable embeddings** for heavy (H) chains and **one-hot encoding** for light (L) chains—requiring no structural data or large pretrained models.

## 🎯 What is ParaDeep?

ParaDeep was developed to enable **fast, interpretable, and accessible** paratope prediction in the early stages of antibody discovery. The model provides **per-residue binary predictions** (binding vs non-binding) and has been optimized for minimal computational overhead while maintaining competitive accuracy.

### Key Features:
- 🔬 **Sequence-only input**: No need for 3D structures or AlphaFold predictions
- ⚡ **Chain-aware modeling**: Independent models for H and L chains
- 🚀 **Lightweight architecture**: Suitable for local or Colab-based inference
- 📊 **Per-residue classification**: Clear binary output per amino acid
- 📁 **User-friendly I/O**: Direct sequence input or file upload

### What This Notebook Does:
1. 🛠️ Set up the environment and install dependencies
2. 📥 Download and load pretrained ParaDeep models
3. 📤 Input your H and L chain sequences directly
4. 🔮 Run predictions on both heavy and light chains
5. 💾 Save and visualize the results


## 1. 🛠️ Environment Setup

First, let's clone the ParaDeep repository from GitHub and install the required dependencies.


In [None]:
import os
import sys

# Check if we're in Colab
try:
    import google.colab
    IN_COLAB = True
    print("🔍 Running in Google Colab")
except ImportError:
    IN_COLAB = False
    print("🔍 Running in local environment")

# Clone the repository if not already present
if not os.path.exists('ParaDeep'):
    print("📥 Cloning ParaDeep repository...")
    !git clone https://github.com/PiyachatU/ParaDeep.git
    print("✅ Repository cloned successfully")
else:
    print("✅ ParaDeep repository already exists")

# Change to the ParaDeep directory
os.chdir('ParaDeep')
print(f"📂 Current directory: {os.getcwd()}")

# Install requirements
print("📦 Installing dependencies...")
!pip install -q -r requirements.txt

# Add src to Python path
sys.path.insert(0, os.path.join(os.getcwd(), "src"))

# Verify installation
try:
    import torch
    import numpy as np
    import pandas as pd
    import matplotlib.pyplot as plt
    from tqdm import tqdm
    from Bio import SeqIO
    from datetime import datetime
    print("✅ All dependencies installed successfully")
    print(f"🔥 PyTorch version: {torch.__version__}")
    print(f"🖥️  Device: {'GPU' if torch.cuda.is_available() else 'CPU'}")
except ImportError as e:
    print(f"❌ Error importing dependencies: {e}")
    print("Please restart the runtime and try again.")

## 2. 📤 Input Your Antibody Sequences

Enter your Heavy (H) and Light (L) chain sequences below. The system will validate your sequences and run predictions automatically.

### Requirements:
- **Heavy Chain (H)**: Variable heavy chain sequence
- **Light Chain (L)**: Variable light chain sequence (kappa or lambda)
- **Format**: Standard single-letter amino acid codes (A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y)
- **Length**: Up to 130 residues (longer sequences will be truncated)

### Example Sequences:
- **H-chain**: `EVQLVESGGGLVQPGGSLRLSCAASGFTFSSYAMSWVRQAPGKGLEWVSAISGSGGSTYYADSVKGRFTISRDNSKNTLYLQMNSLRAEDTAVYYCAR`
- **L-chain**: `DIQMTQSPSSLSASVGDRVTITCRASQGIRNYLAWYQQKPGKAPKLLIYAASTLQSGVPSRFSGSGSGTDFTLTISSLQPEDFATYYCQRYNRAPYTFGQGTKVEIK`


In [None]:
# Import required modules
from src.io_utils import load_sequences
from src.core import predict_paradeep
from collections import Counter
import warnings
warnings.filterwarnings('ignore')

print("🧬 Enter your antibody sequences below:")
print("\n" + "="*60)
print("📝 ANTIBODY SEQUENCE INPUT")
print("="*60)

# Get user input for sequences
print("\n🔗 Heavy Chain (H) Sequence:")
print("   Paste your heavy chain variable region sequence below")
print("   Example: EVQLVESGGGLVQPGGSLRLSCAASGFTFS...")
h_sequence = input("H-chain: ").strip().upper()

print("\n🔗 Light Chain (L) Sequence:")
print("   Paste your light chain variable region sequence below")
print("   Example: DIQMTQSPSSLSASVGDRVTITC...")
l_sequence = input("L-chain: ").strip().upper()

# Optional: Sequence ID
print("\n🏷️  Sequence ID (optional):")
print("   Enter a name/ID for your antibody (default: 'MyAntibody')")
seq_id = input("Seq ID: ").strip()
if not seq_id:
    seq_id = "MyAntibody"

# Validate and process sequences
def validate_sequence(seq, chain_type):
    """Validate amino acid sequence"""
    if not seq:
        return False, f"❌ {chain_type}-chain sequence is empty"
    
    # Check for valid amino acids
    valid_aa = set("ACDEFGHIKLMNPQRSTVWY")
    invalid_chars = set(seq) - valid_aa
    
    if invalid_chars:
        return False, f"❌ {chain_type}-chain contains invalid characters: {sorted(invalid_chars)}"
    
    # Check length
    if len(seq) > 130:
        return True, f"⚠️  {chain_type}-chain sequence is {len(seq)} residues (will be truncated to 130)"
    
    return True, f"✅ {chain_type}-chain sequence is valid ({len(seq)} residues)"

# Validate sequences
print("\n🔍 Validating sequences...")
sequences_to_process = []

if h_sequence:
    h_valid, h_msg = validate_sequence(h_sequence, "Heavy")
    print(h_msg)
    if h_valid:
        sequences_to_process.append({
            'Seq_ID': seq_id,
            'Chain_Type': 'H',
            'Seq_cap': h_sequence
        })

if l_sequence:
    l_valid, l_msg = validate_sequence(l_sequence, "Light")
    print(l_msg)
    if l_valid:
        sequences_to_process.append({
            'Seq_ID': seq_id,
            'Chain_Type': 'L',
            'Seq_cap': l_sequence
        })

# Process sequences if valid
if sequences_to_process:
    print(f"\n✅ Ready to process {len(sequences_to_process)} sequence(s)")
    
    # Create DataFrame
    user_df = pd.DataFrame(sequences_to_process)
    
    # Save to file
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    input_filename = f"data/user_input_{timestamp}.csv"
    user_df.to_csv(input_filename, index=False)
    
    print(f"💾 Sequences saved to: {input_filename}")
    print("\n📋 Your sequences:")
    for _, row in user_df.iterrows():
        print(f"   {row['Chain_Type']}-chain ({len(row['Seq_cap'])} residues): {row['Seq_cap'][:50]}...")
    
    # Automatically run predictions
    print("\n🚀 Running ParaDeep predictions on your sequences...")
    
    # Set up output paths
    output_file = f"output/user_predictions_{timestamp}.csv"
    plot_dir = f"output/user_plots_{timestamp}"
    
    # Ensure output directory exists
    os.makedirs("output", exist_ok=True)
    os.makedirs(plot_dir, exist_ok=True)
    
    try:
        # Run predictions
        predict_paradeep(
            input_path=input_filename,
            model_H_path="models/Best_Model_H.pt",
            model_L_path="models/Best_Model_L.pt",
            kernel_H='Full',
            kernel_L='Full',
            output_path=output_file,
            visualize=True,
            plot_dir=plot_dir
        )
        
        print(f"\n🎉 Predictions completed successfully!")
        print(f"📊 Results saved to: {output_file}")
        print(f"🎨 Visualizations saved to: {plot_dir}")
        
        # Display results summary
        if os.path.exists(output_file):
            results_df = pd.read_csv(output_file)
            
            print(f"\n📈 Results Summary:")
            for chain in ['H', 'L']:
                pred_col = f"{chain}_Prediction"
                if pred_col in results_df.columns:
                    chain_data = results_df[results_df['Chain_Type'] == chain]
                    if not chain_data.empty:
                        binding_count = chain_data[pred_col].sum()
                        total_count = len(chain_data)
                        percentage = (binding_count / total_count) * 100
                        print(f"   🔗 {chain}-chain: {binding_count}/{total_count} binding residues ({percentage:.1f}%)")
                        
                        # Show binding residues
                        binding_residues = chain_data[chain_data[pred_col] == 1]
                        if not binding_residues.empty:
                            positions = binding_residues['Residue_Position'].tolist()
                            residues = binding_residues['Residue'].tolist()
                            binding_info = [f"{res}{pos}" for res, pos in zip(residues, positions)]
                            print(f"      Binding sites: {', '.join(binding_info)}")
                            
                            # Create highlighted sequence
                            sequence = ''.join(chain_data['Residue'].tolist())
                            predictions = chain_data[pred_col].tolist()
                            highlighted = ''.join([f"[{r}]" if p == 1 else r for r, p in zip(sequence, predictions)])
                            print(f"      Highlighted: {highlighted}")
        
        # Download option for Colab
        if IN_COLAB:
            print(f"\n📥 Download your results:")
            from google.colab import files
            files.download(output_file)
            
    except Exception as e:
        print(f"❌ Error during prediction: {e}")
        import traceback
        traceback.print_exc()
        
else:
    print("\n❌ No valid sequences to process. Please check your input and try again.")
    print("\n💡 Tips:")
    print("   - Use only standard amino acid letters (A-Z, no numbers or special characters)")
    print("   - Make sure sequences are not empty")
    print("   - Variable region sequences typically range from 100-130 residues")
    print("   - You can input just H-chain, just L-chain, or both")

## 3. 📊 Enhanced Results Visualization

Create detailed visualizations of your paratope predictions with binding residue highlighting.


In [None]:
import matplotlib.pyplot as plt
import numpy as np
from glob import glob
import seaborn as sns

# Set style for better plots
plt.style.use('default')
sns.set_palette("husl")

def create_enhanced_visualization(save_dir="output/enhanced_figures"):
    """
    Create enhanced visualizations of ParaDeep predictions
    """
    os.makedirs(save_dir, exist_ok=True)
    
    # Find latest prediction file
    prediction_files = sorted(glob("output/*predictions_*.csv"), key=os.path.getmtime, reverse=True)
    if not prediction_files:
        print("❌ No prediction files found in 'output/'")
        print("Please run predictions first using the cell above.")
        return
    
    latest_file = prediction_files[0]
    print(f"📊 Using prediction file: {latest_file}")
    
    df = pd.read_csv(latest_file)
    
    # Validate required columns
    required_cols = ['Seq_ID', 'Chain_Type', 'Residue_Position', 'Residue']
    if not all(col in df.columns for col in required_cols):
        print(f"❌ Missing required columns in prediction file")
        return
    
    print(f"📈 Creating visualizations for {df['Seq_ID'].nunique()} unique sequences...")
    
    # Process each chain type
    for chain in ['H', 'L']:
        pred_col = f"{chain}_Prediction"
        prob_col = f"{chain}_Probability"
        
        if pred_col not in df.columns or prob_col not in df.columns:
            continue
        
        chain_df = df[df['Chain_Type'].str.upper() == chain]
        if chain_df.empty:
            continue
            
        print(f"\n🔗 Processing {chain}-chain sequences...")
        
        for seq_id in chain_df['Seq_ID'].unique():
            df_seq = chain_df[chain_df['Seq_ID'] == seq_id].sort_values('Residue_Position')
            
            if df_seq.empty:
                continue
            
            # Extract data
            positions = df_seq['Residue_Position'].values
            residues = df_seq['Residue'].values
            probabilities = df_seq[prob_col].values
            predictions = df_seq[pred_col].astype(int).values
            
            # Create enhanced plot
            fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(16, 8), 
                                         gridspec_kw={'height_ratios': [3, 1]})
            
            # Main probability plot
            bars = ax1.bar(positions, probabilities, 
                          color=['#ff6b6b' if p == 1 else '#4ecdc4' for p in predictions],
                          alpha=0.7, edgecolor='black', linewidth=0.5)
            
            # Threshold line
            ax1.axhline(y=0.5, color='red', linestyle='--', alpha=0.8, 
                       linewidth=2, label='Decision Threshold (0.5)')
            
            # Highlight binding residues
            binding_mask = predictions == 1
            if np.any(binding_mask):
                ax1.scatter(positions[binding_mask], probabilities[binding_mask],
                           color='darkred', s=100, zorder=5, 
                           label=f'Binding Residues ({np.sum(binding_mask)})', 
                           marker='o', edgecolor='white', linewidth=2)
                
                # Label binding residues
                for pos, res, prob in zip(positions[binding_mask], 
                                        residues[binding_mask], 
                                        probabilities[binding_mask]):
                    ax1.annotate(res, (pos, prob), 
                               xytext=(0, 15), textcoords='offset points',
                               ha='center', va='bottom', fontweight='bold',
                               fontsize=10, color='darkred',
                               bbox=dict(boxstyle='round,pad=0.3', 
                                       facecolor='yellow', alpha=0.7))
            
            # Formatting
            ax1.set_title(f"Paratope Prediction: {seq_id} ({chain}-chain)", 
                         fontsize=16, fontweight='bold', pad=20)
            ax1.set_ylabel("Binding Probability", fontsize=12, fontweight='bold')
            ax1.set_ylim(0, 1.2)
            ax1.grid(True, alpha=0.3, linestyle='-', linewidth=0.5)
            ax1.legend(loc='upper right', fontsize=10)
            
            # Sequence visualization
            colors = ['red' if p == 1 else 'lightgray' for p in predictions]
            ax2.bar(positions, [1]*len(positions), color=colors, alpha=0.8)
            
            # Add residue labels
            for pos, res in zip(positions, residues):
                ax2.text(pos, 0.5, res, ha='center', va='center', 
                        fontsize=8, fontweight='bold')
            
            ax2.set_xlabel("Residue Position", fontsize=12, fontweight='bold')
            ax2.set_ylabel("Sequence", fontsize=10)
            ax2.set_ylim(0, 1)
            ax2.set_xlim(positions[0]-0.5, positions[-1]+0.5)
            
            plt.tight_layout()
            
            # Save figure
            fig_path = os.path.join(save_dir, f"{seq_id}_{chain}_enhanced.png")
            plt.savefig(fig_path, dpi=300, bbox_inches='tight')
            plt.show()
            
            print(f"💾 Saved: {fig_path}")

# Run enhanced visualization
print("🎨 Creating enhanced visualizations...")
create_enhanced_visualization()

## 4. 📁 Alternative: File Upload

If you prefer to upload a file with multiple sequences, use the cell below:


In [None]:
# File upload interface for Colab
if IN_COLAB:
    print("📁 Upload your sequence file (CSV, FASTA, or TXT):")
    print("\nSupported formats:")
    print("   📄 CSV: Seq_ID,Chain_Type,Seq_cap")
    print("   🧬 FASTA: >header\nsequence")
    print("   📝 TXT: one sequence per line")
    
    from google.colab import files
    uploaded = files.upload()
    
    if uploaded:
        filename = list(uploaded.keys())[0]
        print(f"\n✅ File uploaded: {filename}")
        
        # Process the uploaded file
        try:
            df = load_sequences(filename)
            print(f"\n📋 Uploaded data preview:")
            print(df.head())
            
            # Run predictions on uploaded file
            timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
            output_file = f"output/uploaded_predictions_{timestamp}.csv"
            plot_dir = f"output/uploaded_plots_{timestamp}"
            
            print(f"\n🔮 Running predictions on uploaded data...")
            predict_paradeep(
                input_path=filename,
                model_H_path="models/Best_Model_H.pt",
                model_L_path="models/Best_Model_L.pt",
                kernel_H='Full',
                kernel_L='Full',
                output_path=output_file,
                visualize=True,
                plot_dir=plot_dir
            )
            
            print(f"\n✅ Predictions completed!")
            print(f"📥 Download your results:")
            files.download(output_file)
            
        except Exception as e:
            print(f"❌ Error processing file: {e}")
            print("\n💡 Please check your file format:")
            print("   - CSV files need columns: Seq_ID, Chain_Type, Seq_cap")
            print("   - Chain_Type should be 'H' or 'L'")
            print("   - Sequences should contain only valid amino acids")
    else:
        print("No file uploaded.")
else:
    print("📝 File upload widget is only available in Google Colab.")
    print("In local environment, place your file in the data/ directory and modify the code above.")

## 🎉 Conclusion

Congratulations! You've successfully used ParaDeep to predict paratope residues in your antibody sequences.

### Understanding Your Results:
- **Binding Residues**: Amino acids predicted to interact with antigens (marked with [brackets])
- **Probability Scores**: Confidence levels for each prediction (0.0-1.0)
- **Threshold**: 0.5 cutoff for binary classification
- **Visualization**: Red bars/dots indicate predicted binding sites

### Next Steps:
1. **Analyze Results**: Review the binding predictions and visualizations
2. **Validate Experimentally**: Compare with known binding data if available
3. **Design Experiments**: Use predictions to guide mutagenesis studies
4. **Iterate**: Test different antibody variants

### Citation
If you use ParaDeep in your research, please cite:

```
ParaDeep: Sequence-Based Paratope Prediction with BiLSTM-CNN
GitHub: https://github.com/PiyachatU/ParaDeep
```

---

**Thank you for using ParaDeep! 🧬✨**

*Happy antibody discovery!* 🔬🎯
