<img src="https://drive.google.com/uc?id=1hKra9Eirlx69_cZMvPmjwfCAFx6Fhf0a" alt="ParaDeep Icon" width="200"/>

# üß¨ ParaDeep: Sequence-Based Paratope Prediction with BiLSTM-CNN

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/PiyachatU/ParaDeep/blob/main/ParaDeep_Colab.ipynb)

**ParaDeep** is a lightweight, chain-aware deep learning framework for predicting **paratope residues** (antigen-binding sites) directly from antibody amino acid sequences. It employs a BiLSTM-CNN architecture with task-specific encodings‚Äî**learnable embeddings** for heavy (H) chains and **one-hot encoding** for light (L) chains‚Äîrequiring no structural data or large pretrained models.

## üéØ What is ParaDeep?

ParaDeep was developed to enable **fast, interpretable, and accessible** paratope prediction in the early stages of antibody discovery. The model provides **per-residue binary predictions** (binding vs non-binding) and has been optimized for minimal computational overhead while maintaining competitive accuracy.

### Key Features:
- üî¨ **Sequence-only input**: No need for 3D structures or AlphaFold predictions
- ‚ö° **Chain-aware modeling**: Independent models for H and L chains
- üöÄ **Lightweight architecture**: Suitable for local or Colab-based inference
- üìä **Per-residue classification**: Clear binary output per amino acid
- üìÅ **User-friendly I/O**: Direct sequence input or file upload

### ‚ñ∂Ô∏è How to use this notebook
1. Run each cell **from top to bottom**
2. Choose **Manual Input** or **File Upload**
3. ParaDeep will:
   - validate your sequences
   - run predictions
   - show results on screen
   - provide a downloadable ZIP file


## 1. üõ†Ô∏è Environment Setup

First, let's clone the ParaDeep repository from GitHub and install the required dependencies.


In [None]:
import os
import sys

# Check if we're in Colab
try:
    import google.colab
    IN_COLAB = True
    print("üîç Running in Google Colab")
except ImportError:
    IN_COLAB = False
    print("üîç Running in local environment")

# Clone the repository if not already present
if not os.path.exists('ParaDeep'):
    print("üì• Cloning ParaDeep repository...")
    !git clone https://github.com/PiyachatU/ParaDeep.git
    print("‚úÖ Repository cloned successfully")
else:
    print("‚úÖ ParaDeep repository already exists")

# Change to the ParaDeep directory
os.chdir('ParaDeep')
print(f"üìÇ Current directory: {os.getcwd()}")

# Install requirements
print("üì¶ Installing dependencies...")
!pip install -q -r requirements.txt

# Add src to Python path
sys.path.insert(0, os.path.join(os.getcwd(), "src"))

# Verify installation
try:
    import torch
    import numpy as np
    import pandas as pd
    import matplotlib.pyplot as plt
    from tqdm import tqdm
    from Bio import SeqIO
    from datetime import datetime
    print("‚úÖ All dependencies installed successfully")
    print(f"üî• PyTorch version: {torch.__version__}")
    print(f"üñ•Ô∏è  Device: {'GPU' if torch.cuda.is_available() else 'CPU'}")
except ImportError as e:
    print(f"‚ùå Error importing dependencies: {e}")
    print("Please restart the runtime and try again.")

## 2. üì§ Input Your Antibody Sequences

Enter your Heavy (H) and Light (L) chain sequences below. The system will validate your sequences and run predictions automatically.

### Requirements:
- **Heavy Chain (H)**: Variable heavy chain sequence
- **Light Chain (L)**: Variable light chain sequence (kappa or lambda)
- **Format**: Standard single-letter amino acid codes (A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y)
- **Length**: Up to 130 residues (longer sequences will be truncated)

### Example Sequences:
- **H-chain**: `EVQLVESGGGLVQPGGSLRLSCAASGFTFSSYAMSWVRQAPGKGLEWVSAISGSGGSTYYADSVKGRFTISRDNSKNTLYLQMNSLRAEDTAVYYCAR`
- **L-chain**: `DIQMTQSPSSLSASVGDRVTITCRASQGIRNYLAWYQQKPGKAPKLLIYAASTLQSGVPSRFSGSGSGTDFTLTISSLQPEDFATYYCQRYNRAPYTFGQGTKVEIK`


In [None]:
import pandas as pd
import sys

IN_COLAB = "google.colab" in sys.modules

print("üß¨ ParaDeep Input")
print("=" * 60)
print("Choose input method:")
print("  1Ô∏è‚É£ Manual input (single antibody)")
print("  2Ô∏è‚É£ File upload (CSV / FASTA)")
print("=" * 60)

choice = input("Enter 1 or 2: ").strip()
input_df = pd.DataFrame()

# --------------------------------------------------
# MANUAL INPUT
# --------------------------------------------------
if choice == "1":
    h_seq = input("H-chain sequence (press Enter to skip): ").strip().upper()
    l_seq = input("L-chain sequence (press Enter to skip): ").strip().upper()
    seq_id = input("Sequence ID (optional): ").strip() or "MyAntibody"

    rows = []
    if h_seq:
        rows.append({
            "Seq_ID": seq_id,
            "Chain_Type": "H",
            "Seq_cap": h_seq
        })
    if l_seq:
        rows.append({
            "Seq_ID": seq_id,
            "Chain_Type": "L",
            "Seq_cap": l_seq
        })

    input_df = pd.DataFrame(rows)

# --------------------------------------------------
# FILE UPLOAD (CSV / FASTA)
# --------------------------------------------------
elif choice == "2":
    if not IN_COLAB:
        raise RuntimeError(
            "File upload requires Google Colab. "
            "Please use manual input or run in Colab."
        )

    from google.colab import files
    from Bio import SeqIO
    import os

    uploaded = files.upload()
    if not uploaded:
        raise RuntimeError("No file uploaded.")

    filename = list(uploaded.keys())[0]
    ext = os.path.splitext(filename)[1].lower()

    # -------------------------------
    # CSV INPUT
    # -------------------------------
    if ext == ".csv":
        input_df = pd.read_csv(filename)

        required = {"Seq_ID", "Chain_Type", "Seq_cap"}
        if not required.issubset(input_df.columns):
            raise ValueError(
                f"CSV must contain columns: {required}"
            )

        # Normalize
        input_df["Chain_Type"] = input_df["Chain_Type"].str.upper().str.strip()
        input_df["Seq_cap"] = input_df["Seq_cap"].str.upper().str.strip()

    # -------------------------------
    # FASTA INPUT
    # -------------------------------
    elif ext in [".fasta", ".fa", ".faa"]:
        rows = []

        for record in SeqIO.parse(filename, "fasta"):
            header = record.id
            seq = str(record.seq).upper()

            header_upper = header.upper()
            if "H" in header_upper:
                chain = "H"
            elif "L" in header_upper:
                chain = "L"
            else:
                chain = "H"  # default if not specified

            rows.append({
                "Seq_ID": header,
                "Chain_Type": chain,
                "Seq_cap": seq
            })

        input_df = pd.DataFrame(rows)

    else:
        raise ValueError(
            "Unsupported file format. Please upload CSV or FASTA files only."
        )

else:
    raise ValueError("Invalid choice. Please enter 1 or 2.")

# --------------------------------------------------
# Final confirmation
# --------------------------------------------------
print(f"\n‚úÖ Loaded {len(input_df)} sequence(s)")
display(input_df.head())

## 3. üîç Sequence Validation & Preprocessing

ParaDeep will now **automatically check and prepare** your sequences for prediction.

During this step:
- Invalid or uncommon amino acids are safely converted to `X`
- Sequences longer than **130 residues** are truncated to match the ParaDeep model
- No user action is required

‚ÑπÔ∏è This preprocessing step ensures compatibility and does **not** affect the biological interpretation of results.

In [None]:
MAX_LEN = 130
valid_aa = set("ACDEFGHIKLMNPQRSTVWYX")

clean_rows = []
print("\nüîç Checking sequence length...")

for _, row in input_df.iterrows():
    seq = row["Seq_cap"].upper().strip()
    seq_id = row["Seq_ID"]
    chain = row["Chain_Type"]

    # sanitize
    seq = "".join(a if a in valid_aa else "X" for a in seq)

    # length rule
    if len(seq) > MAX_LEN:
        print(f"‚ö†Ô∏è  {seq_id} ({chain}): {len(seq)} > {MAX_LEN}, truncating")
        seq = seq[:MAX_LEN]

    clean_rows.append({
        "Seq_ID": seq_id,
        "Chain_Type": chain,
        "Seq_cap": seq
    })

clean_df = pd.DataFrame(clean_rows)

print("üìè Max length after preprocessing:",
      clean_df["Seq_cap"].str.len().max())

## 4. üöÄ Run ParaDeep Prediction

ParaDeep will now run deep learning‚Äìbased paratope prediction on your processed sequences.

During this step:
- The pretrained ParaDeep models are applied to each chain
- Predictions are generated at the **residue level**
- Output files are automatically saved for download

‚è±Ô∏è This step typically takes a few seconds, depending on the number and length of sequences.
No user action is required.

In [None]:
from datetime import datetime
from src.core import predict_paradeep
import os
import sys
import pandas as pd

# --------------------------------------------------
# Environment detection
# --------------------------------------------------
IN_COLAB = "google.colab" in sys.modules

# --------------------------------------------------
# File paths
# --------------------------------------------------
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
input_file = f"data/input_{timestamp}.csv"
output_file = f"output/predictions_{timestamp}.csv"

os.makedirs("data", exist_ok=True)
os.makedirs("output", exist_ok=True)

# --------------------------------------------------
# Save cleaned input
# --------------------------------------------------
clean_df.to_csv(input_file, index=False)

print(f"\nüöÄ Running ParaDeep prediction on {len(clean_df)} chain(s)...")
print("   Please wait while the model processes your sequences.\n")

# --------------------------------------------------
# Run ParaDeep
# --------------------------------------------------
predict_paradeep(
    input_path=input_file,
    model_H_path="models/Best_Model_H.pt",
    model_L_path="models/Best_Model_L.pt",
    kernel_H="Full",
    kernel_L="Full",
    output_path=output_file,
    visualize=True
)

print("\n‚úÖ ParaDeep prediction completed successfully")
print(f"üìÑ Results saved to: {output_file}")

# --------------------------------------------------
# RESULTS SUMMARY (ON-SCREEN FIRST)
# --------------------------------------------------
results_df = pd.read_csv(output_file)

print("\n" + "=" * 60)
print("üìà Results Summary")
print("=" * 60)

for seq_id in results_df["Seq_ID"].unique():
    print(f"\nüß¨ Antibody: {seq_id}")
    df_seq = results_df[results_df["Seq_ID"] == seq_id]

    for chain in ["H", "L"]:
        pred_col = f"{chain}_Prediction"
        chain_df = df_seq[df_seq["Chain_Type"] == chain]

        if chain_df.empty or pred_col not in chain_df.columns:
            continue

        total = len(chain_df)
        binding_df = chain_df[chain_df[pred_col] == 1]
        pct = 100 * len(binding_df) / total

        print(f"   üîó {chain}-chain: {len(binding_df)}/{total} binding residues ({pct:.1f}%)")

        if binding_df.empty:
            print("      No binding residues predicted above threshold.")
            continue

        # Binding sites
        sites = [
            f"{r}{p}" for r, p in zip(
                binding_df["Residue"],
                binding_df["Residue_Position"]
            )
        ]
        print(f"      Binding sites: {', '.join(sites)}")

        # Highlighted sequence
        highlighted = "".join(
            f"[{r}]" if p == 1 else r
            for r, p in zip(
                chain_df["Residue"],
                chain_df[pred_col]
            )
        )
        print(f"      Highlighted: {highlighted}")

print("\n‚ÑπÔ∏è  Binding residues are shown in [brackets] above and highlighted in red in the figures.")

# --------------------------------------------------
# DOWNLOAD (COLAB ONLY)
# --------------------------------------------------
if IN_COLAB:
    print("\nüì¶ Your result file is ready for download.")
    print("‚¨áÔ∏è The download will start automatically.")

    from google.colab import files
    files.download(output_file)
else:
    print("\n‚ÑπÔ∏è You are running locally.")
    print("   Please download the result file from the output/ directory.")

## 5. üé® Visualize Paratope Predictions

ParaDeep will now generate **visual representations of the predicted paratope residues**.

For each antibody chain, the visualization includes:
- **Residue-level binding probabilities**
- A **decision threshold** at 0.5
- Highlighted residues predicted as part of the paratope
- The full amino acid sequence aligned with predictions

‚ÑπÔ∏è If multiple antibodies are provided, figures will be generated for **all sequences** and saved automatically.  
Only a limited number of plots may be displayed on screen to keep the notebook responsive.

In [None]:
%matplotlib inline

import os
import zipfile
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from glob import glob
import seaborn as sns
import sys

IN_COLAB = "google.colab" in sys.modules

plt.style.use("default")
sns.set_palette("husl")

def create_enhanced_visualization(
    save_dir="output/enhanced_figures",
    max_display=10,
    export_zip=True
):
    """
    Create visualizations of ParaDeep paratope predictions.

    - Displays up to `max_display` figures on screen
    - Saves ALL figures to disk
    - Optionally exports all figures as a ZIP file
    """

    os.makedirs(save_dir, exist_ok=True)

    prediction_files = sorted(
        glob("output/*predictions_*.csv"),
        key=os.path.getmtime,
        reverse=True
    )

    if not prediction_files:
        print("‚ùå No prediction files found in the output directory.")
        print("   Please run the prediction step first.")
        return

    latest_file = prediction_files[0]
    print(f"üìä Using prediction file: {latest_file}")

    df = pd.read_csv(latest_file)

    required_cols = {"Seq_ID", "Chain_Type", "Residue_Position", "Residue"}
    if not required_cols.issubset(df.columns):
        print("‚ùå Prediction file is missing required columns.")
        return

    print(f"üìà Generating visualizations for {df['Seq_ID'].nunique()} antibody ID(s)...")

    shown = 0
    saved_figures = []

    for seq_id in df["Seq_ID"].unique():
        df_seq_all = df[df["Seq_ID"] == seq_id]

        for chain in ["H", "L"]:
            df_seq = df_seq_all[df_seq_all["Chain_Type"] == chain]
            if df_seq.empty:
                continue

            pred_col = f"{chain}_Prediction"
            prob_col = f"{chain}_Probability"

            if pred_col not in df_seq.columns or prob_col not in df_seq.columns:
                continue

            df_seq = df_seq.sort_values("Residue_Position")

            positions = df_seq["Residue_Position"].values
            residues = df_seq["Residue"].values
            probabilities = df_seq[prob_col].values
            predictions = df_seq[pred_col].astype(int).values

            fig, (ax1, ax2) = plt.subplots(
                2, 1,
                figsize=(16, 8),
                gridspec_kw={"height_ratios": [3, 1]},
                sharex=True
            )

            ax1.bar(
                positions,
                probabilities,
                color=["#ff6b6b" if p else "#4ecdc4" for p in predictions],
                alpha=0.75,
                edgecolor="black",
                linewidth=0.4
            )

            ax1.axhline(
                y=0.5,
                color="red",
                linestyle="--",
                linewidth=2,
                label="Decision Threshold (0.5)"
            )

            binding_mask = predictions == 1
            if np.any(binding_mask):
                ax1.scatter(
                    positions[binding_mask],
                    probabilities[binding_mask],
                    color="darkred",
                    s=90,
                    zorder=5,
                    edgecolor="white",
                    linewidth=1.5,
                    label=f"Binding residues ({binding_mask.sum()})"
                )

                for pos, res, prob in zip(
                    positions[binding_mask],
                    residues[binding_mask],
                    probabilities[binding_mask]
                ):
                    ax1.annotate(
                        res,
                        (pos, prob),
                        xytext=(0, 12),
                        textcoords="offset points",
                        ha="center",
                        va="bottom",
                        fontsize=10,
                        fontweight="bold",
                        bbox=dict(
                            boxstyle="round,pad=0.3",
                            facecolor="yellow",
                            alpha=0.8
                        )
                    )

            ax1.set_title(
                f"Paratope Prediction: {seq_id} ({chain}-chain)",
                fontsize=16,
                fontweight="bold",
                pad=15
            )
            ax1.set_ylabel("Binding Probability", fontsize=12, fontweight="bold")
            ax1.set_ylim(0, 1.2)
            ax1.grid(alpha=0.3)
            ax1.legend(loc="upper right", fontsize=10)


            ax2.bar(
                positions,
                np.ones_like(positions),
                color=["red" if p else "lightgray" for p in predictions],
                alpha=0.85
            )

            for pos, res in zip(positions, residues):
                ax2.text(
                    pos, 0.5, res,
                    ha="center",
                    va="center",
                    fontsize=9,
                    fontweight="bold"
                )

            ax2.set_ylim(0, 1)
            ax2.set_yticks([])
            ax2.set_xlabel("Residue Position", fontsize=12, fontweight="bold")
            ax2.set_ylabel("Sequence", fontsize=10)
            ax2.set_xlim(positions.min() - 0.5, positions.max() + 0.5)

            plt.tight_layout()

            fig_path = os.path.join(save_dir, f"{seq_id}_{chain}_paradeep.png")
            plt.savefig(fig_path, dpi=300, bbox_inches="tight")
            saved_figures.append(fig_path)

            if shown < max_display:
                plt.show()
            else:
                plt.close(fig)

            shown += 1
            print(f"üíæ Saved figure: {fig_path}")

    if export_zip and saved_figures:
        zip_path = os.path.join("output", "ParaDeep_figures.zip")

        with zipfile.ZipFile(zip_path, "w", zipfile.ZIP_DEFLATED) as zipf:
            for fig in saved_figures:
                zipf.write(fig, arcname=os.path.basename(fig))

        print("\nüì¶ All figures have been packaged into a ZIP file.")
        print(f"   ZIP file: {zip_path}")

        if IN_COLAB:
            from google.colab import files
            print("‚¨áÔ∏è Downloading figures ZIP...")
            files.download(zip_path)

    print("\n‚úÖ Visualization completed.")
    print(f"üìÅ Individual figures saved to: {save_dir}")


print("üé® Creating paratope visualizations...")
create_enhanced_visualization()

## üéâ Conclusion

You have successfully completed paratope prediction using **ParaDeep**, a sequence-based deep learning framework for identifying antibody binding residues.

### Interpreting Your Results
- **Predicted Binding Residues**  
  Amino acids classified as part of the paratope, highlighted in brackets (`[ ]`) in the sequence and marked in red in the visualizations.

- **Prediction Probabilities**  
  Residue-level confidence scores ranging from **0.0 to 1.0**, reflecting the model‚Äôs confidence in each prediction.

- **Decision Threshold**  
  A probability cutoff of **0.5** is applied to generate binary binding/non-binding classifications.

- **Visual Outputs**  
  Red bars and markers indicate residues predicted to participate in antigen binding, aligned with the corresponding amino acid sequence.

---

### Recommended Next Steps
1. **Review Predictions**  
   Examine the predicted paratope residues and probability profiles across antibody chains.

2. **Experimental Validation**  
   Where possible, compare predictions with known epitope‚Äìparatope data or structural information.

3. **Rational Design**  
   Use the predicted binding sites to guide mutagenesis, affinity maturation, or antibody engineering studies.

4. **Iterative Analysis**  
   Apply ParaDeep to additional antibody variants to explore sequence‚Äìfunction relationships.

---

### Citation
If ParaDeep contributes to your research, please cite or acknowledge the following:

ParaDeep: Sequence-Based Paratope Prediction with BiLSTM-CNN
GitHub repository: https://github.com/PiyachatU/ParaDeep

---
‚ÑπÔ∏è *ParaDeep ‚Äî Last updated: 2025-12-26*
