<img src="https://drive.google.com/uc?id=1hKra9Eirlx69_cZMvPmjwfCAFx6Fhf0a" alt="ParaDeep Icon" width="200"/>

# ParaDeep: Sequence-Based Paratope Prediction with BiLSTM-CNN
**ParaDeep** is a lightweight, chain-aware deep learning framework for predicting **paratope residues** (antigen-binding sites) directly from antibody amino acid sequences. It employs a BiLSTM-CNN architecture with task-specific encodings—**learnable embeddings** for heavy (H) chains and **one-hot encoding** for light (L) chains—requiring no structural data or large pretrained models.



---

##What is ParaDeep?
ParaDeep was developed to enable **fast, interpretable, and accessible** paratope prediction in the early stages of antibody discovery. The model provides **per-residue binary predictions** (binding vs non-binding) and has been optimized for minimal computational overhead while maintaining competitive accuracy.

The framework includes pretrained models for:
*   Heavy (H) chains using embedding-based input
*   Light (L) chains using one-hot encoding
---
##Key Features

* Sequence-only input: No need for 3D structures or AlphaFold predictions
* Chain-aware modeling: Independent models for H and L chains
* Lightweight architecture: Suitable for local or Colab-based inference
* Per-residue classification: Clear binary output per amino acid
* User-friendly I/O: Reads .csv, .fasta, or .txt inputs and exports annotated .csv results
---
##What This Notebook Does
In this notebook, you will:

1. Set up the environment and install dependencies
2. Download and load pretrained ParaDeep models
3. Upload your antibody sequences
4. Run predictions on both heavy and light chains
5. Save and optionally visualize the results





## 1. Clone the ParaDeep Repository and setup environment

First, let's clone the ParaDeep repository from GitHub and install the required dependencies:

In [None]:
# Clone the repository and set up
!git clone https://github.com/PiyachatU/ParaDeep.git
%cd ParaDeep
%pip install -r requirements.txt

## 2. Examine Sample Input Format

Let's look at the sample input format to understand what our data should look like:

## Input Handling and Validation

This block performs all necessary checks before running predictions with ParaDeep. It ensures that the input file format, content structure, and sequence properties meet the criteria required by the pipeline.

### Supported Input Formats
- `.csv` — must contain at least the following columns:
  - `Seq_cap`: amino acid sequences
  - `Chain_Type`: must be either `"H"` (heavy chain) or `"L"` (light chain)
- `.fasta` / `.fa` — standard FASTA format (headers and sequences); parsed into `Seq_cap` and `Chain_Type` by `load_sequences()`
- `.txt` — one sequence per line; chain type may be inferred or annotated in pre-processing

### Automatically Checked Criteria

1. **File Format and Parsing**  
   - Format is detected from the file extension  
   - All files are parsed using the `load_sequences()` function from `io_utils.py`  
   - Returns a `pandas.DataFrame` with required columns

2. **Column Validation**  
   - The input must include both `Seq_cap` and `Chain_Type`  
   - `Chain_Type` determines which model (H or L) to use for prediction

3. **Sequence Validation**  
   - All entries must be valid amino acid sequences (`str`)  
   - The notebook reports:
     - Minimum, maximum, and average sequence lengths
     - Any sequences that exceed the maximum supported length (`MAX_SEQ_LEN = 130`)  
   - Long sequences will be **automatically truncated or padded** before model input

4. **Chain Type Distribution**  
   - The notebook displays a count of how many H and L chain sequences are included  
   - Ensures that predictions are properly routed to the corresponding pretrained model

---
### Notes:
If any of these conditions are not met (e.g., missing columns, unsupported file format, or invalid sequences), the notebook will raise a clear and descriptive error message to guide correction.


In [None]:
import sys
sys.path.append("src")
import os
import pandas as pd
from io_utils import load_sequences
from collections import Counter

# Step 1: Specify your input file (can be .csv, .fasta, or .txt)
input_file = "data/sample_input.txt"  # Change as needed
print(f" Loading sequences from: {input_file}")

# Step 2: Load as DataFrame (handles all supported formats)
df = load_sequences(input_file)  # Unified loader for .csv, .fasta, .txt

# Step 3: Validate required structure
required_cols = {"Seq_cap", "Chain_Type"}
if not required_cols.issubset(df.columns):
    raise ValueError(f" Input file must contain columns: {required_cols}")

# Extract sequences and chain labels
sequences = df["Seq_cap"].tolist()
chain_types = df["Chain_Type"].tolist()

print(f" Loaded {len(sequences)} sequence(s)")

# Step 4: Sequence length analysis
MAX_SEQ_LEN = 130
lengths = [len(seq) for seq in sequences]

print(f" Sequence lengths — Min: {min(lengths)}, Max: {max(lengths)}, Avg: {sum(lengths) // len(lengths)}")

long_seqs = [i for i, l in enumerate(lengths) if l > MAX_SEQ_LEN]
if long_seqs:
    print(f" {len(long_seqs)} sequence(s) exceed MAX_SEQ_LEN = {MAX_SEQ_LEN} — they will be truncated or padded during prediction.")
else:
    print(" All sequences are within the length limit.")

# Step 5: Report Chain Type Distribution (H vs L)
type_counts = Counter(chain_types)
print(f" Chain Type Distribution: {dict(type_counts)}")


## 3. Running ParaDeep Predictions
The following command runs the ParaDeep model on a user-provided sequence file. It uses **two pretrained models**: one for the heavy chain (H) and one for the light chain (L). ParaDeep automatically detects which chain each sequence belongs to and applies the appropriate model.

### **Example Command:**

In [None]:
!python paradeep.py \
  --input data/sample_input.csv \
  --modelH models/Best_Model_H.pt \
  --modelL models/Best_Model_L.pt

## 4. Visualization of Paratope Predictions
This section automatically visualizes the predicted paratope residues using per-residue binding probabilities.

### Key Features:
* Bar chart shows the probability of each residue being part of a paratope
* Red scatter dots highlight predicted binding residues (Prediction = 1)
* Amino acid labels are displayed at a fixed height for clarity
* Threshold line at 0.5 marks the decision boundary used for classification

### What It Detects Automatically:
1. **Latest Prediction File**
The script scans the /output folder and selects the most recent predictions file.
2. **Sequence IDs and Chain Types**
It automatically identifies unique sequences (Seq_ID) and checks for both heavy (H) and light (L) chain prediction columns.
3. **Per-Chain Visualization**
For each sequence, a separate chart is created for the H and/or L chain (if present).

### Example Output
* X-axis: Residue positions (1-based)
* Y-axis: Probability of being a binding residue (range: 0.0–1.0)
* Threshold at 0.5 is used to binarize predictions
* Binding residues are marked with both color and label (e.g., [Y])
---
### Notes:
* Only residues with Prediction = 1 are annotated above the bar
* Long sequences are auto-scaled
* Text highlights below the plot indicate binding positions in square brackets

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import os
from glob import glob

def visualize_latest_paradeep_predictions(save_dir="output/figures"):
    """
    Automatically finds the latest ParaDeep predictions file,
    and visualizes predicted binding residues for H and L chains.
    """
    os.makedirs(save_dir, exist_ok=True)

    # Step 1: Load latest prediction file
    prediction_files = sorted(glob("output/predictions_*.csv"), key=os.path.getmtime, reverse=True)
    if not prediction_files:
        print(" No prediction files found in 'output/'")
        return
    latest_file = prediction_files[0]
    print(f" Using latest prediction file: {latest_file}")

    df = pd.read_csv(latest_file)
    if 'Seq_ID' not in df.columns or 'Chain_Type' not in df.columns:
        print(" Missing required columns ('Seq_ID' or 'Chain_Type') in prediction file.")
        return

    # Step 2: Visualize each sequence by its chain type (H or L)
    for chain in ['H', 'L']:
        pred_col = f"{chain}_Prediction"
        prob_col = f"{chain}_Probability"

        if pred_col not in df.columns or prob_col not in df.columns:
            continue  # Skip if predictions for this chain are not present

        chain_df = df[df['Chain_Type'].str.upper() == chain]
        for seq_id in chain_df['Seq_ID'].unique():
            df_seq = chain_df[chain_df['Seq_ID'] == seq_id]
            if df_seq.empty:
                continue

            # Extract values
            positions = df_seq['Residue_Position'].values
            residues = df_seq['Residue'].values
            probabilities = df_seq[prob_col].values
            predictions = df_seq[pred_col].astype(int).values

            # Plot
            plt.figure(figsize=(15, 4))
            plt.bar(positions, probabilities, color='skyblue', alpha=0.7)
            plt.axhline(y=0.5, color='red', linestyle='--', alpha=0.7, label='Threshold (0.5)')
            plt.scatter(positions[predictions == 1], probabilities[predictions == 1],
                        color='red', s=100, label='Binding Residues')

            # Label only predicted binding residues
            for p, res in zip(positions[predictions == 1], residues[predictions == 1]):
                plt.text(p, 1.05, res, ha='center', fontweight='bold')

            plt.title(f"Paratope Prediction: {seq_id} ({chain}-chain)")
            plt.xlabel("Residue Position")
            plt.ylabel("Binding Probability")
            plt.ylim(0, 1.15)
            plt.grid(True, alpha=0.3)
            plt.legend()
            plt.tight_layout()

            # Save figure
            fig_path = os.path.join(save_dir, f"{seq_id}_{chain}.png")
            plt.savefig(fig_path)
            plt.show()
            print(f" Saved visualization to: {fig_path}")

            # Highlight residues
            highlighted_seq = ''.join([f"[{r}]" if pred == 1 else r for r, pred in zip(residues, predictions)])
            print(f" Highlighted Sequence ({seq_id}):\n{highlighted_seq}\n")

# ✅ Run the visualization
visualize_latest_paradeep_predictions()

## 6. Upload Your Own Sequences


To use ParaDeep with your own antibody sequence data in Google Colab:

1. Open Google Drive in a separate tab
2. Create the following folder structure:

*   `/MyDrive/ParaDeep/`
*   `/MyDrive/ParaDeep/data/`

3. Navigate to the /MyDrive/ParaDeep/data/ folder.

4. Upload your CSV file to this folder. (An example file containing three sequences is available for download. https://github.com/PiyachatU/ParaDeep/blob/main/data/my_sequences.csv)

5. Ensure your CSV has two columns:
   - `Seq_ID`: unique sequence ID
   - `Chain_Type`: chain type
   - `Seq_cap`: amino acid sequence (e.g., "EVQLVESGG...")
6. Then run the code cell below to load and process your file.
7. Authorize Google Drive Access, when you run the next cell, a link will open asking for permission to access your Google Drive.**Please click “Allow” or “Select All” when asked**, otherwise the notebook may not be able to access your Drive properly.


**Privacy & Security Notice**
This notebook is safe to use and does **not access any data from your Google Drive** unless you explicitly run a code cell to do so. When you authorize Google Drive access (via `drive.mount()`), only **your own account** can see and interact with your files — **the notebook author or others cannot access your Drive data**.

You remain in full control:
- No data will be read from or written to your Drive without your action.
- You can revoke access at any time via [https://myaccount.google.com/permissions](https://myaccount.google.com/permissions).
- This notebook runs entirely in your private Colab session.

In [None]:
# STEP 1: Mount Google Drive
from google.colab import drive
import os
import pandas as pd
from datetime import datetime
from glob import glob
import sys

drive.mount('/content/drive')

# STEP 2: Load input file (CSV, FASTA, or TXT)
sys.path.append("src")
from io_utils import load_sequences

input_path = '/content/drive/MyDrive/ParaDeep/data/my_sequences.csv'  # Change as needed

def load_sequence_file(path):
    if not os.path.exists(path):
        print(f"File not found: {path}")
        return None

    ext = os.path.splitext(path)[1].lower()
    try:
        if ext == '.csv':
            df = pd.read_csv(path)
            if 'Seq_cap' not in df.columns:
                raise ValueError("CSV must contain 'Seq_cap' column.")
            if 'Chain_Type' not in df.columns:
                raise ValueError("CSV must contain 'Chain_Type' column.")
        else:
            # Load raw sequences
            sequences = load_sequences(path)
            if not isinstance(sequences, list) or not all(isinstance(s, str) for s in sequences):
                raise ValueError("Parsed data must be a list of amino acid sequences (strings).")

            # Infer chain type from filename
            fname = os.path.basename(path).lower()
            if 'heavy' in fname or '_h' in fname:
                chain_type = 'H'
            elif 'light' in fname or '_l' in fname:
                chain_type = 'L'
            else:
                raise ValueError("Could not infer Chain_Type from filename. Use _H or _L in name.")

            # Build DataFrame
            df = pd.DataFrame({
                'Seq_ID': [f"{chain_type}_seq_{i+1}" for i in range(len(sequences))],
                'Seq_cap': sequences,
                'Chain_Type': chain_type
            })

        print(f" Loaded {len(df)} sequence(s) from: {path}")
        display(df.head())
        return df
    except Exception as e:
        print(f"Failed to parse file: {e}")
        return None

# Run loading
df_result = load_sequence_file(input_path)

# STEP 4: Visualization with figure saving
import matplotlib.pyplot as plt
import numpy as np

def visualize_predictions_save(df, seq_id, chain, save_dir="/content/drive/MyDrive/ParaDeep/results/figures"):
    """
    Visualizes and saves ParaDeep prediction results for a given sequence and chain type.
    Only visualizes if chain type matches the model used.
    """
    os.makedirs(save_dir, exist_ok=True)

    #  Check chain match from 'Chain_Type' column
    seq_chain_type = df[df['Seq_ID'] == seq_id]['Chain_Type'].dropna().unique()
    if len(seq_chain_type) != 1:
        print(f" Skipping {seq_id}: Unable to determine unique Chain_Type.")
        return

    if seq_chain_type[0].upper() != chain.upper():
        print(f" Skipping {seq_id}: Chain_Type is '{seq_chain_type[0]}', not '{chain}'.")
        return

    #  Check prediction columns
    prob_col = f"{chain}_Probability"
    pred_col = f"{chain}_Prediction"

    if prob_col not in df.columns or pred_col not in df.columns:
        print(f" Columns for chain '{chain}' not found in DataFrame.")
        return

    #  Extract prediction results
    df_seq = df[(df['Seq_ID'] == seq_id) & (~df[pred_col].isna())]
    if df_seq.empty:
        print(f" No prediction data for {seq_id} ({chain}-chain). Skipping.")
        return

    pos = df_seq['Residue_Position'].values
    aa = df_seq['Residue'].values
    prob = df_seq[prob_col].values
    pred = df_seq[pred_col].astype(int).values

    #  Plotting
    plt.figure(figsize=(15, 4))
    plt.bar(pos, prob, color='skyblue', alpha=0.7)
    plt.axhline(y=0.5, color='red', linestyle='--', label='Threshold')
    plt.scatter(pos[pred == 1], prob[pred == 1], color='red', s=100, label='Binding Residues')
    for p, r in zip(pos[pred == 1], aa[pred == 1]):
        plt.text(p, 1.05, r, ha='center', fontweight='bold')

    plt.title(f"ParaDeep {chain}-Chain Prediction: {seq_id}")
    plt.xlabel("Residue Position")
    plt.ylabel("Binding Probability")
    plt.ylim(0, 1.15)
    plt.legend()
    plt.tight_layout()
    plt.grid(True)

    # Save and show
    fig_path = os.path.join(save_dir, f"{seq_id}_{chain}.png")
    plt.savefig(fig_path)
    plt.show()
    print(f" Saved visualization to: {fig_path}")

# STEP 5: Load result and visualize only matching chain-model pairs
results = sorted(glob(os.path.join("output", 'predictions_*.csv')), reverse=True)
if results:
    latest = results[0]
    df = pd.read_csv(latest)

    for chain in ['H', 'L']:
        pred_col = f"{chain}_Prediction"
        if pred_col in df.columns:
            #  Iterate only sequences of the same chain type
            matching_ids = df[df['Chain_Type'].str.upper() == chain.upper()]['Seq_ID'].unique()
            for seq_id in matching_ids:
                visualize_predictions_save(df, seq_id, chain)
else:
    print(" No prediction result found.")

### References

- ParaDeep GitHub Repository: [https://github.com/PiyachatU/ParaDeep](https://github.com/PiyachatU/ParaDeep)
- For more information on antibody paratope prediction methods, refer to the manuscript.