<a href="https://colab.research.google.com/github/PiyachatU/ParaDeep/blob/main/ParaDeep_Colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img src="https://drive.google.com/uc?id=1hKra9Eirlx69_cZMvPmjwfCAFx6Fhf0a" alt="ParaDeep Icon" width="200"/>

# ParaDeep: Sequence-Based Paratope Prediction with BiLSTM-CNN

This notebook demonstrates how to use **ParaDeep**, a lightweight deep learning model for predicting paratope residues (antigen-binding sites) from antibody sequences. ParaDeep uses a BiLSTM-CNN architecture with learnable embeddings and requires only amino acid sequences — no structural input or large pretrained models.

## What is ParaDeep?

ParaDeep is a lightweight deep learning model for predicting paratope residues (antigen-binding sites) from antibody sequences. It uses a BiLSTM-CNN architecture with learnable embeddings and requires only amino acid sequences — no structural input or large pretrained models. The framework includes pretrained models for heavy (H), light (L), and combined (HL) chains. Predictions are per-residue, human-readable, and designed for practical use in early-stage antibody discovery and analysis.

### Key Features:
- **Sequence-only approach**: No structural data required
- **Chain-aware modeling**: Specialized models for heavy (H), light (L), and combined (HL) chains
- **Lightweight architecture**: Significantly reduced computing demands compared to structure-based or pretrained language models
- **Per-residue predictions**: Binary classification of binding vs non-binding residues
- **Simple input/output**: Uses CSV format for easy integration

In this notebook, we'll set up the environment, download the necessary models, and run predictions on sample antibody sequences.

## 1. Setup Environment

First, let's install the required dependencies:

In [None]:
# Install required packages
!pip install torch pandas numpy

## 2. Clone the ParaDeep Repository

Now, let's clone the ParaDeep repository from GitHub:

In [None]:
# Clone the repository
!git clone https://github.com/PiyachatU/ParaDeep.git

# Change to the ParaDeep directory
%cd ParaDeep

# List the contents of the repository
!ls -la

## 3. Explore Repository Structure

Let's examine the key components of the repository:

In [None]:
# Check the model directory
!ls -la model

# Check the saved_models directory
!ls -la saved_models

# Check the data directory
!ls -la data

# Check the utils directory
!ls -la utils

## 4. Examine Sample Input Format

Let's look at the sample input format to understand what our data should look like:

In [None]:
import pandas as pd

# Load and display the sample input
sample_input = pd.read_csv('data/sample_input.csv')
sample_input

The input format requires:
- **Seq_ID**: A unique identifier for each sequence
- **Seq_cap**: The amino acid sequence in capital letters

## 5. Create Our Own Input Data

Let's create a custom input file with some example antibody sequences:

In [None]:
# Create a DataFrame with example sequences
custom_data = pd.DataFrame({
    'Seq_ID': ['Heavy_Chain_1', 'Light_Chain_1'],
    'Seq_cap': [
        'EVQLVESGGGLVQPGGSLRLSCAASGFTFSSYAMSWVRQAPGKGLEWVSAISGSGGSTYYADSVKGRFTISRDNSKNTLYLQMNSLRAEDTAVYYCAKDLGWSDSYYYYYGMDVWGQGTTVTVSS',
        'DIQMTQSPSSLSASVGDRVTITCRASQSISSYLNWYQQKPGKAPKLLIYAASSLQSGVPSRFSGSGSGTDFTLTISSLQPEDFATYYCQQSYSTPPTFGQGTKVEIKR'
    ]
})

# Save to a CSV file
custom_data.to_csv('data/custom_input.csv', index=False)

# Display the custom data
custom_data

## 6. Run Predictions

Now, let's run predictions using the pretrained models. We'll try all three models: Heavy (H), Light (L), and Combined (HL).

In [None]:
# Run prediction with the Heavy chain model
!python predict.py --model-path saved_models/ParaDeep_H.pt --input data/custom_input.csv

In [None]:
# Run prediction with the Light chain model
!python predict.py --model-path saved_models/ParaDeep_L.pt --input data/custom_input.csv

In [None]:
# Run prediction with the Combined (HL) model
!python predict.py --model-path saved_models/ParaDeep_HL.pt --input data/custom_input.csv

## 7. Examine the Results

Let's look at the prediction results:

In [None]:
# List output files
!ls -la output/

In [None]:
import glob
import os

# Get the most recent output files
output_files = glob.glob('output/*.csv')
output_files.sort(key=os.path.getmtime, reverse=True)

# Load and display the H chain results
h_results = pd.read_csv(output_files[2])  # Adjust index if needed
h_results.head(20)

In [None]:
# Load and display the L chain results
l_results = pd.read_csv(output_files[1])  # Adjust index if needed
l_results.head(20)

In [None]:
# Load and display the HL chain results
hl_results = pd.read_csv(output_files[0])  # Adjust index if needed
hl_results.head(20)

## 8. Visualize the Results

Let's create a simple visualization of the predicted binding residues:

In [None]:
import matplotlib.pyplot as plt
import numpy as np

def visualize_predictions(results_df, seq_id, pred_column):
    # Filter for the specific sequence
    seq_results = results_df[results_df['Seq_ID'] == seq_id]

    # Extract positions and predictions
    positions = seq_results['Residue_Position'].values
    residues = seq_results['Residue'].values
    predictions = seq_results[pred_column].values

    # Create figure
    plt.figure(figsize=(15, 4))

    # Plot predictions
    plt.bar(positions, predictions, color='skyblue', alpha=0.7)
    plt.axhline(y=0.5, color='red', linestyle='--', alpha=0.7, label='Threshold (0.5)')

    # Highlight binding residues
    binding_positions = positions[predictions == 1]
    binding_residues = residues[predictions == 1]
    plt.scatter(binding_positions, np.ones(len(binding_positions)), color='red', s=100, label='Binding Residues')

    # Add labels for binding residues
    for pos, res in zip(binding_positions, binding_residues):
        plt.text(pos, 1.05, res, ha='center', fontweight='bold')

    # Set labels and title
    plt.xlabel('Residue Position')
    plt.ylabel('Binding Prediction (1 = binding)')
    plt.title(f'Paratope Prediction for {seq_id}')
    plt.legend()
    plt.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.show()

    # Print sequence with binding residues highlighted
    sequence = ''.join(residues)
    highlighted_seq = ''
    for i, res in enumerate(sequence):
        if i+1 in binding_positions:
            highlighted_seq += f"[{res}]"
        else:
            highlighted_seq += res

    print(f"Sequence with binding residues [highlighted]: \n{highlighted_seq}")
    print(f"\nTotal residues: {len(sequence)}")
    print(f"Binding residues: {len(binding_positions)} ({len(binding_positions)/len(sequence)*100:.1f}%)")

In [None]:
# Visualize Heavy chain predictions
visualize_predictions(h_results, 'Heavy_Chain_1', 'H_pred')

In [None]:
# Visualize Light chain predictions
visualize_predictions(l_results, 'Light_Chain_1', 'L_pred')

In [None]:
# Visualize Combined (HL) chain predictions
visualize_predictions(hl_results, 'Heavy_Chain_1', 'HL_pred')
visualize_predictions(hl_results, 'Light_Chain_1', 'HL_pred')

## 9. Upload Your Own Sequences


To use ParaDeep with your own antibody sequence data in Google Colab:

1. Open Google Drive in a separate tab
2. Create the following folder structure:

*   `/MyDrive/ParaDeep/`
*   `/MyDrive/ParaDeep/data/`

3. Navigate to the /MyDrive/ParaDeep/data/ folder.

4. Upload your CSV file to this folder. (An example file containing three sequences is available for download. https://github.com/PiyachatU/ParaDeep/blob/main/data/my_sequences.csv)

5. Ensure your CSV has two columns:
   - `Seq_ID`: unique sequence ID
   - `Seq_cap`: amino acid sequence (e.g., "EVQLVESGG...")
6. Then run the code cell below to load and process your file.
7. Authorize Google Drive Access, when you run the next cell, a link will open asking for permission to access your Google Drive.**Please click “Allow” or “Select All” when asked**, otherwise the notebook may not be able to access your Drive properly.


**Privacy & Security Notice**
This notebook is safe to use and does **not access any data from your Google Drive** unless you explicitly run a code cell to do so. When you authorize Google Drive access (via `drive.mount()`), only **your own account** can see and interact with your files — **the notebook author or others cannot access your Drive data**.

You remain in full control:
- No data will be read from or written to your Drive without your action.
- You can revoke access at any time via [https://myaccount.google.com/permissions](https://myaccount.google.com/permissions).
- This notebook runs entirely in your private Colab session.

In [None]:
# STEP 1: Mount Google Drive
from google.colab import drive
import os
import pandas as pd
from datetime import datetime
from glob import glob

drive.mount('/content/drive')

# STEP 2: Load your CSV from Google Drive
csv_path = '/content/drive/MyDrive/ParaDeep/data/my_sequences.csv'

def load_sequence_csv(path):
    if not os.path.exists(path):
        print(f"File not found: {path}")
        return None
    try:
        df = pd.read_csv(path)
        if 'Seq_ID' not in df.columns or 'Seq_cap' not in df.columns:
            print("CSV must contain 'Seq_ID' and 'Seq_cap' columns.")
            return None
        print(f"Loaded file: {path}")
        display(df.head())
        return df, path
    except Exception as e:
        print(f"Failed to read file: {e}")
        return None

df_result = load_sequence_csv(csv_path)


# STEP 3: Run ParaDeep Prediction with All Models and Save to Drive

if df_result:
    df, valid_path = df_result
    print("Running ParaDeep on uploaded sequences using all models...\n")

    model_list = [
        'ParaDeep_HL.pt',
        'ParaDeep_H.pt',
        'ParaDeep_L.pt'
    ]

    output_dir = "output"
    drive_output_dir = "/content/drive/MyDrive/ParaDeep/results"
    os.makedirs(output_dir, exist_ok=True)
    os.makedirs(drive_output_dir, exist_ok=True)

    timestamp = datetime.now().strftime('%Y%m%d_%H%M')

    for model_filename in model_list:
        model_path = f'saved_models/{model_filename}'
        model_tag = model_filename.replace('.pt', '').replace('ParaDeep_', '')  # 'H', 'L', or 'HL'

        print(f"Running model: {model_tag}")
        !python predict.py --model-path {model_path} --input {valid_path}

        # Find the latest output file for the model
        output_pattern = os.path.join(output_dir, f"ParaDeep_{model_tag}_predictions_*.csv")
        output_files = sorted(glob(output_pattern), reverse=True)

        if output_files:
            latest_file = output_files[0]
            output_filename = f"ParaDeep_{model_tag}_predictions_{timestamp}.csv"
            drive_output_path = os.path.join(drive_output_dir, output_filename)

            !cp {latest_file} {drive_output_path}
            print(f"Copied output for {model_tag} model to Google Drive: {drive_output_path}")
        else:
            print(f"No prediction file found for model: {model_tag}")

    print("\nFinal contents of your Google Drive result folder:")
    !ls -lh {drive_output_dir}
else:
    print("Cannot proceed with prediction. CSV file was not loaded correctly.")

In [None]:
# STEP 4: Imports and path setup
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from glob import glob
import os

# Define paths
drive_output_dir = "/content/drive/MyDrive/ParaDeep/results"
output_dir = "output"
image_save_dir = os.path.join(output_dir, "visualizations")
drive_vis_dir = "/content/drive/MyDrive/ParaDeep/results/visualizations"

os.makedirs(image_save_dir, exist_ok=True)
os.makedirs(drive_vis_dir, exist_ok=True)

# STEP 5: Function to visualize one sequence
def visualize_predictions(results_df, seq_id, pred_column, model_tag):
    seq_results = results_df[results_df['Seq_ID'] == seq_id]
    positions = seq_results['Residue_Position'].values
    residues = seq_results['Residue'].values
    predictions = seq_results[pred_column].values

    plt.figure(figsize=(15, 4))
    plt.bar(positions, predictions, color='skyblue', alpha=0.7)
    plt.axhline(y=0.5, color='red', linestyle='--', alpha=0.7, label='Threshold (0.5)')

    binding_positions = positions[predictions == 1]
    binding_residues = residues[predictions == 1]
    plt.scatter(binding_positions, np.ones(len(binding_positions)), color='red', s=100, label='Binding Residues')

    for pos, res in zip(binding_positions, binding_residues):
        plt.text(pos, 1.05, res, ha='center', fontweight='bold')

    plt.xlabel('Residue Position')
    plt.ylabel('Binding Prediction (1 = binding)')
    plt.title(f'ParaDeep {model_tag} - Prediction for {seq_id}')
    plt.legend()
    plt.grid(True, alpha=0.3)
    plt.tight_layout()

    # Save before showing
    fig_filename = f"{model_tag}_{seq_id}_visualization.png"
    fig_path = os.path.join(image_save_dir, fig_filename)
    plt.savefig(fig_path)
    plt.show()
    print(f"Saved visualization to: {fig_path}")

    # Highlight residues
    sequence = ''.join(residues)
    highlighted_seq = ''
    for i, res in enumerate(sequence):
        if i+1 in binding_positions:
            highlighted_seq += f"[{res}]"
        else:
            highlighted_seq += res

    print(f"Sequence with binding residues highlighted:\n{highlighted_seq}")
    print(f"Total residues: {len(sequence)}")
    print(f"Binding residues: {len(binding_positions)} ({len(binding_positions)/len(sequence)*100:.1f}%)\n")

# STEP 6: Visualize latest predictions from all models
def visualize_all_model_predictions():
    print("Scanning prediction files for all models...\n")
    for model_tag in ['HL', 'H', 'L']:
        pattern = os.path.join(drive_output_dir, f"ParaDeep_{model_tag}_predictions_*.csv")
        model_files = sorted(glob(pattern), key=os.path.getctime, reverse=True)

        if model_files:
            latest_file = model_files[0]
            print(f"Loaded prediction file for model {model_tag}: {latest_file}")
            df = pd.read_csv(latest_file)

            pred_column = f"{model_tag}_pred"
            if pred_column not in df.columns:
                print(f"Column {pred_column} not found in file. Skipping.")
                continue

            for seq_id in df['Seq_ID'].unique():
                visualize_predictions(df, seq_id, pred_column, model_tag)
        else:
            print(f"No prediction output files found for model {model_tag}.")

# STEP 7: Run visualizations
visualize_all_model_predictions()

# STEP 8: Copy plots to Google Drive
image_files = glob(os.path.join(image_save_dir, "*.png"))

for img_path in image_files:
    filename = os.path.basename(img_path)
    target_path = os.path.join(drive_vis_dir, filename)
    os.system(f'cp "{img_path}" "{target_path}"')


## 10. Understanding the Model Architecture

ParaDeep uses a BiLSTM-CNN architecture to predict paratope residues:

1. **Input**: Amino acid sequences are encoded using learnable embeddings
2. **BiLSTM Layer**: Captures bidirectional context from the sequence
3. **CNN Layer**: Extracts local features using sliding kernels
4. **Output Layer**: Produces per-residue binding probabilities

The model is trained with different kernel sizes for different chain types:
- Heavy chain (H): Kernel size 9
- Light chain (L): Kernel size 81
- Combined chains (HL): Kernel size 21

This chain-specific approach allows the model to capture the unique binding patterns of each chain type.

## 11. Conclusion

In this notebook, we've demonstrated how to use ParaDeep for predicting paratope residues from antibody sequences. The key advantages of ParaDeep include:

- **Sequence-only approach**: No need for structural data
- **Chain-specific modeling**: Specialized models for different chain types
- **Lightweight architecture**: Efficient computation with minimal resources
- **Interpretable results**: Clear per-residue binding predictions

ParaDeep is particularly useful for early-stage antibody discovery and analysis when structural data may be limited or unavailable.

### References

- ParaDeep GitHub Repository: [https://github.com/PiyachatU/ParaDeep](https://github.com/PiyachatU/ParaDeep)
- For more information on antibody paratope prediction methods, refer to the manuscript.