# **## PIPES-M: Protease Inhibitor Prediction Using Evolutionary Scale Modeling (ESM-2)**

## Overview

This Google Colab notebook provides a user-friendly interface for inference with **PIPES-M**, a deep learning-based binary classifier designed to predict protease inhibitor (PI) activity from primary protein sequences.

PIPES-M enables rapid screening of small secreted protease inhibitors (<250 amino acids) in large-scale genomic, transcriptomic, or proteomic datasets, where experimental validation is resource-intensive.

The model assigns each input sequence to one of two classes:  
- **Positive (Potential PI)**: Predicted to exhibit protease inhibitor activity  
- **Negative (Non-PI)**: Predicted to lack protease inhibitor activity  

Output includes:  
- Probability of the positive class (`prob_class_1`): ranges from 0 (low likelihood) to 1 (high likelihood of PI activity)  
- Confidence score: probability of the predicted class  

## Model Architecture and Training

PIPES-M is a fine-tuned sequence classification model built on the **ESM-2** protein language model:  
- Base model: `facebook/esm2_t30_150M_UR50D` (150 million parameters, 30 layers)  
- Pre-trained on UniRef50 via masked language modeling  

Fine-tuning was performed on a high-quality curated dataset comprising:  
- Positive examples: known protease inhibitors (<250 AA) from the MEROPS database  
- Negative examples: non-inhibitors selected from UniProt using sequence similarity and Pfam domain analysis  

Training used sequence-only input, requiring no structural data. The classification head leverages evolutionary and physicochemical features encoded by ESM-2.  

Maximum sequence length is fixed at 250 residues; longer sequences are truncated from the N-terminus, appropriate for the typical size range of small secreted inhibitors.

## Input Requirements

- Multi-FASTA formatted file containing one or more protein sequences  
- Sequences must use standard single-letter amino acid codes  
- FASTA headers (lines beginning with `>`) are retained for identification  

## Output Columns

- `header`: Original FASTA identifier  
- `predicted_class`: "Positive (Potential PI)" or "Negative (Non-PI)"  
- `confidence`: Probability of the assigned class  
- `prob_class_1`: Raw probability of protease inhibitor activity  
- `prob_class_0`: Probability of the negative class  

## Usage Notes

- Intended for research and high-throughput screening  
- Positive predictions suggest potential PI activity and warrant experimental follow-up  
- Optimal performance is achieved on secreted or extracellular proteins, reflecting the composition of the training data  
- Predictions rely solely on the provided sequence; no homology search or multiple sequence alignment is performed  

## Model Availability

The fine-tuned PIPES-M model is publicly hosted on Hugging Face:  
https://huggingface.co/MuthuS97/PIPES-M

## Citation

When using PIPES-M in research, please reference the model repository and any associated forthcoming publication.

---

**Instructions**  
1. Enable GPU acceleration: Runtime → Change runtime type → Hardware accelerator → GPU (T4 recommended).  
2. Execute all cells in sequence (Runtime → Run all).  
3. Upload your multi-FASTA file in the designated section to obtain predictions.

In [13]:
# @title 0. Install Required Packages

!pip install --quiet transformers huggingface_hub

print("Required packages installed successfully")

Required packages installed successfully


In [14]:
# @title 1. Initialization and Setup

mount_drive = True  # @param {type:"boolean"}
if mount_drive:
    from google.colab import drive
    drive.mount('/content/drive')
    print("Google Drive mounted at /content/drive")

MAX_LEN = 250  # @param {type:"integer"}
BATCH_SIZE = 16  # @param {type:"integer"}

import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

import pandas as pd
import numpy as np
from IPython.display import display, HTML
from google.colab import files

print("Initialization complete")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Google Drive mounted at /content/drive
Using device: cuda
Initialization complete


In [15]:
# @title 2. Load PIPES-M Model

from transformers import AutoTokenizer, EsmForSequenceClassification

MODEL_ID = "MuthuS97/PIPES-M"

print(f"Loading tokenizer and model from {MODEL_ID}")
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
model = EsmForSequenceClassification.from_pretrained(MODEL_ID)

model.to(device)
model.eval()

print("Model loaded successfully")

Loading tokenizer and model from MuthuS97/PIPES-M
Model loaded successfully


In [16]:
# @title 3. Upload Multi-FASTA File

uploaded = files.upload()

if not uploaded:
    raise ValueError("No file uploaded. Please provide a multi-FASTA file.")

fasta_filename = list(uploaded.keys())[0]
print(f"Uploaded file: {fasta_filename}")

def parse_fasta(content):
    headers = []
    sequences = []
    current_seq = []
    current_header = None

    for line in content.splitlines():
        line = line.strip()
        if line.startswith(">"):
            if current_header is not None:
                sequences.append("".join(current_seq).upper().replace(" ", ""))
                current_seq = []
            current_header = line[1:].strip()
            headers.append(current_header)
        else:
            if line:
                current_seq.append(line.upper().replace(" ", ""))

    if current_header is not None:
        sequences.append("".join(current_seq).upper().replace(" ", ""))

    if len(headers) != len(sequences):
        raise ValueError("Parsing error: number of headers and sequences do not match")

    return pd.DataFrame({"header": headers, "sequence": sequences})

with open(fasta_filename, "r") as f:
    fasta_content = f.read()

df = parse_fasta(fasta_content)
print(f"Loaded {len(df)} sequences")

long_seqs = df[df['sequence'].str.len() > MAX_LEN]
if len(long_seqs) > 0:
    print(f"Warning: {len(long_seqs)} sequences exceed {MAX_LEN} residues and will be truncated")

display(df.head())

Saving rcsb_pdb_6TME.fasta to rcsb_pdb_6TME.fasta
Uploaded file: rcsb_pdb_6TME.fasta
Loaded 2 sequences


Unnamed: 0,header,sequence
0,"6TME_1|Chains A, B|Pollen-specific leucine-ric...",MELTDEEASFLTRRQLLALSENGDLPDDIEYEVDLDLKFANNRLKR...
1,"6TME_2|Chains C, D|Protein RALF-like 4|Arabido...",ARGRRYIGYDALKKNNVPCSRRGRSYYDCKKRRRNNPYRRGCSAIT...


In [17]:
# @title 4. Run Inference

from torch.utils.data import DataLoader, TensorDataset

print("Tokenizing sequences")
sequences = df['sequence'].tolist()
encoded = tokenizer(
    sequences,
    padding=True,
    truncation=True,
    max_length=MAX_LEN,
    return_tensors="pt"
)

dataset = TensorDataset(encoded['input_ids'], encoded['attention_mask'])
dataloader = DataLoader(dataset, batch_size=BATCH_SIZE, shuffle=False)

all_probs = []
all_preds = []

print("Running inference")
with torch.no_grad():
    for i, batch in enumerate(dataloader):
        input_ids, attention_mask = [b.to(device) for b in batch]
        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        logits = outputs.logits
        probs = torch.softmax(logits, dim=1).cpu().numpy()
        preds = np.argmax(probs, axis=1)
        all_probs.extend(probs)
        all_preds.extend(preds)

        if (i + 1) % 10 == 0 or (i + 1) == len(dataloader):
            processed = min((i + 1) * BATCH_SIZE, len(sequences))
            print(f"Processed {processed} of {len(sequences)} sequences")

print("Inference completed")

Tokenizing sequences
Running inference
Processed 2 of 2 sequences
Inference completed


In [18]:
# @title 5. Results and Download

confidence = [p[pred] for p, pred in zip(all_probs, all_preds)]
df['predicted_class_id'] = all_preds
df['confidence'] = confidence
df['prob_class_0'] = [p[0] for p in all_probs]
df['prob_class_1'] = [p[1] for p in all_probs]

df['predicted_class'] = df['predicted_class_id'].map({
    0: "Negative (Non-PI)",
    1: "Positive (Potential PI)"
})

display(HTML("<h3>Prediction Results (first 10 sequences)</h3>"))
display(df[['header', 'predicted_class', 'confidence', 'prob_class_1']].head(10))

print("\nClass distribution")
counts = df['predicted_class'].value_counts()
for label, count in counts.items():
    percentage = count / len(df) * 100
    print(f"{label}: {count} sequences ({percentage:.1f}%)")

output_csv = "PIPES-M_predictions.csv"
df.to_csv(output_csv, index=False)

if mount_drive:
    drive_path = "/content/drive/MyDrive/PIPES-M_predictions.csv"
    df.to_csv(drive_path, index=False)
    print(f"\nResults also saved to Google Drive: {drive_path}")

print(f"\nResults saved as {output_csv}")
files.download(output_csv)

Unnamed: 0,header,predicted_class,confidence,prob_class_1
0,"6TME_1|Chains A, B|Pollen-specific leucine-ric...",Positive (Potential PI),0.947041,0.947041
1,"6TME_2|Chains C, D|Protein RALF-like 4|Arabido...",Positive (Potential PI),0.965963,0.965963



Class distribution
Positive (Potential PI): 2 sequences (100.0%)

Results also saved to Google Drive: /content/drive/MyDrive/PIPES-M_predictions.csv

Results saved as PIPES-M_predictions.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>