# **## PIP-BERT: Protease Inhibitor Prediction Using ProtBERT language model**


---


## Overview

This Google Colab notebook provides a user-friendly interface for inference with **PIP-BERT**, a deep learning-based binary classifier designed to predict protease inhibitor (PI) activity from primary protein sequences using the ProtBERT LLM model.

ProtBERT-PI enables rapid screening of potential small secreted protease inhibitors in large-scale genomic, transcriptomic, or proteomic datasets.

The model assigns each input sequence to one of two classes:  
- **Positive (Potential PI)**: Predicted to exhibit protease inhibitor activity  
- **Negative (Non-PI)**: Predicted to lack protease inhibitor activity  

Output includes:  
- Probability of the positive class (`prob_class_1`): ranges from 0 (low likelihood) to 1 (high likelihood of PI activity)  
- Confidence score: probability of the predicted class  

## Model Architecture and Training

ProtBERT-PI is a fine-tuned sequence classification model built on **ProtBERT** (BertForSequenceClassification):  
- Base model: `Rostlab/prot_bert`   
- Pre-trained on large corpora of protein sequences using masked language modeling  

Fine-tuning was performed on a curated dataset of known protease inhibitors and non-protease inhibitor negative set. Sequences are tokenized by inserting spaces between amino acids (standard for ProtBERT), enabling effective representation learning. Maximum sequence length is configurable (default: 250 AA); longer sequences are truncated.

## Input Requirements

- Multi-FASTA formatted file containing one or more protein sequences  
- Sequences must use standard single-letter amino acid codes  
- FASTA headers (lines beginning with `>`) are retained for identification  

## Output Columns

- `header`: Original FASTA identifier  
- `predicted_class`: "Positive (Potential PI)" or "Negative (Non-PI)"  
- `confidence`: Probability of the assigned class  
- `prob_class_1`: Raw probability of protease inhibitor activity  
- `prob_class_0`: Probability of the negative class  

## Usage Notes

- Intended for research and high-throughput screening  
- Positive predictions suggest potential PI activity and warrant experimental follow-up  
- Predictions rely solely on the provided sequence  

## Model Availability

The fine-tuned ProtBERT-PI model is publicly hosted on Hugging Face:  
https://huggingface.co/MuthuS97/PIP-BERT  
## Citation

When using ProtBERT-PI in research, please reference the model repository and its associated publication.

---

**Instructions**  
1. Enable GPU acceleration: Runtime → Change runtime type → Hardware accelerator → GPU (T4 recommended).  
2. Execute all cells in sequence (Runtime → Run all).  
3. Upload your multi-FASTA file to obtain predictions.

In [None]:
# @title 0. Install Required Packages

!pip install --quiet transformers torch

print("Required packages installed successfully")

Required packages installed successfully


In [None]:
# @title 1. Initialization and Setup

mount_drive = True  # @param {type:"boolean"}
if mount_drive:
    from google.colab import drive
    drive.mount('/content/drive')
    print("Google Drive mounted at /content/drive")

MAX_LEN = 250  # @param {type:"integer"}
BATCH_SIZE = 16  # @param {type:"integer"}

import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

import pandas as pd
import numpy as np
from IPython.display import display, HTML
from google.colab import files

print("Initialization complete")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Google Drive mounted at /content/drive
Using device: cuda
Initialization complete


In [None]:
# @title 2. Load PIP-BERT classifer Model

from transformers import BertTokenizer, BertForSequenceClassification

MODEL_ID = "MuthuS97/PIP-BERT"

print(f"Loading tokenizer and model from {MODEL_ID}")
tokenizer = BertTokenizer.from_pretrained(MODEL_ID)
model = BertForSequenceClassification.from_pretrained(MODEL_ID)

model.to(device)
model.eval()

print("Model loaded successfully")

Loading tokenizer and model from MuthuS97/PIP-BERT
Model loaded successfully


In [None]:
# @title 3. Upload and Parse Multi-FASTA File

uploaded = files.upload()

if not uploaded:
    raise ValueError("No file uploaded. Please provide a multi-FASTA file.")

fasta_filename = list(uploaded.keys())[0]
print(f"Uploaded file: {fasta_filename}")

def parse_fasta(content):
    headers = []
    sequences = []
    current_seq = []
    current_header = None

    for line in content.splitlines():
        line = line.strip()
        if line.startswith(">"):
            if current_header is not None:
                sequences.append("".join(current_seq).upper().replace(" ", ""))
                current_seq = []
            current_header = line[1:].strip()
            headers.append(current_header)
        else:
            if line:
                current_seq.append(line.upper().replace(" ", ""))

    if current_header is not None:
        sequences.append("".join(current_seq).upper().replace(" ", ""))

    if len(headers) != len(sequences):
        raise ValueError("Parsing error: number of headers and sequences do not match")

    return pd.DataFrame({"header": headers, "sequence": sequences})

with open(fasta_filename, "r") as f:
    fasta_content = f.read()

df = parse_fasta(fasta_content)
print(f"Loaded {len(df)} sequences")

long_seqs = df[df['sequence'].str.len() > MAX_LEN]
if len(long_seqs) > 0:
    print(f"Warning: {len(long_seqs)} sequences exceed {MAX_LEN} residues and will be truncated")

display(df.head())

TypeError: 'NoneType' object is not subscriptable

In [None]:
# @title 4. Run Inference

from torch.utils.data import Dataset, DataLoader
import torch.nn.functional as F

class ProteinDataset(Dataset):
    def __init__(self, sequences, tokenizer, max_length):
        self.sequences = sequences
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.sequences)

    def __getitem__(self, idx):
        seq = " ".join(list(self.sequences[idx]))

        encoding = self.tokenizer(
            seq,
            truncation=True,
            padding='max_length',
            max_length=self.max_length,
            return_tensors='pt'
        )

        return {
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten()
        }

print("Tokenizing sequences (ProtBERT format: space-separated amino acids)")
dataset = ProteinDataset(df['sequence'].tolist(), tokenizer, MAX_LEN)
dataloader = DataLoader(dataset, batch_size=BATCH_SIZE, shuffle=False)

all_preds = []
all_probs = []

print("Running inference")
with torch.no_grad():
    for i, batch in enumerate(dataloader):
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)

        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        logits = outputs.logits
        probs = F.softmax(logits, dim=1)
        preds = torch.argmax(probs, dim=1).cpu().numpy()
        pos_probs = probs[:, 1].cpu().numpy()

        all_preds.extend(preds)
        all_probs.extend(pos_probs)

        if (i + 1) % 10 == 0 or (i + 1) == len(dataloader):
            processed = min((i + 1) * BATCH_SIZE, len(df))
            print(f"Processed {processed} of {len(df)} sequences")

print("Inference completed")

In [None]:
# @title 5. Results and Download

df['predicted_class_id'] = all_preds
df['prob_class_1'] = all_probs
df['prob_class_0'] = 1 - np.array(all_probs)
df['confidence'] = np.maximum(df['prob_class_0'], df['prob_class_1'])

df['predicted_class'] = df['predicted_class_id'].map({
    0: "Negative (Non-PI)",
    1: "Positive (Potential PI)"
})

display(HTML("<h3>Prediction Results (first 10 sequences)</h3>"))
display(df[['header', 'predicted_class', 'confidence', 'prob_class_1']].head(10))

print("\nClass distribution")
counts = df['predicted_class'].value_counts()
for label, count in counts.items():
    percentage = count / len(df) * 100
    print(f"{label}: {count} sequences ({percentage:.1f}%)")

output_csv = "ProtBERT-PI_predictions.csv"
df.to_csv(output_csv, index=False)

if mount_drive:
    drive_path = "/content/drive/MyDrive/ProtBERT-PI_predictions.csv"
    df.to_csv(drive_path, index=False)
    print(f"\nResults also saved to Google Drive: {drive_path}")

print(f"\nResults saved as {output_csv}")
files.download(output_csv)