In [9]:
!pip install -q transformers datasets accelerate
!pip install -q evaluate
!pip install scikit-learn wandb




In [2]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

model_name = "facebook/esm2_t6_8M_UR50D"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/95.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/93.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/775 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/31.4M [00:00<?, ?B/s]

Some weights of EsmForSequenceClassification were not initialized from the model checkpoint at facebook/esm2_t6_8M_UR50D and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [5]:
from datasets import Dataset, ClassLabel
import pandas as pd
from google.colab import drive

drive.mount("/content/drive")

# 1. Load raw CSV into HF Dataset
df = pd.read_csv("/content/drive/MyDrive/AllergenAInew/ESMFold/algpred2_train.csv")
dataset = Dataset.from_pandas(df)

# 2. Convert 'label' to ClassLabel BEFORE tokenization
dataset = dataset.cast_column("label", ClassLabel(num_classes=2))

# 3. Tokenize sequences


def tokenize(example):
    return tokenizer(
        example["sequence"], padding="max_length", truncation=True, max_length=1024
    )


tokenized_dataset = dataset.map(tokenize, batched=True)

# 4. Now do stratified split
tokenized_dataset = tokenized_dataset.train_test_split(
    test_size=0.2, stratify_by_column="label"
)

Mounted at /content/drive


Casting the dataset:   0%|          | 0/16120 [00:00<?, ? examples/s]

Map:   0%|          | 0/16120 [00:00<?, ? examples/s]

Tokenize the sequences

In [6]:
from transformers import TrainingArguments, Trainer
import evaluate

accuracy = evaluate.load("accuracy")
roc_auc = evaluate.load("roc_auc", "binary")


def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = logits.argmax(axis=-1)
    probas = tf.nn.softmax(logits, axis=-1).numpy()
    return {
        "accuracy": accuracy.compute(predictions=preds, references=labels)["accuracy"],
        "roc_auc": roc_auc.compute(prediction_scores=probas[:, 1], references=labels)[
            "roc_auc"
        ],
    }


training_args = TrainingArguments(
    output_dir="./esm2_finetuned",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=4,
    weight_decay=0.01,
    logging_dir="./logs",
    load_best_model_at_end=True,
    metric_for_best_model="roc_auc",
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/9.54k [00:00<?, ?B/s]

  trainer = Trainer(


üöÄ 6. Train the Model

In [10]:
import numpy as np
import torch
from transformers import Trainer, TrainingArguments
import evaluate  # for accuracy, f1, roc_auc
from sklearn.metrics import accuracy_score, roc_auc_score, f1_score
from sklearn.utils import resample

# ==============================
# üìà Metrics: Accuracy, ROC-AUC, F1
# ==============================
accuracy = evaluate.load("accuracy")
roc_auc = evaluate.load("roc_auc", "binary")
f1 = evaluate.load("f1")


def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    probas = torch.nn.functional.softmax(torch.tensor(logits), dim=-1).numpy()
    return {
        "accuracy": accuracy.compute(predictions=preds, references=labels)["accuracy"],
        "roc_auc": roc_auc.compute(prediction_scores=probas[:, 1], references=labels)[
            "roc_auc"
        ],
        "f1": f1.compute(predictions=preds, references=labels)["f1"],
    }


# ==============================
# ‚öôÔ∏è Training Arguments
# ==============================
training_args = TrainingArguments(
    output_dir="./esm2_finetuned",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    num_train_epochs=4,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    learning_rate=2e-5,
    weight_decay=0.01,
    logging_dir="./logs",
    load_best_model_at_end=True,
    metric_for_best_model="roc_auc",
    run_name="esm2_finetuned",
)

# ==============================
# üß† Trainer
# ==============================
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)
# ==============================
# üèãÔ∏è‚Äç‚ôÄÔ∏è Fine-tune! (Auto-resume enabled)
# ==============================
import os

# Auto-resume training from the latest checkpoint if available
if os.path.isdir(training_args.output_dir) and any(
    "checkpoint" in f for f in os.listdir(training_args.output_dir)
):
    print("üîÅ Resuming from last checkpoint...")
    trainer.train(resume_from_checkpoint=True)
else:
    print("üöÄ Starting training from scratch...")
    trainer.train()


# ==============================
# üìä Final Evaluation with SE
# ==============================
from sklearn.metrics import accuracy_score, roc_auc_score, f1_score
from sklearn.utils import resample
import numpy as np

# Get predictions
predictions = trainer.predict(tokenized_dataset["test"])
logits = predictions.predictions
labels = predictions.label_ids

# Softmax for probabilities
probs = torch.nn.functional.softmax(torch.tensor(logits), dim=-1).numpy()
y_pred = np.argmax(probs, axis=1)
y_true = labels
y_scores = probs[:, 1]

# Bootstrap SE function


def bootstrap_metric(metric_fn, y_true, y_score, n_rounds=1000):
    scores = []
    for _ in range(n_rounds):
        idx = np.random.choice(len(y_true), len(y_true), replace=True)
        score = metric_fn(y_true[idx], y_score[idx])
        scores.append(score)
    return np.mean(scores), np.std(scores, ddof=1) / np.sqrt(n_rounds)


# Compute metrics + SE
acc_mean, acc_se = bootstrap_metric(accuracy_score, y_true, y_pred)
auc_mean, auc_se = bootstrap_metric(roc_auc_score, y_true, y_scores)
f1_mean, f1_se = bootstrap_metric(f1_score, y_true, y_pred)

# Print results
print("\n‚úÖ Final Metrics on Test Set (with Standard Error):")
print(f"Accuracy:  {acc_mean:.4f} ¬± {acc_se:.4f} (SE)")
print(f"ROC-AUC:   {auc_mean:.4f} ¬± {auc_se:.4f} (SE)")
print(f"F1 Score:  {f1_mean:.4f} ¬± {f1_se:.4f} (SE)")

# ==============================
# üíæ Save Best Model
# ==============================
trainer.save_model(
    "/content/drive/MyDrive/AllergenAInew/Fine-tune Transformer/esm2_finetuned_best"
)
tokenizer.save_pretrained(
    "/content/drive/MyDrive/AllergenAInew/Fine-tune Transformer/esm2_finetuned_best"
)

  trainer = Trainer(


üöÄ Starting training from scratch...


Epoch,Training Loss,Validation Loss,Accuracy,Roc Auc,F1
1,0.11,0.26494,0.948821,0.990241,0.950849
2,0.0438,0.155552,0.974256,0.996249,0.974579
3,0.0259,0.123782,0.981079,0.996942,0.981155


Epoch,Training Loss,Validation Loss,Accuracy,Roc Auc,F1
1,0.11,0.26494,0.948821,0.990241,0.950849
2,0.0438,0.155552,0.974256,0.996249,0.974579
3,0.0259,0.123782,0.981079,0.996942,0.981155
4,0.0034,0.13433,0.981079,0.996878,0.981179



‚úÖ Final Metrics on Test Set (with Standard Error):
Accuracy:  0.9811 ¬± 0.0001 (SE)
ROC-AUC:   0.9969 ¬± 0.0000 (SE)
F1 Score:  0.9811 ¬± 0.0001 (SE)


('/content/drive/MyDrive/AllergenAInew/Fine-tune Transformer/esm2_finetuned_best/tokenizer_config.json',
 '/content/drive/MyDrive/AllergenAInew/Fine-tune Transformer/esm2_finetuned_best/special_tokens_map.json',
 '/content/drive/MyDrive/AllergenAInew/Fine-tune Transformer/esm2_finetuned_best/vocab.txt',
 '/content/drive/MyDrive/AllergenAInew/Fine-tune Transformer/esm2_finetuned_best/added_tokens.json')

üß™ 7. Evaluate on Test Set

In [11]:
metrics = trainer.evaluate()
print(metrics)

{'eval_loss': 0.12378237396478653, 'eval_accuracy': 0.9810794044665012, 'eval_roc_auc': 0.9969423261641905, 'eval_f1': 0.981155390793945, 'eval_runtime': 80.0409, 'eval_samples_per_second': 40.279, 'eval_steps_per_second': 10.07, 'epoch': 4.0}


üß¨ Optional: Scale Up to Bigger ESM-2
Once it works, you can change:

In [None]:
model_name = "facebook/esm2_t33_650M_UR50D"  # or t36_3B if you dare

In [14]:
!cp -r /content/wandb /content/drive/MyDrive/AllergenAInew/Fine-tune-Transformer

## üß™ Calibration: Making Predictions Trustworthy

### ‚ùó Problem: Overconfident Predictions

We did compute Accuracy, F1 Score, and ROC-AUC, but we only optimized the model for ROC-AUC during training.

After fine-tuning our transformer (ESM-2) for binary allergenicity classification, we observed that the model frequently outputs **very high probabilities (e.g., 0.9999)** ‚Äî even for sequences that are likely **not** allergens.

While the model achieved strong performance on **ROC-AUC**, its probability estimates were **overconfident** and not realistic.

This is expected behavior when optimizing for **ROC-AUC**, a metric that measures how well the model **ranks** allergens above non-allergens, but **ignores how confident** those predictions are.

---

### üìà What is ROC-AUC?

**ROC-AUC** (Receiver Operating Characteristic ‚Äì Area Under Curve) evaluates how well the model can **rank** positives above negatives across all thresholds:

| AUC Score | Interpretation            |
|-----------|---------------------------|
| 0.50      | Random guess              |
| 0.70      | Fair                      |
| 0.80      | Good                      |
| 0.90+     | Excellent ranking ability |

A model can achieve a high ROC-AUC **without producing well-calibrated probabilities**. That‚Äôs why the outputs looked good for ranking, but bad for real-world interpretation.

---

### ‚úÖ Solution: Post-Hoc Probability Calibration

To fix the overconfidence, we applied **Isotonic Regression** for probability calibration:

- Trained on the validation set using the model's raw softmax outputs.
- Learned to map overconfident scores to more realistic probabilities.
- Applied this calibrator during inference in the Streamlit app.

---

### üìä Before vs After Calibration

| Raw Model Output | Calibrated Output |
|------------------|-------------------|
| 0.9999           | 0.82              |
| 0.70             | 0.55              |
| 0.40             | 0.38              |

With calibration, our predictions are now **realistic, trustworthy, and user-friendly**, ensuring:

- **0.9 now truly means ~90% chance** of being allergenic,
- Users can make decisions with more confidence,
- We avoid misleading, overly confident outputs.
