CoVFit: COVID-19 Variant Fitness Prediction

A refactored and user-friendly implementation of CoVFit - a multitask deep learning system for predicting SARS-CoV-2 variant fitness and antibody escape using ESM protein language models.

🚀 What's New

Google Colab Support: Run inference directly in Google Colab with automatic setup!
No Authentication Required: Pre-trained models are public - no HuggingFace token needed
Data Included: All training and test data included in the repository
Multiple Inference Modes: Choose between single-fold or multi-fold averaging
Auto-download Models: Models automatically downloaded from Hugging Face when needed

Overview

CoVFit predicts two key properties of SARS-CoV-2 variants:

Viral fitness: How well variants replicate and spread
Antibody escape: How well variants evade immune responses

The system uses ESM (Evolutionary Scale Modeling) protein language models with LoRA (Low-Rank Adaptation) for efficient fine-tuning on multitask regression objectives.

Key Features

Multitask Learning: Simultaneous prediction of fitness and antibody escape
ESM-based Architecture: Leverages Meta's large-scale protein language models
LoRA Fine-tuning: Memory-efficient adaptation with <1% trainable parameters
Weighted Loss Functions: Handles data imbalance and temporal weighting
Interactive Analysis: Jupyter notebook interface for easy exploration
Modular Design: Clean, maintainable, and extensible codebase

Quick Start

Three Ways to Get Started

1. Google Colab (Recommended for beginners)

No installation required
Free GPU access
Just open infer.ipynb in Colab!
See Option A below

2. Local Python Environment

# Clone the repository
git clone https://github.com/TheSatoLab/CoVFit_module.git
cd CoVFit_module

# Install dependencies
pip3 install -r requirements.txt

# Run inference notebook
jupyter notebook infer.ipynb

3. Docker (For reproducible environments)

# Clone and build
git clone https://github.com/TheSatoLab/CoVFit_module.git
cd CoVFit_module
docker build -t covfit:latest .

# Run training or inference
./run_covfit.sh infer

Setup (Optional - for Docker usage)

For Docker-based workflows, create a configuration file user_config.env:

# Required settings for Docker
FOLD_ID=0
OUTPUT_PREFIX="covfit_fold_"
DOCKER_IMAGE="covfit:latest"

# Optional inference settings
RESULTS_DIR="./inference_results"
MODEL_CHECKPOINT_PATH="./covfit_fold_0_model.ckpt"
TASK_DICT_PATH="./covfit_fold_0_model_task_id_dict.pt"

# Inference mode settings
USE_SINGLE_FOLD=false  # Set to true to use single fold, false for multi-fold averaging
SINGLE_FOLD_ID=0       # Which fold to use when USE_SINGLE_FOLD=true

# HuggingFace token - NOT REQUIRED!
# The pre-trained models are public and don't need authentication
# HF_TOKEN="your_hf_token_here"

Important Notes:

HF_TOKEN is NOT required: Pre-trained models are publicly available on Hugging Face without authentication
Data is included: Training and test data are in this repository under data/raw/
Google Colab users: No configuration file needed! Everything is set up automatically.

Build Docker image (for local/server usage):

docker build -t covfit:latest .

Basic Usage

CoVFit provides multiple ways to use the models:

Option A: Google Colab (Easiest - No Setup Required)

The easiest way to get started is using Google Colab:

Open the notebook in Google Colab:
- Go to Google Colab
- Select "GitHub" tab and enter: TheSatoLab/CoVFit_module
- Open infer.ipynb
The notebook automatically:
- Detects Google Colab environment
- Installs required packages
- Mounts your Google Drive
- Clones/updates the repository to your Drive
- Downloads pre-trained models from Hugging Face (no token required!)
- Runs inference on test data
Results are saved to:
- /content/drive/MyDrive/inference_results/
- Files include: predictions, summaries, and visualizations

Benefits of Google Colab:

No local installation required
Free GPU access
Easy sharing and collaboration
Results saved to Google Drive

Option B: Local/Server Usage with Docker

For local or server environments, use the run_covfit.sh script:

1. Training a Single Fold

# Train with default config (uses FOLD_ID from user_config.env)
./run_covfit.sh train

# Train specific fold with command line argument
./run_covfit.sh train user_config.env 2

# Train with custom config file
./run_covfit.sh train my_config.env

2. Training All Folds

# Train all folds (0-4) automatically
./run_covfit.sh train-all-folds

# With custom config
./run_covfit.sh train-all-folds my_config.env

3. Train + Inference Pipeline

# Complete pipeline: train then run inference automatically
./run_covfit.sh train-infer

# With specific fold
./run_covfit.sh train-infer user_config.env 3

4. Inference Only

# Run inference with config settings
./run_covfit.sh infer

# With custom config
./run_covfit.sh infer my_config.env

Output Files:

infer_executed.ipynb: Executed notebook with all cell outputs
logs/inference_YYYYMMDD_HHMMSS.log: Complete execution log
inference_results/: Prediction results and analysis files

Note: The inference script includes:

Extended timeout (10 hours) for long-running computations
Automatic logging to timestamped log files
Executed notebook saved for later review
Automatic model download from Hugging Face (no HF_TOKEN required!)

Option C: Interactive Analysis with Jupyter Notebook

Using VSCode Dev Containers (For Local/Server Development)

This method provides the best development experience with full IDE integration:

Prerequisites:
- Install VSCode with Remote-SSH extension
- Connect to your remote server via SSH (if using remote server)
- Install "Dev Containers" extension in VSCode

Open in Dev Container:

# In VSCode:
# Press Ctrl+Shift+P (Cmd+Shift+P on Mac)
# Type: "Dev Containers: Reopen in Container"
# Select it and wait for container to build

Run Jupyter Notebook:
- Open infer.ipynb in VSCode
- Click "Select Kernel" in the top right
- Choose /opt/conda/bin/python
- Run cells with Shift + Enter
- Models are automatically downloaded from Hugging Face (no token required!)

Benefits:

Direct access to GPU resources
Full IDE features (autocomplete, debugging, git integration)
Automatic environment setup
No need for SSH port forwarding
Seamless file editing and notebook execution

Key capabilities in infer.ipynb:

Automatic environment detection (Colab vs local)
Load pre-trained models (automatically downloaded if needed)
Make predictions on test data or custom sequences
Perform evolutionary fitness landscape analysis
Generate publication-ready visualizations
Real-time progress monitoring with cell outputs
Support for both single-fold and multi-fold averaging inference modes

Custom Sequence Prediction

from src.models.esm_regression import load_model_for_inference
from transformers import AutoTokenizer

# Load model and tokenizer
model = load_model_for_inference("results/covfit_model.ckpt", model_config, n_targets)
tokenizer = AutoTokenizer.from_pretrained("facebook/esm2_t6_8M_UR50D")

# Predict fitness for your sequences
your_sequences = [
    "MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDLFLPFFSNVTWFHAIHVSGTNGTKRF"
]

predictions = predict_fitness(your_sequences, model, tokenizer)

Project Structure

CoVFit
├── config/
│   └── config.py              # Configuration management
├── src/
│   ├── data/
│   │   ├── dataset.py         # PyTorch datasets
│   │   ├── preprocessing.py   # Data preprocessing
│   │   └── preprocessing_lite.py # Memory-efficient preprocessing
│   ├── models/
│   │   └── esm_regression.py  # ESM regression model with LoRA
│   └── utils/
│       ├── io_utils.py        # File I/O utilities
│       ├── train_utils.py     # Training utilities
│       └── simple_split.py    # Data splitting utilities
├── data/
│   └── raw/                   # Training and test data (included in repository)
│       ├── metadata.representative.all_countries.with_date.v2.with_seq_231102_wo_variants_before_cutoff.txt
│       ├── escape_data_mutation.csv
│       └── nextclade.peptide.S_rename.fasta
├── infer.ipynb               # ⭐ Interactive analysis notebook (Colab-ready!)
├── run_covfit.sh             # ⭐ Main execution script (train/infer)
├── run_jupyter.sh            # ⭐ JupyterLab launcher for remote access
├── train.py                  # Training script (used by run_covfit.sh)
├── user_config.env           # Configuration file
├── logs/                     # Execution logs (auto-generated)
└── README.md                 # This file

Research Applications

SARS-CoV-2 Variant Analysis

Evaluate fitness effects of emerging mutations
Compare variant competitiveness across different backgrounds
Predict immune escape potential

Evolutionary Forecasting

Identify positions likely to undergo adaptive mutations
Predict fitness landscapes around current variants
Guide surveillance priorities

Protein Engineering

Design improved vaccine antigens
Optimize therapeutic proteins
Understand structure-function relationships

Model Performance

The model achieves strong performance across multiple tasks:

Fitness prediction: Pearson correlation >0.7 across countries
Antibody escape: High accuracy for DMS experimental data
Cross-validation: Robust performance across 5-fold CV

Advanced Usage

Command Line Options

The run_covfit.sh script provides comprehensive options:

# Get help
./run_covfit.sh --help

# Available commands
Usage: ./run_covfit.sh [train|train-all-folds|train-infer|infer] [config_file] [fold_id]

Commands:
  train             Run model training for single fold
  train-all-folds   Run model training for all folds (0-4)
  train-infer       Run training then inference automatically
  infer             Run model inference

Arguments:
  config_file    Configuration file (default: user_config.env)
  fold_id        Fold ID for training (0-4, overrides FOLD_ID from config)

Configuration Management

Create different configuration files for different experiments:

# Experiment 1 config
cp user_config.env experiment1.env
# Edit experiment1.env with specific settings

# Experiment 2 config
cp user_config.env experiment2.env
# Edit experiment2.env with different settings

# Run experiments
./run_covfit.sh train experiment1.env 0
./run_covfit.sh train experiment2.env 0

Batch Processing

# Train specific folds in sequence
for fold in 0 1 2; do
    ./run_covfit.sh train user_config.env $fold
done

# Run complete pipeline for multiple experiments
for config in experiment1.env experiment2.env; do
    ./run_covfit.sh train-infer $config
done

Batch Prediction

# Analyze multiple sequences
sequences = [seq1, seq2, seq3, ...]
predictions = predict_fitness(sequences, model, tokenizer)

# Save results
results_df = pd.DataFrame({
    'sequence': sequences,
    'fitness': predictions.mean(axis=1),
    'fitness_std': predictions.std(axis=1)
})
results_df.to_csv('batch_predictions.csv', index=False)

Evolutionary Analysis

# Analyze fitness landscape around a sequence
base_sequence = "YOUR_PROTEIN_SEQUENCE"
positions_of_interest = [145, 484, 501, 614]

landscape_results = analyze_fitness_landscape(
    base_sequence, model, tokenizer, positions_of_interest
)

# Find most beneficial mutations
top_mutations = landscape_results.nlargest(10, 'fitness_mean')

Docker Features

The run_covfit.sh script automatically handles Docker execution. It provides:

Automatic GPU detection: Enables --gpus all when NVIDIA GPU is available
Volume mounting: Mounts current directory as /workspace
Environment variables: Passes configuration variables to Docker container
Result handling: Automatically manages inference output directories
Model auto-download: Pre-trained models are automatically downloaded from Hugging Face (no token required)

For manual Docker control, see the Advanced Usage section above.

Data Availability

All training and test data are included in this repository under the data/raw/ directory:

Fitness data: Variant frequency data from multiple countries
Antibody escape data: Deep mutational scanning (DMS) experimental results
Sequence data: SARS-CoV-2 spike protein sequences in FASTA format

This means you can:

Train models without additional data downloads
Reproduce published results exactly
Experiment with the full dataset immediately after cloning

Pre-trained models are hosted on Hugging Face Hub at TheSatoLab-UTokyo/CoVFit and are automatically downloaded when needed (no authentication required).

Requirements

System Requirements

Python 3.8+
PyTorch 1.12+
CUDA-capable GPU (recommended) or CPU
8GB+ RAM (16GB+ recommended)

Python Dependencies

torch >= 1.12.0
transformers >= 4.20.0
pandas >= 1.3.0
numpy >= 1.21.0
scikit-learn >= 1.0.0
biopython >= 1.79
scipy >= 1.7.0
peft >= 0.3.0
matplotlib >= 3.5.0
seaborn >= 0.11.0

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

ESM Models: Meta AI for the ESM protein language models
LoRA: Microsoft for the LoRA adaptation technique
HuggingFace: For the transformers library and training infrastructure
Original CoVFit: Based on the original CoVFit implementation

Citation

A Protein Language Model for Exploring Viral Fitness Landscapes. Jumpei Ito, Adam Strange, Wei Liu, Gustav Joas, Spyros Lytras, The Genotype to Phenotype Japan (G2P-Japan) Consortium, Kei Sato. 2024. bioRxiv https://doi.org/10.1101/2024.03.15.584819

Contact

jampei@g.ecc.u-tokyo.ac.jp (Jumpei Ito)

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.devcontainer		.devcontainer
config		config
data/raw		data/raw
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
README_apptainer.md		README_apptainer.md
infer.ipynb		infer.ipynb
requirements.txt		requirements.txt
run_covfit.sh		run_covfit.sh
run_covfit_apptainer.sh		run_covfit_apptainer.sh
train.py		train.py
user_config.env		user_config.env

Folders and files

Latest commit

History

Repository files navigation

CoVFit: COVID-19 Variant Fitness Prediction

🚀 What's New

Overview

Key Features

Quick Start

Three Ways to Get Started

Setup (Optional - for Docker usage)

Basic Usage

Option A: Google Colab (Easiest - No Setup Required)

Option B: Local/Server Usage with Docker

1. Training a Single Fold

2. Training All Folds

3. Train + Inference Pipeline

4. Inference Only

Option C: Interactive Analysis with Jupyter Notebook

Custom Sequence Prediction

Project Structure

Research Applications

SARS-CoV-2 Variant Analysis

Evolutionary Forecasting

Protein Engineering

Model Performance

Advanced Usage

Command Line Options

Configuration Management

Batch Processing

Batch Prediction

Evolutionary Analysis

Docker Features

Data Availability

Requirements

System Requirements

Python Dependencies

License

Acknowledgments

Citation

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages