A refactored and user-friendly implementation of CoVFit - a multitask deep learning system for predicting SARS-CoV-2 variant fitness and antibody escape using ESM protein language models.
- Google Colab Support: Run inference directly in Google Colab with automatic setup!
- No Authentication Required: Pre-trained models are public - no HuggingFace token needed
- Data Included: All training and test data included in the repository
- Multiple Inference Modes: Choose between single-fold or multi-fold averaging
- Auto-download Models: Models automatically downloaded from Hugging Face when needed
CoVFit predicts two key properties of SARS-CoV-2 variants:
- Viral fitness: How well variants replicate and spread
- Antibody escape: How well variants evade immune responses
The system uses ESM (Evolutionary Scale Modeling) protein language models with LoRA (Low-Rank Adaptation) for efficient fine-tuning on multitask regression objectives.
- Multitask Learning: Simultaneous prediction of fitness and antibody escape
- ESM-based Architecture: Leverages Meta's large-scale protein language models
- LoRA Fine-tuning: Memory-efficient adaptation with <1% trainable parameters
- Weighted Loss Functions: Handles data imbalance and temporal weighting
- Interactive Analysis: Jupyter notebook interface for easy exploration
- Modular Design: Clean, maintainable, and extensible codebase
1. Google Colab (Recommended for beginners)
- No installation required
- Free GPU access
- Just open
infer.ipynbin Colab! - See Option A below
2. Local Python Environment
# Clone the repository
git clone https://github.com/TheSatoLab/CoVFit_module.git
cd CoVFit_module
# Install dependencies
pip3 install -r requirements.txt
# Run inference notebook
jupyter notebook infer.ipynb3. Docker (For reproducible environments)
# Clone and build
git clone https://github.com/TheSatoLab/CoVFit_module.git
cd CoVFit_module
docker build -t covfit:latest .
# Run training or inference
./run_covfit.sh inferFor Docker-based workflows, create a configuration file user_config.env:
# Required settings for Docker
FOLD_ID=0
OUTPUT_PREFIX="covfit_fold_"
DOCKER_IMAGE="covfit:latest"
# Optional inference settings
RESULTS_DIR="./inference_results"
MODEL_CHECKPOINT_PATH="./covfit_fold_0_model.ckpt"
TASK_DICT_PATH="./covfit_fold_0_model_task_id_dict.pt"
# Inference mode settings
USE_SINGLE_FOLD=false # Set to true to use single fold, false for multi-fold averaging
SINGLE_FOLD_ID=0 # Which fold to use when USE_SINGLE_FOLD=true
# HuggingFace token - NOT REQUIRED!
# The pre-trained models are public and don't need authentication
# HF_TOKEN="your_hf_token_here"Important Notes:
- HF_TOKEN is NOT required: Pre-trained models are publicly available on Hugging Face without authentication
- Data is included: Training and test data are in this repository under
data/raw/ - Google Colab users: No configuration file needed! Everything is set up automatically.
Build Docker image (for local/server usage):
docker build -t covfit:latest .CoVFit provides multiple ways to use the models:
The easiest way to get started is using Google Colab:
-
Open the notebook in Google Colab:
- Go to Google Colab
- Select "GitHub" tab and enter:
TheSatoLab/CoVFit_module - Open
infer.ipynb
-
The notebook automatically:
- Detects Google Colab environment
- Installs required packages
- Mounts your Google Drive
- Clones/updates the repository to your Drive
- Downloads pre-trained models from Hugging Face (no token required!)
- Runs inference on test data
-
Results are saved to:
/content/drive/MyDrive/inference_results/- Files include: predictions, summaries, and visualizations
Benefits of Google Colab:
- No local installation required
- Free GPU access
- Easy sharing and collaboration
- Results saved to Google Drive
For local or server environments, use the run_covfit.sh script:
# Train with default config (uses FOLD_ID from user_config.env)
./run_covfit.sh train
# Train specific fold with command line argument
./run_covfit.sh train user_config.env 2
# Train with custom config file
./run_covfit.sh train my_config.env# Train all folds (0-4) automatically
./run_covfit.sh train-all-folds
# With custom config
./run_covfit.sh train-all-folds my_config.env# Complete pipeline: train then run inference automatically
./run_covfit.sh train-infer
# With specific fold
./run_covfit.sh train-infer user_config.env 3# Run inference with config settings
./run_covfit.sh infer
# With custom config
./run_covfit.sh infer my_config.envOutput Files:
infer_executed.ipynb: Executed notebook with all cell outputslogs/inference_YYYYMMDD_HHMMSS.log: Complete execution loginference_results/: Prediction results and analysis files
Note: The inference script includes:
- Extended timeout (10 hours) for long-running computations
- Automatic logging to timestamped log files
- Executed notebook saved for later review
- Automatic model download from Hugging Face (no HF_TOKEN required!)
Using VSCode Dev Containers (For Local/Server Development)
This method provides the best development experience with full IDE integration:
-
Prerequisites:
- Install VSCode with Remote-SSH extension
- Connect to your remote server via SSH (if using remote server)
- Install "Dev Containers" extension in VSCode
-
Open in Dev Container:
# In VSCode: # Press Ctrl+Shift+P (Cmd+Shift+P on Mac) # Type: "Dev Containers: Reopen in Container" # Select it and wait for container to build
-
Run Jupyter Notebook:
- Open
infer.ipynbin VSCode - Click "Select Kernel" in the top right
- Choose
/opt/conda/bin/python - Run cells with
Shift + Enter - Models are automatically downloaded from Hugging Face (no token required!)
- Open
Benefits:
- Direct access to GPU resources
- Full IDE features (autocomplete, debugging, git integration)
- Automatic environment setup
- No need for SSH port forwarding
- Seamless file editing and notebook execution
Key capabilities in infer.ipynb:
- Automatic environment detection (Colab vs local)
- Load pre-trained models (automatically downloaded if needed)
- Make predictions on test data or custom sequences
- Perform evolutionary fitness landscape analysis
- Generate publication-ready visualizations
- Real-time progress monitoring with cell outputs
- Support for both single-fold and multi-fold averaging inference modes
from src.models.esm_regression import load_model_for_inference
from transformers import AutoTokenizer
# Load model and tokenizer
model = load_model_for_inference("results/covfit_model.ckpt", model_config, n_targets)
tokenizer = AutoTokenizer.from_pretrained("facebook/esm2_t6_8M_UR50D")
# Predict fitness for your sequences
your_sequences = [
"MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDLFLPFFSNVTWFHAIHVSGTNGTKRF"
]
predictions = predict_fitness(your_sequences, model, tokenizer)CoVFit
├── config/
│ └── config.py # Configuration management
├── src/
│ ├── data/
│ │ ├── dataset.py # PyTorch datasets
│ │ ├── preprocessing.py # Data preprocessing
│ │ └── preprocessing_lite.py # Memory-efficient preprocessing
│ ├── models/
│ │ └── esm_regression.py # ESM regression model with LoRA
│ └── utils/
│ ├── io_utils.py # File I/O utilities
│ ├── train_utils.py # Training utilities
│ └── simple_split.py # Data splitting utilities
├── data/
│ └── raw/ # Training and test data (included in repository)
│ ├── metadata.representative.all_countries.with_date.v2.with_seq_231102_wo_variants_before_cutoff.txt
│ ├── escape_data_mutation.csv
│ └── nextclade.peptide.S_rename.fasta
├── infer.ipynb # ⭐ Interactive analysis notebook (Colab-ready!)
├── run_covfit.sh # ⭐ Main execution script (train/infer)
├── run_jupyter.sh # ⭐ JupyterLab launcher for remote access
├── train.py # Training script (used by run_covfit.sh)
├── user_config.env # Configuration file
├── logs/ # Execution logs (auto-generated)
└── README.md # This file
- Evaluate fitness effects of emerging mutations
- Compare variant competitiveness across different backgrounds
- Predict immune escape potential
- Identify positions likely to undergo adaptive mutations
- Predict fitness landscapes around current variants
- Guide surveillance priorities
- Design improved vaccine antigens
- Optimize therapeutic proteins
- Understand structure-function relationships
The model achieves strong performance across multiple tasks:
- Fitness prediction: Pearson correlation >0.7 across countries
- Antibody escape: High accuracy for DMS experimental data
- Cross-validation: Robust performance across 5-fold CV
The run_covfit.sh script provides comprehensive options:
# Get help
./run_covfit.sh --help
# Available commands
Usage: ./run_covfit.sh [train|train-all-folds|train-infer|infer] [config_file] [fold_id]
Commands:
train Run model training for single fold
train-all-folds Run model training for all folds (0-4)
train-infer Run training then inference automatically
infer Run model inference
Arguments:
config_file Configuration file (default: user_config.env)
fold_id Fold ID for training (0-4, overrides FOLD_ID from config)Create different configuration files for different experiments:
# Experiment 1 config
cp user_config.env experiment1.env
# Edit experiment1.env with specific settings
# Experiment 2 config
cp user_config.env experiment2.env
# Edit experiment2.env with different settings
# Run experiments
./run_covfit.sh train experiment1.env 0
./run_covfit.sh train experiment2.env 0# Train specific folds in sequence
for fold in 0 1 2; do
./run_covfit.sh train user_config.env $fold
done
# Run complete pipeline for multiple experiments
for config in experiment1.env experiment2.env; do
./run_covfit.sh train-infer $config
done# Analyze multiple sequences
sequences = [seq1, seq2, seq3, ...]
predictions = predict_fitness(sequences, model, tokenizer)
# Save results
results_df = pd.DataFrame({
'sequence': sequences,
'fitness': predictions.mean(axis=1),
'fitness_std': predictions.std(axis=1)
})
results_df.to_csv('batch_predictions.csv', index=False)# Analyze fitness landscape around a sequence
base_sequence = "YOUR_PROTEIN_SEQUENCE"
positions_of_interest = [145, 484, 501, 614]
landscape_results = analyze_fitness_landscape(
base_sequence, model, tokenizer, positions_of_interest
)
# Find most beneficial mutations
top_mutations = landscape_results.nlargest(10, 'fitness_mean')The run_covfit.sh script automatically handles Docker execution. It provides:
- Automatic GPU detection: Enables
--gpus allwhen NVIDIA GPU is available - Volume mounting: Mounts current directory as
/workspace - Environment variables: Passes configuration variables to Docker container
- Result handling: Automatically manages inference output directories
- Model auto-download: Pre-trained models are automatically downloaded from Hugging Face (no token required)
For manual Docker control, see the Advanced Usage section above.
All training and test data are included in this repository under the data/raw/ directory:
- Fitness data: Variant frequency data from multiple countries
- Antibody escape data: Deep mutational scanning (DMS) experimental results
- Sequence data: SARS-CoV-2 spike protein sequences in FASTA format
This means you can:
- Train models without additional data downloads
- Reproduce published results exactly
- Experiment with the full dataset immediately after cloning
Pre-trained models are hosted on Hugging Face Hub at TheSatoLab-UTokyo/CoVFit and are automatically downloaded when needed (no authentication required).
- Python 3.8+
- PyTorch 1.12+
- CUDA-capable GPU (recommended) or CPU
- 8GB+ RAM (16GB+ recommended)
torch >= 1.12.0transformers >= 4.20.0pandas >= 1.3.0numpy >= 1.21.0scikit-learn >= 1.0.0biopython >= 1.79scipy >= 1.7.0peft >= 0.3.0matplotlib >= 3.5.0seaborn >= 0.11.0
This project is licensed under the MIT License - see the LICENSE file for details.
- ESM Models: Meta AI for the ESM protein language models
- LoRA: Microsoft for the LoRA adaptation technique
- HuggingFace: For the transformers library and training infrastructure
- Original CoVFit: Based on the original CoVFit implementation
A Protein Language Model for Exploring Viral Fitness Landscapes. Jumpei Ito, Adam Strange, Wei Liu, Gustav Joas, Spyros Lytras, The Genotype to Phenotype Japan (G2P-Japan) Consortium, Kei Sato. 2024. bioRxiv https://doi.org/10.1101/2024.03.15.584819
jampei@g.ecc.u-tokyo.ac.jp (Jumpei Ito)