Skip to content

TheSatoLab/CoVFit_module

Repository files navigation

CoVFit: COVID-19 Variant Fitness Prediction

A refactored and user-friendly implementation of CoVFit - a multitask deep learning system for predicting SARS-CoV-2 variant fitness and antibody escape using ESM protein language models.

🚀 What's New

  • Google Colab Support: Run inference directly in Google Colab with automatic setup!
  • No Authentication Required: Pre-trained models are public - no HuggingFace token needed
  • Data Included: All training and test data included in the repository
  • Multiple Inference Modes: Choose between single-fold or multi-fold averaging
  • Auto-download Models: Models automatically downloaded from Hugging Face when needed

Overview

CoVFit predicts two key properties of SARS-CoV-2 variants:

  • Viral fitness: How well variants replicate and spread
  • Antibody escape: How well variants evade immune responses

The system uses ESM (Evolutionary Scale Modeling) protein language models with LoRA (Low-Rank Adaptation) for efficient fine-tuning on multitask regression objectives.

Key Features

  • Multitask Learning: Simultaneous prediction of fitness and antibody escape
  • ESM-based Architecture: Leverages Meta's large-scale protein language models
  • LoRA Fine-tuning: Memory-efficient adaptation with <1% trainable parameters
  • Weighted Loss Functions: Handles data imbalance and temporal weighting
  • Interactive Analysis: Jupyter notebook interface for easy exploration
  • Modular Design: Clean, maintainable, and extensible codebase

Quick Start

Three Ways to Get Started

1. Google Colab (Recommended for beginners)

  • No installation required
  • Free GPU access
  • Just open infer.ipynb in Colab!
  • See Option A below

2. Local Python Environment

# Clone the repository
git clone https://github.com/TheSatoLab/CoVFit_module.git
cd CoVFit_module

# Install dependencies
pip3 install -r requirements.txt

# Run inference notebook
jupyter notebook infer.ipynb

3. Docker (For reproducible environments)

# Clone and build
git clone https://github.com/TheSatoLab/CoVFit_module.git
cd CoVFit_module
docker build -t covfit:latest .

# Run training or inference
./run_covfit.sh infer

Setup (Optional - for Docker usage)

For Docker-based workflows, create a configuration file user_config.env:

# Required settings for Docker
FOLD_ID=0
OUTPUT_PREFIX="covfit_fold_"
DOCKER_IMAGE="covfit:latest"

# Optional inference settings
RESULTS_DIR="./inference_results"
MODEL_CHECKPOINT_PATH="./covfit_fold_0_model.ckpt"
TASK_DICT_PATH="./covfit_fold_0_model_task_id_dict.pt"

# Inference mode settings
USE_SINGLE_FOLD=false  # Set to true to use single fold, false for multi-fold averaging
SINGLE_FOLD_ID=0       # Which fold to use when USE_SINGLE_FOLD=true

# HuggingFace token - NOT REQUIRED!
# The pre-trained models are public and don't need authentication
# HF_TOKEN="your_hf_token_here"

Important Notes:

  • HF_TOKEN is NOT required: Pre-trained models are publicly available on Hugging Face without authentication
  • Data is included: Training and test data are in this repository under data/raw/
  • Google Colab users: No configuration file needed! Everything is set up automatically.

Build Docker image (for local/server usage):

docker build -t covfit:latest .

Basic Usage

CoVFit provides multiple ways to use the models:

Option A: Google Colab (Easiest - No Setup Required)

The easiest way to get started is using Google Colab:

  1. Open the notebook in Google Colab:

    • Go to Google Colab
    • Select "GitHub" tab and enter: TheSatoLab/CoVFit_module
    • Open infer.ipynb
  2. The notebook automatically:

    • Detects Google Colab environment
    • Installs required packages
    • Mounts your Google Drive
    • Clones/updates the repository to your Drive
    • Downloads pre-trained models from Hugging Face (no token required!)
    • Runs inference on test data
  3. Results are saved to:

    • /content/drive/MyDrive/inference_results/
    • Files include: predictions, summaries, and visualizations

Benefits of Google Colab:

  • No local installation required
  • Free GPU access
  • Easy sharing and collaboration
  • Results saved to Google Drive

Option B: Local/Server Usage with Docker

For local or server environments, use the run_covfit.sh script:

1. Training a Single Fold
# Train with default config (uses FOLD_ID from user_config.env)
./run_covfit.sh train

# Train specific fold with command line argument
./run_covfit.sh train user_config.env 2

# Train with custom config file
./run_covfit.sh train my_config.env
2. Training All Folds
# Train all folds (0-4) automatically
./run_covfit.sh train-all-folds

# With custom config
./run_covfit.sh train-all-folds my_config.env
3. Train + Inference Pipeline
# Complete pipeline: train then run inference automatically
./run_covfit.sh train-infer

# With specific fold
./run_covfit.sh train-infer user_config.env 3
4. Inference Only
# Run inference with config settings
./run_covfit.sh infer

# With custom config
./run_covfit.sh infer my_config.env

Output Files:

  • infer_executed.ipynb: Executed notebook with all cell outputs
  • logs/inference_YYYYMMDD_HHMMSS.log: Complete execution log
  • inference_results/: Prediction results and analysis files

Note: The inference script includes:

  • Extended timeout (10 hours) for long-running computations
  • Automatic logging to timestamped log files
  • Executed notebook saved for later review
  • Automatic model download from Hugging Face (no HF_TOKEN required!)

Option C: Interactive Analysis with Jupyter Notebook

Using VSCode Dev Containers (For Local/Server Development)

This method provides the best development experience with full IDE integration:

  1. Prerequisites:

    • Install VSCode with Remote-SSH extension
    • Connect to your remote server via SSH (if using remote server)
    • Install "Dev Containers" extension in VSCode
  2. Open in Dev Container:

    # In VSCode:
    # Press Ctrl+Shift+P (Cmd+Shift+P on Mac)
    # Type: "Dev Containers: Reopen in Container"
    # Select it and wait for container to build
  3. Run Jupyter Notebook:

    • Open infer.ipynb in VSCode
    • Click "Select Kernel" in the top right
    • Choose /opt/conda/bin/python
    • Run cells with Shift + Enter
    • Models are automatically downloaded from Hugging Face (no token required!)

Benefits:

  • Direct access to GPU resources
  • Full IDE features (autocomplete, debugging, git integration)
  • Automatic environment setup
  • No need for SSH port forwarding
  • Seamless file editing and notebook execution

Key capabilities in infer.ipynb:

  • Automatic environment detection (Colab vs local)
  • Load pre-trained models (automatically downloaded if needed)
  • Make predictions on test data or custom sequences
  • Perform evolutionary fitness landscape analysis
  • Generate publication-ready visualizations
  • Real-time progress monitoring with cell outputs
  • Support for both single-fold and multi-fold averaging inference modes

Custom Sequence Prediction

from src.models.esm_regression import load_model_for_inference
from transformers import AutoTokenizer

# Load model and tokenizer
model = load_model_for_inference("results/covfit_model.ckpt", model_config, n_targets)
tokenizer = AutoTokenizer.from_pretrained("facebook/esm2_t6_8M_UR50D")

# Predict fitness for your sequences
your_sequences = [
    "MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDLFLPFFSNVTWFHAIHVSGTNGTKRF"
]

predictions = predict_fitness(your_sequences, model, tokenizer)

Project Structure

CoVFit
├── config/
│   └── config.py              # Configuration management
├── src/
│   ├── data/
│   │   ├── dataset.py         # PyTorch datasets
│   │   ├── preprocessing.py   # Data preprocessing
│   │   └── preprocessing_lite.py # Memory-efficient preprocessing
│   ├── models/
│   │   └── esm_regression.py  # ESM regression model with LoRA
│   └── utils/
│       ├── io_utils.py        # File I/O utilities
│       ├── train_utils.py     # Training utilities
│       └── simple_split.py    # Data splitting utilities
├── data/
│   └── raw/                   # Training and test data (included in repository)
│       ├── metadata.representative.all_countries.with_date.v2.with_seq_231102_wo_variants_before_cutoff.txt
│       ├── escape_data_mutation.csv
│       └── nextclade.peptide.S_rename.fasta
├── infer.ipynb               # ⭐ Interactive analysis notebook (Colab-ready!)
├── run_covfit.sh             # ⭐ Main execution script (train/infer)
├── run_jupyter.sh            # ⭐ JupyterLab launcher for remote access
├── train.py                  # Training script (used by run_covfit.sh)
├── user_config.env           # Configuration file
├── logs/                     # Execution logs (auto-generated)
└── README.md                 # This file

Research Applications

SARS-CoV-2 Variant Analysis

  • Evaluate fitness effects of emerging mutations
  • Compare variant competitiveness across different backgrounds
  • Predict immune escape potential

Evolutionary Forecasting

  • Identify positions likely to undergo adaptive mutations
  • Predict fitness landscapes around current variants
  • Guide surveillance priorities

Protein Engineering

  • Design improved vaccine antigens
  • Optimize therapeutic proteins
  • Understand structure-function relationships

Model Performance

The model achieves strong performance across multiple tasks:

  • Fitness prediction: Pearson correlation >0.7 across countries
  • Antibody escape: High accuracy for DMS experimental data
  • Cross-validation: Robust performance across 5-fold CV

Advanced Usage

Command Line Options

The run_covfit.sh script provides comprehensive options:

# Get help
./run_covfit.sh --help

# Available commands
Usage: ./run_covfit.sh [train|train-all-folds|train-infer|infer] [config_file] [fold_id]

Commands:
  train             Run model training for single fold
  train-all-folds   Run model training for all folds (0-4)
  train-infer       Run training then inference automatically
  infer             Run model inference

Arguments:
  config_file    Configuration file (default: user_config.env)
  fold_id        Fold ID for training (0-4, overrides FOLD_ID from config)

Configuration Management

Create different configuration files for different experiments:

# Experiment 1 config
cp user_config.env experiment1.env
# Edit experiment1.env with specific settings

# Experiment 2 config
cp user_config.env experiment2.env
# Edit experiment2.env with different settings

# Run experiments
./run_covfit.sh train experiment1.env 0
./run_covfit.sh train experiment2.env 0

Batch Processing

# Train specific folds in sequence
for fold in 0 1 2; do
    ./run_covfit.sh train user_config.env $fold
done

# Run complete pipeline for multiple experiments
for config in experiment1.env experiment2.env; do
    ./run_covfit.sh train-infer $config
done

Batch Prediction

# Analyze multiple sequences
sequences = [seq1, seq2, seq3, ...]
predictions = predict_fitness(sequences, model, tokenizer)

# Save results
results_df = pd.DataFrame({
    'sequence': sequences,
    'fitness': predictions.mean(axis=1),
    'fitness_std': predictions.std(axis=1)
})
results_df.to_csv('batch_predictions.csv', index=False)

Evolutionary Analysis

# Analyze fitness landscape around a sequence
base_sequence = "YOUR_PROTEIN_SEQUENCE"
positions_of_interest = [145, 484, 501, 614]

landscape_results = analyze_fitness_landscape(
    base_sequence, model, tokenizer, positions_of_interest
)

# Find most beneficial mutations
top_mutations = landscape_results.nlargest(10, 'fitness_mean')

Docker Features

The run_covfit.sh script automatically handles Docker execution. It provides:

  • Automatic GPU detection: Enables --gpus all when NVIDIA GPU is available
  • Volume mounting: Mounts current directory as /workspace
  • Environment variables: Passes configuration variables to Docker container
  • Result handling: Automatically manages inference output directories
  • Model auto-download: Pre-trained models are automatically downloaded from Hugging Face (no token required)

For manual Docker control, see the Advanced Usage section above.

Data Availability

All training and test data are included in this repository under the data/raw/ directory:

  • Fitness data: Variant frequency data from multiple countries
  • Antibody escape data: Deep mutational scanning (DMS) experimental results
  • Sequence data: SARS-CoV-2 spike protein sequences in FASTA format

This means you can:

  • Train models without additional data downloads
  • Reproduce published results exactly
  • Experiment with the full dataset immediately after cloning

Pre-trained models are hosted on Hugging Face Hub at TheSatoLab-UTokyo/CoVFit and are automatically downloaded when needed (no authentication required).

Requirements

System Requirements

  • Python 3.8+
  • PyTorch 1.12+
  • CUDA-capable GPU (recommended) or CPU
  • 8GB+ RAM (16GB+ recommended)

Python Dependencies

  • torch >= 1.12.0
  • transformers >= 4.20.0
  • pandas >= 1.3.0
  • numpy >= 1.21.0
  • scikit-learn >= 1.0.0
  • biopython >= 1.79
  • scipy >= 1.7.0
  • peft >= 0.3.0
  • matplotlib >= 3.5.0
  • seaborn >= 0.11.0

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

  • ESM Models: Meta AI for the ESM protein language models
  • LoRA: Microsoft for the LoRA adaptation technique
  • HuggingFace: For the transformers library and training infrastructure
  • Original CoVFit: Based on the original CoVFit implementation

Citation

A Protein Language Model for Exploring Viral Fitness Landscapes. Jumpei Ito, Adam Strange, Wei Liu, Gustav Joas, Spyros Lytras, The Genotype to Phenotype Japan (G2P-Japan) Consortium, Kei Sato. 2024. bioRxiv https://doi.org/10.1101/2024.03.15.584819

Contact

jampei@g.ecc.u-tokyo.ac.jp (Jumpei Ito)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors