Proteus is a sophisticated, Django-based web application that provides a semi-automated computational pipeline for synthetic protein design. It seamlessly integrates state-of-the-art tools for protein structure prediction, molecular dynamics simulation, and machine learning-based ranking into a unified, accessible platform.
This project aims to streamline the traditionally complex and fragmented workflow of computational protein design, empowering researchers and scientists to accelerate discovery in synthetic biology.
The primary goal of Proteus is to address the challenge of efficiently managing the multi-step computational workflow involved in designing and evaluating synthetic proteins. Traditional methods often require manual execution of disparate tools and cumbersome data transfer, hindering productivity and accessibility. Proteus solves this by:
- Integrating key computational biology tools into a cohesive pipeline.
- Automating job submission, execution, and data management.
- Providing an intuitive web-based interface for users.
Target Users: Researchers and professionals in synthetic biology, computational biology, bioinformatics, and related fields who require robust tools for designing, predicting, and validating protein structures.
Proteus stands out due to its comprehensive and integrated approach:
- 🧬 End-to-End Integrated Workflow: A complete pipeline from protein sequence submission to structure prediction (ColabFold), validation via molecular dynamics (GROMACS), and ML-based ranking.
- ⚙️ Automated Job Management: Leverages Celery and a persistent Job Queue (
core/models.py
) to manage the lifecycle of computationally intensive tasks (prediction, simulation, ranking) asynchronously, ensuring the web application remains responsive. Seecore/tasks.py
. - 📊 Integrated Data Management: Utilizes a PostgreSQL database to store user data, sequences, prediction results (PDB files, scores), simulation metrics (RMSD, Rg, energy), ML rankings, system logs, and performance metrics in a structured and interconnected manner.
- 🖥️ Web-Based Accessibility: A user-friendly interface built with Django (
core/templates/
) lowers the barrier to entry, making powerful computational tools accessible to a broader range of scientific users. - 🐳 Reproducibility & Consistency: Employs Docker containers to run core computational tools (ColabFold, GROMACS), ensuring reproducible results and simplifying dependency management across different environments. Execution logic is detailed in
core/tasks.py
. - 🔬 Interactive 3D Visualization: Integrates Mol*Star for interactive visualization of predicted protein structures directly within the web browser.
- 🤖 ML-Based Ranking: Incorporates a machine learning component (
MLRanking
model incore/models.py
) to rank protein candidates based on a variety of predicted and simulated properties, aiding in decision-making. - 📈 System Monitoring: Includes capabilities for collecting and storing system metrics (CPU, memory, disk usage, active jobs) during computational tasks using
psutil
(SystemMetric
model and usage incore/tasks.py
).
Proteus is built with a modern and robust technology stack:
- Backend: Python 3.12, Django 4.2.19
- Main application logic resides in the
core
app.
- Main application logic resides in the
- Database: PostgreSQL (interfaced via
psycopg2-binary==2.9.10
)- Schema defined in
core/models.py
. - Database configuration in
proteus/settings.py
.
- Schema defined in
- Task Queue: Celery 5.4.0 with RabbitMQ (as the message broker, using
amqp==5.3.1
,kombu==5.4.2
,vine==5.1.0
,billiard==4.2.1
)- Asynchronous tasks defined in
core/tasks.py
.
- Asynchronous tasks defined in
- Containerization: Docker
- Used for running ColabFold and GROMACS, managed via
subprocess
calls in Celery tasks.
- Used for running ColabFold and GROMACS, managed via
- Computational Tools:
- Protein Structure Prediction: ColabFold (leveraging AlphaFold models)
- Molecular Dynamics Simulation: GROMACS
- Frontend: HTML, CSS, JavaScript
- Templates located in
core/templates/
.
- Templates located in
- 3D Visualization: Mol*Star
- System Monitoring:
psutil
Python library - Core Python Libraries:
numpy
,pandas
,scipy
,biopython
for data handling and scientific computation. - Web & API:
django-cors-headers==4.7.0
for Cross-Origin Resource Sharing.
A full list of Python dependencies is available in requirements.txt
.
The architecture of Proteus is designed for scalability, reliability, and ease of use:
- Asynchronous Pipeline Architecture: The core computational pipeline (prediction, simulation, ranking) is implemented using Celery. User requests trigger Django views (
core/views.py
), which in turn dispatch tasks to Celery workers. This non-blocking approach ensures the web interface remains responsive even during long-running computations. - Dockerized Execution Environment: Computationally intensive tools like ColabFold and GROMACS are executed within Docker containers. Celery tasks in
core/tasks.py
dynamically manage these containers, ensuring that the tools run in isolated, consistent, and dependency-managed environments. - Database-Driven State Management: The status, parameters, and results of each stage in the pipeline are meticulously tracked and stored in the PostgreSQL database. Django models such as
JobQueue
,Prediction
,ValidationMetric
, andMLRanking
provide a structured way to manage and query this data, enabling robust workflow orchestration and data provenance. - Modular Django Application: The project follows Django's app-based structure, with the primary functionality encapsulated within the
core
application. This promotes modularity and maintainability.
proteus/
Project Directory (proteus/
):settings.py
: Django project settings, database configuration, installed apps, middleware, etc.urls.py
: Root URL configurations, routing requests to the appropriate views.celery.py
: Celery application definition and configuration.
core/
Application Directory (core/
):models.py
: Defines the database schema for all entities (User, Role, ProteinSequence, Prediction, ValidationMetric, MLRanking, JobQueue, Log, SystemMetric).views.py
: Contains the logic for handling HTTP requests, interacting with models, and rendering HTML templates. Implements views for sequence submission, listing, detail views, etc.forms.py
: Defines Django forms for user input, such as protein sequence submission.tasks.py
: Contains Celery shared tasks for executing ColabFold predictions and GROMACS simulations, including Docker management, file handling, and database updates.admin.py
: Configures the Django admin interface for managing application data.templates/core/
: HTML templates that define the user interface.static/
: Static files (CSS, JavaScript, images - Mol*Star integration would be here).migrations/
: Database migration files generated by Django.
requirements.txt
: Specifies all Python package dependencies for the project.
-
Prerequisites:
- Python 3.12
- PostgreSQL server
- RabbitMQ server (or another Celery-compatible broker)
- Docker
-
Clone the repository:
git clone https://github.com/OtoYuki/proteus.git cd proteus
-
Set up Python environment and install dependencies:
-
It is highly recommended to use a virtual environment (e.g.,
venv
orconda
). -
Install all project dependencies using pip and the provided
requirements.txt
file:pip install -r requirements.txt
-
-
Configure Django settings:
- Update
proteus/settings.py
with your database credentials, secret key, Celery broker URL, and other environment-specific settings.
- Update
-
Apply database migrations:
python manage.py migrate
-
Create a superuser (for admin access):
python manage.py createsuperuser
-
Start Celery worker(s):
celery -A proteus worker -l info
(Ensure RabbitMQ is running and accessible as configured in
proteus/settings.py
orproteus/celery.py
) -
Start the Django development server:
python manage.py runserver
-
Ensure Docker is running and can pull necessary images (e.g.,
ghcr.io/sokrypton/colabfold:1.5.5
, a GROMACS image). The specific images and tags are defined within the execution logic incore/tasks.py
.
- User Authentication: Users register and log in to the Proteus platform.
- Sequence Submission: Users submit protein sequences via a web form (
ProteinSequenceForm
rendered byProteinSequenceCreateView
). - Job Queuing: Upon submission, a job is created in the
JobQueue
and picked up by a Celery worker for processing. - Prediction: The
run_colabfold
task incore/tasks.py
executes ColabFold in a Docker container to predict the protein structure. Results (PDB file, scores) are saved to thePrediction
model. - Validation (Simulation): If configured, the
run_gromacs_simulation
task incore/tasks.py
runs GROMACS simulations on the predicted structure, also within Docker. Metrics (RMSD, Rg, energy) are stored in theValidationMetric
model. - ML Ranking: An ML ranking task (triggered after prediction) evaluates the prediction and simulation data, storing scores in the
MLRanking
model. - Status Tracking & Results: Users can monitor the status of their jobs and view detailed results, including 3D structure visualizations, prediction scores, and simulation metrics, through the web interface (
ProteinSequenceListView
,ProteinSequenceDetailView
).
The development of Proteus addressed several significant challenges inherent in computational biology workflows:
- Managing Long-Running Computations: Solved by implementing an asynchronous task queue with Celery, preventing the web server from blocking and ensuring a responsive user experience.
- Complex Dependency Management: Addressed by containerizing ColabFold and GROMACS using Docker, which encapsulates dependencies and ensures consistent execution environments.
- Computational Resource Management: Basic resource monitoring is implemented using
psutil
within tasks, and GROMACS tasks can be configured withnice
for CPU priority. Docker resource limits are also utilized. - Workflow Orchestration: Achieved through carefully designed Celery tasks in
core/tasks.py
that manage sequential execution, data handoff between steps, and persistent state tracking in the database. - Error Handling and Logging: Robust
try-except
blocks and detailed logging to both the console and theLog
database model are implemented throughoutcore/tasks.py
to ensure issues are captured and diagnosable.
Proteus is expected to deliver significant benefits:
- A Functional, Web-Accessible Platform: Providing a user-friendly application for the entire synthetic protein design pipeline.
- An Automated Computational Pipeline: Reducing manual effort and the potential for errors in complex multi-step analyses.
- Centralized and Structured Data Storage: Ensuring data provenance and facilitating easier analysis and comparison of results.
- Accelerated Research: By streamlining the design-predict-validate-rank cycle, Proteus can significantly speed up the process of identifying promising protein candidates.
- Increased Accessibility: Making advanced computational tools available to a wider range of researchers, including those who may lack deep command-line or infrastructure management expertise.
- Improved Decision Making: Providing integrated data from prediction, simulation, and ML-ranking stages to help researchers make more informed decisions about which protein designs to pursue experimentally.
- Integration of more sophisticated and customizable Machine Learning models for ranking.
- Advanced 3D visualization options and comparative analysis tools.
- Expansion of the supported computational tools (e.g., different docking software, alternative simulation engines).
- Enhanced system monitoring, resource allocation, and job prioritization features.
- User collaboration features and project management capabilities.
- Comprehensive API development for programmatic access to Proteus functionalities.
This README aims to provide a comprehensive overview of the Proteus project. For more specific details, please refer to the source code and any accompanying documentation.