A comprehensive machine learning pipeline for predicting protein functions using Graph Neural Networks (GNNs) and Gene Ontology (GO) annotations.
This project implements a Graph Neural Network-based approach for predicting protein functions across three Gene Ontology categories:
- Molecular Function (MF): What the protein does at the molecular level
- Biological Process (BP): Which biological processes the protein participates in
- Cellular Component (CC): Where the protein is located in the cell
The model processes protein amino acid sequences and predicts GO term annotations using multi-label classification.
- Advanced Architecture: Graph Neural Networks with attention mechanisms
- Multi-label Classification: Simultaneous prediction across three GO categories
- Comprehensive Pipeline: From data preprocessing to model deployment
- Performance Analysis: Detailed evaluation metrics and visualizations
- Interactive Web Interface: Streamlit-based application for easy predictions
- User-Friendly: No coding required for basic predictions
- Real-time Predictions: Fast inference on new protein sequences
- Python 3.8 or higher
- CUDA-compatible GPU (recommended)
- Clone the repository
git clone https://github.com/NameLessAth/Protein-Function-Prediction/ your-folder
cd your-folder- Install dependencies
pip install -r requirements.txttorch>=1.12.0
torch-geometric>=2.1.0
pandas>=1.4.0
numpy>=1.21.0
scikit-learn>=1.1.0
matplotlib>=3.5.0
seaborn>=0.11.0
streamlit>=1.12.0
biopython>=1.79
tqdm>=4.64.0
plotly>=5.10.0
requests>=2.28.0# Basic training
python main.py --epochs 50 --hidden_dim 256 --lr 0.001
# Advanced training with custom parameters
python main.py \
--max_samples 25000 \
--epochs 100 \
--hidden_dim 512 \
--lr 0.0005 \
--batch_size 16 \
--dropout 0.1# Start the Streamlit web application
streamlit run interface/app.pyThen open your browser and navigate to http://localhost:8501
# Predict function for a single sequence
python run_predictions.py \
--sequence "MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLAGG" \
--model improved_model.pth \
--preprocessors improved_preprocessors.pkl
# Predict functions from FASTA file
python run_predictions.py \
--fasta sequences.fasta \
--model improved_model.pth \
--preprocessors improved_preprocessors.pklprotein-function-prediction/
├── data/
│ ├── data_loader.py # UniProt data loading
│ ├── preprocessing.py # Data preprocessing pipeline
├── interface/
│ └── app.py # Streamlit web interface
├── models/
│ ├── feature_encoder.py # Protein sequence encoder
│ └── gnn_model.py # Graph Neural Network architecture
├── training/
│ ├── evaluator.py # Model evaluation
│ └── trainer.py # Training pipeline
├── utils/
│ └── helpers.py # Utility functions
├── main.py # Main training script
├── run_predictions.py # Prediction script
├── go_distribution_analysis.py # GO distribution analysis
├── test_app.py # Test Streamlit interface
├── test_predictions.py # Test prediction functionality
├── improved_model.pth # Improved model checkpoint
├── improved_preprocessors.pkl # Data preprocessors
├── requirements.txt # Python dependencies
├── setup.py # Package setup
├── README.md # This file
└── .gitignore # Git ignore rules
The Streamlit web interface provides an intuitive way to use the model without coding:
streamlit run interface/app.pyKey parameters for training:
| Parameter | Description | Default | Recommended |
|---|---|---|---|
--epochs |
Number of training epochs | 50 | 100-150 |
--hidden_dim |
Model hidden dimension | 256 | 512-768 |
--lr |
Learning rate | 0.001 | 0.0005-0.001 |
--batch_size |
Training batch size | 16 | 8-32 |
--max_samples |
Maximum training samples | 15000 | 25000-30000 |
--dropout |
Dropout rate | 0.1 | 0.05-0.2 |
Quick Test Run:
python main.py --max_samples 5000 --epochs 20 --hidden_dim 128Production Training:
python main.py \
--max_samples 30000 \
--epochs 150 \
--hidden_dim 768 \
--lr 0.0005 \
--batch_size 8 \
--dropout 0.05Single Sequence Prediction:
python run_predictions.py \
--sequence "MSDNEKSKKYFVLSGFHGKFTQYTGDTNVYVKVAKACQEDFKQYKTQLLNKHWDVK" \
--model models/best_model.pth- Follow PEP 8 guidelines
- Use type hints where applicable
- Add docstrings to all functions
- Format code with
black:black src/
- UniProt Consortium for providing protein annotation data
- Gene Ontology Consortium for functional classification standards
- PyTorch Geometric for graph neural network implementations
- Streamlit for the amazing web framework