Protein Function Prediction using Graph Neural Networks

A comprehensive machine learning pipeline for predicting protein functions using Graph Neural Networks (GNNs) and Gene Ontology (GO) annotations.

Overview

This project implements a Graph Neural Network-based approach for predicting protein functions across three Gene Ontology categories:

Molecular Function (MF): What the protein does at the molecular level
Biological Process (BP): Which biological processes the protein participates in
Cellular Component (CC): Where the protein is located in the cell

The model processes protein amino acid sequences and predicts GO term annotations using multi-label classification.

Features

Advanced Architecture: Graph Neural Networks with attention mechanisms
Multi-label Classification: Simultaneous prediction across three GO categories
Comprehensive Pipeline: From data preprocessing to model deployment
Performance Analysis: Detailed evaluation metrics and visualizations
Interactive Web Interface: Streamlit-based application for easy predictions
User-Friendly: No coding required for basic predictions
Real-time Predictions: Fast inference on new protein sequences

Installation

Prerequisites

Python 3.8 or higher
CUDA-compatible GPU (recommended)

Environment Setup

Clone the repository

git clone https://github.com/NameLessAth/Protein-Function-Prediction/ your-folder
cd your-folder

Install dependencies

pip install -r requirements.txt

Requirements.txt

torch>=1.12.0
torch-geometric>=2.1.0
pandas>=1.4.0
numpy>=1.21.0
scikit-learn>=1.1.0
matplotlib>=3.5.0
seaborn>=0.11.0
streamlit>=1.12.0
biopython>=1.79
tqdm>=4.64.0
plotly>=5.10.0
requests>=2.28.0

Quick Start

1. Train the Model

# Basic training
python main.py --epochs 50 --hidden_dim 256 --lr 0.001

# Advanced training with custom parameters
python main.py \
    --max_samples 25000 \
    --epochs 100 \
    --hidden_dim 512 \
    --lr 0.0005 \
    --batch_size 16 \
    --dropout 0.1

2. Launch Interactive Web Interface

# Start the Streamlit web application
streamlit run interface/app.py

Then open your browser and navigate to http://localhost:8501

3. Command-Line Predictions

# Predict function for a single sequence
python run_predictions.py \
    --sequence "MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLAGG" \
    --model improved_model.pth \
    --preprocessors improved_preprocessors.pkl

# Predict functions from FASTA file
python run_predictions.py \
    --fasta sequences.fasta \
    --model improved_model.pth \
    --preprocessors improved_preprocessors.pkl

Project Structure

protein-function-prediction/
├── data/
│   ├── data_loader.py          # UniProt data loading
│   ├── preprocessing.py        # Data preprocessing pipeline
├── interface/
│   └── app.py                  # Streamlit web interface
├── models/
│   ├── feature_encoder.py      # Protein sequence encoder
│   └── gnn_model.py           # Graph Neural Network architecture
├── training/
│   ├── evaluator.py           # Model evaluation
│   └── trainer.py             # Training pipeline
├── utils/
│   └── helpers.py             # Utility functions
├── main.py                 # Main training script
├── run_predictions.py      # Prediction script
├── go_distribution_analysis.py  # GO distribution analysis
├── test_app.py             # Test Streamlit interface
├── test_predictions.py     # Test prediction functionality
├── improved_model.pth      # Improved model checkpoint
├── improved_preprocessors.pkl    # Data preprocessors
├── requirements.txt        # Python dependencies
├── setup.py               # Package setup
├── README.md              # This file
└── .gitignore             # Git ignore rules

Usage

Web Interface (Recommended)

The Streamlit web interface provides an intuitive way to use the model without coding:

streamlit run interface/app.py

Training Configuration

Key parameters for training:

Parameter	Description	Default	Recommended
`--epochs`	Number of training epochs	50	100-150
`--hidden_dim`	Model hidden dimension	256	512-768
`--lr`	Learning rate	0.001	0.0005-0.001
`--batch_size`	Training batch size	16	8-32
`--max_samples`	Maximum training samples	15000	25000-30000
`--dropout`	Dropout rate	0.1	0.05-0.2

Example Training Commands

Quick Test Run:

python main.py --max_samples 5000 --epochs 20 --hidden_dim 128

Production Training:

python main.py \
    --max_samples 30000 \
    --epochs 150 \
    --hidden_dim 768 \
    --lr 0.0005 \
    --batch_size 8 \
    --dropout 0.05

Command-Line Prediction Examples

Single Sequence Prediction:

python run_predictions.py \
    --sequence "MSDNEKSKKYFVLSGFHGKFTQYTGDTNVYVKVAKACQEDFKQYKTQLLNKHWDVK" \
    --model models/best_model.pth

Code Style

Follow PEP 8 guidelines
Use type hints where applicable
Add docstrings to all functions
Format code with black: black src/

Acknowledgments

UniProt Consortium for providing protein annotation data
Gene Ontology Consortium for functional classification standards
PyTorch Geometric for graph neural network implementations
Streamlit for the amazing web framework

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Protein Function Prediction using Graph Neural Networks

Table of Contents

Overview

Features

Installation

Prerequisites

Environment Setup

Requirements.txt

Quick Start

1. Train the Model

2. Launch Interactive Web Interface

3. Command-Line Predictions

Project Structure

Usage

Web Interface (Recommended)

Training Configuration

Example Training Commands

Command-Line Prediction Examples

Code Style

Acknowledgments

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
data		data
interface		interface
models		models
protein_function_predictor.egg-info		protein_function_predictor.egg-info
training		training
utils		utils
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
go_distribution_analysis.py		go_distribution_analysis.py
improved_model.pth		improved_model.pth
improved_preprocessors.pkl		improved_preprocessors.pkl
load_fixed_model.py		load_fixed_model.py
main.py		main.py
requirements.txt		requirements.txt
run_predictions.py		run_predictions.py
setup.py		setup.py
test_app.py		test_app.py
test_model_config.py		test_model_config.py
test_predictions.py		test_predictions.py

NameLessAth/Protein-Function-Prediction

Folders and files

Latest commit

History

Repository files navigation

Protein Function Prediction using Graph Neural Networks

Table of Contents

Overview

Features

Installation

Prerequisites

Environment Setup

Requirements.txt

Quick Start

1. Train the Model

2. Launch Interactive Web Interface

3. Command-Line Predictions

Project Structure

Usage

Web Interface (Recommended)

Training Configuration

Example Training Commands

Command-Line Prediction Examples

Code Style

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages