Skip to content

NameLessAth/Protein-Function-Prediction

Repository files navigation

Protein Function Prediction using Graph Neural Networks

A comprehensive machine learning pipeline for predicting protein functions using Graph Neural Networks (GNNs) and Gene Ontology (GO) annotations.

Python 3.8+ PyTorch Streamlit

Table of Contents

Overview

This project implements a Graph Neural Network-based approach for predicting protein functions across three Gene Ontology categories:

  • Molecular Function (MF): What the protein does at the molecular level
  • Biological Process (BP): Which biological processes the protein participates in
  • Cellular Component (CC): Where the protein is located in the cell

The model processes protein amino acid sequences and predicts GO term annotations using multi-label classification.

Features

  • Advanced Architecture: Graph Neural Networks with attention mechanisms
  • Multi-label Classification: Simultaneous prediction across three GO categories
  • Comprehensive Pipeline: From data preprocessing to model deployment
  • Performance Analysis: Detailed evaluation metrics and visualizations
  • Interactive Web Interface: Streamlit-based application for easy predictions
  • User-Friendly: No coding required for basic predictions
  • Real-time Predictions: Fast inference on new protein sequences

Installation

Prerequisites

  • Python 3.8 or higher
  • CUDA-compatible GPU (recommended)

Environment Setup

  1. Clone the repository
git clone https://github.com/NameLessAth/Protein-Function-Prediction/ your-folder
cd your-folder
  1. Install dependencies
pip install -r requirements.txt

Requirements.txt

torch>=1.12.0
torch-geometric>=2.1.0
pandas>=1.4.0
numpy>=1.21.0
scikit-learn>=1.1.0
matplotlib>=3.5.0
seaborn>=0.11.0
streamlit>=1.12.0
biopython>=1.79
tqdm>=4.64.0
plotly>=5.10.0
requests>=2.28.0

Quick Start

1. Train the Model

# Basic training
python main.py --epochs 50 --hidden_dim 256 --lr 0.001

# Advanced training with custom parameters
python main.py \
    --max_samples 25000 \
    --epochs 100 \
    --hidden_dim 512 \
    --lr 0.0005 \
    --batch_size 16 \
    --dropout 0.1

2. Launch Interactive Web Interface

# Start the Streamlit web application
streamlit run interface/app.py

Then open your browser and navigate to http://localhost:8501

3. Command-Line Predictions

# Predict function for a single sequence
python run_predictions.py \
    --sequence "MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLAGG" \
    --model improved_model.pth \
    --preprocessors improved_preprocessors.pkl

# Predict functions from FASTA file
python run_predictions.py \
    --fasta sequences.fasta \
    --model improved_model.pth \
    --preprocessors improved_preprocessors.pkl

Project Structure

protein-function-prediction/
├── data/
│   ├── data_loader.py          # UniProt data loading
│   ├── preprocessing.py        # Data preprocessing pipeline
├── interface/
│   └── app.py                  # Streamlit web interface
├── models/
│   ├── feature_encoder.py      # Protein sequence encoder
│   └── gnn_model.py           # Graph Neural Network architecture
├── training/
│   ├── evaluator.py           # Model evaluation
│   └── trainer.py             # Training pipeline
├── utils/
│   └── helpers.py             # Utility functions
├── main.py                 # Main training script
├── run_predictions.py      # Prediction script
├── go_distribution_analysis.py  # GO distribution analysis
├── test_app.py             # Test Streamlit interface
├── test_predictions.py     # Test prediction functionality
├── improved_model.pth      # Improved model checkpoint
├── improved_preprocessors.pkl    # Data preprocessors
├── requirements.txt        # Python dependencies
├── setup.py               # Package setup
├── README.md              # This file
└── .gitignore             # Git ignore rules

Usage

Web Interface (Recommended)

The Streamlit web interface provides an intuitive way to use the model without coding:

streamlit run interface/app.py

Training Configuration

Key parameters for training:

Parameter Description Default Recommended
--epochs Number of training epochs 50 100-150
--hidden_dim Model hidden dimension 256 512-768
--lr Learning rate 0.001 0.0005-0.001
--batch_size Training batch size 16 8-32
--max_samples Maximum training samples 15000 25000-30000
--dropout Dropout rate 0.1 0.05-0.2

Example Training Commands

Quick Test Run:

python main.py --max_samples 5000 --epochs 20 --hidden_dim 128

Production Training:

python main.py \
    --max_samples 30000 \
    --epochs 150 \
    --hidden_dim 768 \
    --lr 0.0005 \
    --batch_size 8 \
    --dropout 0.05

Command-Line Prediction Examples

Single Sequence Prediction:

python run_predictions.py \
    --sequence "MSDNEKSKKYFVLSGFHGKFTQYTGDTNVYVKVAKACQEDFKQYKTQLLNKHWDVK" \
    --model models/best_model.pth

Code Style

  • Follow PEP 8 guidelines
  • Use type hints where applicable
  • Add docstrings to all functions
  • Format code with black: black src/

Acknowledgments

  • UniProt Consortium for providing protein annotation data
  • Gene Ontology Consortium for functional classification standards
  • PyTorch Geometric for graph neural network implementations
  • Streamlit for the amazing web framework

About

Repo Tubes IF3211 Komputasi Domain Spesifik

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages